This is a followup to [0], which was a lot smaller. Thanks to Ryan for
feedback and review. This series is independent from Ryan's work on
adding support for LPA2 to KVM - the only potential source of conflict
should be the patch "arm64: kvm: Limit HYP VA and host S2 range to 48
bits when LPA2 is in effect", which could simply be dropped in favour of
the KVM changes to make it support LPA2.
The first ~15 patches of this series rework how the kernel VA space is
organized, so that the vmemmap region does not take up more space than
necessary, and so that most of it can be reclaimed when running a build
capable of 52-bit virtual addressing on hardware that is not. This is
needed because the vmemmap region will take up a substantial part of the
upper VA region that it shares with the kernel, modules and
vmalloc/vmap mappings once we enable LPA2 with 4k pages.
The next ~30 patches rework the early init code, reimplementing most of
the page table and relocation handling in C code. There are several
reasons why this is beneficial:
- we generally prefer C code over asm for these things, and the macros
that currently exist in head.S for creating the kernel pages tables
are a good example why;
- we no longer need to create the kernel mapping in two passes, which
means we can remove the logic that copies parts of the fixmap and the
KAsan shadow from one set of page tables to the other; this is
especially advantageous for KAsan with LPA2, which needs more
elaborate shadow handling across multiple levels, since the KAsan
region cannot be placed on exact pgd_t bouundaries in that case;
- we can read the ID registers and parse command line overrides before
creating the page tables, which simplifies the LPA2 case, as flicking
the global TCR_EL1.DS bit at a later stage would require elaborate
repainting of all page table descriptors, some of which with the MMU
disabled;
- we can use more elaborate logic to create the mappings, which means we
can use more precise mappings for code and data sections even when
using 2 MiB granularity, and this is a prerequisite for running with
WXN.
As part of the ID map changes, we decouple the ID map size from the
kernel VA size, and switch to a 48-bit VA map for all configurations.
The next 18 patches rework the existing LVA support as a CPU feature,
which simplifies some code and gets rid of the vabits_actual variable.
Then, LPA2 support is implemented in the same vein. This requires adding
support for 5 level paging as well, given that LPA2 introduces a new
paging level '-1' when using 4k pages.
Combined with the vmemmap changes at the start of the series, the
resulting LPA2/4k pages configuration will have the exact same VA space
layout as the ordinary 4k/4 levels configuration, and so LPA2 support
can reasonably be enabled by default, as the fallback is seamless on
non-LPA2 hardware.
In the 16k/LPA2 case, the fallback also reduces the number of paging
levels, resulting in a 47-bit VA space. This is based on the assumption
that hybrid LPA2/non-LPA2 16k pages kernels in production use would
prefer not to take the performance hit of 4 level paging to gain only a
single additional bit of VA space. (Note that generic Android kernels
use only 3 levels of paging today.) Bespoke 16k configurations can still
configure 48-bit virtual addressing as before.
Finally, the last two patches enable support for running with the WXN
control enabled. This was previously part of a separate series [1], but
given that the delta is tiny, it is included here as well.
[0] https://lore.kernel.org/all/[email protected]/
[1] https://lore.kernel.org/all/[email protected]/
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Marc Zyngier <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Cc: Kees Cook <[email protected]>
Anshuman Khandual (2):
arm64/mm: Add FEAT_LPA2 specific TCR_EL1.DS field
arm64/mm: Add FEAT_LPA2 specific ID_AA64MMFR0.TGRAN[2]
Ard Biesheuvel (57):
// KASLR / vmemmap reorg
arm64: kernel: Disable latent_entropy GCC plugin in early C runtime
arm64: mm: Take potential load offset into account when KASLR is off
arm64: mm: get rid of kimage_vaddr global variable
arm64: mm: Move PCI I/O emulation region above the vmemmap region
arm64: mm: Move fixmap region above vmemmap region
arm64: ptdump: Allow VMALLOC_END to be defined at boot
arm64: ptdump: Discover start of vmemmap region at runtime
arm64: vmemmap: Avoid base2 order of struct page size to dimension
region
arm64: mm: Reclaim unused vmemmap region for vmalloc use
arm64: kaslr: Adjust randomization range dynamically
arm64: kaslr: drop special case for ThunderX in kaslr_requires_kpti()
arm64: kvm: honour 'nokaslr' command line option for the HYP VA space
// Reimplement page table creation code in C
arm64: kernel: Manage absolute relocations in code built under pi/
arm64: kernel: Don't rely on objcopy to make code under pi/ __init
arm64: head: move relocation handling to C code
arm64: idreg-override: Omit non-NULL checks for override pointer
arm64: idreg-override: Prepare for place relative reloc patching
arm64: idreg-override: Avoid parameq() and parameqn()
arm64: idreg-override: avoid strlen() to check for empty strings
arm64: idreg-override: Avoid sprintf() for simple string concatenation
arm64: idreg-override: Avoid kstrtou64() to parse a single hex digit
arm64: idreg-override: Move to early mini C runtime
arm64: kernel: Remove early fdt remap code
arm64: head: Clear BSS and the kernel page tables in one go
arm64: Move feature overrides into the BSS section
arm64: head: Run feature override detection before mapping the kernel
arm64: head: move dynamic shadow call stack patching into early C
runtime
arm64: kaslr: Use feature override instead of parsing the cmdline
again
arm64: idreg-override: Create a pseudo feature for rodata=off
arm64: Add helpers to probe local CPU for PAC/BTI/E0PD support
arm64: head: allocate more pages for the kernel mapping
arm64: head: move memstart_offset_seed handling to C code
arm64: head: Move early kernel mapping routines into C code
arm64: mm: Use 48-bit virtual addressing for the permanent ID map
arm64: pgtable: Decouple PGDIR size macros from PGD/PUD/PMD levels
arm64: kernel: Create initial ID map from C code
arm64: mm: avoid fixmap for early swapper_pg_dir updates
arm64: mm: omit redundant remap of kernel image
arm64: Revert "mm: provide idmap pointer to cpu_replace_ttbr1()"
// Implement LPA2 support
arm64: mm: Handle LVA support as a CPU feature
arm64: mm: Add feature override support for LVA
arm64: mm: Wire up TCR.DS bit to PTE shareability fields
arm64: mm: Add LPA2 support to phys<->pte conversion routines
arm64: mm: Add definitions to support 5 levels of paging
arm64: mm: add LPA2 and 5 level paging support to G-to-nG conversion
arm64: Enable LPA2 at boot if supported by the system
arm64: mm: Add 5 level paging support to fixmap and swapper handling
arm64: kasan: Reduce minimum shadow alignment and enable 5 level
paging
arm64: mm: Add support for folding PUDs at runtime
arm64: ptdump: Disregard unaddressable VA space
arm64: ptdump: Deal with translation levels folded at runtime
arm64: kvm: avoid CONFIG_PGTABLE_LEVELS for runtime levels
arm64: kvm: Limit HYP VA and host S2 range to 48 bits when LPA2 is in
effect
arm64: Enable 52-bit virtual addressing for 4k and 16k granule configs
arm64: defconfig: Enable LPA2 support
// Allow WXN control to be enabled at boot
mm: add arch hook to validate mmap() prot flags
arm64: mm: add support for WXN memory translation attribute
Marc Zyngier (1):
arm64: Turn kaslr_feature_override into a generic SW feature override
arch/arm64/Kconfig | 34 +-
arch/arm64/configs/defconfig | 2 +-
arch/arm64/include/asm/assembler.h | 55 +--
arch/arm64/include/asm/cpufeature.h | 102 +++++
arch/arm64/include/asm/fixmap.h | 1 +
arch/arm64/include/asm/kasan.h | 2 -
arch/arm64/include/asm/kernel-pgtable.h | 104 ++---
arch/arm64/include/asm/memory.h | 50 +--
arch/arm64/include/asm/mman.h | 36 ++
arch/arm64/include/asm/mmu.h | 26 +-
arch/arm64/include/asm/mmu_context.h | 49 ++-
arch/arm64/include/asm/pgalloc.h | 53 ++-
arch/arm64/include/asm/pgtable-hwdef.h | 33 +-
arch/arm64/include/asm/pgtable-prot.h | 18 +-
arch/arm64/include/asm/pgtable-types.h | 6 +
arch/arm64/include/asm/pgtable.h | 229 +++++++++-
arch/arm64/include/asm/scs.h | 34 +-
arch/arm64/include/asm/setup.h | 3 -
arch/arm64/include/asm/sysreg.h | 2 +
arch/arm64/include/asm/tlb.h | 3 +-
arch/arm64/kernel/Makefile | 7 +-
arch/arm64/kernel/cpu_errata.c | 2 +-
arch/arm64/kernel/cpufeature.c | 90 ++--
arch/arm64/kernel/head.S | 465 ++------------------
arch/arm64/kernel/idreg-override.c | 322 --------------
arch/arm64/kernel/image-vars.h | 32 ++
arch/arm64/kernel/kaslr.c | 4 +-
arch/arm64/kernel/module.c | 2 +-
arch/arm64/kernel/pi/Makefile | 28 +-
arch/arm64/kernel/pi/idreg-override.c | 396 +++++++++++++++++
arch/arm64/kernel/pi/kaslr_early.c | 78 +---
arch/arm64/kernel/pi/map_kernel.c | 284 ++++++++++++
arch/arm64/kernel/pi/map_range.c | 104 +++++
arch/arm64/kernel/{ => pi}/patch-scs.c | 36 +-
arch/arm64/kernel/pi/pi.h | 30 ++
arch/arm64/kernel/pi/relacheck.c | 130 ++++++
arch/arm64/kernel/pi/relocate.c | 64 +++
arch/arm64/kernel/setup.c | 22 -
arch/arm64/kernel/sleep.S | 3 -
arch/arm64/kernel/suspend.c | 2 +-
arch/arm64/kernel/vmlinux.lds.S | 14 +-
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 2 +
arch/arm64/kvm/mmu.c | 22 +-
arch/arm64/kvm/va_layout.c | 10 +-
arch/arm64/mm/init.c | 2 +-
arch/arm64/mm/kasan_init.c | 154 +++++--
arch/arm64/mm/mmap.c | 4 +
arch/arm64/mm/mmu.c | 268 ++++++-----
arch/arm64/mm/pgd.c | 17 +-
arch/arm64/mm/proc.S | 106 ++++-
arch/arm64/mm/ptdump.c | 43 +-
arch/arm64/tools/cpucaps | 1 +
include/linux/mman.h | 15 +
mm/mmap.c | 3 +
54 files changed, 2259 insertions(+), 1345 deletions(-)
delete mode 100644 arch/arm64/kernel/idreg-override.c
create mode 100644 arch/arm64/kernel/pi/idreg-override.c
create mode 100644 arch/arm64/kernel/pi/map_kernel.c
create mode 100644 arch/arm64/kernel/pi/map_range.c
rename arch/arm64/kernel/{ => pi}/patch-scs.c (89%)
create mode 100644 arch/arm64/kernel/pi/pi.h
create mode 100644 arch/arm64/kernel/pi/relacheck.c
create mode 100644 arch/arm64/kernel/pi/relocate.c
--
2.39.2
Avoid build issues in the early C code related to the latent_entropy GCC
plugin, by incorporating the C flags fragment that disables it.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/kernel/pi/Makefile | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm64/kernel/pi/Makefile b/arch/arm64/kernel/pi/Makefile
index 4c0ea3cd4ea406b6..c844a0546d7f0e62 100644
--- a/arch/arm64/kernel/pi/Makefile
+++ b/arch/arm64/kernel/pi/Makefile
@@ -3,6 +3,7 @@
KBUILD_CFLAGS := $(subst $(CC_FLAGS_FTRACE),,$(KBUILD_CFLAGS)) -fpie \
-Os -DDISABLE_BRANCH_PROFILING $(DISABLE_STACKLEAK_PLUGIN) \
+ $(DISABLE_LATENT_ENTROPY_PLUGIN) \
$(call cc-option,-mbranch-protection=none) \
-I$(srctree)/scripts/dtc/libfdt -fno-stack-protector \
-include $(srctree)/include/linux/hidden.h \
--
2.39.2
We enable CONFIG_RELOCATABLE even when CONFIG_RANDOMIZE_BASE is
disabled, and this permits the loader (i.e., EFI) to place the kernel
anywhere in physical memory as long as the base address is 64k aligned.
This means that the 'KASLR' case described in the header that defines
the size of the statically allocated page tables could take effect even
when CONFIG_RANDMIZE_BASE=n. So check for CONFIG_RELOCATABLE instead.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/kernel-pgtable.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/kernel-pgtable.h b/arch/arm64/include/asm/kernel-pgtable.h
index fcd14197756f0619..4d13c73171e1e360 100644
--- a/arch/arm64/include/asm/kernel-pgtable.h
+++ b/arch/arm64/include/asm/kernel-pgtable.h
@@ -53,7 +53,7 @@
* address is just pushed over a boundary and the start address isn't).
*/
-#ifdef CONFIG_RANDOMIZE_BASE
+#ifdef CONFIG_RELOCATABLE
#define EARLY_KASLR (1)
#else
#define EARLY_KASLR (0)
--
2.39.2
We store the address of _text in kimage_vaddr, but since commit
09e3c22a86f6889d ("arm64: Use a variable to store non-global mappings
decision"), we no longer reference this variable from modules so we no
longer need to export it.
In fact, we don't need it at all so let's just get rid of it.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/memory.h | 6 ++----
arch/arm64/kernel/head.S | 2 +-
arch/arm64/mm/mmu.c | 3 ---
3 files changed, 3 insertions(+), 8 deletions(-)
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 78e5163836a0ab95..a4e1d832a15a2d7a 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -182,6 +182,7 @@
#include <linux/types.h>
#include <asm/boot.h>
#include <asm/bug.h>
+#include <asm/sections.h>
#if VA_BITS > 48
extern u64 vabits_actual;
@@ -193,15 +194,12 @@ extern s64 memstart_addr;
/* PHYS_OFFSET - the physical address of the start of memory. */
#define PHYS_OFFSET ({ VM_BUG_ON(memstart_addr & 1); memstart_addr; })
-/* the virtual base of the kernel image */
-extern u64 kimage_vaddr;
-
/* the offset between the kernel virtual and physical mappings */
extern u64 kimage_voffset;
static inline unsigned long kaslr_offset(void)
{
- return kimage_vaddr - KIMAGE_VADDR;
+ return (u64)&_text - KIMAGE_VADDR;
}
static inline bool kaslr_enabled(void)
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index b98970907226b36c..65cdaaa2c859418f 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -482,7 +482,7 @@ SYM_FUNC_START_LOCAL(__primary_switched)
str_l x21, __fdt_pointer, x5 // Save FDT pointer
- ldr_l x4, kimage_vaddr // Save the offset between
+ adrp x4, _text // Save the offset between
sub x4, x4, x0 // the kernel virtual and
str_l x4, kimage_voffset, x5 // physical mappings
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 6f9d8898a02516f6..81e1420d2cc13246 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -50,9 +50,6 @@ u64 vabits_actual __ro_after_init = VA_BITS_MIN;
EXPORT_SYMBOL(vabits_actual);
#endif
-u64 kimage_vaddr __ro_after_init = (u64)&_text;
-EXPORT_SYMBOL(kimage_vaddr);
-
u64 kimage_voffset __ro_after_init;
EXPORT_SYMBOL(kimage_voffset);
--
2.39.2
Move the PCI I/O region above the vmemmap region in the kernel's VA
space. This will permit us to reclaim the lower part of the vmemmap
region for vmalloc/vmap allocations when running a 52-bit VA capable
build on a 48-bit VA capable system.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/memory.h | 4 ++--
arch/arm64/mm/ptdump.c | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index a4e1d832a15a2d7a..6e321cc06a3c30f0 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -49,8 +49,8 @@
#define MODULES_VSIZE (SZ_128M)
#define VMEMMAP_START (-(UL(1) << (VA_BITS - VMEMMAP_SHIFT)))
#define VMEMMAP_END (VMEMMAP_START + VMEMMAP_SIZE)
-#define PCI_IO_END (VMEMMAP_START - SZ_8M)
-#define PCI_IO_START (PCI_IO_END - PCI_IO_SIZE)
+#define PCI_IO_START (VMEMMAP_END + SZ_8M)
+#define PCI_IO_END (PCI_IO_START + PCI_IO_SIZE)
#define FIXADDR_TOP (VMEMMAP_START - SZ_32M)
#if VA_BITS > 48
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 9bc4066c5bf33a72..9d1f4cdc6672ed5f 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -47,10 +47,10 @@ static struct addr_marker address_markers[] = {
{ VMALLOC_END, "vmalloc() end" },
{ FIXADDR_START, "Fixmap start" },
{ FIXADDR_TOP, "Fixmap end" },
- { PCI_IO_START, "PCI I/O start" },
- { PCI_IO_END, "PCI I/O end" },
{ VMEMMAP_START, "vmemmap start" },
{ VMEMMAP_START + VMEMMAP_SIZE, "vmemmap end" },
+ { PCI_IO_START, "PCI I/O start" },
+ { PCI_IO_END, "PCI I/O end" },
{ -1, NULL },
};
--
2.39.2
Move the fixmap region above the vmemmap region, so that the start of
the vmemmap delineates the end of the region available for vmalloc and
vmap allocations and the randomized placement of the kernel and modules.
In a subsequent patch, we will take advantage of this to reclaim most of
the vmemmap area when running a 52-bit VA capable build with 52-bit
virtual addressing disabled at runtime.
Note that the existing guard region of 256 MiB covers the fixmap and PCI
I/O regions as well, so we can reduce it 8 MiB, which is what we use in
other places too.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/memory.h | 2 +-
arch/arm64/include/asm/pgtable.h | 2 +-
arch/arm64/mm/ptdump.c | 4 ++--
3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 6e321cc06a3c30f0..9b9e52d823beccc6 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -51,7 +51,7 @@
#define VMEMMAP_END (VMEMMAP_START + VMEMMAP_SIZE)
#define PCI_IO_START (VMEMMAP_END + SZ_8M)
#define PCI_IO_END (PCI_IO_START + PCI_IO_SIZE)
-#define FIXADDR_TOP (VMEMMAP_START - SZ_32M)
+#define FIXADDR_TOP (ULONG_MAX - SZ_8M + 1)
#if VA_BITS > 48
#define VA_BITS_MIN (48)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index b6ba466e2e8a3fc7..3eff06c5d0eb73c7 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -22,7 +22,7 @@
* and fixed mappings
*/
#define VMALLOC_START (MODULES_END)
-#define VMALLOC_END (VMEMMAP_START - SZ_256M)
+#define VMALLOC_END (VMEMMAP_START - SZ_8M)
#define vmemmap ((struct page *)VMEMMAP_START - (memstart_addr >> PAGE_SHIFT))
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 9d1f4cdc6672ed5f..76d28056bd14920a 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -45,12 +45,12 @@ static struct addr_marker address_markers[] = {
{ MODULES_END, "Modules end" },
{ VMALLOC_START, "vmalloc() area" },
{ VMALLOC_END, "vmalloc() end" },
- { FIXADDR_START, "Fixmap start" },
- { FIXADDR_TOP, "Fixmap end" },
{ VMEMMAP_START, "vmemmap start" },
{ VMEMMAP_START + VMEMMAP_SIZE, "vmemmap end" },
{ PCI_IO_START, "PCI I/O start" },
{ PCI_IO_END, "PCI I/O end" },
+ { FIXADDR_START, "Fixmap start" },
+ { FIXADDR_TOP, "Fixmap end" },
{ -1, NULL },
};
--
2.39.2
Extend the existing pattern for populating ptdump marker entries at
boot, and add handling of VMALLOC_END, which will cease to be a compile
time constant for configurations that support 52-bit virtual addressing.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/mm/ptdump.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 76d28056bd14920a..910b35f02280cbdb 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -31,7 +31,12 @@ enum address_markers_idx {
PAGE_END_NR,
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
KASAN_START_NR,
+ KASAN_END_NR,
#endif
+ MODULES_NR,
+ MODULES_END_NR,
+ VMALLOC_START_NR,
+ VMALLOC_END_NR,
};
static struct addr_marker address_markers[] = {
@@ -44,7 +49,7 @@ static struct addr_marker address_markers[] = {
{ MODULES_VADDR, "Modules start" },
{ MODULES_END, "Modules end" },
{ VMALLOC_START, "vmalloc() area" },
- { VMALLOC_END, "vmalloc() end" },
+ { 0, "vmalloc() end" },
{ VMEMMAP_START, "vmemmap start" },
{ VMEMMAP_START + VMEMMAP_SIZE, "vmemmap end" },
{ PCI_IO_START, "PCI I/O start" },
@@ -379,6 +384,7 @@ static int __init ptdump_init(void)
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
address_markers[KASAN_START_NR].start_address = KASAN_SHADOW_START;
#endif
+ address_markers[VMALLOC_END_NR].start_address = VMALLOC_END;
ptdump_initialize();
ptdump_debugfs_register(&kernel_ptdump_info, "kernel_page_tables");
return 0;
--
2.39.2
We will soon reclaim the part of the vmemmap region that covers VA space
that is not addressable by the hardware. To avoid confusion, ensure that
the 'vmemmap start' marker points at the start of the region that is
actually being used for the struct page array, rather than the start of
the region we set aside for it at build time.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/mm/ptdump.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 910b35f02280cbdb..8f37d6d8b5216473 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -37,6 +37,7 @@ enum address_markers_idx {
MODULES_END_NR,
VMALLOC_START_NR,
VMALLOC_END_NR,
+ VMEMMAP_START_NR,
};
static struct addr_marker address_markers[] = {
@@ -386,6 +387,10 @@ static int __init ptdump_init(void)
#endif
address_markers[VMALLOC_END_NR].start_address = VMALLOC_END;
ptdump_initialize();
+ if (vabits_actual < VA_BITS) {
+ address_markers[VMEMMAP_START_NR].start_address =
+ (unsigned long)virt_to_page(_PAGE_OFFSET(vabits_actual));
+ }
ptdump_debugfs_register(&kernel_ptdump_info, "kernel_page_tables");
return 0;
}
--
2.39.2
The placement and size of the vmemmap region in the kernel virtual
address space is currently derived from the base2 order of the size of a
struct page. This makes for nicely aligned constants with lots of
leading 0xf and trailing 0x0 digits, but given that the actual struct
pages are indexed as an ordinary array, this resulting region is
severely overdimensioned when the size of a struct page is just over a
power of 2.
This doesn't matter today, but once we enable 52-bit virtual addressing
for 4k pages configurations, the vmemmap region may take up almost half
of the upper VA region with the current struct page upper bound at 64
bytes. And once we enable KMSAN or other features that push the size of
a struct page over 64 bytes, we will run out of VMALLOC space entirely.
So instead, let's derive the region size from the actual size of a
struct page, and place the entire region 1 GB from the top of the VA
space, where it still doesn't share any lower level translation table
entries with the fixmap.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/memory.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 9b9e52d823beccc6..830740ff79bab902 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -30,8 +30,8 @@
* keep a constant PAGE_OFFSET and "fallback" to using the higher end
* of the VMEMMAP where 52-bit support is not available in hardware.
*/
-#define VMEMMAP_SHIFT (PAGE_SHIFT - STRUCT_PAGE_MAX_SHIFT)
-#define VMEMMAP_SIZE ((_PAGE_END(VA_BITS_MIN) - PAGE_OFFSET) >> VMEMMAP_SHIFT)
+#define VMEMMAP_RANGE (_PAGE_END(VA_BITS_MIN) - PAGE_OFFSET)
+#define VMEMMAP_SIZE ((VMEMMAP_RANGE >> PAGE_SHIFT) * sizeof(struct page))
/*
* PAGE_OFFSET - the virtual address of the start of the linear map, at the
@@ -47,8 +47,8 @@
#define MODULES_END (MODULES_VADDR + MODULES_VSIZE)
#define MODULES_VADDR (_PAGE_END(VA_BITS_MIN))
#define MODULES_VSIZE (SZ_128M)
-#define VMEMMAP_START (-(UL(1) << (VA_BITS - VMEMMAP_SHIFT)))
-#define VMEMMAP_END (VMEMMAP_START + VMEMMAP_SIZE)
+#define VMEMMAP_START (VMEMMAP_END - VMEMMAP_SIZE)
+#define VMEMMAP_END (ULONG_MAX - SZ_1G + 1)
#define PCI_IO_START (VMEMMAP_END + SZ_8M)
#define PCI_IO_END (PCI_IO_START + PCI_IO_SIZE)
#define FIXADDR_TOP (ULONG_MAX - SZ_8M + 1)
--
2.39.2
The vmemmap array is statically sized based on the maximum supported
size of the virtual address space, but it is located inside the upper VA
region, which is statically sized based on the *minimum* supported size
of the VA space. This doesn't matter much when using 64k pages, which is
the only configuration that currently supports 52-bit virtual
addressing.
However, upcoming LPA2 support will change this picture somewhat, as in
that case, the vmemmap array will take up more than 25% of the upper VA
region when using 4k pages. Given that most of this space is never used
when running on a system that does not support 52-bit virtual
addressing, let's reclaim the unused vmemmap area in that case.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 3eff06c5d0eb73c7..2259898e8c3d990a 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -18,11 +18,15 @@
* VMALLOC range.
*
* VMALLOC_START: beginning of the kernel vmalloc space
- * VMALLOC_END: extends to the available space below vmemmap, PCI I/O space
- * and fixed mappings
+ * VMALLOC_END: extends to the available space below vmemmap
*/
#define VMALLOC_START (MODULES_END)
+#if VA_BITS == VA_BITS_MIN
#define VMALLOC_END (VMEMMAP_START - SZ_8M)
+#else
+#define VMEMMAP_UNUSED_NPAGES ((_PAGE_OFFSET(vabits_actual) - PAGE_OFFSET) >> PAGE_SHIFT)
+#define VMALLOC_END (VMEMMAP_START + VMEMMAP_UNUSED_NPAGES * sizeof(struct page) - SZ_8M)
+#endif
#define vmemmap ((struct page *)VMEMMAP_START - (memstart_addr >> PAGE_SHIFT))
--
2.39.2
Currently, we base the KASLR randomization range on a rough estimate of
the available space in the upper VA region: the lower 1/4th has the
module region and the upper 1/4th has the fixmap, vmemmap and PCI I/O
ranges, and so we pick a random location in the remaining space in the
middle.
Once we enable support for 5-level paging with 4k pages, this no longer
works: the vmemmap region, being dimensioned to cover a 52-bit linear
region, takes up so much space in the upper VA region (the size of which
is based on a 48-bit VA space for compatibility with non-LVA hardware)
that the region above the vmalloc region takes up more than a quarter of
the available space.
So instead of a heuristic, let's derive the randomization range from the
actual boundaries of the vmalloc region.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/kernel/image-vars.h | 2 ++
arch/arm64/kernel/pi/kaslr_early.c | 11 ++++++-----
2 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
index 8309197c0ebd4a8e..b5906f8e18d7eb8d 100644
--- a/arch/arm64/kernel/image-vars.h
+++ b/arch/arm64/kernel/image-vars.h
@@ -34,6 +34,8 @@ PROVIDE(__pi___memcpy = __pi_memcpy);
PROVIDE(__pi___memmove = __pi_memmove);
PROVIDE(__pi___memset = __pi_memset);
+PROVIDE(__pi_vabits_actual = vabits_actual);
+
#ifdef CONFIG_KVM
/*
diff --git a/arch/arm64/kernel/pi/kaslr_early.c b/arch/arm64/kernel/pi/kaslr_early.c
index 17bff6e399e46b0b..b9e0bb4bc6a9766f 100644
--- a/arch/arm64/kernel/pi/kaslr_early.c
+++ b/arch/arm64/kernel/pi/kaslr_early.c
@@ -14,6 +14,7 @@
#include <asm/archrandom.h>
#include <asm/memory.h>
+#include <asm/pgtable.h>
/* taken from lib/string.c */
static char *__strstr(const char *s1, const char *s2)
@@ -87,7 +88,7 @@ static u64 get_kaslr_seed(void *fdt)
asmlinkage u64 kaslr_early_init(void *fdt)
{
- u64 seed;
+ u64 seed, range;
if (is_kaslr_disabled_cmdline(fdt))
return 0;
@@ -102,9 +103,9 @@ asmlinkage u64 kaslr_early_init(void *fdt)
/*
* OK, so we are proceeding with KASLR enabled. Calculate a suitable
* kernel image offset from the seed. Let's place the kernel in the
- * middle half of the VMALLOC area (VA_BITS_MIN - 2), and stay clear of
- * the lower and upper quarters to avoid colliding with other
- * allocations.
+ * 'middle' half of the VMALLOC area, and stay clear of the lower and
+ * upper quarters to avoid colliding with other allocations.
*/
- return BIT(VA_BITS_MIN - 3) + (seed & GENMASK(VA_BITS_MIN - 3, 0));
+ range = (VMALLOC_END - KIMAGE_VADDR) / 2;
+ return range / 2 + (((__uint128_t)range * seed) >> 64);
}
--
2.39.2
ThunderX is an obsolete platform that shipped without support for the
EFI_RNG_PROTOCOL in its firmware. Now that we no longer misidentify
small KASLR offsets as randomization being enabled, we can drop the
explicit check for ThunderX as well, given that KASLR is known to be
unavailable.
Note that we never enable KPTI on these systems, in spite of what this
function returns. However, using non-global mappings for code that is
executable at EL1 is what tickles the erratum on these cores, regardless
of whether KPTI is enabled or not, so non-global mappings should simply
never be used here.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/kernel/cpu_errata.c | 2 +-
arch/arm64/kernel/cpufeature.c | 12 ------------
2 files changed, 1 insertion(+), 13 deletions(-)
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index 307faa2b4395ed9f..530bbd6a2f6331fd 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -261,7 +261,7 @@ static const struct midr_range cavium_erratum_23154_cpus[] = {
#endif
#ifdef CONFIG_CAVIUM_ERRATUM_27456
-const struct midr_range cavium_erratum_27456_cpus[] = {
+static const struct midr_range cavium_erratum_27456_cpus[] = {
/* Cavium ThunderX, T88 pass 1.x - 2.1 */
MIDR_RANGE(MIDR_THUNDERX, 0, 0, 1, 1),
/* Cavium ThunderX, T81 pass 1.0 */
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 2e3e5513977733b7..e9a138b7e3b22cc7 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1621,18 +1621,6 @@ bool kaslr_requires_kpti(void)
return false;
}
- /*
- * Systems affected by Cavium erratum 24756 are incompatible
- * with KPTI.
- */
- if (IS_ENABLED(CONFIG_CAVIUM_ERRATUM_27456)) {
- extern const struct midr_range cavium_erratum_27456_cpus[];
-
- if (is_midr_in_range_list(read_cpuid_id(),
- cavium_erratum_27456_cpus))
- return false;
- }
-
return kaslr_enabled();
}
--
2.39.2
From: Marc Zyngier <[email protected]>
Disabling KASLR from the command line is implemented as a feature
override. Repaint it slightly so that it can further be used as
more generic infrastructure for SW override purposes.
Signed-off-by: Marc Zyngier <[email protected]>
[ardb: don't apply the override mask to val]
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/cpufeature.h | 4 ++++
arch/arm64/kernel/cpufeature.c | 2 ++
arch/arm64/kernel/idreg-override.c | 16 ++++++----------
arch/arm64/kernel/kaslr.c | 5 ++---
4 files changed, 14 insertions(+), 13 deletions(-)
diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 6bf013fb110d7923..bc10098901808c00 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -15,6 +15,8 @@
#define MAX_CPU_FEATURES 128
#define cpu_feature(x) KERNEL_HWCAP_ ## x
+#define ARM64_SW_FEATURE_OVERRIDE_NOKASLR 0
+
#ifndef __ASSEMBLY__
#include <linux/bug.h>
@@ -925,6 +927,8 @@ extern struct arm64_ftr_override id_aa64smfr0_override;
extern struct arm64_ftr_override id_aa64isar1_override;
extern struct arm64_ftr_override id_aa64isar2_override;
+extern struct arm64_ftr_override arm64_sw_feature_override;
+
u32 get_kvm_ipa_limit(void);
void dump_cpu_features(void);
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index e9a138b7e3b22cc7..88cec6c14743c4c5 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -657,6 +657,8 @@ struct arm64_ftr_override __ro_after_init id_aa64smfr0_override;
struct arm64_ftr_override __ro_after_init id_aa64isar1_override;
struct arm64_ftr_override __ro_after_init id_aa64isar2_override;
+struct arm64_ftr_override arm64_sw_feature_override;
+
static const struct __ftr_reg_entry {
u32 sys_id;
struct arm64_ftr_reg *reg;
diff --git a/arch/arm64/kernel/idreg-override.c b/arch/arm64/kernel/idreg-override.c
index d833d78a7f313563..434703e4e55cb785 100644
--- a/arch/arm64/kernel/idreg-override.c
+++ b/arch/arm64/kernel/idreg-override.c
@@ -138,15 +138,11 @@ static const struct ftr_set_desc smfr0 __initconst = {
},
};
-extern struct arm64_ftr_override kaslr_feature_override;
-
-static const struct ftr_set_desc kaslr __initconst = {
- .name = "kaslr",
-#ifdef CONFIG_RANDOMIZE_BASE
- .override = &kaslr_feature_override,
-#endif
+static const struct ftr_set_desc sw_features __initconst = {
+ .name = "arm64_sw",
+ .override = &arm64_sw_feature_override,
.fields = {
- FIELD("disabled", 0, NULL),
+ FIELD("nokaslr", ARM64_SW_FEATURE_OVERRIDE_NOKASLR, NULL),
{}
},
};
@@ -158,7 +154,7 @@ static const struct ftr_set_desc * const regs[] __initconst = {
&isar1,
&isar2,
&smfr0,
- &kaslr,
+ &sw_features,
};
static const struct {
@@ -175,7 +171,7 @@ static const struct {
"id_aa64isar1.api=0 id_aa64isar1.apa=0 "
"id_aa64isar2.gpa3=0 id_aa64isar2.apa3=0" },
{ "arm64.nomte", "id_aa64pfr1.mte=0" },
- { "nokaslr", "kaslr.disabled=1" },
+ { "nokaslr", "arm64_sw.nokaslr=1" },
};
static int __init find_field(const char *cmdline,
diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c
index e7477f21a4c9d062..abcd996c747b8c97 100644
--- a/arch/arm64/kernel/kaslr.c
+++ b/arch/arm64/kernel/kaslr.c
@@ -23,8 +23,6 @@
u64 __ro_after_init module_alloc_base;
u16 __initdata memstart_offset_seed;
-struct arm64_ftr_override kaslr_feature_override __initdata;
-
static int __init kaslr_init(void)
{
u64 module_range;
@@ -36,7 +34,8 @@ static int __init kaslr_init(void)
*/
module_alloc_base = (u64)_etext - MODULES_VSIZE;
- if (kaslr_feature_override.val & kaslr_feature_override.mask & 0xf) {
+ if (cpuid_feature_extract_unsigned_field(arm64_sw_feature_override.val,
+ ARM64_SW_FEATURE_OVERRIDE_NOKASLR)) {
pr_info("KASLR disabled on command line\n");
return 0;
}
--
2.39.2
Debugging the code that runs in HYP is rather difficult, especially when
its placement is randomized. So let's take the 'nokaslr' command line
option into account, and allow it to be disabled at boot.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/memory.h | 15 -------------
arch/arm64/include/asm/mmu.h | 23 ++++++++++++++++++++
arch/arm64/kernel/kaslr.c | 3 +--
arch/arm64/kvm/va_layout.c | 3 ++-
4 files changed, 26 insertions(+), 18 deletions(-)
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 830740ff79bab902..f96975466ef1b752 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -197,21 +197,6 @@ extern s64 memstart_addr;
/* the offset between the kernel virtual and physical mappings */
extern u64 kimage_voffset;
-static inline unsigned long kaslr_offset(void)
-{
- return (u64)&_text - KIMAGE_VADDR;
-}
-
-static inline bool kaslr_enabled(void)
-{
- /*
- * The KASLR offset modulo MIN_KIMG_ALIGN is taken from the physical
- * placement of the image rather than from the seed, so a displacement
- * of less than MIN_KIMG_ALIGN means that no seed was provided.
- */
- return kaslr_offset() >= MIN_KIMG_ALIGN;
-}
-
/*
* Allow all memory at the discovery stage. We will clip it later.
*/
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 48f8466a4be92ac3..9b9a206e4e9c9d4e 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -72,6 +72,29 @@ extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot);
extern void mark_linear_text_alias_ro(void);
extern bool kaslr_requires_kpti(void);
+static inline unsigned long kaslr_offset(void)
+{
+ return (u64)&_text - KIMAGE_VADDR;
+}
+
+static inline bool kaslr_enabled(void)
+{
+ /*
+ * The KASLR offset modulo MIN_KIMG_ALIGN is taken from the physical
+ * placement of the image rather than from the seed, so a displacement
+ * of less than MIN_KIMG_ALIGN means that no seed was provided.
+ */
+ return kaslr_offset() >= MIN_KIMG_ALIGN;
+}
+
+static inline bool kaslr_disabled_cmdline(void)
+{
+ if (cpuid_feature_extract_unsigned_field(arm64_sw_feature_override.val,
+ ARM64_SW_FEATURE_OVERRIDE_NOKASLR))
+ return true;
+ return false;
+}
+
#define INIT_MM_CONTEXT(name) \
.pgd = init_pg_dir,
diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c
index abcd996c747b8c97..9e7f0bae5b5fb24b 100644
--- a/arch/arm64/kernel/kaslr.c
+++ b/arch/arm64/kernel/kaslr.c
@@ -34,8 +34,7 @@ static int __init kaslr_init(void)
*/
module_alloc_base = (u64)_etext - MODULES_VSIZE;
- if (cpuid_feature_extract_unsigned_field(arm64_sw_feature_override.val,
- ARM64_SW_FEATURE_OVERRIDE_NOKASLR)) {
+ if (kaslr_disabled_cmdline()) {
pr_info("KASLR disabled on command line\n");
return 0;
}
diff --git a/arch/arm64/kvm/va_layout.c b/arch/arm64/kvm/va_layout.c
index 91b22a014610b22f..341b67e2f2514e55 100644
--- a/arch/arm64/kvm/va_layout.c
+++ b/arch/arm64/kvm/va_layout.c
@@ -72,7 +72,8 @@ __init void kvm_compute_layout(void)
va_mask = GENMASK_ULL(tag_lsb - 1, 0);
tag_val = hyp_va_msb;
- if (IS_ENABLED(CONFIG_RANDOMIZE_BASE) && tag_lsb != (vabits_actual - 1)) {
+ if (IS_ENABLED(CONFIG_RANDOMIZE_BASE) && tag_lsb != (vabits_actual - 1) &&
+ !kaslr_disabled_cmdline()) {
/* We have some free bits to insert a random tag. */
tag_val |= get_random_long() & GENMASK_ULL(vabits_actual - 2, tag_lsb);
}
--
2.39.2
We will add some code under pi/ that contains global variables that
should not end up in __initdata, as they will not be writable via the
initial ID map. So only rely on objcopy for making the libfdt code
__init, and use explicit annotations for the rest.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/kernel/pi/Makefile | 6 ++++--
arch/arm64/kernel/pi/kaslr_early.c | 16 +++++++++-------
2 files changed, 13 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/kernel/pi/Makefile b/arch/arm64/kernel/pi/Makefile
index bc32a431fe35e6f4..2bbe866417d453ff 100644
--- a/arch/arm64/kernel/pi/Makefile
+++ b/arch/arm64/kernel/pi/Makefile
@@ -28,11 +28,13 @@ quiet_cmd_piobjcopy = $(quiet_cmd_objcopy)
cmd_piobjcopy = $(cmd_objcopy) && $(obj)/relacheck $(@) $(<)
$(obj)/%.pi.o: OBJCOPYFLAGS := --prefix-symbols=__pi_ \
- --remove-section=.note.gnu.property \
- --prefix-alloc-sections=.init
+ --remove-section=.note.gnu.property
$(obj)/%.pi.o: $(obj)/%.o $(obj)/relacheck FORCE
$(call if_changed,piobjcopy)
+# ensure that all the lib- code ends up as __init code and data
+$(obj)/lib-%.pi.o: OBJCOPYFLAGS += --prefix-alloc-sections=.init
+
$(obj)/lib-%.o: $(srctree)/lib/%.c FORCE
$(call if_changed_rule,cc_o_c)
diff --git a/arch/arm64/kernel/pi/kaslr_early.c b/arch/arm64/kernel/pi/kaslr_early.c
index b9e0bb4bc6a9766f..167081b30a152d0a 100644
--- a/arch/arm64/kernel/pi/kaslr_early.c
+++ b/arch/arm64/kernel/pi/kaslr_early.c
@@ -17,7 +17,7 @@
#include <asm/pgtable.h>
/* taken from lib/string.c */
-static char *__strstr(const char *s1, const char *s2)
+static char *__init __strstr(const char *s1, const char *s2)
{
size_t l1, l2;
@@ -33,7 +33,7 @@ static char *__strstr(const char *s1, const char *s2)
}
return NULL;
}
-static bool cmdline_contains_nokaslr(const u8 *cmdline)
+static bool __init cmdline_contains_nokaslr(const u8 *cmdline)
{
const u8 *str;
@@ -41,7 +41,7 @@ static bool cmdline_contains_nokaslr(const u8 *cmdline)
return str == cmdline || (str > cmdline && *(str - 1) == ' ');
}
-static bool is_kaslr_disabled_cmdline(void *fdt)
+static bool __init is_kaslr_disabled_cmdline(void *fdt)
{
if (!IS_ENABLED(CONFIG_CMDLINE_FORCE)) {
int node;
@@ -67,17 +67,19 @@ static bool is_kaslr_disabled_cmdline(void *fdt)
return cmdline_contains_nokaslr(CONFIG_CMDLINE);
}
-static u64 get_kaslr_seed(void *fdt)
+static u64 __init get_kaslr_seed(void *fdt)
{
+ static char const chosen_str[] __initconst = "chosen";
+ static char const seed_str[] __initconst = "kaslr-seed";
int node, len;
fdt64_t *prop;
u64 ret;
- node = fdt_path_offset(fdt, "/chosen");
+ node = fdt_path_offset(fdt, chosen_str);
if (node < 0)
return 0;
- prop = fdt_getprop_w(fdt, node, "kaslr-seed", &len);
+ prop = fdt_getprop_w(fdt, node, seed_str, &len);
if (!prop || len != sizeof(u64))
return 0;
@@ -86,7 +88,7 @@ static u64 get_kaslr_seed(void *fdt)
return ret;
}
-asmlinkage u64 kaslr_early_init(void *fdt)
+asmlinkage u64 __init kaslr_early_init(void *fdt)
{
u64 seed, range;
--
2.39.2
Now that override pointers are always set, we can drop the various
non-NULL checks that we have in the code.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/kernel/idreg-override.c | 16 +++++-----------
1 file changed, 5 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/kernel/idreg-override.c b/arch/arm64/kernel/idreg-override.c
index 434703e4e55cb785..b6e90ee6857eb758 100644
--- a/arch/arm64/kernel/idreg-override.c
+++ b/arch/arm64/kernel/idreg-override.c
@@ -196,9 +196,6 @@ static void __init match_options(const char *cmdline)
for (i = 0; i < ARRAY_SIZE(regs); i++) {
int f;
- if (!regs[i]->override)
- continue;
-
for (f = 0; strlen(regs[i]->fields[f].name); f++) {
u64 shift = regs[i]->fields[f].shift;
u64 width = regs[i]->fields[f].width ?: 4;
@@ -299,10 +296,8 @@ asmlinkage void __init init_feature_override(u64 boot_status)
int i;
for (i = 0; i < ARRAY_SIZE(regs); i++) {
- if (regs[i]->override) {
- regs[i]->override->val = 0;
- regs[i]->override->mask = 0;
- }
+ regs[i]->override->val = 0;
+ regs[i]->override->mask = 0;
}
__boot_status = boot_status;
@@ -310,9 +305,8 @@ asmlinkage void __init init_feature_override(u64 boot_status)
parse_cmdline();
for (i = 0; i < ARRAY_SIZE(regs); i++) {
- if (regs[i]->override)
- dcache_clean_inval_poc((unsigned long)regs[i]->override,
- (unsigned long)regs[i]->override +
- sizeof(*regs[i]->override));
+ dcache_clean_inval_poc((unsigned long)regs[i]->override,
+ (unsigned long)regs[i]->override +
+ sizeof(*regs[i]->override));
}
}
--
2.39.2
The mini C runtime runs before relocations are processed, and so it
cannot rely on statically initialized pointer variables.
Add a check to ensure that such code does not get introduced by
accident, by going over the relocations in each object, identifying the
ones that operate on data sections that are part of the executable
image, and raising an error if any relocations of type R_AARCH64_ABS64
exist. Note that such relocations are permitted in other places (e.g.,
debug sections) and will never occur in compiler generated code sections
when using the small code model, so only check sections that have
SHF_ALLOC set and SHF_EXECINSTR cleared.
To accommodate cases where statically initialized symbol references are
unavoidable, introduce a special case for ELF input data sections that
have ".rodata.prel64" in their names, and in these cases, instead of
rejecting any encountered ABS64 relocations, convert them into PREL64
relocations, which don't require any runtime fixups. Note that the code
in question must still be modified to deal with this, as it needs to
convert the 64-bit signed offsets into absolute addresses before use.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/kernel/pi/Makefile | 9 +-
arch/arm64/kernel/pi/pi.h | 14 +++
arch/arm64/kernel/pi/relacheck.c | 130 ++++++++++++++++++++
3 files changed, 151 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kernel/pi/Makefile b/arch/arm64/kernel/pi/Makefile
index c844a0546d7f0e62..bc32a431fe35e6f4 100644
--- a/arch/arm64/kernel/pi/Makefile
+++ b/arch/arm64/kernel/pi/Makefile
@@ -22,11 +22,16 @@ KCSAN_SANITIZE := n
UBSAN_SANITIZE := n
KCOV_INSTRUMENT := n
+hostprogs := relacheck
+
+quiet_cmd_piobjcopy = $(quiet_cmd_objcopy)
+ cmd_piobjcopy = $(cmd_objcopy) && $(obj)/relacheck $(@) $(<)
+
$(obj)/%.pi.o: OBJCOPYFLAGS := --prefix-symbols=__pi_ \
--remove-section=.note.gnu.property \
--prefix-alloc-sections=.init
-$(obj)/%.pi.o: $(obj)/%.o FORCE
- $(call if_changed,objcopy)
+$(obj)/%.pi.o: $(obj)/%.o $(obj)/relacheck FORCE
+ $(call if_changed,piobjcopy)
$(obj)/lib-%.o: $(srctree)/lib/%.c FORCE
$(call if_changed_rule,cc_o_c)
diff --git a/arch/arm64/kernel/pi/pi.h b/arch/arm64/kernel/pi/pi.h
new file mode 100644
index 0000000000000000..f455ad385976a664
--- /dev/null
+++ b/arch/arm64/kernel/pi/pi.h
@@ -0,0 +1,14 @@
+// SPDX-License-Identifier: GPL-2.0-only
+// Copyright 2023 Google LLC
+// Author: Ard Biesheuvel <[email protected]>
+
+#define __prel64_initconst __section(".init.rodata.prel64")
+
+typedef volatile signed long prel64_t;
+
+static inline void *prel64_to_pointer(const prel64_t *offset)
+{
+ if (!*offset)
+ return NULL;
+ return (void *)offset + *offset;
+}
diff --git a/arch/arm64/kernel/pi/relacheck.c b/arch/arm64/kernel/pi/relacheck.c
new file mode 100644
index 0000000000000000..29310c632e9312e7
--- /dev/null
+++ b/arch/arm64/kernel/pi/relacheck.c
@@ -0,0 +1,130 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2022 - Google LLC
+ * Author: Ard Biesheuvel <[email protected]>
+ */
+
+#include <elf.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+#define HOST_ORDER ELFDATA2LSB
+#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+#define HOST_ORDER ELFDATA2MSB
+#endif
+
+static Elf64_Ehdr *ehdr;
+static Elf64_Shdr *shdr;
+static const char *strtab;
+static bool swap;
+
+static uint64_t swab_elfxword(uint64_t val)
+{
+ return swap ? __builtin_bswap64(val) : val;
+}
+
+static uint32_t swab_elfword(uint32_t val)
+{
+ return swap ? __builtin_bswap32(val) : val;
+}
+
+static uint16_t swab_elfhword(uint16_t val)
+{
+ return swap ? __builtin_bswap16(val) : val;
+}
+
+int main(int argc, char *argv[])
+{
+ struct stat stat;
+ int fd, ret;
+
+ if (argc < 3) {
+ fprintf(stderr, "file arguments missing\n");
+ exit(EXIT_FAILURE);
+ }
+
+ fd = open(argv[1], O_RDWR);
+ if (fd < 0) {
+ fprintf(stderr, "failed to open %s\n", argv[1]);
+ exit(EXIT_FAILURE);
+ }
+
+ ret = fstat(fd, &stat);
+ if (ret < 0) {
+ fprintf(stderr, "failed to stat() %s\n", argv[1]);
+ exit(EXIT_FAILURE);
+ }
+
+ ehdr = mmap(0, stat.st_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+ if (ehdr == MAP_FAILED) {
+ fprintf(stderr, "failed to mmap() %s\n", argv[1]);
+ exit(EXIT_FAILURE);
+ }
+
+ swap = ehdr->e_ident[EI_DATA] != HOST_ORDER;
+ shdr = (void *)ehdr + swab_elfxword(ehdr->e_shoff);
+ strtab = (void *)ehdr +
+ swab_elfxword(shdr[swab_elfhword(ehdr->e_shstrndx)].sh_offset);
+
+ for (int i = 0; i < swab_elfhword(ehdr->e_shnum); i++) {
+ unsigned long info, flags;
+ bool prel64 = false;
+ Elf64_Rela *rela;
+ int numrela;
+
+ if (swab_elfword(shdr[i].sh_type) != SHT_RELA)
+ continue;
+
+ /* only consider RELA sections operating on data */
+ info = swab_elfword(shdr[i].sh_info);
+ flags = swab_elfxword(shdr[info].sh_flags);
+ if ((flags & (SHF_ALLOC | SHF_EXECINSTR)) != SHF_ALLOC)
+ continue;
+
+ /*
+ * We generally don't permit ABS64 relocations in the code that
+ * runs before relocation processing occurs. If statically
+ * initialized absolute symbol references are unavoidable, they
+ * may be emitted into a *.rodata.prel64 section and they will
+ * be converted to place-relative 64-bit references. This
+ * requires special handling in the referring code.
+ */
+ if (strstr(strtab + swab_elfword(shdr[info].sh_name),
+ ".rodata.prel64")) {
+ prel64 = true;
+ }
+
+ rela = (void *)ehdr + swab_elfxword(shdr[i].sh_offset);
+ numrela = swab_elfxword(shdr[i].sh_size) / sizeof(*rela);
+
+ for (int j = 0; j < numrela; j++) {
+ uint64_t info = swab_elfxword(rela[j].r_info);
+
+ if (ELF64_R_TYPE(info) != R_AARCH64_ABS64)
+ continue;
+
+ if (prel64) {
+ /* convert ABS64 into PREL64 */
+ info ^= R_AARCH64_ABS64 ^ R_AARCH64_PREL64;
+ rela[j].r_info = swab_elfxword(info);
+ } else {
+ fprintf(stderr,
+ "Unexpected absolute relocations detected in %s\n",
+ argv[2]);
+ close(fd);
+ unlink(argv[1]);
+ exit(EXIT_FAILURE);
+ }
+ }
+ }
+ close(fd);
+ return 0;
+}
--
2.39.2
The ID reg override handling code uses a rather elaborate data structure
that relies on statically initialized absolute address values in pointer
fields. This means that this code cannot run until relocation fixups
have been applied, and this is unfortunate, because it means we cannot
discover overrides for KASLR or LVA/LPA without creating the kernel
mapping and performing the relocations first.
This can be solved by switching to place-relative relocations, which can
be applied by the linker at build time. This means some additional
arithmetic is required when dereferencing these pointers, as we can no
longer dereference the pointer members directly.
So let's implement this for idreg-override.c in a preliminary way, i.e.,
convert all the references in code to use a special accessor that
produces the correct absolute value at runtime.
To preserve the strong type checking for the static initializers, use
union types for representing the hybrid quantities.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/kernel/idreg-override.c | 98 +++++++++++++-------
1 file changed, 65 insertions(+), 33 deletions(-)
diff --git a/arch/arm64/kernel/idreg-override.c b/arch/arm64/kernel/idreg-override.c
index b6e90ee6857eb758..fc9ed722621412bf 100644
--- a/arch/arm64/kernel/idreg-override.c
+++ b/arch/arm64/kernel/idreg-override.c
@@ -21,14 +21,32 @@
static u64 __boot_status __initdata;
+// temporary __prel64 related definitions
+// to be removed when this code is moved under pi/
+
+#define __prel64_initconst __initconst
+
+typedef void *prel64_t;
+
+static void *prel64_to_pointer(const prel64_t *p)
+{
+ return *p;
+}
+
struct ftr_set_desc {
char name[FTR_DESC_NAME_LEN];
- struct arm64_ftr_override *override;
+ union {
+ struct arm64_ftr_override *override;
+ prel64_t override_prel;
+ };
struct {
char name[FTR_DESC_FIELD_LEN];
u8 shift;
u8 width;
- bool (*filter)(u64 val);
+ union {
+ bool (*filter)(u64 val);
+ prel64_t filter_prel;
+ };
} fields[];
};
@@ -46,7 +64,7 @@ static bool __init mmfr1_vh_filter(u64 val)
val == 0);
}
-static const struct ftr_set_desc mmfr1 __initconst = {
+static const struct ftr_set_desc mmfr1 __prel64_initconst = {
.name = "id_aa64mmfr1",
.override = &id_aa64mmfr1_override,
.fields = {
@@ -70,7 +88,7 @@ static bool __init pfr0_sve_filter(u64 val)
return true;
}
-static const struct ftr_set_desc pfr0 __initconst = {
+static const struct ftr_set_desc pfr0 __prel64_initconst = {
.name = "id_aa64pfr0",
.override = &id_aa64pfr0_override,
.fields = {
@@ -94,7 +112,7 @@ static bool __init pfr1_sme_filter(u64 val)
return true;
}
-static const struct ftr_set_desc pfr1 __initconst = {
+static const struct ftr_set_desc pfr1 __prel64_initconst = {
.name = "id_aa64pfr1",
.override = &id_aa64pfr1_override,
.fields = {
@@ -105,7 +123,7 @@ static const struct ftr_set_desc pfr1 __initconst = {
},
};
-static const struct ftr_set_desc isar1 __initconst = {
+static const struct ftr_set_desc isar1 __prel64_initconst = {
.name = "id_aa64isar1",
.override = &id_aa64isar1_override,
.fields = {
@@ -117,7 +135,7 @@ static const struct ftr_set_desc isar1 __initconst = {
},
};
-static const struct ftr_set_desc isar2 __initconst = {
+static const struct ftr_set_desc isar2 __prel64_initconst = {
.name = "id_aa64isar2",
.override = &id_aa64isar2_override,
.fields = {
@@ -127,7 +145,7 @@ static const struct ftr_set_desc isar2 __initconst = {
},
};
-static const struct ftr_set_desc smfr0 __initconst = {
+static const struct ftr_set_desc smfr0 __prel64_initconst = {
.name = "id_aa64smfr0",
.override = &id_aa64smfr0_override,
.fields = {
@@ -138,7 +156,7 @@ static const struct ftr_set_desc smfr0 __initconst = {
},
};
-static const struct ftr_set_desc sw_features __initconst = {
+static const struct ftr_set_desc sw_features __prel64_initconst = {
.name = "arm64_sw",
.override = &arm64_sw_feature_override,
.fields = {
@@ -147,14 +165,17 @@ static const struct ftr_set_desc sw_features __initconst = {
},
};
-static const struct ftr_set_desc * const regs[] __initconst = {
- &mmfr1,
- &pfr0,
- &pfr1,
- &isar1,
- &isar2,
- &smfr0,
- &sw_features,
+static const union {
+ const struct ftr_set_desc *reg;
+ prel64_t reg_prel;
+} regs[] __prel64_initconst = {
+ { .reg = &mmfr1 },
+ { .reg = &pfr0 },
+ { .reg = &pfr1 },
+ { .reg = &isar1 },
+ { .reg = &isar2 },
+ { .reg = &smfr0 },
+ { .reg = &sw_features },
};
static const struct {
@@ -194,15 +215,20 @@ static void __init match_options(const char *cmdline)
int i;
for (i = 0; i < ARRAY_SIZE(regs); i++) {
+ const struct ftr_set_desc *reg = prel64_to_pointer(®s[i].reg_prel);
+ struct arm64_ftr_override *override;
int f;
- for (f = 0; strlen(regs[i]->fields[f].name); f++) {
- u64 shift = regs[i]->fields[f].shift;
- u64 width = regs[i]->fields[f].width ?: 4;
+ override = prel64_to_pointer(®->override_prel);
+
+ for (f = 0; strlen(reg->fields[f].name); f++) {
+ u64 shift = reg->fields[f].shift;
+ u64 width = reg->fields[f].width ?: 4;
u64 mask = GENMASK_ULL(shift + width - 1, shift);
+ bool (*filter)(u64 val);
u64 v;
- if (find_field(cmdline, regs[i], f, &v))
+ if (find_field(cmdline, reg, f, &v))
continue;
/*
@@ -210,16 +236,16 @@ static void __init match_options(const char *cmdline)
* it by setting the value to the all-ones while
* clearing the mask... Yes, this is fragile.
*/
- if (regs[i]->fields[f].filter &&
- !regs[i]->fields[f].filter(v)) {
- regs[i]->override->val |= mask;
- regs[i]->override->mask &= ~mask;
+ filter = prel64_to_pointer(®->fields[f].filter_prel);
+ if (filter && !filter(v)) {
+ override->val |= mask;
+ override->mask &= ~mask;
continue;
}
- regs[i]->override->val &= ~mask;
- regs[i]->override->val |= (v << shift) & mask;
- regs[i]->override->mask |= mask;
+ override->val &= ~mask;
+ override->val |= (v << shift) & mask;
+ override->mask |= mask;
return;
}
@@ -293,11 +319,16 @@ void init_feature_override(u64 boot_status);
asmlinkage void __init init_feature_override(u64 boot_status)
{
+ struct arm64_ftr_override *override;
+ const struct ftr_set_desc *reg;
int i;
for (i = 0; i < ARRAY_SIZE(regs); i++) {
- regs[i]->override->val = 0;
- regs[i]->override->mask = 0;
+ reg = prel64_to_pointer(®s[i].reg_prel);
+ override = prel64_to_pointer(®->override_prel);
+
+ override->val = 0;
+ override->mask = 0;
}
__boot_status = boot_status;
@@ -305,8 +336,9 @@ asmlinkage void __init init_feature_override(u64 boot_status)
parse_cmdline();
for (i = 0; i < ARRAY_SIZE(regs); i++) {
- dcache_clean_inval_poc((unsigned long)regs[i]->override,
- (unsigned long)regs[i]->override +
- sizeof(*regs[i]->override));
+ reg = prel64_to_pointer(®s[i].reg_prel);
+ override = prel64_to_pointer(®->override_prel);
+ dcache_clean_inval_poc((unsigned long)override,
+ (unsigned long)(override + 1));
}
}
--
2.39.2
Now that we have a mini C runtime before the kernel mapping is up, we
can move the non-trivial relocation processing code out of head.S and
reimplement it in C.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/kernel/Makefile | 3 +-
arch/arm64/kernel/head.S | 104 ++------------------
arch/arm64/kernel/pi/Makefile | 5 +-
arch/arm64/kernel/pi/relocate.c | 62 ++++++++++++
arch/arm64/kernel/vmlinux.lds.S | 12 ++-
5 files changed, 82 insertions(+), 104 deletions(-)
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index ceba6792f5b3c473..e27168d6ed2050b9 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -58,7 +58,8 @@ obj-$(CONFIG_ACPI) += acpi.o
obj-$(CONFIG_ACPI_NUMA) += acpi_numa.o
obj-$(CONFIG_ARM64_ACPI_PARKING_PROTOCOL) += acpi_parking_protocol.o
obj-$(CONFIG_PARAVIRT) += paravirt.o
-obj-$(CONFIG_RANDOMIZE_BASE) += kaslr.o pi/
+obj-$(CONFIG_RELOCATABLE) += pi/
+obj-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
obj-$(CONFIG_HIBERNATION) += hibernate.o hibernate-asm.o
obj-$(CONFIG_ELF_CORE) += elfcore.o
obj-$(CONFIG_KEXEC_CORE) += machine_kexec.o relocate_kernel.o \
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 65cdaaa2c859418f..5047a2952ec273f9 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -81,7 +81,7 @@
* x20 primary_entry() .. __primary_switch() CPU boot mode
* x21 primary_entry() .. start_kernel() FDT pointer passed at boot in x0
* x22 create_idmap() .. start_kernel() ID map VA of the DT blob
- * x23 primary_entry() .. start_kernel() physical misalignment/KASLR offset
+ * x23 __primary_switch() physical misalignment/KASLR offset
* x24 __primary_switch() linear map KASLR seed
* x25 primary_entry() .. start_kernel() supported VA size
* x28 create_idmap() callee preserved temp register
@@ -389,7 +389,7 @@ SYM_FUNC_START_LOCAL(create_idmap)
/* Remap the kernel page tables r/w in the ID map */
adrp x1, _text
adrp x2, init_pg_dir
- adrp x3, init_pg_end
+ adrp x3, _end
bic x4, x2, #SWAPPER_BLOCK_SIZE - 1
mov x5, SWAPPER_RW_MMUFLAGS
mov x6, #SWAPPER_BLOCK_SHIFT
@@ -777,97 +777,6 @@ SYM_FUNC_START_LOCAL(__no_granule_support)
b 1b
SYM_FUNC_END(__no_granule_support)
-#ifdef CONFIG_RELOCATABLE
-SYM_FUNC_START_LOCAL(__relocate_kernel)
- /*
- * Iterate over each entry in the relocation table, and apply the
- * relocations in place.
- */
- adr_l x9, __rela_start
- adr_l x10, __rela_end
- mov_q x11, KIMAGE_VADDR // default virtual offset
- add x11, x11, x23 // actual virtual offset
-
-0: cmp x9, x10
- b.hs 1f
- ldp x12, x13, [x9], #24
- ldr x14, [x9, #-8]
- cmp w13, #R_AARCH64_RELATIVE
- b.ne 0b
- add x14, x14, x23 // relocate
- str x14, [x12, x23]
- b 0b
-
-1:
-#ifdef CONFIG_RELR
- /*
- * Apply RELR relocations.
- *
- * RELR is a compressed format for storing relative relocations. The
- * encoded sequence of entries looks like:
- * [ AAAAAAAA BBBBBBB1 BBBBBBB1 ... AAAAAAAA BBBBBB1 ... ]
- *
- * i.e. start with an address, followed by any number of bitmaps. The
- * address entry encodes 1 relocation. The subsequent bitmap entries
- * encode up to 63 relocations each, at subsequent offsets following
- * the last address entry.
- *
- * The bitmap entries must have 1 in the least significant bit. The
- * assumption here is that an address cannot have 1 in lsb. Odd
- * addresses are not supported. Any odd addresses are stored in the RELA
- * section, which is handled above.
- *
- * Excluding the least significant bit in the bitmap, each non-zero
- * bit in the bitmap represents a relocation to be applied to
- * a corresponding machine word that follows the base address
- * word. The second least significant bit represents the machine
- * word immediately following the initial address, and each bit
- * that follows represents the next word, in linear order. As such,
- * a single bitmap can encode up to 63 relocations in a 64-bit object.
- *
- * In this implementation we store the address of the next RELR table
- * entry in x9, the address being relocated by the current address or
- * bitmap entry in x13 and the address being relocated by the current
- * bit in x14.
- */
- adr_l x9, __relr_start
- adr_l x10, __relr_end
-
-2: cmp x9, x10
- b.hs 7f
- ldr x11, [x9], #8
- tbnz x11, #0, 3f // branch to handle bitmaps
- add x13, x11, x23
- ldr x12, [x13] // relocate address entry
- add x12, x12, x23
- str x12, [x13], #8 // adjust to start of bitmap
- b 2b
-
-3: mov x14, x13
-4: lsr x11, x11, #1
- cbz x11, 6f
- tbz x11, #0, 5f // skip bit if not set
- ldr x12, [x14] // relocate bit
- add x12, x12, x23
- str x12, [x14]
-
-5: add x14, x14, #8 // move to next bit's address
- b 4b
-
-6: /*
- * Move to the next bitmap's address. 8 is the word size, and 63 is the
- * number of significant bits in a bitmap entry.
- */
- add x13, x13, #(8 * 63)
- b 2b
-
-7:
-#endif
- ret
-
-SYM_FUNC_END(__relocate_kernel)
-#endif
-
SYM_FUNC_START_LOCAL(__primary_switch)
adrp x1, reserved_pg_dir
adrp x2, init_idmap_pg_dir
@@ -875,11 +784,11 @@ SYM_FUNC_START_LOCAL(__primary_switch)
#ifdef CONFIG_RELOCATABLE
adrp x23, KERNEL_START
and x23, x23, MIN_KIMG_ALIGN - 1
-#ifdef CONFIG_RANDOMIZE_BASE
- mov x0, x22
- adrp x1, init_pg_end
+ adrp x1, early_init_stack
mov sp, x1
mov x29, xzr
+#ifdef CONFIG_RANDOMIZE_BASE
+ mov x0, x22
bl __pi_kaslr_early_init
and x24, x0, #SZ_2M - 1 // capture memstart offset seed
bic x0, x0, #SZ_2M - 1
@@ -892,7 +801,8 @@ SYM_FUNC_START_LOCAL(__primary_switch)
adrp x1, init_pg_dir
load_ttbr1 x1, x1, x2
#ifdef CONFIG_RELOCATABLE
- bl __relocate_kernel
+ mov x0, x23
+ bl __pi_relocate_kernel
#endif
ldr x8, =__primary_switched
adrp x0, KERNEL_START // __pa(KERNEL_START)
diff --git a/arch/arm64/kernel/pi/Makefile b/arch/arm64/kernel/pi/Makefile
index 2bbe866417d453ff..d084c1dcf4165420 100644
--- a/arch/arm64/kernel/pi/Makefile
+++ b/arch/arm64/kernel/pi/Makefile
@@ -38,5 +38,6 @@ $(obj)/lib-%.pi.o: OBJCOPYFLAGS += --prefix-alloc-sections=.init
$(obj)/lib-%.o: $(srctree)/lib/%.c FORCE
$(call if_changed_rule,cc_o_c)
-obj-y := kaslr_early.pi.o lib-fdt.pi.o lib-fdt_ro.pi.o
-extra-y := $(patsubst %.pi.o,%.o,$(obj-y))
+obj-y := relocate.pi.o
+obj-$(CONFIG_RANDOMIZE_BASE) += kaslr_early.pi.o lib-fdt.pi.o lib-fdt_ro.pi.o
+extra-y := $(patsubst %.pi.o,%.o,$(obj-y))
diff --git a/arch/arm64/kernel/pi/relocate.c b/arch/arm64/kernel/pi/relocate.c
new file mode 100644
index 0000000000000000..1853408ea76b0e5d
--- /dev/null
+++ b/arch/arm64/kernel/pi/relocate.c
@@ -0,0 +1,62 @@
+// SPDX-License-Identifier: GPL-2.0-only
+// Copyright 2023 Google LLC
+// Authors: Ard Biesheuvel <[email protected]>
+// Peter Collingbourne <[email protected]>
+
+#include <linux/elf.h>
+#include <linux/init.h>
+#include <linux/types.h>
+
+extern const Elf64_Rela rela_start[], rela_end[];
+extern const u64 relr_start[], relr_end[];
+
+void __init relocate_kernel(u64 offset)
+{
+ u64 *place = NULL;
+
+ for (const Elf64_Rela *rela = rela_start; rela < rela_end; rela++) {
+ if (ELF64_R_TYPE(rela->r_info) != R_AARCH64_RELATIVE)
+ continue;
+ *(u64 *)(rela->r_offset + offset) = rela->r_addend + offset;
+ }
+
+ if (!IS_ENABLED(CONFIG_RELR) || !offset)
+ return;
+
+ /*
+ * Apply RELR relocations.
+ *
+ * RELR is a compressed format for storing relative relocations. The
+ * encoded sequence of entries looks like:
+ * [ AAAAAAAA BBBBBBB1 BBBBBBB1 ... AAAAAAAA BBBBBB1 ... ]
+ *
+ * i.e. start with an address, followed by any number of bitmaps. The
+ * address entry encodes 1 relocation. The subsequent bitmap entries
+ * encode up to 63 relocations each, at subsequent offsets following
+ * the last address entry.
+ *
+ * The bitmap entries must have 1 in the least significant bit. The
+ * assumption here is that an address cannot have 1 in lsb. Odd
+ * addresses are not supported. Any odd addresses are stored in the
+ * RELA section, which is handled above.
+ *
+ * With the exception of the least significant bit, each bit in the
+ * bitmap corresponds with a machine word that follows the base address
+ * word, and the bit value indicates whether or not a relocation needs
+ * to be applied to it. The second least significant bit represents the
+ * machine word immediately following the initial address, and each bit
+ * that follows represents the next word, in linear order. As such, a
+ * single bitmap can encode up to 63 relocations in a 64-bit object.
+ */
+ for (const u64 *relr = relr_start; relr < relr_end; relr++) {
+ if ((*relr & 1) == 0) {
+ place = (u64 *)(*relr + offset);
+ *place++ += offset;
+ } else {
+ for (u64 *p = place, r = *relr >> 1; r; p++, r >>= 1)
+ if (r & 1)
+ *p += offset;
+ place += 63;
+ }
+ }
+}
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index b9202c2ee18e02d8..ec24b1e70d606ec8 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -271,15 +271,15 @@ SECTIONS
HYPERVISOR_RELOC_SECTION
.rela.dyn : ALIGN(8) {
- __rela_start = .;
+ __pi_rela_start = .;
*(.rela .rela*)
- __rela_end = .;
+ __pi_rela_end = .;
}
.relr.dyn : ALIGN(8) {
- __relr_start = .;
+ __pi_relr_start = .;
*(.relr.dyn)
- __relr_end = .;
+ __pi_relr_end = .;
}
. = ALIGN(SEGMENT_ALIGN);
@@ -318,6 +318,10 @@ SECTIONS
init_pg_dir = .;
. += INIT_DIR_SIZE;
init_pg_end = .;
+#ifdef CONFIG_RELOCATABLE
+ . += SZ_4K; /* stack for the early relocation code */
+ early_init_stack = .;
+#endif
. = ALIGN(SEGMENT_ALIGN);
__pecoff_data_size = ABSOLUTE(. - __initdata_begin);
--
2.39.2
The early FDT remap code is no longer used so let's drop it.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/setup.h | 3 ---
arch/arm64/kernel/setup.c | 15 ---------------
2 files changed, 18 deletions(-)
diff --git a/arch/arm64/include/asm/setup.h b/arch/arm64/include/asm/setup.h
index f4af547ef54caa70..acc5e00bf3b0fafb 100644
--- a/arch/arm64/include/asm/setup.h
+++ b/arch/arm64/include/asm/setup.h
@@ -7,9 +7,6 @@
#include <uapi/asm/setup.h>
-void *get_early_fdt_ptr(void);
-void early_fdt_map(u64 dt_phys);
-
/*
* These two variables are used in the head.S file.
*/
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index b8ec7b3ac9cbe8a8..bda21a9245943c57 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -166,21 +166,6 @@ static void __init smp_build_mpidr_hash(void)
pr_warn("Large number of MPIDR hash buckets detected\n");
}
-static void *early_fdt_ptr __initdata;
-
-void __init *get_early_fdt_ptr(void)
-{
- return early_fdt_ptr;
-}
-
-asmlinkage void __init early_fdt_map(u64 dt_phys)
-{
- int fdt_size;
-
- early_fixmap_init();
- early_fdt_ptr = fixmap_remap_fdt(dt_phys, &fdt_size, PAGE_KERNEL);
-}
-
static void __init setup_machine_fdt(phys_addr_t dt_phys)
{
int size;
--
2.39.2
All ID register value overrides are =0 with the exception of the nokaslr
pseudo feature which uses =1. In order to remove the dependency on
kstrtou64(), which is part of the core kernel and no longer usable once
we move idreg-override into the early mini C runtime, let's just parse a
single hex digit (with optional leading 0x) and set the output value
accordingly.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/kernel/idreg-override.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/kernel/idreg-override.c b/arch/arm64/kernel/idreg-override.c
index 58c02fe20e1050b4..758f0e86e2bd2a34 100644
--- a/arch/arm64/kernel/idreg-override.c
+++ b/arch/arm64/kernel/idreg-override.c
@@ -195,6 +195,20 @@ static const struct {
{ "nokaslr", "arm64_sw.nokaslr=1" },
};
+static int __init parse_hexdigit(const char *p, u64 *v)
+{
+ // skip "0x" if it comes next
+ if (p[0] == '0' && tolower(p[1]) == 'x')
+ p += 2;
+
+ // check whether the RHS is a single hex digit
+ if (!isxdigit(p[0]) || (p[1] && !isspace(p[1])))
+ return -EINVAL;
+
+ *v = tolower(*p) - (isdigit(*p) ? '0' : 'a' - 10);
+ return 0;
+}
+
static int __init find_field(const char *cmdline, char *opt, int len,
const struct ftr_set_desc *reg, int f, u64 *v)
{
@@ -208,7 +222,7 @@ static int __init find_field(const char *cmdline, char *opt, int len,
if (memcmp(cmdline, opt, len))
return -1;
- return kstrtou64(cmdline + len, 0, v);
+ return parse_hexdigit(cmdline + len, v);
}
static void __init match_options(const char *cmdline)
--
2.39.2
The only way parameq() and parameqn() deviate from the ordinary string
and memory routines is that they ignore the difference between dashes
and underscores.
Since we copy each command line argument into a buffer before passing it
to parameq() and parameqn() numerous times, let's just convert all
dashes to underscores just once, and update the alias array accordingly.
This also helps reduce the dependency on kernel APIs that are no longer
available once we move this code into the early mini C runtime.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/kernel/idreg-override.c | 26 ++++++++++++--------
1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/arch/arm64/kernel/idreg-override.c b/arch/arm64/kernel/idreg-override.c
index fc9ed722621412bf..23bbbc37ac24ba09 100644
--- a/arch/arm64/kernel/idreg-override.c
+++ b/arch/arm64/kernel/idreg-override.c
@@ -182,8 +182,8 @@ static const struct {
char alias[FTR_ALIAS_NAME_LEN];
char feature[FTR_ALIAS_OPTION_LEN];
} aliases[] __initconst = {
- { "kvm-arm.mode=nvhe", "id_aa64mmfr1.vh=0" },
- { "kvm-arm.mode=protected", "id_aa64mmfr1.vh=0" },
+ { "kvm_arm.mode=nvhe", "id_aa64mmfr1.vh=0" },
+ { "kvm_arm.mode=protected", "id_aa64mmfr1.vh=0" },
{ "arm64.nosve", "id_aa64pfr0.sve=0 id_aa64pfr1.sme=0" },
{ "arm64.nosme", "id_aa64pfr1.sme=0" },
{ "arm64.nobti", "id_aa64pfr1.bt=0" },
@@ -204,7 +204,7 @@ static int __init find_field(const char *cmdline,
len = snprintf(opt, ARRAY_SIZE(opt), "%s.%s=",
reg->name, reg->fields[f].name);
- if (!parameqn(cmdline, opt, len))
+ if (memcmp(cmdline, opt, len))
return -1;
return kstrtou64(cmdline + len, 0, v);
@@ -261,23 +261,29 @@ static __init void __parse_cmdline(const char *cmdline, bool parse_aliases)
cmdline = skip_spaces(cmdline);
- for (len = 0; cmdline[len] && !isspace(cmdline[len]); len++);
+ /* terminate on "--" appearing on the command line by itself */
+ if (cmdline[0] == '-' && cmdline[1] == '-' && isspace(cmdline[2]))
+ return;
+
+ for (len = 0; cmdline[len] && !isspace(cmdline[len]); len++) {
+ if (len >= sizeof(buf) - 1)
+ break;
+ if (cmdline[len] == '-')
+ buf[len] = '_';
+ else
+ buf[len] = cmdline[len];
+ }
if (!len)
return;
- len = min(len, ARRAY_SIZE(buf) - 1);
- strncpy(buf, cmdline, len);
buf[len] = 0;
- if (strcmp(buf, "--") == 0)
- return;
-
cmdline += len;
match_options(buf);
for (i = 0; parse_aliases && i < ARRAY_SIZE(aliases); i++)
- if (parameq(buf, aliases[i].alias))
+ if (!memcmp(buf, aliases[i].alias, len + 1))
__parse_cmdline(aliases[i].feature, false);
} while (1);
}
--
2.39.2
strlen() is a costly way to decide whether a string is empty, as in that
case, the first character will be NUL so we can check for that directly.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/kernel/idreg-override.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/kernel/idreg-override.c b/arch/arm64/kernel/idreg-override.c
index 23bbbc37ac24ba09..476dc3f0e9d90e22 100644
--- a/arch/arm64/kernel/idreg-override.c
+++ b/arch/arm64/kernel/idreg-override.c
@@ -221,7 +221,7 @@ static void __init match_options(const char *cmdline)
override = prel64_to_pointer(®->override_prel);
- for (f = 0; strlen(reg->fields[f].name); f++) {
+ for (f = 0; reg->fields[f].name[0] != '\0'; f++) {
u64 shift = reg->fields[f].shift;
u64 width = reg->fields[f].width ?: 4;
u64 mask = GENMASK_ULL(shift + width - 1, shift);
--
2.39.2
Instead of using sprintf() with the "%s.%s=" format, where the first
string argument is always the same in the inner loop of match_options(),
use simple memcpy() for string concatenation, and move the first copy to
the outer loop. This removes the dependency on sprintf(), which will be
difficult to fulfil when we move this code into the early mini C
runtime.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/kernel/idreg-override.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/arch/arm64/kernel/idreg-override.c b/arch/arm64/kernel/idreg-override.c
index 476dc3f0e9d90e22..58c02fe20e1050b4 100644
--- a/arch/arm64/kernel/idreg-override.c
+++ b/arch/arm64/kernel/idreg-override.c
@@ -195,14 +195,15 @@ static const struct {
{ "nokaslr", "arm64_sw.nokaslr=1" },
};
-static int __init find_field(const char *cmdline,
+static int __init find_field(const char *cmdline, char *opt, int len,
const struct ftr_set_desc *reg, int f, u64 *v)
{
- char opt[FTR_DESC_NAME_LEN + FTR_DESC_FIELD_LEN + 2];
- int len;
+ int flen = strlen(reg->fields[f].name);
- len = snprintf(opt, ARRAY_SIZE(opt), "%s.%s=",
- reg->name, reg->fields[f].name);
+ // append '<fieldname>=' to obtain '<name>.<fieldname>='
+ memcpy(opt + len, reg->fields[f].name, flen);
+ len += flen;
+ opt[len++] = '=';
if (memcmp(cmdline, opt, len))
return -1;
@@ -212,15 +213,21 @@ static int __init find_field(const char *cmdline,
static void __init match_options(const char *cmdline)
{
+ char opt[FTR_DESC_NAME_LEN + FTR_DESC_FIELD_LEN + 2];
int i;
for (i = 0; i < ARRAY_SIZE(regs); i++) {
const struct ftr_set_desc *reg = prel64_to_pointer(®s[i].reg_prel);
struct arm64_ftr_override *override;
+ int len = strlen(reg->name);
int f;
override = prel64_to_pointer(®->override_prel);
+ // set opt[] to '<name>.'
+ memcpy(opt, reg->name, len);
+ opt[len++] = '.';
+
for (f = 0; reg->fields[f].name[0] != '\0'; f++) {
u64 shift = reg->fields[f].shift;
u64 width = reg->fields[f].width ?: 4;
@@ -228,7 +235,7 @@ static void __init match_options(const char *cmdline)
bool (*filter)(u64 val);
u64 v;
- if (find_field(cmdline, reg, f, &v))
+ if (find_field(cmdline, opt, len, reg, f, &v))
continue;
/*
--
2.39.2
Add some helpers that will be used by the early kernel mapping code to
check feature support on the local CPU. This permits the early kernel
mapping to be created with the right attributes, removing the need for
tearing it down and recreating it.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/cpufeature.h | 56 ++++++++++++++++++++
arch/arm64/kernel/cpufeature.c | 12 +----
2 files changed, 57 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index edc7733aa49846b2..edefe3b36fe5c243 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -933,6 +933,62 @@ extern struct arm64_ftr_override arm64_sw_feature_override;
u32 get_kvm_ipa_limit(void);
void dump_cpu_features(void);
+static inline bool cpu_has_bti(void)
+{
+ u64 pfr1;
+
+ if (!IS_ENABLED(CONFIG_ARM64_BTI))
+ return false;
+
+ pfr1 = read_cpuid(ID_AA64PFR1_EL1);
+ pfr1 &= ~id_aa64pfr1_override.mask;
+ pfr1 |= id_aa64pfr1_override.val;
+
+ return cpuid_feature_extract_unsigned_field(pfr1,
+ ID_AA64PFR1_EL1_BT_SHIFT);
+}
+
+static inline bool cpu_has_e0pd(void)
+{
+ u64 mmfr2;
+
+ if (!IS_ENABLED(CONFIG_ARM64_E0PD))
+ return false;
+
+ mmfr2 = read_sysreg_s(SYS_ID_AA64MMFR2_EL1);
+ return cpuid_feature_extract_unsigned_field(mmfr2,
+ ID_AA64MMFR2_EL1_E0PD_SHIFT);
+}
+
+static inline bool cpu_has_pac(void)
+{
+ u64 isar1, isar2;
+ u8 feat;
+
+ if (!IS_ENABLED(CONFIG_ARM64_PTR_AUTH))
+ return false;
+
+ isar1 = read_cpuid(ID_AA64ISAR1_EL1);
+ isar1 &= ~id_aa64isar1_override.mask;
+ isar1 |= id_aa64isar1_override.val;
+ feat = cpuid_feature_extract_unsigned_field(isar1,
+ ID_AA64ISAR1_EL1_APA_SHIFT);
+ if (feat)
+ return true;
+
+ feat = cpuid_feature_extract_unsigned_field(isar1,
+ ID_AA64ISAR1_EL1_API_SHIFT);
+ if (feat)
+ return true;
+
+ isar2 = read_sysreg_s(SYS_ID_AA64ISAR2_EL1);
+ isar2 &= ~id_aa64isar2_override.mask;
+ isar2 |= id_aa64isar2_override.val;
+ feat = cpuid_feature_extract_unsigned_field(isar2,
+ ID_AA64ISAR2_EL1_APA3_SHIFT);
+ return feat;
+}
+
#endif /* __ASSEMBLY__ */
#endif
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 0b16e676b68c6543..9838934fee028bcb 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1609,21 +1609,11 @@ has_useable_cnp(const struct arm64_cpu_capabilities *entry, int scope)
*/
bool kaslr_requires_kpti(void)
{
- if (!IS_ENABLED(CONFIG_RANDOMIZE_BASE))
- return false;
-
/*
* E0PD does a similar job to KPTI so can be used instead
* where available.
*/
- if (IS_ENABLED(CONFIG_ARM64_E0PD)) {
- u64 mmfr2 = read_sysreg_s(SYS_ID_AA64MMFR2_EL1);
- if (cpuid_feature_extract_unsigned_field(mmfr2,
- ID_AA64MMFR2_EL1_E0PD_SHIFT))
- return false;
- }
-
- return kaslr_enabled();
+ return kaslr_enabled() && !cpu_has_e0pd();
}
static bool __meltdown_safe = true;
--
2.39.2
In preparation for switching to an early kernel mapping routine that
maps each segment according to its precise boundaries, and with the
correct attributes, let's allocate some extra pages for page tables for
the 4k page size configuration. This is necessary because the start and
end of each segment may not be aligned to the block size, and so we'll
need an extra page table at each segment boundary.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/kernel-pgtable.h | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/kernel-pgtable.h b/arch/arm64/include/asm/kernel-pgtable.h
index 4d13c73171e1e360..50b5c145358a5d8e 100644
--- a/arch/arm64/include/asm/kernel-pgtable.h
+++ b/arch/arm64/include/asm/kernel-pgtable.h
@@ -80,7 +80,7 @@
+ EARLY_PGDS((vstart), (vend), add) /* each PGDIR needs a next level page table */ \
+ EARLY_PUDS((vstart), (vend), add) /* each PUD needs a next level page table */ \
+ EARLY_PMDS((vstart), (vend), add)) /* each PMD needs a next level page table */
-#define INIT_DIR_SIZE (PAGE_SIZE * EARLY_PAGES(KIMAGE_VADDR, _end, EARLY_KASLR))
+#define INIT_DIR_SIZE (PAGE_SIZE * (EARLY_PAGES(KIMAGE_VADDR, _end, EARLY_KASLR) + EARLY_SEGMENT_EXTRA_PAGES))
/* the initial ID map may need two extra pages if it needs to be extended */
#if VA_BITS < 48
@@ -101,6 +101,15 @@
#define SWAPPER_TABLE_SHIFT PMD_SHIFT
#endif
+/* The number of segments in the kernel image (text, rodata, inittext, initdata, data+bss) */
+#define KERNEL_SEGMENT_COUNT 5
+
+#if SWAPPER_BLOCK_SIZE > SEGMENT_ALIGN
+#define EARLY_SEGMENT_EXTRA_PAGES (KERNEL_SEGMENT_COUNT + 1)
+#else
+#define EARLY_SEGMENT_EXTRA_PAGES 0
+#endif
+
/*
* Initial memory map attributes.
*/
--
2.39.2
Now that we can set BSS variables from the early code running from the
ID map, we can set memstart_offset_seed directly from the C code that
derives the value instead of passing it back and forth between C and asm
code.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/kernel/head.S | 7 -------
arch/arm64/kernel/image-vars.h | 1 +
arch/arm64/kernel/pi/kaslr_early.c | 4 ++++
3 files changed, 5 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 70ad180eed364906..81c2dd06420992ea 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -82,7 +82,6 @@
* x21 primary_entry() .. start_kernel() FDT pointer passed at boot in x0
* x22 create_idmap() .. start_kernel() ID map VA of the DT blob
* x23 __primary_switch() physical misalignment/KASLR offset
- * x24 __primary_switch() linear map KASLR seed
* x25 primary_entry() .. start_kernel() supported VA size
* x28 create_idmap() callee preserved temp register
*/
@@ -483,11 +482,6 @@ SYM_FUNC_START_LOCAL(__primary_switched)
str x25, [x8] // ... observes the correct value
dc civac, x8 // Make visible to booting secondaries
#endif
-
-#ifdef CONFIG_RANDOMIZE_BASE
- adrp x5, memstart_offset_seed // Save KASLR linear map seed
- strh w24, [x5, :lo12:memstart_offset_seed]
-#endif
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
bl kasan_early_init
#endif
@@ -777,7 +771,6 @@ SYM_FUNC_START_LOCAL(__primary_switch)
#ifdef CONFIG_RANDOMIZE_BASE
mov x0, x22
bl __pi_kaslr_early_init
- and x24, x0, #SZ_2M - 1 // capture memstart offset seed
bic x0, x0, #SZ_2M - 1
orr x23, x23, x0 // record kernel offset
#endif
diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
index 5aa914ea919a1149..b7fa7fbf8fa543a6 100644
--- a/arch/arm64/kernel/image-vars.h
+++ b/arch/arm64/kernel/image-vars.h
@@ -44,6 +44,7 @@ PROVIDE(__pi_id_aa64smfr0_override = id_aa64smfr0_override);
PROVIDE(__pi_id_aa64zfr0_override = id_aa64zfr0_override);
PROVIDE(__pi_arm64_sw_feature_override = arm64_sw_feature_override);
PROVIDE(__pi__ctype = _ctype);
+PROVIDE(__pi_memstart_offset_seed = memstart_offset_seed);
#ifdef CONFIG_KVM
diff --git a/arch/arm64/kernel/pi/kaslr_early.c b/arch/arm64/kernel/pi/kaslr_early.c
index f2305e276ec36803..eeecee7ffd6fa125 100644
--- a/arch/arm64/kernel/pi/kaslr_early.c
+++ b/arch/arm64/kernel/pi/kaslr_early.c
@@ -16,6 +16,8 @@
#include <asm/memory.h>
#include <asm/pgtable.h>
+extern u16 memstart_offset_seed;
+
static u64 __init get_kaslr_seed(void *fdt)
{
static char const chosen_str[] __initconst = "chosen";
@@ -51,6 +53,8 @@ asmlinkage u64 __init kaslr_early_init(void *fdt)
return 0;
}
+ memstart_offset_seed = seed & U16_MAX;
+
/*
* OK, so we are proceeding with KASLR enabled. Calculate a suitable
* kernel image offset from the seed. Let's place the kernel in the
--
2.39.2
We will want to parse the ID register overrides even earlier, so that we
can take them into account before creating the kernel mapping. So
migrate the code and make it work in the context of the early C runtime.
We will move the invocation to an earlier stage in a subsequent patch.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/kernel/Makefile | 4 +--
arch/arm64/kernel/head.S | 5 ++-
arch/arm64/kernel/image-vars.h | 9 +++++
arch/arm64/kernel/pi/Makefile | 5 +--
arch/arm64/kernel/{ => pi}/idreg-override.c | 38 ++++++++------------
5 files changed, 30 insertions(+), 31 deletions(-)
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index e27168d6ed2050b9..4f1fcaebafcfe077 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -33,8 +33,7 @@ obj-y := debug-monitors.o entry.o irq.o fpsimd.o \
return_address.o cpuinfo.o cpu_errata.o \
cpufeature.o alternative.o cacheinfo.o \
smp.o smp_spin_table.o topology.o smccc-call.o \
- syscall.o proton-pack.o idreg-override.o idle.o \
- patching.o
+ syscall.o proton-pack.o idle.o patching.o pi/
obj-$(CONFIG_COMPAT) += sys32.o signal32.o \
sys_compat.o
@@ -58,7 +57,6 @@ obj-$(CONFIG_ACPI) += acpi.o
obj-$(CONFIG_ACPI_NUMA) += acpi_numa.o
obj-$(CONFIG_ARM64_ACPI_PARKING_PROTOCOL) += acpi_parking_protocol.o
obj-$(CONFIG_PARAVIRT) += paravirt.o
-obj-$(CONFIG_RELOCATABLE) += pi/
obj-$(CONFIG_RANDOMIZE_BASE) += kaslr.o
obj-$(CONFIG_HIBERNATION) += hibernate.o hibernate-asm.o
obj-$(CONFIG_ELF_CORE) += elfcore.o
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 5047a2952ec273f9..0fa44b3188c1e204 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -510,10 +510,9 @@ SYM_FUNC_START_LOCAL(__primary_switched)
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
bl kasan_early_init
#endif
- mov x0, x21 // pass FDT address in x0
- bl early_fdt_map // Try mapping the FDT early
mov x0, x20 // pass the full boot status
- bl init_feature_override // Parse cpu feature overrides
+ mov x1, x22 // pass the low FDT mapping
+ bl __pi_init_feature_override // Parse cpu feature overrides
#ifdef CONFIG_UNWIND_PATCH_PAC_INTO_SCS
bl scs_patch_vmlinux
#endif
diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
index b5906f8e18d7eb8d..5aa914ea919a1149 100644
--- a/arch/arm64/kernel/image-vars.h
+++ b/arch/arm64/kernel/image-vars.h
@@ -35,6 +35,15 @@ PROVIDE(__pi___memmove = __pi_memmove);
PROVIDE(__pi___memset = __pi_memset);
PROVIDE(__pi_vabits_actual = vabits_actual);
+PROVIDE(__pi_id_aa64isar1_override = id_aa64isar1_override);
+PROVIDE(__pi_id_aa64isar2_override = id_aa64isar2_override);
+PROVIDE(__pi_id_aa64mmfr1_override = id_aa64mmfr1_override);
+PROVIDE(__pi_id_aa64pfr0_override = id_aa64pfr0_override);
+PROVIDE(__pi_id_aa64pfr1_override = id_aa64pfr1_override);
+PROVIDE(__pi_id_aa64smfr0_override = id_aa64smfr0_override);
+PROVIDE(__pi_id_aa64zfr0_override = id_aa64zfr0_override);
+PROVIDE(__pi_arm64_sw_feature_override = arm64_sw_feature_override);
+PROVIDE(__pi__ctype = _ctype);
#ifdef CONFIG_KVM
diff --git a/arch/arm64/kernel/pi/Makefile b/arch/arm64/kernel/pi/Makefile
index d084c1dcf4165420..7f6dfce893c3b88f 100644
--- a/arch/arm64/kernel/pi/Makefile
+++ b/arch/arm64/kernel/pi/Makefile
@@ -38,6 +38,7 @@ $(obj)/lib-%.pi.o: OBJCOPYFLAGS += --prefix-alloc-sections=.init
$(obj)/lib-%.o: $(srctree)/lib/%.c FORCE
$(call if_changed_rule,cc_o_c)
-obj-y := relocate.pi.o
-obj-$(CONFIG_RANDOMIZE_BASE) += kaslr_early.pi.o lib-fdt.pi.o lib-fdt_ro.pi.o
+obj-y := idreg-override.pi.o lib-fdt.pi.o lib-fdt_ro.pi.o
+obj-$(CONFIG_RELOCATABLE) += relocate.pi.o
+obj-$(CONFIG_RANDOMIZE_BASE) += kaslr_early.pi.o
extra-y := $(patsubst %.pi.o,%.o,$(obj-y))
diff --git a/arch/arm64/kernel/idreg-override.c b/arch/arm64/kernel/pi/idreg-override.c
similarity index 93%
rename from arch/arm64/kernel/idreg-override.c
rename to arch/arm64/kernel/pi/idreg-override.c
index 758f0e86e2bd2a34..4e76db6eb72c2087 100644
--- a/arch/arm64/kernel/idreg-override.c
+++ b/arch/arm64/kernel/pi/idreg-override.c
@@ -14,6 +14,8 @@
#include <asm/cpufeature.h>
#include <asm/setup.h>
+#include "pi.h"
+
#define FTR_DESC_NAME_LEN 20
#define FTR_DESC_FIELD_LEN 10
#define FTR_ALIAS_NAME_LEN 30
@@ -21,18 +23,6 @@
static u64 __boot_status __initdata;
-// temporary __prel64 related definitions
-// to be removed when this code is moved under pi/
-
-#define __prel64_initconst __initconst
-
-typedef void *prel64_t;
-
-static void *prel64_to_pointer(const prel64_t *p)
-{
- return *p;
-}
-
struct ftr_set_desc {
char name[FTR_DESC_NAME_LEN];
union {
@@ -309,16 +299,11 @@ static __init void __parse_cmdline(const char *cmdline, bool parse_aliases)
} while (1);
}
-static __init const u8 *get_bootargs_cmdline(void)
+static __init const u8 *get_bootargs_cmdline(const void *fdt)
{
const u8 *prop;
- void *fdt;
int node;
- fdt = get_early_fdt_ptr();
- if (!fdt)
- return NULL;
-
node = fdt_path_offset(fdt, "/chosen");
if (node < 0)
return NULL;
@@ -330,9 +315,9 @@ static __init const u8 *get_bootargs_cmdline(void)
return strlen(prop) ? prop : NULL;
}
-static __init void parse_cmdline(void)
+static __init void parse_cmdline(const void *fdt)
{
- const u8 *prop = get_bootargs_cmdline();
+ const u8 *prop = get_bootargs_cmdline(fdt);
if (IS_ENABLED(CONFIG_CMDLINE_FORCE) || !prop)
__parse_cmdline(CONFIG_CMDLINE, true);
@@ -342,9 +327,9 @@ static __init void parse_cmdline(void)
}
/* Keep checkers quiet */
-void init_feature_override(u64 boot_status);
+void init_feature_override(u64 boot_status, const void *fdt);
-asmlinkage void __init init_feature_override(u64 boot_status)
+asmlinkage void __init init_feature_override(u64 boot_status, const void *fdt)
{
struct arm64_ftr_override *override;
const struct ftr_set_desc *reg;
@@ -360,7 +345,7 @@ asmlinkage void __init init_feature_override(u64 boot_status)
__boot_status = boot_status;
- parse_cmdline();
+ parse_cmdline(fdt);
for (i = 0; i < ARRAY_SIZE(regs); i++) {
reg = prel64_to_pointer(®s[i].reg_prel);
@@ -369,3 +354,10 @@ asmlinkage void __init init_feature_override(u64 boot_status)
(unsigned long)(override + 1));
}
}
+
+char * __init skip_spaces(const char *str)
+{
+ while (isspace(*str))
+ ++str;
+ return (char *)str;
+}
--
2.39.2
The asm version of the kernel mapping code works fine for creating a
coarse grained identity map, but for mapping the kernel down to its
exact boundaries with the right attributes, it is not suitable. This is
why we create a preliminary RWX kernel mapping first, and then rebuild
it from scratch later on.
So let's reimplement this in C, in a way that will make it unnecessary
to create the kernel page tables yet another time in paging_init().
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/mmu.h | 1 -
arch/arm64/include/asm/scs.h | 32 +---
arch/arm64/kernel/cpufeature.c | 2 +-
arch/arm64/kernel/head.S | 52 +-----
arch/arm64/kernel/image-vars.h | 16 ++
arch/arm64/kernel/pi/Makefile | 1 +
arch/arm64/kernel/pi/idreg-override.c | 22 ++-
arch/arm64/kernel/pi/kaslr_early.c | 12 +-
arch/arm64/kernel/pi/map_kernel.c | 171 ++++++++++++++++++++
arch/arm64/kernel/pi/map_range.c | 87 ++++++++++
arch/arm64/kernel/pi/patch-scs.c | 16 +-
arch/arm64/kernel/pi/pi.h | 12 ++
arch/arm64/kernel/pi/relocate.c | 2 +
arch/arm64/kernel/setup.c | 7 -
arch/arm64/kernel/vmlinux.lds.S | 4 +-
arch/arm64/mm/proc.S | 1 +
16 files changed, 317 insertions(+), 121 deletions(-)
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index 9b9a206e4e9c9d4e..e74dfae8e48214c3 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -70,7 +70,6 @@ extern void create_pgd_mapping(struct mm_struct *mm, phys_addr_t phys,
pgprot_t prot, bool page_mappings_only);
extern void *fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot);
extern void mark_linear_text_alias_ro(void);
-extern bool kaslr_requires_kpti(void);
static inline unsigned long kaslr_offset(void)
{
diff --git a/arch/arm64/include/asm/scs.h b/arch/arm64/include/asm/scs.h
index bcf8ad574807b82c..f246b1515fd59507 100644
--- a/arch/arm64/include/asm/scs.h
+++ b/arch/arm64/include/asm/scs.h
@@ -33,37 +33,11 @@
#include <asm/cpufeature.h>
#ifdef CONFIG_UNWIND_PATCH_PAC_INTO_SCS
-static inline bool should_patch_pac_into_scs(void)
-{
- u64 reg;
-
- /*
- * We only enable the shadow call stack dynamically if we are running
- * on a system that does not implement PAC or BTI. PAC and SCS provide
- * roughly the same level of protection, and BTI relies on the PACIASP
- * instructions serving as landing pads, preventing us from patching
- * those instructions into something else.
- */
- reg = read_sysreg_s(SYS_ID_AA64ISAR1_EL1);
- if (SYS_FIELD_GET(ID_AA64ISAR1_EL1, APA, reg) |
- SYS_FIELD_GET(ID_AA64ISAR1_EL1, API, reg))
- return false;
-
- reg = read_sysreg_s(SYS_ID_AA64ISAR2_EL1);
- if (SYS_FIELD_GET(ID_AA64ISAR2_EL1, APA3, reg))
- return false;
-
- if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL)) {
- reg = read_sysreg_s(SYS_ID_AA64PFR1_EL1);
- if (reg & (0xf << ID_AA64PFR1_EL1_BT_SHIFT))
- return false;
- }
- return true;
-}
-
static inline void dynamic_scs_init(void)
{
- if (should_patch_pac_into_scs()) {
+ extern bool __pi_dynamic_scs_is_enabled;
+
+ if (__pi_dynamic_scs_is_enabled) {
pr_info("Enabling dynamic shadow call stack\n");
static_branch_enable(&dynamic_scs_enabled);
}
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 9838934fee028bcb..e9788671be044a47 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1607,7 +1607,7 @@ has_useable_cnp(const struct arm64_cpu_capabilities *entry, int scope)
* state once the SMP CPUs are up and thus make the switch to non-global
* mappings if required.
*/
-bool kaslr_requires_kpti(void)
+static bool kaslr_requires_kpti(void)
{
/*
* E0PD does a similar job to KPTI so can be used instead
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 81c2dd06420992ea..e45fd99e8ab4272a 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -81,7 +81,6 @@
* x20 primary_entry() .. __primary_switch() CPU boot mode
* x21 primary_entry() .. start_kernel() FDT pointer passed at boot in x0
* x22 create_idmap() .. start_kernel() ID map VA of the DT blob
- * x23 __primary_switch() physical misalignment/KASLR offset
* x25 primary_entry() .. start_kernel() supported VA size
* x28 create_idmap() callee preserved temp register
*/
@@ -408,24 +407,6 @@ SYM_FUNC_START_LOCAL(create_idmap)
0: ret x28
SYM_FUNC_END(create_idmap)
-SYM_FUNC_START_LOCAL(create_kernel_mapping)
- adrp x0, init_pg_dir
- mov_q x5, KIMAGE_VADDR // compile time __va(_text)
-#ifdef CONFIG_RELOCATABLE
- add x5, x5, x23 // add KASLR displacement
-#endif
- adrp x6, _end // runtime __pa(_end)
- adrp x3, _text // runtime __pa(_text)
- sub x6, x6, x3 // _end - _text
- add x6, x6, x5 // runtime __va(_end)
- mov x7, SWAPPER_RW_MMUFLAGS
-
- map_memory x0, x1, x5, x6, x7, x3, (VA_BITS - PGDIR_SHIFT), x10, x11, x12, x13, x14
-
- dsb ishst // sync with page table walker
- ret
-SYM_FUNC_END(create_kernel_mapping)
-
/*
* Initialize CPU registers with task-specific and cpu-specific context.
*
@@ -750,44 +731,13 @@ SYM_FUNC_START_LOCAL(__primary_switch)
adrp x2, init_idmap_pg_dir
bl __enable_mmu
- // Clear BSS
- adrp x0, __bss_start
- mov x1, xzr
- adrp x2, init_pg_end
- sub x2, x2, x0
- bl __pi_memset
- dsb ishst // Make zero page visible to PTW
-
adrp x1, early_init_stack
mov sp, x1
mov x29, xzr
mov x0, x20 // pass the full boot status
mov x1, x22 // pass the low FDT mapping
- bl __pi_init_feature_override // Parse cpu feature overrides
-
-#ifdef CONFIG_RELOCATABLE
- adrp x23, KERNEL_START
- and x23, x23, MIN_KIMG_ALIGN - 1
-#ifdef CONFIG_RANDOMIZE_BASE
- mov x0, x22
- bl __pi_kaslr_early_init
- bic x0, x0, #SZ_2M - 1
- orr x23, x23, x0 // record kernel offset
-#endif
-#endif
- bl create_kernel_mapping
+ bl __pi_early_map_kernel // Map and relocate the kernel
- adrp x1, init_pg_dir
- load_ttbr1 x1, x1, x2
-#ifdef CONFIG_RELOCATABLE
- mov x0, x23
- bl __pi_relocate_kernel
-#endif
-#ifdef CONFIG_UNWIND_PATCH_PAC_INTO_SCS
- ldr x0, =__eh_frame_start
- ldr x1, =__eh_frame_end
- bl __pi_scs_patch_vmlinux
-#endif
ldr x8, =__primary_switched
adrp x0, KERNEL_START // __pa(KERNEL_START)
br x8
diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
index b7fa7fbf8fa543a6..6eeb2a09c8d2441d 100644
--- a/arch/arm64/kernel/image-vars.h
+++ b/arch/arm64/kernel/image-vars.h
@@ -43,9 +43,25 @@ PROVIDE(__pi_id_aa64pfr1_override = id_aa64pfr1_override);
PROVIDE(__pi_id_aa64smfr0_override = id_aa64smfr0_override);
PROVIDE(__pi_id_aa64zfr0_override = id_aa64zfr0_override);
PROVIDE(__pi_arm64_sw_feature_override = arm64_sw_feature_override);
+PROVIDE(__pi_arm64_use_ng_mappings = arm64_use_ng_mappings);
PROVIDE(__pi__ctype = _ctype);
PROVIDE(__pi_memstart_offset_seed = memstart_offset_seed);
+PROVIDE(__pi_init_pg_dir = init_pg_dir);
+PROVIDE(__pi_init_pg_end = init_pg_end);
+
+PROVIDE(__pi__text = _text);
+PROVIDE(__pi__stext = _stext);
+PROVIDE(__pi__etext = _etext);
+PROVIDE(__pi___start_rodata = __start_rodata);
+PROVIDE(__pi___inittext_begin = __inittext_begin);
+PROVIDE(__pi___inittext_end = __inittext_end);
+PROVIDE(__pi___initdata_begin = __initdata_begin);
+PROVIDE(__pi___initdata_end = __initdata_end);
+PROVIDE(__pi__data = _data);
+PROVIDE(__pi___bss_start = __bss_start);
+PROVIDE(__pi__end = _end);
+
#ifdef CONFIG_KVM
/*
diff --git a/arch/arm64/kernel/pi/Makefile b/arch/arm64/kernel/pi/Makefile
index a8b302245f15326a..8c2f80a46b9387b0 100644
--- a/arch/arm64/kernel/pi/Makefile
+++ b/arch/arm64/kernel/pi/Makefile
@@ -39,6 +39,7 @@ $(obj)/lib-%.o: $(srctree)/lib/%.c FORCE
$(call if_changed_rule,cc_o_c)
obj-y := idreg-override.pi.o \
+ map_kernel.pi.o map_range.pi.o \
lib-fdt.pi.o lib-fdt_ro.pi.o
obj-$(CONFIG_RELOCATABLE) += relocate.pi.o
obj-$(CONFIG_RANDOMIZE_BASE) += kaslr_early.pi.o
diff --git a/arch/arm64/kernel/pi/idreg-override.c b/arch/arm64/kernel/pi/idreg-override.c
index 6c547cccaf6a9e9c..265b35b09dd488f1 100644
--- a/arch/arm64/kernel/pi/idreg-override.c
+++ b/arch/arm64/kernel/pi/idreg-override.c
@@ -301,37 +301,35 @@ static __init void __parse_cmdline(const char *cmdline, bool parse_aliases)
} while (1);
}
-static __init const u8 *get_bootargs_cmdline(const void *fdt)
+static __init const u8 *get_bootargs_cmdline(const void *fdt, int node)
{
+ static char const bootargs[] __initconst = "bootargs";
const u8 *prop;
- int node;
- node = fdt_path_offset(fdt, "/chosen");
if (node < 0)
return NULL;
- prop = fdt_getprop(fdt, node, "bootargs", NULL);
+ prop = fdt_getprop(fdt, node, bootargs, NULL);
if (!prop)
return NULL;
return strlen(prop) ? prop : NULL;
}
-static __init void parse_cmdline(const void *fdt)
+static __init void parse_cmdline(const void *fdt, int chosen)
{
- const u8 *prop = get_bootargs_cmdline(fdt);
+ static char const cmdline[] __initconst = CONFIG_CMDLINE;
+ const u8 *prop = get_bootargs_cmdline(fdt, chosen);
if (IS_ENABLED(CONFIG_CMDLINE_FORCE) || !prop)
- __parse_cmdline(CONFIG_CMDLINE, true);
+ __parse_cmdline(cmdline, true);
if (!IS_ENABLED(CONFIG_CMDLINE_FORCE) && prop)
__parse_cmdline(prop, true);
}
-/* Keep checkers quiet */
-void init_feature_override(u64 boot_status, const void *fdt);
-
-asmlinkage void __init init_feature_override(u64 boot_status, const void *fdt)
+void __init init_feature_override(u64 boot_status, const void *fdt,
+ int chosen)
{
struct arm64_ftr_override *override;
const struct ftr_set_desc *reg;
@@ -347,7 +345,7 @@ asmlinkage void __init init_feature_override(u64 boot_status, const void *fdt)
__boot_status = boot_status;
- parse_cmdline(fdt);
+ parse_cmdline(fdt, chosen);
for (i = 0; i < ARRAY_SIZE(regs); i++) {
reg = prel64_to_pointer(®s[i].reg_prel);
diff --git a/arch/arm64/kernel/pi/kaslr_early.c b/arch/arm64/kernel/pi/kaslr_early.c
index eeecee7ffd6fa125..0257b43819db85eb 100644
--- a/arch/arm64/kernel/pi/kaslr_early.c
+++ b/arch/arm64/kernel/pi/kaslr_early.c
@@ -16,17 +16,17 @@
#include <asm/memory.h>
#include <asm/pgtable.h>
+#include "pi.h"
+
extern u16 memstart_offset_seed;
-static u64 __init get_kaslr_seed(void *fdt)
+static u64 __init get_kaslr_seed(void *fdt, int node)
{
- static char const chosen_str[] __initconst = "chosen";
static char const seed_str[] __initconst = "kaslr-seed";
- int node, len;
fdt64_t *prop;
u64 ret;
+ int len;
- node = fdt_path_offset(fdt, chosen_str);
if (node < 0)
return 0;
@@ -39,14 +39,14 @@ static u64 __init get_kaslr_seed(void *fdt)
return ret;
}
-asmlinkage u64 __init kaslr_early_init(void *fdt)
+u64 __init kaslr_early_init(void *fdt, int chosen)
{
u64 seed, range;
if (kaslr_disabled_cmdline())
return 0;
- seed = get_kaslr_seed(fdt);
+ seed = get_kaslr_seed(fdt, chosen);
if (!seed) {
if (!__early_cpu_has_rndr() ||
!__arm64_rndr((unsigned long *)&seed))
diff --git a/arch/arm64/kernel/pi/map_kernel.c b/arch/arm64/kernel/pi/map_kernel.c
new file mode 100644
index 0000000000000000..b573c964c7d88d1b
--- /dev/null
+++ b/arch/arm64/kernel/pi/map_kernel.c
@@ -0,0 +1,171 @@
+// SPDX-License-Identifier: GPL-2.0-only
+// Copyright 2023 Google LLC
+// Author: Ard Biesheuvel <[email protected]>
+
+#include <linux/init.h>
+#include <linux/libfdt.h>
+#include <linux/linkage.h>
+#include <linux/types.h>
+#include <linux/sizes.h>
+#include <linux/string.h>
+
+#include <asm/memory.h>
+#include <asm/pgalloc.h>
+#include <asm/pgtable.h>
+#include <asm/tlbflush.h>
+
+#include "pi.h"
+
+extern const u8 __eh_frame_start[], __eh_frame_end[];
+
+extern void idmap_cpu_replace_ttbr1(void *pgdir);
+
+static void map_segment(pgd_t *pg_dir, u64 *pgd, u64 va_offset,
+ void *start, void *end, pgprot_t prot,
+ bool may_use_cont, int root_level)
+{
+ map_range(pgd, ((u64)start + va_offset) & ~PAGE_OFFSET,
+ ((u64)end + va_offset) & ~PAGE_OFFSET, (u64)start,
+ prot, root_level, (pte_t *)pg_dir, may_use_cont, 0);
+}
+
+static void unmap_segment(pgd_t *pg_dir, u64 va_offset, void *start,
+ void *end, int root_level)
+{
+ map_segment(pg_dir, NULL, va_offset, start, end, __pgprot(0),
+ false, root_level);
+}
+
+static void __init map_kernel(u64 kaslr_offset, u64 va_offset, int root_level)
+{
+ bool enable_scs = IS_ENABLED(CONFIG_UNWIND_PATCH_PAC_INTO_SCS);
+ bool twopass = IS_ENABLED(CONFIG_RELOCATABLE);
+ u64 pgdp = (u64)init_pg_dir + PAGE_SIZE;
+ pgprot_t text_prot = PAGE_KERNEL_ROX;
+ pgprot_t data_prot = PAGE_KERNEL;
+ pgprot_t prot;
+
+ /*
+ * External debuggers may need to write directly to the text mapping to
+ * install SW breakpoints. Allow this (only) when explicitly requested
+ * with rodata=off.
+ */
+ if (cpuid_feature_extract_unsigned_field(arm64_sw_feature_override.val,
+ ARM64_SW_FEATURE_OVERRIDE_RODATA_OFF))
+ text_prot = PAGE_KERNEL_EXEC;
+
+ /*
+ * We only enable the shadow call stack dynamically if we are running
+ * on a system that does not implement PAC or BTI. PAC and SCS provide
+ * roughly the same level of protection, and BTI relies on the PACIASP
+ * instructions serving as landing pads, preventing us from patching
+ * those instructions into something else.
+ */
+ if (IS_ENABLED(CONFIG_ARM64_PTR_AUTH_KERNEL) && cpu_has_pac())
+ enable_scs = false;
+
+ if (IS_ENABLED(CONFIG_ARM64_BTI_KERNEL) && cpu_has_bti()) {
+ enable_scs = false;
+
+ /*
+ * If we have a CPU that supports BTI and a kernel built for
+ * BTI then mark the kernel executable text as guarded pages
+ * now so we don't have to rewrite the page tables later.
+ */
+ text_prot = __pgprot_modify(text_prot, PTE_GP, PTE_GP);
+ }
+
+ /* Map all code read-write on the first pass if needed */
+ twopass |= enable_scs;
+ prot = twopass ? data_prot : text_prot;
+
+ map_segment(init_pg_dir, &pgdp, va_offset, _stext, _etext, prot,
+ !twopass, root_level);
+ map_segment(init_pg_dir, &pgdp, va_offset, __start_rodata,
+ __inittext_begin, data_prot, false, root_level);
+ map_segment(init_pg_dir, &pgdp, va_offset, __inittext_begin,
+ __inittext_end, prot, false, root_level);
+ map_segment(init_pg_dir, &pgdp, va_offset, __initdata_begin,
+ __initdata_end, data_prot, false, root_level);
+ map_segment(init_pg_dir, &pgdp, va_offset, _data, _end, data_prot,
+ true, root_level);
+ dsb(ishst);
+
+ idmap_cpu_replace_ttbr1(init_pg_dir);
+
+ if (twopass) {
+ if (IS_ENABLED(CONFIG_RELOCATABLE))
+ relocate_kernel(kaslr_offset);
+
+ if (enable_scs) {
+ scs_patch(__eh_frame_start + va_offset,
+ __eh_frame_end - __eh_frame_start);
+ asm("ic ialluis");
+
+ dynamic_scs_is_enabled = true;
+ }
+
+ /*
+ * Unmap the text region before remapping it, to avoid
+ * potential TLB conflicts when creating the contiguous
+ * descriptors.
+ */
+ unmap_segment(init_pg_dir, va_offset, _stext, _etext,
+ root_level);
+ dsb(ishst);
+ isb();
+ __tlbi(vmalle1);
+ isb();
+
+ /*
+ * Remap these segments with different permissions
+ * No new page table allocations should be needed
+ */
+ map_segment(init_pg_dir, NULL, va_offset, _stext, _etext,
+ text_prot, true, root_level);
+ map_segment(init_pg_dir, NULL, va_offset, __inittext_begin,
+ __inittext_end, text_prot, false, root_level);
+ dsb(ishst);
+ }
+}
+
+asmlinkage void __init early_map_kernel(u64 boot_status, void *fdt)
+{
+ static char const chosen_str[] __initconst = "/chosen";
+ u64 va_base, pa_base = (u64)&_text;
+ u64 kaslr_offset = pa_base % MIN_KIMG_ALIGN;
+ int root_level = 4 - CONFIG_PGTABLE_LEVELS;
+ int chosen;
+
+ /* Clear BSS and the initial page tables */
+ memset(__bss_start, 0, (u64)init_pg_end - (u64)__bss_start);
+
+ /* Parse the command line for CPU feature overrides */
+ chosen = fdt_path_offset(fdt, chosen_str);
+ init_feature_override(boot_status, fdt, chosen);
+
+ /*
+ * The virtual KASLR displacement modulo 2MiB is decided by the
+ * physical placement of the image, as otherwise, we might not be able
+ * to create the early kernel mapping using 2 MiB block descriptors. So
+ * take the low bits of the KASLR offset from the physical address, and
+ * fill in the high bits from the seed.
+ */
+ if (IS_ENABLED(CONFIG_RANDOMIZE_BASE)) {
+ u64 kaslr_seed = kaslr_early_init(fdt, chosen);
+
+ /*
+ * Assume that any CPU that does not implement E0PD needs KPTI
+ * to ensure that KASLR randomized addresses will not leak.
+ * This means we need to use non-global mappings for the kernel
+ * text and data.
+ */
+ if (kaslr_seed && !cpu_has_e0pd())
+ arm64_use_ng_mappings = true;
+
+ kaslr_offset |= kaslr_seed & ~(MIN_KIMG_ALIGN - 1);
+ }
+
+ va_base = KIMAGE_VADDR + kaslr_offset;
+ map_kernel(kaslr_offset, va_base - pa_base, root_level);
+}
diff --git a/arch/arm64/kernel/pi/map_range.c b/arch/arm64/kernel/pi/map_range.c
new file mode 100644
index 0000000000000000..61cbd6e82418c033
--- /dev/null
+++ b/arch/arm64/kernel/pi/map_range.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0-only
+// Copyright 2023 Google LLC
+// Author: Ard Biesheuvel <[email protected]>
+
+#include <linux/types.h>
+#include <linux/sizes.h>
+
+#include <asm/memory.h>
+#include <asm/pgalloc.h>
+#include <asm/pgtable.h>
+
+#include "pi.h"
+
+/**
+ * map_range - Map a contiguous range of physical pages into virtual memory
+ *
+ * @pte: Address of physical pointer to array of pages to
+ * allocate page tables from
+ * @start: Virtual address of the start of the range
+ * @end: Virtual address of the end of the range (exclusive)
+ * @pa: Physical address of the start of the range
+ * @level: Translation level for the mapping
+ * @tbl: The level @level page table to create the mappings in
+ * @may_use_cont: Whether the use of the contiguous attribute is allowed
+ * @va_offset: Offset between a physical page and its current mapping
+ * in the VA space
+ */
+void __init map_range(u64 *pte, u64 start, u64 end, u64 pa, pgprot_t prot,
+ int level, pte_t *tbl, bool may_use_cont, u64 va_offset)
+{
+ u64 cmask = (level == 3) ? CONT_PTE_SIZE - 1 : U64_MAX;
+ u64 protval = pgprot_val(prot) & ~PTE_TYPE_MASK;
+ int lshift = (3 - level) * (PAGE_SHIFT - 3);
+ u64 lmask = (PAGE_SIZE << lshift) - 1;
+
+ start &= PAGE_MASK;
+ pa &= PAGE_MASK;
+
+ /* Advance tbl to the entry that covers start */
+ tbl += (start >> (lshift + PAGE_SHIFT)) % PTRS_PER_PTE;
+
+ /*
+ * Set the right block/page bits for this level unless we are
+ * clearing the mapping
+ */
+ if (protval)
+ protval |= (level < 3) ? PMD_TYPE_SECT : PTE_TYPE_PAGE;
+
+ while (start < end) {
+ u64 next = min((start | lmask) + 1, PAGE_ALIGN(end));
+
+ if (level < 3 && (start | next | pa) & lmask) {
+ /*
+ * This chunk needs a finer grained mapping. Create a
+ * table mapping if necessary and recurse.
+ */
+ if (pte_none(*tbl)) {
+ *tbl = __pte(__phys_to_pte_val(*pte) |
+ PMD_TYPE_TABLE | PMD_TABLE_UXN);
+ *pte += PTRS_PER_PTE * sizeof(pte_t);
+ }
+ map_range(pte, start, next, pa, prot, level + 1,
+ (pte_t *)(__pte_to_phys(*tbl) + va_offset),
+ may_use_cont, va_offset);
+ } else {
+ /*
+ * Start a contiguous range if start and pa are
+ * suitably aligned
+ */
+ if (((start | pa) & cmask) == 0 && may_use_cont)
+ protval |= PTE_CONT;
+
+ /*
+ * Clear the contiguous attribute if the remaining
+ * range does not cover a contiguous block
+ */
+ if ((end & ~cmask) <= start)
+ protval &= ~PTE_CONT;
+
+ /* Put down a block or page mapping */
+ *tbl = __pte(__phys_to_pte_val(pa) | protval);
+ }
+ pa += next - start;
+ start = next;
+ tbl++;
+ }
+}
diff --git a/arch/arm64/kernel/pi/patch-scs.c b/arch/arm64/kernel/pi/patch-scs.c
index c65ef40d1e6b6b30..49d8b40e61bc050d 100644
--- a/arch/arm64/kernel/pi/patch-scs.c
+++ b/arch/arm64/kernel/pi/patch-scs.c
@@ -11,6 +11,10 @@
#include <asm/scs.h>
+#include "pi.h"
+
+bool dynamic_scs_is_enabled;
+
//
// This minimal DWARF CFI parser is partially based on the code in
// arch/arc/kernel/unwind.c, and on the document below:
@@ -46,8 +50,6 @@
#define DW_CFA_GNU_negative_offset_extended 0x2f
#define DW_CFA_hi_user 0x3f
-extern const u8 __eh_frame_start[], __eh_frame_end[];
-
enum {
PACIASP = 0xd503233f,
AUTIASP = 0xd50323bf,
@@ -250,13 +252,3 @@ int scs_patch(const u8 eh_frame[], int size)
}
return 0;
}
-
-asmlinkage void __init scs_patch_vmlinux(const u8 start[], const u8 end[])
-{
- if (!should_patch_pac_into_scs())
- return;
-
- scs_patch(start, end - start);
- asm("ic ialluis");
- isb();
-}
diff --git a/arch/arm64/kernel/pi/pi.h b/arch/arm64/kernel/pi/pi.h
index f455ad385976a664..dfc70828ad0ad2a8 100644
--- a/arch/arm64/kernel/pi/pi.h
+++ b/arch/arm64/kernel/pi/pi.h
@@ -2,6 +2,8 @@
// Copyright 2023 Google LLC
// Author: Ard Biesheuvel <[email protected]>
+#include <linux/types.h>
+
#define __prel64_initconst __section(".init.rodata.prel64")
typedef volatile signed long prel64_t;
@@ -12,3 +14,13 @@ static inline void *prel64_to_pointer(const prel64_t *offset)
return NULL;
return (void *)offset + *offset;
}
+
+extern bool dynamic_scs_is_enabled;
+
+void init_feature_override(u64 boot_status, const void *fdt, int chosen);
+u64 kaslr_early_init(void *fdt, int chosen);
+void relocate_kernel(u64 offset);
+int scs_patch(const u8 eh_frame[], int size);
+
+void map_range(u64 *pgd, u64 start, u64 end, u64 pa, pgprot_t prot,
+ int level, pte_t *tbl, bool may_use_cont, u64 va_offset);
diff --git a/arch/arm64/kernel/pi/relocate.c b/arch/arm64/kernel/pi/relocate.c
index 1853408ea76b0e5d..2407d269639870fc 100644
--- a/arch/arm64/kernel/pi/relocate.c
+++ b/arch/arm64/kernel/pi/relocate.c
@@ -7,6 +7,8 @@
#include <linux/init.h>
#include <linux/types.h>
+#include "pi.h"
+
extern const Elf64_Rela rela_start[], rela_end[];
extern const u64 relr_start[], relr_end[];
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index bda21a9245943c57..a4578c823ab57b6f 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -281,13 +281,6 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
*cmdline_p = boot_command_line;
- /*
- * If know now we are going to need KPTI then use non-global
- * mappings from the start, avoiding the cost of rewriting
- * everything later.
- */
- arm64_use_ng_mappings = kaslr_requires_kpti();
-
early_fixmap_init();
early_ioremap_init();
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 6c79ad2945749260..e22bf17209c80ff9 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -126,9 +126,9 @@ jiffies = jiffies_64;
#ifdef CONFIG_UNWIND_TABLES
#define UNWIND_DATA_SECTIONS \
.eh_frame : { \
- __eh_frame_start = .; \
+ __pi___eh_frame_start = .; \
*(.eh_frame) \
- __eh_frame_end = .; \
+ __pi___eh_frame_end = .; \
}
#else
#define UNWIND_DATA_SECTIONS
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 91410f48809000a0..82e88f4521737c0e 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -195,6 +195,7 @@ SYM_TYPED_FUNC_START(idmap_cpu_replace_ttbr1)
ret
SYM_FUNC_END(idmap_cpu_replace_ttbr1)
+SYM_FUNC_ALIAS(__pi_idmap_cpu_replace_ttbr1, idmap_cpu_replace_ttbr1)
.popsection
#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
--
2.39.2
We will move the CPU feature overrides into BSS in a subsequent patch,
and this requires that BSS is zeroed before the feature override
detection code runs. So let's map BSS read-write in the ID map, and zero
it via this mapping.
Since the kernel page tables are right next to it, and also zeroed via
the ID map, let's drop the separate clear_page_tables() function, and
just zero everything in one go.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/kernel/head.S | 33 +++++++-------------
1 file changed, 11 insertions(+), 22 deletions(-)
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 0fa44b3188c1e204..ade0cb99c8a83a3d 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -177,17 +177,6 @@ SYM_CODE_START_LOCAL(preserve_boot_args)
ret
SYM_CODE_END(preserve_boot_args)
-SYM_FUNC_START_LOCAL(clear_page_tables)
- /*
- * Clear the init page tables.
- */
- adrp x0, init_pg_dir
- adrp x1, init_pg_end
- sub x2, x1, x0
- mov x1, xzr
- b __pi_memset // tail call
-SYM_FUNC_END(clear_page_tables)
-
/*
* Macro to populate page table entries, these entries can be pointers to the next level
* or last level entries pointing to physical memory.
@@ -386,9 +375,9 @@ SYM_FUNC_START_LOCAL(create_idmap)
map_memory x0, x1, x3, x6, x7, x3, IDMAP_PGD_ORDER, x10, x11, x12, x13, x14, EXTRA_SHIFT
- /* Remap the kernel page tables r/w in the ID map */
+ /* Remap BSS and the kernel page tables r/w in the ID map */
adrp x1, _text
- adrp x2, init_pg_dir
+ adrp x2, __bss_start
adrp x3, _end
bic x4, x2, #SWAPPER_BLOCK_SIZE - 1
mov x5, SWAPPER_RW_MMUFLAGS
@@ -489,14 +478,6 @@ SYM_FUNC_START_LOCAL(__primary_switched)
mov x0, x20
bl set_cpu_boot_mode_flag
- // Clear BSS
- adr_l x0, __bss_start
- mov x1, xzr
- adr_l x2, __bss_stop
- sub x2, x2, x0
- bl __pi_memset
- dsb ishst // Make zero page visible to PTW
-
#if VA_BITS > 48
adr_l x8, vabits_actual // Set this early so KASAN early init
str x25, [x8] // ... observes the correct value
@@ -780,6 +761,15 @@ SYM_FUNC_START_LOCAL(__primary_switch)
adrp x1, reserved_pg_dir
adrp x2, init_idmap_pg_dir
bl __enable_mmu
+
+ // Clear BSS
+ adrp x0, __bss_start
+ mov x1, xzr
+ adrp x2, init_pg_end
+ sub x2, x2, x0
+ bl __pi_memset
+ dsb ishst // Make zero page visible to PTW
+
#ifdef CONFIG_RELOCATABLE
adrp x23, KERNEL_START
and x23, x23, MIN_KIMG_ALIGN - 1
@@ -794,7 +784,6 @@ SYM_FUNC_START_LOCAL(__primary_switch)
orr x23, x23, x0 // record kernel offset
#endif
#endif
- bl clear_page_tables
bl create_kernel_mapping
adrp x1, init_pg_dir
--
2.39.2
The early kaslr code open codes the detection of 'nokaslr' on the kernel
command line, and this is no longer necessary now that the feature
detection code, which also looks for the same string, executes before
this code.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/kernel/pi/kaslr_early.c | 53 +-------------------
1 file changed, 1 insertion(+), 52 deletions(-)
diff --git a/arch/arm64/kernel/pi/kaslr_early.c b/arch/arm64/kernel/pi/kaslr_early.c
index 167081b30a152d0a..f2305e276ec36803 100644
--- a/arch/arm64/kernel/pi/kaslr_early.c
+++ b/arch/arm64/kernel/pi/kaslr_early.c
@@ -16,57 +16,6 @@
#include <asm/memory.h>
#include <asm/pgtable.h>
-/* taken from lib/string.c */
-static char *__init __strstr(const char *s1, const char *s2)
-{
- size_t l1, l2;
-
- l2 = strlen(s2);
- if (!l2)
- return (char *)s1;
- l1 = strlen(s1);
- while (l1 >= l2) {
- l1--;
- if (!memcmp(s1, s2, l2))
- return (char *)s1;
- s1++;
- }
- return NULL;
-}
-static bool __init cmdline_contains_nokaslr(const u8 *cmdline)
-{
- const u8 *str;
-
- str = __strstr(cmdline, "nokaslr");
- return str == cmdline || (str > cmdline && *(str - 1) == ' ');
-}
-
-static bool __init is_kaslr_disabled_cmdline(void *fdt)
-{
- if (!IS_ENABLED(CONFIG_CMDLINE_FORCE)) {
- int node;
- const u8 *prop;
-
- node = fdt_path_offset(fdt, "/chosen");
- if (node < 0)
- goto out;
-
- prop = fdt_getprop(fdt, node, "bootargs", NULL);
- if (!prop)
- goto out;
-
- if (cmdline_contains_nokaslr(prop))
- return true;
-
- if (IS_ENABLED(CONFIG_CMDLINE_EXTEND))
- goto out;
-
- return false;
- }
-out:
- return cmdline_contains_nokaslr(CONFIG_CMDLINE);
-}
-
static u64 __init get_kaslr_seed(void *fdt)
{
static char const chosen_str[] __initconst = "chosen";
@@ -92,7 +41,7 @@ asmlinkage u64 __init kaslr_early_init(void *fdt)
{
u64 seed, range;
- if (is_kaslr_disabled_cmdline(fdt))
+ if (kaslr_disabled_cmdline())
return 0;
seed = get_kaslr_seed(fdt);
--
2.39.2
Early in the boot, when .rodata is still writable, we can poke
swapper_pg_dir entries directly, and there is no need to go through the
fixmap. After a future patch, we will enter the kernel with
swapper_pg_dir already active, and early swapper_pg_dir updates for
creating the fixmap page table hierarchy itself cannot go through the
fixmap for obvious reaons. So let's keep track of whether rodata is
writable, and update the descriptor directly in that case.
As the same reasoning applies to early KASAN init, make the function
noinstr as well.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/mm/mmu.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 13f46c911558f21f..3f631f3bc2f80b2b 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -53,6 +53,8 @@ EXPORT_SYMBOL(kimage_voffset);
u32 __boot_cpu_mode[] = { BOOT_CPU_MODE_EL2, BOOT_CPU_MODE_EL1 };
+static bool rodata_is_rw __ro_after_init = true;
+
/*
* The booting CPU updates the failed status @__early_cpu_boot_status,
* with MMU turned off.
@@ -73,10 +75,21 @@ static pud_t bm_pud[PTRS_PER_PUD] __page_aligned_bss __maybe_unused;
static DEFINE_SPINLOCK(swapper_pgdir_lock);
static DEFINE_MUTEX(fixmap_lock);
-void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd)
+void noinstr set_swapper_pgd(pgd_t *pgdp, pgd_t pgd)
{
pgd_t *fixmap_pgdp;
+ /*
+ * Don't bother with the fixmap if swapper_pg_dir is still mapped
+ * writable in the kernel mapping.
+ */
+ if (rodata_is_rw) {
+ WRITE_ONCE(*pgdp, pgd);
+ dsb(ishst);
+ isb();
+ return;
+ }
+
spin_lock(&swapper_pgdir_lock);
fixmap_pgdp = pgd_set_fixmap(__pa_symbol(pgdp));
WRITE_ONCE(*fixmap_pgdp, pgd);
@@ -614,6 +627,7 @@ void mark_rodata_ro(void)
* to cover NOTES and EXCEPTION_TABLE.
*/
section_size = (unsigned long)__init_begin - (unsigned long)__start_rodata;
+ WRITE_ONCE(rodata_is_rw, false);
update_mapping_prot(__pa_symbol(__start_rodata), (unsigned long)__start_rodata,
section_size, PAGE_KERNEL_RO);
--
2.39.2
The asm code that creates the initial ID map is rather intricate and
hard to follow. This is problematic because it makes adding support for
things like LPA2 or WXN more difficult than necessary. Also, it is
parameterized like the rest of the MM code to run with a configurable
number of levels, which is rather pointless, given that all AArch64 CPUs
implement support for 48-bit virtual addressing, and that many systems
exist with DRAM located outside of the 39-bit addressable range, which
is the only smaller VA size that is widely used, and we need additional
tricks to make things work in that combination.
So let's bite the bullet, and rip out all the asm macros, and fiddly
code, and replace it with a C implementation based on the newly added
routines for creating the early kernel VA mappings. And while at it,
create the initial ID map based on 48-bit virtual addressing as well,
regardless of the number of configured levels for the kernel proper.
Note that this code may execute with the MMU and caches disabled, and is
therefore not permitted to make unaligned accesses. This shouldn't
generally happen in any case for the algorithm as implemented, but to be
sure, let's pass -mstrict-align to the compiler just in case.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/assembler.h | 14 -
arch/arm64/include/asm/kernel-pgtable.h | 50 ++--
arch/arm64/include/asm/mmu_context.h | 6 +-
arch/arm64/kernel/head.S | 267 ++------------------
arch/arm64/kernel/image-vars.h | 1 +
arch/arm64/kernel/pi/Makefile | 3 +
arch/arm64/kernel/pi/map_kernel.c | 18 ++
arch/arm64/kernel/pi/map_range.c | 12 +
arch/arm64/kernel/pi/pi.h | 4 +
arch/arm64/mm/mmu.c | 5 -
arch/arm64/mm/proc.S | 2 +-
11 files changed, 87 insertions(+), 295 deletions(-)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 376a980f2bad08bb..beb53bbd8c19bb1c 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -345,20 +345,6 @@ alternative_cb_end
bfi \valreg, \t1sz, #TCR_T1SZ_OFFSET, #TCR_TxSZ_WIDTH
.endm
-/*
- * idmap_get_t0sz - get the T0SZ value needed to cover the ID map
- *
- * Calculate the maximum allowed value for TCR_EL1.T0SZ so that the
- * entire ID map region can be mapped. As T0SZ == (64 - #bits used),
- * this number conveniently equals the number of leading zeroes in
- * the physical address of _end.
- */
- .macro idmap_get_t0sz, reg
- adrp \reg, _end
- orr \reg, \reg, #(1 << VA_BITS_MIN) - 1
- clz \reg, \reg
- .endm
-
/*
* tcr_compute_pa_size - set TCR.(I)PS to the highest supported
* ID_AA64MMFR0_EL1.PARange value
diff --git a/arch/arm64/include/asm/kernel-pgtable.h b/arch/arm64/include/asm/kernel-pgtable.h
index bfc35944a5304e8e..f4d5858d2f66821d 100644
--- a/arch/arm64/include/asm/kernel-pgtable.h
+++ b/arch/arm64/include/asm/kernel-pgtable.h
@@ -29,6 +29,7 @@
#define SWAPPER_TABLE_SHIFT (SWAPPER_BLOCK_SHIFT + PAGE_SHIFT - 3)
#define SWAPPER_PGTABLE_LEVELS (CONFIG_PGTABLE_LEVELS - SWAPPER_SKIP_LEVEL)
+#define INIT_IDMAP_PGTABLE_LEVELS (IDMAP_LEVELS - SWAPPER_SKIP_LEVEL)
#define IDMAP_LEVELS ARM64_HW_PGTABLE_LEVELS(48)
#define IDMAP_ROOT_LEVEL (4 - IDMAP_LEVELS)
@@ -59,44 +60,39 @@
#define EARLY_ENTRIES(vstart, vend, shift, add) \
((((vend) - 1) >> (shift)) - ((vstart) >> (shift)) + 1 + add)
-#define EARLY_LEVEL(l, vstart, vend, add) \
- (SWAPPER_PGTABLE_LEVELS > l ? EARLY_ENTRIES(vstart, vend, SWAPPER_BLOCK_SHIFT + l * (PAGE_SHIFT - 3), add) : 0)
+#define EARLY_LEVEL(l, lvls, vstart, vend, add) \
+ (lvls > l ? EARLY_ENTRIES(vstart, vend, SWAPPER_BLOCK_SHIFT + l * (PAGE_SHIFT - 3), add) : 0)
-#define EARLY_PAGES(vstart, vend, add) (1 /* PGDIR page */ \
- + EARLY_LEVEL(3, (vstart), (vend), add) /* each entry needs a next level page table */ \
- + EARLY_LEVEL(2, (vstart), (vend), add) /* each entry needs a next level page table */ \
- + EARLY_LEVEL(1, (vstart), (vend), add))/* each entry needs a next level page table */
-#define INIT_DIR_SIZE (PAGE_SIZE * (EARLY_PAGES(KIMAGE_VADDR, _end, EARLY_KASLR) + EARLY_SEGMENT_EXTRA_PAGES))
+#define EARLY_PAGES(lvls, vstart, vend, add) (1 /* PGDIR page */ \
+ + EARLY_LEVEL(3, (lvls), (vstart), (vend), add) /* each entry needs a next level page table */ \
+ + EARLY_LEVEL(2, (lvls), (vstart), (vend), add) /* each entry needs a next level page table */ \
+ + EARLY_LEVEL(1, (lvls), (vstart), (vend), add))/* each entry needs a next level page table */
+#define INIT_DIR_SIZE (PAGE_SIZE * (EARLY_PAGES(SWAPPER_PGTABLE_LEVELS, KIMAGE_VADDR, _end, EARLY_KASLR) \
+ + EARLY_SEGMENT_EXTRA_PAGES))
-/* the initial ID map may need two extra pages if it needs to be extended */
-#if VA_BITS < 48
-#define INIT_IDMAP_DIR_SIZE ((INIT_IDMAP_DIR_PAGES + 2) * PAGE_SIZE)
-#else
-#define INIT_IDMAP_DIR_SIZE (INIT_IDMAP_DIR_PAGES * PAGE_SIZE)
-#endif
-#define INIT_IDMAP_DIR_PAGES EARLY_PAGES(KIMAGE_VADDR, _end + MAX_FDT_SIZE + SWAPPER_BLOCK_SIZE, 1)
+#define INIT_IDMAP_DIR_PAGES (EARLY_PAGES(INIT_IDMAP_PGTABLE_LEVELS, KIMAGE_VADDR, _end, 1))
+#define INIT_IDMAP_DIR_SIZE ((INIT_IDMAP_DIR_PAGES + EARLY_IDMAP_EXTRA_PAGES) * PAGE_SIZE)
+
+#define INIT_IDMAP_FDT_PAGES (EARLY_PAGES(INIT_IDMAP_PGTABLE_LEVELS, 0UL, UL(MAX_FDT_SIZE), 1) - 1)
+#define INIT_IDMAP_FDT_SIZE ((INIT_IDMAP_FDT_PAGES + EARLY_IDMAP_EXTRA_FDT_PAGES) * PAGE_SIZE)
/* The number of segments in the kernel image (text, rodata, inittext, initdata, data+bss) */
#define KERNEL_SEGMENT_COUNT 5
#if SWAPPER_BLOCK_SIZE > SEGMENT_ALIGN
#define EARLY_SEGMENT_EXTRA_PAGES (KERNEL_SEGMENT_COUNT + 1)
-#else
-#define EARLY_SEGMENT_EXTRA_PAGES 0
-#endif
-
/*
- * Initial memory map attributes.
+ * The initial ID map consists of the kernel image, mapped as two separate
+ * segments, and may appear misaligned wrt the swapper block size. This means
+ * we need 3 additional pages. The DT could straddle a swapper block boundary,
+ * so it may need 2.
*/
-#define SWAPPER_PTE_FLAGS (PTE_TYPE_PAGE | PTE_AF | PTE_SHARED)
-#define SWAPPER_PMD_FLAGS (PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S)
-
-#ifdef CONFIG_ARM64_4K_PAGES
-#define SWAPPER_RW_MMUFLAGS (PMD_ATTRINDX(MT_NORMAL) | SWAPPER_PMD_FLAGS)
-#define SWAPPER_RX_MMUFLAGS (SWAPPER_RW_MMUFLAGS | PMD_SECT_RDONLY)
+#define EARLY_IDMAP_EXTRA_PAGES 3
+#define EARLY_IDMAP_EXTRA_FDT_PAGES 2
#else
-#define SWAPPER_RW_MMUFLAGS (PTE_ATTRINDX(MT_NORMAL) | SWAPPER_PTE_FLAGS)
-#define SWAPPER_RX_MMUFLAGS (SWAPPER_RW_MMUFLAGS | PTE_RDONLY)
+#define EARLY_SEGMENT_EXTRA_PAGES 0
+#define EARLY_IDMAP_EXTRA_PAGES 0
+#define EARLY_IDMAP_EXTRA_FDT_PAGES 0
#endif
/*
diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index 72dbd6400549c185..f9ae2891e4c72c7f 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -57,11 +57,9 @@ static inline void cpu_switch_mm(pgd_t *pgd, struct mm_struct *mm)
}
/*
- * TCR.T0SZ value to use when the ID map is active. Usually equals
- * TCR_T0SZ(VA_BITS), unless system RAM is positioned very high in
- * physical memory, in which case it will be smaller.
+ * TCR.T0SZ value to use when the ID map is active.
*/
-extern int idmap_t0sz;
+#define idmap_t0sz TCR_T0SZ(48)
/*
* Ensure TCR.T0SZ is set to the provided value.
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index fc6a4076d826b728..c6f8c3b1f026c07b 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -80,26 +80,42 @@
* x19 primary_entry() .. start_kernel() whether we entered with the MMU on
* x20 primary_entry() .. __primary_switch() CPU boot mode
* x21 primary_entry() .. start_kernel() FDT pointer passed at boot in x0
- * x22 create_idmap() .. start_kernel() ID map VA of the DT blob
* x25 primary_entry() .. start_kernel() supported VA size
- * x28 create_idmap() callee preserved temp register
*/
SYM_CODE_START(primary_entry)
bl record_mmu_state
bl preserve_boot_args
- bl create_idmap
+
+ adrp x1, early_init_stack
+ mov sp, x1
+ mov x29, xzr
+ adrp x0, init_idmap_pg_dir
+ bl __pi_create_init_idmap
+
+ /*
+ * If the page tables have been populated with non-cacheable
+ * accesses (MMU disabled), invalidate those tables again to
+ * remove any speculatively loaded cache lines.
+ */
+ cbnz x19, 0f
+ dmb sy
+ mov x1, x0 // end of used region
+ adrp x0, init_idmap_pg_dir
+ adr_l x2, dcache_inval_poc
+ blr x2
+ b 1f
/*
* If we entered with the MMU and caches on, clean the ID mapped part
* of the primary boot code to the PoC so we can safely execute it with
* the MMU off.
*/
- cbz x19, 0f
- adrp x0, __idmap_text_start
+0: adrp x0, __idmap_text_start
adr_l x1, __idmap_text_end
adr_l x2, dcache_clean_poc
blr x2
-0: mov x0, x19
+
+1: mov x0, x19
bl init_kernel_el // w0=cpu_boot_mode
mov x20, x0
@@ -175,238 +191,6 @@ SYM_CODE_START_LOCAL(preserve_boot_args)
ret
SYM_CODE_END(preserve_boot_args)
-/*
- * Macro to populate page table entries, these entries can be pointers to the next level
- * or last level entries pointing to physical memory.
- *
- * tbl: page table address
- * rtbl: pointer to page table or physical memory
- * index: start index to write
- * eindex: end index to write - [index, eindex] written to
- * flags: flags for pagetable entry to or in
- * inc: increment to rtbl between each entry
- * tmp1: temporary variable
- *
- * Preserves: tbl, eindex, flags, inc
- * Corrupts: index, tmp1
- * Returns: rtbl
- */
- .macro populate_entries, tbl, rtbl, index, eindex, flags, inc, tmp1
-.Lpe\@: phys_to_pte \tmp1, \rtbl
- orr \tmp1, \tmp1, \flags // tmp1 = table entry
- str \tmp1, [\tbl, \index, lsl #3]
- add \rtbl, \rtbl, \inc // rtbl = pa next level
- add \index, \index, #1
- cmp \index, \eindex
- b.ls .Lpe\@
- .endm
-
-/*
- * Compute indices of table entries from virtual address range. If multiple entries
- * were needed in the previous page table level then the next page table level is assumed
- * to be composed of multiple pages. (This effectively scales the end index).
- *
- * vstart: virtual address of start of range
- * vend: virtual address of end of range - we map [vstart, vend]
- * shift: shift used to transform virtual address into index
- * order: #imm 2log(number of entries in page table)
- * istart: index in table corresponding to vstart
- * iend: index in table corresponding to vend
- * count: On entry: how many extra entries were required in previous level, scales
- * our end index.
- * On exit: returns how many extra entries required for next page table level
- *
- * Preserves: vstart, vend
- * Returns: istart, iend, count
- */
- .macro compute_indices, vstart, vend, shift, order, istart, iend, count
- ubfx \istart, \vstart, \shift, \order
- ubfx \iend, \vend, \shift, \order
- add \iend, \iend, \count, lsl \order
- sub \count, \iend, \istart
- .endm
-
-/*
- * Map memory for specified virtual address range. Each level of page table needed supports
- * multiple entries. If a level requires n entries the next page table level is assumed to be
- * formed from n pages.
- *
- * tbl: location of page table
- * rtbl: address to be used for first level page table entry (typically tbl + PAGE_SIZE)
- * vstart: virtual address of start of range
- * vend: virtual address of end of range - we map [vstart, vend - 1]
- * flags: flags to use to map last level entries
- * phys: physical address corresponding to vstart - physical memory is contiguous
- * order: #imm 2log(number of entries in PGD table)
- *
- * If extra_shift is set, an extra level will be populated if the end address does
- * not fit in 'extra_shift' bits. This assumes vend is in the TTBR0 range.
- *
- * Temporaries: istart, iend, tmp, count, sv - these need to be different registers
- * Preserves: vstart, flags
- * Corrupts: tbl, rtbl, vend, istart, iend, tmp, count, sv
- */
- .macro map_memory, tbl, rtbl, vstart, vend, flags, phys, order, istart, iend, tmp, count, sv, extra_shift
- sub \vend, \vend, #1
- add \rtbl, \tbl, #PAGE_SIZE
- mov \count, #0
-
- .ifnb \extra_shift
- tst \vend, #~((1 << (\extra_shift)) - 1)
- b.eq .L_\@
- compute_indices \vstart, \vend, #\extra_shift, #(PAGE_SHIFT - 3), \istart, \iend, \count
- mov \sv, \rtbl
- populate_entries \tbl, \rtbl, \istart, \iend, #PMD_TYPE_TABLE, #PAGE_SIZE, \tmp
- mov \tbl, \sv
- .endif
-.L_\@:
- compute_indices \vstart, \vend, #PGDIR_SHIFT, #\order, \istart, \iend, \count
- mov \sv, \rtbl
- populate_entries \tbl, \rtbl, \istart, \iend, #PMD_TYPE_TABLE, #PAGE_SIZE, \tmp
- mov \tbl, \sv
-
-#if SWAPPER_PGTABLE_LEVELS > 3
- compute_indices \vstart, \vend, #PUD_SHIFT, #(PAGE_SHIFT - 3), \istart, \iend, \count
- mov \sv, \rtbl
- populate_entries \tbl, \rtbl, \istart, \iend, #PMD_TYPE_TABLE, #PAGE_SIZE, \tmp
- mov \tbl, \sv
-#endif
-
-#if SWAPPER_PGTABLE_LEVELS > 2
- compute_indices \vstart, \vend, #SWAPPER_TABLE_SHIFT, #(PAGE_SHIFT - 3), \istart, \iend, \count
- mov \sv, \rtbl
- populate_entries \tbl, \rtbl, \istart, \iend, #PMD_TYPE_TABLE, #PAGE_SIZE, \tmp
- mov \tbl, \sv
-#endif
-
- compute_indices \vstart, \vend, #SWAPPER_BLOCK_SHIFT, #(PAGE_SHIFT - 3), \istart, \iend, \count
- bic \rtbl, \phys, #SWAPPER_BLOCK_SIZE - 1
- populate_entries \tbl, \rtbl, \istart, \iend, \flags, #SWAPPER_BLOCK_SIZE, \tmp
- .endm
-
-/*
- * Remap a subregion created with the map_memory macro with modified attributes
- * or output address. The entire remapped region must have been covered in the
- * invocation of map_memory.
- *
- * x0: last level table address (returned in first argument to map_memory)
- * x1: start VA of the existing mapping
- * x2: start VA of the region to update
- * x3: end VA of the region to update (exclusive)
- * x4: start PA associated with the region to update
- * x5: attributes to set on the updated region
- * x6: order of the last level mappings
- */
-SYM_FUNC_START_LOCAL(remap_region)
- sub x3, x3, #1 // make end inclusive
-
- // Get the index offset for the start of the last level table
- lsr x1, x1, x6
- bfi x1, xzr, #0, #PAGE_SHIFT - 3
-
- // Derive the start and end indexes into the last level table
- // associated with the provided region
- lsr x2, x2, x6
- lsr x3, x3, x6
- sub x2, x2, x1
- sub x3, x3, x1
-
- mov x1, #1
- lsl x6, x1, x6 // block size at this level
-
- populate_entries x0, x4, x2, x3, x5, x6, x7
- ret
-SYM_FUNC_END(remap_region)
-
-SYM_FUNC_START_LOCAL(create_idmap)
- mov x28, lr
- /*
- * The ID map carries a 1:1 mapping of the physical address range
- * covered by the loaded image, which could be anywhere in DRAM. This
- * means that the required size of the VA (== PA) space is decided at
- * boot time, and could be more than the configured size of the VA
- * space for ordinary kernel and user space mappings.
- *
- * There are three cases to consider here:
- * - 39 <= VA_BITS < 48, and the ID map needs up to 48 VA bits to cover
- * the placement of the image. In this case, we configure one extra
- * level of translation on the fly for the ID map only. (This case
- * also covers 42-bit VA/52-bit PA on 64k pages).
- *
- * - VA_BITS == 48, and the ID map needs more than 48 VA bits. This can
- * only happen when using 64k pages, in which case we need to extend
- * the root level table rather than add a level. Note that we can
- * treat this case as 'always extended' as long as we take care not
- * to program an unsupported T0SZ value into the TCR register.
- *
- * - Combinations that would require two additional levels of
- * translation are not supported, e.g., VA_BITS==36 on 16k pages, or
- * VA_BITS==39/4k pages with 5-level paging, where the input address
- * requires more than 47 or 48 bits, respectively.
- */
-#if (VA_BITS < 48)
-#define IDMAP_PGD_ORDER (VA_BITS - PGDIR_SHIFT)
-#define EXTRA_SHIFT (PGDIR_SHIFT + PAGE_SHIFT - 3)
-
- /*
- * If VA_BITS < 48, we have to configure an additional table level.
- * First, we have to verify our assumption that the current value of
- * VA_BITS was chosen such that all translation levels are fully
- * utilised, and that lowering T0SZ will always result in an additional
- * translation level to be configured.
- */
-#if VA_BITS != EXTRA_SHIFT
-#error "Mismatch between VA_BITS and page size/number of translation levels"
-#endif
-#else
-#define IDMAP_PGD_ORDER (PHYS_MASK_SHIFT - PGDIR_SHIFT)
-#define EXTRA_SHIFT
- /*
- * If VA_BITS == 48, we don't have to configure an additional
- * translation level, but the top-level table has more entries.
- */
-#endif
- adrp x0, init_idmap_pg_dir
- adrp x3, _text
- adrp x6, _end + MAX_FDT_SIZE + SWAPPER_BLOCK_SIZE
- mov x7, SWAPPER_RX_MMUFLAGS
-
- map_memory x0, x1, x3, x6, x7, x3, IDMAP_PGD_ORDER, x10, x11, x12, x13, x14, EXTRA_SHIFT
-
- /* Remap [.init].data, BSS and the kernel page tables r/w in the ID map */
- adrp x1, _text
- adrp x2, __initdata_begin
- adrp x3, _end
- bic x4, x2, #SWAPPER_BLOCK_SIZE - 1
- mov x5, SWAPPER_RW_MMUFLAGS
- mov x6, #SWAPPER_BLOCK_SHIFT
- bl remap_region
-
- /* Remap the FDT after the kernel image */
- adrp x1, _text
- adrp x22, _end + SWAPPER_BLOCK_SIZE
- bic x2, x22, #SWAPPER_BLOCK_SIZE - 1
- bfi x22, x21, #0, #SWAPPER_BLOCK_SHIFT // remapped FDT address
- add x3, x2, #MAX_FDT_SIZE + SWAPPER_BLOCK_SIZE
- bic x4, x21, #SWAPPER_BLOCK_SIZE - 1
- mov x5, SWAPPER_RW_MMUFLAGS
- mov x6, #SWAPPER_BLOCK_SHIFT
- bl remap_region
-
- /*
- * Since the page tables have been populated with non-cacheable
- * accesses (MMU disabled), invalidate those tables again to
- * remove any speculatively loaded cache lines.
- */
- cbnz x19, 0f // skip cache invalidation if MMU is on
- dmb sy
-
- adrp x0, init_idmap_pg_dir
- adrp x1, init_idmap_pg_end
- bl dcache_inval_poc
-0: ret x28
-SYM_FUNC_END(create_idmap)
-
/*
* Initialize CPU registers with task-specific and cpu-specific context.
*
@@ -727,11 +511,6 @@ SYM_FUNC_START_LOCAL(__no_granule_support)
SYM_FUNC_END(__no_granule_support)
SYM_FUNC_START_LOCAL(__primary_switch)
- mrs x1, tcr_el1
- mov x2, #64 - VA_BITS
- tcr_set_t0sz x1, x2
- msr tcr_el1, x1
-
adrp x1, reserved_pg_dir
adrp x2, init_idmap_pg_dir
bl __enable_mmu
@@ -740,7 +519,7 @@ SYM_FUNC_START_LOCAL(__primary_switch)
mov sp, x1
mov x29, xzr
mov x0, x20 // pass the full boot status
- mov x1, x22 // pass the low FDT mapping
+ mov x1, x21 // pass the FDT
bl __pi_early_map_kernel // Map and relocate the kernel
ldr x8, =__primary_switched
diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
index 6eeb2a09c8d2441d..6ca235b09a30d5d3 100644
--- a/arch/arm64/kernel/image-vars.h
+++ b/arch/arm64/kernel/image-vars.h
@@ -47,6 +47,7 @@ PROVIDE(__pi_arm64_use_ng_mappings = arm64_use_ng_mappings);
PROVIDE(__pi__ctype = _ctype);
PROVIDE(__pi_memstart_offset_seed = memstart_offset_seed);
+PROVIDE(__pi_init_idmap_pg_dir = init_idmap_pg_dir);
PROVIDE(__pi_init_pg_dir = init_pg_dir);
PROVIDE(__pi_init_pg_end = init_pg_end);
diff --git a/arch/arm64/kernel/pi/Makefile b/arch/arm64/kernel/pi/Makefile
index 8c2f80a46b9387b0..4393b41f0b7133da 100644
--- a/arch/arm64/kernel/pi/Makefile
+++ b/arch/arm64/kernel/pi/Makefile
@@ -11,6 +11,9 @@ KBUILD_CFLAGS := $(subst $(CC_FLAGS_FTRACE),,$(KBUILD_CFLAGS)) -fpie \
-fno-asynchronous-unwind-tables -fno-unwind-tables \
$(call cc-option,-fno-addrsig)
+# this code may run with the MMU off so disable unaligned accesses
+CFLAGS_map_range.o += -mstrict-align
+
# remove SCS flags from all objects in this directory
KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_SCS), $(KBUILD_CFLAGS))
# disable LTO
diff --git a/arch/arm64/kernel/pi/map_kernel.c b/arch/arm64/kernel/pi/map_kernel.c
index b573c964c7d88d1b..a718714eb671f290 100644
--- a/arch/arm64/kernel/pi/map_kernel.c
+++ b/arch/arm64/kernel/pi/map_kernel.c
@@ -129,6 +129,22 @@ static void __init map_kernel(u64 kaslr_offset, u64 va_offset, int root_level)
}
}
+static void map_fdt(u64 fdt)
+{
+ static u8 ptes[INIT_IDMAP_FDT_SIZE] __initdata __aligned(PAGE_SIZE);
+ u64 efdt = fdt + MAX_FDT_SIZE;
+ u64 ptep = (u64)ptes;
+
+ /*
+ * Map up to MAX_FDT_SIZE bytes, but avoid overlap with
+ * the kernel image.
+ */
+ map_range(&ptep, fdt, (u64)_text > fdt ? min((u64)_text, efdt) : efdt,
+ fdt, PAGE_KERNEL, IDMAP_ROOT_LEVEL,
+ (pte_t *)init_idmap_pg_dir, false, 0);
+ dsb(ishst);
+}
+
asmlinkage void __init early_map_kernel(u64 boot_status, void *fdt)
{
static char const chosen_str[] __initconst = "/chosen";
@@ -137,6 +153,8 @@ asmlinkage void __init early_map_kernel(u64 boot_status, void *fdt)
int root_level = 4 - CONFIG_PGTABLE_LEVELS;
int chosen;
+ map_fdt((u64)fdt);
+
/* Clear BSS and the initial page tables */
memset(__bss_start, 0, (u64)init_pg_end - (u64)__bss_start);
diff --git a/arch/arm64/kernel/pi/map_range.c b/arch/arm64/kernel/pi/map_range.c
index 61cbd6e82418c033..8de3fe2c1cb55722 100644
--- a/arch/arm64/kernel/pi/map_range.c
+++ b/arch/arm64/kernel/pi/map_range.c
@@ -85,3 +85,15 @@ void __init map_range(u64 *pte, u64 start, u64 end, u64 pa, pgprot_t prot,
tbl++;
}
}
+
+asmlinkage u64 __init create_init_idmap(pgd_t *pg_dir)
+{
+ u64 ptep = (u64)pg_dir + PAGE_SIZE;
+
+ map_range(&ptep, (u64)_stext, (u64)__initdata_begin, (u64)_stext,
+ PAGE_KERNEL_ROX, IDMAP_ROOT_LEVEL, (pte_t *)pg_dir, false, 0);
+ map_range(&ptep, (u64)__initdata_begin, (u64)_end, (u64)__initdata_begin,
+ PAGE_KERNEL, IDMAP_ROOT_LEVEL, (pte_t *)pg_dir, false, 0);
+
+ return ptep;
+}
diff --git a/arch/arm64/kernel/pi/pi.h b/arch/arm64/kernel/pi/pi.h
index dfc70828ad0ad2a8..a8fe79c9b111140f 100644
--- a/arch/arm64/kernel/pi/pi.h
+++ b/arch/arm64/kernel/pi/pi.h
@@ -17,6 +17,8 @@ static inline void *prel64_to_pointer(const prel64_t *offset)
extern bool dynamic_scs_is_enabled;
+extern pgd_t init_idmap_pg_dir[];
+
void init_feature_override(u64 boot_status, const void *fdt, int chosen);
u64 kaslr_early_init(void *fdt, int chosen);
void relocate_kernel(u64 offset);
@@ -24,3 +26,5 @@ int scs_patch(const u8 eh_frame[], int size);
void map_range(u64 *pgd, u64 start, u64 end, u64 pa, pgprot_t prot,
int level, pte_t *tbl, bool may_use_cont, u64 va_offset);
+
+asmlinkage u64 create_init_idmap(pgd_t *pgd);
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index a59433ae4f5f8d02..13f46c911558f21f 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -43,8 +43,6 @@
#define NO_CONT_MAPPINGS BIT(1)
#define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
-int idmap_t0sz __ro_after_init;
-
#if VA_BITS > 48
u64 vabits_actual __ro_after_init = VA_BITS_MIN;
EXPORT_SYMBOL(vabits_actual);
@@ -798,8 +796,6 @@ void __init paging_init(void)
pgd_t *pgdp = pgd_set_fixmap(__pa_symbol(swapper_pg_dir));
extern pgd_t init_idmap_pg_dir[];
- idmap_t0sz = 63UL - __fls(__pa_symbol(_end) | GENMASK(VA_BITS_MIN - 1, 0));
-
map_kernel(pgdp);
map_mem(pgdp);
@@ -814,7 +810,6 @@ void __init paging_init(void)
memblock_allow_resize();
create_idmap();
- idmap_t0sz = TCR_T0SZ(48);
}
#ifdef CONFIG_MEMORY_HOTPLUG
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index c7129b21bfd5191f..d0748f18b2abdf0e 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -200,7 +200,7 @@ SYM_FUNC_ALIAS(__pi_idmap_cpu_replace_ttbr1, idmap_cpu_replace_ttbr1)
#ifdef CONFIG_UNMAP_KERNEL_AT_EL0
-#define KPTI_NG_PTE_FLAGS (PTE_ATTRINDX(MT_NORMAL) | SWAPPER_PTE_FLAGS)
+#define KPTI_NG_PTE_FLAGS (PTE_ATTRINDX(MT_NORMAL) | PTE_TYPE_PAGE | PTE_AF | PTE_SHARED)
.pushsection ".idmap.text", "awx"
--
2.39.2
Now that the early kernel mapping is created with all the right
attributes and segment boundaries, there is no longer a need to recreate
it and switch to it. This also means we no longer have to copy the kasan
shadow or some parts of the fixmap from one set of page tables to the
other.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/kasan.h | 2 -
arch/arm64/include/asm/mmu.h | 2 +-
arch/arm64/kernel/image-vars.h | 1 +
arch/arm64/kernel/pi/map_kernel.c | 6 +-
arch/arm64/mm/kasan_init.c | 15 ---
arch/arm64/mm/mmu.c | 110 +++-----------------
6 files changed, 20 insertions(+), 116 deletions(-)
diff --git a/arch/arm64/include/asm/kasan.h b/arch/arm64/include/asm/kasan.h
index 12d5f47f7dbec628..ab52688ac4bd43b6 100644
--- a/arch/arm64/include/asm/kasan.h
+++ b/arch/arm64/include/asm/kasan.h
@@ -36,12 +36,10 @@ void kasan_init(void);
#define _KASAN_SHADOW_START(va) (KASAN_SHADOW_END - (1UL << ((va) - KASAN_SHADOW_SCALE_SHIFT)))
#define KASAN_SHADOW_START _KASAN_SHADOW_START(vabits_actual)
-void kasan_copy_shadow(pgd_t *pgdir);
asmlinkage void kasan_early_init(void);
#else
static inline void kasan_init(void) { }
-static inline void kasan_copy_shadow(pgd_t *pgdir) { }
#endif
#endif
diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
index e74dfae8e48214c3..fce956cd721ba64f 100644
--- a/arch/arm64/include/asm/mmu.h
+++ b/arch/arm64/include/asm/mmu.h
@@ -95,7 +95,7 @@ static inline bool kaslr_disabled_cmdline(void)
}
#define INIT_MM_CONTEXT(name) \
- .pgd = init_pg_dir,
+ .pgd = swapper_pg_dir,
#endif /* !__ASSEMBLY__ */
#endif
diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
index 6ca235b09a30d5d3..1c0e920a2466f851 100644
--- a/arch/arm64/kernel/image-vars.h
+++ b/arch/arm64/kernel/image-vars.h
@@ -50,6 +50,7 @@ PROVIDE(__pi_memstart_offset_seed = memstart_offset_seed);
PROVIDE(__pi_init_idmap_pg_dir = init_idmap_pg_dir);
PROVIDE(__pi_init_pg_dir = init_pg_dir);
PROVIDE(__pi_init_pg_end = init_pg_end);
+PROVIDE(__pi_swapper_pg_dir = swapper_pg_dir);
PROVIDE(__pi__text = _text);
PROVIDE(__pi__stext = _stext);
diff --git a/arch/arm64/kernel/pi/map_kernel.c b/arch/arm64/kernel/pi/map_kernel.c
index a718714eb671f290..a90c4d6fc75c35d0 100644
--- a/arch/arm64/kernel/pi/map_kernel.c
+++ b/arch/arm64/kernel/pi/map_kernel.c
@@ -125,8 +125,12 @@ static void __init map_kernel(u64 kaslr_offset, u64 va_offset, int root_level)
text_prot, true, root_level);
map_segment(init_pg_dir, NULL, va_offset, __inittext_begin,
__inittext_end, text_prot, false, root_level);
- dsb(ishst);
}
+
+ /* Copy the root page table to its final location */
+ memcpy((void *)swapper_pg_dir + va_offset, init_pg_dir, PGD_SIZE);
+ dsb(ishst);
+ idmap_cpu_replace_ttbr1(swapper_pg_dir);
}
static void map_fdt(u64 fdt)
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index e969e68de005fd2a..df98f496539f0e39 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -184,21 +184,6 @@ static void __init kasan_map_populate(unsigned long start, unsigned long end,
kasan_pgd_populate(start & PAGE_MASK, PAGE_ALIGN(end), node, false);
}
-/*
- * Copy the current shadow region into a new pgdir.
- */
-void __init kasan_copy_shadow(pgd_t *pgdir)
-{
- pgd_t *pgdp, *pgdp_new, *pgdp_end;
-
- pgdp = pgd_offset_k(KASAN_SHADOW_START);
- pgdp_end = pgd_offset_k(KASAN_SHADOW_END);
- pgdp_new = pgd_offset_pgd(pgdir, KASAN_SHADOW_START);
- do {
- set_pgd(pgdp_new, READ_ONCE(*pgdp));
- } while (pgdp++, pgdp_new++, pgdp != pgdp_end);
-}
-
static void __init clear_pgds(unsigned long start,
unsigned long end)
{
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 3f631f3bc2f80b2b..81634ff5f6a67476 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -634,9 +634,9 @@ void mark_rodata_ro(void)
debug_checkwx();
}
-static void __init map_kernel_segment(pgd_t *pgdp, void *va_start, void *va_end,
- pgprot_t prot, struct vm_struct *vma,
- int flags, unsigned long vm_flags)
+static void __init declare_vma(struct vm_struct *vma,
+ void *va_start, void *va_end,
+ unsigned long vm_flags)
{
phys_addr_t pa_start = __pa_symbol(va_start);
unsigned long size = va_end - va_start;
@@ -644,9 +644,6 @@ static void __init map_kernel_segment(pgd_t *pgdp, void *va_start, void *va_end,
BUG_ON(!PAGE_ALIGNED(pa_start));
BUG_ON(!PAGE_ALIGNED(size));
- __create_pgd_mapping(pgdp, pa_start, (unsigned long)va_start, size, prot,
- early_pgtable_alloc, flags);
-
if (!(vm_flags & VM_NO_GUARD))
size += PAGE_SIZE;
@@ -691,87 +688,17 @@ core_initcall(map_entry_trampoline);
#endif
/*
- * Open coded check for BTI, only for use to determine configuration
- * for early mappings for before the cpufeature code has run.
+ * Declare the VMA areas for the kernel
*/
-static bool arm64_early_this_cpu_has_bti(void)
+static void __init declare_kernel_vmas(void)
{
- u64 pfr1;
-
- if (!IS_ENABLED(CONFIG_ARM64_BTI_KERNEL))
- return false;
-
- pfr1 = __read_sysreg_by_encoding(SYS_ID_AA64PFR1_EL1);
- return cpuid_feature_extract_unsigned_field(pfr1,
- ID_AA64PFR1_EL1_BT_SHIFT);
-}
-
-/*
- * Create fine-grained mappings for the kernel.
- */
-static void __init map_kernel(pgd_t *pgdp)
-{
- static struct vm_struct vmlinux_text, vmlinux_rodata, vmlinux_inittext,
- vmlinux_initdata, vmlinux_data;
-
- /*
- * External debuggers may need to write directly to the text
- * mapping to install SW breakpoints. Allow this (only) when
- * explicitly requested with rodata=off.
- */
- pgprot_t text_prot = rodata_enabled ? PAGE_KERNEL_ROX : PAGE_KERNEL_EXEC;
-
- /*
- * If we have a CPU that supports BTI and a kernel built for
- * BTI then mark the kernel executable text as guarded pages
- * now so we don't have to rewrite the page tables later.
- */
- if (arm64_early_this_cpu_has_bti())
- text_prot = __pgprot_modify(text_prot, PTE_GP, PTE_GP);
+ static struct vm_struct vmlinux_seg[KERNEL_SEGMENT_COUNT];
- /*
- * Only rodata will be remapped with different permissions later on,
- * all other segments are allowed to use contiguous mappings.
- */
- map_kernel_segment(pgdp, _stext, _etext, text_prot, &vmlinux_text, 0,
- VM_NO_GUARD);
- map_kernel_segment(pgdp, __start_rodata, __inittext_begin, PAGE_KERNEL,
- &vmlinux_rodata, NO_CONT_MAPPINGS, VM_NO_GUARD);
- map_kernel_segment(pgdp, __inittext_begin, __inittext_end, text_prot,
- &vmlinux_inittext, 0, VM_NO_GUARD);
- map_kernel_segment(pgdp, __initdata_begin, __initdata_end, PAGE_KERNEL,
- &vmlinux_initdata, 0, VM_NO_GUARD);
- map_kernel_segment(pgdp, _data, _end, PAGE_KERNEL, &vmlinux_data, 0, 0);
-
- if (!READ_ONCE(pgd_val(*pgd_offset_pgd(pgdp, FIXADDR_START)))) {
- /*
- * The fixmap falls in a separate pgd to the kernel, and doesn't
- * live in the carveout for the swapper_pg_dir. We can simply
- * re-use the existing dir for the fixmap.
- */
- set_pgd(pgd_offset_pgd(pgdp, FIXADDR_START),
- READ_ONCE(*pgd_offset_k(FIXADDR_START)));
- } else if (CONFIG_PGTABLE_LEVELS > 3) {
- pgd_t *bm_pgdp;
- p4d_t *bm_p4dp;
- pud_t *bm_pudp;
- /*
- * The fixmap shares its top level pgd entry with the kernel
- * mapping. This can really only occur when we are running
- * with 16k/4 levels, so we can simply reuse the pud level
- * entry instead.
- */
- BUG_ON(!IS_ENABLED(CONFIG_ARM64_16K_PAGES));
- bm_pgdp = pgd_offset_pgd(pgdp, FIXADDR_START);
- bm_p4dp = p4d_offset(bm_pgdp, FIXADDR_START);
- bm_pudp = pud_set_fixmap_offset(bm_p4dp, FIXADDR_START);
- pud_populate(&init_mm, bm_pudp, lm_alias(bm_pmd));
- pud_clear_fixmap();
- } else {
- BUG();
- }
-
- kasan_copy_shadow(pgdp);
+ declare_vma(&vmlinux_seg[0], _stext, _etext, VM_NO_GUARD);
+ declare_vma(&vmlinux_seg[1], __start_rodata, __inittext_begin, VM_NO_GUARD);
+ declare_vma(&vmlinux_seg[2], __inittext_begin, __inittext_end, VM_NO_GUARD);
+ declare_vma(&vmlinux_seg[3], __initdata_begin, __initdata_end, VM_NO_GUARD);
+ declare_vma(&vmlinux_seg[4], _data, _end, 0);
}
void __pi_map_range(u64 *pgd, u64 start, u64 end, u64 pa, pgprot_t prot,
@@ -807,23 +734,12 @@ static void __init create_idmap(void)
void __init paging_init(void)
{
- pgd_t *pgdp = pgd_set_fixmap(__pa_symbol(swapper_pg_dir));
- extern pgd_t init_idmap_pg_dir[];
-
- map_kernel(pgdp);
- map_mem(pgdp);
-
- pgd_clear_fixmap();
-
- cpu_replace_ttbr1(lm_alias(swapper_pg_dir), init_idmap_pg_dir);
- init_mm.pgd = swapper_pg_dir;
-
- memblock_phys_free(__pa_symbol(init_pg_dir),
- __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir));
+ map_mem(swapper_pg_dir);
memblock_allow_resize();
create_idmap();
+ declare_kernel_vmas();
}
#ifdef CONFIG_MEMORY_HOTPLUG
--
2.39.2
In order to allow the CPU feature override detection code to run even
earlier, move the feature override global variables into BSS, which is
the only part of the static kernel image that is mapped read-write in
the initial ID map.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/kernel/cpufeature.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 88cec6c14743c4c5..0b16e676b68c6543 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -649,13 +649,13 @@ static const struct arm64_ftr_bits ftr_raz[] = {
#define ARM64_FTR_REG(id, table) \
__ARM64_FTR_REG_OVERRIDE(#id, id, table, &no_override)
-struct arm64_ftr_override __ro_after_init id_aa64mmfr1_override;
-struct arm64_ftr_override __ro_after_init id_aa64pfr0_override;
-struct arm64_ftr_override __ro_after_init id_aa64pfr1_override;
-struct arm64_ftr_override __ro_after_init id_aa64zfr0_override;
-struct arm64_ftr_override __ro_after_init id_aa64smfr0_override;
-struct arm64_ftr_override __ro_after_init id_aa64isar1_override;
-struct arm64_ftr_override __ro_after_init id_aa64isar2_override;
+struct arm64_ftr_override id_aa64mmfr1_override;
+struct arm64_ftr_override id_aa64pfr0_override;
+struct arm64_ftr_override id_aa64pfr1_override;
+struct arm64_ftr_override id_aa64zfr0_override;
+struct arm64_ftr_override id_aa64smfr0_override;
+struct arm64_ftr_override id_aa64isar1_override;
+struct arm64_ftr_override id_aa64isar2_override;
struct arm64_ftr_override arm64_sw_feature_override;
--
2.39.2
Add rodata=off to the set of kernel command line options that is parsed
early using the CPU feature override detection code, so we can easily
refer to it when creating the kernel mapping.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/cpufeature.h | 1 +
arch/arm64/kernel/pi/idreg-override.c | 2 ++
2 files changed, 3 insertions(+)
diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index bc10098901808c00..edc7733aa49846b2 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -16,6 +16,7 @@
#define cpu_feature(x) KERNEL_HWCAP_ ## x
#define ARM64_SW_FEATURE_OVERRIDE_NOKASLR 0
+#define ARM64_SW_FEATURE_OVERRIDE_RODATA_OFF 4
#ifndef __ASSEMBLY__
diff --git a/arch/arm64/kernel/pi/idreg-override.c b/arch/arm64/kernel/pi/idreg-override.c
index 4e76db6eb72c2087..6c547cccaf6a9e9c 100644
--- a/arch/arm64/kernel/pi/idreg-override.c
+++ b/arch/arm64/kernel/pi/idreg-override.c
@@ -151,6 +151,7 @@ static const struct ftr_set_desc sw_features __prel64_initconst = {
.override = &arm64_sw_feature_override,
.fields = {
FIELD("nokaslr", ARM64_SW_FEATURE_OVERRIDE_NOKASLR, NULL),
+ FIELD("rodataoff", ARM64_SW_FEATURE_OVERRIDE_RODATA_OFF, NULL),
{}
},
};
@@ -183,6 +184,7 @@ static const struct {
"id_aa64isar2.gpa3=0 id_aa64isar2.apa3=0" },
{ "arm64.nomte", "id_aa64pfr1.mte=0" },
{ "nokaslr", "arm64_sw.nokaslr=1" },
+ { "rodata=off", "arm64_sw.rodataoff=1" },
};
static int __init parse_hexdigit(const char *p, u64 *v)
--
2.39.2
From: Anshuman Khandual <[email protected]>
PAGE_SIZE support is tested against possible minimum and maximum values for
its respective ID_AA64MMFR0.TGRAN field, depending on whether it is signed
or unsigned. But then FEAT_LPA2 implementation needs to be validated for 4K
and 16K page sizes via feature specific ID_AA64MMFR0.TGRAN values. Hence it
adds FEAT_LPA2 specific ID_AA64MMFR0.TGRAN[2] values per ARM ARM (0487G.A).
Acked-by: Catalin Marinas <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/sysreg.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 9e3ecba3c4e67936..21f2ceeb2d152469 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -630,11 +630,13 @@
#if defined(CONFIG_ARM64_4K_PAGES)
#define ID_AA64MMFR0_EL1_TGRAN_SHIFT ID_AA64MMFR0_EL1_TGRAN4_SHIFT
+#define ID_AA64MMFR0_EL1_TGRAN_LPA2 ID_AA64MMFR0_EL1_TGRAN4_52_BIT
#define ID_AA64MMFR0_EL1_TGRAN_SUPPORTED_MIN ID_AA64MMFR0_EL1_TGRAN4_SUPPORTED_MIN
#define ID_AA64MMFR0_EL1_TGRAN_SUPPORTED_MAX ID_AA64MMFR0_EL1_TGRAN4_SUPPORTED_MAX
#define ID_AA64MMFR0_EL1_TGRAN_2_SHIFT ID_AA64MMFR0_EL1_TGRAN4_2_SHIFT
#elif defined(CONFIG_ARM64_16K_PAGES)
#define ID_AA64MMFR0_EL1_TGRAN_SHIFT ID_AA64MMFR0_EL1_TGRAN16_SHIFT
+#define ID_AA64MMFR0_EL1_TGRAN_LPA2 ID_AA64MMFR0_EL1_TGRAN16_52_BIT
#define ID_AA64MMFR0_EL1_TGRAN_SUPPORTED_MIN ID_AA64MMFR0_EL1_TGRAN16_SUPPORTED_MIN
#define ID_AA64MMFR0_EL1_TGRAN_SUPPORTED_MAX ID_AA64MMFR0_EL1_TGRAN16_SUPPORTED_MAX
#define ID_AA64MMFR0_EL1_TGRAN_2_SHIFT ID_AA64MMFR0_EL1_TGRAN16_2_SHIFT
--
2.39.2
To permit the feature overrides to be taken into account before the
KASLR init code runs and the kernel mapping is created, move the
detection code to an earlier stage in the boot.
In a subsequent patch, this will be taken advantage of by merging the
preliminary and permanent mappings of the kernel text and data into a
single one that gets created and relocated before start_kernel() is
called.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/kernel/head.S | 17 +++++++++--------
arch/arm64/kernel/vmlinux.lds.S | 4 +---
2 files changed, 10 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index ade0cb99c8a83a3d..0a345898a12939af 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -375,9 +375,9 @@ SYM_FUNC_START_LOCAL(create_idmap)
map_memory x0, x1, x3, x6, x7, x3, IDMAP_PGD_ORDER, x10, x11, x12, x13, x14, EXTRA_SHIFT
- /* Remap BSS and the kernel page tables r/w in the ID map */
+ /* Remap [.init].data, BSS and the kernel page tables r/w in the ID map */
adrp x1, _text
- adrp x2, __bss_start
+ adrp x2, __initdata_begin
adrp x3, _end
bic x4, x2, #SWAPPER_BLOCK_SIZE - 1
mov x5, SWAPPER_RW_MMUFLAGS
@@ -491,9 +491,6 @@ SYM_FUNC_START_LOCAL(__primary_switched)
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
bl kasan_early_init
#endif
- mov x0, x20 // pass the full boot status
- mov x1, x22 // pass the low FDT mapping
- bl __pi_init_feature_override // Parse cpu feature overrides
#ifdef CONFIG_UNWIND_PATCH_PAC_INTO_SCS
bl scs_patch_vmlinux
#endif
@@ -770,12 +767,16 @@ SYM_FUNC_START_LOCAL(__primary_switch)
bl __pi_memset
dsb ishst // Make zero page visible to PTW
-#ifdef CONFIG_RELOCATABLE
- adrp x23, KERNEL_START
- and x23, x23, MIN_KIMG_ALIGN - 1
adrp x1, early_init_stack
mov sp, x1
mov x29, xzr
+ mov x0, x20 // pass the full boot status
+ mov x1, x22 // pass the low FDT mapping
+ bl __pi_init_feature_override // Parse cpu feature overrides
+
+#ifdef CONFIG_RELOCATABLE
+ adrp x23, KERNEL_START
+ and x23, x23, MIN_KIMG_ALIGN - 1
#ifdef CONFIG_RANDOMIZE_BASE
mov x0, x22
bl __pi_kaslr_early_init
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index ec24b1e70d606ec8..6c79ad2945749260 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -318,10 +318,8 @@ SECTIONS
init_pg_dir = .;
. += INIT_DIR_SIZE;
init_pg_end = .;
-#ifdef CONFIG_RELOCATABLE
- . += SZ_4K; /* stack for the early relocation code */
+ . += SZ_4K; /* stack for the early C runtime */
early_init_stack = .;
-#endif
. = ALIGN(SEGMENT_ALIGN);
__pecoff_data_size = ABSOLUTE(. - __initdata_begin);
--
2.39.2
Once we update the early kernel mapping code to only map the kernel once
with the right permissions, we can no longer perform code patching via
this mapping.
So move this code to an earlier stage of the boot, right after applying
the relocations.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/scs.h | 2 +-
arch/arm64/kernel/Makefile | 2 --
arch/arm64/kernel/head.S | 8 +++---
arch/arm64/kernel/module.c | 2 +-
arch/arm64/kernel/pi/Makefile | 10 +++++---
arch/arm64/kernel/{ => pi}/patch-scs.c | 26 ++++++++++----------
6 files changed, 26 insertions(+), 24 deletions(-)
diff --git a/arch/arm64/include/asm/scs.h b/arch/arm64/include/asm/scs.h
index 13df982a080805e6..bcf8ad574807b82c 100644
--- a/arch/arm64/include/asm/scs.h
+++ b/arch/arm64/include/asm/scs.h
@@ -72,7 +72,7 @@ static inline void dynamic_scs_init(void)
static inline void dynamic_scs_init(void) {}
#endif
-int scs_patch(const u8 eh_frame[], int size);
+int __pi_scs_patch(const u8 eh_frame[], int size);
#endif /* __ASSEMBLY __ */
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 4f1fcaebafcfe077..bae6194df6a50479 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -72,8 +72,6 @@ obj-$(CONFIG_ARM64_PTR_AUTH) += pointer_auth.o
obj-$(CONFIG_ARM64_MTE) += mte.o
obj-y += vdso-wrap.o
obj-$(CONFIG_COMPAT_VDSO) += vdso32-wrap.o
-obj-$(CONFIG_UNWIND_PATCH_PAC_INTO_SCS) += patch-scs.o
-CFLAGS_patch-scs.o += -mbranch-protection=none
# Force dependency (vdso*-wrap.S includes vdso.so through incbin)
$(obj)/vdso-wrap.o: $(obj)/vdso/vdso.so
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 0a345898a12939af..70ad180eed364906 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -490,9 +490,6 @@ SYM_FUNC_START_LOCAL(__primary_switched)
#endif
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
bl kasan_early_init
-#endif
-#ifdef CONFIG_UNWIND_PATCH_PAC_INTO_SCS
- bl scs_patch_vmlinux
#endif
mov x0, x20
bl finalise_el2 // Prefer VHE if possible
@@ -792,6 +789,11 @@ SYM_FUNC_START_LOCAL(__primary_switch)
#ifdef CONFIG_RELOCATABLE
mov x0, x23
bl __pi_relocate_kernel
+#endif
+#ifdef CONFIG_UNWIND_PATCH_PAC_INTO_SCS
+ ldr x0, =__eh_frame_start
+ ldr x1, =__eh_frame_end
+ bl __pi_scs_patch_vmlinux
#endif
ldr x8, =__primary_switched
adrp x0, KERNEL_START // __pa(KERNEL_START)
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index 5af4975caeb58ff7..9df01fce6ed528a0 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -516,7 +516,7 @@ int module_finalize(const Elf_Ehdr *hdr,
if (scs_is_dynamic()) {
s = find_section(hdr, sechdrs, ".init.eh_frame");
if (s)
- scs_patch((void *)s->sh_addr, s->sh_size);
+ __pi_scs_patch((void *)s->sh_addr, s->sh_size);
}
return module_init_ftrace_plt(hdr, sechdrs, me);
diff --git a/arch/arm64/kernel/pi/Makefile b/arch/arm64/kernel/pi/Makefile
index 7f6dfce893c3b88f..a8b302245f15326a 100644
--- a/arch/arm64/kernel/pi/Makefile
+++ b/arch/arm64/kernel/pi/Makefile
@@ -38,7 +38,9 @@ $(obj)/lib-%.pi.o: OBJCOPYFLAGS += --prefix-alloc-sections=.init
$(obj)/lib-%.o: $(srctree)/lib/%.c FORCE
$(call if_changed_rule,cc_o_c)
-obj-y := idreg-override.pi.o lib-fdt.pi.o lib-fdt_ro.pi.o
-obj-$(CONFIG_RELOCATABLE) += relocate.pi.o
-obj-$(CONFIG_RANDOMIZE_BASE) += kaslr_early.pi.o
-extra-y := $(patsubst %.pi.o,%.o,$(obj-y))
+obj-y := idreg-override.pi.o \
+ lib-fdt.pi.o lib-fdt_ro.pi.o
+obj-$(CONFIG_RELOCATABLE) += relocate.pi.o
+obj-$(CONFIG_RANDOMIZE_BASE) += kaslr_early.pi.o
+obj-$(CONFIG_UNWIND_PATCH_PAC_INTO_SCS) += patch-scs.pi.o
+extra-y := $(patsubst %.pi.o,%.o,$(obj-y))
diff --git a/arch/arm64/kernel/patch-scs.c b/arch/arm64/kernel/pi/patch-scs.c
similarity index 91%
rename from arch/arm64/kernel/patch-scs.c
rename to arch/arm64/kernel/pi/patch-scs.c
index a1fe4b4ff5917670..c65ef40d1e6b6b30 100644
--- a/arch/arm64/kernel/patch-scs.c
+++ b/arch/arm64/kernel/pi/patch-scs.c
@@ -4,14 +4,11 @@
* Author: Ard Biesheuvel <[email protected]>
*/
-#include <linux/bug.h>
#include <linux/errno.h>
#include <linux/init.h>
#include <linux/linkage.h>
-#include <linux/printk.h>
#include <linux/types.h>
-#include <asm/cacheflush.h>
#include <asm/scs.h>
//
@@ -81,7 +78,11 @@ static void __always_inline scs_patch_loc(u64 loc)
*/
return;
}
- dcache_clean_pou(loc, loc + sizeof(u32));
+ if (IS_ENABLED(CONFIG_ARM64_WORKAROUND_CLEAN_CACHE))
+ asm("dc civac, %0" :: "r"(loc));
+ else
+ asm(ALTERNATIVE("dc cvau, %0", "nop", ARM64_HAS_CACHE_IDC)
+ :: "r"(loc));
}
/*
@@ -128,10 +129,10 @@ struct eh_frame {
};
};
-static int noinstr scs_handle_fde_frame(const struct eh_frame *frame,
- bool fde_has_augmentation_data,
- int code_alignment_factor,
- bool dry_run)
+static int scs_handle_fde_frame(const struct eh_frame *frame,
+ bool fde_has_augmentation_data,
+ int code_alignment_factor,
+ bool dry_run)
{
int size = frame->size - offsetof(struct eh_frame, opcodes) + 4;
u64 loc = (u64)offset_to_ptr(&frame->initial_loc);
@@ -198,14 +199,13 @@ static int noinstr scs_handle_fde_frame(const struct eh_frame *frame,
break;
default:
- pr_err("unhandled opcode: %02x in FDE frame %lx\n", opcode[-1], (uintptr_t)frame);
return -ENOEXEC;
}
}
return 0;
}
-int noinstr scs_patch(const u8 eh_frame[], int size)
+int scs_patch(const u8 eh_frame[], int size)
{
const u8 *p = eh_frame;
@@ -251,12 +251,12 @@ int noinstr scs_patch(const u8 eh_frame[], int size)
return 0;
}
-asmlinkage void __init scs_patch_vmlinux(void)
+asmlinkage void __init scs_patch_vmlinux(const u8 start[], const u8 end[])
{
if (!should_patch_pac_into_scs())
return;
- WARN_ON(scs_patch(__eh_frame_start, __eh_frame_end - __eh_frame_start));
- icache_inval_all_pou();
+ scs_patch(start, end - start);
+ asm("ic ialluis");
isb();
}
--
2.39.2
Add support for overriding the VARange field of the MMFR2 CPU ID
register. This permits the associated LVA feature to be overridden early
enough for the boot code that creates the kernel mapping to take it into
account.
Given that LPA2 implies LVA, disabling the latter should disable the
former as well. So override the ID_AA64MMFR0.TGran field of the current
page size as well if it advertises support for 52-bit addressing.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/assembler.h | 17 ++++++-----
arch/arm64/include/asm/cpufeature.h | 4 +++
arch/arm64/kernel/cpufeature.c | 8 +++--
arch/arm64/kernel/image-vars.h | 2 ++
arch/arm64/kernel/pi/idreg-override.c | 31 ++++++++++++++++++++
5 files changed, 53 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index beb53bbd8c19bb1c..0710c17800a49b75 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -576,18 +576,21 @@ alternative_endif
.endm
/*
- * Offset ttbr1 to allow for 48-bit kernel VAs set with 52-bit PTRS_PER_PGD.
+ * If the kernel is built for 52-bit virtual addressing but the hardware only
+ * supports 48 bits, we cannot program the pgdir address into TTBR1 directly,
+ * but we have to add an offset so that the TTBR1 address corresponds with the
+ * pgdir entry that covers the lowest 48-bit addressable VA.
+ *
* orr is used as it can cover the immediate value (and is idempotent).
- * In future this may be nop'ed out when dealing with 52-bit kernel VAs.
* ttbr: Value of ttbr to set, modified.
*/
.macro offset_ttbr1, ttbr, tmp
#ifdef CONFIG_ARM64_VA_BITS_52
- mrs_s \tmp, SYS_ID_AA64MMFR2_EL1
- and \tmp, \tmp, #(0xf << ID_AA64MMFR2_EL1_VARange_SHIFT)
- cbnz \tmp, .Lskipoffs_\@
- orr \ttbr, \ttbr, #TTBR1_BADDR_4852_OFFSET
-.Lskipoffs_\@ :
+ mrs \tmp, tcr_el1
+ and \tmp, \tmp, #TCR_T1SZ_MASK
+ cmp \tmp, #TCR_T1SZ(VA_BITS_MIN)
+ orr \tmp, \ttbr, #TTBR1_BADDR_4852_OFFSET
+ csel \ttbr, \tmp, \ttbr, eq
#endif
.endm
diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index a37f4956d5a7ef6e..7faf9a48339e7c8c 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -920,7 +920,9 @@ static inline unsigned int get_vmid_bits(u64 mmfr1)
struct arm64_ftr_reg *get_arm64_ftr_reg(u32 sys_id);
+extern struct arm64_ftr_override id_aa64mmfr0_override;
extern struct arm64_ftr_override id_aa64mmfr1_override;
+extern struct arm64_ftr_override id_aa64mmfr2_override;
extern struct arm64_ftr_override id_aa64pfr0_override;
extern struct arm64_ftr_override id_aa64pfr1_override;
extern struct arm64_ftr_override id_aa64zfr0_override;
@@ -994,6 +996,8 @@ static inline bool cpu_has_lva(void)
u64 mmfr2;
mmfr2 = read_sysreg_s(SYS_ID_AA64MMFR2_EL1);
+ mmfr2 &= ~id_aa64mmfr2_override.mask;
+ mmfr2 |= id_aa64mmfr2_override.val;
return cpuid_feature_extract_unsigned_field(mmfr2,
ID_AA64MMFR2_EL1_VARange_SHIFT);
}
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 660dedcae173841a..f8e3f37accdddc86 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -649,7 +649,9 @@ static const struct arm64_ftr_bits ftr_raz[] = {
#define ARM64_FTR_REG(id, table) \
__ARM64_FTR_REG_OVERRIDE(#id, id, table, &no_override)
+struct arm64_ftr_override id_aa64mmfr0_override;
struct arm64_ftr_override id_aa64mmfr1_override;
+struct arm64_ftr_override id_aa64mmfr2_override;
struct arm64_ftr_override id_aa64pfr0_override;
struct arm64_ftr_override id_aa64pfr1_override;
struct arm64_ftr_override id_aa64zfr0_override;
@@ -713,10 +715,12 @@ static const struct __ftr_reg_entry {
&id_aa64isar2_override),
/* Op1 = 0, CRn = 0, CRm = 7 */
- ARM64_FTR_REG(SYS_ID_AA64MMFR0_EL1, ftr_id_aa64mmfr0),
+ ARM64_FTR_REG_OVERRIDE(SYS_ID_AA64MMFR0_EL1, ftr_id_aa64mmfr0,
+ &id_aa64mmfr0_override),
ARM64_FTR_REG_OVERRIDE(SYS_ID_AA64MMFR1_EL1, ftr_id_aa64mmfr1,
&id_aa64mmfr1_override),
- ARM64_FTR_REG(SYS_ID_AA64MMFR2_EL1, ftr_id_aa64mmfr2),
+ ARM64_FTR_REG_OVERRIDE(SYS_ID_AA64MMFR2_EL1, ftr_id_aa64mmfr2,
+ &id_aa64mmfr2_override),
/* Op1 = 0, CRn = 1, CRm = 2 */
ARM64_FTR_REG(SYS_ZCR_EL1, ftr_zcr),
diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
index 61d1d283a69ba5d8..79a7e0e3edd1aa21 100644
--- a/arch/arm64/kernel/image-vars.h
+++ b/arch/arm64/kernel/image-vars.h
@@ -36,7 +36,9 @@ PROVIDE(__pi___memset = __pi_memset);
PROVIDE(__pi_id_aa64isar1_override = id_aa64isar1_override);
PROVIDE(__pi_id_aa64isar2_override = id_aa64isar2_override);
+PROVIDE(__pi_id_aa64mmfr0_override = id_aa64mmfr0_override);
PROVIDE(__pi_id_aa64mmfr1_override = id_aa64mmfr1_override);
+PROVIDE(__pi_id_aa64mmfr2_override = id_aa64mmfr2_override);
PROVIDE(__pi_id_aa64pfr0_override = id_aa64pfr0_override);
PROVIDE(__pi_id_aa64pfr1_override = id_aa64pfr1_override);
PROVIDE(__pi_id_aa64smfr0_override = id_aa64smfr0_override);
diff --git a/arch/arm64/kernel/pi/idreg-override.c b/arch/arm64/kernel/pi/idreg-override.c
index 265b35b09dd488f1..c4ae5ffe5cb0c999 100644
--- a/arch/arm64/kernel/pi/idreg-override.c
+++ b/arch/arm64/kernel/pi/idreg-override.c
@@ -63,6 +63,35 @@ static const struct ftr_set_desc mmfr1 __prel64_initconst = {
},
};
+
+static bool __init mmfr2_varange_filter(u64 val)
+{
+ int __maybe_unused feat;
+
+ if (val)
+ return false;
+
+#ifdef CONFIG_ARM64_LPA2
+ feat = cpuid_feature_extract_signed_field(read_sysreg(id_aa64mmfr0_el1),
+ ID_AA64MMFR0_EL1_TGRAN_SHIFT);
+ if (feat >= ID_AA64MMFR0_EL1_TGRAN_LPA2) {
+ id_aa64mmfr0_override.val |=
+ (ID_AA64MMFR0_EL1_TGRAN_LPA2 - 1) << ID_AA64MMFR0_EL1_TGRAN_SHIFT;
+ id_aa64mmfr0_override.mask |= 0xfU << ID_AA64MMFR0_EL1_TGRAN_SHIFT;
+ }
+#endif
+ return true;
+}
+
+static const struct ftr_set_desc mmfr2 __prel64_initconst = {
+ .name = "id_aa64mmfr2",
+ .override = &id_aa64mmfr2_override,
+ .fields = {
+ FIELD("varange", ID_AA64MMFR2_EL1_VARange_SHIFT, mmfr2_varange_filter),
+ {}
+ },
+};
+
static bool __init pfr0_sve_filter(u64 val)
{
/*
@@ -161,6 +190,7 @@ static const union {
prel64_t reg_prel;
} regs[] __prel64_initconst = {
{ .reg = &mmfr1 },
+ { .reg = &mmfr2 },
{ .reg = &pfr0 },
{ .reg = &pfr1 },
{ .reg = &isar1 },
@@ -185,6 +215,7 @@ static const struct {
{ "arm64.nomte", "id_aa64pfr1.mte=0" },
{ "nokaslr", "arm64_sw.nokaslr=1" },
{ "rodata=off", "arm64_sw.rodataoff=1" },
+ { "arm64.nolva", "id_aa64mmfr2.varange=0" },
};
static int __init parse_hexdigit(const char *p, u64 *v)
--
2.39.2
In preparation for enabling LPA2 support, introduce the mask values for
converting between physical addresses and their representations in a
page table descriptor.
While at it, move the pte_to_phys asm macro into its only user, so that
we can freely modify it to use its input value register as a temp
register.
For LPA2, the PTE_ADDR_MASK contains two non-adjacent sequences of zero
bits, which means it no longer fits into the immediate field of an
ordinary ALU instruction. So let's redefine it to include the bits in
between as well, and only use it when converting from physical address
to PTE representation, where the distinction does not matter. Also
update the name accordingly to emphasize this.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/assembler.h | 16 ++--------------
arch/arm64/include/asm/pgtable-hwdef.h | 10 +++++++---
arch/arm64/include/asm/pgtable.h | 5 +++--
arch/arm64/mm/proc.S | 8 ++++++++
4 files changed, 20 insertions(+), 19 deletions(-)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 0710c17800a49b75..55e8731844cf7eb7 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -612,25 +612,13 @@ alternative_endif
.macro phys_to_pte, pte, phys
#ifdef CONFIG_ARM64_PA_BITS_52
- /*
- * We assume \phys is 64K aligned and this is guaranteed by only
- * supporting this configuration with 64K pages.
- */
- orr \pte, \phys, \phys, lsr #36
- and \pte, \pte, #PTE_ADDR_MASK
+ orr \pte, \phys, \phys, lsr #PTE_ADDR_HIGH_SHIFT
+ and \pte, \pte, #PHYS_TO_PTE_ADDR_MASK
#else
mov \pte, \phys
#endif
.endm
- .macro pte_to_phys, phys, pte
- and \phys, \pte, #PTE_ADDR_MASK
-#ifdef CONFIG_ARM64_PA_BITS_52
- orr \phys, \phys, \phys, lsl #PTE_ADDR_HIGH_SHIFT
- and \phys, \phys, GENMASK_ULL(PHYS_MASK_SHIFT - 1, PAGE_SHIFT)
-#endif
- .endm
-
/*
* tcr_clear_errata_bits - Clear TCR bits that trigger an errata on this CPU.
*/
diff --git a/arch/arm64/include/asm/pgtable-hwdef.h b/arch/arm64/include/asm/pgtable-hwdef.h
index c4ad7fbb12c5c07a..b91fe4781b066d54 100644
--- a/arch/arm64/include/asm/pgtable-hwdef.h
+++ b/arch/arm64/include/asm/pgtable-hwdef.h
@@ -155,13 +155,17 @@
#define PTE_PXN (_AT(pteval_t, 1) << 53) /* Privileged XN */
#define PTE_UXN (_AT(pteval_t, 1) << 54) /* User XN */
-#define PTE_ADDR_LOW (((_AT(pteval_t, 1) << (48 - PAGE_SHIFT)) - 1) << PAGE_SHIFT)
+#define PTE_ADDR_LOW (((_AT(pteval_t, 1) << (50 - PAGE_SHIFT)) - 1) << PAGE_SHIFT)
#ifdef CONFIG_ARM64_PA_BITS_52
+#ifdef CONFIG_ARM64_64K_PAGES
#define PTE_ADDR_HIGH (_AT(pteval_t, 0xf) << 12)
-#define PTE_ADDR_MASK (PTE_ADDR_LOW | PTE_ADDR_HIGH)
#define PTE_ADDR_HIGH_SHIFT 36
+#define PHYS_TO_PTE_ADDR_MASK (PTE_ADDR_LOW | PTE_ADDR_HIGH)
#else
-#define PTE_ADDR_MASK PTE_ADDR_LOW
+#define PTE_ADDR_HIGH (_AT(pteval_t, 0x3) << 8)
+#define PTE_ADDR_HIGH_SHIFT 42
+#define PHYS_TO_PTE_ADDR_MASK GENMASK_ULL(49, 8)
+#endif
#endif
/*
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 2259898e8c3d990a..c8666d5c31fd1e52 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -80,15 +80,16 @@ extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
#ifdef CONFIG_ARM64_PA_BITS_52
static inline phys_addr_t __pte_to_phys(pte_t pte)
{
+ pte_val(pte) &= ~PTE_MAYBE_SHARED;
return (pte_val(pte) & PTE_ADDR_LOW) |
((pte_val(pte) & PTE_ADDR_HIGH) << PTE_ADDR_HIGH_SHIFT);
}
static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
{
- return (phys | (phys >> PTE_ADDR_HIGH_SHIFT)) & PTE_ADDR_MASK;
+ return (phys | (phys >> PTE_ADDR_HIGH_SHIFT)) & PHYS_TO_PTE_ADDR_MASK;
}
#else
-#define __pte_to_phys(pte) (pte_val(pte) & PTE_ADDR_MASK)
+#define __pte_to_phys(pte) (pte_val(pte) & PTE_ADDR_LOW)
#define __phys_to_pte_val(phys) (phys)
#endif
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index a5dc009b7dd5c141..74b846d77b88e843 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -204,6 +204,14 @@ SYM_FUNC_ALIAS(__pi_idmap_cpu_replace_ttbr1, idmap_cpu_replace_ttbr1)
.pushsection ".idmap.text", "awx"
+ .macro pte_to_phys, phys, pte
+ and \phys, \pte, #PTE_ADDR_LOW
+#ifdef CONFIG_ARM64_PA_BITS_52
+ and \pte, \pte, #PTE_ADDR_HIGH
+ orr \phys, \phys, \pte, lsl #PTE_ADDR_HIGH_SHIFT
+#endif
+ .endm
+
.macro kpti_mk_tbl_ng, type, num_entries
add end_\type\()p, cur_\type\()p, #\num_entries * 8
.Ldo_\type:
--
2.39.2
Add support for 5 level paging in the G-to-nG routine that creates its
own temporary page tables to traverse the swapper page tables. Also add
support for running the 5 level configuration with the top level folded
at runtime, to support CPUs that do not implement the LPA2 extension.
While at it, wire up the level skipping logic so it will also trigger on
4 level configurations with LPA2 enabled at build time but not active at
runtime, as we'll fall back to 3 level paging in that case.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/kernel/cpufeature.c | 9 ++-
arch/arm64/mm/proc.S | 70 +++++++++++++++++---
2 files changed, 66 insertions(+), 13 deletions(-)
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index f8e3f37accdddc86..b0fee11886e54ca2 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1729,6 +1729,9 @@ kpti_install_ng_mappings(const struct arm64_cpu_capabilities *__unused)
pgd_t *kpti_ng_temp_pgd;
u64 alloc = 0;
+ if (levels == 5 && !pgtable_l5_enabled())
+ levels = 4;
+
if (__this_cpu_read(this_cpu_vector) == vectors) {
const char *v = arm64_get_bp_hardening_vector(EL1_VECTOR_KPTI);
@@ -1756,9 +1759,9 @@ kpti_install_ng_mappings(const struct arm64_cpu_capabilities *__unused)
//
// The physical pages are laid out as follows:
//
- // +--------+-/-------+-/------ +-\\--------+
- // : PTE[] : | PMD[] : | PUD[] : || PGD[] :
- // +--------+-\-------+-\------ +-//--------+
+ // +--------+-/-------+-/------ +-/------ +-\\\--------+
+ // : PTE[] : | PMD[] : | PUD[] : | P4D[] : ||| PGD[] :
+ // +--------+-\-------+-\------ +-\------ +-///--------+
// ^
// The first page is mapped into this hierarchy at a PMD_SHIFT
// aligned virtual address, so that we can manipulate the PTE
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 74b846d77b88e843..f425cfc3e4dad188 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -215,16 +215,15 @@ SYM_FUNC_ALIAS(__pi_idmap_cpu_replace_ttbr1, idmap_cpu_replace_ttbr1)
.macro kpti_mk_tbl_ng, type, num_entries
add end_\type\()p, cur_\type\()p, #\num_entries * 8
.Ldo_\type:
- ldr \type, [cur_\type\()p] // Load the entry
+ ldr \type, [cur_\type\()p], #8 // Load the entry and advance
tbz \type, #0, .Lnext_\type // Skip invalid and
tbnz \type, #11, .Lnext_\type // non-global entries
orr \type, \type, #PTE_NG // Same bit for blocks and pages
- str \type, [cur_\type\()p] // Update the entry
+ str \type, [cur_\type\()p, #-8] // Update the entry
.ifnc \type, pte
tbnz \type, #1, .Lderef_\type
.endif
.Lnext_\type:
- add cur_\type\()p, cur_\type\()p, #8
cmp cur_\type\()p, end_\type\()p
b.ne .Ldo_\type
.endm
@@ -234,18 +233,18 @@ SYM_FUNC_ALIAS(__pi_idmap_cpu_replace_ttbr1, idmap_cpu_replace_ttbr1)
* fixmap slot associated with the current level.
*/
.macro kpti_map_pgtbl, type, level
- str xzr, [temp_pte, #8 * (\level + 1)] // break before make
+ str xzr, [temp_pte, #8 * (\level + 2)] // break before make
dsb nshst
- add pte, temp_pte, #PAGE_SIZE * (\level + 1)
+ add pte, temp_pte, #PAGE_SIZE * (\level + 2)
lsr pte, pte, #12
tlbi vaae1, pte
dsb nsh
isb
phys_to_pte pte, cur_\type\()p
- add cur_\type\()p, temp_pte, #PAGE_SIZE * (\level + 1)
+ add cur_\type\()p, temp_pte, #PAGE_SIZE * (\level + 2)
orr pte, pte, pte_flags
- str pte, [temp_pte, #8 * (\level + 1)]
+ str pte, [temp_pte, #8 * (\level + 2)]
dsb nshst
.endm
@@ -278,6 +277,8 @@ SYM_TYPED_FUNC_START(idmap_kpti_install_ng_mappings)
end_ptep .req x15
pte .req x16
valid .req x17
+ cur_p4dp .req x19
+ end_p4dp .req x20
mov x5, x3 // preserve temp_pte arg
mrs swapper_ttb, ttbr1_el1
@@ -285,6 +286,12 @@ SYM_TYPED_FUNC_START(idmap_kpti_install_ng_mappings)
cbnz cpu, __idmap_kpti_secondary
+#if CONFIG_PGTABLE_LEVELS > 4
+ stp x29, x30, [sp, #-32]!
+ mov x29, sp
+ stp x19, x20, [sp, #16]
+#endif
+
/* We're the boot CPU. Wait for the others to catch up */
sevl
1: wfe
@@ -302,9 +309,32 @@ SYM_TYPED_FUNC_START(idmap_kpti_install_ng_mappings)
mov pte_flags, #KPTI_NG_PTE_FLAGS
/* Everybody is enjoying the idmap, so we can rewrite swapper. */
+
+#ifdef CONFIG_ARM64_LPA2
+ /*
+ * If LPA2 support is configured, but 52-bit virtual addressing is not
+ * enabled at runtime, we will fall back to one level of paging less,
+ * and so we have to walk swapper_pg_dir as if we dereferenced its
+ * address from a PGD level entry, and terminate the PGD level loop
+ * right after.
+ */
+ adrp pgd, swapper_pg_dir // walk &swapper_pg_dir at the next level
+ mov cur_pgdp, end_pgdp // must be equal to terminate the PGD loop
+alternative_if_not ARM64_HAS_VA52
+ b .Lderef_pgd // skip to the next level
+alternative_else_nop_endif
+ /*
+ * LPA2 based 52-bit virtual addressing requires 52-bit physical
+ * addressing to be enabled as well. In this case, the shareability
+ * bits are repurposed as physical address bits, and should not be
+ * set in pte_flags.
+ */
+ bic pte_flags, pte_flags, #PTE_SHARED
+#endif
+
/* PGD */
adrp cur_pgdp, swapper_pg_dir
- kpti_map_pgtbl pgd, 0
+ kpti_map_pgtbl pgd, -1
kpti_mk_tbl_ng pgd, PTRS_PER_PGD
/* Ensure all the updated entries are visible to secondary CPUs */
@@ -317,16 +347,33 @@ SYM_TYPED_FUNC_START(idmap_kpti_install_ng_mappings)
/* Set the flag to zero to indicate that we're all done */
str wzr, [flag_ptr]
+#if CONFIG_PGTABLE_LEVELS > 4
+ ldp x19, x20, [sp, #16]
+ ldp x29, x30, [sp], #32
+#endif
ret
.Lderef_pgd:
+ /* P4D */
+ .if CONFIG_PGTABLE_LEVELS > 4
+ p4d .req x30
+ pte_to_phys cur_p4dp, pgd
+ kpti_map_pgtbl p4d, 0
+ kpti_mk_tbl_ng p4d, PTRS_PER_P4D
+ b .Lnext_pgd
+ .else /* CONFIG_PGTABLE_LEVELS <= 4 */
+ p4d .req pgd
+ .set .Lnext_p4d, .Lnext_pgd
+ .endif
+
+.Lderef_p4d:
/* PUD */
.if CONFIG_PGTABLE_LEVELS > 3
pud .req x10
- pte_to_phys cur_pudp, pgd
+ pte_to_phys cur_pudp, p4d
kpti_map_pgtbl pud, 1
kpti_mk_tbl_ng pud, PTRS_PER_PUD
- b .Lnext_pgd
+ b .Lnext_p4d
.else /* CONFIG_PGTABLE_LEVELS <= 3 */
pud .req pgd
.set .Lnext_pud, .Lnext_pgd
@@ -370,6 +417,9 @@ SYM_TYPED_FUNC_START(idmap_kpti_install_ng_mappings)
.unreq end_ptep
.unreq pte
.unreq valid
+ .unreq cur_p4dp
+ .unreq end_p4dp
+ .unreq p4d
/* Secondary CPUs end up here */
__idmap_kpti_secondary:
--
2.39.2
The mapping from PGD/PUD/PMD to levels and shifts is very confusing,
given that, due to folding, the shifts may be equal for different
levels, if the macros are even #define'd to begin with.
In a subsequent patch, we will modify the ID mapping code to decouple
the number of levels from the kernel's view of how these types are
folded, so prepare for this by reformulating the macros without the use
of these types.
Instead, use SWAPPER_BLOCK_SHIFT as the base quantity, and derive it
from either PAGE_SHIFT or PMD_SHIFT, which -if defined at all- are
defined unambiguously for a given page size, regardless of the number of
configured levels.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/kernel-pgtable.h | 65 ++++++--------------
1 file changed, 19 insertions(+), 46 deletions(-)
diff --git a/arch/arm64/include/asm/kernel-pgtable.h b/arch/arm64/include/asm/kernel-pgtable.h
index 2a2c80ffe59e5307..bfc35944a5304e8e 100644
--- a/arch/arm64/include/asm/kernel-pgtable.h
+++ b/arch/arm64/include/asm/kernel-pgtable.h
@@ -13,27 +13,22 @@
#include <asm/sparsemem.h>
/*
- * The linear mapping and the start of memory are both 2M aligned (per
- * the arm64 booting.txt requirements). Hence we can use section mapping
- * with 4K (section size = 2M) but not with 16K (section size = 32M) or
- * 64K (section size = 512M).
+ * The physical and virtual addresses of the start of the kernel image are
+ * equal modulo 2 MiB (per the arm64 booting.txt requirements). Hence we can
+ * use section mapping with 4K (section size = 2M) but not with 16K (section
+ * size = 32M) or 64K (section size = 512M).
*/
-
-/*
- * The idmap and swapper page tables need some space reserved in the kernel
- * image. Both require pgd, pud (4 levels only) and pmd tables to (section)
- * map the kernel. With the 64K page configuration, swapper and idmap need to
- * map to pte level. The swapper also maps the FDT (see __create_page_tables
- * for more information). Note that the number of ID map translation levels
- * could be increased on the fly if system RAM is out of reach for the default
- * VA range, so pages required to map highest possible PA are reserved in all
- * cases.
- */
-#ifdef CONFIG_ARM64_4K_PAGES
-#define SWAPPER_PGTABLE_LEVELS (CONFIG_PGTABLE_LEVELS - 1)
+#if defined(PMD_SIZE) && PMD_SIZE <= MIN_KIMG_ALIGN
+#define SWAPPER_BLOCK_SHIFT PMD_SHIFT
+#define SWAPPER_SKIP_LEVEL 1
#else
-#define SWAPPER_PGTABLE_LEVELS (CONFIG_PGTABLE_LEVELS)
+#define SWAPPER_BLOCK_SHIFT PAGE_SHIFT
+#define SWAPPER_SKIP_LEVEL 0
#endif
+#define SWAPPER_BLOCK_SIZE (UL(1) << SWAPPER_BLOCK_SHIFT)
+#define SWAPPER_TABLE_SHIFT (SWAPPER_BLOCK_SHIFT + PAGE_SHIFT - 3)
+
+#define SWAPPER_PGTABLE_LEVELS (CONFIG_PGTABLE_LEVELS - SWAPPER_SKIP_LEVEL)
#define IDMAP_LEVELS ARM64_HW_PGTABLE_LEVELS(48)
#define IDMAP_ROOT_LEVEL (4 - IDMAP_LEVELS)
@@ -64,24 +59,13 @@
#define EARLY_ENTRIES(vstart, vend, shift, add) \
((((vend) - 1) >> (shift)) - ((vstart) >> (shift)) + 1 + add)
-#define EARLY_PGDS(vstart, vend, add) (EARLY_ENTRIES(vstart, vend, PGDIR_SHIFT, add))
-
-#if SWAPPER_PGTABLE_LEVELS > 3
-#define EARLY_PUDS(vstart, vend, add) (EARLY_ENTRIES(vstart, vend, PUD_SHIFT, add))
-#else
-#define EARLY_PUDS(vstart, vend, add) (0)
-#endif
+#define EARLY_LEVEL(l, vstart, vend, add) \
+ (SWAPPER_PGTABLE_LEVELS > l ? EARLY_ENTRIES(vstart, vend, SWAPPER_BLOCK_SHIFT + l * (PAGE_SHIFT - 3), add) : 0)
-#if SWAPPER_PGTABLE_LEVELS > 2
-#define EARLY_PMDS(vstart, vend, add) (EARLY_ENTRIES(vstart, vend, SWAPPER_TABLE_SHIFT, add))
-#else
-#define EARLY_PMDS(vstart, vend, add) (0)
-#endif
-
-#define EARLY_PAGES(vstart, vend, add) ( 1 /* PGDIR page */ \
- + EARLY_PGDS((vstart), (vend), add) /* each PGDIR needs a next level page table */ \
- + EARLY_PUDS((vstart), (vend), add) /* each PUD needs a next level page table */ \
- + EARLY_PMDS((vstart), (vend), add)) /* each PMD needs a next level page table */
+#define EARLY_PAGES(vstart, vend, add) (1 /* PGDIR page */ \
+ + EARLY_LEVEL(3, (vstart), (vend), add) /* each entry needs a next level page table */ \
+ + EARLY_LEVEL(2, (vstart), (vend), add) /* each entry needs a next level page table */ \
+ + EARLY_LEVEL(1, (vstart), (vend), add))/* each entry needs a next level page table */
#define INIT_DIR_SIZE (PAGE_SIZE * (EARLY_PAGES(KIMAGE_VADDR, _end, EARLY_KASLR) + EARLY_SEGMENT_EXTRA_PAGES))
/* the initial ID map may need two extra pages if it needs to be extended */
@@ -92,17 +76,6 @@
#endif
#define INIT_IDMAP_DIR_PAGES EARLY_PAGES(KIMAGE_VADDR, _end + MAX_FDT_SIZE + SWAPPER_BLOCK_SIZE, 1)
-/* Initial memory map size */
-#ifdef CONFIG_ARM64_4K_PAGES
-#define SWAPPER_BLOCK_SHIFT PMD_SHIFT
-#define SWAPPER_BLOCK_SIZE PMD_SIZE
-#define SWAPPER_TABLE_SHIFT PUD_SHIFT
-#else
-#define SWAPPER_BLOCK_SHIFT PAGE_SHIFT
-#define SWAPPER_BLOCK_SIZE PAGE_SIZE
-#define SWAPPER_TABLE_SHIFT PMD_SHIFT
-#endif
-
/* The number of segments in the kernel image (text, rodata, inittext, initdata, data+bss) */
#define KERNEL_SEGMENT_COUNT 5
--
2.39.2
In order to support LPA2 on 16k pages in a way that permits non-LPA2
systems to run the same kernel image, we have to be able to fall back to
at most 48 bits of virtual addressing.
Falling back to 48 bits would result in a level 0 with only 2 entries,
which is suboptimal in terms of TLB utilization. So instead, let's fall
back to 47 bits in that case. This means we need to be able to fold PUDs
dynamically, similar to how we fold P4Ds for 48 bit virtual addressing
on LPA2 with 4k pages.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/pgalloc.h | 12 ++-
arch/arm64/include/asm/pgtable.h | 87 +++++++++++++++++---
arch/arm64/include/asm/tlb.h | 3 +-
arch/arm64/kernel/cpufeature.c | 2 +
arch/arm64/mm/mmu.c | 2 +-
arch/arm64/mm/pgd.c | 2 +
6 files changed, 94 insertions(+), 14 deletions(-)
diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h
index cae8c648f4628709..aeba2cf15a253ee1 100644
--- a/arch/arm64/include/asm/pgalloc.h
+++ b/arch/arm64/include/asm/pgalloc.h
@@ -14,6 +14,7 @@
#include <asm/tlbflush.h>
#define __HAVE_ARCH_PGD_FREE
+#define __HAVE_ARCH_PUD_FREE
#include <asm-generic/pgalloc.h>
#define PGD_SIZE (PTRS_PER_PGD * sizeof(pgd_t))
@@ -43,7 +44,8 @@ static inline void __pud_populate(pud_t *pudp, phys_addr_t pmdp, pudval_t prot)
static inline void __p4d_populate(p4d_t *p4dp, phys_addr_t pudp, p4dval_t prot)
{
- set_p4d(p4dp, __p4d(__phys_to_p4d_val(pudp) | prot));
+ if (pgtable_l4_enabled())
+ set_p4d(p4dp, __p4d(__phys_to_p4d_val(pudp) | prot));
}
static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4dp, pud_t *pudp)
@@ -53,6 +55,14 @@ static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4dp, pud_t *pudp)
p4dval |= (mm == &init_mm) ? P4D_TABLE_UXN : P4D_TABLE_PXN;
__p4d_populate(p4dp, __pa(pudp), p4dval);
}
+
+static inline void pud_free(struct mm_struct *mm, pud_t *pud)
+{
+ if (!pgtable_l4_enabled())
+ return;
+ BUG_ON((unsigned long)pud & (PAGE_SIZE-1));
+ free_page((unsigned long)pud);
+}
#else
static inline void __p4d_populate(p4d_t *p4dp, phys_addr_t pudp, p4dval_t prot)
{
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index a286ecc447d33b24..02ba59a8ede7e0fd 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -751,12 +751,27 @@ static inline pmd_t *pud_pgtable(pud_t pud)
#if CONFIG_PGTABLE_LEVELS > 3
+static __always_inline bool pgtable_l4_enabled(void)
+{
+ if (CONFIG_PGTABLE_LEVELS > 4 || !IS_ENABLED(CONFIG_ARM64_LPA2))
+ return true;
+ if (!alternative_has_feature_likely(ARM64_ALWAYS_BOOT))
+ return vabits_actual == VA_BITS;
+ return alternative_has_feature_unlikely(ARM64_HAS_VA52);
+}
+
+static inline bool mm_pud_folded(const struct mm_struct *mm)
+{
+ return !pgtable_l4_enabled();
+}
+#define mm_pud_folded mm_pud_folded
+
#define pud_ERROR(e) \
pr_err("%s:%d: bad pud %016llx.\n", __FILE__, __LINE__, pud_val(e))
-#define p4d_none(p4d) (!p4d_val(p4d))
-#define p4d_bad(p4d) (!(p4d_val(p4d) & 2))
-#define p4d_present(p4d) (p4d_val(p4d))
+#define p4d_none(p4d) (pgtable_l4_enabled() && !p4d_val(p4d))
+#define p4d_bad(p4d) (pgtable_l4_enabled() && !(p4d_val(p4d) & 2))
+#define p4d_present(p4d) (!p4d_none(p4d))
static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
{
@@ -772,7 +787,8 @@ static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
static inline void p4d_clear(p4d_t *p4dp)
{
- set_p4d(p4dp, __p4d(0));
+ if (pgtable_l4_enabled())
+ set_p4d(p4dp, __p4d(0));
}
static inline phys_addr_t p4d_page_paddr(p4d_t p4d)
@@ -780,25 +796,74 @@ static inline phys_addr_t p4d_page_paddr(p4d_t p4d)
return __p4d_to_phys(p4d);
}
+#define pud_index(addr) (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
+
+static inline pud_t *p4d_to_folded_pud(p4d_t *p4dp, unsigned long addr)
+{
+ return (pud_t *)PTR_ALIGN_DOWN(p4dp, PAGE_SIZE) + pud_index(addr);
+}
+
static inline pud_t *p4d_pgtable(p4d_t p4d)
{
return (pud_t *)__va(p4d_page_paddr(p4d));
}
-/* Find an entry in the first-level page table. */
-#define pud_offset_phys(dir, addr) (p4d_page_paddr(READ_ONCE(*(dir))) + pud_index(addr) * sizeof(pud_t))
+static inline phys_addr_t pud_offset_phys(p4d_t *p4dp, unsigned long addr)
+{
+ BUG_ON(!pgtable_l4_enabled());
-#define pud_set_fixmap(addr) ((pud_t *)set_fixmap_offset(FIX_PUD, addr))
-#define pud_set_fixmap_offset(p4d, addr) pud_set_fixmap(pud_offset_phys(p4d, addr))
-#define pud_clear_fixmap() clear_fixmap(FIX_PUD)
+ return p4d_page_paddr(READ_ONCE(*p4dp)) + pud_index(addr) * sizeof(pud_t);
+}
-#define p4d_page(p4d) pfn_to_page(__phys_to_pfn(__p4d_to_phys(p4d)))
+static inline
+pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long addr)
+{
+ if (!pgtable_l4_enabled())
+ return p4d_to_folded_pud(p4dp, addr);
+ return (pud_t *)__va(p4d_page_paddr(p4d)) + pud_index(addr);
+}
+#define pud_offset_lockless pud_offset_lockless
+
+static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long addr)
+{
+ return pud_offset_lockless(p4dp, READ_ONCE(*p4dp), addr);
+}
+#define pud_offset pud_offset
+
+static inline pud_t *pud_set_fixmap(unsigned long addr)
+{
+ if (!pgtable_l4_enabled())
+ return NULL;
+ return (pud_t *)set_fixmap_offset(FIX_PUD, addr);
+}
+
+static inline pud_t *pud_set_fixmap_offset(p4d_t *p4dp, unsigned long addr)
+{
+ if (!pgtable_l4_enabled())
+ return p4d_to_folded_pud(p4dp, addr);
+ return pud_set_fixmap(pud_offset_phys(p4dp, addr));
+}
+
+static inline void pud_clear_fixmap(void)
+{
+ if (pgtable_l4_enabled())
+ clear_fixmap(FIX_PUD);
+}
/* use ONLY for statically allocated translation tables */
-#define pud_offset_kimg(dir,addr) ((pud_t *)__phys_to_kimg(pud_offset_phys((dir), (addr))))
+static inline pud_t *pud_offset_kimg(p4d_t *p4dp, u64 addr)
+{
+ if (!pgtable_l4_enabled())
+ return p4d_to_folded_pud(p4dp, addr);
+ return (pud_t *)__phys_to_kimg(pud_offset_phys(p4dp, addr));
+}
+
+#define p4d_page(p4d) pfn_to_page(__phys_to_pfn(__p4d_to_phys(p4d)))
#else
+static inline bool pgtable_l4_enabled(void) { return false; }
+
#define p4d_page_paddr(p4d) ({ BUILD_BUG(); 0;})
/* Match pud_offset folding in <asm/generic/pgtable-nopud.h> */
diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index c995d1f4594f6692..a23d33b6b56b8c35 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -94,7 +94,8 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pudp,
unsigned long addr)
{
- tlb_remove_table(tlb, virt_to_page(pudp));
+ if (pgtable_l4_enabled())
+ tlb_remove_table(tlb, virt_to_page(pudp));
}
#endif
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index b0fee11886e54ca2..62f6d397f27b07c0 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -1731,6 +1731,8 @@ kpti_install_ng_mappings(const struct arm64_cpu_capabilities *__unused)
if (levels == 5 && !pgtable_l5_enabled())
levels = 4;
+ else if (levels == 4 && !pgtable_l4_enabled())
+ levels = 3;
if (__this_cpu_read(this_cpu_vector) == vectors) {
const char *v = arm64_get_bp_hardening_vector(EL1_VECTOR_KPTI);
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 77cd163375124c6a..955c7c3341fbc9c2 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1043,7 +1043,7 @@ static void free_empty_pud_table(p4d_t *p4dp, unsigned long addr,
free_empty_pmd_table(pudp, addr, next, floor, ceiling);
} while (addr = next, addr < end);
- if (CONFIG_PGTABLE_LEVELS <= 3)
+ if (!pgtable_l4_enabled())
return;
if (!pgtable_range_aligned(start, end, floor, ceiling, P4D_MASK))
diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
index 3c4f8a279d2bc76a..0c501cabc23846c4 100644
--- a/arch/arm64/mm/pgd.c
+++ b/arch/arm64/mm/pgd.c
@@ -21,6 +21,8 @@ static bool pgdir_is_page_size(void)
{
if (PGD_SIZE == PAGE_SIZE)
return true;
+ if (CONFIG_PGTABLE_LEVELS == 4)
+ return !pgtable_l4_enabled();
if (CONFIG_PGTABLE_LEVELS == 5)
return !pgtable_l5_enabled();
return false;
--
2.39.2
Add support for using 5 levels of paging in the fixmap, as well as in
the kernel page table handling code which uses fixmaps internally.
This also handles the case where a 5 level build runs on hardware that
only supports 4 levels of paging.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/fixmap.h | 1 +
arch/arm64/include/asm/pgtable.h | 45 ++++++++++++--
arch/arm64/mm/mmu.c | 64 +++++++++++++++++---
3 files changed, 96 insertions(+), 14 deletions(-)
diff --git a/arch/arm64/include/asm/fixmap.h b/arch/arm64/include/asm/fixmap.h
index 71ed5fdf718bd0fd..8d4ec7bf74afb7a6 100644
--- a/arch/arm64/include/asm/fixmap.h
+++ b/arch/arm64/include/asm/fixmap.h
@@ -90,6 +90,7 @@ enum fixed_addresses {
FIX_PTE,
FIX_PMD,
FIX_PUD,
+ FIX_P4D,
FIX_PGD,
__end_of_fixed_addresses
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index c667073e3f56755d..a286ecc447d33b24 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -613,12 +613,12 @@ static inline bool pud_table(pud_t pud) { return true; }
PUD_TYPE_TABLE)
#endif
-extern pgd_t init_pg_dir[PTRS_PER_PGD];
+extern pgd_t init_pg_dir[];
extern pgd_t init_pg_end[];
-extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
-extern pgd_t idmap_pg_dir[PTRS_PER_PGD];
-extern pgd_t tramp_pg_dir[PTRS_PER_PGD];
-extern pgd_t reserved_pg_dir[PTRS_PER_PGD];
+extern pgd_t swapper_pg_dir[];
+extern pgd_t idmap_pg_dir[];
+extern pgd_t tramp_pg_dir[];
+extern pgd_t reserved_pg_dir[];
extern void set_swapper_pgd(pgd_t *pgdp, pgd_t pgd);
@@ -883,12 +883,47 @@ static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long addr)
return p4d_offset_lockless(pgdp, READ_ONCE(*pgdp), addr);
}
+static inline p4d_t *p4d_set_fixmap(unsigned long addr)
+{
+ if (!pgtable_l5_enabled())
+ return NULL;
+ return (p4d_t *)set_fixmap_offset(FIX_P4D, addr);
+}
+
+static inline p4d_t *p4d_set_fixmap_offset(pgd_t *pgdp, unsigned long addr)
+{
+ if (!pgtable_l5_enabled())
+ return pgd_to_folded_p4d(pgdp, addr);
+ return p4d_set_fixmap(p4d_offset_phys(pgdp, addr));
+}
+
+static inline void p4d_clear_fixmap(void)
+{
+ if (pgtable_l5_enabled())
+ clear_fixmap(FIX_P4D);
+}
+
+/* use ONLY for statically allocated translation tables */
+static inline p4d_t *p4d_offset_kimg(pgd_t *pgdp, u64 addr)
+{
+ if (!pgtable_l5_enabled())
+ return pgd_to_folded_p4d(pgdp, addr);
+ return (p4d_t *)__phys_to_kimg(p4d_offset_phys(pgdp, addr));
+}
+
#define pgd_page(pgd) pfn_to_page(__phys_to_pfn(__pgd_to_phys(pgd)))
#else
static inline bool pgtable_l5_enabled(void) { return false; }
+/* Match p4d_offset folding in <asm/generic/pgtable-nop4d.h> */
+#define p4d_set_fixmap(addr) NULL
+#define p4d_set_fixmap_offset(p4dp, addr) ((p4d_t *)p4dp)
+#define p4d_clear_fixmap()
+
+#define p4d_offset_kimg(dir,addr) ((p4d_t *)dir)
+
#endif /* CONFIG_PGTABLE_LEVELS > 4 */
#define pgd_ERROR(e) \
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 6426df5211456cfe..77cd163375124c6a 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -315,15 +315,14 @@ static void alloc_init_cont_pmd(pud_t *pudp, unsigned long addr,
} while (addr = next, addr != end);
}
-static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
+static void alloc_init_pud(p4d_t *p4dp, unsigned long addr, unsigned long end,
phys_addr_t phys, pgprot_t prot,
phys_addr_t (*pgtable_alloc)(int),
int flags)
{
unsigned long next;
- pud_t *pudp;
- p4d_t *p4dp = p4d_offset(pgdp, addr);
p4d_t p4d = READ_ONCE(*p4dp);
+ pud_t *pudp;
if (p4d_none(p4d)) {
p4dval_t p4dval = P4D_TYPE_TABLE | P4D_TABLE_UXN;
@@ -371,6 +370,46 @@ static void alloc_init_pud(pgd_t *pgdp, unsigned long addr, unsigned long end,
pud_clear_fixmap();
}
+static void alloc_init_p4d(pgd_t *pgdp, unsigned long addr, unsigned long end,
+ phys_addr_t phys, pgprot_t prot,
+ phys_addr_t (*pgtable_alloc)(int),
+ int flags)
+{
+ unsigned long next;
+ pgd_t pgd = READ_ONCE(*pgdp);
+ p4d_t *p4dp;
+
+ if (pgd_none(pgd)) {
+ pgdval_t pgdval = PGD_TYPE_TABLE | PGD_TABLE_UXN;
+ phys_addr_t p4d_phys;
+
+ if (flags & NO_EXEC_MAPPINGS)
+ pgdval |= PGD_TABLE_PXN;
+ BUG_ON(!pgtable_alloc);
+ p4d_phys = pgtable_alloc(P4D_SHIFT);
+ __pgd_populate(pgdp, p4d_phys, pgdval);
+ pgd = READ_ONCE(*pgdp);
+ }
+ BUG_ON(pgd_bad(pgd));
+
+ p4dp = p4d_set_fixmap_offset(pgdp, addr);
+ do {
+ p4d_t old_p4d = READ_ONCE(*p4dp);
+
+ next = p4d_addr_end(addr, end);
+
+ alloc_init_pud(p4dp, addr, next, phys, prot,
+ pgtable_alloc, flags);
+
+ BUG_ON(p4d_val(old_p4d) != 0 &&
+ p4d_val(old_p4d) != READ_ONCE(p4d_val(*p4dp)));
+
+ phys += next - addr;
+ } while (p4dp++, addr = next, addr != end);
+
+ p4d_clear_fixmap();
+}
+
static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
unsigned long virt, phys_addr_t size,
pgprot_t prot,
@@ -393,7 +432,7 @@ static void __create_pgd_mapping_locked(pgd_t *pgdir, phys_addr_t phys,
do {
next = pgd_addr_end(addr, end);
- alloc_init_pud(pgdp, addr, next, phys, prot, pgtable_alloc,
+ alloc_init_p4d(pgdp, addr, next, phys, prot, pgtable_alloc,
flags);
phys += next - addr;
} while (pgdp++, addr = next, addr != end);
@@ -1120,10 +1159,19 @@ void vmemmap_free(unsigned long start, unsigned long end,
}
#endif /* CONFIG_MEMORY_HOTPLUG */
-static inline pud_t *fixmap_pud(unsigned long addr)
+static inline p4d_t *fixmap_p4d(unsigned long addr)
{
pgd_t *pgdp = pgd_offset_k(addr);
- p4d_t *p4dp = p4d_offset(pgdp, addr);
+ pgd_t pgd = READ_ONCE(*pgdp);
+
+ BUG_ON(pgd_none(pgd) || pgd_bad(pgd));
+
+ return p4d_offset_kimg(pgdp, addr);
+}
+
+static inline pud_t *fixmap_pud(unsigned long addr)
+{
+ p4d_t *p4dp = fixmap_p4d(addr);
p4d_t p4d = READ_ONCE(*p4dp);
BUG_ON(p4d_none(p4d) || p4d_bad(p4d));
@@ -1154,14 +1202,12 @@ static inline pte_t *fixmap_pte(unsigned long addr)
*/
void __init early_fixmap_init(void)
{
- pgd_t *pgdp;
p4d_t *p4dp, p4d;
pud_t *pudp;
pmd_t *pmdp;
unsigned long addr = FIXADDR_START;
- pgdp = pgd_offset_k(addr);
- p4dp = p4d_offset(pgdp, addr);
+ p4dp = fixmap_p4d(addr);
p4d = READ_ONCE(*p4dp);
if (CONFIG_PGTABLE_LEVELS > 3 &&
!(p4d_none(p4d) || p4d_page_paddr(p4d) == __pa_symbol(bm_pud))) {
--
2.39.2
Even though we support loading kernels anywhere in 48-bit addressable
physical memory, we create the ID maps based on the number of levels
that we happened to configure for the kernel VA and user VA spaces.
The reason for this is that the PGD/PUD/PMD based classification of
translation levels, along with the associated folding when the number of
levels is less than 5, does not permit creating a page table hierarchy
of a set number of levels. This means that, for instance, on 39-bit VA
kernels we need to configure an additional level above PGD level on the
fly, and 36-bit VA kernels still only support 47-bit virtual addressing
with this trick applied.
Now that we have a separate helper to populate page table hierarchies
that does not define the levels in terms of PUDS/PMDS/etc at all, let's
reuse it to create the permanent ID map with a fixed VA size of 48 bits.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/kernel-pgtable.h | 2 ++
arch/arm64/kernel/head.S | 5 +++
arch/arm64/kvm/mmu.c | 15 +++------
arch/arm64/mm/mmu.c | 32 +++++++++++---------
arch/arm64/mm/proc.S | 9 ++----
5 files changed, 31 insertions(+), 32 deletions(-)
diff --git a/arch/arm64/include/asm/kernel-pgtable.h b/arch/arm64/include/asm/kernel-pgtable.h
index 50b5c145358a5d8e..2a2c80ffe59e5307 100644
--- a/arch/arm64/include/asm/kernel-pgtable.h
+++ b/arch/arm64/include/asm/kernel-pgtable.h
@@ -35,6 +35,8 @@
#define SWAPPER_PGTABLE_LEVELS (CONFIG_PGTABLE_LEVELS)
#endif
+#define IDMAP_LEVELS ARM64_HW_PGTABLE_LEVELS(48)
+#define IDMAP_ROOT_LEVEL (4 - IDMAP_LEVELS)
/*
* If KASLR is enabled, then an offset K is added to the kernel address
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index e45fd99e8ab4272a..fc6a4076d826b728 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -727,6 +727,11 @@ SYM_FUNC_START_LOCAL(__no_granule_support)
SYM_FUNC_END(__no_granule_support)
SYM_FUNC_START_LOCAL(__primary_switch)
+ mrs x1, tcr_el1
+ mov x2, #64 - VA_BITS
+ tcr_set_t0sz x1, x2
+ msr tcr_el1, x1
+
adrp x1, reserved_pg_dir
adrp x2, init_idmap_pg_dir
bl __enable_mmu
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 7113587222ffe8e1..d64be7b5f6692e8b 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1687,16 +1687,9 @@ int __init kvm_mmu_init(u32 *hyp_va_bits)
BUG_ON((hyp_idmap_start ^ (hyp_idmap_end - 1)) & PAGE_MASK);
/*
- * The ID map may be configured to use an extended virtual address
- * range. This is only the case if system RAM is out of range for the
- * currently configured page size and VA_BITS_MIN, in which case we will
- * also need the extended virtual range for the HYP ID map, or we won't
- * be able to enable the EL2 MMU.
- *
- * However, in some cases the ID map may be configured for fewer than
- * the number of VA bits used by the regular kernel stage 1. This
- * happens when VA_BITS=52 and the kernel image is placed in PA space
- * below 48 bits.
+ * The ID map is always configured for 48 bits of translation, which
+ * may be fewer than the number of VA bits used by the regular kernel
+ * stage 1, when VA_BITS=52.
*
* At EL2, there is only one TTBR register, and we can't switch between
* translation tables *and* update TCR_EL2.T0SZ at the same time. Bottom
@@ -1707,7 +1700,7 @@ int __init kvm_mmu_init(u32 *hyp_va_bits)
* 1 VA bits to assure that the hypervisor can both ID map its code page
* and map any kernel memory.
*/
- idmap_bits = 64 - ((idmap_t0sz & TCR_T0SZ_MASK) >> TCR_T0SZ_OFFSET);
+ idmap_bits = 48;
kernel_bits = vabits_actual;
*hyp_va_bits = max(idmap_bits, kernel_bits);
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 81e1420d2cc13246..a59433ae4f5f8d02 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -762,22 +762,21 @@ static void __init map_kernel(pgd_t *pgdp)
kasan_copy_shadow(pgdp);
}
+void __pi_map_range(u64 *pgd, u64 start, u64 end, u64 pa, pgprot_t prot,
+ int level, pte_t *tbl, bool may_use_cont, u64 va_offset);
+
+static u8 idmap_ptes[IDMAP_LEVELS - 1][PAGE_SIZE] __aligned(PAGE_SIZE) __ro_after_init,
+ kpti_ptes[IDMAP_LEVELS - 1][PAGE_SIZE] __aligned(PAGE_SIZE) __ro_after_init;
+
static void __init create_idmap(void)
{
u64 start = __pa_symbol(__idmap_text_start);
- u64 size = __pa_symbol(__idmap_text_end) - start;
- pgd_t *pgd = idmap_pg_dir;
- u64 pgd_phys;
-
- /* check if we need an additional level of translation */
- if (VA_BITS < 48 && idmap_t0sz < (64 - VA_BITS_MIN)) {
- pgd_phys = early_pgtable_alloc(PAGE_SHIFT);
- set_pgd(&idmap_pg_dir[start >> VA_BITS],
- __pgd(pgd_phys | P4D_TYPE_TABLE));
- pgd = __va(pgd_phys);
- }
- __create_pgd_mapping(pgd, start, start, size, PAGE_KERNEL_ROX,
- early_pgtable_alloc, 0);
+ u64 end = __pa_symbol(__idmap_text_end);
+ u64 ptep = __pa_symbol(idmap_ptes);
+
+ __pi_map_range(&ptep, start, end, start, PAGE_KERNEL_ROX,
+ IDMAP_ROOT_LEVEL, (pte_t *)idmap_pg_dir, false,
+ __phys_to_virt(ptep) - ptep);
if (IS_ENABLED(CONFIG_UNMAP_KERNEL_AT_EL0)) {
extern u32 __idmap_kpti_flag;
@@ -787,8 +786,10 @@ static void __init create_idmap(void)
* The KPTI G-to-nG conversion code needs a read-write mapping
* of its synchronization flag in the ID map.
*/
- __create_pgd_mapping(pgd, pa, pa, sizeof(u32), PAGE_KERNEL,
- early_pgtable_alloc, 0);
+ ptep = __pa_symbol(kpti_ptes);
+ __pi_map_range(&ptep, pa, pa + sizeof(u32), pa, PAGE_KERNEL,
+ IDMAP_ROOT_LEVEL, (pte_t *)idmap_pg_dir, false,
+ __phys_to_virt(ptep) - ptep);
}
}
@@ -813,6 +814,7 @@ void __init paging_init(void)
memblock_allow_resize();
create_idmap();
+ idmap_t0sz = TCR_T0SZ(48);
}
#ifdef CONFIG_MEMORY_HOTPLUG
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 82e88f4521737c0e..c7129b21bfd5191f 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -422,9 +422,9 @@ SYM_FUNC_START(__cpu_setup)
mair .req x17
tcr .req x16
mov_q mair, MAIR_EL1_SET
- mov_q tcr, TCR_TxSZ(VA_BITS) | TCR_CACHE_FLAGS | TCR_SMP_FLAGS | \
- TCR_TG_FLAGS | TCR_KASLR_FLAGS | TCR_ASID16 | \
- TCR_TBI0 | TCR_A1 | TCR_KASAN_SW_FLAGS | TCR_MTE_FLAGS
+ mov_q tcr, TCR_T0SZ(48) | TCR_T1SZ(VA_BITS) | TCR_CACHE_FLAGS | \
+ TCR_SMP_FLAGS | TCR_TG_FLAGS | TCR_KASLR_FLAGS | TCR_ASID16 | \
+ TCR_TBI0 | TCR_A1 | TCR_KASAN_SW_FLAGS | TCR_MTE_FLAGS
tcr_clear_errata_bits tcr, x9, x5
@@ -432,10 +432,7 @@ SYM_FUNC_START(__cpu_setup)
sub x9, xzr, x0
add x9, x9, #64
tcr_set_t1sz tcr, x9
-#else
- idmap_get_t0sz x9
#endif
- tcr_set_t0sz tcr, x9
/*
* Set the IPS bits in TCR_EL1.
--
2.39.2
This reverts commit 1682c45b920643c, which is no longer needed now that
we create the permanent kernel mapping directly during early boot.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/mmu_context.h | 13 ++++---------
arch/arm64/kernel/cpufeature.c | 2 +-
arch/arm64/kernel/suspend.c | 2 +-
arch/arm64/mm/kasan_init.c | 4 ++--
4 files changed, 8 insertions(+), 13 deletions(-)
diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index f9ae2891e4c72c7f..bc1cef5002d60e02 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -104,18 +104,13 @@ static inline void cpu_uninstall_idmap(void)
cpu_switch_mm(mm->pgd, mm);
}
-static inline void __cpu_install_idmap(pgd_t *idmap)
+static inline void cpu_install_idmap(void)
{
cpu_set_reserved_ttbr0();
local_flush_tlb_all();
cpu_set_idmap_tcr_t0sz();
- cpu_switch_mm(lm_alias(idmap), &init_mm);
-}
-
-static inline void cpu_install_idmap(void)
-{
- __cpu_install_idmap(idmap_pg_dir);
+ cpu_switch_mm(lm_alias(idmap_pg_dir), &init_mm);
}
/*
@@ -146,7 +141,7 @@ static inline void cpu_install_ttbr0(phys_addr_t ttbr0, unsigned long t0sz)
* Atomically replaces the active TTBR1_EL1 PGD with a new VA-compatible PGD,
* avoiding the possibility of conflicting TLB entries being allocated.
*/
-static inline void cpu_replace_ttbr1(pgd_t *pgdp, pgd_t *idmap)
+static inline void cpu_replace_ttbr1(pgd_t *pgdp)
{
typedef void (ttbr_replace_func)(phys_addr_t);
extern ttbr_replace_func idmap_cpu_replace_ttbr1;
@@ -170,7 +165,7 @@ static inline void cpu_replace_ttbr1(pgd_t *pgdp, pgd_t *idmap)
replace_phys = (void *)__pa_symbol(idmap_cpu_replace_ttbr1);
- __cpu_install_idmap(idmap);
+ cpu_install_idmap();
/*
* We really don't want to take *any* exceptions while TTBR1 is
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index e9788671be044a47..b206de4758ce6fb3 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -3456,7 +3456,7 @@ subsys_initcall_sync(init_32bit_el0_mask);
static void __maybe_unused cpu_enable_cnp(struct arm64_cpu_capabilities const *cap)
{
- cpu_replace_ttbr1(lm_alias(swapper_pg_dir), idmap_pg_dir);
+ cpu_replace_ttbr1(lm_alias(swapper_pg_dir));
}
/*
diff --git a/arch/arm64/kernel/suspend.c b/arch/arm64/kernel/suspend.c
index 0fbdf5fe64d8da08..7c2391851db6c198 100644
--- a/arch/arm64/kernel/suspend.c
+++ b/arch/arm64/kernel/suspend.c
@@ -55,7 +55,7 @@ void notrace __cpu_suspend_exit(void)
/* Restore CnP bit in TTBR1_EL1 */
if (system_supports_cnp())
- cpu_replace_ttbr1(lm_alias(swapper_pg_dir), idmap_pg_dir);
+ cpu_replace_ttbr1(lm_alias(swapper_pg_dir));
/*
* PSTATE was not saved over suspend/resume, re-enable any detected
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index df98f496539f0e39..7e32f21fb8e1e227 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -221,7 +221,7 @@ static void __init kasan_init_shadow(void)
*/
memcpy(tmp_pg_dir, swapper_pg_dir, sizeof(tmp_pg_dir));
dsb(ishst);
- cpu_replace_ttbr1(lm_alias(tmp_pg_dir), idmap_pg_dir);
+ cpu_replace_ttbr1(lm_alias(tmp_pg_dir));
clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
@@ -265,7 +265,7 @@ static void __init kasan_init_shadow(void)
PAGE_KERNEL_RO));
memset(kasan_early_shadow_page, KASAN_SHADOW_INIT, PAGE_SIZE);
- cpu_replace_ttbr1(lm_alias(swapper_pg_dir), idmap_pg_dir);
+ cpu_replace_ttbr1(lm_alias(swapper_pg_dir));
}
static void __init kasan_init_depth(void)
--
2.39.2
From: Anshuman Khandual <[email protected]>
As per ARM ARM (0487G.A) TCR_EL1.DS fields controls whether 52 bit input
and output address get supported on 4K and 16K page size configuration,
when FEAT_LPA2 is known to have been implemented. This adds TCR_DS field
definition which would be used when FEAT_LPA2 gets enabled.
Acked-by: Catalin Marinas <[email protected]>
Signed-off-by: Anshuman Khandual <[email protected]>
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/pgtable-hwdef.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm64/include/asm/pgtable-hwdef.h b/arch/arm64/include/asm/pgtable-hwdef.h
index f658aafc47dfa29a..c4ad7fbb12c5c07a 100644
--- a/arch/arm64/include/asm/pgtable-hwdef.h
+++ b/arch/arm64/include/asm/pgtable-hwdef.h
@@ -276,6 +276,7 @@
#define TCR_E0PD1 (UL(1) << 56)
#define TCR_TCMA0 (UL(1) << 57)
#define TCR_TCMA1 (UL(1) << 58)
+#define TCR_DS (UL(1) << 59)
/*
* TTBR.
--
2.39.2
The KVM code needs more work to support 5 level paging with LPA2, so for
the time being, limit KVM to 48 bit addressing on 4k and 16k pagesize
configurations. This can be reverted once the LPA2 support for KVM is
merged.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 2 ++
arch/arm64/kvm/mmu.c | 5 ++++-
arch/arm64/kvm/va_layout.c | 9 +++++----
3 files changed, 11 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
index 552653fa18be34b2..e00b87ed4a8400f6 100644
--- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
+++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
@@ -128,6 +128,8 @@ static void prepare_host_vtcr(void)
/* The host stage 2 is id-mapped, so use parange for T0SZ */
parange = kvm_get_parange(id_aa64mmfr0_el1_sys_val);
phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange);
+ if (IS_ENABLED(CONFIG_ARM64_LPA2) && phys_shift > 48)
+ phys_shift = 48; // not implemented yet
host_mmu.arch.vtcr = kvm_get_vtcr(id_aa64mmfr0_el1_sys_val,
id_aa64mmfr1_el1_sys_val, phys_shift);
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 4e7c0f9a9c286c09..2ad9e6f1e101e52d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -661,7 +661,8 @@ static int get_user_mapping_size(struct kvm *kvm, u64 addr)
{
struct kvm_pgtable pgt = {
.pgd = (kvm_pteref_t)kvm->mm->pgd,
- .ia_bits = vabits_actual,
+ .ia_bits = IS_ENABLED(CONFIG_ARM64_LPA2) ? 48
+ : vabits_actual,
.start_level = (KVM_PGTABLE_MAX_LEVELS -
ARM64_HW_PGTABLE_LEVELS(pgt.ia_bits)),
.mm_ops = &kvm_user_mm_ops,
@@ -1703,6 +1704,8 @@ int __init kvm_mmu_init(u32 *hyp_va_bits)
idmap_bits = 48;
kernel_bits = vabits_actual;
*hyp_va_bits = max(idmap_bits, kernel_bits);
+ if (IS_ENABLED(CONFIG_ARM64_LPA2))
+ *hyp_va_bits = 48; // LPA2 is not yet supported in KVM
kvm_debug("Using %u-bit virtual addresses at EL2\n", *hyp_va_bits);
kvm_debug("IDMAP page: %lx\n", hyp_idmap_start);
diff --git a/arch/arm64/kvm/va_layout.c b/arch/arm64/kvm/va_layout.c
index 341b67e2f2514e55..ac87d0c39c38f7d9 100644
--- a/arch/arm64/kvm/va_layout.c
+++ b/arch/arm64/kvm/va_layout.c
@@ -59,12 +59,13 @@ static void init_hyp_physvirt_offset(void)
*/
__init void kvm_compute_layout(void)
{
+ u64 vabits = IS_ENABLED(CONFIG_ARM64_LPA2) ? 48 : vabits_actual; // not yet
phys_addr_t idmap_addr = __pa_symbol(__hyp_idmap_text_start);
u64 hyp_va_msb;
/* Where is my RAM region? */
- hyp_va_msb = idmap_addr & BIT(vabits_actual - 1);
- hyp_va_msb ^= BIT(vabits_actual - 1);
+ hyp_va_msb = idmap_addr & BIT(vabits - 1);
+ hyp_va_msb ^= BIT(vabits - 1);
tag_lsb = fls64((u64)phys_to_virt(memblock_start_of_DRAM()) ^
(u64)(high_memory - 1));
@@ -72,10 +73,10 @@ __init void kvm_compute_layout(void)
va_mask = GENMASK_ULL(tag_lsb - 1, 0);
tag_val = hyp_va_msb;
- if (IS_ENABLED(CONFIG_RANDOMIZE_BASE) && tag_lsb != (vabits_actual - 1) &&
+ if (IS_ENABLED(CONFIG_RANDOMIZE_BASE) && tag_lsb != (vabits - 1) &&
!kaslr_disabled_cmdline()) {
/* We have some free bits to insert a random tag. */
- tag_val |= get_random_long() & GENMASK_ULL(vabits_actual - 2, tag_lsb);
+ tag_val |= get_random_long() & GENMASK_ULL(vabits - 2, tag_lsb);
}
tag_val >>= tag_lsb;
--
2.39.2
Currently, we detect CPU support for 52-bit virtual addressing (LVA)
extremely early, before creating the kernel page tables or enabling the
MMU. We cannot override the feature this early, and so large virtual
addressing is always enabled on CPUs that implement support for it if
the software support for it was enabled at build time. It also means we
rely on non-trivial code in asm to deal with this feature.
Given that both the ID map and the TTBR1 mapping of the kernel image are
guaranteed to be 48-bit addressable, it is not actually necessary to
enable support this early, and instead, we can model it as a CPU
feature. That way, we can rely on code patching to get the correct
TCR.T1SZ values programmed on secondary boot and suspend from resume.
On the primary boot path, we simply enable the MMU with 48-bit virtual
addressing initially, and update TCR.T1SZ if LVA is supported from C
code, right before creating the kernel mapping. Given that TTBR1 still
points to reserved_pg_dir at this point, updating TCR.T1SZ should be
safe without the need for explicit TLB maintenance.
Since this gets rid of all accesses to the vabits_actual variable from
asm code that occurred before TCR.T1SZ had been programmed, we no longer
have a need for this variable, and we can replace it with a C expression
that produces the correct value directly, based on the value of TCR.T1SZ.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/cpufeature.h | 9 ++++++
arch/arm64/include/asm/memory.h | 13 ++++++++-
arch/arm64/kernel/cpufeature.c | 13 +++++++++
arch/arm64/kernel/head.S | 29 +++++---------------
arch/arm64/kernel/image-vars.h | 1 -
arch/arm64/kernel/pi/map_kernel.c | 3 ++
arch/arm64/kernel/sleep.S | 3 --
arch/arm64/mm/mmu.c | 5 ----
arch/arm64/mm/proc.S | 9 +++---
arch/arm64/tools/cpucaps | 1 +
10 files changed, 49 insertions(+), 37 deletions(-)
diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index edefe3b36fe5c243..a37f4956d5a7ef6e 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -989,6 +989,15 @@ static inline bool cpu_has_pac(void)
return feat;
}
+static inline bool cpu_has_lva(void)
+{
+ u64 mmfr2;
+
+ mmfr2 = read_sysreg_s(SYS_ID_AA64MMFR2_EL1);
+ return cpuid_feature_extract_unsigned_field(mmfr2,
+ ID_AA64MMFR2_EL1_VARange_SHIFT);
+}
+
#endif /* __ASSEMBLY__ */
#endif
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index f96975466ef1b752..3e32d957aadcb2bb 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -183,9 +183,20 @@
#include <asm/boot.h>
#include <asm/bug.h>
#include <asm/sections.h>
+#include <asm/sysreg.h>
+
+static inline u64 __pure read_tcr(void)
+{
+ u64 tcr;
+
+ // read_sysreg() uses asm volatile, so avoid it here
+ asm("mrs %0, tcr_el1" : "=r"(tcr));
+ return tcr;
+}
#if VA_BITS > 48
-extern u64 vabits_actual;
+// For reasons of #include hell, we can't use TCR_T1SZ_OFFSET/TCR_T1SZ_MASK here
+#define vabits_actual (64 - ((read_tcr() >> 16) & 63))
#else
#define vabits_actual ((u64)VA_BITS)
#endif
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index b206de4758ce6fb3..660dedcae173841a 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -2752,6 +2752,19 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
.matches = has_cpuid_feature,
.cpu_enable = cpu_enable_dit,
},
+#ifdef CONFIG_ARM64_VA_BITS_52
+ {
+ .desc = "52-bit Virtual Addressing (LVA)",
+ .capability = ARM64_HAS_VA52,
+ .type = ARM64_CPUCAP_BOOT_CPU_FEATURE,
+ .sys_reg = SYS_ID_AA64MMFR2_EL1,
+ .sign = FTR_UNSIGNED,
+ .field_width = 4,
+ .field_pos = ID_AA64MMFR2_EL1_VARange_SHIFT,
+ .matches = has_cpuid_feature,
+ .min_field_value = ID_AA64MMFR2_EL1_VARange_52,
+ },
+#endif
{},
};
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index c6f8c3b1f026c07b..47ebe3242d7feb7e 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -80,7 +80,6 @@
* x19 primary_entry() .. start_kernel() whether we entered with the MMU on
* x20 primary_entry() .. __primary_switch() CPU boot mode
* x21 primary_entry() .. start_kernel() FDT pointer passed at boot in x0
- * x25 primary_entry() .. start_kernel() supported VA size
*/
SYM_CODE_START(primary_entry)
bl record_mmu_state
@@ -125,14 +124,6 @@ SYM_CODE_START(primary_entry)
* On return, the CPU will be ready for the MMU to be turned on and
* the TCR will have been set.
*/
-#if VA_BITS > 48
- mrs_s x0, SYS_ID_AA64MMFR2_EL1
- tst x0, #0xf << ID_AA64MMFR2_EL1_VARange_SHIFT
- mov x0, #VA_BITS
- mov x25, #VA_BITS_MIN
- csel x25, x25, x0, eq
- mov x0, x25
-#endif
bl __cpu_setup // initialise processor
b __primary_switch
SYM_CODE_END(primary_entry)
@@ -242,11 +233,6 @@ SYM_FUNC_START_LOCAL(__primary_switched)
mov x0, x20
bl set_cpu_boot_mode_flag
-#if VA_BITS > 48
- adr_l x8, vabits_actual // Set this early so KASAN early init
- str x25, [x8] // ... observes the correct value
- dc civac, x8 // Make visible to booting secondaries
-#endif
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
bl kasan_early_init
#endif
@@ -374,10 +360,13 @@ SYM_FUNC_START_LOCAL(secondary_startup)
* Common entry point for secondary CPUs.
*/
mov x20, x0 // preserve boot mode
+
+#ifdef CONFIG_ARM64_VA_BITS_52
+alternative_if ARM64_HAS_VA52
bl __cpu_secondary_check52bitva
-#if VA_BITS > 48
- ldr_l x0, vabits_actual
+alternative_else_nop_endif
#endif
+
bl __cpu_setup // initialise processor
adrp x1, swapper_pg_dir
adrp x2, idmap_pg_dir
@@ -480,12 +469,8 @@ SYM_FUNC_START(__enable_mmu)
ret
SYM_FUNC_END(__enable_mmu)
+#ifdef CONFIG_ARM64_VA_BITS_52
SYM_FUNC_START(__cpu_secondary_check52bitva)
-#if VA_BITS > 48
- ldr_l x0, vabits_actual
- cmp x0, #52
- b.ne 2f
-
mrs_s x0, SYS_ID_AA64MMFR2_EL1
and x0, x0, #(0xf << ID_AA64MMFR2_EL1_VARange_SHIFT)
cbnz x0, 2f
@@ -496,9 +481,9 @@ SYM_FUNC_START(__cpu_secondary_check52bitva)
wfi
b 1b
-#endif
2: ret
SYM_FUNC_END(__cpu_secondary_check52bitva)
+#endif
SYM_FUNC_START_LOCAL(__no_granule_support)
/* Indicate that this CPU can't boot and is stuck in the kernel */
diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
index 1c0e920a2466f851..61d1d283a69ba5d8 100644
--- a/arch/arm64/kernel/image-vars.h
+++ b/arch/arm64/kernel/image-vars.h
@@ -34,7 +34,6 @@ PROVIDE(__pi___memcpy = __pi_memcpy);
PROVIDE(__pi___memmove = __pi_memmove);
PROVIDE(__pi___memset = __pi_memset);
-PROVIDE(__pi_vabits_actual = vabits_actual);
PROVIDE(__pi_id_aa64isar1_override = id_aa64isar1_override);
PROVIDE(__pi_id_aa64isar2_override = id_aa64isar2_override);
PROVIDE(__pi_id_aa64mmfr1_override = id_aa64mmfr1_override);
diff --git a/arch/arm64/kernel/pi/map_kernel.c b/arch/arm64/kernel/pi/map_kernel.c
index a90c4d6fc75c35d0..c1a5bef4e10a49d7 100644
--- a/arch/arm64/kernel/pi/map_kernel.c
+++ b/arch/arm64/kernel/pi/map_kernel.c
@@ -166,6 +166,9 @@ asmlinkage void __init early_map_kernel(u64 boot_status, void *fdt)
chosen = fdt_path_offset(fdt, chosen_str);
init_feature_override(boot_status, fdt, chosen);
+ if (VA_BITS > VA_BITS_MIN && cpu_has_lva())
+ sysreg_clear_set(tcr_el1, TCR_T1SZ_MASK, TCR_T1SZ(VA_BITS));
+
/*
* The virtual KASLR displacement modulo 2MiB is decided by the
* physical placement of the image, as otherwise, we might not be able
diff --git a/arch/arm64/kernel/sleep.S b/arch/arm64/kernel/sleep.S
index 2ae7cff1953aaf87..353e71bd40a1e1e5 100644
--- a/arch/arm64/kernel/sleep.S
+++ b/arch/arm64/kernel/sleep.S
@@ -102,9 +102,6 @@ SYM_CODE_START(cpu_resume)
mov x0, xzr
bl init_kernel_el
mov x19, x0 // preserve boot mode
-#if VA_BITS > 48
- ldr_l x0, vabits_actual
-#endif
bl __cpu_setup
/* enable the MMU early - so we can access sleep_save_stash by va */
adrp x1, swapper_pg_dir
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 81634ff5f6a67476..914745697fb8b30c 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -43,11 +43,6 @@
#define NO_CONT_MAPPINGS BIT(1)
#define NO_EXEC_MAPPINGS BIT(2) /* assumes FEAT_HPDS is not used */
-#if VA_BITS > 48
-u64 vabits_actual __ro_after_init = VA_BITS_MIN;
-EXPORT_SYMBOL(vabits_actual);
-#endif
-
u64 kimage_voffset __ro_after_init;
EXPORT_SYMBOL(kimage_voffset);
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index d0748f18b2abdf0e..a5dc009b7dd5c141 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -396,8 +396,6 @@ SYM_FUNC_END(idmap_kpti_install_ng_mappings)
*
* Initialise the processor for turning the MMU on.
*
- * Input:
- * x0 - actual number of VA bits (ignored unless VA_BITS > 48)
* Output:
* Return in x0 the value of the SCTLR_EL1 register.
*/
@@ -422,16 +420,17 @@ SYM_FUNC_START(__cpu_setup)
mair .req x17
tcr .req x16
mov_q mair, MAIR_EL1_SET
- mov_q tcr, TCR_T0SZ(48) | TCR_T1SZ(VA_BITS) | TCR_CACHE_FLAGS | \
+ mov_q tcr, TCR_T0SZ(48) | TCR_T1SZ(VA_BITS_MIN) | TCR_CACHE_FLAGS | \
TCR_SMP_FLAGS | TCR_TG_FLAGS | TCR_KASLR_FLAGS | TCR_ASID16 | \
TCR_TBI0 | TCR_A1 | TCR_KASAN_SW_FLAGS | TCR_MTE_FLAGS
tcr_clear_errata_bits tcr, x9, x5
#ifdef CONFIG_ARM64_VA_BITS_52
- sub x9, xzr, x0
- add x9, x9, #64
+ mov x9, #64 - VA_BITS
+alternative_if ARM64_HAS_VA52
tcr_set_t1sz tcr, x9
+alternative_else_nop_endif
#endif
/*
diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index 37b1340e96466411..fce10148f000a8a4 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -43,6 +43,7 @@ HAS_SB
HAS_STAGE2_FWB
HAS_TIDCP1
HAS_TLB_RANGE
+HAS_VA52
HAS_VIRT_HOST_EXTN
HAS_WFXT
HW_DBM
--
2.39.2
Update Kconfig to permit 4k and 16k granule configurations to be built
with 52-bit virtual addressing, now that all the prerequisites are in
place.
While at it, update the feature description so it matches on the
appropriate feature bits depending on the page size. For simplicity,
let's just keep ARM64_HAS_VA52 as the feature name.
Note that LPA2 based 52-bit virtual addressing requires 52-bit physical
addressing support to be enabled as well, as programming TCR.TxSZ to
values below 16 is not allowed unless TCR.DS is set, which is what
activates the 52-bit physical addressing support.
While supporting the converse (52-bit physical addressing without 52-bit
virtual addressing) would be possible in principle, let's keep things
simple, by only allowing these features to be enabled at the same time.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/Kconfig | 17 ++++++++-------
arch/arm64/kernel/cpufeature.c | 22 ++++++++++++++++----
2 files changed, 28 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 52aac583823863e4..938fe1d090a5bb4e 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -351,7 +351,9 @@ config PGTABLE_LEVELS
default 3 if ARM64_64K_PAGES && (ARM64_VA_BITS_48 || ARM64_VA_BITS_52)
default 3 if ARM64_4K_PAGES && ARM64_VA_BITS_39
default 3 if ARM64_16K_PAGES && ARM64_VA_BITS_47
+ default 4 if ARM64_16K_PAGES && (ARM64_VA_BITS_48 || ARM64_VA_BITS_52)
default 4 if !ARM64_64K_PAGES && ARM64_VA_BITS_48
+ default 5 if ARM64_4K_PAGES && ARM64_VA_BITS_52
config ARCH_SUPPORTS_UPROBES
def_bool y
@@ -365,13 +367,13 @@ config BROKEN_GAS_INST
config KASAN_SHADOW_OFFSET
hex
depends on KASAN_GENERIC || KASAN_SW_TAGS
- default 0xdfff800000000000 if (ARM64_VA_BITS_48 || ARM64_VA_BITS_52) && !KASAN_SW_TAGS
- default 0xdfffc00000000000 if ARM64_VA_BITS_47 && !KASAN_SW_TAGS
+ default 0xdfff800000000000 if (ARM64_VA_BITS_48 || (ARM64_VA_BITS_52 && !ARM64_16K_PAGES)) && !KASAN_SW_TAGS
+ default 0xdfffc00000000000 if (ARM64_VA_BITS_47 || ARM64_VA_BITS_52) && ARM64_16K_PAGES && !KASAN_SW_TAGS
default 0xdffffe0000000000 if ARM64_VA_BITS_42 && !KASAN_SW_TAGS
default 0xdfffffc000000000 if ARM64_VA_BITS_39 && !KASAN_SW_TAGS
default 0xdffffff800000000 if ARM64_VA_BITS_36 && !KASAN_SW_TAGS
- default 0xefff800000000000 if (ARM64_VA_BITS_48 || ARM64_VA_BITS_52) && KASAN_SW_TAGS
- default 0xefffc00000000000 if ARM64_VA_BITS_47 && KASAN_SW_TAGS
+ default 0xefff800000000000 if (ARM64_VA_BITS_48 || (ARM64_VA_BITS_52 && !ARM64_16K_PAGES)) && KASAN_SW_TAGS
+ default 0xefffc00000000000 if (ARM64_VA_BITS_47 || ARM64_VA_BITS_52) && ARM64_16K_PAGES && KASAN_SW_TAGS
default 0xeffffe0000000000 if ARM64_VA_BITS_42 && KASAN_SW_TAGS
default 0xefffffc000000000 if ARM64_VA_BITS_39 && KASAN_SW_TAGS
default 0xeffffff800000000 if ARM64_VA_BITS_36 && KASAN_SW_TAGS
@@ -1220,7 +1222,7 @@ config ARM64_VA_BITS_48
config ARM64_VA_BITS_52
bool "52-bit"
- depends on ARM64_64K_PAGES && (ARM64_PAN || !ARM64_SW_TTBR0_PAN)
+ depends on ARM64_PAN || !ARM64_SW_TTBR0_PAN
help
Enable 52-bit virtual addressing for userspace when explicitly
requested via a hint to mmap(). The kernel will also use 52-bit
@@ -1267,10 +1269,11 @@ choice
config ARM64_PA_BITS_48
bool "48-bit"
+ depends on ARM64_64K_PAGES || !ARM64_VA_BITS_52
config ARM64_PA_BITS_52
- bool "52-bit (ARMv8.2)"
- depends on ARM64_64K_PAGES
+ bool "52-bit"
+ depends on ARM64_64K_PAGES || ARM64_VA_BITS_52
depends on ARM64_PAN || !ARM64_SW_TTBR0_PAN
help
Enable support for a 52-bit physical address space, introduced as
diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 62f6d397f27b07c0..4487745e9b660c5f 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -2763,15 +2763,29 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
},
#ifdef CONFIG_ARM64_VA_BITS_52
{
- .desc = "52-bit Virtual Addressing (LVA)",
.capability = ARM64_HAS_VA52,
.type = ARM64_CPUCAP_BOOT_CPU_FEATURE,
- .sys_reg = SYS_ID_AA64MMFR2_EL1,
- .sign = FTR_UNSIGNED,
+ .matches = has_cpuid_feature,
.field_width = 4,
+#ifdef CONFIG_ARM64_64K_PAGES
+ .desc = "52-bit Virtual Addressing (LVA)",
+ .sign = FTR_SIGNED,
+ .sys_reg = SYS_ID_AA64MMFR2_EL1,
.field_pos = ID_AA64MMFR2_EL1_VARange_SHIFT,
- .matches = has_cpuid_feature,
.min_field_value = ID_AA64MMFR2_EL1_VARange_52,
+#else
+ .desc = "52-bit Virtual Addressing (LPA2)",
+ .sys_reg = SYS_ID_AA64MMFR0_EL1,
+#ifdef CONFIG_ARM64_4K_PAGES
+ .sign = FTR_SIGNED,
+ .field_pos = ID_AA64MMFR0_EL1_TGRAN4_SHIFT,
+ .min_field_value = ID_AA64MMFR0_EL1_TGRAN4_52_BIT,
+#else
+ .sign = FTR_UNSIGNED,
+ .field_pos = ID_AA64MMFR0_EL1_TGRAN16_SHIFT,
+ .min_field_value = ID_AA64MMFR0_EL1_TGRAN16_52_BIT,
+#endif
+#endif
},
#endif
{},
--
2.39.2
When LPA2 is enabled, bits 8 and 9 of page and block descriptors become
part of the output address instead of carrying shareability attributes
for the region in question.
So avoid setting these bits if TCR.DS == 1, which means LPA2 is enabled.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/Kconfig | 4 ++++
arch/arm64/include/asm/pgtable-prot.h | 18 ++++++++++++++++--
arch/arm64/mm/mmap.c | 4 ++++
3 files changed, 24 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1023e896d46b8969..d287dad29198c843 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1287,6 +1287,10 @@ config ARM64_PA_BITS
default 48 if ARM64_PA_BITS_48
default 52 if ARM64_PA_BITS_52
+config ARM64_LPA2
+ def_bool y
+ depends on ARM64_PA_BITS_52 && !ARM64_64K_PAGES
+
choice
prompt "Endianness"
default CPU_LITTLE_ENDIAN
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index 9b165117a454595a..269584d5a2c017fc 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -40,6 +40,20 @@ extern bool arm64_use_ng_mappings;
#define PTE_MAYBE_NG (arm64_use_ng_mappings ? PTE_NG : 0)
#define PMD_MAYBE_NG (arm64_use_ng_mappings ? PMD_SECT_NG : 0)
+#ifndef CONFIG_ARM64_LPA2
+#define lpa2_is_enabled() false
+#define PTE_MAYBE_SHARED PTE_SHARED
+#define PMD_MAYBE_SHARED PMD_SECT_S
+#else
+static inline bool __pure lpa2_is_enabled(void)
+{
+ return read_tcr() & TCR_DS;
+}
+
+#define PTE_MAYBE_SHARED (lpa2_is_enabled() ? 0 : PTE_SHARED)
+#define PMD_MAYBE_SHARED (lpa2_is_enabled() ? 0 : PMD_SECT_S)
+#endif
+
/*
* If we have userspace only BTI we don't want to mark kernel pages
* guarded even if the system does support BTI.
@@ -50,8 +64,8 @@ extern bool arm64_use_ng_mappings;
#define PTE_MAYBE_GP 0
#endif
-#define PROT_DEFAULT (_PROT_DEFAULT | PTE_MAYBE_NG)
-#define PROT_SECT_DEFAULT (_PROT_SECT_DEFAULT | PMD_MAYBE_NG)
+#define PROT_DEFAULT (PTE_TYPE_PAGE | PTE_MAYBE_NG | PTE_MAYBE_SHARED | PTE_AF)
+#define PROT_SECT_DEFAULT (PMD_TYPE_SECT | PMD_MAYBE_NG | PMD_MAYBE_SHARED | PMD_SECT_AF)
#define PROT_DEVICE_nGnRnE (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_WRITE | PTE_ATTRINDX(MT_DEVICE_nGnRnE))
#define PROT_DEVICE_nGnRE (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_WRITE | PTE_ATTRINDX(MT_DEVICE_nGnRE))
diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c
index 8f5b7ce857ed4a8f..adcf547f74eb8e60 100644
--- a/arch/arm64/mm/mmap.c
+++ b/arch/arm64/mm/mmap.c
@@ -73,6 +73,10 @@ static int __init adjust_protection_map(void)
protection_map[VM_EXEC | VM_SHARED] = PAGE_EXECONLY;
}
+ if (lpa2_is_enabled())
+ for (int i = 0; i < ARRAY_SIZE(protection_map); i++)
+ pgprot_val(protection_map[i]) &= ~PTE_SHARED;
+
return 0;
}
arch_initcall(adjust_protection_map);
--
2.39.2
We typically enable support in defconfig for all architectural features
for which we can detect at runtime if the hardware actually supports
them.
Now that we have implemented support for LPA2 based 52-bit virtual
addressing in a way that should not impact 48-bit operation on non-LPA2
CPU, we can do the same, and enable 52-bit virtual addressing by
default.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/configs/defconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index 7790ee42c68a88d7..728b29549e058e94 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -69,7 +69,7 @@ CONFIG_ARCH_VEXPRESS=y
CONFIG_ARCH_VISCONTI=y
CONFIG_ARCH_XGENE=y
CONFIG_ARCH_ZYNQMP=y
-CONFIG_ARM64_VA_BITS_48=y
+CONFIG_ARM64_VA_BITS_52=y
CONFIG_SCHED_MC=y
CONFIG_SCHED_SMT=y
CONFIG_NUMA=y
--
2.39.2
Add a hook to permit architectures to perform validation on the prot
flags passed to mmap(), like arch_validate_prot() does for mprotect().
This will be used by arm64 to reject PROT_WRITE+PROT_EXEC mappings on
configurations that run with WXN enabled.
Reviewed-by: Kees Cook <[email protected]>
Signed-off-by: Ard Biesheuvel <[email protected]>
---
include/linux/mman.h | 15 +++++++++++++++
mm/mmap.c | 3 +++
2 files changed, 18 insertions(+)
diff --git a/include/linux/mman.h b/include/linux/mman.h
index cee1e4b566d80095..25ffe6948fa21ff8 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -120,6 +120,21 @@ static inline bool arch_validate_flags(unsigned long flags)
#define arch_validate_flags arch_validate_flags
#endif
+#ifndef arch_validate_mmap_prot
+/*
+ * This is called from mmap(), which ignores unknown prot bits so the default
+ * is to accept anything.
+ *
+ * Returns true if the prot flags are valid
+ */
+static inline bool arch_validate_mmap_prot(unsigned long prot,
+ unsigned long addr)
+{
+ return true;
+}
+#define arch_validate_mmap_prot arch_validate_mmap_prot
+#endif
+
/*
* Optimisation macro. It is equivalent to:
* (x & bit1) ? bit2 : 0
diff --git a/mm/mmap.c b/mm/mmap.c
index 740b54be3ed4140f..02ccd45df0194796 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1214,6 +1214,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
if (!(file && path_noexec(&file->f_path)))
prot |= PROT_EXEC;
+ if (!arch_validate_mmap_prot(prot, addr))
+ return -EACCES;
+
/* force arch specific MAP_FIXED handling in get_unmapped_area */
if (flags & MAP_FIXED_NOREPLACE)
flags |= MAP_FIXED;
--
2.39.2
The AArch64 virtual memory system supports a global WXN control, which
can be enabled to make all writable mappings implicitly no-exec. This is
a useful hardening feature, as it prevents mistakes in managing page
table permissions from being exploited to attack the system.
When enabled at EL1, the restrictions apply to both EL1 and EL0. EL1 is
completely under our control, and has been cleaned up to allow WXN to be
enabled from boot onwards. EL0 is not under our control, but given that
widely deployed security features such as selinux or PaX already limit
the ability of user space to create mappings that are writable and
executable at the same time, the impact of enabling this for EL0 is
expected to be limited. (For this reason, common user space libraries
that have a legitimate need for manipulating executable code already
carry fallbacks such as [0].)
If enabled at compile time, the feature can still be disabled at boot if
needed, by passing arm64.nowxn on the kernel command line.
[0] https://github.com/libffi/libffi/blob/master/src/closures.c#L440
Signed-off-by: Ard Biesheuvel <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
---
arch/arm64/Kconfig | 11 ++++++
arch/arm64/include/asm/cpufeature.h | 10 ++++++
arch/arm64/include/asm/mman.h | 36 ++++++++++++++++++++
arch/arm64/include/asm/mmu_context.h | 30 +++++++++++++++-
arch/arm64/kernel/pi/idreg-override.c | 4 ++-
arch/arm64/kernel/pi/map_kernel.c | 24 +++++++++++++
arch/arm64/mm/proc.S | 6 ++++
7 files changed, 119 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 938fe1d090a5bb4e..4262f3f784696d94 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1552,6 +1552,17 @@ config RODATA_FULL_DEFAULT_ENABLED
This requires the linear region to be mapped down to pages,
which may adversely affect performance in some cases.
+config ARM64_WXN
+ bool "Enable WXN attribute so all writable mappings are non-exec"
+ help
+ Set the WXN bit in the SCTLR system register so that all writable
+ mappings are treated as if the PXN/UXN bit is set as well.
+ If this is set to Y, it can still be disabled at runtime by
+ passing 'arm64.nowxn' on the kernel command line.
+
+ This should only be set if no software needs to be supported that
+ relies on being able to execute from writable mappings.
+
config ARM64_SW_TTBR0_PAN
bool "Emulate Privileged Access Never using TTBR0_EL1 switching"
help
diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 170e18cb2b4faf11..9a5a373a3fda7f58 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -17,6 +17,7 @@
#define ARM64_SW_FEATURE_OVERRIDE_NOKASLR 0
#define ARM64_SW_FEATURE_OVERRIDE_RODATA_OFF 4
+#define ARM64_SW_FEATURE_OVERRIDE_NOWXN 8
#ifndef __ASSEMBLY__
@@ -932,6 +933,15 @@ extern struct arm64_ftr_override id_aa64isar2_override;
extern struct arm64_ftr_override arm64_sw_feature_override;
+static inline bool arm64_wxn_enabled(void)
+{
+ if (!IS_ENABLED(CONFIG_ARM64_WXN) ||
+ cpuid_feature_extract_unsigned_field(arm64_sw_feature_override.val,
+ ARM64_SW_FEATURE_OVERRIDE_NOWXN))
+ return false;
+ return true;
+}
+
u32 get_kvm_ipa_limit(void);
void dump_cpu_features(void);
diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h
index 5966ee4a61542edf..6d4940342ba73060 100644
--- a/arch/arm64/include/asm/mman.h
+++ b/arch/arm64/include/asm/mman.h
@@ -35,11 +35,40 @@ static inline unsigned long arch_calc_vm_flag_bits(unsigned long flags)
}
#define arch_calc_vm_flag_bits(flags) arch_calc_vm_flag_bits(flags)
+static inline bool arm64_check_wx_prot(unsigned long prot,
+ struct task_struct *tsk)
+{
+ /*
+ * When we are running with SCTLR_ELx.WXN==1, writable mappings are
+ * implicitly non-executable. This means we should reject such mappings
+ * when user space attempts to create them using mmap() or mprotect().
+ */
+ if (arm64_wxn_enabled() &&
+ ((prot & (PROT_WRITE | PROT_EXEC)) == (PROT_WRITE | PROT_EXEC))) {
+ /*
+ * User space libraries such as libffi carry elaborate
+ * heuristics to decide whether it is worth it to even attempt
+ * to create writable executable mappings, as PaX or selinux
+ * enabled systems will outright reject it. They will usually
+ * fall back to something else (e.g., two separate shared
+ * mmap()s of a temporary file) on failure.
+ */
+ pr_info_ratelimited(
+ "process %s (%d) attempted to create PROT_WRITE+PROT_EXEC mapping\n",
+ tsk->comm, tsk->pid);
+ return false;
+ }
+ return true;
+}
+
static inline bool arch_validate_prot(unsigned long prot,
unsigned long addr __always_unused)
{
unsigned long supported = PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM;
+ if (!arm64_check_wx_prot(prot, current))
+ return false;
+
if (system_supports_bti())
supported |= PROT_BTI;
@@ -50,6 +79,13 @@ static inline bool arch_validate_prot(unsigned long prot,
}
#define arch_validate_prot(prot, addr) arch_validate_prot(prot, addr)
+static inline bool arch_validate_mmap_prot(unsigned long prot,
+ unsigned long addr)
+{
+ return arm64_check_wx_prot(prot, current);
+}
+#define arch_validate_mmap_prot arch_validate_mmap_prot
+
static inline bool arch_validate_flags(unsigned long vm_flags)
{
if (!system_supports_mte())
diff --git a/arch/arm64/include/asm/mmu_context.h b/arch/arm64/include/asm/mmu_context.h
index bc1cef5002d60e02..910c6e009515745d 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -20,13 +20,41 @@
#include <asm/cpufeature.h>
#include <asm/daifflags.h>
#include <asm/proc-fns.h>
-#include <asm-generic/mm_hooks.h>
#include <asm/cputype.h>
#include <asm/sysreg.h>
#include <asm/tlbflush.h>
extern bool rodata_full;
+static inline int arch_dup_mmap(struct mm_struct *oldmm,
+ struct mm_struct *mm)
+{
+ return 0;
+}
+
+static inline void arch_exit_mmap(struct mm_struct *mm)
+{
+}
+
+static inline void arch_unmap(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+}
+
+static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
+ bool write, bool execute, bool foreign)
+{
+ if (IS_ENABLED(CONFIG_ARM64_WXN) && execute &&
+ (vma->vm_flags & (VM_WRITE | VM_EXEC)) == (VM_WRITE | VM_EXEC)) {
+ pr_warn_ratelimited(
+ "process %s (%d) attempted to execute from writable memory\n",
+ current->comm, current->pid);
+ /* disallow unless the nowxn override is set */
+ return !arm64_wxn_enabled();
+ }
+ return true;
+}
+
static inline void contextidr_thread_switch(struct task_struct *next)
{
if (!IS_ENABLED(CONFIG_PID_IN_CONTEXTIDR))
diff --git a/arch/arm64/kernel/pi/idreg-override.c b/arch/arm64/kernel/pi/idreg-override.c
index c4ae5ffe5cb0c999..0ceec20669e50913 100644
--- a/arch/arm64/kernel/pi/idreg-override.c
+++ b/arch/arm64/kernel/pi/idreg-override.c
@@ -181,6 +181,7 @@ static const struct ftr_set_desc sw_features __prel64_initconst = {
.fields = {
FIELD("nokaslr", ARM64_SW_FEATURE_OVERRIDE_NOKASLR, NULL),
FIELD("rodataoff", ARM64_SW_FEATURE_OVERRIDE_RODATA_OFF, NULL),
+ FIELD("nowxn", ARM64_SW_FEATURE_OVERRIDE_NOWXN, NULL),
{}
},
};
@@ -214,8 +215,9 @@ static const struct {
"id_aa64isar2.gpa3=0 id_aa64isar2.apa3=0" },
{ "arm64.nomte", "id_aa64pfr1.mte=0" },
{ "nokaslr", "arm64_sw.nokaslr=1" },
- { "rodata=off", "arm64_sw.rodataoff=1" },
+ { "rodata=off", "arm64_sw.rodataoff=1 arm64_sw.nowxn=1" },
{ "arm64.nolva", "id_aa64mmfr2.varange=0" },
+ { "arm64.nowxn", "arm64_sw.nowxn=1" },
};
static int __init parse_hexdigit(const char *p, u64 *v)
diff --git a/arch/arm64/kernel/pi/map_kernel.c b/arch/arm64/kernel/pi/map_kernel.c
index a14b7c1236b5707c..2d02fa94e48c92e1 100644
--- a/arch/arm64/kernel/pi/map_kernel.c
+++ b/arch/arm64/kernel/pi/map_kernel.c
@@ -133,6 +133,25 @@ static void __init map_kernel(u64 kaslr_offset, u64 va_offset, int root_level)
idmap_cpu_replace_ttbr1(swapper_pg_dir);
}
+static void noinline __section(".idmap.text") disable_wxn(void)
+{
+ u64 sctlr = read_sysreg(sctlr_el1) & ~SCTLR_ELx_WXN;
+
+ /*
+ * We cannot safely clear the WXN bit while the MMU and caches are on,
+ * so turn the MMU off, flush the TLBs and turn it on again but with
+ * the WXN bit cleared this time.
+ */
+ asm(" msr sctlr_el1, %0 ;"
+ " isb ;"
+ " tlbi vmalle1 ;"
+ " dsb nsh ;"
+ " isb ;"
+ " msr sctlr_el1, %1 ;"
+ " isb ;"
+ :: "r"(sctlr & ~SCTLR_ELx_M), "r"(sctlr));
+}
+
static void noinline __section(".idmap.text") set_ttbr0_for_lpa2(u64 ttbr)
{
u64 sctlr = read_sysreg(sctlr_el1);
@@ -230,6 +249,11 @@ asmlinkage void __init early_map_kernel(u64 boot_status, void *fdt)
if (va_bits > VA_BITS_MIN)
sysreg_clear_set(tcr_el1, TCR_T1SZ_MASK, TCR_T1SZ(va_bits));
+ if (IS_ENABLED(CONFIG_ARM64_WXN) &&
+ cpuid_feature_extract_unsigned_field(arm64_sw_feature_override.val,
+ ARM64_SW_FEATURE_OVERRIDE_NOWXN))
+ disable_wxn();
+
/*
* The virtual KASLR displacement modulo 2MiB is decided by the
* physical placement of the image, as otherwise, we might not be able
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index 68f8337cd8bf6ab9..1df7a031bcb40341 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -516,6 +516,12 @@ alternative_else_nop_endif
* Prepare SCTLR
*/
mov_q x0, INIT_SCTLR_EL1_MMU_ON
+#ifdef CONFIG_ARM64_WXN
+ ldr_l x1, arm64_sw_feature_override + FTR_OVR_VAL_OFFSET
+ tst x1, #0xf << ARM64_SW_FEATURE_OVERRIDE_NOWXN
+ orr x1, x0, #SCTLR_ELx_WXN
+ csel x0, x0, x1, ne
+#endif
ret // return to head.S
.unreq mair
--
2.39.2
Update the early kernel mapping code to take 52-bit virtual addressing
into account based on the LPA2 feature. This is a bit more involved than
LVA (which is supported with 64k pages only), given that some page table
descriptor bits change meaning in this case.
To keep the handling in asm to a minimum, the initial ID map is still
created with 48-bit virtual addressing, which implies that the kernel
image must be loaded into 48-bit addressable physical memory. This is
currently required by the boot protocol, even though we happen to
support placement outside of that for LVA/64k based configurations.
Enabling LPA2 involves more than setting TCR.T1SZ to a lower value,
there is also a DS bit in TCR that needs to be set, and which changes
the meaning of bits [9:8] in all page table descriptors. Since we cannot
enable DS and every live page table descriptor at the same time, let's
pivot through another temporary mapping. This avoids the need to
reintroduce manipulations of the page tables with the MMU and caches
disabled.
To permit the LPA2 feature to be overridden on the kernel command line,
which may be necessary to work around silicon errata, or to deal with
mismatched features on heterogeneous SoC designs, test for CPU feature
overrides first, and only then enable LPA2.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/assembler.h | 8 ++-
arch/arm64/include/asm/cpufeature.h | 18 +++++
arch/arm64/include/asm/memory.h | 4 ++
arch/arm64/kernel/head.S | 8 +++
arch/arm64/kernel/image-vars.h | 1 +
arch/arm64/kernel/pi/map_kernel.c | 70 +++++++++++++++++++-
arch/arm64/kernel/pi/map_range.c | 11 ++-
arch/arm64/kernel/pi/pi.h | 4 +-
arch/arm64/mm/init.c | 2 +-
arch/arm64/mm/mmu.c | 6 +-
arch/arm64/mm/proc.S | 3 +
11 files changed, 124 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 55e8731844cf7eb7..d5e139ce0820479b 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -581,11 +581,17 @@ alternative_endif
* but we have to add an offset so that the TTBR1 address corresponds with the
* pgdir entry that covers the lowest 48-bit addressable VA.
*
+ * Note that this trick is only used for LVA/64k pages - LPA2/4k pages uses an
+ * additional paging level, and on LPA2/16k pages, we would end up with a root
+ * level table with only 2 entries, which is suboptimal in terms of TLB
+ * utilization, so there we fall back to 47 bits of translation if LPA2 is not
+ * supported.
+ *
* orr is used as it can cover the immediate value (and is idempotent).
* ttbr: Value of ttbr to set, modified.
*/
.macro offset_ttbr1, ttbr, tmp
-#ifdef CONFIG_ARM64_VA_BITS_52
+#if defined(CONFIG_ARM64_VA_BITS_52) && !defined(CONFIG_ARM64_LPA2)
mrs \tmp, tcr_el1
and \tmp, \tmp, #TCR_T1SZ_MASK
cmp \tmp, #TCR_T1SZ(VA_BITS_MIN)
diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
index 7faf9a48339e7c8c..170e18cb2b4faf11 100644
--- a/arch/arm64/include/asm/cpufeature.h
+++ b/arch/arm64/include/asm/cpufeature.h
@@ -1002,6 +1002,24 @@ static inline bool cpu_has_lva(void)
ID_AA64MMFR2_EL1_VARange_SHIFT);
}
+static inline bool cpu_has_lpa2(void)
+{
+#ifdef CONFIG_ARM64_LPA2
+ u64 mmfr0;
+ int feat;
+
+ mmfr0 = read_sysreg(id_aa64mmfr0_el1);
+ mmfr0 &= ~id_aa64mmfr0_override.mask;
+ mmfr0 |= id_aa64mmfr0_override.val;
+ feat = cpuid_feature_extract_signed_field(mmfr0,
+ ID_AA64MMFR0_EL1_TGRAN_SHIFT);
+
+ return feat >= ID_AA64MMFR0_EL1_TGRAN_LPA2;
+#else
+ return false;
+#endif
+}
+
#endif /* __ASSEMBLY__ */
#endif
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 3e32d957aadcb2bb..3163f0748d90b313 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -54,7 +54,11 @@
#define FIXADDR_TOP (ULONG_MAX - SZ_8M + 1)
#if VA_BITS > 48
+#ifdef CONFIG_ARM64_16K_PAGES
+#define VA_BITS_MIN (47)
+#else
#define VA_BITS_MIN (48)
+#endif
#else
#define VA_BITS_MIN (VA_BITS)
#endif
diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
index 47ebe3242d7feb7e..17b2a5518f5b7f3b 100644
--- a/arch/arm64/kernel/head.S
+++ b/arch/arm64/kernel/head.S
@@ -89,6 +89,7 @@ SYM_CODE_START(primary_entry)
mov sp, x1
mov x29, xzr
adrp x0, init_idmap_pg_dir
+ mov x1, xzr
bl __pi_create_init_idmap
/*
@@ -471,9 +472,16 @@ SYM_FUNC_END(__enable_mmu)
#ifdef CONFIG_ARM64_VA_BITS_52
SYM_FUNC_START(__cpu_secondary_check52bitva)
+#ifndef CONFIG_ARM64_LPA2
mrs_s x0, SYS_ID_AA64MMFR2_EL1
and x0, x0, #(0xf << ID_AA64MMFR2_EL1_VARange_SHIFT)
cbnz x0, 2f
+#else
+ mrs x0, id_aa64mmfr0_el1
+ sbfx x0, x0, #ID_AA64MMFR0_EL1_TGRAN_SHIFT, 4
+ cmp x0, #ID_AA64MMFR0_EL1_TGRAN_LPA2
+ b.ge 2f
+#endif
update_early_cpu_boot_status \
CPU_STUCK_IN_KERNEL | CPU_STUCK_REASON_52_BIT_VA, x0, x1
diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
index 79a7e0e3edd1aa21..15e396c9113189f9 100644
--- a/arch/arm64/kernel/image-vars.h
+++ b/arch/arm64/kernel/image-vars.h
@@ -49,6 +49,7 @@ PROVIDE(__pi__ctype = _ctype);
PROVIDE(__pi_memstart_offset_seed = memstart_offset_seed);
PROVIDE(__pi_init_idmap_pg_dir = init_idmap_pg_dir);
+PROVIDE(__pi_init_idmap_pg_end = init_idmap_pg_end);
PROVIDE(__pi_init_pg_dir = init_pg_dir);
PROVIDE(__pi_init_pg_end = init_pg_end);
PROVIDE(__pi_swapper_pg_dir = swapper_pg_dir);
diff --git a/arch/arm64/kernel/pi/map_kernel.c b/arch/arm64/kernel/pi/map_kernel.c
index c1a5bef4e10a49d7..a14b7c1236b5707c 100644
--- a/arch/arm64/kernel/pi/map_kernel.c
+++ b/arch/arm64/kernel/pi/map_kernel.c
@@ -128,11 +128,64 @@ static void __init map_kernel(u64 kaslr_offset, u64 va_offset, int root_level)
}
/* Copy the root page table to its final location */
- memcpy((void *)swapper_pg_dir + va_offset, init_pg_dir, PGD_SIZE);
+ memcpy((void *)swapper_pg_dir + va_offset, init_pg_dir, PAGE_SIZE);
dsb(ishst);
idmap_cpu_replace_ttbr1(swapper_pg_dir);
}
+static void noinline __section(".idmap.text") set_ttbr0_for_lpa2(u64 ttbr)
+{
+ u64 sctlr = read_sysreg(sctlr_el1);
+ u64 tcr = read_sysreg(tcr_el1) | TCR_DS;
+
+ asm(" msr sctlr_el1, %0 ;"
+ " isb ;"
+ " msr ttbr0_el1, %1 ;"
+ " msr tcr_el1, %2 ;"
+ " isb ;"
+ " tlbi vmalle1 ;"
+ " dsb nsh ;"
+ " isb ;"
+ " msr sctlr_el1, %3 ;"
+ " isb ;"
+ :: "r"(sctlr & ~SCTLR_ELx_M), "r"(ttbr), "r"(tcr), "r"(sctlr));
+}
+
+static void __init remap_idmap_for_lpa2(void)
+{
+ /* clear the bits that change meaning once LPA2 is turned on */
+ pteval_t mask = PTE_SHARED;
+
+ /*
+ * We have to clear bits [9:8] in all block or page descriptors in the
+ * initial ID map, as otherwise they will be (mis)interpreted as
+ * physical address bits once we flick the LPA2 switch (TCR.DS). Since
+ * we cannot manipulate live descriptors in that way without creating
+ * potential TLB conflicts, let's create another temporary ID map in a
+ * LPA2 compatible fashion, and update the initial ID map while running
+ * from that.
+ */
+ create_init_idmap(init_pg_dir, mask);
+ dsb(ishst);
+ set_ttbr0_for_lpa2((u64)init_pg_dir);
+
+ /*
+ * Recreate the initial ID map with the same granularity as before.
+ * Don't bother with the FDT, we no longer need it after this.
+ */
+ memset(init_idmap_pg_dir, 0,
+ (u64)init_idmap_pg_dir - (u64)init_idmap_pg_end);
+
+ create_init_idmap(init_idmap_pg_dir, mask);
+ dsb(ishst);
+
+ /* switch back to the updated initial ID map */
+ set_ttbr0_for_lpa2((u64)init_idmap_pg_dir);
+
+ /* wipe the temporary ID map from memory */
+ memset(init_pg_dir, 0, (u64)init_pg_end - (u64)init_pg_dir);
+}
+
static void map_fdt(u64 fdt)
{
static u8 ptes[INIT_IDMAP_FDT_SIZE] __initdata __aligned(PAGE_SIZE);
@@ -155,6 +208,7 @@ asmlinkage void __init early_map_kernel(u64 boot_status, void *fdt)
u64 va_base, pa_base = (u64)&_text;
u64 kaslr_offset = pa_base % MIN_KIMG_ALIGN;
int root_level = 4 - CONFIG_PGTABLE_LEVELS;
+ int va_bits = VA_BITS;
int chosen;
map_fdt((u64)fdt);
@@ -166,8 +220,15 @@ asmlinkage void __init early_map_kernel(u64 boot_status, void *fdt)
chosen = fdt_path_offset(fdt, chosen_str);
init_feature_override(boot_status, fdt, chosen);
- if (VA_BITS > VA_BITS_MIN && cpu_has_lva())
- sysreg_clear_set(tcr_el1, TCR_T1SZ_MASK, TCR_T1SZ(VA_BITS));
+ if (IS_ENABLED(CONFIG_ARM64_64K_PAGES) && !cpu_has_lva()) {
+ va_bits = VA_BITS_MIN;
+ } else if (IS_ENABLED(CONFIG_ARM64_LPA2) && !cpu_has_lpa2()) {
+ va_bits = VA_BITS_MIN;
+ root_level++;
+ }
+
+ if (va_bits > VA_BITS_MIN)
+ sysreg_clear_set(tcr_el1, TCR_T1SZ_MASK, TCR_T1SZ(va_bits));
/*
* The virtual KASLR displacement modulo 2MiB is decided by the
@@ -191,6 +252,9 @@ asmlinkage void __init early_map_kernel(u64 boot_status, void *fdt)
kaslr_offset |= kaslr_seed & ~(MIN_KIMG_ALIGN - 1);
}
+ if (IS_ENABLED(CONFIG_ARM64_LPA2) && va_bits > VA_BITS_MIN)
+ remap_idmap_for_lpa2();
+
va_base = KIMAGE_VADDR + kaslr_offset;
map_kernel(kaslr_offset, va_base - pa_base, root_level);
}
diff --git a/arch/arm64/kernel/pi/map_range.c b/arch/arm64/kernel/pi/map_range.c
index 8de3fe2c1cb55722..8a55c7775db8390f 100644
--- a/arch/arm64/kernel/pi/map_range.c
+++ b/arch/arm64/kernel/pi/map_range.c
@@ -86,14 +86,19 @@ void __init map_range(u64 *pte, u64 start, u64 end, u64 pa, pgprot_t prot,
}
}
-asmlinkage u64 __init create_init_idmap(pgd_t *pg_dir)
+asmlinkage u64 __init create_init_idmap(pgd_t *pg_dir, pteval_t clrmask)
{
u64 ptep = (u64)pg_dir + PAGE_SIZE;
+ pgprot_t text_prot = PAGE_KERNEL_ROX;
+ pgprot_t data_prot = PAGE_KERNEL;
+
+ pgprot_val(text_prot) &= ~clrmask;
+ pgprot_val(data_prot) &= ~clrmask;
map_range(&ptep, (u64)_stext, (u64)__initdata_begin, (u64)_stext,
- PAGE_KERNEL_ROX, IDMAP_ROOT_LEVEL, (pte_t *)pg_dir, false, 0);
+ text_prot, IDMAP_ROOT_LEVEL, (pte_t *)pg_dir, false, 0);
map_range(&ptep, (u64)__initdata_begin, (u64)_end, (u64)__initdata_begin,
- PAGE_KERNEL, IDMAP_ROOT_LEVEL, (pte_t *)pg_dir, false, 0);
+ data_prot, IDMAP_ROOT_LEVEL, (pte_t *)pg_dir, false, 0);
return ptep;
}
diff --git a/arch/arm64/kernel/pi/pi.h b/arch/arm64/kernel/pi/pi.h
index a8fe79c9b111140f..93d7457dd10d0380 100644
--- a/arch/arm64/kernel/pi/pi.h
+++ b/arch/arm64/kernel/pi/pi.h
@@ -17,7 +17,7 @@ static inline void *prel64_to_pointer(const prel64_t *offset)
extern bool dynamic_scs_is_enabled;
-extern pgd_t init_idmap_pg_dir[];
+extern pgd_t init_idmap_pg_dir[], init_idmap_pg_end[];
void init_feature_override(u64 boot_status, const void *fdt, int chosen);
u64 kaslr_early_init(void *fdt, int chosen);
@@ -27,4 +27,4 @@ int scs_patch(const u8 eh_frame[], int size);
void map_range(u64 *pgd, u64 start, u64 end, u64 pa, pgprot_t prot,
int level, pte_t *tbl, bool may_use_cont, u64 va_offset);
-asmlinkage u64 create_init_idmap(pgd_t *pgd);
+asmlinkage u64 create_init_idmap(pgd_t *pgd, pteval_t clrmask);
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 58a0bb2c17f18cf5..362b6884427c083f 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -334,7 +334,7 @@ void __init arm64_memblock_init(void)
* physical address of PAGE_OFFSET, we have to *subtract* from it.
*/
if (IS_ENABLED(CONFIG_ARM64_VA_BITS_52) && (vabits_actual != 52))
- memstart_addr -= _PAGE_OFFSET(48) - _PAGE_OFFSET(52);
+ memstart_addr -= _PAGE_OFFSET(vabits_actual) - _PAGE_OFFSET(52);
/*
* Apply the memory limit if it was set. Since the kernel may be loaded
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index a16bcfba2e500600..6426df5211456cfe 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -543,8 +543,12 @@ static void __init map_mem(pgd_t *pgdp)
* entries at any level are being shared between the linear region and
* the vmalloc region. Check whether this is true for the PGD level, in
* which case it is guaranteed to be true for all other levels as well.
+ * (Unless we are running with support for LPA2, in which case the
+ * entire reduced VA space is covered by a single pgd_t which will have
+ * been populated without the PXNTable attribute by the time we get here.)
*/
- BUILD_BUG_ON(pgd_index(direct_map_end - 1) == pgd_index(direct_map_end));
+ BUILD_BUG_ON(pgd_index(direct_map_end - 1) == pgd_index(direct_map_end) &&
+ pgd_index(_PAGE_OFFSET(VA_BITS_MIN)) != PTRS_PER_PGD - 1);
if (can_set_direct_map())
flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
index f425cfc3e4dad188..68f8337cd8bf6ab9 100644
--- a/arch/arm64/mm/proc.S
+++ b/arch/arm64/mm/proc.S
@@ -488,6 +488,9 @@ SYM_FUNC_START(__cpu_setup)
mov x9, #64 - VA_BITS
alternative_if ARM64_HAS_VA52
tcr_set_t1sz tcr, x9
+#ifdef CONFIG_ARM64_LPA2
+ orr tcr, tcr, #TCR_DS
+#endif
alternative_else_nop_endif
#endif
--
2.39.2
Allow the KASAN init code to deal with 5 levels of paging, and relax the
requirement that the shadow region is aligned to the top level pgd_t
size. This is necessary for LPA2 based 52-bit virtual addressing, where
the KASAN shadow will never be aligned to the pgd_t size. Allowing this
also enables the 16k/48-bit case for KASAN, which is a nice bonus.
This involves some hackery to manipulate the root and next level page
tables without having to distinguish all the various configurations,
including 16k/48-bits (which has a two entry pgd_t level), and LPA2
configurations running with one translation level less on non-LPA2
hardware.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/Kconfig | 2 +-
arch/arm64/mm/kasan_init.c | 143 ++++++++++++++++++--
2 files changed, 130 insertions(+), 15 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index d287dad29198c843..52aac583823863e4 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -157,7 +157,7 @@ config ARM64
select HAVE_ARCH_HUGE_VMAP
select HAVE_ARCH_JUMP_LABEL
select HAVE_ARCH_JUMP_LABEL_RELATIVE
- select HAVE_ARCH_KASAN if !(ARM64_16K_PAGES && ARM64_VA_BITS_48)
+ select HAVE_ARCH_KASAN
select HAVE_ARCH_KASAN_VMALLOC if HAVE_ARCH_KASAN
select HAVE_ARCH_KASAN_SW_TAGS if HAVE_ARCH_KASAN
select HAVE_ARCH_KASAN_HW_TAGS if (HAVE_ARCH_KASAN && ARM64_MTE)
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 7e32f21fb8e1e227..7ab7520133946e91 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -23,7 +23,7 @@
#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
-static pgd_t tmp_pg_dir[PTRS_PER_PGD] __initdata __aligned(PGD_SIZE);
+static pgd_t tmp_pg_dir[PTRS_PER_PTE] __initdata __aligned(PAGE_SIZE);
/*
* The p*d_populate functions call virt_to_phys implicitly so they can't be used
@@ -99,6 +99,19 @@ static pud_t *__init kasan_pud_offset(p4d_t *p4dp, unsigned long addr, int node,
return early ? pud_offset_kimg(p4dp, addr) : pud_offset(p4dp, addr);
}
+static p4d_t *__init kasan_p4d_offset(pgd_t *pgdp, unsigned long addr, int node,
+ bool early)
+{
+ if (pgd_none(READ_ONCE(*pgdp))) {
+ phys_addr_t p4d_phys = early ?
+ __pa_symbol(kasan_early_shadow_p4d)
+ : kasan_alloc_zeroed_page(node);
+ __pgd_populate(pgdp, p4d_phys, PGD_TYPE_TABLE);
+ }
+
+ return early ? p4d_offset_kimg(pgdp, addr) : p4d_offset(pgdp, addr);
+}
+
static void __init kasan_pte_populate(pmd_t *pmdp, unsigned long addr,
unsigned long end, int node, bool early)
{
@@ -144,12 +157,12 @@ static void __init kasan_p4d_populate(pgd_t *pgdp, unsigned long addr,
unsigned long end, int node, bool early)
{
unsigned long next;
- p4d_t *p4dp = p4d_offset(pgdp, addr);
+ p4d_t *p4dp = kasan_p4d_offset(pgdp, addr, node, early);
do {
next = p4d_addr_end(addr, end);
kasan_pud_populate(p4dp, addr, next, node, early);
- } while (p4dp++, addr = next, addr != end);
+ } while (p4dp++, addr = next, addr != end && p4d_none(READ_ONCE(*p4dp)));
}
static void __init kasan_pgd_populate(unsigned long addr, unsigned long end,
@@ -165,14 +178,48 @@ static void __init kasan_pgd_populate(unsigned long addr, unsigned long end,
} while (pgdp++, addr = next, addr != end);
}
+#if defined(CONFIG_ARM64_64K_PAGES) || CONFIG_PGTABLE_LEVELS > 4
+#define SHADOW_ALIGN P4D_SIZE
+#else
+#define SHADOW_ALIGN PUD_SIZE
+#endif
+
+/*
+ * Return whether 'addr' is aligned to the size covered by a root level
+ * descriptor.
+ */
+static bool __init root_level_aligned(u64 addr)
+{
+ int shift = (ARM64_HW_PGTABLE_LEVELS(vabits_actual) - 1) * (PAGE_SHIFT - 3);
+
+ return (addr % (PAGE_SIZE << shift)) == 0;
+}
+
/* The early shadow maps everything to a single page of zeroes */
asmlinkage void __init kasan_early_init(void)
{
BUILD_BUG_ON(KASAN_SHADOW_OFFSET !=
KASAN_SHADOW_END - (1UL << (64 - KASAN_SHADOW_SCALE_SHIFT)));
- BUILD_BUG_ON(!IS_ALIGNED(_KASAN_SHADOW_START(VA_BITS), PGDIR_SIZE));
- BUILD_BUG_ON(!IS_ALIGNED(_KASAN_SHADOW_START(VA_BITS_MIN), PGDIR_SIZE));
- BUILD_BUG_ON(!IS_ALIGNED(KASAN_SHADOW_END, PGDIR_SIZE));
+ BUILD_BUG_ON(!IS_ALIGNED(_KASAN_SHADOW_START(VA_BITS), SHADOW_ALIGN));
+ BUILD_BUG_ON(!IS_ALIGNED(_KASAN_SHADOW_START(VA_BITS_MIN), SHADOW_ALIGN));
+ BUILD_BUG_ON(!IS_ALIGNED(KASAN_SHADOW_END, SHADOW_ALIGN));
+
+ if (!root_level_aligned(KASAN_SHADOW_START)) {
+ /*
+ * The start address is misaligned, and so the next level table
+ * will be shared with the linear region. This can happen with
+ * 4 or 5 level paging, so install a generic pte_t[] as the
+ * next level. This prevents the kasan_pgd_populate call below
+ * from inserting an entry that refers to the shared KASAN zero
+ * shadow pud_t[]/p4d_t[], which could end up getting corrupted
+ * when the linear region is mapped.
+ */
+ static pte_t tbl[PTRS_PER_PTE] __page_aligned_bss;
+ pgd_t *pgdp = pgd_offset_k(KASAN_SHADOW_START);
+
+ set_pgd(pgdp, __pgd(__pa_symbol(tbl) | PGD_TYPE_TABLE));
+ }
+
kasan_pgd_populate(KASAN_SHADOW_START, KASAN_SHADOW_END, NUMA_NO_NODE,
true);
}
@@ -184,20 +231,75 @@ static void __init kasan_map_populate(unsigned long start, unsigned long end,
kasan_pgd_populate(start & PAGE_MASK, PAGE_ALIGN(end), node, false);
}
-static void __init clear_pgds(unsigned long start,
- unsigned long end)
+/*
+ * Return the descriptor index of 'addr' in the root level table
+ */
+static int __init root_level_idx(u64 addr)
{
/*
- * Remove references to kasan page tables from
- * swapper_pg_dir. pgd_clear() can't be used
- * here because it's nop on 2,3-level pagetable setups
+ * On 64k pages, the TTBR1 range root tables are extended for 52-bit
+ * virtual addressing, and TTBR1 will simply point to the pgd_t entry
+ * that covers the start of the 48-bit addressable VA space if LVA is
+ * not implemented. This means we need to index the table as usual,
+ * instead of masking off bits based on vabits_actual.
*/
- for (; start < end; start += PGDIR_SIZE)
- set_pgd(pgd_offset_k(start), __pgd(0));
+ u64 vabits = IS_ENABLED(CONFIG_ARM64_64K_PAGES) ? VA_BITS
+ : vabits_actual;
+ int shift = (ARM64_HW_PGTABLE_LEVELS(vabits) - 1) * (PAGE_SHIFT - 3);
+
+ return (addr & ~_PAGE_OFFSET(vabits)) >> (shift + PAGE_SHIFT);
+}
+
+/*
+ * Clone a next level table from swapper_pg_dir into tmp_pg_dir
+ */
+static void __init clone_next_level(u64 addr, pgd_t *tmp_pg_dir, pud_t *pud)
+{
+ int idx = root_level_idx(addr);
+ pgd_t pgd = READ_ONCE(swapper_pg_dir[idx]);
+ pud_t *pudp = (pud_t *)__phys_to_kimg(__pgd_to_phys(pgd));
+
+ memcpy(pud, pudp, PAGE_SIZE);
+ tmp_pg_dir[idx] = __pgd(__phys_to_pgd_val(__pa_symbol(pud)) |
+ PUD_TYPE_TABLE);
+}
+
+/*
+ * Return the descriptor index of 'addr' in the next level table
+ */
+static int __init next_level_idx(u64 addr)
+{
+ int shift = (ARM64_HW_PGTABLE_LEVELS(vabits_actual) - 2) * (PAGE_SHIFT - 3);
+
+ return (addr >> (shift + PAGE_SHIFT)) % PTRS_PER_PTE;
+}
+
+/*
+ * Dereference the table descriptor at 'pgd_idx' and clear the entries from
+ * 'start' to 'end' (exclusive) from the table.
+ */
+static void __init clear_next_level(int pgd_idx, int start, int end)
+{
+ pgd_t pgd = READ_ONCE(swapper_pg_dir[pgd_idx]);
+ pud_t *pudp = (pud_t *)__phys_to_kimg(__pgd_to_phys(pgd));
+
+ memset(&pudp[start], 0, (end - start) * sizeof(pud_t));
+}
+
+static void __init clear_shadow(u64 start, u64 end)
+{
+ int l = root_level_idx(start), m = root_level_idx(end);
+
+ if (!root_level_aligned(start))
+ clear_next_level(l++, next_level_idx(start), PTRS_PER_PTE);
+ if (!root_level_aligned(end))
+ clear_next_level(m, 0, next_level_idx(end));
+ memset(&swapper_pg_dir[l], 0, (m - l) * sizeof(pgd_t));
}
static void __init kasan_init_shadow(void)
{
+ static pud_t pud[2][PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);
u64 kimg_shadow_start, kimg_shadow_end;
u64 mod_shadow_start, mod_shadow_end;
u64 vmalloc_shadow_end;
@@ -220,10 +322,23 @@ static void __init kasan_init_shadow(void)
* setup will be finished.
*/
memcpy(tmp_pg_dir, swapper_pg_dir, sizeof(tmp_pg_dir));
+
+ /*
+ * If the start or end address of the shadow region is not aligned to
+ * the root level size, we have to allocate a temporary next-level table
+ * in each case, clone the next level of descriptors, and install the
+ * table into tmp_pg_dir. Note that with 5 levels of paging, the next
+ * level will in fact be p4d_t, but that makes no difference in this
+ * case.
+ */
+ if (!root_level_aligned(KASAN_SHADOW_START))
+ clone_next_level(KASAN_SHADOW_START, tmp_pg_dir, pud[0]);
+ if (!root_level_aligned(KASAN_SHADOW_END))
+ clone_next_level(KASAN_SHADOW_END, tmp_pg_dir, pud[1]);
dsb(ishst);
cpu_replace_ttbr1(lm_alias(tmp_pg_dir));
- clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
+ clear_shadow(KASAN_SHADOW_START, KASAN_SHADOW_END);
kasan_map_populate(kimg_shadow_start, kimg_shadow_end,
early_pfn_to_nid(virt_to_pfn(lm_alias(KERNEL_START))));
--
2.39.2
Add the required types and descriptor accessors to support 5 levels of
paging in the common code. This is one of the prerequisites for
supporting 52-bit virtual addressing with 4k pages.
Note that this does not cover the code that handles kernel mappings or
the fixmap.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/include/asm/pgalloc.h | 41 ++++++++++
arch/arm64/include/asm/pgtable-hwdef.h | 22 +++++-
arch/arm64/include/asm/pgtable-types.h | 6 ++
arch/arm64/include/asm/pgtable.h | 82 +++++++++++++++++++-
arch/arm64/mm/mmu.c | 31 +++++++-
arch/arm64/mm/pgd.c | 15 +++-
6 files changed, 188 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h
index 237224484d0f6f11..cae8c648f4628709 100644
--- a/arch/arm64/include/asm/pgalloc.h
+++ b/arch/arm64/include/asm/pgalloc.h
@@ -60,6 +60,47 @@ static inline void __p4d_populate(p4d_t *p4dp, phys_addr_t pudp, p4dval_t prot)
}
#endif /* CONFIG_PGTABLE_LEVELS > 3 */
+#if CONFIG_PGTABLE_LEVELS > 4
+
+static inline void __pgd_populate(pgd_t *pgdp, phys_addr_t p4dp, pgdval_t prot)
+{
+ if (pgtable_l5_enabled())
+ set_pgd(pgdp, __pgd(__phys_to_pgd_val(p4dp) | prot));
+}
+
+static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgdp, p4d_t *p4dp)
+{
+ pgdval_t pgdval = PGD_TYPE_TABLE;
+
+ pgdval |= (mm == &init_mm) ? PGD_TABLE_UXN : PGD_TABLE_PXN;
+ __pgd_populate(pgdp, __pa(p4dp), pgdval);
+}
+
+static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long addr)
+{
+ gfp_t gfp = GFP_PGTABLE_USER;
+
+ if (mm == &init_mm)
+ gfp = GFP_PGTABLE_KERNEL;
+ return (p4d_t *)get_zeroed_page(gfp);
+}
+
+static inline void p4d_free(struct mm_struct *mm, p4d_t *p4d)
+{
+ if (!pgtable_l5_enabled())
+ return;
+ BUG_ON((unsigned long)p4d & (PAGE_SIZE-1));
+ free_page((unsigned long)p4d);
+}
+
+#define __p4d_free_tlb(tlb, p4d, addr) p4d_free((tlb)->mm, p4d)
+#else
+static inline void __pgd_populate(pgd_t *pgdp, phys_addr_t p4dp, pgdval_t prot)
+{
+ BUILD_BUG();
+}
+#endif /* CONFIG_PGTABLE_LEVELS > 4 */
+
extern pgd_t *pgd_alloc(struct mm_struct *mm);
extern void pgd_free(struct mm_struct *mm, pgd_t *pgdp);
diff --git a/arch/arm64/include/asm/pgtable-hwdef.h b/arch/arm64/include/asm/pgtable-hwdef.h
index b91fe4781b066d54..b364b02e696b8172 100644
--- a/arch/arm64/include/asm/pgtable-hwdef.h
+++ b/arch/arm64/include/asm/pgtable-hwdef.h
@@ -26,10 +26,10 @@
#define ARM64_HW_PGTABLE_LEVELS(va_bits) (((va_bits) - 4) / (PAGE_SHIFT - 3))
/*
- * Size mapped by an entry at level n ( 0 <= n <= 3)
+ * Size mapped by an entry at level n ( -1 <= n <= 3)
* We map (PAGE_SHIFT - 3) at all translation levels and PAGE_SHIFT bits
* in the final page. The maximum number of translation levels supported by
- * the architecture is 4. Hence, starting at level n, we have further
+ * the architecture is 5. Hence, starting at level n, we have further
* ((4 - n) - 1) levels of translation excluding the offset within the page.
* So, the total number of bits mapped by an entry at level n is :
*
@@ -62,9 +62,16 @@
#define PTRS_PER_PUD (1 << (PAGE_SHIFT - 3))
#endif
+#if CONFIG_PGTABLE_LEVELS > 4
+#define P4D_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(0)
+#define P4D_SIZE (_AC(1, UL) << P4D_SHIFT)
+#define P4D_MASK (~(P4D_SIZE-1))
+#define PTRS_PER_P4D (1 << (PAGE_SHIFT - 3))
+#endif
+
/*
* PGDIR_SHIFT determines the size a top-level page table entry can map
- * (depending on the configuration, this level can be 0, 1 or 2).
+ * (depending on the configuration, this level can be -1, 0, 1 or 2).
*/
#define PGDIR_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - CONFIG_PGTABLE_LEVELS)
#define PGDIR_SIZE (_AC(1, UL) << PGDIR_SHIFT)
@@ -87,6 +94,15 @@
/*
* Hardware page table definitions.
*
+ * Level -1 descriptor (PGD).
+ */
+#define PGD_TYPE_TABLE (_AT(pgdval_t, 3) << 0)
+#define PGD_TABLE_BIT (_AT(pgdval_t, 1) << 1)
+#define PGD_TYPE_MASK (_AT(pgdval_t, 3) << 0)
+#define PGD_TABLE_PXN (_AT(pgdval_t, 1) << 59)
+#define PGD_TABLE_UXN (_AT(pgdval_t, 1) << 60)
+
+/*
* Level 0 descriptor (P4D).
*/
#define P4D_TYPE_TABLE (_AT(p4dval_t, 3) << 0)
diff --git a/arch/arm64/include/asm/pgtable-types.h b/arch/arm64/include/asm/pgtable-types.h
index b8f158ae25273679..6d6d4065b0cb4ed1 100644
--- a/arch/arm64/include/asm/pgtable-types.h
+++ b/arch/arm64/include/asm/pgtable-types.h
@@ -36,6 +36,12 @@ typedef struct { pudval_t pud; } pud_t;
#define __pud(x) ((pud_t) { (x) } )
#endif
+#if CONFIG_PGTABLE_LEVELS > 4
+typedef struct { p4dval_t p4d; } p4d_t;
+#define p4d_val(x) ((x).p4d)
+#define __p4d(x) ((p4d_t) { (x) } )
+#endif
+
typedef struct { pgdval_t pgd; } pgd_t;
#define pgd_val(x) ((x).pgd)
#define __pgd(x) ((pgd_t) { (x) } )
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index c8666d5c31fd1e52..c667073e3f56755d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -800,7 +800,6 @@ static inline pud_t *p4d_pgtable(p4d_t p4d)
#else
#define p4d_page_paddr(p4d) ({ BUILD_BUG(); 0;})
-#define pgd_page_paddr(pgd) ({ BUILD_BUG(); 0;})
/* Match pud_offset folding in <asm/generic/pgtable-nopud.h> */
#define pud_set_fixmap(addr) NULL
@@ -811,6 +810,87 @@ static inline pud_t *p4d_pgtable(p4d_t p4d)
#endif /* CONFIG_PGTABLE_LEVELS > 3 */
+#if CONFIG_PGTABLE_LEVELS > 4
+
+static __always_inline bool pgtable_l5_enabled(void)
+{
+ if (!alternative_has_feature_likely(ARM64_ALWAYS_BOOT))
+ return vabits_actual == VA_BITS;
+ return alternative_has_feature_unlikely(ARM64_HAS_VA52);
+}
+
+static inline bool mm_p4d_folded(const struct mm_struct *mm)
+{
+ return !pgtable_l5_enabled();
+}
+#define mm_p4d_folded mm_p4d_folded
+
+#define p4d_ERROR(e) \
+ pr_err("%s:%d: bad p4d %016llx.\n", __FILE__, __LINE__, p4d_val(e))
+
+#define pgd_none(pgd) (pgtable_l5_enabled() && !pgd_val(pgd))
+#define pgd_bad(pgd) (pgtable_l5_enabled() && !(pgd_val(pgd) & 2))
+#define pgd_present(pgd) (!pgd_none(pgd))
+
+static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
+{
+ if (in_swapper_pgdir(pgdp)) {
+ set_swapper_pgd(pgdp, __pgd(pgd_val(pgd)));
+ return;
+ }
+
+ WRITE_ONCE(*pgdp, pgd);
+ dsb(ishst);
+ isb();
+}
+
+static inline void pgd_clear(pgd_t *pgdp)
+{
+ if (pgtable_l5_enabled())
+ set_pgd(pgdp, __pgd(0));
+}
+
+static inline phys_addr_t pgd_page_paddr(pgd_t pgd)
+{
+ return __pgd_to_phys(pgd);
+}
+
+#define p4d_index(addr) (((addr) >> P4D_SHIFT) & (PTRS_PER_P4D - 1))
+
+static inline p4d_t *pgd_to_folded_p4d(pgd_t *pgdp, unsigned long addr)
+{
+ return (p4d_t *)PTR_ALIGN_DOWN(pgdp, PAGE_SIZE) + p4d_index(addr);
+}
+
+static inline phys_addr_t p4d_offset_phys(pgd_t *pgdp, unsigned long addr)
+{
+ BUG_ON(!pgtable_l5_enabled());
+
+ return pgd_page_paddr(READ_ONCE(*pgdp)) + p4d_index(addr) * sizeof(p4d_t);
+}
+
+static inline
+p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long addr)
+{
+ if (!pgtable_l5_enabled())
+ return pgd_to_folded_p4d(pgdp, addr);
+ return (p4d_t *)__va(pgd_page_paddr(pgd)) + p4d_index(addr);
+}
+#define p4d_offset_lockless p4d_offset_lockless
+
+static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long addr)
+{
+ return p4d_offset_lockless(pgdp, READ_ONCE(*pgdp), addr);
+}
+
+#define pgd_page(pgd) pfn_to_page(__phys_to_pfn(__pgd_to_phys(pgd)))
+
+#else
+
+static inline bool pgtable_l5_enabled(void) { return false; }
+
+#endif /* CONFIG_PGTABLE_LEVELS > 4 */
+
#define pgd_ERROR(e) \
pr_err("%s:%d: bad pgd %016llx.\n", __FILE__, __LINE__, pgd_val(e))
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 914745697fb8b30c..a16bcfba2e500600 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1003,7 +1003,7 @@ static void free_empty_pud_table(p4d_t *p4dp, unsigned long addr,
if (CONFIG_PGTABLE_LEVELS <= 3)
return;
- if (!pgtable_range_aligned(start, end, floor, ceiling, PGDIR_MASK))
+ if (!pgtable_range_aligned(start, end, floor, ceiling, P4D_MASK))
return;
/*
@@ -1026,8 +1026,8 @@ static void free_empty_p4d_table(pgd_t *pgdp, unsigned long addr,
unsigned long end, unsigned long floor,
unsigned long ceiling)
{
- unsigned long next;
p4d_t *p4dp, p4d;
+ unsigned long i, next, start = addr;
do {
next = p4d_addr_end(addr, end);
@@ -1039,6 +1039,27 @@ static void free_empty_p4d_table(pgd_t *pgdp, unsigned long addr,
WARN_ON(!p4d_present(p4d));
free_empty_pud_table(p4dp, addr, next, floor, ceiling);
} while (addr = next, addr < end);
+
+ if (!pgtable_l5_enabled())
+ return;
+
+ if (!pgtable_range_aligned(start, end, floor, ceiling, PGDIR_MASK))
+ return;
+
+ /*
+ * Check whether we can free the p4d page if the rest of the
+ * entries are empty. Overlap with other regions have been
+ * handled by the floor/ceiling check.
+ */
+ p4dp = p4d_offset(pgdp, 0UL);
+ for (i = 0; i < PTRS_PER_P4D; i++) {
+ if (!p4d_none(READ_ONCE(p4dp[i])))
+ return;
+ }
+
+ pgd_clear(pgdp);
+ __flush_tlb_kernel_pgtable(start);
+ free_hotplug_pgtable_page(virt_to_page(p4dp));
}
static void free_empty_tables(unsigned long addr, unsigned long end,
@@ -1283,6 +1304,12 @@ int pmd_set_huge(pmd_t *pmdp, phys_addr_t phys, pgprot_t prot)
return 1;
}
+#ifndef __PAGETABLE_P4D_FOLDED
+void p4d_clear_huge(p4d_t *p4dp)
+{
+}
+#endif
+
int pud_clear_huge(pud_t *pudp)
{
if (!pud_sect(READ_ONCE(*pudp)))
diff --git a/arch/arm64/mm/pgd.c b/arch/arm64/mm/pgd.c
index 4a64089e5771c1e2..3c4f8a279d2bc76a 100644
--- a/arch/arm64/mm/pgd.c
+++ b/arch/arm64/mm/pgd.c
@@ -17,11 +17,20 @@
static struct kmem_cache *pgd_cache __ro_after_init;
+static bool pgdir_is_page_size(void)
+{
+ if (PGD_SIZE == PAGE_SIZE)
+ return true;
+ if (CONFIG_PGTABLE_LEVELS == 5)
+ return !pgtable_l5_enabled();
+ return false;
+}
+
pgd_t *pgd_alloc(struct mm_struct *mm)
{
gfp_t gfp = GFP_PGTABLE_USER;
- if (PGD_SIZE == PAGE_SIZE)
+ if (pgdir_is_page_size())
return (pgd_t *)__get_free_page(gfp);
else
return kmem_cache_alloc(pgd_cache, gfp);
@@ -29,7 +38,7 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
void pgd_free(struct mm_struct *mm, pgd_t *pgd)
{
- if (PGD_SIZE == PAGE_SIZE)
+ if (pgdir_is_page_size())
free_page((unsigned long)pgd);
else
kmem_cache_free(pgd_cache, pgd);
@@ -37,7 +46,7 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd)
void __init pgtable_cache_init(void)
{
- if (PGD_SIZE == PAGE_SIZE)
+ if (pgdir_is_page_size())
return;
#ifdef CONFIG_ARM64_PA_BITS_52
--
2.39.2
Configurations built with support for 52-bit virtual addressing can also
run on CPUs that only support 48 bits of VA space, in which case only
that part of swapper_pg_dir that represents the 48-bit addressable
region is relevant, and everything else is ignored by the hardware.
Our software pagetable walker has little in the way of input address
validation, and so it will happily start a walk from an address that is
not representable by the number of paging levels that are actually
active, resulting in lots of bogus output from the page table dumper
unless we take care to start at a valid address.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/mm/ptdump.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 8f37d6d8b5216473..8aee5d25f3d8cbe6 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -345,7 +345,7 @@ static void __init ptdump_initialize(void)
pg_level[i].mask |= pg_level[i].bits[j].mask;
}
-static struct ptdump_info kernel_ptdump_info = {
+static struct ptdump_info kernel_ptdump_info __ro_after_init = {
.mm = &init_mm,
.markers = address_markers,
.base_addr = PAGE_OFFSET,
@@ -364,7 +364,7 @@ void ptdump_check_wx(void)
.ptdump = {
.note_page = note_page,
.range = (struct ptdump_range[]) {
- {PAGE_OFFSET, ~0UL},
+ {_PAGE_OFFSET(vabits_actual), ~0UL},
{0, 0}
}
}
@@ -388,8 +388,9 @@ static int __init ptdump_init(void)
address_markers[VMALLOC_END_NR].start_address = VMALLOC_END;
ptdump_initialize();
if (vabits_actual < VA_BITS) {
+ kernel_ptdump_info.base_addr = _PAGE_OFFSET(vabits_actual);
address_markers[VMEMMAP_START_NR].start_address =
- (unsigned long)virt_to_page(_PAGE_OFFSET(vabits_actual));
+ (unsigned long)virt_to_page(kernel_ptdump_info.base_addr);
}
ptdump_debugfs_register(&kernel_ptdump_info, "kernel_page_tables");
return 0;
--
2.39.2
get_user_mapping_size() uses vabits_actual and CONFIG_PGTABLE_LEVELS to
provide the starting point for a table walk. This is fine for LVA, as
the number of translation levels is the same regardless of whether LVA
is enabled. However, with LPA2, this will no longer be the case, so
let's derive the number of levels from the number of VA bits directly.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/kvm/mmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d64be7b5f6692e8b..4e7c0f9a9c286c09 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -663,7 +663,7 @@ static int get_user_mapping_size(struct kvm *kvm, u64 addr)
.pgd = (kvm_pteref_t)kvm->mm->pgd,
.ia_bits = vabits_actual,
.start_level = (KVM_PGTABLE_MAX_LEVELS -
- CONFIG_PGTABLE_LEVELS),
+ ARM64_HW_PGTABLE_LEVELS(pgt.ia_bits)),
.mm_ops = &kvm_user_mm_ops,
};
kvm_pte_t pte = 0; /* Keep GCC quiet... */
--
2.39.2
Currently, the ptdump code deals with folded PMD or PUD levels at build
time, by omitting those levels when invoking note_page. IOW, note_page()
is never invoked with level == 1 if P4Ds are folded in the build
configuration.
With the introduction of LPA2 support, we will defer some of these
folding decisions to runtime, so let's take care of this by overriding
the 'level' argument when this condition triggers.
Substituting the PUD or PMD strings for "PGD" when the level in question
is folded at build time is no longer necessary, and so the conditional
expressions can be simplified. This also makes the indirection of the
'name' field unnecessary, so change that into a char[] array, and make
the whole thing __ro_after_init.
Note that the mm_p?d_folded() functions currently ignore their mm
pointer arguments, but let's wire them up correctly anyway.
Signed-off-by: Ard Biesheuvel <[email protected]>
---
arch/arm64/mm/ptdump.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 8aee5d25f3d8cbe6..0e0ad6a5a12e6f04 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -82,6 +82,7 @@ struct pg_state {
struct ptdump_state ptdump;
struct seq_file *seq;
const struct addr_marker *marker;
+ const struct mm_struct *mm;
unsigned long start_address;
int level;
u64 current_prot;
@@ -178,12 +179,12 @@ static const struct prot_bits pte_bits[] = {
struct pg_level {
const struct prot_bits *bits;
- const char *name;
- size_t num;
+ char name[4];
+ int num;
u64 mask;
};
-static struct pg_level pg_level[] = {
+static struct pg_level pg_level[] __ro_after_init = {
{ /* pgd */
.name = "PGD",
.bits = pte_bits,
@@ -193,11 +194,11 @@ static struct pg_level pg_level[] = {
.bits = pte_bits,
.num = ARRAY_SIZE(pte_bits),
}, { /* pud */
- .name = (CONFIG_PGTABLE_LEVELS > 3) ? "PUD" : "PGD",
+ .name = "PUD",
.bits = pte_bits,
.num = ARRAY_SIZE(pte_bits),
}, { /* pmd */
- .name = (CONFIG_PGTABLE_LEVELS > 2) ? "PMD" : "PGD",
+ .name = "PMD",
.bits = pte_bits,
.num = ARRAY_SIZE(pte_bits),
}, { /* pte */
@@ -261,6 +262,11 @@ static void note_page(struct ptdump_state *pt_st, unsigned long addr, int level,
static const char units[] = "KMGTPE";
u64 prot = 0;
+ /* check if the current level has been folded dynamically */
+ if ((level == 1 && mm_p4d_folded(st->mm)) ||
+ (level == 2 && mm_pud_folded(st->mm)))
+ level = 0;
+
if (level >= 0)
prot = val & pg_level[level].mask;
@@ -322,6 +328,7 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info *info)
st = (struct pg_state){
.seq = s,
.marker = info->markers,
+ .mm = info->mm,
.level = -1,
.ptdump = {
.note_page = note_page,
--
2.39.2
Hi Ard,
Just to say that I plan to work my way through this lot over the next couple of
weeks. I hope you can tolerate comments dribbling in as I go?
I'll also try integrating this with my latest revision for the KVM side of
things and re-run all my tests. I'll report back in due course.
Thanks,
Ryan
On 07/03/2023 14:04, Ard Biesheuvel wrote:
> This is a followup to [0], which was a lot smaller. Thanks to Ryan for
> feedback and review. This series is independent from Ryan's work on
> adding support for LPA2 to KVM - the only potential source of conflict
> should be the patch "arm64: kvm: Limit HYP VA and host S2 range to 48
> bits when LPA2 is in effect", which could simply be dropped in favour of
> the KVM changes to make it support LPA2.
>
> The first ~15 patches of this series rework how the kernel VA space is
> organized, so that the vmemmap region does not take up more space than
> necessary, and so that most of it can be reclaimed when running a build
> capable of 52-bit virtual addressing on hardware that is not. This is
> needed because the vmemmap region will take up a substantial part of the
> upper VA region that it shares with the kernel, modules and
> vmalloc/vmap mappings once we enable LPA2 with 4k pages.
>
> The next ~30 patches rework the early init code, reimplementing most of
> the page table and relocation handling in C code. There are several
> reasons why this is beneficial:
> - we generally prefer C code over asm for these things, and the macros
> that currently exist in head.S for creating the kernel pages tables
> are a good example why;
> - we no longer need to create the kernel mapping in two passes, which
> means we can remove the logic that copies parts of the fixmap and the
> KAsan shadow from one set of page tables to the other; this is
> especially advantageous for KAsan with LPA2, which needs more
> elaborate shadow handling across multiple levels, since the KAsan
> region cannot be placed on exact pgd_t bouundaries in that case;
> - we can read the ID registers and parse command line overrides before
> creating the page tables, which simplifies the LPA2 case, as flicking
> the global TCR_EL1.DS bit at a later stage would require elaborate
> repainting of all page table descriptors, some of which with the MMU
> disabled;
> - we can use more elaborate logic to create the mappings, which means we
> can use more precise mappings for code and data sections even when
> using 2 MiB granularity, and this is a prerequisite for running with
> WXN.
>
> As part of the ID map changes, we decouple the ID map size from the
> kernel VA size, and switch to a 48-bit VA map for all configurations.
>
> The next 18 patches rework the existing LVA support as a CPU feature,
> which simplifies some code and gets rid of the vabits_actual variable.
> Then, LPA2 support is implemented in the same vein. This requires adding
> support for 5 level paging as well, given that LPA2 introduces a new
> paging level '-1' when using 4k pages.
>
> Combined with the vmemmap changes at the start of the series, the
> resulting LPA2/4k pages configuration will have the exact same VA space
> layout as the ordinary 4k/4 levels configuration, and so LPA2 support
> can reasonably be enabled by default, as the fallback is seamless on
> non-LPA2 hardware.
>
> In the 16k/LPA2 case, the fallback also reduces the number of paging
> levels, resulting in a 47-bit VA space. This is based on the assumption
> that hybrid LPA2/non-LPA2 16k pages kernels in production use would
> prefer not to take the performance hit of 4 level paging to gain only a
> single additional bit of VA space. (Note that generic Android kernels
> use only 3 levels of paging today.) Bespoke 16k configurations can still
> configure 48-bit virtual addressing as before.
>
> Finally, the last two patches enable support for running with the WXN
> control enabled. This was previously part of a separate series [1], but
> given that the delta is tiny, it is included here as well.
>
> [0] https://lore.kernel.org/all/[email protected]/
> [1] https://lore.kernel.org/all/[email protected]/
>
> Cc: Catalin Marinas <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Marc Zyngier <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Ryan Roberts <[email protected]>
> Cc: Anshuman Khandual <[email protected]>
> Cc: Kees Cook <[email protected]>
>
> Anshuman Khandual (2):
> arm64/mm: Add FEAT_LPA2 specific TCR_EL1.DS field
> arm64/mm: Add FEAT_LPA2 specific ID_AA64MMFR0.TGRAN[2]
>
> Ard Biesheuvel (57):
>
> // KASLR / vmemmap reorg
> arm64: kernel: Disable latent_entropy GCC plugin in early C runtime
> arm64: mm: Take potential load offset into account when KASLR is off
> arm64: mm: get rid of kimage_vaddr global variable
> arm64: mm: Move PCI I/O emulation region above the vmemmap region
> arm64: mm: Move fixmap region above vmemmap region
> arm64: ptdump: Allow VMALLOC_END to be defined at boot
> arm64: ptdump: Discover start of vmemmap region at runtime
> arm64: vmemmap: Avoid base2 order of struct page size to dimension
> region
> arm64: mm: Reclaim unused vmemmap region for vmalloc use
> arm64: kaslr: Adjust randomization range dynamically
> arm64: kaslr: drop special case for ThunderX in kaslr_requires_kpti()
> arm64: kvm: honour 'nokaslr' command line option for the HYP VA space
>
> // Reimplement page table creation code in C
> arm64: kernel: Manage absolute relocations in code built under pi/
> arm64: kernel: Don't rely on objcopy to make code under pi/ __init
> arm64: head: move relocation handling to C code
> arm64: idreg-override: Omit non-NULL checks for override pointer
> arm64: idreg-override: Prepare for place relative reloc patching
> arm64: idreg-override: Avoid parameq() and parameqn()
> arm64: idreg-override: avoid strlen() to check for empty strings
> arm64: idreg-override: Avoid sprintf() for simple string concatenation
> arm64: idreg-override: Avoid kstrtou64() to parse a single hex digit
> arm64: idreg-override: Move to early mini C runtime
> arm64: kernel: Remove early fdt remap code
> arm64: head: Clear BSS and the kernel page tables in one go
> arm64: Move feature overrides into the BSS section
> arm64: head: Run feature override detection before mapping the kernel
> arm64: head: move dynamic shadow call stack patching into early C
> runtime
> arm64: kaslr: Use feature override instead of parsing the cmdline
> again
> arm64: idreg-override: Create a pseudo feature for rodata=off
> arm64: Add helpers to probe local CPU for PAC/BTI/E0PD support
> arm64: head: allocate more pages for the kernel mapping
> arm64: head: move memstart_offset_seed handling to C code
> arm64: head: Move early kernel mapping routines into C code
> arm64: mm: Use 48-bit virtual addressing for the permanent ID map
> arm64: pgtable: Decouple PGDIR size macros from PGD/PUD/PMD levels
> arm64: kernel: Create initial ID map from C code
> arm64: mm: avoid fixmap for early swapper_pg_dir updates
> arm64: mm: omit redundant remap of kernel image
> arm64: Revert "mm: provide idmap pointer to cpu_replace_ttbr1()"
>
> // Implement LPA2 support
> arm64: mm: Handle LVA support as a CPU feature
> arm64: mm: Add feature override support for LVA
> arm64: mm: Wire up TCR.DS bit to PTE shareability fields
> arm64: mm: Add LPA2 support to phys<->pte conversion routines
> arm64: mm: Add definitions to support 5 levels of paging
> arm64: mm: add LPA2 and 5 level paging support to G-to-nG conversion
> arm64: Enable LPA2 at boot if supported by the system
> arm64: mm: Add 5 level paging support to fixmap and swapper handling
> arm64: kasan: Reduce minimum shadow alignment and enable 5 level
> paging
> arm64: mm: Add support for folding PUDs at runtime
> arm64: ptdump: Disregard unaddressable VA space
> arm64: ptdump: Deal with translation levels folded at runtime
> arm64: kvm: avoid CONFIG_PGTABLE_LEVELS for runtime levels
> arm64: kvm: Limit HYP VA and host S2 range to 48 bits when LPA2 is in
> effect
> arm64: Enable 52-bit virtual addressing for 4k and 16k granule configs
> arm64: defconfig: Enable LPA2 support
>
> // Allow WXN control to be enabled at boot
> mm: add arch hook to validate mmap() prot flags
> arm64: mm: add support for WXN memory translation attribute
>
> Marc Zyngier (1):
> arm64: Turn kaslr_feature_override into a generic SW feature override
>
> arch/arm64/Kconfig | 34 +-
> arch/arm64/configs/defconfig | 2 +-
> arch/arm64/include/asm/assembler.h | 55 +--
> arch/arm64/include/asm/cpufeature.h | 102 +++++
> arch/arm64/include/asm/fixmap.h | 1 +
> arch/arm64/include/asm/kasan.h | 2 -
> arch/arm64/include/asm/kernel-pgtable.h | 104 ++---
> arch/arm64/include/asm/memory.h | 50 +--
> arch/arm64/include/asm/mman.h | 36 ++
> arch/arm64/include/asm/mmu.h | 26 +-
> arch/arm64/include/asm/mmu_context.h | 49 ++-
> arch/arm64/include/asm/pgalloc.h | 53 ++-
> arch/arm64/include/asm/pgtable-hwdef.h | 33 +-
> arch/arm64/include/asm/pgtable-prot.h | 18 +-
> arch/arm64/include/asm/pgtable-types.h | 6 +
> arch/arm64/include/asm/pgtable.h | 229 +++++++++-
> arch/arm64/include/asm/scs.h | 34 +-
> arch/arm64/include/asm/setup.h | 3 -
> arch/arm64/include/asm/sysreg.h | 2 +
> arch/arm64/include/asm/tlb.h | 3 +-
> arch/arm64/kernel/Makefile | 7 +-
> arch/arm64/kernel/cpu_errata.c | 2 +-
> arch/arm64/kernel/cpufeature.c | 90 ++--
> arch/arm64/kernel/head.S | 465 ++------------------
> arch/arm64/kernel/idreg-override.c | 322 --------------
> arch/arm64/kernel/image-vars.h | 32 ++
> arch/arm64/kernel/kaslr.c | 4 +-
> arch/arm64/kernel/module.c | 2 +-
> arch/arm64/kernel/pi/Makefile | 28 +-
> arch/arm64/kernel/pi/idreg-override.c | 396 +++++++++++++++++
> arch/arm64/kernel/pi/kaslr_early.c | 78 +---
> arch/arm64/kernel/pi/map_kernel.c | 284 ++++++++++++
> arch/arm64/kernel/pi/map_range.c | 104 +++++
> arch/arm64/kernel/{ => pi}/patch-scs.c | 36 +-
> arch/arm64/kernel/pi/pi.h | 30 ++
> arch/arm64/kernel/pi/relacheck.c | 130 ++++++
> arch/arm64/kernel/pi/relocate.c | 64 +++
> arch/arm64/kernel/setup.c | 22 -
> arch/arm64/kernel/sleep.S | 3 -
> arch/arm64/kernel/suspend.c | 2 +-
> arch/arm64/kernel/vmlinux.lds.S | 14 +-
> arch/arm64/kvm/hyp/nvhe/mem_protect.c | 2 +
> arch/arm64/kvm/mmu.c | 22 +-
> arch/arm64/kvm/va_layout.c | 10 +-
> arch/arm64/mm/init.c | 2 +-
> arch/arm64/mm/kasan_init.c | 154 +++++--
> arch/arm64/mm/mmap.c | 4 +
> arch/arm64/mm/mmu.c | 268 ++++++-----
> arch/arm64/mm/pgd.c | 17 +-
> arch/arm64/mm/proc.S | 106 ++++-
> arch/arm64/mm/ptdump.c | 43 +-
> arch/arm64/tools/cpucaps | 1 +
> include/linux/mman.h | 15 +
> mm/mmap.c | 3 +
> 54 files changed, 2259 insertions(+), 1345 deletions(-)
> delete mode 100644 arch/arm64/kernel/idreg-override.c
> create mode 100644 arch/arm64/kernel/pi/idreg-override.c
> create mode 100644 arch/arm64/kernel/pi/map_kernel.c
> create mode 100644 arch/arm64/kernel/pi/map_range.c
> rename arch/arm64/kernel/{ => pi}/patch-scs.c (89%)
> create mode 100644 arch/arm64/kernel/pi/pi.h
> create mode 100644 arch/arm64/kernel/pi/relacheck.c
> create mode 100644 arch/arm64/kernel/pi/relocate.c
>
On 07/03/2023 14:04, Ard Biesheuvel wrote:
> We will soon reclaim the part of the vmemmap region that covers VA space
> that is not addressable by the hardware. To avoid confusion, ensure that
> the 'vmemmap start' marker points at the start of the region that is
> actually being used for the struct page array, rather than the start of
> the region we set aside for it at build time.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/arm64/mm/ptdump.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
> index 910b35f02280cbdb..8f37d6d8b5216473 100644
> --- a/arch/arm64/mm/ptdump.c
> +++ b/arch/arm64/mm/ptdump.c
> @@ -37,6 +37,7 @@ enum address_markers_idx {
> MODULES_END_NR,
> VMALLOC_START_NR,
> VMALLOC_END_NR,
> + VMEMMAP_START_NR,
> };
>
> static struct addr_marker address_markers[] = {
> @@ -386,6 +387,10 @@ static int __init ptdump_init(void)
> #endif
> address_markers[VMALLOC_END_NR].start_address = VMALLOC_END;
> ptdump_initialize();
> + if (vabits_actual < VA_BITS) {
> + address_markers[VMEMMAP_START_NR].start_address =
> + (unsigned long)virt_to_page(_PAGE_OFFSET(vabits_actual));
> + }
nit: Do you need the conditional here? Why not just do it unconditionally?
> ptdump_debugfs_register(&kernel_ptdump_info, "kernel_page_tables");
> return 0;
> }
On 07/03/2023 14:04, Ard Biesheuvel wrote:
> The vmemmap array is statically sized based on the maximum supported
> size of the virtual address space, but it is located inside the upper VA
> region, which is statically sized based on the *minimum* supported size
> of the VA space. This doesn't matter much when using 64k pages, which is
> the only configuration that currently supports 52-bit virtual
> addressing.
As I understand it, the vmemmap section only holds struct pages, and the number
of struct pages in the system is surely a function of PA size, not VA size? So
why is the region sized based on VA size?
>
> However, upcoming LPA2 support will change this picture somewhat, as in
> that case, the vmemmap array will take up more than 25% of the upper VA
> region when using 4k pages. Given that most of this space is never used
> when running on a system that does not support 52-bit virtual
> addressing, let's reclaim the unused vmemmap area in that case.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/arm64/include/asm/pgtable.h | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 3eff06c5d0eb73c7..2259898e8c3d990a 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -18,11 +18,15 @@
> * VMALLOC range.
> *
> * VMALLOC_START: beginning of the kernel vmalloc space
> - * VMALLOC_END: extends to the available space below vmemmap, PCI I/O space
> - * and fixed mappings
> + * VMALLOC_END: extends to the available space below vmemmap
> */
> #define VMALLOC_START (MODULES_END)
> +#if VA_BITS == VA_BITS_MIN
> #define VMALLOC_END (VMEMMAP_START - SZ_8M)
> +#else
> +#define VMEMMAP_UNUSED_NPAGES ((_PAGE_OFFSET(vabits_actual) - PAGE_OFFSET) >> PAGE_SHIFT)
> +#define VMALLOC_END (VMEMMAP_START + VMEMMAP_UNUSED_NPAGES * sizeof(struct page) - SZ_8M)
> +#endif
>
> #define vmemmap ((struct page *)VMEMMAP_START - (memstart_addr >> PAGE_SHIFT))
>
On 07/03/2023 14:04, Ard Biesheuvel wrote:
> Extend the existing pattern for populating ptdump marker entries at
> boot, and add handling of VMALLOC_END, which will cease to be a compile
> time constant for configurations that support 52-bit virtual addressing.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/arm64/mm/ptdump.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
> index 76d28056bd14920a..910b35f02280cbdb 100644
> --- a/arch/arm64/mm/ptdump.c
> +++ b/arch/arm64/mm/ptdump.c
> @@ -31,7 +31,12 @@ enum address_markers_idx {
> PAGE_END_NR,
> #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
> KASAN_START_NR,
> + KASAN_END_NR,
> #endif
> + MODULES_NR,
> + MODULES_END_NR,
> + VMALLOC_START_NR,
> + VMALLOC_END_NR,
> };
>
> static struct addr_marker address_markers[] = {
> @@ -44,7 +49,7 @@ static struct addr_marker address_markers[] = {
> { MODULES_VADDR, "Modules start" },
> { MODULES_END, "Modules end" },
> { VMALLOC_START, "vmalloc() area" },
> - { VMALLOC_END, "vmalloc() end" },
> + { 0, "vmalloc() end" },
> { VMEMMAP_START, "vmemmap start" },
> { VMEMMAP_START + VMEMMAP_SIZE, "vmemmap end" },
> { PCI_IO_START, "PCI I/O start" },
With all the VA layout changes, and the addition of 52-bit PA/VA for 4KB and
16KB pages, Documentation/arm64/memory.rst now looks very wrong. Suggest
updating it to reflect reality?
> @@ -379,6 +384,7 @@ static int __init ptdump_init(void)
> #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
> address_markers[KASAN_START_NR].start_address = KASAN_SHADOW_START;
> #endif
> + address_markers[VMALLOC_END_NR].start_address = VMALLOC_END;
> ptdump_initialize();
> ptdump_debugfs_register(&kernel_ptdump_info, "kernel_page_tables");
> return 0;
On Tue, 7 Mar 2023 at 17:42, Ryan Roberts <[email protected]> wrote:
>
> On 07/03/2023 14:04, Ard Biesheuvel wrote:
> > The vmemmap array is statically sized based on the maximum supported
> > size of the virtual address space, but it is located inside the upper VA
> > region, which is statically sized based on the *minimum* supported size
> > of the VA space. This doesn't matter much when using 64k pages, which is
> > the only configuration that currently supports 52-bit virtual
> > addressing.
>
> As I understand it, the vmemmap section only holds struct pages, and the number
> of struct pages in the system is surely a function of PA size, not VA size? So
> why is the region sized based on VA size?
>
We do not implement CONFIG_HIGHMEM on arm64, and so the addressable PA
range is bounded by the size of the linear map (and in some cases,
DRAM starts outside of the VA addressable range entirely, e.g., on AMD
Seattle with 38-bit VAs). Also, the start of the linear map (i.e.,
PAGE_OFFSET) does not correspond with PA 0x0, it is based on the
physical placement of system memory, and it is randomized in some
cases as well. This is why we have
#define vmemmap ((struct page *)VMEMMAP_START - (memstart_addr >> PAGE_SHIFT))
This means btw that we could reclaim even more vmemmap space, i.e.,
nothing below phys_to_page(memblock_start_of_DRAM()) will ever be
used, but there are some intricacies with section rounding etc which
make this a bit tricky.
The main reason for this patch is to free up 15/16 of the vmemmap
region when using LPA2 and 4k pages, so that the resulting VA space
layout is identical to 48-bits/4k pages, making LPA2 support enabled a
reasonable default.
> >
> > However, upcoming LPA2 support will change this picture somewhat, as in
> > that case, the vmemmap array will take up more than 25% of the upper VA
> > region when using 4k pages. Given that most of this space is never used
> > when running on a system that does not support 52-bit virtual
> > addressing, let's reclaim the unused vmemmap area in that case.
> >
> > Signed-off-by: Ard Biesheuvel <[email protected]>
> > ---
> > arch/arm64/include/asm/pgtable.h | 8 ++++++--
> > 1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 3eff06c5d0eb73c7..2259898e8c3d990a 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -18,11 +18,15 @@
> > * VMALLOC range.
> > *
> > * VMALLOC_START: beginning of the kernel vmalloc space
> > - * VMALLOC_END: extends to the available space below vmemmap, PCI I/O space
> > - * and fixed mappings
> > + * VMALLOC_END: extends to the available space below vmemmap
> > */
> > #define VMALLOC_START (MODULES_END)
> > +#if VA_BITS == VA_BITS_MIN
> > #define VMALLOC_END (VMEMMAP_START - SZ_8M)
> > +#else
> > +#define VMEMMAP_UNUSED_NPAGES ((_PAGE_OFFSET(vabits_actual) - PAGE_OFFSET) >> PAGE_SHIFT)
> > +#define VMALLOC_END (VMEMMAP_START + VMEMMAP_UNUSED_NPAGES * sizeof(struct page) - SZ_8M)
> > +#endif
> >
> > #define vmemmap ((struct page *)VMEMMAP_START - (memstart_addr >> PAGE_SHIFT))
> >
>
On Tue, 7 Mar 2023 at 17:58, Ryan Roberts <[email protected]> wrote:
>
> On 07/03/2023 14:04, Ard Biesheuvel wrote:
> > Extend the existing pattern for populating ptdump marker entries at
> > boot, and add handling of VMALLOC_END, which will cease to be a compile
> > time constant for configurations that support 52-bit virtual addressing.
> >
> > Signed-off-by: Ard Biesheuvel <[email protected]>
> > ---
> > arch/arm64/mm/ptdump.c | 8 +++++++-
> > 1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
> > index 76d28056bd14920a..910b35f02280cbdb 100644
> > --- a/arch/arm64/mm/ptdump.c
> > +++ b/arch/arm64/mm/ptdump.c
> > @@ -31,7 +31,12 @@ enum address_markers_idx {
> > PAGE_END_NR,
> > #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
> > KASAN_START_NR,
> > + KASAN_END_NR,
> > #endif
> > + MODULES_NR,
> > + MODULES_END_NR,
> > + VMALLOC_START_NR,
> > + VMALLOC_END_NR,
> > };
> >
> > static struct addr_marker address_markers[] = {
> > @@ -44,7 +49,7 @@ static struct addr_marker address_markers[] = {
> > { MODULES_VADDR, "Modules start" },
> > { MODULES_END, "Modules end" },
> > { VMALLOC_START, "vmalloc() area" },
> > - { VMALLOC_END, "vmalloc() end" },
> > + { 0, "vmalloc() end" },
> > { VMEMMAP_START, "vmemmap start" },
> > { VMEMMAP_START + VMEMMAP_SIZE, "vmemmap end" },
> > { PCI_IO_START, "PCI I/O start" },
>
> With all the VA layout changes, and the addition of 52-bit PA/VA for 4KB and
> 16KB pages, Documentation/arm64/memory.rst now looks very wrong. Suggest
> updating it to reflect reality?
>
Yeah good point. And I don't think we necessarily have to describe
every imaginable combo exhaustively.
On Tue, 7 Mar 2023 at 17:28, Ryan Roberts <[email protected]> wrote:
>
> Hi Ard,
>
> Just to say that I plan to work my way through this lot over the next couple of
> weeks. I hope you can tolerate comments dribbling in as I go?
>
Yes, please take the time you need. I am confident we will still be
able to comfortably beat actual LPA2 hardware coming to market.
On 07/03/2023 14:04, Ard Biesheuvel wrote:
> We will move the CPU feature overrides into BSS in a subsequent patch,
> and this requires that BSS is zeroed before the feature override
> detection code runs. So let's map BSS read-write in the ID map, and zero
> it via this mapping.
>
> Since the kernel page tables are right next to it, and also zeroed via
> the ID map, let's drop the separate clear_page_tables() function, and
> just zero everything in one go.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/arm64/kernel/head.S | 33 +++++++-------------
> 1 file changed, 11 insertions(+), 22 deletions(-)
>
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 0fa44b3188c1e204..ade0cb99c8a83a3d 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -177,17 +177,6 @@ SYM_CODE_START_LOCAL(preserve_boot_args)
> ret
> SYM_CODE_END(preserve_boot_args)
>
> -SYM_FUNC_START_LOCAL(clear_page_tables)
> - /*
> - * Clear the init page tables.
> - */
> - adrp x0, init_pg_dir
> - adrp x1, init_pg_end
> - sub x2, x1, x0
> - mov x1, xzr
> - b __pi_memset // tail call
> -SYM_FUNC_END(clear_page_tables)
> -
> /*
> * Macro to populate page table entries, these entries can be pointers to the next level
> * or last level entries pointing to physical memory.
> @@ -386,9 +375,9 @@ SYM_FUNC_START_LOCAL(create_idmap)
>
> map_memory x0, x1, x3, x6, x7, x3, IDMAP_PGD_ORDER, x10, x11, x12, x13, x14, EXTRA_SHIFT
>
> - /* Remap the kernel page tables r/w in the ID map */
> + /* Remap BSS and the kernel page tables r/w in the ID map */
> adrp x1, _text
> - adrp x2, init_pg_dir
> + adrp x2, __bss_start
> adrp x3, _end
> bic x4, x2, #SWAPPER_BLOCK_SIZE - 1
> mov x5, SWAPPER_RW_MMUFLAGS
> @@ -489,14 +478,6 @@ SYM_FUNC_START_LOCAL(__primary_switched)
> mov x0, x20
> bl set_cpu_boot_mode_flag
>
> - // Clear BSS
> - adr_l x0, __bss_start
> - mov x1, xzr
> - adr_l x2, __bss_stop
> - sub x2, x2, x0
> - bl __pi_memset
> - dsb ishst // Make zero page visible to PTW
> -
> #if VA_BITS > 48
> adr_l x8, vabits_actual // Set this early so KASAN early init
> str x25, [x8] // ... observes the correct value
> @@ -780,6 +761,15 @@ SYM_FUNC_START_LOCAL(__primary_switch)
> adrp x1, reserved_pg_dir
> adrp x2, init_idmap_pg_dir
> bl __enable_mmu
> +
> + // Clear BSS
> + adrp x0, __bss_start
> + mov x1, xzr
> + adrp x2, init_pg_end
> + sub x2, x2, x0
> + bl __pi_memset
> + dsb ishst // Make zero page visible to PTW
Is it possible to add an assert somewhere (or at the very least a comment in
vmlinux.lds.S) to ensure that nothing gets inserted between the BSS and the page
tables? It feels a bit fragile otherwise.
I also wonder what's the point in calling __pi_memset() from here? Why not just
do it all in C?
> +
> #ifdef CONFIG_RELOCATABLE
> adrp x23, KERNEL_START
> and x23, x23, MIN_KIMG_ALIGN - 1
> @@ -794,7 +784,6 @@ SYM_FUNC_START_LOCAL(__primary_switch)
> orr x23, x23, x0 // record kernel offset
> #endif
> #endif
> - bl clear_page_tables
> bl create_kernel_mapping
>
> adrp x1, init_pg_dir
On Mon, 17 Apr 2023 at 16:00, Ryan Roberts <[email protected]> wrote:
>
> On 07/03/2023 14:04, Ard Biesheuvel wrote:
> > We will move the CPU feature overrides into BSS in a subsequent patch,
> > and this requires that BSS is zeroed before the feature override
> > detection code runs. So let's map BSS read-write in the ID map, and zero
> > it via this mapping.
> >
> > Since the kernel page tables are right next to it, and also zeroed via
> > the ID map, let's drop the separate clear_page_tables() function, and
> > just zero everything in one go.
> >
> > Signed-off-by: Ard Biesheuvel <[email protected]>
> > ---
> > arch/arm64/kernel/head.S | 33 +++++++-------------
> > 1 file changed, 11 insertions(+), 22 deletions(-)
> >
> > diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> > index 0fa44b3188c1e204..ade0cb99c8a83a3d 100644
> > --- a/arch/arm64/kernel/head.S
> > +++ b/arch/arm64/kernel/head.S
> > @@ -177,17 +177,6 @@ SYM_CODE_START_LOCAL(preserve_boot_args)
> > ret
> > SYM_CODE_END(preserve_boot_args)
> >
> > -SYM_FUNC_START_LOCAL(clear_page_tables)
> > - /*
> > - * Clear the init page tables.
> > - */
> > - adrp x0, init_pg_dir
> > - adrp x1, init_pg_end
> > - sub x2, x1, x0
> > - mov x1, xzr
> > - b __pi_memset // tail call
> > -SYM_FUNC_END(clear_page_tables)
> > -
> > /*
> > * Macro to populate page table entries, these entries can be pointers to the next level
> > * or last level entries pointing to physical memory.
> > @@ -386,9 +375,9 @@ SYM_FUNC_START_LOCAL(create_idmap)
> >
> > map_memory x0, x1, x3, x6, x7, x3, IDMAP_PGD_ORDER, x10, x11, x12, x13, x14, EXTRA_SHIFT
> >
> > - /* Remap the kernel page tables r/w in the ID map */
> > + /* Remap BSS and the kernel page tables r/w in the ID map */
> > adrp x1, _text
> > - adrp x2, init_pg_dir
> > + adrp x2, __bss_start
> > adrp x3, _end
> > bic x4, x2, #SWAPPER_BLOCK_SIZE - 1
> > mov x5, SWAPPER_RW_MMUFLAGS
> > @@ -489,14 +478,6 @@ SYM_FUNC_START_LOCAL(__primary_switched)
> > mov x0, x20
> > bl set_cpu_boot_mode_flag
> >
> > - // Clear BSS
> > - adr_l x0, __bss_start
> > - mov x1, xzr
> > - adr_l x2, __bss_stop
> > - sub x2, x2, x0
> > - bl __pi_memset
> > - dsb ishst // Make zero page visible to PTW
> > -
> > #if VA_BITS > 48
> > adr_l x8, vabits_actual // Set this early so KASAN early init
> > str x25, [x8] // ... observes the correct value
> > @@ -780,6 +761,15 @@ SYM_FUNC_START_LOCAL(__primary_switch)
> > adrp x1, reserved_pg_dir
> > adrp x2, init_idmap_pg_dir
> > bl __enable_mmu
> > +
> > + // Clear BSS
> > + adrp x0, __bss_start
> > + mov x1, xzr
> > + adrp x2, init_pg_end
> > + sub x2, x2, x0
> > + bl __pi_memset
> > + dsb ishst // Make zero page visible to PTW
>
> Is it possible to add an assert somewhere (or at the very least a comment in
> vmlinux.lds.S) to ensure that nothing gets inserted between the BSS and the page
> tables? It feels a bit fragile otherwise.
>
I'm not sure that matters. The contents are not covered by the loaded
image so they are undefined otherwise in any case.
> I also wonder what's the point in calling __pi_memset() from here? Why not just
> do it all in C?
>
That happens in one of the subsequent patches.
On 17/04/2023 15:02, Ard Biesheuvel wrote:
> On Mon, 17 Apr 2023 at 16:00, Ryan Roberts <[email protected]> wrote:
>>
>> On 07/03/2023 14:04, Ard Biesheuvel wrote:
>>> We will move the CPU feature overrides into BSS in a subsequent patch,
>>> and this requires that BSS is zeroed before the feature override
>>> detection code runs. So let's map BSS read-write in the ID map, and zero
>>> it via this mapping.
>>>
>>> Since the kernel page tables are right next to it, and also zeroed via
>>> the ID map, let's drop the separate clear_page_tables() function, and
>>> just zero everything in one go.
>>>
>>> Signed-off-by: Ard Biesheuvel <[email protected]>
>>> ---
>>> arch/arm64/kernel/head.S | 33 +++++++-------------
>>> 1 file changed, 11 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
>>> index 0fa44b3188c1e204..ade0cb99c8a83a3d 100644
>>> --- a/arch/arm64/kernel/head.S
>>> +++ b/arch/arm64/kernel/head.S
>>> @@ -177,17 +177,6 @@ SYM_CODE_START_LOCAL(preserve_boot_args)
>>> ret
>>> SYM_CODE_END(preserve_boot_args)
>>>
>>> -SYM_FUNC_START_LOCAL(clear_page_tables)
>>> - /*
>>> - * Clear the init page tables.
>>> - */
>>> - adrp x0, init_pg_dir
>>> - adrp x1, init_pg_end
>>> - sub x2, x1, x0
>>> - mov x1, xzr
>>> - b __pi_memset // tail call
>>> -SYM_FUNC_END(clear_page_tables)
>>> -
>>> /*
>>> * Macro to populate page table entries, these entries can be pointers to the next level
>>> * or last level entries pointing to physical memory.
>>> @@ -386,9 +375,9 @@ SYM_FUNC_START_LOCAL(create_idmap)
>>>
>>> map_memory x0, x1, x3, x6, x7, x3, IDMAP_PGD_ORDER, x10, x11, x12, x13, x14, EXTRA_SHIFT
>>>
>>> - /* Remap the kernel page tables r/w in the ID map */
>>> + /* Remap BSS and the kernel page tables r/w in the ID map */
>>> adrp x1, _text
>>> - adrp x2, init_pg_dir
>>> + adrp x2, __bss_start
>>> adrp x3, _end
>>> bic x4, x2, #SWAPPER_BLOCK_SIZE - 1
>>> mov x5, SWAPPER_RW_MMUFLAGS
>>> @@ -489,14 +478,6 @@ SYM_FUNC_START_LOCAL(__primary_switched)
>>> mov x0, x20
>>> bl set_cpu_boot_mode_flag
>>>
>>> - // Clear BSS
>>> - adr_l x0, __bss_start
>>> - mov x1, xzr
>>> - adr_l x2, __bss_stop
>>> - sub x2, x2, x0
>>> - bl __pi_memset
>>> - dsb ishst // Make zero page visible to PTW
>>> -
>>> #if VA_BITS > 48
>>> adr_l x8, vabits_actual // Set this early so KASAN early init
>>> str x25, [x8] // ... observes the correct value
>>> @@ -780,6 +761,15 @@ SYM_FUNC_START_LOCAL(__primary_switch)
>>> adrp x1, reserved_pg_dir
>>> adrp x2, init_idmap_pg_dir
>>> bl __enable_mmu
>>> +
>>> + // Clear BSS
>>> + adrp x0, __bss_start
>>> + mov x1, xzr
>>> + adrp x2, init_pg_end
>>> + sub x2, x2, x0
>>> + bl __pi_memset
>>> + dsb ishst // Make zero page visible to PTW
>>
>> Is it possible to add an assert somewhere (or at the very least a comment in
>> vmlinux.lds.S) to ensure that nothing gets inserted between the BSS and the page
>> tables? It feels a bit fragile otherwise.
>>
>
> I'm not sure that matters. The contents are not covered by the loaded
> image so they are undefined otherwise in any case.
OK, so you couldn't accidentally zero anything in the image. But it could
represent a performance regression if something big was added between them that
doesn't need to be zeroed. All hypothetical, but this is currently an unstated
assumption that I think is worth stating at least as a comment in the linker script.
>
>> I also wonder what's the point in calling __pi_memset() from here? Why not just
>> do it all in C?
>>
>
> That happens in one of the subsequent patches.
Ahh, cheers... Haven't got that far yet. (very impressive that you immediately
knew that given you posted the series 6 weeks ago! I usually can't remember what
I did yesterday ;-)
On Mon, 17 Apr 2023 at 16:28, Ryan Roberts <[email protected]> wrote:
>
> On 07/03/2023 14:04, Ard Biesheuvel wrote:
> > Add rodata=off to the set of kernel command line options that is parsed
> > early using the CPU feature override detection code, so we can easily
> > refer to it when creating the kernel mapping.
> >
> > Signed-off-by: Ard Biesheuvel <[email protected]>
> > ---
> > arch/arm64/include/asm/cpufeature.h | 1 +
> > arch/arm64/kernel/pi/idreg-override.c | 2 ++
> > 2 files changed, 3 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> > index bc10098901808c00..edc7733aa49846b2 100644
> > --- a/arch/arm64/include/asm/cpufeature.h
> > +++ b/arch/arm64/include/asm/cpufeature.h
> > @@ -16,6 +16,7 @@
> > #define cpu_feature(x) KERNEL_HWCAP_ ## x
> >
> > #define ARM64_SW_FEATURE_OVERRIDE_NOKASLR 0
> > +#define ARM64_SW_FEATURE_OVERRIDE_RODATA_OFF 4
>
> I assume these are bit numbers? Why not just use the next available bit (bit 1)
> for this new flag?
>
This (ab)uses the CPU feature framework, which is based on 4-bit
quantities. I don't remember if it matters or not, but IIRC the
default macros use 4-bit wide values.
On 07/03/2023 14:04, Ard Biesheuvel wrote:
> Add rodata=off to the set of kernel command line options that is parsed
> early using the CPU feature override detection code, so we can easily
> refer to it when creating the kernel mapping.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/arm64/include/asm/cpufeature.h | 1 +
> arch/arm64/kernel/pi/idreg-override.c | 2 ++
> 2 files changed, 3 insertions(+)
>
> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> index bc10098901808c00..edc7733aa49846b2 100644
> --- a/arch/arm64/include/asm/cpufeature.h
> +++ b/arch/arm64/include/asm/cpufeature.h
> @@ -16,6 +16,7 @@
> #define cpu_feature(x) KERNEL_HWCAP_ ## x
>
> #define ARM64_SW_FEATURE_OVERRIDE_NOKASLR 0
> +#define ARM64_SW_FEATURE_OVERRIDE_RODATA_OFF 4
I assume these are bit numbers? Why not just use the next available bit (bit 1)
for this new flag?
>
> #ifndef __ASSEMBLY__
>
> diff --git a/arch/arm64/kernel/pi/idreg-override.c b/arch/arm64/kernel/pi/idreg-override.c
> index 4e76db6eb72c2087..6c547cccaf6a9e9c 100644
> --- a/arch/arm64/kernel/pi/idreg-override.c
> +++ b/arch/arm64/kernel/pi/idreg-override.c
> @@ -151,6 +151,7 @@ static const struct ftr_set_desc sw_features __prel64_initconst = {
> .override = &arm64_sw_feature_override,
> .fields = {
> FIELD("nokaslr", ARM64_SW_FEATURE_OVERRIDE_NOKASLR, NULL),
> + FIELD("rodataoff", ARM64_SW_FEATURE_OVERRIDE_RODATA_OFF, NULL),
> {}
> },
> };
> @@ -183,6 +184,7 @@ static const struct {
> "id_aa64isar2.gpa3=0 id_aa64isar2.apa3=0" },
> { "arm64.nomte", "id_aa64pfr1.mte=0" },
> { "nokaslr", "arm64_sw.nokaslr=1" },
> + { "rodata=off", "arm64_sw.rodataoff=1" },
> };
>
> static int __init parse_hexdigit(const char *p, u64 *v)
On 17/04/2023 15:30, Ard Biesheuvel wrote:
> On Mon, 17 Apr 2023 at 16:28, Ryan Roberts <[email protected]> wrote:
>>
>> On 07/03/2023 14:04, Ard Biesheuvel wrote:
>>> Add rodata=off to the set of kernel command line options that is parsed
>>> early using the CPU feature override detection code, so we can easily
>>> refer to it when creating the kernel mapping.
>>>
>>> Signed-off-by: Ard Biesheuvel <[email protected]>
>>> ---
>>> arch/arm64/include/asm/cpufeature.h | 1 +
>>> arch/arm64/kernel/pi/idreg-override.c | 2 ++
>>> 2 files changed, 3 insertions(+)
>>>
>>> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
>>> index bc10098901808c00..edc7733aa49846b2 100644
>>> --- a/arch/arm64/include/asm/cpufeature.h
>>> +++ b/arch/arm64/include/asm/cpufeature.h
>>> @@ -16,6 +16,7 @@
>>> #define cpu_feature(x) KERNEL_HWCAP_ ## x
>>>
>>> #define ARM64_SW_FEATURE_OVERRIDE_NOKASLR 0
>>> +#define ARM64_SW_FEATURE_OVERRIDE_RODATA_OFF 4
>>
>> I assume these are bit numbers? Why not just use the next available bit (bit 1)
>> for this new flag?
>>
>
> This (ab)uses the CPU feature framework, which is based on 4-bit
> quantities. I don't remember if it matters or not, but IIRC the
> default macros use 4-bit wide values.
OK, thanks.
On 07/03/2023 14:04, Ard Biesheuvel wrote:
> In preparation for switching to an early kernel mapping routine that
> maps each segment according to its precise boundaries, and with the
> correct attributes, let's allocate some extra pages for page tables for
> the 4k page size configuration. This is necessary because the start and
> end of each segment may not be aligned to the block size, and so we'll
> need an extra page table at each segment boundary.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/arm64/include/asm/kernel-pgtable.h | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/include/asm/kernel-pgtable.h b/arch/arm64/include/asm/kernel-pgtable.h
> index 4d13c73171e1e360..50b5c145358a5d8e 100644
> --- a/arch/arm64/include/asm/kernel-pgtable.h
> +++ b/arch/arm64/include/asm/kernel-pgtable.h
> @@ -80,7 +80,7 @@
> + EARLY_PGDS((vstart), (vend), add) /* each PGDIR needs a next level page table */ \
> + EARLY_PUDS((vstart), (vend), add) /* each PUD needs a next level page table */ \
> + EARLY_PMDS((vstart), (vend), add)) /* each PMD needs a next level page table */
> -#define INIT_DIR_SIZE (PAGE_SIZE * EARLY_PAGES(KIMAGE_VADDR, _end, EARLY_KASLR))
> +#define INIT_DIR_SIZE (PAGE_SIZE * (EARLY_PAGES(KIMAGE_VADDR, _end, EARLY_KASLR) + EARLY_SEGMENT_EXTRA_PAGES))
>
> /* the initial ID map may need two extra pages if it needs to be extended */
> #if VA_BITS < 48
> @@ -101,6 +101,15 @@
> #define SWAPPER_TABLE_SHIFT PMD_SHIFT
> #endif
>
> +/* The number of segments in the kernel image (text, rodata, inittext, initdata, data+bss) */
> +#define KERNEL_SEGMENT_COUNT 5
> +
> +#if SWAPPER_BLOCK_SIZE > SEGMENT_ALIGN
> +#define EARLY_SEGMENT_EXTRA_PAGES (KERNEL_SEGMENT_COUNT + 1)
I'm guessing the block size for 4K pages is PMD, so you need these extra pages
to define PTEs for the case where the section start/end addresses are not on
exact 2MB boundaries? But in that case, isn't it possible that you would need 2
extra PTE tables per segment - one for the start and one for the end?
> +#else
> +#define EARLY_SEGMENT_EXTRA_PAGES 0
> +#endif
> +
> /*
> * Initial memory map attributes.
> */
On Mon, 17 Apr 2023 at 17:48, Ryan Roberts <[email protected]> wrote:
>
> On 07/03/2023 14:04, Ard Biesheuvel wrote:
> > In preparation for switching to an early kernel mapping routine that
> > maps each segment according to its precise boundaries, and with the
> > correct attributes, let's allocate some extra pages for page tables for
> > the 4k page size configuration. This is necessary because the start and
> > end of each segment may not be aligned to the block size, and so we'll
> > need an extra page table at each segment boundary.
> >
> > Signed-off-by: Ard Biesheuvel <[email protected]>
> > ---
> > arch/arm64/include/asm/kernel-pgtable.h | 11 ++++++++++-
> > 1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/include/asm/kernel-pgtable.h b/arch/arm64/include/asm/kernel-pgtable.h
> > index 4d13c73171e1e360..50b5c145358a5d8e 100644
> > --- a/arch/arm64/include/asm/kernel-pgtable.h
> > +++ b/arch/arm64/include/asm/kernel-pgtable.h
> > @@ -80,7 +80,7 @@
> > + EARLY_PGDS((vstart), (vend), add) /* each PGDIR needs a next level page table */ \
> > + EARLY_PUDS((vstart), (vend), add) /* each PUD needs a next level page table */ \
> > + EARLY_PMDS((vstart), (vend), add)) /* each PMD needs a next level page table */
> > -#define INIT_DIR_SIZE (PAGE_SIZE * EARLY_PAGES(KIMAGE_VADDR, _end, EARLY_KASLR))
> > +#define INIT_DIR_SIZE (PAGE_SIZE * (EARLY_PAGES(KIMAGE_VADDR, _end, EARLY_KASLR) + EARLY_SEGMENT_EXTRA_PAGES))
> >
> > /* the initial ID map may need two extra pages if it needs to be extended */
> > #if VA_BITS < 48
> > @@ -101,6 +101,15 @@
> > #define SWAPPER_TABLE_SHIFT PMD_SHIFT
> > #endif
> >
> > +/* The number of segments in the kernel image (text, rodata, inittext, initdata, data+bss) */
> > +#define KERNEL_SEGMENT_COUNT 5
> > +
> > +#if SWAPPER_BLOCK_SIZE > SEGMENT_ALIGN
> > +#define EARLY_SEGMENT_EXTRA_PAGES (KERNEL_SEGMENT_COUNT + 1)
>
> I'm guessing the block size for 4K pages is PMD, so you need these extra pages
> to define PTEs for the case where the section start/end addresses are not on
> exact 2MB boundaries? But in that case, isn't it possible that you would need 2
> extra PTE tables per segment - one for the start and one for the end?
>
The end of one segment is the start of another, so we need one at the
start, plus one each for each segment end.
On 17/04/2023 17:11, Ard Biesheuvel wrote:
> On Mon, 17 Apr 2023 at 17:48, Ryan Roberts <[email protected]> wrote:
>>
>> On 07/03/2023 14:04, Ard Biesheuvel wrote:
>>> In preparation for switching to an early kernel mapping routine that
>>> maps each segment according to its precise boundaries, and with the
>>> correct attributes, let's allocate some extra pages for page tables for
>>> the 4k page size configuration. This is necessary because the start and
>>> end of each segment may not be aligned to the block size, and so we'll
>>> need an extra page table at each segment boundary.
>>>
>>> Signed-off-by: Ard Biesheuvel <[email protected]>
>>> ---
>>> arch/arm64/include/asm/kernel-pgtable.h | 11 ++++++++++-
>>> 1 file changed, 10 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/arm64/include/asm/kernel-pgtable.h b/arch/arm64/include/asm/kernel-pgtable.h
>>> index 4d13c73171e1e360..50b5c145358a5d8e 100644
>>> --- a/arch/arm64/include/asm/kernel-pgtable.h
>>> +++ b/arch/arm64/include/asm/kernel-pgtable.h
>>> @@ -80,7 +80,7 @@
>>> + EARLY_PGDS((vstart), (vend), add) /* each PGDIR needs a next level page table */ \
>>> + EARLY_PUDS((vstart), (vend), add) /* each PUD needs a next level page table */ \
>>> + EARLY_PMDS((vstart), (vend), add)) /* each PMD needs a next level page table */
>>> -#define INIT_DIR_SIZE (PAGE_SIZE * EARLY_PAGES(KIMAGE_VADDR, _end, EARLY_KASLR))
>>> +#define INIT_DIR_SIZE (PAGE_SIZE * (EARLY_PAGES(KIMAGE_VADDR, _end, EARLY_KASLR) + EARLY_SEGMENT_EXTRA_PAGES))
>>>
>>> /* the initial ID map may need two extra pages if it needs to be extended */
>>> #if VA_BITS < 48
>>> @@ -101,6 +101,15 @@
>>> #define SWAPPER_TABLE_SHIFT PMD_SHIFT
>>> #endif
>>>
>>> +/* The number of segments in the kernel image (text, rodata, inittext, initdata, data+bss) */
>>> +#define KERNEL_SEGMENT_COUNT 5
>>> +
>>> +#if SWAPPER_BLOCK_SIZE > SEGMENT_ALIGN
>>> +#define EARLY_SEGMENT_EXTRA_PAGES (KERNEL_SEGMENT_COUNT + 1)
>>
>> I'm guessing the block size for 4K pages is PMD, so you need these extra pages
>> to define PTEs for the case where the section start/end addresses are not on
>> exact 2MB boundaries? But in that case, isn't it possible that you would need 2
>> extra PTE tables per segment - one for the start and one for the end?
>>
>
> The end of one segment is the start of another, so we need one at the
> start, plus one each for each segment end.
Ahh, of course. Thanks.
On 07/03/2023 14:04, Ard Biesheuvel wrote:
> The asm version of the kernel mapping code works fine for creating a
> coarse grained identity map, but for mapping the kernel down to its
> exact boundaries with the right attributes, it is not suitable. This is
> why we create a preliminary RWX kernel mapping first, and then rebuild
> it from scratch later on.
>
> So let's reimplement this in C, in a way that will make it unnecessary
> to create the kernel page tables yet another time in paging_init().
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> [...]
> diff --git a/arch/arm64/kernel/pi/map_kernel.c b/arch/arm64/kernel/pi/map_kernel.c
> new file mode 100644
> index 0000000000000000..b573c964c7d88d1b
> --- /dev/null
> +++ b/arch/arm64/kernel/pi/map_kernel.c
> @@ -0,0 +1,171 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +// Copyright 2023 Google LLC
> +// Author: Ard Biesheuvel <[email protected]>
> +
> +#include <linux/init.h>
> +#include <linux/libfdt.h>
> +#include <linux/linkage.h>
> +#include <linux/types.h>
> +#include <linux/sizes.h>
> +#include <linux/string.h>
> +
> +#include <asm/memory.h>
> +#include <asm/pgalloc.h>
> +#include <asm/pgtable.h>
> +#include <asm/tlbflush.h>
> +
> +#include "pi.h"
> +
> +extern const u8 __eh_frame_start[], __eh_frame_end[];
> +
> +extern void idmap_cpu_replace_ttbr1(void *pgdir);
> +
> +static void map_segment(pgd_t *pg_dir, u64 *pgd, u64 va_offset,
> + void *start, void *end, pgprot_t prot,
> + bool may_use_cont, int root_level)
> +{
> + map_range(pgd, ((u64)start + va_offset) & ~PAGE_OFFSET,
> + ((u64)end + va_offset) & ~PAGE_OFFSET, (u64)start,
> + prot, root_level, (pte_t *)pg_dir, may_use_cont, 0);
I don't understand what you are doing with ~PAGE_OFFSET here. Is this intended
to be page alignment with PAGE_MASK? I'm guessing not, because you would want to
forward align the end address in that case.
> +}
> +
> +static void unmap_segment(pgd_t *pg_dir, u64 va_offset, void *start,
> + void *end, int root_level)
> +{
> + map_segment(pg_dir, NULL, va_offset, start, end, __pgprot(0),
> + false, root_level);
> +}
> +
> [...]
> diff --git a/arch/arm64/kernel/pi/map_range.c b/arch/arm64/kernel/pi/map_range.c
> new file mode 100644
> index 0000000000000000..61cbd6e82418c033
> --- /dev/null
> +++ b/arch/arm64/kernel/pi/map_range.c
> @@ -0,0 +1,87 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +// Copyright 2023 Google LLC
> +// Author: Ard Biesheuvel <[email protected]>
> +
> +#include <linux/types.h>
> +#include <linux/sizes.h>
> +
> +#include <asm/memory.h>
> +#include <asm/pgalloc.h>
> +#include <asm/pgtable.h>
> +
> +#include "pi.h"
> +
> +/**
> + * map_range - Map a contiguous range of physical pages into virtual memory
> + *
> + * @pte: Address of physical pointer to array of pages to
> + * allocate page tables from
> + * @start: Virtual address of the start of the range
> + * @end: Virtual address of the end of the range (exclusive)
> + * @pa: Physical address of the start of the range
> + * @level: Translation level for the mapping
> + * @tbl: The level @level page table to create the mappings in
> + * @may_use_cont: Whether the use of the contiguous attribute is allowed
> + * @va_offset: Offset between a physical page and its current mapping
> + * in the VA space
> + */
> +void __init map_range(u64 *pte, u64 start, u64 end, u64 pa, pgprot_t prot,
> + int level, pte_t *tbl, bool may_use_cont, u64 va_offset)
va_offset is always 0 (because the memory at *pte is id-mapped). Can it be
dropped? Or perhaps you are using this function later, once the memory is no
longer id-mapped?
> +{
> + u64 cmask = (level == 3) ? CONT_PTE_SIZE - 1 : U64_MAX;
> + u64 protval = pgprot_val(prot) & ~PTE_TYPE_MASK;
> + int lshift = (3 - level) * (PAGE_SHIFT - 3);
> + u64 lmask = (PAGE_SIZE << lshift) - 1;
> +
> + start &= PAGE_MASK;
> + pa &= PAGE_MASK;
> +
On Tue, 18 Apr 2023 at 11:31, Ryan Roberts <[email protected]> wrote:
>
> On 07/03/2023 14:04, Ard Biesheuvel wrote:
> > The asm version of the kernel mapping code works fine for creating a
> > coarse grained identity map, but for mapping the kernel down to its
> > exact boundaries with the right attributes, it is not suitable. This is
> > why we create a preliminary RWX kernel mapping first, and then rebuild
> > it from scratch later on.
> >
> > So let's reimplement this in C, in a way that will make it unnecessary
> > to create the kernel page tables yet another time in paging_init().
> >
> > Signed-off-by: Ard Biesheuvel <[email protected]>
> > ---
>
> > [...]
>
> > diff --git a/arch/arm64/kernel/pi/map_kernel.c b/arch/arm64/kernel/pi/map_kernel.c
> > new file mode 100644
> > index 0000000000000000..b573c964c7d88d1b
> > --- /dev/null
> > +++ b/arch/arm64/kernel/pi/map_kernel.c
> > @@ -0,0 +1,171 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +// Copyright 2023 Google LLC
> > +// Author: Ard Biesheuvel <[email protected]>
> > +
> > +#include <linux/init.h>
> > +#include <linux/libfdt.h>
> > +#include <linux/linkage.h>
> > +#include <linux/types.h>
> > +#include <linux/sizes.h>
> > +#include <linux/string.h>
> > +
> > +#include <asm/memory.h>
> > +#include <asm/pgalloc.h>
> > +#include <asm/pgtable.h>
> > +#include <asm/tlbflush.h>
> > +
> > +#include "pi.h"
> > +
> > +extern const u8 __eh_frame_start[], __eh_frame_end[];
> > +
> > +extern void idmap_cpu_replace_ttbr1(void *pgdir);
> > +
> > +static void map_segment(pgd_t *pg_dir, u64 *pgd, u64 va_offset,
> > + void *start, void *end, pgprot_t prot,
> > + bool may_use_cont, int root_level)
> > +{
> > + map_range(pgd, ((u64)start + va_offset) & ~PAGE_OFFSET,
> > + ((u64)end + va_offset) & ~PAGE_OFFSET, (u64)start,
> > + prot, root_level, (pte_t *)pg_dir, may_use_cont, 0);
>
> I don't understand what you are doing with ~PAGE_OFFSET here. Is this intended
> to be page alignment with PAGE_MASK? I'm guessing not, because you would want to
> forward align the end address in that case.
>
start + va_offset will produce an address that has leading 1 bits set
in positions that do not contribute to the translation. In order to
index the page tables correctly, those bits need to be cleared.
> > +}
> > +
> > +static void unmap_segment(pgd_t *pg_dir, u64 va_offset, void *start,
> > + void *end, int root_level)
> > +{
> > + map_segment(pg_dir, NULL, va_offset, start, end, __pgprot(0),
> > + false, root_level);
> > +}
> > +
>
> > [...]
>
> > diff --git a/arch/arm64/kernel/pi/map_range.c b/arch/arm64/kernel/pi/map_range.c
> > new file mode 100644
> > index 0000000000000000..61cbd6e82418c033
> > --- /dev/null
> > +++ b/arch/arm64/kernel/pi/map_range.c
> > @@ -0,0 +1,87 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +// Copyright 2023 Google LLC
> > +// Author: Ard Biesheuvel <[email protected]>
> > +
> > +#include <linux/types.h>
> > +#include <linux/sizes.h>
> > +
> > +#include <asm/memory.h>
> > +#include <asm/pgalloc.h>
> > +#include <asm/pgtable.h>
> > +
> > +#include "pi.h"
> > +
> > +/**
> > + * map_range - Map a contiguous range of physical pages into virtual memory
> > + *
> > + * @pte: Address of physical pointer to array of pages to
> > + * allocate page tables from
> > + * @start: Virtual address of the start of the range
> > + * @end: Virtual address of the end of the range (exclusive)
> > + * @pa: Physical address of the start of the range
> > + * @level: Translation level for the mapping
> > + * @tbl: The level @level page table to create the mappings in
> > + * @may_use_cont: Whether the use of the contiguous attribute is allowed
> > + * @va_offset: Offset between a physical page and its current mapping
> > + * in the VA space
> > + */
> > +void __init map_range(u64 *pte, u64 start, u64 end, u64 pa, pgprot_t prot,
> > + int level, pte_t *tbl, bool may_use_cont, u64 va_offset)
>
> va_offset is always 0 (because the memory at *pte is id-mapped). Can it be
> dropped? Or perhaps you are using this function later, once the memory is no
> longer id-mapped?
>
It will be used later.
On 07/03/2023 14:04, Ard Biesheuvel wrote:
> Even though we support loading kernels anywhere in 48-bit addressable
> physical memory, we create the ID maps based on the number of levels
> that we happened to configure for the kernel VA and user VA spaces.
>
> The reason for this is that the PGD/PUD/PMD based classification of
> translation levels, along with the associated folding when the number of
> levels is less than 5, does not permit creating a page table hierarchy
> of a set number of levels. This means that, for instance, on 39-bit VA
> kernels we need to configure an additional level above PGD level on the
> fly, and 36-bit VA kernels still only support 47-bit virtual addressing
> with this trick applied.
>
> Now that we have a separate helper to populate page table hierarchies
> that does not define the levels in terms of PUDS/PMDS/etc at all, let's
> reuse it to create the permanent ID map with a fixed VA size of 48 bits.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/arm64/include/asm/kernel-pgtable.h | 2 ++
> arch/arm64/kernel/head.S | 5 +++
> arch/arm64/kvm/mmu.c | 15 +++------
> arch/arm64/mm/mmu.c | 32 +++++++++++---------
> arch/arm64/mm/proc.S | 9 ++----
> 5 files changed, 31 insertions(+), 32 deletions(-)
>
> diff --git a/arch/arm64/include/asm/kernel-pgtable.h b/arch/arm64/include/asm/kernel-pgtable.h
> index 50b5c145358a5d8e..2a2c80ffe59e5307 100644
> --- a/arch/arm64/include/asm/kernel-pgtable.h
> +++ b/arch/arm64/include/asm/kernel-pgtable.h
> @@ -35,6 +35,8 @@
> #define SWAPPER_PGTABLE_LEVELS (CONFIG_PGTABLE_LEVELS)
> #endif
>
> +#define IDMAP_LEVELS ARM64_HW_PGTABLE_LEVELS(48)
> +#define IDMAP_ROOT_LEVEL (4 - IDMAP_LEVELS)
>
> /*
> * If KASLR is enabled, then an offset K is added to the kernel address
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index e45fd99e8ab4272a..fc6a4076d826b728 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -727,6 +727,11 @@ SYM_FUNC_START_LOCAL(__no_granule_support)
> SYM_FUNC_END(__no_granule_support)
>
> SYM_FUNC_START_LOCAL(__primary_switch)
> + mrs x1, tcr_el1
> + mov x2, #64 - VA_BITS
> + tcr_set_t0sz x1, x2
> + msr tcr_el1, x1
> +
> adrp x1, reserved_pg_dir
> adrp x2, init_idmap_pg_dir
> bl __enable_mmu
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 7113587222ffe8e1..d64be7b5f6692e8b 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1687,16 +1687,9 @@ int __init kvm_mmu_init(u32 *hyp_va_bits)
> BUG_ON((hyp_idmap_start ^ (hyp_idmap_end - 1)) & PAGE_MASK);
>
> /*
> - * The ID map may be configured to use an extended virtual address
> - * range. This is only the case if system RAM is out of range for the
> - * currently configured page size and VA_BITS_MIN, in which case we will
> - * also need the extended virtual range for the HYP ID map, or we won't
> - * be able to enable the EL2 MMU.
> - *
> - * However, in some cases the ID map may be configured for fewer than
> - * the number of VA bits used by the regular kernel stage 1. This
> - * happens when VA_BITS=52 and the kernel image is placed in PA space
> - * below 48 bits.
> + * The ID map is always configured for 48 bits of translation, which
> + * may be fewer than the number of VA bits used by the regular kernel
> + * stage 1, when VA_BITS=52.
> *
> * At EL2, there is only one TTBR register, and we can't switch between
> * translation tables *and* update TCR_EL2.T0SZ at the same time. Bottom
> @@ -1707,7 +1700,7 @@ int __init kvm_mmu_init(u32 *hyp_va_bits)
> * 1 VA bits to assure that the hypervisor can both ID map its code page
> * and map any kernel memory.
> */
> - idmap_bits = 64 - ((idmap_t0sz & TCR_T0SZ_MASK) >> TCR_T0SZ_OFFSET);
> + idmap_bits = 48;
> kernel_bits = vabits_actual;
> *hyp_va_bits = max(idmap_bits, kernel_bits);
This effectively means that the hypervisor always uses at least 48 VA bits.
Previously, I think it would have been 39 for (e.g.) Android builds? Does this
have any performance implications for pKVM?
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 81e1420d2cc13246..a59433ae4f5f8d02 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -762,22 +762,21 @@ static void __init map_kernel(pgd_t *pgdp)
> kasan_copy_shadow(pgdp);
> }
>
> +void __pi_map_range(u64 *pgd, u64 start, u64 end, u64 pa, pgprot_t prot,
> + int level, pte_t *tbl, bool may_use_cont, u64 va_offset);
> +
> +static u8 idmap_ptes[IDMAP_LEVELS - 1][PAGE_SIZE] __aligned(PAGE_SIZE) __ro_after_init,
> + kpti_ptes[IDMAP_LEVELS - 1][PAGE_SIZE] __aligned(PAGE_SIZE) __ro_after_init;
I see this new storage introduced, but I don't see you removing the storage for
the old method (I have a vague memory of it being defined in the linker script)?
> +
> static void __init create_idmap(void)
> {
> u64 start = __pa_symbol(__idmap_text_start);
> - u64 size = __pa_symbol(__idmap_text_end) - start;
> - pgd_t *pgd = idmap_pg_dir;
> - u64 pgd_phys;
> -
> - /* check if we need an additional level of translation */
> - if (VA_BITS < 48 && idmap_t0sz < (64 - VA_BITS_MIN)) {
> - pgd_phys = early_pgtable_alloc(PAGE_SHIFT);
> - set_pgd(&idmap_pg_dir[start >> VA_BITS],
> - __pgd(pgd_phys | P4D_TYPE_TABLE));
> - pgd = __va(pgd_phys);
> - }
> - __create_pgd_mapping(pgd, start, start, size, PAGE_KERNEL_ROX,
> - early_pgtable_alloc, 0);
> + u64 end = __pa_symbol(__idmap_text_end);
> + u64 ptep = __pa_symbol(idmap_ptes);
> +
> + __pi_map_range(&ptep, start, end, start, PAGE_KERNEL_ROX,
> + IDMAP_ROOT_LEVEL, (pte_t *)idmap_pg_dir, false,
> + __phys_to_virt(ptep) - ptep);
>
> if (IS_ENABLED(CONFIG_UNMAP_KERNEL_AT_EL0)) {
> extern u32 __idmap_kpti_flag;
> @@ -787,8 +786,10 @@ static void __init create_idmap(void)
> * The KPTI G-to-nG conversion code needs a read-write mapping
> * of its synchronization flag in the ID map.
> */
> - __create_pgd_mapping(pgd, pa, pa, sizeof(u32), PAGE_KERNEL,
> - early_pgtable_alloc, 0);
> + ptep = __pa_symbol(kpti_ptes);
> + __pi_map_range(&ptep, pa, pa + sizeof(u32), pa, PAGE_KERNEL,
> + IDMAP_ROOT_LEVEL, (pte_t *)idmap_pg_dir, false,
> + __phys_to_virt(ptep) - ptep);
> }
> }
>
> @@ -813,6 +814,7 @@ void __init paging_init(void)
> memblock_allow_resize();
>
> create_idmap();
> + idmap_t0sz = TCR_T0SZ(48);
> }
>
> #ifdef CONFIG_MEMORY_HOTPLUG
> diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
> index 82e88f4521737c0e..c7129b21bfd5191f 100644
> --- a/arch/arm64/mm/proc.S
> +++ b/arch/arm64/mm/proc.S
> @@ -422,9 +422,9 @@ SYM_FUNC_START(__cpu_setup)
> mair .req x17
> tcr .req x16
> mov_q mair, MAIR_EL1_SET
> - mov_q tcr, TCR_TxSZ(VA_BITS) | TCR_CACHE_FLAGS | TCR_SMP_FLAGS | \
> - TCR_TG_FLAGS | TCR_KASLR_FLAGS | TCR_ASID16 | \
> - TCR_TBI0 | TCR_A1 | TCR_KASAN_SW_FLAGS | TCR_MTE_FLAGS
> + mov_q tcr, TCR_T0SZ(48) | TCR_T1SZ(VA_BITS) | TCR_CACHE_FLAGS | \
> + TCR_SMP_FLAGS | TCR_TG_FLAGS | TCR_KASLR_FLAGS | TCR_ASID16 | \
> + TCR_TBI0 | TCR_A1 | TCR_KASAN_SW_FLAGS | TCR_MTE_FLAGS
You're hardcoding 48 in 3 places in this patch. I wonder if an IDMAP_VA_BITS
macro might help here?
>
> tcr_clear_errata_bits tcr, x9, x5
>
> @@ -432,10 +432,7 @@ SYM_FUNC_START(__cpu_setup)
> sub x9, xzr, x0
> add x9, x9, #64
> tcr_set_t1sz tcr, x9
> -#else
> - idmap_get_t0sz x9
> #endif
> - tcr_set_t0sz tcr, x9
>
> /*
> * Set the IPS bits in TCR_EL1.
On 07/03/2023 14:05, Ard Biesheuvel wrote:
> Update the early kernel mapping code to take 52-bit virtual addressing
> into account based on the LPA2 feature. This is a bit more involved than
> LVA (which is supported with 64k pages only), given that some page table
> descriptor bits change meaning in this case.
>
> To keep the handling in asm to a minimum, the initial ID map is still
> created with 48-bit virtual addressing, which implies that the kernel
> image must be loaded into 48-bit addressable physical memory. This is
> currently required by the boot protocol, even though we happen to
> support placement outside of that for LVA/64k based configurations.
>
> Enabling LPA2 involves more than setting TCR.T1SZ to a lower value,
> there is also a DS bit in TCR that needs to be set, and which changes
> the meaning of bits [9:8] in all page table descriptors. Since we cannot
> enable DS and every live page table descriptor at the same time, let's
> pivot through another temporary mapping. This avoids the need to
> reintroduce manipulations of the page tables with the MMU and caches
> disabled.
>
> To permit the LPA2 feature to be overridden on the kernel command line,
> which may be necessary to work around silicon errata, or to deal with
> mismatched features on heterogeneous SoC designs, test for CPU feature
> overrides first, and only then enable LPA2.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/arm64/include/asm/assembler.h | 8 ++-
> arch/arm64/include/asm/cpufeature.h | 18 +++++
> arch/arm64/include/asm/memory.h | 4 ++
> arch/arm64/kernel/head.S | 8 +++
> arch/arm64/kernel/image-vars.h | 1 +
> arch/arm64/kernel/pi/map_kernel.c | 70 +++++++++++++++++++-
> arch/arm64/kernel/pi/map_range.c | 11 ++-
> arch/arm64/kernel/pi/pi.h | 4 +-
> arch/arm64/mm/init.c | 2 +-
> arch/arm64/mm/mmu.c | 6 +-
> arch/arm64/mm/proc.S | 3 +
> 11 files changed, 124 insertions(+), 11 deletions(-)
>
> diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
> index 55e8731844cf7eb7..d5e139ce0820479b 100644
> --- a/arch/arm64/include/asm/assembler.h
> +++ b/arch/arm64/include/asm/assembler.h
> @@ -581,11 +581,17 @@ alternative_endif
> * but we have to add an offset so that the TTBR1 address corresponds with the
> * pgdir entry that covers the lowest 48-bit addressable VA.
> *
> + * Note that this trick is only used for LVA/64k pages - LPA2/4k pages uses an
> + * additional paging level, and on LPA2/16k pages, we would end up with a root
> + * level table with only 2 entries, which is suboptimal in terms of TLB
> + * utilization, so there we fall back to 47 bits of translation if LPA2 is not
> + * supported.
> + *
> * orr is used as it can cover the immediate value (and is idempotent).
> * ttbr: Value of ttbr to set, modified.
> */
> .macro offset_ttbr1, ttbr, tmp
> -#ifdef CONFIG_ARM64_VA_BITS_52
> +#if defined(CONFIG_ARM64_VA_BITS_52) && !defined(CONFIG_ARM64_LPA2)
> mrs \tmp, tcr_el1
> and \tmp, \tmp, #TCR_T1SZ_MASK
> cmp \tmp, #TCR_T1SZ(VA_BITS_MIN)
> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> index 7faf9a48339e7c8c..170e18cb2b4faf11 100644
> --- a/arch/arm64/include/asm/cpufeature.h
> +++ b/arch/arm64/include/asm/cpufeature.h
> @@ -1002,6 +1002,24 @@ static inline bool cpu_has_lva(void)
> ID_AA64MMFR2_EL1_VARange_SHIFT);
> }
>
> +static inline bool cpu_has_lpa2(void)
> +{
> +#ifdef CONFIG_ARM64_LPA2
> + u64 mmfr0;
> + int feat;
> +
> + mmfr0 = read_sysreg(id_aa64mmfr0_el1);
> + mmfr0 &= ~id_aa64mmfr0_override.mask;
> + mmfr0 |= id_aa64mmfr0_override.val;
> + feat = cpuid_feature_extract_signed_field(mmfr0,
> + ID_AA64MMFR0_EL1_TGRAN_SHIFT);
> +
> + return feat >= ID_AA64MMFR0_EL1_TGRAN_LPA2;
> +#else
> + return false;
> +#endif
> +}
I wonder if we should rename this to cpu_has_lpa2_stage1()? I currently have a
system_supports_lpa2() function, which wraps a system cap, and reports true if
BOTH stage1 and stage2 are supported. I suspect this should be renamed to
something like system_has_lpa2_stage12() to match?
Regardless, in my series, KVM currently decides whether or not to use LPA2 page
tables based on system_supports_lpa2(). But I will need to add a new condition
whereby if the kernel is using LPA2 (lpa2_is_enabled()) but stage2 reports that
LPA2 is not supported then KVM fails to initialize. Otherwise we could end up in
a situation where KVM can't map memory passed to it by the kernel.
On 07/03/2023 14:05, Ard Biesheuvel wrote:
> get_user_mapping_size() uses vabits_actual and CONFIG_PGTABLE_LEVELS to
> provide the starting point for a table walk. This is fine for LVA, as
> the number of translation levels is the same regardless of whether LVA
> is enabled. However, with LPA2, this will no longer be the case, so
> let's derive the number of levels from the number of VA bits directly.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/arm64/kvm/mmu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index d64be7b5f6692e8b..4e7c0f9a9c286c09 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -663,7 +663,7 @@ static int get_user_mapping_size(struct kvm *kvm, u64 addr)
> .pgd = (kvm_pteref_t)kvm->mm->pgd,
> .ia_bits = vabits_actual,
> .start_level = (KVM_PGTABLE_MAX_LEVELS -
> - CONFIG_PGTABLE_LEVELS),
> + ARM64_HW_PGTABLE_LEVELS(pgt.ia_bits)),
> .mm_ops = &kvm_user_mm_ops,
> };
> kvm_pte_t pte = 0; /* Keep GCC quiet... */
You have the problem here that the KVM library (which isn't LPA2 aware) is
walking a kernel page table, which may now be in LPA2 format. I think this works
out ok as long as there are no physical addresses above 48 bits in the page
table. But otherwise, I doubt it works out very well...
On 07/03/2023 14:05, Ard Biesheuvel wrote:
> The KVM code needs more work to support 5 level paging with LPA2, so for
> the time being, limit KVM to 48 bit addressing on 4k and 16k pagesize
> configurations. This can be reverted once the LPA2 support for KVM is
> merged.
Don't you still have a problem that a user's memory could map to physical memory
above 48 bits that it tries to map into a KVM VM? How do you protect against
that? I think KVM needs to be disabled entirely when the kernel is using LPA2,
until KVM explicitly supports LPA2 too?
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/arm64/kvm/hyp/nvhe/mem_protect.c | 2 ++
> arch/arm64/kvm/mmu.c | 5 ++++-
> arch/arm64/kvm/va_layout.c | 9 +++++----
> 3 files changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 552653fa18be34b2..e00b87ed4a8400f6 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -128,6 +128,8 @@ static void prepare_host_vtcr(void)
> /* The host stage 2 is id-mapped, so use parange for T0SZ */
> parange = kvm_get_parange(id_aa64mmfr0_el1_sys_val);
> phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange);
> + if (IS_ENABLED(CONFIG_ARM64_LPA2) && phys_shift > 48)
> + phys_shift = 48; // not implemented yet
>
> host_mmu.arch.vtcr = kvm_get_vtcr(id_aa64mmfr0_el1_sys_val,
> id_aa64mmfr1_el1_sys_val, phys_shift);
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 4e7c0f9a9c286c09..2ad9e6f1e101e52d 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -661,7 +661,8 @@ static int get_user_mapping_size(struct kvm *kvm, u64 addr)
> {
> struct kvm_pgtable pgt = {
> .pgd = (kvm_pteref_t)kvm->mm->pgd,
> - .ia_bits = vabits_actual,
> + .ia_bits = IS_ENABLED(CONFIG_ARM64_LPA2) ? 48
> + : vabits_actual,
> .start_level = (KVM_PGTABLE_MAX_LEVELS -
> ARM64_HW_PGTABLE_LEVELS(pgt.ia_bits)),
> .mm_ops = &kvm_user_mm_ops,
> @@ -1703,6 +1704,8 @@ int __init kvm_mmu_init(u32 *hyp_va_bits)
> idmap_bits = 48;
> kernel_bits = vabits_actual;
> *hyp_va_bits = max(idmap_bits, kernel_bits);
> + if (IS_ENABLED(CONFIG_ARM64_LPA2))
> + *hyp_va_bits = 48; // LPA2 is not yet supported in KVM
>
> kvm_debug("Using %u-bit virtual addresses at EL2\n", *hyp_va_bits);
> kvm_debug("IDMAP page: %lx\n", hyp_idmap_start);
> diff --git a/arch/arm64/kvm/va_layout.c b/arch/arm64/kvm/va_layout.c
> index 341b67e2f2514e55..ac87d0c39c38f7d9 100644
> --- a/arch/arm64/kvm/va_layout.c
> +++ b/arch/arm64/kvm/va_layout.c
> @@ -59,12 +59,13 @@ static void init_hyp_physvirt_offset(void)
> */
> __init void kvm_compute_layout(void)
> {
> + u64 vabits = IS_ENABLED(CONFIG_ARM64_LPA2) ? 48 : vabits_actual; // not yet
> phys_addr_t idmap_addr = __pa_symbol(__hyp_idmap_text_start);
> u64 hyp_va_msb;
>
> /* Where is my RAM region? */
> - hyp_va_msb = idmap_addr & BIT(vabits_actual - 1);
> - hyp_va_msb ^= BIT(vabits_actual - 1);
> + hyp_va_msb = idmap_addr & BIT(vabits - 1);
> + hyp_va_msb ^= BIT(vabits - 1);
>
> tag_lsb = fls64((u64)phys_to_virt(memblock_start_of_DRAM()) ^
> (u64)(high_memory - 1));
> @@ -72,10 +73,10 @@ __init void kvm_compute_layout(void)
> va_mask = GENMASK_ULL(tag_lsb - 1, 0);
> tag_val = hyp_va_msb;
>
> - if (IS_ENABLED(CONFIG_RANDOMIZE_BASE) && tag_lsb != (vabits_actual - 1) &&
> + if (IS_ENABLED(CONFIG_RANDOMIZE_BASE) && tag_lsb != (vabits - 1) &&
> !kaslr_disabled_cmdline()) {
> /* We have some free bits to insert a random tag. */
> - tag_val |= get_random_long() & GENMASK_ULL(vabits_actual - 2, tag_lsb);
> + tag_val |= get_random_long() & GENMASK_ULL(vabits - 2, tag_lsb);
> }
> tag_val >>= tag_lsb;
>
On 07/03/2023 14:04, Ard Biesheuvel wrote:
> This is a followup to [0], which was a lot smaller. Thanks to Ryan for
> feedback and review. This series is independent from Ryan's work on
> adding support for LPA2 to KVM - the only potential source of conflict
> should be the patch "arm64: kvm: Limit HYP VA and host S2 range to 48
> bits when LPA2 is in effect", which could simply be dropped in favour of
> the KVM changes to make it support LPA2.
>
> The first ~15 patches of this series rework how the kernel VA space is
> organized, so that the vmemmap region does not take up more space than
> necessary, and so that most of it can be reclaimed when running a build
> capable of 52-bit virtual addressing on hardware that is not. This is
> needed because the vmemmap region will take up a substantial part of the
> upper VA region that it shares with the kernel, modules and
> vmalloc/vmap mappings once we enable LPA2 with 4k pages.
>
> The next ~30 patches rework the early init code, reimplementing most of
> the page table and relocation handling in C code. There are several
> reasons why this is beneficial:
> - we generally prefer C code over asm for these things, and the macros
> that currently exist in head.S for creating the kernel pages tables
> are a good example why;
> - we no longer need to create the kernel mapping in two passes, which
> means we can remove the logic that copies parts of the fixmap and the
> KAsan shadow from one set of page tables to the other; this is
> especially advantageous for KAsan with LPA2, which needs more
> elaborate shadow handling across multiple levels, since the KAsan
> region cannot be placed on exact pgd_t bouundaries in that case;
> - we can read the ID registers and parse command line overrides before
> creating the page tables, which simplifies the LPA2 case, as flicking
> the global TCR_EL1.DS bit at a later stage would require elaborate
> repainting of all page table descriptors, some of which with the MMU
> disabled;
> - we can use more elaborate logic to create the mappings, which means we
> can use more precise mappings for code and data sections even when
> using 2 MiB granularity, and this is a prerequisite for running with
> WXN.
>
> As part of the ID map changes, we decouple the ID map size from the
> kernel VA size, and switch to a 48-bit VA map for all configurations.
>
> The next 18 patches rework the existing LVA support as a CPU feature,
> which simplifies some code and gets rid of the vabits_actual variable.
> Then, LPA2 support is implemented in the same vein. This requires adding
> support for 5 level paging as well, given that LPA2 introduces a new
> paging level '-1' when using 4k pages.
I still don't see any changes for TLBI, which I raised in the first round, and
which I think you need when enabling LPA2. I have 2 patches, which I think are
appropriate; one covers the non-range tlbi routines (and is part of my KVM
series) and the other covers the rage-based routines (this would need a bit of
fix up to use your lpa2_is_enabled()). At [1] and [2] respectively.
[1] https://lore.kernel.org/kvmarm/[email protected]/
[2]
https://gitlab.arm.com/linux-arm/linux-rr/-/commit/38628decb785aea42a349a857b9f8a65a19e9c2b
Thanks,
Ryan
On Tue, Mar 07, 2023 at 03:04:23PM +0100, Ard Biesheuvel wrote:
> Avoid build issues in the early C code related to the latent_entropy GCC
> plugin, by incorporating the C flags fragment that disables it.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
Just to check, are you seeing issues today? IIUC the plugin only instruments
functions which are explicitly marked with __latent_entropy, and if we're
seeing that happen unexpectedly (or due to that being applying to __meminit /
__init), we might need to do likewise for other noinstr code.
Regardless, for this patch:
Acked-by: Mark Rutland <[email protected]>
Mark.
> ---
> arch/arm64/kernel/pi/Makefile | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/arm64/kernel/pi/Makefile b/arch/arm64/kernel/pi/Makefile
> index 4c0ea3cd4ea406b6..c844a0546d7f0e62 100644
> --- a/arch/arm64/kernel/pi/Makefile
> +++ b/arch/arm64/kernel/pi/Makefile
> @@ -3,6 +3,7 @@
>
> KBUILD_CFLAGS := $(subst $(CC_FLAGS_FTRACE),,$(KBUILD_CFLAGS)) -fpie \
> -Os -DDISABLE_BRANCH_PROFILING $(DISABLE_STACKLEAK_PLUGIN) \
> + $(DISABLE_LATENT_ENTROPY_PLUGIN) \
> $(call cc-option,-mbranch-protection=none) \
> -I$(srctree)/scripts/dtc/libfdt -fno-stack-protector \
> -include $(srctree)/include/linux/hidden.h \
> --
> 2.39.2
>
>
On Tue, Mar 07, 2023 at 03:04:25PM +0100, Ard Biesheuvel wrote:
> We store the address of _text in kimage_vaddr, but since commit
> 09e3c22a86f6889d ("arm64: Use a variable to store non-global mappings
> decision"), we no longer reference this variable from modules so we no
> longer need to export it.
>
> In fact, we don't need it at all so let's just get rid of it.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Mark.
> ---
> arch/arm64/include/asm/memory.h | 6 ++----
> arch/arm64/kernel/head.S | 2 +-
> arch/arm64/mm/mmu.c | 3 ---
> 3 files changed, 3 insertions(+), 8 deletions(-)
>
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index 78e5163836a0ab95..a4e1d832a15a2d7a 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -182,6 +182,7 @@
> #include <linux/types.h>
> #include <asm/boot.h>
> #include <asm/bug.h>
> +#include <asm/sections.h>
>
> #if VA_BITS > 48
> extern u64 vabits_actual;
> @@ -193,15 +194,12 @@ extern s64 memstart_addr;
> /* PHYS_OFFSET - the physical address of the start of memory. */
> #define PHYS_OFFSET ({ VM_BUG_ON(memstart_addr & 1); memstart_addr; })
>
> -/* the virtual base of the kernel image */
> -extern u64 kimage_vaddr;
> -
> /* the offset between the kernel virtual and physical mappings */
> extern u64 kimage_voffset;
>
> static inline unsigned long kaslr_offset(void)
> {
> - return kimage_vaddr - KIMAGE_VADDR;
> + return (u64)&_text - KIMAGE_VADDR;
> }
>
> static inline bool kaslr_enabled(void)
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index b98970907226b36c..65cdaaa2c859418f 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -482,7 +482,7 @@ SYM_FUNC_START_LOCAL(__primary_switched)
>
> str_l x21, __fdt_pointer, x5 // Save FDT pointer
>
> - ldr_l x4, kimage_vaddr // Save the offset between
> + adrp x4, _text // Save the offset between
> sub x4, x4, x0 // the kernel virtual and
> str_l x4, kimage_voffset, x5 // physical mappings
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 6f9d8898a02516f6..81e1420d2cc13246 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -50,9 +50,6 @@ u64 vabits_actual __ro_after_init = VA_BITS_MIN;
> EXPORT_SYMBOL(vabits_actual);
> #endif
>
> -u64 kimage_vaddr __ro_after_init = (u64)&_text;
> -EXPORT_SYMBOL(kimage_vaddr);
> -
> u64 kimage_voffset __ro_after_init;
> EXPORT_SYMBOL(kimage_voffset);
>
> --
> 2.39.2
>
>
On Tue, Mar 07, 2023 at 03:04:24PM +0100, Ard Biesheuvel wrote:
> We enable CONFIG_RELOCATABLE even when CONFIG_RANDOMIZE_BASE is
> disabled, and this permits the loader (i.e., EFI) to place the kernel
> anywhere in physical memory as long as the base address is 64k aligned.
>
> This means that the 'KASLR' case described in the header that defines
> the size of the statically allocated page tables could take effect even
> when CONFIG_RANDMIZE_BASE=n. So check for CONFIG_RELOCATABLE instead.
Could we pleqase update the comment to describe that? As of this commit it'll
be left describing a KASLR-specific case, and it'd be good to have it mention
the case described in this commit message.
Thanks,
Mark.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/arm64/include/asm/kernel-pgtable.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/arm64/include/asm/kernel-pgtable.h b/arch/arm64/include/asm/kernel-pgtable.h
> index fcd14197756f0619..4d13c73171e1e360 100644
> --- a/arch/arm64/include/asm/kernel-pgtable.h
> +++ b/arch/arm64/include/asm/kernel-pgtable.h
> @@ -53,7 +53,7 @@
> * address is just pushed over a boundary and the start address isn't).
> */
>
> -#ifdef CONFIG_RANDOMIZE_BASE
> +#ifdef CONFIG_RELOCATABLE
> #define EARLY_KASLR (1)
> #else
> #define EARLY_KASLR (0)
> --
> 2.39.2
>
>
On Fri, 28 Apr 2023 at 11:38, Mark Rutland <[email protected]> wrote:
>
> On Tue, Mar 07, 2023 at 03:04:23PM +0100, Ard Biesheuvel wrote:
> > Avoid build issues in the early C code related to the latent_entropy GCC
> > plugin, by incorporating the C flags fragment that disables it.
> >
> > Signed-off-by: Ard Biesheuvel <[email protected]>
>
> Just to check, are you seeing issues today? IIUC the plugin only instruments
> functions which are explicitly marked with __latent_entropy, and if we're
> seeing that happen unexpectedly (or due to that being applying to __meminit /
> __init), we might need to do likewise for other noinstr code.
>
I don't quite remember, tbh, but it is unlikely that I would have
written or included this patch without having run into some actual
issue.
> Regardless, for this patch:
>
> Acked-by: Mark Rutland <[email protected]>
>
Thanks,
On Tue, Mar 07, 2023 at 03:04:27PM +0100, Ard Biesheuvel wrote:
> Move the fixmap region above the vmemmap region, so that the start of
> the vmemmap delineates the end of the region available for vmalloc and
> vmap allocations and the randomized placement of the kernel and modules.
>
> In a subsequent patch, we will take advantage of this to reclaim most of
> the vmemmap area when running a 52-bit VA capable build with 52-bit
> virtual addressing disabled at runtime.
>
> Note that the existing guard region of 256 MiB covers the fixmap and PCI
> I/O regions as well, so we can reduce it 8 MiB, which is what we use in
> other places too.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/arm64/include/asm/memory.h | 2 +-
> arch/arm64/include/asm/pgtable.h | 2 +-
> arch/arm64/mm/ptdump.c | 4 ++--
> 3 files changed, 4 insertions(+), 4 deletions(-)
As a heads-up, this will (trivially) conflict with some of the arm64 fixmap
cleanups that'll be in v6.4-rc1, due to the FIXADDR_TOT_* changes.
>
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index 6e321cc06a3c30f0..9b9e52d823beccc6 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -51,7 +51,7 @@
> #define VMEMMAP_END (VMEMMAP_START + VMEMMAP_SIZE)
> #define PCI_IO_START (VMEMMAP_END + SZ_8M)
> #define PCI_IO_END (PCI_IO_START + PCI_IO_SIZE)
> -#define FIXADDR_TOP (VMEMMAP_START - SZ_32M)
> +#define FIXADDR_TOP (ULONG_MAX - SZ_8M + 1)
Can we make this:
#define FIXADDR_TOP (-SZ_8M)
... as that would match the way we define PAGE_OFFSET (and VMEMMAP_START), and
it removes the need for the '+1' to handle ULONG_MAX being one-off what we
actually want to subtact from.
Mark.
>
> #if VA_BITS > 48
> #define VA_BITS_MIN (48)
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index b6ba466e2e8a3fc7..3eff06c5d0eb73c7 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -22,7 +22,7 @@
> * and fixed mappings
> */
> #define VMALLOC_START (MODULES_END)
> -#define VMALLOC_END (VMEMMAP_START - SZ_256M)
> +#define VMALLOC_END (VMEMMAP_START - SZ_8M)
>
> #define vmemmap ((struct page *)VMEMMAP_START - (memstart_addr >> PAGE_SHIFT))
>
> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
> index 9d1f4cdc6672ed5f..76d28056bd14920a 100644
> --- a/arch/arm64/mm/ptdump.c
> +++ b/arch/arm64/mm/ptdump.c
> @@ -45,12 +45,12 @@ static struct addr_marker address_markers[] = {
> { MODULES_END, "Modules end" },
> { VMALLOC_START, "vmalloc() area" },
> { VMALLOC_END, "vmalloc() end" },
> - { FIXADDR_START, "Fixmap start" },
> - { FIXADDR_TOP, "Fixmap end" },
> { VMEMMAP_START, "vmemmap start" },
> { VMEMMAP_START + VMEMMAP_SIZE, "vmemmap end" },
> { PCI_IO_START, "PCI I/O start" },
> { PCI_IO_END, "PCI I/O end" },
> + { FIXADDR_START, "Fixmap start" },
> + { FIXADDR_TOP, "Fixmap end" },
> { -1, NULL },
> };
>
> --
> 2.39.2
>
>
On Tue, Mar 07, 2023 at 03:04:28PM +0100, Ard Biesheuvel wrote:
> Extend the existing pattern for populating ptdump marker entries at
> boot, and add handling of VMALLOC_END, which will cease to be a compile
> time constant for configurations that support 52-bit virtual addressing.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/arm64/mm/ptdump.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
> index 76d28056bd14920a..910b35f02280cbdb 100644
> --- a/arch/arm64/mm/ptdump.c
> +++ b/arch/arm64/mm/ptdump.c
> @@ -31,7 +31,12 @@ enum address_markers_idx {
> PAGE_END_NR,
> #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
> KASAN_START_NR,
> + KASAN_END_NR,
> #endif
> + MODULES_NR,
> + MODULES_END_NR,
> + VMALLOC_START_NR,
> + VMALLOC_END_NR,
> };
While it'd be a bit more verbose, I reckon it'd be worth making all of this
dynamically initialized. That would naturally handle things which are not
compile-time constant, and it'd keep all the values in one place, so it's
easier to keep the start/end/name/whatever in sync.
Something like:
| enum address_markers_idx {
| MARKER_LINEAR_START,
| MARKER_LINEAR_END,
| #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
| MARKER_KASAN_START,
| MARKER_KASAN_END,
| #endif
| MARKER_FOO_START,
| MARKER_FOO_END,
| MARKER_BAR_START,
| MARKER_BAR_END,
| MARKER_SENTINEL,
| };
|
| static __ro_after_init struct addr_marker address_markers[] {
| [MARKER_SENTINEL] = {
| .start_address = 0,
| .name = NULL,
| }
| };
|
| static int __init ptdump_init(void)
| {
| address_markers[MARKER_LINEAR_START] = (struct addr_marker) {
| .start_address = [...],
| .name = "Linear mapping start",
| };
| address_markers[MARKER_LINEAR_END] = (struct addr_marker) {
| .start_address = [...],
| .name = "Linear mapping end",
| };
|
| #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
| address_markers[MARKER_LINEAR_START] = (struct addr_marker) {
| .start_adddress = [...],
| .name = "KASAN shadow start",
| };
| address_markers[MARKER_LINEAR_START] = (struct addr_marker) {
| .start_adddress = [...],
| .name = "KASAN shadow end",
| };
| #endif
|
| [...]
|
| ptdump_initialize();
| ptdump_debugfs_register(&kernel_ptdump_info, "kernel_page_tables");
| return 0;
| };
Thanks,
Mark.
>
> static struct addr_marker address_markers[] = {
> @@ -44,7 +49,7 @@ static struct addr_marker address_markers[] = {
> { MODULES_VADDR, "Modules start" },
> { MODULES_END, "Modules end" },
> { VMALLOC_START, "vmalloc() area" },
> - { VMALLOC_END, "vmalloc() end" },
> + { 0, "vmalloc() end" },
> { VMEMMAP_START, "vmemmap start" },
> { VMEMMAP_START + VMEMMAP_SIZE, "vmemmap end" },
> { PCI_IO_START, "PCI I/O start" },
> @@ -379,6 +384,7 @@ static int __init ptdump_init(void)
> #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
> address_markers[KASAN_START_NR].start_address = KASAN_SHADOW_START;
> #endif
> + address_markers[VMALLOC_END_NR].start_address = VMALLOC_END;
> ptdump_initialize();
> ptdump_debugfs_register(&kernel_ptdump_info, "kernel_page_tables");
> return 0;
> --
> 2.39.2
>
>
On Tue, Mar 07, 2023 at 03:04:29PM +0100, Ard Biesheuvel wrote:
> We will soon reclaim the part of the vmemmap region that covers VA space
> that is not addressable by the hardware. To avoid confusion, ensure that
> the 'vmemmap start' marker points at the start of the region that is
> actually being used for the struct page array, rather than the start of
> the region we set aside for it at build time.
>
> Signed-off-by: Ard Biesheuvel <[email protected]>
> ---
> arch/arm64/mm/ptdump.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
> index 910b35f02280cbdb..8f37d6d8b5216473 100644
> --- a/arch/arm64/mm/ptdump.c
> +++ b/arch/arm64/mm/ptdump.c
> @@ -37,6 +37,7 @@ enum address_markers_idx {
> MODULES_END_NR,
> VMALLOC_START_NR,
> VMALLOC_END_NR,
> + VMEMMAP_START_NR,
> };
>
> static struct addr_marker address_markers[] = {
> @@ -386,6 +387,10 @@ static int __init ptdump_init(void)
> #endif
> address_markers[VMALLOC_END_NR].start_address = VMALLOC_END;
> ptdump_initialize();
> + if (vabits_actual < VA_BITS) {
> + address_markers[VMEMMAP_START_NR].start_address =
> + (unsigned long)virt_to_page(_PAGE_OFFSET(vabits_actual));
> + }
As Ryan suggested, I think it'd be worth doing this unconditionally, to keep
this simple.
Mark.
> ptdump_debugfs_register(&kernel_ptdump_info, "kernel_page_tables");
> return 0;
> }
> --
> 2.39.2
>
>
On Fri, Apr 28, 2023 at 11:54:16AM +0100, Ard Biesheuvel wrote:
> On Fri, 28 Apr 2023 at 11:38, Mark Rutland <[email protected]> wrote:
> >
> > On Tue, Mar 07, 2023 at 03:04:23PM +0100, Ard Biesheuvel wrote:
> > > Avoid build issues in the early C code related to the latent_entropy GCC
> > > plugin, by incorporating the C flags fragment that disables it.
> > >
> > > Signed-off-by: Ard Biesheuvel <[email protected]>
> >
> > Just to check, are you seeing issues today? IIUC the plugin only instruments
> > functions which are explicitly marked with __latent_entropy, and if we're
> > seeing that happen unexpectedly (or due to that being applying to __meminit /
> > __init), we might need to do likewise for other noinstr code.
> >
>
> I don't quite remember, tbh, but it is unlikely that I would have
> written or included this patch without having run into some actual
> issue.
Sure.
Looking at the series, from patch 15 onwards you mark portions of the PI code
as __init. As __init currently implies __latent_entropy (which I think is a bit
crazy as of itself...), that's why this'll start to fail.
It would be nice if we could mention that in the commit message, e.g.
| In subsequent patches we'll mark portions of the early C code as __init.
| Unfortunarely, __init implies __latent_entropy, and this would result in the
| early C code being instrumented in an unsafe manner.
|
| Disable the latent entropy plugin for the early C code.
... though my ack stands regardless of whether we add such wording.
Mark.
> > Regardless, for this patch:
> >
> > Acked-by: Mark Rutland <[email protected]>
> >
>
> Thanks,