This patchset, contrary to the previous versions, allows to have a single
kernel for sv39 and sv48 without being relocatable.
The idea comes from Arnd Bergmann who suggested to do the same as x86,
that is mapping the kernel to the end of the address space, which allows
the kernel to be linked at the same address for both sv39 and sv48 and
then does not require to be relocated at runtime.
This is an RFC because I need to at least rebase a few commits and add
documentation. The most interesting patches where I expect feedbacks are
1/12, 2/12 and 8/12. Note that moving the kernel out of the linear
mapping and sv48 support can be separate patchsets, I share them together
today to show that it works (this patchset is rebased on top of v5.10).
If we agree about the overall idea, I'll rebase my relocatable patchset
on top of that and then KASLR implementation from Zong will be greatly
simplified since moving the kernel out of the linear mapping will avoid
to copy the kernel physically.
This implements sv48 support at runtime. The kernel will try to
boot with 4-level page table and will fallback to 3-level if the HW does not
support it. Folding the 4th level into a 3-level page table has almost no
cost at runtime.
Finally, the user can now ask for sv39 explicitly by using the device-tree
which will reduce memory footprint and reduce the number of memory accesses
in case of TLB miss.
Alexandre Ghiti (12):
riscv: Move kernel mapping outside of linear mapping
riscv: Protect the kernel linear mapping
riscv: Get rid of compile time logic with MAX_EARLY_MAPPING_SIZE
riscv: Allow to dynamically define VA_BITS
riscv: Simplify MAXPHYSMEM config
riscv: Prepare ptdump for vm layout dynamic addresses
asm-generic: Prepare for riscv use of pud_alloc_one and pud_free
riscv: Implement sv48 support
riscv: Allow user to downgrade to sv39 when hw supports sv48
riscv: Use pgtable_l4_enabled to output mmu type in cpuinfo
riscv: Explicit comment about user virtual address space size
riscv: Improve virtual kernel memory layout dump
arch/riscv/Kconfig | 34 +--
arch/riscv/boot/loader.lds.S | 3 +-
arch/riscv/include/asm/csr.h | 3 +-
arch/riscv/include/asm/fixmap.h | 3 +
arch/riscv/include/asm/page.h | 33 ++-
arch/riscv/include/asm/pgalloc.h | 40 +++
arch/riscv/include/asm/pgtable-64.h | 104 ++++++-
arch/riscv/include/asm/pgtable.h | 68 +++--
arch/riscv/include/asm/sparsemem.h | 6 +-
arch/riscv/kernel/cpu.c | 23 +-
arch/riscv/kernel/head.S | 6 +-
arch/riscv/kernel/module.c | 4 +-
arch/riscv/kernel/vmlinux.lds.S | 3 +-
arch/riscv/mm/context.c | 2 +-
arch/riscv/mm/init.c | 376 ++++++++++++++++++++----
arch/riscv/mm/physaddr.c | 2 +-
arch/riscv/mm/ptdump.c | 56 +++-
drivers/firmware/efi/libstub/efi-stub.c | 2 +-
include/asm-generic/pgalloc.h | 24 +-
include/linux/sizes.h | 3 +-
20 files changed, 648 insertions(+), 147 deletions(-)
--
2.20.1
This is a preparatory patch for relocatable kernel and sv48 support.
The kernel used to be linked at PAGE_OFFSET address therefore we could use
the linear mapping for the kernel mapping. But the relocated kernel base
address will be different from PAGE_OFFSET and since in the linear mapping,
two different virtual addresses cannot point to the same physical address,
the kernel mapping needs to lie outside the linear mapping so that we don't
have to copy it at the same physical offset.
The kernel mapping is moved to the last 2GB of the address space and then
BPF and modules are also pushed to the same range since they have to lie
close to the kernel inside a 2GB window.
Note then that KASLR implementation will simply have to move the kernel in
this 2GB range and modify BPF/modules regions accordingly.
In addition, by moving the kernel to the end of the address space, both
sv39 and sv48 kernels will be exactly the same without needing to be
relocated at runtime.
Suggested-by: Arnd Bergmann <[email protected]>
Signed-off-by: Alexandre Ghiti <[email protected]>
---
arch/riscv/boot/loader.lds.S | 3 +-
arch/riscv/include/asm/page.h | 10 ++++-
arch/riscv/include/asm/pgtable.h | 39 +++++++++++++------
arch/riscv/kernel/head.S | 3 +-
arch/riscv/kernel/module.c | 4 +-
arch/riscv/kernel/vmlinux.lds.S | 3 +-
arch/riscv/mm/init.c | 65 ++++++++++++++++++++++++--------
arch/riscv/mm/physaddr.c | 2 +-
8 files changed, 94 insertions(+), 35 deletions(-)
diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
index 47a5003c2e28..62d94696a19c 100644
--- a/arch/riscv/boot/loader.lds.S
+++ b/arch/riscv/boot/loader.lds.S
@@ -1,13 +1,14 @@
/* SPDX-License-Identifier: GPL-2.0 */
#include <asm/page.h>
+#include <asm/pgtable.h>
OUTPUT_ARCH(riscv)
ENTRY(_start)
SECTIONS
{
- . = PAGE_OFFSET;
+ . = KERNEL_LINK_ADDR;
.payload : {
*(.payload)
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 2d50f76efe48..98188e315e8d 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
#ifdef CONFIG_MMU
extern unsigned long va_pa_offset;
+extern unsigned long va_kernel_pa_offset;
extern unsigned long pfn_base;
#define ARCH_PFN_OFFSET (pfn_base)
#else
#define va_pa_offset 0
+#define va_kernel_pa_offset 0
#define ARCH_PFN_OFFSET (PAGE_OFFSET >> PAGE_SHIFT)
#endif /* CONFIG_MMU */
extern unsigned long max_low_pfn;
extern unsigned long min_low_pfn;
+extern unsigned long kernel_virt_addr;
#define __pa_to_va_nodebug(x) ((void *)((unsigned long) (x) + va_pa_offset))
-#define __va_to_pa_nodebug(x) ((unsigned long)(x) - va_pa_offset)
+#define linear_mapping_va_to_pa(x) ((unsigned long)(x) - va_pa_offset)
+#define kernel_mapping_va_to_pa(x) \
+ ((unsigned long)(x) - va_kernel_pa_offset)
+#define __va_to_pa_nodebug(x) \
+ (((x) < KERNEL_LINK_ADDR) ? \
+ linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
#ifdef CONFIG_DEBUG_VIRTUAL
extern phys_addr_t __virt_to_phys(unsigned long x);
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 183f1f4b2ae6..102b728ca146 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -11,23 +11,32 @@
#include <asm/pgtable-bits.h>
-#ifndef __ASSEMBLY__
-
-/* Page Upper Directory not used in RISC-V */
-#include <asm-generic/pgtable-nopud.h>
-#include <asm/page.h>
-#include <asm/tlbflush.h>
-#include <linux/mm_types.h>
+#ifndef CONFIG_MMU
+#define KERNEL_VIRT_ADDR PAGE_OFFSET
+#define KERNEL_LINK_ADDR PAGE_OFFSET
+#else
-#ifdef CONFIG_MMU
+#define ADDRESS_SPACE_END (UL(-1))
+/*
+ * Leave 2GB for kernel, modules and BPF at the end of the address space
+ */
+#define KERNEL_VIRT_ADDR (ADDRESS_SPACE_END - SZ_2G + 1)
+#define KERNEL_LINK_ADDR KERNEL_VIRT_ADDR
#define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
#define VMALLOC_END (PAGE_OFFSET - 1)
#define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE)
+/* KASLR should leave at least 128MB for BPF after the kernel */
#define BPF_JIT_REGION_SIZE (SZ_128M)
-#define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
-#define BPF_JIT_REGION_END (VMALLOC_END)
+#define BPF_JIT_REGION_START PFN_ALIGN((unsigned long)&_end)
+#define BPF_JIT_REGION_END (BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
+
+/* Modules always live before the kernel */
+#ifdef CONFIG_64BIT
+#define VMALLOC_MODULE_START (PFN_ALIGN((unsigned long)&_end) - SZ_2G)
+#define VMALLOC_MODULE_END (PFN_ALIGN((unsigned long)&_start))
+#endif
/*
* Roughly size the vmemmap space to be large enough to fit enough
@@ -57,9 +66,16 @@
#define FIXADDR_SIZE PGDIR_SIZE
#endif
#define FIXADDR_START (FIXADDR_TOP - FIXADDR_SIZE)
-
#endif
+#ifndef __ASSEMBLY__
+
+/* Page Upper Directory not used in RISC-V */
+#include <asm-generic/pgtable-nopud.h>
+#include <asm/page.h>
+#include <asm/tlbflush.h>
+#include <linux/mm_types.h>
+
#ifdef CONFIG_64BIT
#include <asm/pgtable-64.h>
#else
@@ -467,6 +483,7 @@ static inline void __kernel_map_pages(struct page *page, int numpages, int enabl
#define kern_addr_valid(addr) (1) /* FIXME */
+extern char _start[];
extern void *dtb_early_va;
extern uintptr_t dtb_early_pa;
void setup_bootmem(void);
diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index 7e849797c9c3..66f40c49bf68 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -69,7 +69,8 @@ pe_head_start:
#ifdef CONFIG_MMU
relocate:
/* Relocate return address */
- li a1, PAGE_OFFSET
+ la a1, kernel_virt_addr
+ REG_L a1, 0(a1)
la a2, _start
sub a1, a1, a2
add ra, ra, a1
diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
index 104fba889cf7..75a0b9541266 100644
--- a/arch/riscv/kernel/module.c
+++ b/arch/riscv/kernel/module.c
@@ -408,12 +408,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
}
#if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
-#define VMALLOC_MODULE_START \
- max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
void *module_alloc(unsigned long size)
{
return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
- VMALLOC_END, GFP_KERNEL,
+ VMALLOC_MODULE_END, GFP_KERNEL,
PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
__builtin_return_address(0));
}
diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
index 3ffbd6cbdb86..c21dc46f41be 100644
--- a/arch/riscv/kernel/vmlinux.lds.S
+++ b/arch/riscv/kernel/vmlinux.lds.S
@@ -4,7 +4,8 @@
* Copyright (C) 2017 SiFive
*/
-#define LOAD_OFFSET PAGE_OFFSET
+#include <asm/pgtable.h>
+#define LOAD_OFFSET KERNEL_LINK_ADDR
#include <asm/vmlinux.lds.h>
#include <asm/page.h>
#include <asm/cache.h>
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 8e577f14f120..9d06ff0e015a 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -23,6 +23,9 @@
#include "../kernel/head.h"
+unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
+EXPORT_SYMBOL(kernel_virt_addr);
+
unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
__page_aligned_bss;
EXPORT_SYMBOL(empty_zero_page);
@@ -201,8 +204,12 @@ void __init setup_bootmem(void)
#ifdef CONFIG_MMU
static struct pt_alloc_ops pt_ops;
+/* Offset between linear mapping virtual address and kernel load address */
unsigned long va_pa_offset;
EXPORT_SYMBOL(va_pa_offset);
+/* Offset between kernel mapping virtual address and kernel load address */
+unsigned long va_kernel_pa_offset;
+EXPORT_SYMBOL(va_kernel_pa_offset);
unsigned long pfn_base;
EXPORT_SYMBOL(pfn_base);
@@ -316,7 +323,7 @@ static phys_addr_t __init alloc_pmd_early(uintptr_t va)
{
uintptr_t pmd_num;
- pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
+ pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
BUG_ON(pmd_num >= NUM_EARLY_PMDS);
return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
}
@@ -431,17 +438,34 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
#error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
#endif
+static uintptr_t load_pa, load_sz;
+
+static void __init create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
+{
+ uintptr_t va, end_va;
+
+ end_va = kernel_virt_addr + load_sz;
+ for (va = kernel_virt_addr; va < end_va; va += map_size)
+ create_pgd_mapping(pgdir, va,
+ load_pa + (va - kernel_virt_addr),
+ map_size, PAGE_KERNEL_EXEC);
+}
+
asmlinkage void __init setup_vm(uintptr_t dtb_pa)
{
- uintptr_t va, pa, end_va;
- uintptr_t load_pa = (uintptr_t)(&_start);
- uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
- uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
+ uintptr_t pa;
+ uintptr_t map_size;
#ifndef __PAGETABLE_PMD_FOLDED
pmd_t fix_bmap_spmd, fix_bmap_epmd;
#endif
+ load_pa = (uintptr_t)(&_start);
+ load_sz = (uintptr_t)(&_end) - load_pa;
+ map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
+
va_pa_offset = PAGE_OFFSET - load_pa;
+ va_kernel_pa_offset = kernel_virt_addr - load_pa;
+
pfn_base = PFN_DOWN(load_pa);
/*
@@ -470,26 +494,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
create_pmd_mapping(fixmap_pmd, FIXADDR_START,
(uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
/* Setup trampoline PGD and PMD */
- create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
+ create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
(uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
- create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
+ create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
#else
/* Setup trampoline PGD */
- create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
+ create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
#endif
/*
- * Setup early PGD covering entire kernel which will allows
+ * Setup early PGD covering entire kernel which will allow
* us to reach paging_init(). We map all memory banks later
* in setup_vm_final() below.
*/
- end_va = PAGE_OFFSET + load_sz;
- for (va = PAGE_OFFSET; va < end_va; va += map_size)
- create_pgd_mapping(early_pg_dir, va,
- load_pa + (va - PAGE_OFFSET),
- map_size, PAGE_KERNEL_EXEC);
+ create_kernel_page_table(early_pg_dir, map_size);
#ifndef __PAGETABLE_PMD_FOLDED
/* Setup early PMD for DTB */
@@ -549,6 +569,7 @@ static void __init setup_vm_final(void)
uintptr_t va, map_size;
phys_addr_t pa, start, end;
u64 i;
+ static struct vm_struct vm_kernel = { 0 };
/**
* MMU is enabled at this point. But page table setup is not complete yet.
@@ -565,7 +586,7 @@ static void __init setup_vm_final(void)
__pa_symbol(fixmap_pgd_next),
PGDIR_SIZE, PAGE_TABLE);
- /* Map all memory banks */
+ /* Map all memory banks in the linear mapping */
for_each_mem_range(i, &start, &end) {
if (start >= end)
break;
@@ -577,10 +598,22 @@ static void __init setup_vm_final(void)
for (pa = start; pa < end; pa += map_size) {
va = (uintptr_t)__va(pa);
create_pgd_mapping(swapper_pg_dir, va, pa,
- map_size, PAGE_KERNEL_EXEC);
+ map_size, PAGE_KERNEL);
}
}
+ /* Map the kernel */
+ create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
+
+ /* Reserve the vmalloc area occupied by the kernel */
+ vm_kernel.addr = (void *)kernel_virt_addr;
+ vm_kernel.phys_addr = load_pa;
+ vm_kernel.size = (load_sz + PMD_SIZE - 1) & ~(PMD_SIZE - 1);
+ vm_kernel.flags = VM_MAP | VM_NO_GUARD;
+ vm_kernel.caller = __builtin_return_address(0);
+
+ vm_area_add_early(&vm_kernel);
+
/* Clear fixmap PTE and PMD mappings */
clear_fixmap(FIX_PTE);
clear_fixmap(FIX_PMD);
diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
index e8e4dcd39fed..35703d5ef5fd 100644
--- a/arch/riscv/mm/physaddr.c
+++ b/arch/riscv/mm/physaddr.c
@@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
phys_addr_t __phys_addr_symbol(unsigned long x)
{
- unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
+ unsigned long kernel_start = (unsigned long)kernel_virt_addr;
unsigned long kernel_end = (unsigned long)_end;
/*
--
2.20.1
There is no need to compare at compile time MAX_EARLY_MAPPING_SIZE value
with PGDIR_SIZE since MAX_EARLY_MAPPING_SIZE is set to 128MB which is less
than PGDIR_SIZE that is equal to 1GB: that allows to simplify early_pmd
definition.
Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Anup Patel <[email protected]>
Reviewed-by: Palmer Dabbelt <[email protected]>
---
arch/riscv/mm/init.c | 14 +++-----------
1 file changed, 3 insertions(+), 11 deletions(-)
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 7b87c14f1d24..694efcc3a131 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -296,13 +296,7 @@ static void __init create_pte_mapping(pte_t *ptep,
pmd_t trampoline_pmd[PTRS_PER_PMD] __page_aligned_bss;
pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
-
-#if MAX_EARLY_MAPPING_SIZE < PGDIR_SIZE
-#define NUM_EARLY_PMDS 1UL
-#else
-#define NUM_EARLY_PMDS (1UL + MAX_EARLY_MAPPING_SIZE / PGDIR_SIZE)
-#endif
-pmd_t early_pmd[PTRS_PER_PMD * NUM_EARLY_PMDS] __initdata __aligned(PAGE_SIZE);
+pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
pmd_t early_dtb_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
static pmd_t *__init get_pmd_virt_early(phys_addr_t pa)
@@ -324,11 +318,9 @@ static pmd_t *get_pmd_virt_late(phys_addr_t pa)
static phys_addr_t __init alloc_pmd_early(uintptr_t va)
{
- uintptr_t pmd_num;
+ BUG_ON((va - kernel_virt_addr) >> PGDIR_SHIFT);
- pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
- BUG_ON(pmd_num >= NUM_EARLY_PMDS);
- return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
+ return (uintptr_t)early_pmd;
}
static phys_addr_t __init alloc_pmd_fixmap(uintptr_t va)
--
2.20.1
Either the user specifies maximum physical memory size of 2GB or the
user lives with the system constraint which is 1/4th of maximum
addressable memory in Sv39 MMU mode (i.e. 128GB) for now.
Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Anup Patel <[email protected]>
---
arch/riscv/Kconfig | 20 ++++++--------------
1 file changed, 6 insertions(+), 14 deletions(-)
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 2979a44103be..852ab2f7a50d 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -127,7 +127,7 @@ config PAGE_OFFSET
default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
default 0x80000000 if 64BIT && !MMU
default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB
- default 0xffffffe000000000 if 64BIT && MAXPHYSMEM_128GB
+ default 0xffffffe000000000 if 64BIT && !MAXPHYSMEM_2GB
config ARCH_FLATMEM_ENABLE
def_bool y
@@ -235,19 +235,11 @@ config MODULE_SECTIONS
bool
select HAVE_MOD_ARCH_SPECIFIC
-choice
- prompt "Maximum Physical Memory"
- default MAXPHYSMEM_2GB if 32BIT
- default MAXPHYSMEM_2GB if 64BIT && CMODEL_MEDLOW
- default MAXPHYSMEM_128GB if 64BIT && CMODEL_MEDANY
-
- config MAXPHYSMEM_2GB
- bool "2GiB"
- config MAXPHYSMEM_128GB
- depends on 64BIT && CMODEL_MEDANY
- bool "128GiB"
-endchoice
-
+config MAXPHYSMEM_2GB
+ bool "Maximum Physical Memory 2GiB"
+ default y if 32BIT
+ default y if 64BIT && CMODEL_MEDLOW
+ default n
config SMP
bool "Symmetric Multi-Processing"
--
2.20.1
With 4-level page table folding at runtime, we don't know at compile time
the size of the virtual address space so we must set VA_BITS dynamically
so that sparsemem reserves the right amount of memory for struct pages.
Signed-off-by: Alexandre Ghiti <[email protected]>
---
arch/riscv/Kconfig | 10 ----------
arch/riscv/include/asm/pgtable.h | 11 +++++++++--
arch/riscv/include/asm/sparsemem.h | 6 +++++-
3 files changed, 14 insertions(+), 13 deletions(-)
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 44377fd7860e..2979a44103be 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -122,16 +122,6 @@ config ZONE_DMA32
bool
default y if 64BIT
-config VA_BITS
- int
- default 32 if 32BIT
- default 39 if 64BIT
-
-config PA_BITS
- int
- default 34 if 32BIT
- default 56 if 64BIT
-
config PAGE_OFFSET
hex
default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 102b728ca146..c7973bfd65bc 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -43,8 +43,14 @@
* struct pages to map half the virtual address space. Then
* position vmemmap directly below the VMALLOC region.
*/
+#ifdef CONFIG_64BIT
+#define VA_BITS 39
+#else
+#define VA_BITS 32
+#endif
+
#define VMEMMAP_SHIFT \
- (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
+ (VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
#define VMEMMAP_SIZE BIT(VMEMMAP_SHIFT)
#define VMEMMAP_END (VMALLOC_START - 1)
#define VMEMMAP_START (VMALLOC_START - VMEMMAP_SIZE)
@@ -83,6 +89,7 @@
#endif /* CONFIG_64BIT */
#ifdef CONFIG_MMU
+
/* Number of entries in the page global directory */
#define PTRS_PER_PGD (PAGE_SIZE / sizeof(pgd_t))
/* Number of entries in the page table */
@@ -453,7 +460,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
* and give the kernel the other (upper) half.
*/
#ifdef CONFIG_64BIT
-#define KERN_VIRT_START (-(BIT(CONFIG_VA_BITS)) + TASK_SIZE)
+#define KERN_VIRT_START (-(BIT(VA_BITS)) + TASK_SIZE)
#else
#define KERN_VIRT_START FIXADDR_START
#endif
diff --git a/arch/riscv/include/asm/sparsemem.h b/arch/riscv/include/asm/sparsemem.h
index 45a7018a8118..63acaecc3374 100644
--- a/arch/riscv/include/asm/sparsemem.h
+++ b/arch/riscv/include/asm/sparsemem.h
@@ -4,7 +4,11 @@
#define _ASM_RISCV_SPARSEMEM_H
#ifdef CONFIG_SPARSEMEM
-#define MAX_PHYSMEM_BITS CONFIG_PA_BITS
+#ifdef CONFIG_64BIT
+#define MAX_PHYSMEM_BITS 56
+#else
+#define MAX_PHYSMEM_BITS 34
+#endif /* CONFIG_64BIT */
#define SECTION_SIZE_BITS 27
#endif /* CONFIG_SPARSEMEM */
--
2.20.1
This is a preparatory patch for sv48 support that will introduce
dynamic PAGE_OFFSET.
Dynamic PAGE_OFFSET implies that all zones (vmalloc, vmemmap, fixaddr...)
whose addresses depend on PAGE_OFFSET become dynamic and can't be used
to statically initialize the array used by ptdump to identify the
different zones of the vm layout.
Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Anup Patel <[email protected]>
---
arch/riscv/mm/ptdump.c | 56 ++++++++++++++++++++++++++++++++++--------
1 file changed, 46 insertions(+), 10 deletions(-)
diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index ace74dec7492..1be2ca81f8ad 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -58,29 +58,50 @@ struct ptd_mm_info {
unsigned long end;
};
+enum address_markers_idx {
+#ifdef CONFIG_KASAN
+ KASAN_SHADOW_START_NR,
+ KASAN_SHADOW_END_NR,
+#endif
+ FIXMAP_START_NR,
+ FIXMAP_END_NR,
+ PCI_IO_START_NR,
+ PCI_IO_END_NR,
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+ VMEMMAP_START_NR,
+ VMEMMAP_END_NR,
+#endif
+ VMALLOC_START_NR,
+ VMALLOC_END_NR,
+ PAGE_OFFSET_NR,
+ KERNEL_MAPPING_NR,
+ END_OF_SPACE_NR
+};
+
static struct addr_marker address_markers[] = {
#ifdef CONFIG_KASAN
{KASAN_SHADOW_START, "Kasan shadow start"},
{KASAN_SHADOW_END, "Kasan shadow end"},
#endif
- {FIXADDR_START, "Fixmap start"},
- {FIXADDR_TOP, "Fixmap end"},
- {PCI_IO_START, "PCI I/O start"},
- {PCI_IO_END, "PCI I/O end"},
+ {0, "Fixmap start"},
+ {0, "Fixmap end"},
+ {0, "PCI I/O start"},
+ {0, "PCI I/O end"},
#ifdef CONFIG_SPARSEMEM_VMEMMAP
- {VMEMMAP_START, "vmemmap start"},
- {VMEMMAP_END, "vmemmap end"},
+ {0, "vmemmap start"},
+ {0, "vmemmap end"},
#endif
- {VMALLOC_START, "vmalloc() area"},
- {VMALLOC_END, "vmalloc() end"},
- {PAGE_OFFSET, "Linear mapping"},
+ {0, "vmalloc() area"},
+ {0, "vmalloc() end"},
+ {0, "Linear mapping"},
+ {0, "Kernel mapping (kernel, BPF, modules)"},
{-1, NULL},
};
static struct ptd_mm_info kernel_ptd_info = {
.mm = &init_mm,
.markers = address_markers,
- .base_addr = KERN_VIRT_START,
+ .base_addr = 0,
.end = ULONG_MAX,
};
@@ -335,6 +356,21 @@ static int ptdump_init(void)
{
unsigned int i, j;
+ address_markers[FIXMAP_START_NR].start_address = FIXADDR_START;
+ address_markers[FIXMAP_END_NR].start_address = FIXADDR_TOP;
+ address_markers[PCI_IO_START_NR].start_address = PCI_IO_START;
+ address_markers[PCI_IO_END_NR].start_address = PCI_IO_END;
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+ address_markers[VMEMMAP_START_NR].start_address = VMEMMAP_START;
+ address_markers[VMEMMAP_END_NR].start_address = VMEMMAP_END;
+#endif
+ address_markers[VMALLOC_START_NR].start_address = VMALLOC_START;
+ address_markers[VMALLOC_END_NR].start_address = VMALLOC_END;
+ address_markers[PAGE_OFFSET_NR].start_address = PAGE_OFFSET;
+ address_markers[KERNEL_MAPPING_NR].start_address = KERNEL_LINK_ADDR;
+
+ kernel_ptd_info.base_addr = KERN_VIRT_START;
+
for (i = 0; i < ARRAY_SIZE(pg_level); i++)
for (j = 0; j < ARRAY_SIZE(pte_bits); j++)
pg_level[i].mask |= pte_bits[j].mask;
--
2.20.1
By adding a new 4th level of page table, give the possibility to 64bit
kernel to address 2^48 bytes of virtual address: in practice, that roughly
offers ~160TB of virtual address space to userspace and allows up to 64TB
of physical memory.
If the underlying hardware does not support sv48, we will automatically
fallback to a standard 3-level page table by folding the new PUD level into
PGDIR level. In order to detect HW capabilities at runtime, we
use SATP feature that ignores writes with an unsupported mode.
Signed-off-by: Alexandre Ghiti <[email protected]>
---
arch/riscv/Kconfig | 6 +-
arch/riscv/include/asm/csr.h | 3 +-
arch/riscv/include/asm/fixmap.h | 3 +
arch/riscv/include/asm/page.h | 12 ++
arch/riscv/include/asm/pgalloc.h | 40 +++++
arch/riscv/include/asm/pgtable-64.h | 104 +++++++++++-
arch/riscv/include/asm/pgtable.h | 12 +-
arch/riscv/kernel/head.S | 3 +-
arch/riscv/mm/context.c | 2 +-
arch/riscv/mm/init.c | 212 +++++++++++++++++++++---
drivers/firmware/efi/libstub/efi-stub.c | 2 +-
11 files changed, 362 insertions(+), 37 deletions(-)
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 852ab2f7a50d..03205e11f952 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -127,7 +127,7 @@ config PAGE_OFFSET
default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
default 0x80000000 if 64BIT && !MMU
default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB
- default 0xffffffe000000000 if 64BIT && !MAXPHYSMEM_2GB
+ default 0xffffc00000000000 if 64BIT && !MAXPHYSMEM_2GB
config ARCH_FLATMEM_ENABLE
def_bool y
@@ -176,9 +176,11 @@ config GENERIC_HWEIGHT
config FIX_EARLYCON_MEM
def_bool MMU
+# On a 64BIT relocatable kernel, the 4-level page table is at runtime folded
+# on a 3-level page table when sv48 is not supported.
config PGTABLE_LEVELS
int
- default 3 if 64BIT
+ default 4 if 64BIT
default 2
config LOCKDEP_SUPPORT
diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
index cec462e198ce..d41536c3f8d4 100644
--- a/arch/riscv/include/asm/csr.h
+++ b/arch/riscv/include/asm/csr.h
@@ -40,11 +40,10 @@
#ifndef CONFIG_64BIT
#define SATP_PPN _AC(0x003FFFFF, UL)
#define SATP_MODE_32 _AC(0x80000000, UL)
-#define SATP_MODE SATP_MODE_32
#else
#define SATP_PPN _AC(0x00000FFFFFFFFFFF, UL)
#define SATP_MODE_39 _AC(0x8000000000000000, UL)
-#define SATP_MODE SATP_MODE_39
+#define SATP_MODE_48 _AC(0x9000000000000000, UL)
#endif
/* Exception cause high bit - is an interrupt if set */
diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h
index 54cbf07fb4e9..c4e51929773a 100644
--- a/arch/riscv/include/asm/fixmap.h
+++ b/arch/riscv/include/asm/fixmap.h
@@ -24,6 +24,9 @@ enum fixed_addresses {
FIX_HOLE,
FIX_PTE,
FIX_PMD,
+#ifdef CONFIG_64BIT
+ FIX_PUD,
+#endif
FIX_TEXT_POKE1,
FIX_TEXT_POKE0,
FIX_EARLYCON_MEM_BASE,
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index a93e35aaa717..37ca192a7b80 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -31,7 +31,16 @@
* When not using MMU this corresponds to the first free page in
* physical memory (aligned on a page boundary).
*/
+#ifdef CONFIG_64BIT
+#define PAGE_OFFSET __page_offset
+/*
+ * By default, CONFIG_PAGE_OFFSET value corresponds to SV48 address space so
+ * define the PAGE_OFFSET value for SV39.
+ */
+#define PAGE_OFFSET_L3 0xffffffe000000000
+#else
#define PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)
+#endif /* CONFIG_64BIT */
#define KERN_VIRT_SIZE (-PAGE_OFFSET)
@@ -102,6 +111,9 @@ extern unsigned long pfn_base;
extern unsigned long max_low_pfn;
extern unsigned long min_low_pfn;
extern unsigned long kernel_virt_addr;
+#ifdef CONFIG_64BIT
+extern unsigned long __page_offset;
+#endif
extern uintptr_t load_pa, load_sz;
#define linear_mapping_pa_to_va(x) ((void *)((unsigned long)(x) + va_pa_offset))
diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
index 23b1544e0ca5..2b7fb8156fc6 100644
--- a/arch/riscv/include/asm/pgalloc.h
+++ b/arch/riscv/include/asm/pgalloc.h
@@ -11,6 +11,8 @@
#include <asm/tlb.h>
#ifdef CONFIG_MMU
+#define __HAVE_ARCH_PUD_ALLOC_ONE
+#define __HAVE_ARCH_PUD_FREE
#include <asm-generic/pgalloc.h>
static inline void pmd_populate_kernel(struct mm_struct *mm,
@@ -36,6 +38,44 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
set_pud(pud, __pud((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
}
+
+static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
+{
+ if (pgtable_l4_enabled) {
+ unsigned long pfn = virt_to_pfn(pud);
+
+ set_p4d(p4d, __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
+ }
+}
+
+static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d,
+ pud_t *pud)
+{
+ if (pgtable_l4_enabled) {
+ unsigned long pfn = virt_to_pfn(pud);
+
+ set_p4d_safe(p4d,
+ __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
+ }
+}
+
+#define pud_alloc_one pud_alloc_one
+static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
+{
+ if (pgtable_l4_enabled)
+ return __pud_alloc_one(mm, addr);
+
+ return NULL;
+}
+
+#define pud_free pud_free
+static inline void pud_free(struct mm_struct *mm, pud_t *pud)
+{
+ if (pgtable_l4_enabled)
+ __pud_free(mm, pud);
+}
+
+#define __pud_free_tlb(tlb, pud, addr) pud_free((tlb)->mm, pud)
#endif /* __PAGETABLE_PMD_FOLDED */
#define pmd_pgtable(pmd) pmd_page(pmd)
diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
index f3b0da64c6c8..338670897fbf 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -8,16 +8,36 @@
#include <linux/const.h>
-#define PGDIR_SHIFT 30
+extern bool pgtable_l4_enabled;
+
+#define PGDIR_SHIFT_L3 30
+#define PGDIR_SHIFT_L4 39
+#define PGDIR_SIZE_L3 (_AC(1, UL) << PGDIR_SHIFT_L3)
+
+#define PGDIR_SHIFT (pgtable_l4_enabled ? PGDIR_SHIFT_L4: PGDIR_SHIFT_L3)
/* Size of region mapped by a page global directory */
#define PGDIR_SIZE (_AC(1, UL) << PGDIR_SHIFT)
#define PGDIR_MASK (~(PGDIR_SIZE - 1))
+/* pud is folded into pgd in case of 3-level page table */
+#define PUD_SHIFT 30
+#define PUD_SIZE (_AC(1, UL) << PUD_SHIFT)
+#define PUD_MASK (~(PUD_SIZE - 1))
+
#define PMD_SHIFT 21
/* Size of region mapped by a page middle directory */
#define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
#define PMD_MASK (~(PMD_SIZE - 1))
+/* Page Upper Directory entry */
+typedef struct {
+ unsigned long pud;
+} pud_t;
+
+#define pud_val(x) ((x).pud)
+#define __pud(x) ((pud_t) { (x) })
+#define PTRS_PER_PUD (PAGE_SIZE / sizeof(pud_t))
+
/* Page Middle Directory entry */
typedef struct {
unsigned long pmd;
@@ -60,6 +80,16 @@ static inline void pud_clear(pud_t *pudp)
set_pud(pudp, __pud(0));
}
+static inline pud_t pfn_pud(unsigned long pfn, pgprot_t prot)
+{
+ return __pud((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
+}
+
+static inline unsigned long _pud_pfn(pud_t pud)
+{
+ return pud_val(pud) >> _PAGE_PFN_SHIFT;
+}
+
static inline unsigned long pud_page_vaddr(pud_t pud)
{
return (unsigned long)pfn_to_virt(pud_val(pud) >> _PAGE_PFN_SHIFT);
@@ -70,6 +100,17 @@ static inline struct page *pud_page(pud_t pud)
return pfn_to_page(pud_val(pud) >> _PAGE_PFN_SHIFT);
}
+#define mm_pud_folded mm_pud_folded
+static inline bool mm_pud_folded(struct mm_struct *mm)
+{
+ if (pgtable_l4_enabled)
+ return false;
+
+ return true;
+}
+
+#define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
+
static inline pmd_t pfn_pmd(unsigned long pfn, pgprot_t prot)
{
return __pmd((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
@@ -83,4 +124,65 @@ static inline unsigned long _pmd_pfn(pmd_t pmd)
#define pmd_ERROR(e) \
pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))
+#define pud_ERROR(e) \
+ pr_err("%s:%d: bad pud %016lx.\n", __FILE__, __LINE__, pud_val(e))
+
+static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
+{
+ if (pgtable_l4_enabled)
+ *p4dp = p4d;
+ else
+ set_pud((pud_t *)p4dp, (pud_t){ p4d_val(p4d) });
+}
+
+static inline int p4d_none(p4d_t p4d)
+{
+ if (pgtable_l4_enabled)
+ return (p4d_val(p4d) == 0);
+
+ return 0;
+}
+
+static inline int p4d_present(p4d_t p4d)
+{
+ if (pgtable_l4_enabled)
+ return (p4d_val(p4d) & _PAGE_PRESENT);
+
+ return 1;
+}
+
+static inline int p4d_bad(p4d_t p4d)
+{
+ if (pgtable_l4_enabled)
+ return !p4d_present(p4d);
+
+ return 0;
+}
+
+static inline void p4d_clear(p4d_t *p4d)
+{
+ if (pgtable_l4_enabled)
+ set_p4d(p4d, __p4d(0));
+}
+
+static inline unsigned long p4d_page_vaddr(p4d_t p4d)
+{
+ if (pgtable_l4_enabled)
+ return (unsigned long)pfn_to_virt(
+ p4d_val(p4d) >> _PAGE_PFN_SHIFT);
+
+ return pud_page_vaddr((pud_t) { p4d_val(p4d) });
+}
+
+#define pud_index(addr) (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
+
+static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
+{
+ if (pgtable_l4_enabled)
+ return (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
+
+ return (pud_t *)p4d;
+}
+#define pud_offset pud_offset
+
#endif /* _ASM_RISCV_PGTABLE_64_H */
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index c7973bfd65bc..dd27d28f1d9e 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -44,7 +44,7 @@
* position vmemmap directly below the VMALLOC region.
*/
#ifdef CONFIG_64BIT
-#define VA_BITS 39
+#define VA_BITS (pgtable_l4_enabled ? 48 : 39)
#else
#define VA_BITS 32
#endif
@@ -76,8 +76,7 @@
#ifndef __ASSEMBLY__
-/* Page Upper Directory not used in RISC-V */
-#include <asm-generic/pgtable-nopud.h>
+#include <asm-generic/pgtable-nop4d.h>
#include <asm/page.h>
#include <asm/tlbflush.h>
#include <linux/mm_types.h>
@@ -470,9 +469,11 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
* Note that PGDIR_SIZE must evenly divide TASK_SIZE.
*/
#ifdef CONFIG_64BIT
-#define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)
+#define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)
+#define TASK_SIZE_MIN (PGDIR_SIZE_L3 * PTRS_PER_PGD / 2)
#else
-#define TASK_SIZE FIXADDR_START
+#define TASK_SIZE FIXADDR_START
+#define TASK_SIZE_MIN TASK_SIZE
#endif
#else /* CONFIG_MMU */
@@ -493,6 +494,7 @@ static inline void __kernel_map_pages(struct page *page, int numpages, int enabl
extern char _start[];
extern void *dtb_early_va;
extern uintptr_t dtb_early_pa;
+extern u64 satp_mode;
void setup_bootmem(void);
void paging_init(void);
diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index 66f40c49bf68..c98877307d4e 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -82,7 +82,8 @@ relocate:
/* Compute satp for kernel page tables, but don't load it yet */
srl a2, a0, PAGE_SHIFT
- li a1, SATP_MODE
+ la a1, satp_mode
+ REG_L a1, 0(a1)
or a2, a2, a1
/*
diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
index 613ec81a8979..0393ca1b4416 100644
--- a/arch/riscv/mm/context.c
+++ b/arch/riscv/mm/context.c
@@ -59,7 +59,7 @@ void switch_mm(struct mm_struct *prev, struct mm_struct *next,
cpumask_set_cpu(cpu, mm_cpumask(next));
#ifdef CONFIG_MMU
- csr_write(CSR_SATP, virt_to_pfn(next->pgd) | SATP_MODE);
+ csr_write(CSR_SATP, virt_to_pfn(next->pgd) | satp_mode);
local_flush_tlb_all();
#endif
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 694efcc3a131..cb23a30d9af3 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -23,8 +23,23 @@
#include "../kernel/head.h"
-unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
+#ifdef CONFIG_64BIT
+u64 satp_mode = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ?
+ SATP_MODE_39 : SATP_MODE_48;
+bool pgtable_l4_enabled = IS_ENABLED(CONFIG_MAXPHYSMEM_2GB) ? false : true;
+#else
+u64 satp_mode = SATP_MODE_32;
+bool pgtable_l4_enabled;
+#endif
+EXPORT_SYMBOL(pgtable_l4_enabled);
+EXPORT_SYMBOL(satp_mode);
+
+unsigned long kernel_virt_addr;
EXPORT_SYMBOL(kernel_virt_addr);
+#ifdef CONFIG_64BIT
+unsigned long __page_offset = _AC(CONFIG_PAGE_OFFSET, UL);
+EXPORT_SYMBOL(__page_offset);
+#endif
unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
__page_aligned_bss;
@@ -41,6 +56,8 @@ struct pt_alloc_ops {
#ifndef __PAGETABLE_PMD_FOLDED
pmd_t *(*get_pmd_virt)(phys_addr_t pa);
phys_addr_t (*alloc_pmd)(uintptr_t va);
+ pud_t *(*get_pud_virt)(phys_addr_t pa);
+ phys_addr_t (*alloc_pud)(uintptr_t va);
#endif
};
@@ -294,9 +311,13 @@ static void __init create_pte_mapping(pte_t *ptep,
#ifndef __PAGETABLE_PMD_FOLDED
+pud_t trampoline_pud[PTRS_PER_PUD] __page_aligned_bss;
pmd_t trampoline_pmd[PTRS_PER_PMD] __page_aligned_bss;
+pud_t fixmap_pud[PTRS_PER_PUD] __page_aligned_bss;
pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
+pud_t early_pud[PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);
pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
+pud_t early_dtb_pud[PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);
pmd_t early_dtb_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
static pmd_t *__init get_pmd_virt_early(phys_addr_t pa)
@@ -318,7 +339,8 @@ static pmd_t *get_pmd_virt_late(phys_addr_t pa)
static phys_addr_t __init alloc_pmd_early(uintptr_t va)
{
- BUG_ON((va - kernel_virt_addr) >> PGDIR_SHIFT);
+ /* Only one PMD is available for early mapping */
+ BUG_ON((va - kernel_virt_addr) >> PUD_SHIFT);
return (uintptr_t)early_pmd;
}
@@ -364,20 +386,90 @@ static void __init create_pmd_mapping(pmd_t *pmdp,
create_pte_mapping(ptep, va, pa, sz, prot);
}
-#define pgd_next_t pmd_t
-#define alloc_pgd_next(__va) pt_ops.alloc_pmd(__va)
-#define get_pgd_next_virt(__pa) pt_ops.get_pmd_virt(__pa)
+static pud_t *__init get_pud_virt_early(phys_addr_t pa)
+{
+ return (pud_t *)((uintptr_t)pa);
+}
+
+static pud_t *__init get_pud_virt_fixmap(phys_addr_t pa)
+{
+ clear_fixmap(FIX_PUD);
+ return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
+}
+
+static pud_t *__init get_pud_virt_late(phys_addr_t pa)
+{
+ return (pud_t *)__va(pa);
+}
+
+static phys_addr_t __init alloc_pud_early(uintptr_t va)
+{
+ /* Only one PUD is available for early mapping */
+ BUG_ON((va - kernel_virt_addr) >> PGDIR_SHIFT);
+
+ return (uintptr_t)early_pud;
+}
+
+static phys_addr_t __init alloc_pud_fixmap(uintptr_t va)
+{
+ return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
+}
+
+static phys_addr_t alloc_pud_late(uintptr_t va)
+{
+ unsigned long vaddr;
+
+ vaddr = __get_free_page(GFP_KERNEL);
+ BUG_ON(!vaddr);
+ return __pa(vaddr);
+}
+
+static void __init create_pud_mapping(pud_t *pudp,
+ uintptr_t va, phys_addr_t pa,
+ phys_addr_t sz, pgprot_t prot)
+{
+ pmd_t *nextp;
+ phys_addr_t next_phys;
+ uintptr_t pud_index = pud_index(va);
+
+ if (sz == PUD_SIZE) {
+ if (pud_val(pudp[pud_index]) == 0)
+ pudp[pud_index] = pfn_pud(PFN_DOWN(pa), prot);
+ return;
+ }
+
+ if (pud_val(pudp[pud_index]) == 0) {
+ next_phys = pt_ops.alloc_pmd(va);
+ pudp[pud_index] = pfn_pud(PFN_DOWN(next_phys), PAGE_TABLE);
+ nextp = pt_ops.get_pmd_virt(next_phys);
+ memset(nextp, 0, PAGE_SIZE);
+ } else {
+ next_phys = PFN_PHYS(_pud_pfn(pudp[pud_index]));
+ nextp = pt_ops.get_pmd_virt(next_phys);
+ }
+
+ create_pmd_mapping(nextp, va, pa, sz, prot);
+}
+
+#define pgd_next_t pud_t
+#define alloc_pgd_next(__va) pt_ops.alloc_pud(__va)
+#define get_pgd_next_virt(__pa) pt_ops.get_pud_virt(__pa)
#define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot) \
- create_pmd_mapping(__nextp, __va, __pa, __sz, __prot)
-#define fixmap_pgd_next fixmap_pmd
+ create_pud_mapping(__nextp, __va, __pa, __sz, __prot)
+#define fixmap_pgd_next (pgtable_l4_enabled ? \
+ (uintptr_t)fixmap_pud : (uintptr_t)fixmap_pmd)
+#define trampoline_pgd_next (pgtable_l4_enabled ? \
+ (uintptr_t)trampoline_pud : (uintptr_t)trampoline_pmd)
+#define early_dtb_pgd_next (pgtable_l4_enabled ? \
+ (uintptr_t)early_dtb_pud : (uintptr_t)early_dtb_pmd)
#else
#define pgd_next_t pte_t
#define alloc_pgd_next(__va) pt_ops.alloc_pte(__va)
#define get_pgd_next_virt(__pa) pt_ops.get_pte_virt(__pa)
#define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot) \
create_pte_mapping(__nextp, __va, __pa, __sz, __prot)
-#define fixmap_pgd_next fixmap_pte
-#endif
+#define fixmap_pgd_next ((uintptr_t)fixmap_pte)
+#endif /* __PAGETABLE_PMD_FOLDED */
void __init create_pgd_mapping(pgd_t *pgdp,
uintptr_t va, phys_addr_t pa,
@@ -387,6 +479,13 @@ void __init create_pgd_mapping(pgd_t *pgdp,
phys_addr_t next_phys;
uintptr_t pgd_idx = pgd_index(va);
+#ifndef __PAGETABLE_PMD_FOLDED
+ if (!pgtable_l4_enabled) {
+ create_pud_mapping((pud_t *)pgdp, va, pa, sz, prot);
+ return;
+ }
+#endif
+
if (sz == PGDIR_SIZE) {
if (pgd_val(pgdp[pgd_idx]) == 0)
pgdp[pgd_idx] = pfn_pgd(PFN_DOWN(pa), prot);
@@ -437,6 +536,48 @@ uintptr_t load_pa, load_sz;
EXPORT_SYMBOL(load_pa);
EXPORT_SYMBOL(load_sz);
+#if defined(CONFIG_64BIT) && !defined(CONFIG_MAXPHYSMEM_2GB)
+void disable_pgtable_l4(void)
+{
+ pgtable_l4_enabled = false;
+ __page_offset = PAGE_OFFSET_L3;
+ satp_mode = SATP_MODE_39;
+}
+
+/**
+ * There is a simple way to determine if 4-level is supported by the
+ * underlying hardware: establish 1:1 mapping in 4-level page table mode
+ * then read SATP to see if the configuration was taken into account
+ * meaning sv48 is supported.
+ */
+asmlinkage __init void set_satp_mode(uintptr_t load_pa)
+{
+ u64 identity_satp, hw_satp;
+
+ create_pgd_mapping(early_pg_dir, load_pa, (uintptr_t)early_pud,
+ PGDIR_SIZE, PAGE_TABLE);
+ create_pud_mapping(early_pud, load_pa, (uintptr_t)early_pmd,
+ PUD_SIZE, PAGE_TABLE);
+ create_pmd_mapping(early_pmd, load_pa, load_pa,
+ PMD_SIZE, PAGE_KERNEL_EXEC);
+
+ identity_satp = PFN_DOWN((uintptr_t)&early_pg_dir) | satp_mode;
+ local_flush_tlb_all();
+ csr_write(CSR_SATP, identity_satp);
+
+ hw_satp = csr_read(CSR_SATP);
+ csr_write(CSR_SATP, 0ULL);
+ local_flush_tlb_all();
+
+ if (hw_satp != identity_satp)
+ disable_pgtable_l4();
+
+ memset(early_pg_dir, 0, PAGE_SIZE);
+ memset(early_pud, 0, PAGE_SIZE);
+ memset(early_pmd, 0, PAGE_SIZE);
+}
+#endif
+
static void __init create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
{
uintptr_t va, end_va;
@@ -460,9 +601,23 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
load_sz = (uintptr_t)(&_end) - load_pa;
map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
+ pt_ops.alloc_pte = alloc_pte_early;
+ pt_ops.get_pte_virt = get_pte_virt_early;
+#ifndef __PAGETABLE_PMD_FOLDED
+ pt_ops.alloc_pmd = alloc_pmd_early;
+ pt_ops.get_pmd_virt = get_pmd_virt_early;
+ pt_ops.alloc_pud = alloc_pud_early;
+ pt_ops.get_pud_virt = get_pud_virt_early;
+#endif
+
+#if defined(CONFIG_64BIT) && !defined(CONFIG_MAXPHYSMEM_2GB)
+ set_satp_mode(load_pa);
+#endif
+
+ kernel_virt_addr = KERNEL_VIRT_ADDR;
+
va_pa_offset = PAGE_OFFSET - load_pa;
va_kernel_pa_offset = kernel_virt_addr - load_pa;
-
pfn_base = PFN_DOWN(load_pa);
/*
@@ -476,23 +631,24 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
BUG_ON((load_pa % map_size) != 0);
BUG_ON(load_sz > MAX_EARLY_MAPPING_SIZE);
- pt_ops.alloc_pte = alloc_pte_early;
- pt_ops.get_pte_virt = get_pte_virt_early;
-#ifndef __PAGETABLE_PMD_FOLDED
- pt_ops.alloc_pmd = alloc_pmd_early;
- pt_ops.get_pmd_virt = get_pmd_virt_early;
-#endif
/* Setup early PGD for fixmap */
create_pgd_mapping(early_pg_dir, FIXADDR_START,
- (uintptr_t)fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
+ fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
#ifndef __PAGETABLE_PMD_FOLDED
- /* Setup fixmap PMD */
+ /* Setup fixmap PUD and PMD */
+ if (pgtable_l4_enabled)
+ create_pud_mapping(fixmap_pud, FIXADDR_START,
+ (uintptr_t)fixmap_pmd, PUD_SIZE, PAGE_TABLE);
create_pmd_mapping(fixmap_pmd, FIXADDR_START,
(uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
+
/* Setup trampoline PGD and PMD */
create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
- (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
+ trampoline_pgd_next, PGDIR_SIZE, PAGE_TABLE);
+ if (pgtable_l4_enabled)
+ create_pud_mapping(trampoline_pud, kernel_virt_addr,
+ (uintptr_t)trampoline_pmd, PUD_SIZE, PAGE_TABLE);
create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
#else
@@ -509,9 +665,12 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
create_kernel_page_table(early_pg_dir, map_size);
#ifndef __PAGETABLE_PMD_FOLDED
- /* Setup early PMD for DTB */
+ /* Setup early PUD and PMD for DTB */
create_pgd_mapping(early_pg_dir, DTB_EARLY_BASE_VA,
- (uintptr_t)early_dtb_pmd, PGDIR_SIZE, PAGE_TABLE);
+ (uintptr_t)early_dtb_pgd_next, PGDIR_SIZE, PAGE_TABLE);
+ if (pgtable_l4_enabled)
+ create_pud_mapping(early_dtb_pud, DTB_EARLY_BASE_VA,
+ (uintptr_t)early_dtb_pmd, PUD_SIZE, PAGE_TABLE);
/* Create two consecutive PMD mappings for FDT early scan */
pa = dtb_pa & ~(PMD_SIZE - 1);
create_pmd_mapping(early_dtb_pmd, DTB_EARLY_BASE_VA,
@@ -534,7 +693,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
* Bootime fixmap only can handle PMD_SIZE mapping. Thus, boot-ioremap
* range can not span multiple pmds.
*/
- BUILD_BUG_ON((__fix_to_virt(FIX_BTMAP_BEGIN) >> PMD_SHIFT)
+ BUG_ON((__fix_to_virt(FIX_BTMAP_BEGIN) >> PMD_SHIFT)
!= (__fix_to_virt(FIX_BTMAP_END) >> PMD_SHIFT));
#ifndef __PAGETABLE_PMD_FOLDED
@@ -577,6 +736,8 @@ static void __init setup_vm_final(void)
#ifndef __PAGETABLE_PMD_FOLDED
pt_ops.alloc_pmd = alloc_pmd_fixmap;
pt_ops.get_pmd_virt = get_pmd_virt_fixmap;
+ pt_ops.alloc_pud = alloc_pud_fixmap;
+ pt_ops.get_pud_virt = get_pud_virt_fixmap;
#endif
/* Setup swapper PGD for fixmap */
create_pgd_mapping(swapper_pg_dir, FIXADDR_START,
@@ -619,12 +780,13 @@ static void __init setup_vm_final(void)
vm_area_add_early(&vm_kernel);
- /* Clear fixmap PTE and PMD mappings */
+ /* Clear fixmap page table mappings */
clear_fixmap(FIX_PTE);
clear_fixmap(FIX_PMD);
+ clear_fixmap(FIX_PUD);
/* Move to swapper page table */
- csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | SATP_MODE);
+ csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | satp_mode);
local_flush_tlb_all();
/* generic page allocation functions must be used to setup page table */
@@ -633,6 +795,8 @@ static void __init setup_vm_final(void)
#ifndef __PAGETABLE_PMD_FOLDED
pt_ops.alloc_pmd = alloc_pmd_late;
pt_ops.get_pmd_virt = get_pmd_virt_late;
+ pt_ops.alloc_pud = alloc_pud_late;
+ pt_ops.get_pud_virt = get_pud_virt_late;
#endif
}
#else
diff --git a/drivers/firmware/efi/libstub/efi-stub.c b/drivers/firmware/efi/libstub/efi-stub.c
index 914a343c7785..f7e3405bceb8 100644
--- a/drivers/firmware/efi/libstub/efi-stub.c
+++ b/drivers/firmware/efi/libstub/efi-stub.c
@@ -41,7 +41,7 @@
#ifdef CONFIG_ARM64
# define EFI_RT_VIRTUAL_LIMIT DEFAULT_MAP_WINDOW_64
#else
-# define EFI_RT_VIRTUAL_LIMIT TASK_SIZE
+# define EFI_RT_VIRTUAL_LIMIT TASK_SIZE_MIN
#endif
static u64 virtmap_base = EFI_RT_VIRTUAL_BASE;
--
2.20.1
In the following commits, riscv will almost use the generic versions of
pud_alloc_one and pud_free but an additional check is required since those
functions are only relevant when using at least a 4-level page table, which
will be determined at runtime on riscv.
So move the content of those functions into other functions that riscv
can use without duplicating code.
Signed-off-by: Alexandre Ghiti <[email protected]>
---
include/asm-generic/pgalloc.h | 24 ++++++++++++++++++------
1 file changed, 18 insertions(+), 6 deletions(-)
diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index 02932efad3ab..977bea16cf1b 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -147,6 +147,15 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
#if CONFIG_PGTABLE_LEVELS > 3
+static inline pud_t *__pud_alloc_one(struct mm_struct *mm, unsigned long addr)
+{
+ gfp_t gfp = GFP_PGTABLE_USER;
+
+ if (mm == &init_mm)
+ gfp = GFP_PGTABLE_KERNEL;
+ return (pud_t *)get_zeroed_page(gfp);
+}
+
#ifndef __HAVE_ARCH_PUD_ALLOC_ONE
/**
* pud_alloc_one - allocate a page for PUD-level page table
@@ -159,20 +168,23 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
*/
static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
{
- gfp_t gfp = GFP_PGTABLE_USER;
-
- if (mm == &init_mm)
- gfp = GFP_PGTABLE_KERNEL;
- return (pud_t *)get_zeroed_page(gfp);
+ return __pud_alloc_one(mm, addr);
}
#endif
-static inline void pud_free(struct mm_struct *mm, pud_t *pud)
+static inline void __pud_free(struct mm_struct *mm, pud_t *pud)
{
BUG_ON((unsigned long)pud & (PAGE_SIZE-1));
free_page((unsigned long)pud);
}
+#ifndef __HAVE_ARCH_PUD_FREE
+static inline void pud_free(struct mm_struct *mm, pud_t *pud)
+{
+ __pud_free(mm, pud);
+}
+#endif
+
#endif /* CONFIG_PGTABLE_LEVELS > 3 */
#ifndef __HAVE_ARCH_PGD_FREE
--
2.20.1
This is made possible by using the mmu-type property of the cpu node of
the device tree.
By default, the kernel will boot with 4-level page table if the hw supports
it but it can be interesting for the user to select 3-level page table as
it is less memory consuming and faster since it requires less memory
accesses in case of a TLB miss.
Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Anup Patel <[email protected]>
---
arch/riscv/mm/init.c | 26 ++++++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index cb23a30d9af3..f9a99cb1870b 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -550,10 +550,32 @@ void disable_pgtable_l4(void)
* then read SATP to see if the configuration was taken into account
* meaning sv48 is supported.
*/
-asmlinkage __init void set_satp_mode(uintptr_t load_pa)
+asmlinkage __init void set_satp_mode(uintptr_t load_pa, uintptr_t dtb_pa)
{
u64 identity_satp, hw_satp;
+ int cpus_node;
+ /* 1/ Check if the user asked for sv39 explicitly in the device tree */
+ cpus_node = fdt_path_offset((void *)dtb_pa, "/cpus");
+ if (cpus_node >= 0) {
+ int node;
+
+ fdt_for_each_subnode(node, (void *)dtb_pa, cpus_node) {
+ const char *mmu_type = fdt_getprop((void *)dtb_pa, node,
+ "mmu-type", NULL);
+ if (!mmu_type)
+ continue;
+
+ if (!strcmp(mmu_type, "riscv,sv39")) {
+ disable_pgtable_l4();
+ return;
+ }
+
+ break;
+ }
+ }
+
+ /* 2/ Determine if the HW supports sv48: if not, fallback to sv39 */
create_pgd_mapping(early_pg_dir, load_pa, (uintptr_t)early_pud,
PGDIR_SIZE, PAGE_TABLE);
create_pud_mapping(early_pud, load_pa, (uintptr_t)early_pmd,
@@ -611,7 +633,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
#endif
#if defined(CONFIG_64BIT) && !defined(CONFIG_MAXPHYSMEM_2GB)
- set_satp_mode(load_pa);
+ set_satp_mode(load_pa, dtb_pa);
#endif
kernel_virt_addr = KERNEL_VIRT_ADDR;
--
2.20.1
Now that the mmu type is determined at runtime using SATP
characteristic, use the global variable pgtable_l4_enabled to output
mmu type of the processor through /proc/cpuinfo instead of relying on
device tree infos.
Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Anup Patel <[email protected]>
Reviewed-by: Palmer Dabbelt <[email protected]>
---
arch/riscv/include/asm/pgtable.h | 1 +
arch/riscv/kernel/cpu.c | 23 ++++++++++++-----------
2 files changed, 13 insertions(+), 11 deletions(-)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index dd27d28f1d9e..95721016049d 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -495,6 +495,7 @@ extern char _start[];
extern void *dtb_early_va;
extern uintptr_t dtb_early_pa;
extern u64 satp_mode;
+extern bool pgtable_l4_enabled;
void setup_bootmem(void);
void paging_init(void);
diff --git a/arch/riscv/kernel/cpu.c b/arch/riscv/kernel/cpu.c
index 6d59e6906fdd..dea9b1c31889 100644
--- a/arch/riscv/kernel/cpu.c
+++ b/arch/riscv/kernel/cpu.c
@@ -7,6 +7,7 @@
#include <linux/seq_file.h>
#include <linux/of.h>
#include <asm/smp.h>
+#include <asm/pgtable.h>
/*
* Returns the hart ID of the given device tree node, or -ENODEV if the node
@@ -70,18 +71,19 @@ static void print_isa(struct seq_file *f, const char *isa)
seq_puts(f, "\n");
}
-static void print_mmu(struct seq_file *f, const char *mmu_type)
+static void print_mmu(struct seq_file *f)
{
+ char sv_type[16];
+
#if defined(CONFIG_32BIT)
- if (strcmp(mmu_type, "riscv,sv32") != 0)
- return;
+ strncpy(sv_type, "sv32", 5);
#elif defined(CONFIG_64BIT)
- if (strcmp(mmu_type, "riscv,sv39") != 0 &&
- strcmp(mmu_type, "riscv,sv48") != 0)
- return;
+ if (pgtable_l4_enabled)
+ strncpy(sv_type, "sv48", 5);
+ else
+ strncpy(sv_type, "sv39", 5);
#endif
-
- seq_printf(f, "mmu\t\t: %s\n", mmu_type+6);
+ seq_printf(f, "mmu\t\t: %s\n", sv_type);
}
static void *c_start(struct seq_file *m, loff_t *pos)
@@ -106,14 +108,13 @@ static int c_show(struct seq_file *m, void *v)
{
unsigned long cpu_id = (unsigned long)v - 1;
struct device_node *node = of_get_cpu_node(cpu_id, NULL);
- const char *compat, *isa, *mmu;
+ const char *compat, *isa;
seq_printf(m, "processor\t: %lu\n", cpu_id);
seq_printf(m, "hart\t\t: %lu\n", cpuid_to_hartid_map(cpu_id));
if (!of_property_read_string(node, "riscv,isa", &isa))
print_isa(m, isa);
- if (!of_property_read_string(node, "mmu-type", &mmu))
- print_mmu(m, mmu);
+ print_mmu(m);
if (!of_property_read_string(node, "compatible", &compat)
&& strcmp(compat, "riscv"))
seq_printf(m, "uarch\t\t: %s\n", compat);
--
2.20.1
Define precisely the size of the user accessible virtual space size
for sv32/39/48 mmu types and explain why the whole virtual address
space is split into 2 equal chunks between kernel and user space.
Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Anup Patel <[email protected]>
Reviewed-by: Palmer Dabbelt <[email protected]>
---
arch/riscv/include/asm/pgtable.h | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 95721016049d..360858cdbfdd 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -465,8 +465,15 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
#endif
/*
- * Task size is 0x4000000000 for RV64 or 0x9fc00000 for RV32.
- * Note that PGDIR_SIZE must evenly divide TASK_SIZE.
+ * Task size is:
+ * - 0x9fc00000 (~2.5GB) for RV32.
+ * - 0x4000000000 ( 256GB) for RV64 using SV39 mmu
+ * - 0x800000000000 ( 128TB) for RV64 using SV48 mmu
+ *
+ * Note that PGDIR_SIZE must evenly divide TASK_SIZE since "RISC-V
+ * Instruction Set Manual Volume II: Privileged Architecture" states that
+ * "load and store effective addresses, which are 64bits, must have bits
+ * 63–48 all equal to bit 47, or else a page-fault exception will occur."
*/
#ifdef CONFIG_64BIT
#define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)
--
2.20.1
With the arrival of sv48 and its large address space, it would be
cumbersome to statically define the unit size to use to print the different
portions of the virtual memory layout: instead, determine it dynamically.
Signed-off-by: Alexandre Ghiti <[email protected]>
---
arch/riscv/mm/init.c | 46 ++++++++++++++++++++++++++++++++++++-------
include/linux/sizes.h | 3 ++-
2 files changed, 41 insertions(+), 8 deletions(-)
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index f9a99cb1870b..f06c21985274 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -80,30 +80,62 @@ static void setup_zero_page(void)
}
#if defined(CONFIG_MMU) && defined(CONFIG_DEBUG_VM)
+
+#define LOG2_SZ_1K ilog2(SZ_1K)
+#define LOG2_SZ_1M ilog2(SZ_1M)
+#define LOG2_SZ_1G ilog2(SZ_1G)
+#define LOG2_SZ_1T ilog2(SZ_1T)
+
static inline void print_mlk(char *name, unsigned long b, unsigned long t)
{
pr_notice("%12s : 0x%08lx - 0x%08lx (%4ld kB)\n", name, b, t,
- (((t) - (b)) >> 10));
+ (((t) - (b)) >> LOG2_SZ_1K));
}
static inline void print_mlm(char *name, unsigned long b, unsigned long t)
{
pr_notice("%12s : 0x%08lx - 0x%08lx (%4ld MB)\n", name, b, t,
- (((t) - (b)) >> 20));
+ (((t) - (b)) >> LOG2_SZ_1M));
+}
+
+static inline void print_mlg(char *name, unsigned long b, unsigned long t)
+{
+ pr_notice("%12s : 0x%08lx - 0x%08lx (%4ld GB)\n", name, b, t,
+ (((t) - (b)) >> LOG2_SZ_1G));
+}
+
+static inline void print_mlt(char *name, unsigned long b, unsigned long t)
+{
+ pr_notice("%12s : 0x%08lx - 0x%08lx (%4ld TB)\n", name, b, t,
+ (((t) - (b)) >> LOG2_SZ_1T));
+}
+
+static inline void print_ml(char *name, unsigned long b, unsigned long t)
+{
+ unsigned long diff = t - b;
+
+ if ((diff >> LOG2_SZ_1T) >= 10)
+ print_mlt(name, b, t);
+ else if ((diff >> LOG2_SZ_1G) >= 10)
+ print_mlg(name, b, t);
+ else if ((diff >> LOG2_SZ_1M) >= 10)
+ print_mlm(name, b, t);
+ else
+ print_mlk(name, b, t);
}
static void print_vm_layout(void)
{
pr_notice("Virtual kernel memory layout:\n");
- print_mlk("fixmap", (unsigned long)FIXADDR_START,
+ print_ml("fixmap", (unsigned long)FIXADDR_START,
(unsigned long)FIXADDR_TOP);
- print_mlm("pci io", (unsigned long)PCI_IO_START,
+ print_ml("pci io", (unsigned long)PCI_IO_START,
(unsigned long)PCI_IO_END);
- print_mlm("vmemmap", (unsigned long)VMEMMAP_START,
+ print_ml("vmemmap", (unsigned long)VMEMMAP_START,
(unsigned long)VMEMMAP_END);
- print_mlm("vmalloc", (unsigned long)VMALLOC_START,
+ print_ml("vmalloc", (unsigned long)VMALLOC_START,
(unsigned long)VMALLOC_END);
- print_mlm("lowmem", (unsigned long)PAGE_OFFSET,
+ print_ml("lowmem", (unsigned long)PAGE_OFFSET,
(unsigned long)high_memory);
}
#else
diff --git a/include/linux/sizes.h b/include/linux/sizes.h
index 9874f6f67537..9528b082873b 100644
--- a/include/linux/sizes.h
+++ b/include/linux/sizes.h
@@ -42,8 +42,9 @@
#define SZ_1G 0x40000000
#define SZ_2G 0x80000000
-
#define SZ_4G _AC(0x100000000, ULL)
+
+#define SZ_1T _AC(0x10000000000, ULL)
#define SZ_64T _AC(0x400000000000, ULL)
#endif /* __LINUX_SIZES_H__ */
--
2.20.1
The kernel is now mapped at the end of the address space and it should
be accessed through this mapping only: so map the whole kernel in the
linear mapping as read only.
Signed-off-by: Alexandre Ghiti <[email protected]>
---
arch/riscv/include/asm/page.h | 9 ++++++++-
arch/riscv/mm/init.c | 29 +++++++++++++++++++++--------
2 files changed, 29 insertions(+), 9 deletions(-)
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 98188e315e8d..a93e35aaa717 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -102,8 +102,15 @@ extern unsigned long pfn_base;
extern unsigned long max_low_pfn;
extern unsigned long min_low_pfn;
extern unsigned long kernel_virt_addr;
+extern uintptr_t load_pa, load_sz;
+
+#define linear_mapping_pa_to_va(x) ((void *)((unsigned long)(x) + va_pa_offset))
+#define kernel_mapping_pa_to_va(x) \
+ ((void *)((unsigned long) (x) + va_kernel_pa_offset))
+#define __pa_to_va_nodebug(x) \
+ ((x >= load_pa && x < load_pa + load_sz) ? \
+ kernel_mapping_pa_to_va(x): linear_mapping_pa_to_va(x))
-#define __pa_to_va_nodebug(x) ((void *)((unsigned long) (x) + va_pa_offset))
#define linear_mapping_va_to_pa(x) ((unsigned long)(x) - va_pa_offset)
#define kernel_mapping_va_to_pa(x) \
((unsigned long)(x) - va_kernel_pa_offset)
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 9d06ff0e015a..7b87c14f1d24 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -159,8 +159,6 @@ void __init setup_bootmem(void)
{
phys_addr_t mem_start = 0;
phys_addr_t start, end = 0;
- phys_addr_t vmlinux_end = __pa_symbol(&_end);
- phys_addr_t vmlinux_start = __pa_symbol(&_start);
u64 i;
/* Find the memory region containing the kernel */
@@ -168,7 +166,7 @@ void __init setup_bootmem(void)
phys_addr_t size = end - start;
if (!mem_start)
mem_start = start;
- if (start <= vmlinux_start && vmlinux_end <= end)
+ if (start <= load_pa && (load_pa + load_sz) <= end)
BUG_ON(size == 0);
}
@@ -179,8 +177,13 @@ void __init setup_bootmem(void)
*/
memblock_enforce_memory_limit(mem_start - PAGE_OFFSET);
- /* Reserve from the start of the kernel to the end of the kernel */
- memblock_reserve(vmlinux_start, vmlinux_end - vmlinux_start);
+ /*
+ * Reserve from the start of the kernel to the end of the kernel
+ * and make sure we align the reservation on PMD_SIZE since we will
+ * map the kernel in the linear mapping as read-only: we do not want
+ * any allocation to happen between _end and the next pmd aligned page.
+ */
+ memblock_reserve(load_pa, (load_sz + PMD_SIZE - 1) & ~(PMD_SIZE - 1));
max_pfn = PFN_DOWN(memblock_end_of_DRAM());
max_low_pfn = max_pfn;
@@ -438,7 +441,9 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
#error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
#endif
-static uintptr_t load_pa, load_sz;
+uintptr_t load_pa, load_sz;
+EXPORT_SYMBOL(load_pa);
+EXPORT_SYMBOL(load_sz);
static void __init create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
{
@@ -596,9 +601,17 @@ static void __init setup_vm_final(void)
map_size = best_map_size(start, end - start);
for (pa = start; pa < end; pa += map_size) {
- va = (uintptr_t)__va(pa);
+ pgprot_t prot = PAGE_KERNEL;
+
+ /* Protect the kernel mapping that lies in the linear mapping */
+ if (pa >= __pa(_start) && pa < __pa(_end))
+ prot = PAGE_KERNEL_READ;
+
+ /* Make sure we get virtual addresses in the linear mapping */
+ va = (uintptr_t)linear_mapping_pa_to_va(pa);
+
create_pgd_mapping(swapper_pg_dir, va, pa,
- map_size, PAGE_KERNEL);
+ map_size, prot);
}
}
--
2.20.1
On Tue, Jan 5, 2021 at 1:29 AM Alexandre Ghiti <[email protected]> wrote:
>
> This is a preparatory patch for relocatable kernel and sv48 support.
>
> The kernel used to be linked at PAGE_OFFSET address therefore we could use
> the linear mapping for the kernel mapping. But the relocated kernel base
> address will be different from PAGE_OFFSET and since in the linear mapping,
> two different virtual addresses cannot point to the same physical address,
> the kernel mapping needs to lie outside the linear mapping so that we don't
> have to copy it at the same physical offset.
>
> The kernel mapping is moved to the last 2GB of the address space and then
> BPF and modules are also pushed to the same range since they have to lie
> close to the kernel inside a 2GB window.
>
> Note then that KASLR implementation will simply have to move the kernel in
> this 2GB range and modify BPF/modules regions accordingly.
>
> In addition, by moving the kernel to the end of the address space, both
> sv39 and sv48 kernels will be exactly the same without needing to be
> relocated at runtime.
Awesome ! This is a good approach with no performance impact.
>
> Suggested-by: Arnd Bergmann <[email protected]>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> arch/riscv/boot/loader.lds.S | 3 +-
> arch/riscv/include/asm/page.h | 10 ++++-
> arch/riscv/include/asm/pgtable.h | 39 +++++++++++++------
> arch/riscv/kernel/head.S | 3 +-
> arch/riscv/kernel/module.c | 4 +-
> arch/riscv/kernel/vmlinux.lds.S | 3 +-
> arch/riscv/mm/init.c | 65 ++++++++++++++++++++++++--------
> arch/riscv/mm/physaddr.c | 2 +-
> 8 files changed, 94 insertions(+), 35 deletions(-)
>
> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
> index 47a5003c2e28..62d94696a19c 100644
> --- a/arch/riscv/boot/loader.lds.S
> +++ b/arch/riscv/boot/loader.lds.S
> @@ -1,13 +1,14 @@
> /* SPDX-License-Identifier: GPL-2.0 */
>
> #include <asm/page.h>
> +#include <asm/pgtable.h>
>
> OUTPUT_ARCH(riscv)
> ENTRY(_start)
>
> SECTIONS
> {
> - . = PAGE_OFFSET;
> + . = KERNEL_LINK_ADDR;
>
> .payload : {
> *(.payload)
> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> index 2d50f76efe48..98188e315e8d 100644
> --- a/arch/riscv/include/asm/page.h
> +++ b/arch/riscv/include/asm/page.h
> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>
> #ifdef CONFIG_MMU
> extern unsigned long va_pa_offset;
> +extern unsigned long va_kernel_pa_offset;
> extern unsigned long pfn_base;
> #define ARCH_PFN_OFFSET (pfn_base)
> #else
> #define va_pa_offset 0
> +#define va_kernel_pa_offset 0
> #define ARCH_PFN_OFFSET (PAGE_OFFSET >> PAGE_SHIFT)
> #endif /* CONFIG_MMU */
>
> extern unsigned long max_low_pfn;
> extern unsigned long min_low_pfn;
> +extern unsigned long kernel_virt_addr;
>
> #define __pa_to_va_nodebug(x) ((void *)((unsigned long) (x) + va_pa_offset))
> -#define __va_to_pa_nodebug(x) ((unsigned long)(x) - va_pa_offset)
> +#define linear_mapping_va_to_pa(x) ((unsigned long)(x) - va_pa_offset)
> +#define kernel_mapping_va_to_pa(x) \
> + ((unsigned long)(x) - va_kernel_pa_offset)
> +#define __va_to_pa_nodebug(x) \
> + (((x) < KERNEL_LINK_ADDR) ? \
> + linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
>
> #ifdef CONFIG_DEBUG_VIRTUAL
> extern phys_addr_t __virt_to_phys(unsigned long x);
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 183f1f4b2ae6..102b728ca146 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -11,23 +11,32 @@
>
> #include <asm/pgtable-bits.h>
>
> -#ifndef __ASSEMBLY__
> -
> -/* Page Upper Directory not used in RISC-V */
> -#include <asm-generic/pgtable-nopud.h>
> -#include <asm/page.h>
> -#include <asm/tlbflush.h>
> -#include <linux/mm_types.h>
> +#ifndef CONFIG_MMU
> +#define KERNEL_VIRT_ADDR PAGE_OFFSET
> +#define KERNEL_LINK_ADDR PAGE_OFFSET
> +#else
>
> -#ifdef CONFIG_MMU
> +#define ADDRESS_SPACE_END (UL(-1))
> +/*
> + * Leave 2GB for kernel, modules and BPF at the end of the address space
> + */
> +#define KERNEL_VIRT_ADDR (ADDRESS_SPACE_END - SZ_2G + 1)
> +#define KERNEL_LINK_ADDR KERNEL_VIRT_ADDR
>
> #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
> #define VMALLOC_END (PAGE_OFFSET - 1)
> #define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE)
>
> +/* KASLR should leave at least 128MB for BPF after the kernel */
> #define BPF_JIT_REGION_SIZE (SZ_128M)
> -#define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
> -#define BPF_JIT_REGION_END (VMALLOC_END)
> +#define BPF_JIT_REGION_START PFN_ALIGN((unsigned long)&_end)
> +#define BPF_JIT_REGION_END (BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
> +
> +/* Modules always live before the kernel */
> +#ifdef CONFIG_64BIT
> +#define VMALLOC_MODULE_START (PFN_ALIGN((unsigned long)&_end) - SZ_2G)
> +#define VMALLOC_MODULE_END (PFN_ALIGN((unsigned long)&_start))
> +#endif
This does not look right or I am missing something.
I think the VMALLOC_MODULE_START should be:
#define VMALLOC_MODULE_START (PFN_ALIGN((unsigned long)&_start) - SZ_2G)
>
> /*
> * Roughly size the vmemmap space to be large enough to fit enough
> @@ -57,9 +66,16 @@
> #define FIXADDR_SIZE PGDIR_SIZE
> #endif
> #define FIXADDR_START (FIXADDR_TOP - FIXADDR_SIZE)
> -
> #endif
>
> +#ifndef __ASSEMBLY__
> +
> +/* Page Upper Directory not used in RISC-V */
> +#include <asm-generic/pgtable-nopud.h>
> +#include <asm/page.h>
> +#include <asm/tlbflush.h>
> +#include <linux/mm_types.h>
> +
> #ifdef CONFIG_64BIT
> #include <asm/pgtable-64.h>
> #else
> @@ -467,6 +483,7 @@ static inline void __kernel_map_pages(struct page *page, int numpages, int enabl
>
> #define kern_addr_valid(addr) (1) /* FIXME */
>
> +extern char _start[];
> extern void *dtb_early_va;
> extern uintptr_t dtb_early_pa;
> void setup_bootmem(void);
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index 7e849797c9c3..66f40c49bf68 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -69,7 +69,8 @@ pe_head_start:
> #ifdef CONFIG_MMU
> relocate:
> /* Relocate return address */
> - li a1, PAGE_OFFSET
> + la a1, kernel_virt_addr
> + REG_L a1, 0(a1)
> la a2, _start
> sub a1, a1, a2
> add ra, ra, a1
> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
> index 104fba889cf7..75a0b9541266 100644
> --- a/arch/riscv/kernel/module.c
> +++ b/arch/riscv/kernel/module.c
> @@ -408,12 +408,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
> }
>
> #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
> -#define VMALLOC_MODULE_START \
> - max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
> void *module_alloc(unsigned long size)
> {
> return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
> - VMALLOC_END, GFP_KERNEL,
> + VMALLOC_MODULE_END, GFP_KERNEL,
> PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> __builtin_return_address(0));
> }
> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
> index 3ffbd6cbdb86..c21dc46f41be 100644
> --- a/arch/riscv/kernel/vmlinux.lds.S
> +++ b/arch/riscv/kernel/vmlinux.lds.S
> @@ -4,7 +4,8 @@
> * Copyright (C) 2017 SiFive
> */
>
> -#define LOAD_OFFSET PAGE_OFFSET
> +#include <asm/pgtable.h>
> +#define LOAD_OFFSET KERNEL_LINK_ADDR
> #include <asm/vmlinux.lds.h>
> #include <asm/page.h>
> #include <asm/cache.h>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 8e577f14f120..9d06ff0e015a 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -23,6 +23,9 @@
>
> #include "../kernel/head.h"
>
> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
> +EXPORT_SYMBOL(kernel_virt_addr);
> +
> unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
> __page_aligned_bss;
> EXPORT_SYMBOL(empty_zero_page);
> @@ -201,8 +204,12 @@ void __init setup_bootmem(void)
> #ifdef CONFIG_MMU
> static struct pt_alloc_ops pt_ops;
>
> +/* Offset between linear mapping virtual address and kernel load address */
> unsigned long va_pa_offset;
> EXPORT_SYMBOL(va_pa_offset);
> +/* Offset between kernel mapping virtual address and kernel load address */
> +unsigned long va_kernel_pa_offset;
> +EXPORT_SYMBOL(va_kernel_pa_offset);
> unsigned long pfn_base;
> EXPORT_SYMBOL(pfn_base);
>
> @@ -316,7 +323,7 @@ static phys_addr_t __init alloc_pmd_early(uintptr_t va)
> {
> uintptr_t pmd_num;
>
> - pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
> + pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
> BUG_ON(pmd_num >= NUM_EARLY_PMDS);
> return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
> }
> @@ -431,17 +438,34 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
> #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
> #endif
>
> +static uintptr_t load_pa, load_sz;
> +
> +static void __init create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
> +{
> + uintptr_t va, end_va;
> +
> + end_va = kernel_virt_addr + load_sz;
> + for (va = kernel_virt_addr; va < end_va; va += map_size)
> + create_pgd_mapping(pgdir, va,
> + load_pa + (va - kernel_virt_addr),
> + map_size, PAGE_KERNEL_EXEC);
> +}
> +
> asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> {
> - uintptr_t va, pa, end_va;
> - uintptr_t load_pa = (uintptr_t)(&_start);
> - uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
> - uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
> + uintptr_t pa;
> + uintptr_t map_size;
> #ifndef __PAGETABLE_PMD_FOLDED
> pmd_t fix_bmap_spmd, fix_bmap_epmd;
> #endif
>
> + load_pa = (uintptr_t)(&_start);
> + load_sz = (uintptr_t)(&_end) - load_pa;
> + map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
> +
> va_pa_offset = PAGE_OFFSET - load_pa;
> + va_kernel_pa_offset = kernel_virt_addr - load_pa;
> +
> pfn_base = PFN_DOWN(load_pa);
>
> /*
> @@ -470,26 +494,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> create_pmd_mapping(fixmap_pmd, FIXADDR_START,
> (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
> /* Setup trampoline PGD and PMD */
> - create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
> + create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
> (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
> - create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
> + create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
> load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
> #else
> /* Setup trampoline PGD */
> - create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
> + create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
> load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
> #endif
>
> /*
> - * Setup early PGD covering entire kernel which will allows
> + * Setup early PGD covering entire kernel which will allow
> * us to reach paging_init(). We map all memory banks later
> * in setup_vm_final() below.
> */
> - end_va = PAGE_OFFSET + load_sz;
> - for (va = PAGE_OFFSET; va < end_va; va += map_size)
> - create_pgd_mapping(early_pg_dir, va,
> - load_pa + (va - PAGE_OFFSET),
> - map_size, PAGE_KERNEL_EXEC);
> + create_kernel_page_table(early_pg_dir, map_size);
>
> #ifndef __PAGETABLE_PMD_FOLDED
> /* Setup early PMD for DTB */
> @@ -549,6 +569,7 @@ static void __init setup_vm_final(void)
> uintptr_t va, map_size;
> phys_addr_t pa, start, end;
> u64 i;
> + static struct vm_struct vm_kernel = { 0 };
>
> /**
> * MMU is enabled at this point. But page table setup is not complete yet.
> @@ -565,7 +586,7 @@ static void __init setup_vm_final(void)
> __pa_symbol(fixmap_pgd_next),
> PGDIR_SIZE, PAGE_TABLE);
>
> - /* Map all memory banks */
> + /* Map all memory banks in the linear mapping */
> for_each_mem_range(i, &start, &end) {
> if (start >= end)
> break;
> @@ -577,10 +598,22 @@ static void __init setup_vm_final(void)
> for (pa = start; pa < end; pa += map_size) {
> va = (uintptr_t)__va(pa);
> create_pgd_mapping(swapper_pg_dir, va, pa,
> - map_size, PAGE_KERNEL_EXEC);
> + map_size, PAGE_KERNEL);
> }
> }
>
> + /* Map the kernel */
> + create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
> +
> + /* Reserve the vmalloc area occupied by the kernel */
> + vm_kernel.addr = (void *)kernel_virt_addr;
> + vm_kernel.phys_addr = load_pa;
> + vm_kernel.size = (load_sz + PMD_SIZE - 1) & ~(PMD_SIZE - 1);
> + vm_kernel.flags = VM_MAP | VM_NO_GUARD;
> + vm_kernel.caller = __builtin_return_address(0);
> +
> + vm_area_add_early(&vm_kernel);
> +
> /* Clear fixmap PTE and PMD mappings */
> clear_fixmap(FIX_PTE);
> clear_fixmap(FIX_PMD);
> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
> index e8e4dcd39fed..35703d5ef5fd 100644
> --- a/arch/riscv/mm/physaddr.c
> +++ b/arch/riscv/mm/physaddr.c
> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>
> phys_addr_t __phys_addr_symbol(unsigned long x)
> {
> - unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
> + unsigned long kernel_start = (unsigned long)kernel_virt_addr;
> unsigned long kernel_end = (unsigned long)_end;
>
> /*
> --
> 2.20.1
>
Apart from the minor comment above, this looks good to me.
Reviewed-by: Anup Patel <[email protected]>
Regards,
Anup
On Tue, Jan 5, 2021 at 1:31 AM Alexandre Ghiti <[email protected]> wrote:
>
> The kernel is now mapped at the end of the address space and it should
> be accessed through this mapping only: so map the whole kernel in the
> linear mapping as read only.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> arch/riscv/include/asm/page.h | 9 ++++++++-
> arch/riscv/mm/init.c | 29 +++++++++++++++++++++--------
> 2 files changed, 29 insertions(+), 9 deletions(-)
>
> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> index 98188e315e8d..a93e35aaa717 100644
> --- a/arch/riscv/include/asm/page.h
> +++ b/arch/riscv/include/asm/page.h
> @@ -102,8 +102,15 @@ extern unsigned long pfn_base;
> extern unsigned long max_low_pfn;
> extern unsigned long min_low_pfn;
> extern unsigned long kernel_virt_addr;
> +extern uintptr_t load_pa, load_sz;
> +
> +#define linear_mapping_pa_to_va(x) ((void *)((unsigned long)(x) + va_pa_offset))
> +#define kernel_mapping_pa_to_va(x) \
> + ((void *)((unsigned long) (x) + va_kernel_pa_offset))
> +#define __pa_to_va_nodebug(x) \
> + ((x >= load_pa && x < load_pa + load_sz) ? \
> + kernel_mapping_pa_to_va(x): linear_mapping_pa_to_va(x))
This change should be part of PATCH1
>
> -#define __pa_to_va_nodebug(x) ((void *)((unsigned long) (x) + va_pa_offset))
> #define linear_mapping_va_to_pa(x) ((unsigned long)(x) - va_pa_offset)
> #define kernel_mapping_va_to_pa(x) \
> ((unsigned long)(x) - va_kernel_pa_offset)
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 9d06ff0e015a..7b87c14f1d24 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -159,8 +159,6 @@ void __init setup_bootmem(void)
> {
> phys_addr_t mem_start = 0;
> phys_addr_t start, end = 0;
> - phys_addr_t vmlinux_end = __pa_symbol(&_end);
> - phys_addr_t vmlinux_start = __pa_symbol(&_start);
This as well.
> u64 i;
>
> /* Find the memory region containing the kernel */
> @@ -168,7 +166,7 @@ void __init setup_bootmem(void)
> phys_addr_t size = end - start;
> if (!mem_start)
> mem_start = start;
> - if (start <= vmlinux_start && vmlinux_end <= end)
> + if (start <= load_pa && (load_pa + load_sz) <= end)
> BUG_ON(size == 0);
> }
>
> @@ -179,8 +177,13 @@ void __init setup_bootmem(void)
> */
> memblock_enforce_memory_limit(mem_start - PAGE_OFFSET);
>
> - /* Reserve from the start of the kernel to the end of the kernel */
> - memblock_reserve(vmlinux_start, vmlinux_end - vmlinux_start);
> + /*
> + * Reserve from the start of the kernel to the end of the kernel
> + * and make sure we align the reservation on PMD_SIZE since we will
> + * map the kernel in the linear mapping as read-only: we do not want
> + * any allocation to happen between _end and the next pmd aligned page.
> + */
> + memblock_reserve(load_pa, (load_sz + PMD_SIZE - 1) & ~(PMD_SIZE - 1));
>
> max_pfn = PFN_DOWN(memblock_end_of_DRAM());
> max_low_pfn = max_pfn;
> @@ -438,7 +441,9 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
> #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
> #endif
>
> -static uintptr_t load_pa, load_sz;
> +uintptr_t load_pa, load_sz;
> +EXPORT_SYMBOL(load_pa);
> +EXPORT_SYMBOL(load_sz);
I think all changes till here should be in PATCH1.
Only the changes here onwards seems to be as-per PATCH description.
>
> static void __init create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
> {
> @@ -596,9 +601,17 @@ static void __init setup_vm_final(void)
>
> map_size = best_map_size(start, end - start);
> for (pa = start; pa < end; pa += map_size) {
> - va = (uintptr_t)__va(pa);
> + pgprot_t prot = PAGE_KERNEL;
> +
> + /* Protect the kernel mapping that lies in the linear mapping */
> + if (pa >= __pa(_start) && pa < __pa(_end))
> + prot = PAGE_KERNEL_READ;
> +
> + /* Make sure we get virtual addresses in the linear mapping */
> + va = (uintptr_t)linear_mapping_pa_to_va(pa);
> +
> create_pgd_mapping(swapper_pg_dir, va, pa,
> - map_size, PAGE_KERNEL);
> + map_size, prot);
> }
> }
>
> --
> 2.20.1
>
Regards,
Anup
On Tue, Jan 5, 2021 at 1:33 AM Alexandre Ghiti <[email protected]> wrote:
>
> With 4-level page table folding at runtime, we don't know at compile time
> the size of the virtual address space so we must set VA_BITS dynamically
> so that sparsemem reserves the right amount of memory for struct pages.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> arch/riscv/Kconfig | 10 ----------
> arch/riscv/include/asm/pgtable.h | 11 +++++++++--
> arch/riscv/include/asm/sparsemem.h | 6 +++++-
> 3 files changed, 14 insertions(+), 13 deletions(-)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index 44377fd7860e..2979a44103be 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -122,16 +122,6 @@ config ZONE_DMA32
> bool
> default y if 64BIT
>
> -config VA_BITS
> - int
> - default 32 if 32BIT
> - default 39 if 64BIT
> -
> -config PA_BITS
> - int
> - default 34 if 32BIT
> - default 56 if 64BIT
> -
> config PAGE_OFFSET
> hex
> default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 102b728ca146..c7973bfd65bc 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -43,8 +43,14 @@
> * struct pages to map half the virtual address space. Then
> * position vmemmap directly below the VMALLOC region.
> */
> +#ifdef CONFIG_64BIT
> +#define VA_BITS 39
> +#else
> +#define VA_BITS 32
> +#endif
> +
> #define VMEMMAP_SHIFT \
> - (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
> + (VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
> #define VMEMMAP_SIZE BIT(VMEMMAP_SHIFT)
> #define VMEMMAP_END (VMALLOC_START - 1)
> #define VMEMMAP_START (VMALLOC_START - VMEMMAP_SIZE)
> @@ -83,6 +89,7 @@
> #endif /* CONFIG_64BIT */
>
> #ifdef CONFIG_MMU
> +
> /* Number of entries in the page global directory */
> #define PTRS_PER_PGD (PAGE_SIZE / sizeof(pgd_t))
> /* Number of entries in the page table */
> @@ -453,7 +460,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> * and give the kernel the other (upper) half.
> */
> #ifdef CONFIG_64BIT
> -#define KERN_VIRT_START (-(BIT(CONFIG_VA_BITS)) + TASK_SIZE)
> +#define KERN_VIRT_START (-(BIT(VA_BITS)) + TASK_SIZE)
> #else
> #define KERN_VIRT_START FIXADDR_START
> #endif
> diff --git a/arch/riscv/include/asm/sparsemem.h b/arch/riscv/include/asm/sparsemem.h
> index 45a7018a8118..63acaecc3374 100644
> --- a/arch/riscv/include/asm/sparsemem.h
> +++ b/arch/riscv/include/asm/sparsemem.h
> @@ -4,7 +4,11 @@
> #define _ASM_RISCV_SPARSEMEM_H
>
> #ifdef CONFIG_SPARSEMEM
> -#define MAX_PHYSMEM_BITS CONFIG_PA_BITS
> +#ifdef CONFIG_64BIT
> +#define MAX_PHYSMEM_BITS 56
> +#else
> +#define MAX_PHYSMEM_BITS 34
> +#endif /* CONFIG_64BIT */
> #define SECTION_SIZE_BITS 27
> #endif /* CONFIG_SPARSEMEM */
>
> --
> 2.20.1
>
Looks good to me.
Reviewed-by: Anup Patel <[email protected]>
Regards,
Anup
Hi Anup,
Le 1/5/21 à 6:40 AM, Anup Patel a écrit :
> On Tue, Jan 5, 2021 at 1:29 AM Alexandre Ghiti <[email protected]> wrote:
>>
>> This is a preparatory patch for relocatable kernel and sv48 support.
>>
>> The kernel used to be linked at PAGE_OFFSET address therefore we could use
>> the linear mapping for the kernel mapping. But the relocated kernel base
>> address will be different from PAGE_OFFSET and since in the linear mapping,
>> two different virtual addresses cannot point to the same physical address,
>> the kernel mapping needs to lie outside the linear mapping so that we don't
>> have to copy it at the same physical offset.
>>
>> The kernel mapping is moved to the last 2GB of the address space and then
>> BPF and modules are also pushed to the same range since they have to lie
>> close to the kernel inside a 2GB window.
>>
>> Note then that KASLR implementation will simply have to move the kernel in
>> this 2GB range and modify BPF/modules regions accordingly.
>>
>> In addition, by moving the kernel to the end of the address space, both
>> sv39 and sv48 kernels will be exactly the same without needing to be
>> relocated at runtime.
>
> Awesome ! This is a good approach with no performance impact.
>
>>
>> Suggested-by: Arnd Bergmann <[email protected]>
>> Signed-off-by: Alexandre Ghiti <[email protected]>
>> ---
>> arch/riscv/boot/loader.lds.S | 3 +-
>> arch/riscv/include/asm/page.h | 10 ++++-
>> arch/riscv/include/asm/pgtable.h | 39 +++++++++++++------
>> arch/riscv/kernel/head.S | 3 +-
>> arch/riscv/kernel/module.c | 4 +-
>> arch/riscv/kernel/vmlinux.lds.S | 3 +-
>> arch/riscv/mm/init.c | 65 ++++++++++++++++++++++++--------
>> arch/riscv/mm/physaddr.c | 2 +-
>> 8 files changed, 94 insertions(+), 35 deletions(-)
>>
>> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
>> index 47a5003c2e28..62d94696a19c 100644
>> --- a/arch/riscv/boot/loader.lds.S
>> +++ b/arch/riscv/boot/loader.lds.S
>> @@ -1,13 +1,14 @@
>> /* SPDX-License-Identifier: GPL-2.0 */
>>
>> #include <asm/page.h>
>> +#include <asm/pgtable.h>
>>
>> OUTPUT_ARCH(riscv)
>> ENTRY(_start)
>>
>> SECTIONS
>> {
>> - . = PAGE_OFFSET;
>> + . = KERNEL_LINK_ADDR;
>>
>> .payload : {
>> *(.payload)
>> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
>> index 2d50f76efe48..98188e315e8d 100644
>> --- a/arch/riscv/include/asm/page.h
>> +++ b/arch/riscv/include/asm/page.h
>> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>>
>> #ifdef CONFIG_MMU
>> extern unsigned long va_pa_offset;
>> +extern unsigned long va_kernel_pa_offset;
>> extern unsigned long pfn_base;
>> #define ARCH_PFN_OFFSET (pfn_base)
>> #else
>> #define va_pa_offset 0
>> +#define va_kernel_pa_offset 0
>> #define ARCH_PFN_OFFSET (PAGE_OFFSET >> PAGE_SHIFT)
>> #endif /* CONFIG_MMU */
>>
>> extern unsigned long max_low_pfn;
>> extern unsigned long min_low_pfn;
>> +extern unsigned long kernel_virt_addr;
>>
>> #define __pa_to_va_nodebug(x) ((void *)((unsigned long) (x) + va_pa_offset))
>> -#define __va_to_pa_nodebug(x) ((unsigned long)(x) - va_pa_offset)
>> +#define linear_mapping_va_to_pa(x) ((unsigned long)(x) - va_pa_offset)
>> +#define kernel_mapping_va_to_pa(x) \
>> + ((unsigned long)(x) - va_kernel_pa_offset)
>> +#define __va_to_pa_nodebug(x) \
>> + (((x) < KERNEL_LINK_ADDR) ? \
>> + linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
>>
>> #ifdef CONFIG_DEBUG_VIRTUAL
>> extern phys_addr_t __virt_to_phys(unsigned long x);
>> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
>> index 183f1f4b2ae6..102b728ca146 100644
>> --- a/arch/riscv/include/asm/pgtable.h
>> +++ b/arch/riscv/include/asm/pgtable.h
>> @@ -11,23 +11,32 @@
>>
>> #include <asm/pgtable-bits.h>
>>
>> -#ifndef __ASSEMBLY__
>> -
>> -/* Page Upper Directory not used in RISC-V */
>> -#include <asm-generic/pgtable-nopud.h>
>> -#include <asm/page.h>
>> -#include <asm/tlbflush.h>
>> -#include <linux/mm_types.h>
>> +#ifndef CONFIG_MMU
>> +#define KERNEL_VIRT_ADDR PAGE_OFFSET
>> +#define KERNEL_LINK_ADDR PAGE_OFFSET
>> +#else
>>
>> -#ifdef CONFIG_MMU
>> +#define ADDRESS_SPACE_END (UL(-1))
>> +/*
>> + * Leave 2GB for kernel, modules and BPF at the end of the address space
>> + */
>> +#define KERNEL_VIRT_ADDR (ADDRESS_SPACE_END - SZ_2G + 1)
>> +#define KERNEL_LINK_ADDR KERNEL_VIRT_ADDR
>>
>> #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
>> #define VMALLOC_END (PAGE_OFFSET - 1)
>> #define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE)
>>
>> +/* KASLR should leave at least 128MB for BPF after the kernel */
>> #define BPF_JIT_REGION_SIZE (SZ_128M)
>> -#define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
>> -#define BPF_JIT_REGION_END (VMALLOC_END)
>> +#define BPF_JIT_REGION_START PFN_ALIGN((unsigned long)&_end)
>> +#define BPF_JIT_REGION_END (BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
>> +
>> +/* Modules always live before the kernel */
>> +#ifdef CONFIG_64BIT
>> +#define VMALLOC_MODULE_START (PFN_ALIGN((unsigned long)&_end) - SZ_2G)
>> +#define VMALLOC_MODULE_END (PFN_ALIGN((unsigned long)&_start))
>> +#endif
>
> This does not look right or I am missing something.
>
> I think the VMALLOC_MODULE_START should be:
> #define VMALLOC_MODULE_START (PFN_ALIGN((unsigned long)&_start) - SZ_2G)
>
I think the patch is correct: worst-case, we want the first address of
the module area to be able to access the last address of the kernel, so
we must use _end and not _start to guarantee that the difference between
those 2 addresses is not greater than 2GB.
>>
>> /*
>> * Roughly size the vmemmap space to be large enough to fit enough
>> @@ -57,9 +66,16 @@
>> #define FIXADDR_SIZE PGDIR_SIZE
>> #endif
>> #define FIXADDR_START (FIXADDR_TOP - FIXADDR_SIZE)
>> -
>> #endif
>>
>> +#ifndef __ASSEMBLY__
>> +
>> +/* Page Upper Directory not used in RISC-V */
>> +#include <asm-generic/pgtable-nopud.h>
>> +#include <asm/page.h>
>> +#include <asm/tlbflush.h>
>> +#include <linux/mm_types.h>
>> +
>> #ifdef CONFIG_64BIT
>> #include <asm/pgtable-64.h>
>> #else
>> @@ -467,6 +483,7 @@ static inline void __kernel_map_pages(struct page *page, int numpages, int enabl
>>
>> #define kern_addr_valid(addr) (1) /* FIXME */
>>
>> +extern char _start[];
>> extern void *dtb_early_va;
>> extern uintptr_t dtb_early_pa;
>> void setup_bootmem(void);
>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>> index 7e849797c9c3..66f40c49bf68 100644
>> --- a/arch/riscv/kernel/head.S
>> +++ b/arch/riscv/kernel/head.S
>> @@ -69,7 +69,8 @@ pe_head_start:
>> #ifdef CONFIG_MMU
>> relocate:
>> /* Relocate return address */
>> - li a1, PAGE_OFFSET
>> + la a1, kernel_virt_addr
>> + REG_L a1, 0(a1)
>> la a2, _start
>> sub a1, a1, a2
>> add ra, ra, a1
>> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
>> index 104fba889cf7..75a0b9541266 100644
>> --- a/arch/riscv/kernel/module.c
>> +++ b/arch/riscv/kernel/module.c
>> @@ -408,12 +408,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
>> }
>>
>> #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
>> -#define VMALLOC_MODULE_START \
>> - max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
>> void *module_alloc(unsigned long size)
>> {
>> return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
>> - VMALLOC_END, GFP_KERNEL,
>> + VMALLOC_MODULE_END, GFP_KERNEL,
>> PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
>> __builtin_return_address(0));
>> }
>> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
>> index 3ffbd6cbdb86..c21dc46f41be 100644
>> --- a/arch/riscv/kernel/vmlinux.lds.S
>> +++ b/arch/riscv/kernel/vmlinux.lds.S
>> @@ -4,7 +4,8 @@
>> * Copyright (C) 2017 SiFive
>> */
>>
>> -#define LOAD_OFFSET PAGE_OFFSET
>> +#include <asm/pgtable.h>
>> +#define LOAD_OFFSET KERNEL_LINK_ADDR
>> #include <asm/vmlinux.lds.h>
>> #include <asm/page.h>
>> #include <asm/cache.h>
>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>> index 8e577f14f120..9d06ff0e015a 100644
>> --- a/arch/riscv/mm/init.c
>> +++ b/arch/riscv/mm/init.c
>> @@ -23,6 +23,9 @@
>>
>> #include "../kernel/head.h"
>>
>> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
>> +EXPORT_SYMBOL(kernel_virt_addr);
>> +
>> unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>> __page_aligned_bss;
>> EXPORT_SYMBOL(empty_zero_page);
>> @@ -201,8 +204,12 @@ void __init setup_bootmem(void)
>> #ifdef CONFIG_MMU
>> static struct pt_alloc_ops pt_ops;
>>
>> +/* Offset between linear mapping virtual address and kernel load address */
>> unsigned long va_pa_offset;
>> EXPORT_SYMBOL(va_pa_offset);
>> +/* Offset between kernel mapping virtual address and kernel load address */
>> +unsigned long va_kernel_pa_offset;
>> +EXPORT_SYMBOL(va_kernel_pa_offset);
>> unsigned long pfn_base;
>> EXPORT_SYMBOL(pfn_base);
>>
>> @@ -316,7 +323,7 @@ static phys_addr_t __init alloc_pmd_early(uintptr_t va)
>> {
>> uintptr_t pmd_num;
>>
>> - pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
>> + pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
>> BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>> return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>> }
>> @@ -431,17 +438,34 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
>> #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
>> #endif
>>
>> +static uintptr_t load_pa, load_sz;
>> +
>> +static void __init create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
>> +{
>> + uintptr_t va, end_va;
>> +
>> + end_va = kernel_virt_addr + load_sz;
>> + for (va = kernel_virt_addr; va < end_va; va += map_size)
>> + create_pgd_mapping(pgdir, va,
>> + load_pa + (va - kernel_virt_addr),
>> + map_size, PAGE_KERNEL_EXEC);
>> +}
>> +
>> asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>> {
>> - uintptr_t va, pa, end_va;
>> - uintptr_t load_pa = (uintptr_t)(&_start);
>> - uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
>> - uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
>> + uintptr_t pa;
>> + uintptr_t map_size;
>> #ifndef __PAGETABLE_PMD_FOLDED
>> pmd_t fix_bmap_spmd, fix_bmap_epmd;
>> #endif
>>
>> + load_pa = (uintptr_t)(&_start);
>> + load_sz = (uintptr_t)(&_end) - load_pa;
>> + map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
>> +
>> va_pa_offset = PAGE_OFFSET - load_pa;
>> + va_kernel_pa_offset = kernel_virt_addr - load_pa;
>> +
>> pfn_base = PFN_DOWN(load_pa);
>>
>> /*
>> @@ -470,26 +494,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>> create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>> (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
>> /* Setup trampoline PGD and PMD */
>> - create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>> + create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>> (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
>> - create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>> + create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
>> load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>> #else
>> /* Setup trampoline PGD */
>> - create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>> + create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>> load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
>> #endif
>>
>> /*
>> - * Setup early PGD covering entire kernel which will allows
>> + * Setup early PGD covering entire kernel which will allow
>> * us to reach paging_init(). We map all memory banks later
>> * in setup_vm_final() below.
>> */
>> - end_va = PAGE_OFFSET + load_sz;
>> - for (va = PAGE_OFFSET; va < end_va; va += map_size)
>> - create_pgd_mapping(early_pg_dir, va,
>> - load_pa + (va - PAGE_OFFSET),
>> - map_size, PAGE_KERNEL_EXEC);
>> + create_kernel_page_table(early_pg_dir, map_size);
>>
>> #ifndef __PAGETABLE_PMD_FOLDED
>> /* Setup early PMD for DTB */
>> @@ -549,6 +569,7 @@ static void __init setup_vm_final(void)
>> uintptr_t va, map_size;
>> phys_addr_t pa, start, end;
>> u64 i;
>> + static struct vm_struct vm_kernel = { 0 };
>>
>> /**
>> * MMU is enabled at this point. But page table setup is not complete yet.
>> @@ -565,7 +586,7 @@ static void __init setup_vm_final(void)
>> __pa_symbol(fixmap_pgd_next),
>> PGDIR_SIZE, PAGE_TABLE);
>>
>> - /* Map all memory banks */
>> + /* Map all memory banks in the linear mapping */
>> for_each_mem_range(i, &start, &end) {
>> if (start >= end)
>> break;
>> @@ -577,10 +598,22 @@ static void __init setup_vm_final(void)
>> for (pa = start; pa < end; pa += map_size) {
>> va = (uintptr_t)__va(pa);
>> create_pgd_mapping(swapper_pg_dir, va, pa,
>> - map_size, PAGE_KERNEL_EXEC);
>> + map_size, PAGE_KERNEL);
>> }
>> }
>>
>> + /* Map the kernel */
>> + create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
>> +
>> + /* Reserve the vmalloc area occupied by the kernel */
>> + vm_kernel.addr = (void *)kernel_virt_addr;
>> + vm_kernel.phys_addr = load_pa;
>> + vm_kernel.size = (load_sz + PMD_SIZE - 1) & ~(PMD_SIZE - 1);
>> + vm_kernel.flags = VM_MAP | VM_NO_GUARD;
>> + vm_kernel.caller = __builtin_return_address(0);
>> +
>> + vm_area_add_early(&vm_kernel);
>> +
>> /* Clear fixmap PTE and PMD mappings */
>> clear_fixmap(FIX_PTE);
>> clear_fixmap(FIX_PMD);
>> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
>> index e8e4dcd39fed..35703d5ef5fd 100644
>> --- a/arch/riscv/mm/physaddr.c
>> +++ b/arch/riscv/mm/physaddr.c
>> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>>
>> phys_addr_t __phys_addr_symbol(unsigned long x)
>> {
>> - unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
>> + unsigned long kernel_start = (unsigned long)kernel_virt_addr;
>> unsigned long kernel_end = (unsigned long)_end;
>>
>> /*
>> --
>> 2.20.1
>>
>
> Apart from the minor comment above, this looks good to me.
>
> Reviewed-by: Anup Patel <[email protected]>
Thanks for your time Anup,
>
> Regards,
> Anup
>
Alex
Le 1/5/21 à 7:06 AM, Anup Patel a écrit :
> On Tue, Jan 5, 2021 at 1:33 AM Alexandre Ghiti <[email protected]> wrote:
>>
>> With 4-level page table folding at runtime, we don't know at compile time
>> the size of the virtual address space so we must set VA_BITS dynamically
>> so that sparsemem reserves the right amount of memory for struct pages.
>>
>> Signed-off-by: Alexandre Ghiti <[email protected]>
>> ---
>> arch/riscv/Kconfig | 10 ----------
>> arch/riscv/include/asm/pgtable.h | 11 +++++++++--
>> arch/riscv/include/asm/sparsemem.h | 6 +++++-
>> 3 files changed, 14 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>> index 44377fd7860e..2979a44103be 100644
>> --- a/arch/riscv/Kconfig
>> +++ b/arch/riscv/Kconfig
>> @@ -122,16 +122,6 @@ config ZONE_DMA32
>> bool
>> default y if 64BIT
>>
>> -config VA_BITS
>> - int
>> - default 32 if 32BIT
>> - default 39 if 64BIT
>> -
>> -config PA_BITS
>> - int
>> - default 34 if 32BIT
>> - default 56 if 64BIT
>> -
>> config PAGE_OFFSET
>> hex
>> default 0xC0000000 if 32BIT && MAXPHYSMEM_2GB
>> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
>> index 102b728ca146..c7973bfd65bc 100644
>> --- a/arch/riscv/include/asm/pgtable.h
>> +++ b/arch/riscv/include/asm/pgtable.h
>> @@ -43,8 +43,14 @@
>> * struct pages to map half the virtual address space. Then
>> * position vmemmap directly below the VMALLOC region.
>> */
>> +#ifdef CONFIG_64BIT
>> +#define VA_BITS 39
>> +#else
>> +#define VA_BITS 32
>> +#endif
>> +
>> #define VMEMMAP_SHIFT \
>> - (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
>> + (VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
>> #define VMEMMAP_SIZE BIT(VMEMMAP_SHIFT)
>> #define VMEMMAP_END (VMALLOC_START - 1)
>> #define VMEMMAP_START (VMALLOC_START - VMEMMAP_SIZE)
>> @@ -83,6 +89,7 @@
>> #endif /* CONFIG_64BIT */
>>
>> #ifdef CONFIG_MMU
>> +
>> /* Number of entries in the page global directory */
>> #define PTRS_PER_PGD (PAGE_SIZE / sizeof(pgd_t))
>> /* Number of entries in the page table */
>> @@ -453,7 +460,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>> * and give the kernel the other (upper) half.
>> */
>> #ifdef CONFIG_64BIT
>> -#define KERN_VIRT_START (-(BIT(CONFIG_VA_BITS)) + TASK_SIZE)
>> +#define KERN_VIRT_START (-(BIT(VA_BITS)) + TASK_SIZE)
>> #else
>> #define KERN_VIRT_START FIXADDR_START
>> #endif
>> diff --git a/arch/riscv/include/asm/sparsemem.h b/arch/riscv/include/asm/sparsemem.h
>> index 45a7018a8118..63acaecc3374 100644
>> --- a/arch/riscv/include/asm/sparsemem.h
>> +++ b/arch/riscv/include/asm/sparsemem.h
>> @@ -4,7 +4,11 @@
>> #define _ASM_RISCV_SPARSEMEM_H
>>
>> #ifdef CONFIG_SPARSEMEM
>> -#define MAX_PHYSMEM_BITS CONFIG_PA_BITS
>> +#ifdef CONFIG_64BIT
>> +#define MAX_PHYSMEM_BITS 56
>> +#else
>> +#define MAX_PHYSMEM_BITS 34
>> +#endif /* CONFIG_64BIT */
>> #define SECTION_SIZE_BITS 27
>> #endif /* CONFIG_SPARSEMEM */
>>
>> --
>> 2.20.1
>>
>
> Looks good to me.
>
> Reviewed-by: Anup Patel <[email protected]>
Thanks,
>
> Regards,
> Anup
>
Alex
On Wed, Jan 6, 2021 at 12:06 PM Alex Ghiti <[email protected]> wrote:
>
> Hi Anup,
>
> Le 1/5/21 à 6:40 AM, Anup Patel a écrit :
> > On Tue, Jan 5, 2021 at 1:29 AM Alexandre Ghiti <[email protected]> wrote:
> >>
> >> This is a preparatory patch for relocatable kernel and sv48 support.
> >>
> >> The kernel used to be linked at PAGE_OFFSET address therefore we could use
> >> the linear mapping for the kernel mapping. But the relocated kernel base
> >> address will be different from PAGE_OFFSET and since in the linear mapping,
> >> two different virtual addresses cannot point to the same physical address,
> >> the kernel mapping needs to lie outside the linear mapping so that we don't
> >> have to copy it at the same physical offset.
> >>
> >> The kernel mapping is moved to the last 2GB of the address space and then
> >> BPF and modules are also pushed to the same range since they have to lie
> >> close to the kernel inside a 2GB window.
> >>
> >> Note then that KASLR implementation will simply have to move the kernel in
> >> this 2GB range and modify BPF/modules regions accordingly.
> >>
> >> In addition, by moving the kernel to the end of the address space, both
> >> sv39 and sv48 kernels will be exactly the same without needing to be
> >> relocated at runtime.
> >
> > Awesome ! This is a good approach with no performance impact.
> >
> >>
> >> Suggested-by: Arnd Bergmann <[email protected]>
> >> Signed-off-by: Alexandre Ghiti <[email protected]>
> >> ---
> >> arch/riscv/boot/loader.lds.S | 3 +-
> >> arch/riscv/include/asm/page.h | 10 ++++-
> >> arch/riscv/include/asm/pgtable.h | 39 +++++++++++++------
> >> arch/riscv/kernel/head.S | 3 +-
> >> arch/riscv/kernel/module.c | 4 +-
> >> arch/riscv/kernel/vmlinux.lds.S | 3 +-
> >> arch/riscv/mm/init.c | 65 ++++++++++++++++++++++++--------
> >> arch/riscv/mm/physaddr.c | 2 +-
> >> 8 files changed, 94 insertions(+), 35 deletions(-)
> >>
> >> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
> >> index 47a5003c2e28..62d94696a19c 100644
> >> --- a/arch/riscv/boot/loader.lds.S
> >> +++ b/arch/riscv/boot/loader.lds.S
> >> @@ -1,13 +1,14 @@
> >> /* SPDX-License-Identifier: GPL-2.0 */
> >>
> >> #include <asm/page.h>
> >> +#include <asm/pgtable.h>
> >>
> >> OUTPUT_ARCH(riscv)
> >> ENTRY(_start)
> >>
> >> SECTIONS
> >> {
> >> - . = PAGE_OFFSET;
> >> + . = KERNEL_LINK_ADDR;
> >>
> >> .payload : {
> >> *(.payload)
> >> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> >> index 2d50f76efe48..98188e315e8d 100644
> >> --- a/arch/riscv/include/asm/page.h
> >> +++ b/arch/riscv/include/asm/page.h
> >> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
> >>
> >> #ifdef CONFIG_MMU
> >> extern unsigned long va_pa_offset;
> >> +extern unsigned long va_kernel_pa_offset;
> >> extern unsigned long pfn_base;
> >> #define ARCH_PFN_OFFSET (pfn_base)
> >> #else
> >> #define va_pa_offset 0
> >> +#define va_kernel_pa_offset 0
> >> #define ARCH_PFN_OFFSET (PAGE_OFFSET >> PAGE_SHIFT)
> >> #endif /* CONFIG_MMU */
> >>
> >> extern unsigned long max_low_pfn;
> >> extern unsigned long min_low_pfn;
> >> +extern unsigned long kernel_virt_addr;
> >>
> >> #define __pa_to_va_nodebug(x) ((void *)((unsigned long) (x) + va_pa_offset))
> >> -#define __va_to_pa_nodebug(x) ((unsigned long)(x) - va_pa_offset)
> >> +#define linear_mapping_va_to_pa(x) ((unsigned long)(x) - va_pa_offset)
> >> +#define kernel_mapping_va_to_pa(x) \
> >> + ((unsigned long)(x) - va_kernel_pa_offset)
> >> +#define __va_to_pa_nodebug(x) \
> >> + (((x) < KERNEL_LINK_ADDR) ? \
> >> + linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
> >>
> >> #ifdef CONFIG_DEBUG_VIRTUAL
> >> extern phys_addr_t __virt_to_phys(unsigned long x);
> >> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> >> index 183f1f4b2ae6..102b728ca146 100644
> >> --- a/arch/riscv/include/asm/pgtable.h
> >> +++ b/arch/riscv/include/asm/pgtable.h
> >> @@ -11,23 +11,32 @@
> >>
> >> #include <asm/pgtable-bits.h>
> >>
> >> -#ifndef __ASSEMBLY__
> >> -
> >> -/* Page Upper Directory not used in RISC-V */
> >> -#include <asm-generic/pgtable-nopud.h>
> >> -#include <asm/page.h>
> >> -#include <asm/tlbflush.h>
> >> -#include <linux/mm_types.h>
> >> +#ifndef CONFIG_MMU
> >> +#define KERNEL_VIRT_ADDR PAGE_OFFSET
> >> +#define KERNEL_LINK_ADDR PAGE_OFFSET
> >> +#else
> >>
> >> -#ifdef CONFIG_MMU
> >> +#define ADDRESS_SPACE_END (UL(-1))
> >> +/*
> >> + * Leave 2GB for kernel, modules and BPF at the end of the address space
> >> + */
> >> +#define KERNEL_VIRT_ADDR (ADDRESS_SPACE_END - SZ_2G + 1)
> >> +#define KERNEL_LINK_ADDR KERNEL_VIRT_ADDR
> >>
> >> #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
> >> #define VMALLOC_END (PAGE_OFFSET - 1)
> >> #define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE)
> >>
> >> +/* KASLR should leave at least 128MB for BPF after the kernel */
> >> #define BPF_JIT_REGION_SIZE (SZ_128M)
> >> -#define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
> >> -#define BPF_JIT_REGION_END (VMALLOC_END)
> >> +#define BPF_JIT_REGION_START PFN_ALIGN((unsigned long)&_end)
> >> +#define BPF_JIT_REGION_END (BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
> >> +
> >> +/* Modules always live before the kernel */
> >> +#ifdef CONFIG_64BIT
> >> +#define VMALLOC_MODULE_START (PFN_ALIGN((unsigned long)&_end) - SZ_2G)
> >> +#define VMALLOC_MODULE_END (PFN_ALIGN((unsigned long)&_start))
> >> +#endif
> >
> > This does not look right or I am missing something.
> >
> > I think the VMALLOC_MODULE_START should be:
> > #define VMALLOC_MODULE_START (PFN_ALIGN((unsigned long)&_start) - SZ_2G)
> >
>
> I think the patch is correct: worst-case, we want the first address of
> the module area to be able to access the last address of the kernel, so
> we must use _end and not _start to guarantee that the difference between
> those 2 addresses is not greater than 2GB.
Makes sense. Please add more detailed comment instead of single-line comment.
>
> >>
> >> /*
> >> * Roughly size the vmemmap space to be large enough to fit enough
> >> @@ -57,9 +66,16 @@
> >> #define FIXADDR_SIZE PGDIR_SIZE
> >> #endif
> >> #define FIXADDR_START (FIXADDR_TOP - FIXADDR_SIZE)
> >> -
> >> #endif
> >>
> >> +#ifndef __ASSEMBLY__
> >> +
> >> +/* Page Upper Directory not used in RISC-V */
> >> +#include <asm-generic/pgtable-nopud.h>
> >> +#include <asm/page.h>
> >> +#include <asm/tlbflush.h>
> >> +#include <linux/mm_types.h>
> >> +
> >> #ifdef CONFIG_64BIT
> >> #include <asm/pgtable-64.h>
> >> #else
> >> @@ -467,6 +483,7 @@ static inline void __kernel_map_pages(struct page *page, int numpages, int enabl
> >>
> >> #define kern_addr_valid(addr) (1) /* FIXME */
> >>
> >> +extern char _start[];
> >> extern void *dtb_early_va;
> >> extern uintptr_t dtb_early_pa;
> >> void setup_bootmem(void);
> >> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> >> index 7e849797c9c3..66f40c49bf68 100644
> >> --- a/arch/riscv/kernel/head.S
> >> +++ b/arch/riscv/kernel/head.S
> >> @@ -69,7 +69,8 @@ pe_head_start:
> >> #ifdef CONFIG_MMU
> >> relocate:
> >> /* Relocate return address */
> >> - li a1, PAGE_OFFSET
> >> + la a1, kernel_virt_addr
> >> + REG_L a1, 0(a1)
> >> la a2, _start
> >> sub a1, a1, a2
> >> add ra, ra, a1
> >> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
> >> index 104fba889cf7..75a0b9541266 100644
> >> --- a/arch/riscv/kernel/module.c
> >> +++ b/arch/riscv/kernel/module.c
> >> @@ -408,12 +408,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
> >> }
> >>
> >> #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
> >> -#define VMALLOC_MODULE_START \
> >> - max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
> >> void *module_alloc(unsigned long size)
> >> {
> >> return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
> >> - VMALLOC_END, GFP_KERNEL,
> >> + VMALLOC_MODULE_END, GFP_KERNEL,
> >> PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> >> __builtin_return_address(0));
> >> }
> >> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
> >> index 3ffbd6cbdb86..c21dc46f41be 100644
> >> --- a/arch/riscv/kernel/vmlinux.lds.S
> >> +++ b/arch/riscv/kernel/vmlinux.lds.S
> >> @@ -4,7 +4,8 @@
> >> * Copyright (C) 2017 SiFive
> >> */
> >>
> >> -#define LOAD_OFFSET PAGE_OFFSET
> >> +#include <asm/pgtable.h>
> >> +#define LOAD_OFFSET KERNEL_LINK_ADDR
> >> #include <asm/vmlinux.lds.h>
> >> #include <asm/page.h>
> >> #include <asm/cache.h>
> >> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> >> index 8e577f14f120..9d06ff0e015a 100644
> >> --- a/arch/riscv/mm/init.c
> >> +++ b/arch/riscv/mm/init.c
> >> @@ -23,6 +23,9 @@
> >>
> >> #include "../kernel/head.h"
> >>
> >> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
> >> +EXPORT_SYMBOL(kernel_virt_addr);
> >> +
> >> unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
> >> __page_aligned_bss;
> >> EXPORT_SYMBOL(empty_zero_page);
> >> @@ -201,8 +204,12 @@ void __init setup_bootmem(void)
> >> #ifdef CONFIG_MMU
> >> static struct pt_alloc_ops pt_ops;
> >>
> >> +/* Offset between linear mapping virtual address and kernel load address */
> >> unsigned long va_pa_offset;
> >> EXPORT_SYMBOL(va_pa_offset);
> >> +/* Offset between kernel mapping virtual address and kernel load address */
> >> +unsigned long va_kernel_pa_offset;
> >> +EXPORT_SYMBOL(va_kernel_pa_offset);
> >> unsigned long pfn_base;
> >> EXPORT_SYMBOL(pfn_base);
> >>
> >> @@ -316,7 +323,7 @@ static phys_addr_t __init alloc_pmd_early(uintptr_t va)
> >> {
> >> uintptr_t pmd_num;
> >>
> >> - pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
> >> + pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
> >> BUG_ON(pmd_num >= NUM_EARLY_PMDS);
> >> return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
> >> }
> >> @@ -431,17 +438,34 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
> >> #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
> >> #endif
> >>
> >> +static uintptr_t load_pa, load_sz;
> >> +
> >> +static void __init create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
> >> +{
> >> + uintptr_t va, end_va;
> >> +
> >> + end_va = kernel_virt_addr + load_sz;
> >> + for (va = kernel_virt_addr; va < end_va; va += map_size)
> >> + create_pgd_mapping(pgdir, va,
> >> + load_pa + (va - kernel_virt_addr),
> >> + map_size, PAGE_KERNEL_EXEC);
> >> +}
> >> +
> >> asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >> {
> >> - uintptr_t va, pa, end_va;
> >> - uintptr_t load_pa = (uintptr_t)(&_start);
> >> - uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
> >> - uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
> >> + uintptr_t pa;
> >> + uintptr_t map_size;
> >> #ifndef __PAGETABLE_PMD_FOLDED
> >> pmd_t fix_bmap_spmd, fix_bmap_epmd;
> >> #endif
> >>
> >> + load_pa = (uintptr_t)(&_start);
> >> + load_sz = (uintptr_t)(&_end) - load_pa;
> >> + map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
> >> +
> >> va_pa_offset = PAGE_OFFSET - load_pa;
> >> + va_kernel_pa_offset = kernel_virt_addr - load_pa;
> >> +
> >> pfn_base = PFN_DOWN(load_pa);
> >>
> >> /*
> >> @@ -470,26 +494,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >> create_pmd_mapping(fixmap_pmd, FIXADDR_START,
> >> (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
> >> /* Setup trampoline PGD and PMD */
> >> - create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
> >> + create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
> >> (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
> >> - create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
> >> + create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
> >> load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
> >> #else
> >> /* Setup trampoline PGD */
> >> - create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
> >> + create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
> >> load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
> >> #endif
> >>
> >> /*
> >> - * Setup early PGD covering entire kernel which will allows
> >> + * Setup early PGD covering entire kernel which will allow
> >> * us to reach paging_init(). We map all memory banks later
> >> * in setup_vm_final() below.
> >> */
> >> - end_va = PAGE_OFFSET + load_sz;
> >> - for (va = PAGE_OFFSET; va < end_va; va += map_size)
> >> - create_pgd_mapping(early_pg_dir, va,
> >> - load_pa + (va - PAGE_OFFSET),
> >> - map_size, PAGE_KERNEL_EXEC);
> >> + create_kernel_page_table(early_pg_dir, map_size);
> >>
> >> #ifndef __PAGETABLE_PMD_FOLDED
> >> /* Setup early PMD for DTB */
> >> @@ -549,6 +569,7 @@ static void __init setup_vm_final(void)
> >> uintptr_t va, map_size;
> >> phys_addr_t pa, start, end;
> >> u64 i;
> >> + static struct vm_struct vm_kernel = { 0 };
> >>
> >> /**
> >> * MMU is enabled at this point. But page table setup is not complete yet.
> >> @@ -565,7 +586,7 @@ static void __init setup_vm_final(void)
> >> __pa_symbol(fixmap_pgd_next),
> >> PGDIR_SIZE, PAGE_TABLE);
> >>
> >> - /* Map all memory banks */
> >> + /* Map all memory banks in the linear mapping */
> >> for_each_mem_range(i, &start, &end) {
> >> if (start >= end)
> >> break;
> >> @@ -577,10 +598,22 @@ static void __init setup_vm_final(void)
> >> for (pa = start; pa < end; pa += map_size) {
> >> va = (uintptr_t)__va(pa);
> >> create_pgd_mapping(swapper_pg_dir, va, pa,
> >> - map_size, PAGE_KERNEL_EXEC);
> >> + map_size, PAGE_KERNEL);
> >> }
> >> }
> >>
> >> + /* Map the kernel */
> >> + create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
> >> +
> >> + /* Reserve the vmalloc area occupied by the kernel */
> >> + vm_kernel.addr = (void *)kernel_virt_addr;
> >> + vm_kernel.phys_addr = load_pa;
> >> + vm_kernel.size = (load_sz + PMD_SIZE - 1) & ~(PMD_SIZE - 1);
> >> + vm_kernel.flags = VM_MAP | VM_NO_GUARD;
> >> + vm_kernel.caller = __builtin_return_address(0);
> >> +
> >> + vm_area_add_early(&vm_kernel);
> >> +
> >> /* Clear fixmap PTE and PMD mappings */
> >> clear_fixmap(FIX_PTE);
> >> clear_fixmap(FIX_PMD);
> >> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
> >> index e8e4dcd39fed..35703d5ef5fd 100644
> >> --- a/arch/riscv/mm/physaddr.c
> >> +++ b/arch/riscv/mm/physaddr.c
> >> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
> >>
> >> phys_addr_t __phys_addr_symbol(unsigned long x)
> >> {
> >> - unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
> >> + unsigned long kernel_start = (unsigned long)kernel_virt_addr;
> >> unsigned long kernel_end = (unsigned long)_end;
> >>
> >> /*
> >> --
> >> 2.20.1
> >>
> >
> > Apart from the minor comment above, this looks good to me.
> >
> > Reviewed-by: Anup Patel <[email protected]>
>
> Thanks for your time Anup,
>
> >
> > Regards,
> > Anup
> >
>
> Alex
Regards,
Anup
Le 1/6/21 à 1:44 AM, Anup Patel a écrit :
> On Wed, Jan 6, 2021 at 12:06 PM Alex Ghiti <[email protected]> wrote:
>>
>> Hi Anup,
>>
>> Le 1/5/21 à 6:40 AM, Anup Patel a écrit :
>>> On Tue, Jan 5, 2021 at 1:29 AM Alexandre Ghiti <[email protected]> wrote:
>>>>
>>>> This is a preparatory patch for relocatable kernel and sv48 support.
>>>>
>>>> The kernel used to be linked at PAGE_OFFSET address therefore we could use
>>>> the linear mapping for the kernel mapping. But the relocated kernel base
>>>> address will be different from PAGE_OFFSET and since in the linear mapping,
>>>> two different virtual addresses cannot point to the same physical address,
>>>> the kernel mapping needs to lie outside the linear mapping so that we don't
>>>> have to copy it at the same physical offset.
>>>>
>>>> The kernel mapping is moved to the last 2GB of the address space and then
>>>> BPF and modules are also pushed to the same range since they have to lie
>>>> close to the kernel inside a 2GB window.
>>>>
>>>> Note then that KASLR implementation will simply have to move the kernel in
>>>> this 2GB range and modify BPF/modules regions accordingly.
>>>>
>>>> In addition, by moving the kernel to the end of the address space, both
>>>> sv39 and sv48 kernels will be exactly the same without needing to be
>>>> relocated at runtime.
>>>
>>> Awesome ! This is a good approach with no performance impact.
>>>
>>>>
>>>> Suggested-by: Arnd Bergmann <[email protected]>
>>>> Signed-off-by: Alexandre Ghiti <[email protected]>
>>>> ---
>>>> arch/riscv/boot/loader.lds.S | 3 +-
>>>> arch/riscv/include/asm/page.h | 10 ++++-
>>>> arch/riscv/include/asm/pgtable.h | 39 +++++++++++++------
>>>> arch/riscv/kernel/head.S | 3 +-
>>>> arch/riscv/kernel/module.c | 4 +-
>>>> arch/riscv/kernel/vmlinux.lds.S | 3 +-
>>>> arch/riscv/mm/init.c | 65 ++++++++++++++++++++++++--------
>>>> arch/riscv/mm/physaddr.c | 2 +-
>>>> 8 files changed, 94 insertions(+), 35 deletions(-)
>>>>
>>>> diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
>>>> index 47a5003c2e28..62d94696a19c 100644
>>>> --- a/arch/riscv/boot/loader.lds.S
>>>> +++ b/arch/riscv/boot/loader.lds.S
>>>> @@ -1,13 +1,14 @@
>>>> /* SPDX-License-Identifier: GPL-2.0 */
>>>>
>>>> #include <asm/page.h>
>>>> +#include <asm/pgtable.h>
>>>>
>>>> OUTPUT_ARCH(riscv)
>>>> ENTRY(_start)
>>>>
>>>> SECTIONS
>>>> {
>>>> - . = PAGE_OFFSET;
>>>> + . = KERNEL_LINK_ADDR;
>>>>
>>>> .payload : {
>>>> *(.payload)
>>>> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
>>>> index 2d50f76efe48..98188e315e8d 100644
>>>> --- a/arch/riscv/include/asm/page.h
>>>> +++ b/arch/riscv/include/asm/page.h
>>>> @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
>>>>
>>>> #ifdef CONFIG_MMU
>>>> extern unsigned long va_pa_offset;
>>>> +extern unsigned long va_kernel_pa_offset;
>>>> extern unsigned long pfn_base;
>>>> #define ARCH_PFN_OFFSET (pfn_base)
>>>> #else
>>>> #define va_pa_offset 0
>>>> +#define va_kernel_pa_offset 0
>>>> #define ARCH_PFN_OFFSET (PAGE_OFFSET >> PAGE_SHIFT)
>>>> #endif /* CONFIG_MMU */
>>>>
>>>> extern unsigned long max_low_pfn;
>>>> extern unsigned long min_low_pfn;
>>>> +extern unsigned long kernel_virt_addr;
>>>>
>>>> #define __pa_to_va_nodebug(x) ((void *)((unsigned long) (x) + va_pa_offset))
>>>> -#define __va_to_pa_nodebug(x) ((unsigned long)(x) - va_pa_offset)
>>>> +#define linear_mapping_va_to_pa(x) ((unsigned long)(x) - va_pa_offset)
>>>> +#define kernel_mapping_va_to_pa(x) \
>>>> + ((unsigned long)(x) - va_kernel_pa_offset)
>>>> +#define __va_to_pa_nodebug(x) \
>>>> + (((x) < KERNEL_LINK_ADDR) ? \
>>>> + linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))
>>>>
>>>> #ifdef CONFIG_DEBUG_VIRTUAL
>>>> extern phys_addr_t __virt_to_phys(unsigned long x);
>>>> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
>>>> index 183f1f4b2ae6..102b728ca146 100644
>>>> --- a/arch/riscv/include/asm/pgtable.h
>>>> +++ b/arch/riscv/include/asm/pgtable.h
>>>> @@ -11,23 +11,32 @@
>>>>
>>>> #include <asm/pgtable-bits.h>
>>>>
>>>> -#ifndef __ASSEMBLY__
>>>> -
>>>> -/* Page Upper Directory not used in RISC-V */
>>>> -#include <asm-generic/pgtable-nopud.h>
>>>> -#include <asm/page.h>
>>>> -#include <asm/tlbflush.h>
>>>> -#include <linux/mm_types.h>
>>>> +#ifndef CONFIG_MMU
>>>> +#define KERNEL_VIRT_ADDR PAGE_OFFSET
>>>> +#define KERNEL_LINK_ADDR PAGE_OFFSET
>>>> +#else
>>>>
>>>> -#ifdef CONFIG_MMU
>>>> +#define ADDRESS_SPACE_END (UL(-1))
>>>> +/*
>>>> + * Leave 2GB for kernel, modules and BPF at the end of the address space
>>>> + */
>>>> +#define KERNEL_VIRT_ADDR (ADDRESS_SPACE_END - SZ_2G + 1)
>>>> +#define KERNEL_LINK_ADDR KERNEL_VIRT_ADDR
>>>>
>>>> #define VMALLOC_SIZE (KERN_VIRT_SIZE >> 1)
>>>> #define VMALLOC_END (PAGE_OFFSET - 1)
>>>> #define VMALLOC_START (PAGE_OFFSET - VMALLOC_SIZE)
>>>>
>>>> +/* KASLR should leave at least 128MB for BPF after the kernel */
>>>> #define BPF_JIT_REGION_SIZE (SZ_128M)
>>>> -#define BPF_JIT_REGION_START (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
>>>> -#define BPF_JIT_REGION_END (VMALLOC_END)
>>>> +#define BPF_JIT_REGION_START PFN_ALIGN((unsigned long)&_end)
>>>> +#define BPF_JIT_REGION_END (BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
>>>> +
>>>> +/* Modules always live before the kernel */
>>>> +#ifdef CONFIG_64BIT
>>>> +#define VMALLOC_MODULE_START (PFN_ALIGN((unsigned long)&_end) - SZ_2G)
>>>> +#define VMALLOC_MODULE_END (PFN_ALIGN((unsigned long)&_start))
>>>> +#endif
>>>
>>> This does not look right or I am missing something.
>>>
>>> I think the VMALLOC_MODULE_START should be:
>>> #define VMALLOC_MODULE_START (PFN_ALIGN((unsigned long)&_start) - SZ_2G)
>>>
>>
>> I think the patch is correct: worst-case, we want the first address of
>> the module area to be able to access the last address of the kernel, so
>> we must use _end and not _start to guarantee that the difference between
>> those 2 addresses is not greater than 2GB.
>
> Makes sense. Please add more detailed comment instead of single-line comment.
>
Ok I'll do that.
I realize the naming is confusing, this area does not belong to vmalloc
anymore, I'll change that too.
Thanks again,
>>
>>>>
>>>> /*
>>>> * Roughly size the vmemmap space to be large enough to fit enough
>>>> @@ -57,9 +66,16 @@
>>>> #define FIXADDR_SIZE PGDIR_SIZE
>>>> #endif
>>>> #define FIXADDR_START (FIXADDR_TOP - FIXADDR_SIZE)
>>>> -
>>>> #endif
>>>>
>>>> +#ifndef __ASSEMBLY__
>>>> +
>>>> +/* Page Upper Directory not used in RISC-V */
>>>> +#include <asm-generic/pgtable-nopud.h>
>>>> +#include <asm/page.h>
>>>> +#include <asm/tlbflush.h>
>>>> +#include <linux/mm_types.h>
>>>> +
>>>> #ifdef CONFIG_64BIT
>>>> #include <asm/pgtable-64.h>
>>>> #else
>>>> @@ -467,6 +483,7 @@ static inline void __kernel_map_pages(struct page *page, int numpages, int enabl
>>>>
>>>> #define kern_addr_valid(addr) (1) /* FIXME */
>>>>
>>>> +extern char _start[];
>>>> extern void *dtb_early_va;
>>>> extern uintptr_t dtb_early_pa;
>>>> void setup_bootmem(void);
>>>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>>>> index 7e849797c9c3..66f40c49bf68 100644
>>>> --- a/arch/riscv/kernel/head.S
>>>> +++ b/arch/riscv/kernel/head.S
>>>> @@ -69,7 +69,8 @@ pe_head_start:
>>>> #ifdef CONFIG_MMU
>>>> relocate:
>>>> /* Relocate return address */
>>>> - li a1, PAGE_OFFSET
>>>> + la a1, kernel_virt_addr
>>>> + REG_L a1, 0(a1)
>>>> la a2, _start
>>>> sub a1, a1, a2
>>>> add ra, ra, a1
>>>> diff --git a/arch/riscv/kernel/module.c b/arch/riscv/kernel/module.c
>>>> index 104fba889cf7..75a0b9541266 100644
>>>> --- a/arch/riscv/kernel/module.c
>>>> +++ b/arch/riscv/kernel/module.c
>>>> @@ -408,12 +408,10 @@ int apply_relocate_add(Elf_Shdr *sechdrs, const char *strtab,
>>>> }
>>>>
>>>> #if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
>>>> -#define VMALLOC_MODULE_START \
>>>> - max(PFN_ALIGN((unsigned long)&_end - SZ_2G), VMALLOC_START)
>>>> void *module_alloc(unsigned long size)
>>>> {
>>>> return __vmalloc_node_range(size, 1, VMALLOC_MODULE_START,
>>>> - VMALLOC_END, GFP_KERNEL,
>>>> + VMALLOC_MODULE_END, GFP_KERNEL,
>>>> PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
>>>> __builtin_return_address(0));
>>>> }
>>>> diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S
>>>> index 3ffbd6cbdb86..c21dc46f41be 100644
>>>> --- a/arch/riscv/kernel/vmlinux.lds.S
>>>> +++ b/arch/riscv/kernel/vmlinux.lds.S
>>>> @@ -4,7 +4,8 @@
>>>> * Copyright (C) 2017 SiFive
>>>> */
>>>>
>>>> -#define LOAD_OFFSET PAGE_OFFSET
>>>> +#include <asm/pgtable.h>
>>>> +#define LOAD_OFFSET KERNEL_LINK_ADDR
>>>> #include <asm/vmlinux.lds.h>
>>>> #include <asm/page.h>
>>>> #include <asm/cache.h>
>>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>>>> index 8e577f14f120..9d06ff0e015a 100644
>>>> --- a/arch/riscv/mm/init.c
>>>> +++ b/arch/riscv/mm/init.c
>>>> @@ -23,6 +23,9 @@
>>>>
>>>> #include "../kernel/head.h"
>>>>
>>>> +unsigned long kernel_virt_addr = KERNEL_VIRT_ADDR;
>>>> +EXPORT_SYMBOL(kernel_virt_addr);
>>>> +
>>>> unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)]
>>>> __page_aligned_bss;
>>>> EXPORT_SYMBOL(empty_zero_page);
>>>> @@ -201,8 +204,12 @@ void __init setup_bootmem(void)
>>>> #ifdef CONFIG_MMU
>>>> static struct pt_alloc_ops pt_ops;
>>>>
>>>> +/* Offset between linear mapping virtual address and kernel load address */
>>>> unsigned long va_pa_offset;
>>>> EXPORT_SYMBOL(va_pa_offset);
>>>> +/* Offset between kernel mapping virtual address and kernel load address */
>>>> +unsigned long va_kernel_pa_offset;
>>>> +EXPORT_SYMBOL(va_kernel_pa_offset);
>>>> unsigned long pfn_base;
>>>> EXPORT_SYMBOL(pfn_base);
>>>>
>>>> @@ -316,7 +323,7 @@ static phys_addr_t __init alloc_pmd_early(uintptr_t va)
>>>> {
>>>> uintptr_t pmd_num;
>>>>
>>>> - pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
>>>> + pmd_num = (va - kernel_virt_addr) >> PGDIR_SHIFT;
>>>> BUG_ON(pmd_num >= NUM_EARLY_PMDS);
>>>> return (uintptr_t)&early_pmd[pmd_num * PTRS_PER_PMD];
>>>> }
>>>> @@ -431,17 +438,34 @@ static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
>>>> #error "setup_vm() is called from head.S before relocate so it should not use absolute addressing."
>>>> #endif
>>>>
>>>> +static uintptr_t load_pa, load_sz;
>>>> +
>>>> +static void __init create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
>>>> +{
>>>> + uintptr_t va, end_va;
>>>> +
>>>> + end_va = kernel_virt_addr + load_sz;
>>>> + for (va = kernel_virt_addr; va < end_va; va += map_size)
>>>> + create_pgd_mapping(pgdir, va,
>>>> + load_pa + (va - kernel_virt_addr),
>>>> + map_size, PAGE_KERNEL_EXEC);
>>>> +}
>>>> +
>>>> asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>> {
>>>> - uintptr_t va, pa, end_va;
>>>> - uintptr_t load_pa = (uintptr_t)(&_start);
>>>> - uintptr_t load_sz = (uintptr_t)(&_end) - load_pa;
>>>> - uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
>>>> + uintptr_t pa;
>>>> + uintptr_t map_size;
>>>> #ifndef __PAGETABLE_PMD_FOLDED
>>>> pmd_t fix_bmap_spmd, fix_bmap_epmd;
>>>> #endif
>>>>
>>>> + load_pa = (uintptr_t)(&_start);
>>>> + load_sz = (uintptr_t)(&_end) - load_pa;
>>>> + map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
>>>> +
>>>> va_pa_offset = PAGE_OFFSET - load_pa;
>>>> + va_kernel_pa_offset = kernel_virt_addr - load_pa;
>>>> +
>>>> pfn_base = PFN_DOWN(load_pa);
>>>>
>>>> /*
>>>> @@ -470,26 +494,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>> create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>>>> (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
>>>> /* Setup trampoline PGD and PMD */
>>>> - create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>> + create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>> (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
>>>> - create_pmd_mapping(trampoline_pmd, PAGE_OFFSET,
>>>> + create_pmd_mapping(trampoline_pmd, kernel_virt_addr,
>>>> load_pa, PMD_SIZE, PAGE_KERNEL_EXEC);
>>>> #else
>>>> /* Setup trampoline PGD */
>>>> - create_pgd_mapping(trampoline_pg_dir, PAGE_OFFSET,
>>>> + create_pgd_mapping(trampoline_pg_dir, kernel_virt_addr,
>>>> load_pa, PGDIR_SIZE, PAGE_KERNEL_EXEC);
>>>> #endif
>>>>
>>>> /*
>>>> - * Setup early PGD covering entire kernel which will allows
>>>> + * Setup early PGD covering entire kernel which will allow
>>>> * us to reach paging_init(). We map all memory banks later
>>>> * in setup_vm_final() below.
>>>> */
>>>> - end_va = PAGE_OFFSET + load_sz;
>>>> - for (va = PAGE_OFFSET; va < end_va; va += map_size)
>>>> - create_pgd_mapping(early_pg_dir, va,
>>>> - load_pa + (va - PAGE_OFFSET),
>>>> - map_size, PAGE_KERNEL_EXEC);
>>>> + create_kernel_page_table(early_pg_dir, map_size);
>>>>
>>>> #ifndef __PAGETABLE_PMD_FOLDED
>>>> /* Setup early PMD for DTB */
>>>> @@ -549,6 +569,7 @@ static void __init setup_vm_final(void)
>>>> uintptr_t va, map_size;
>>>> phys_addr_t pa, start, end;
>>>> u64 i;
>>>> + static struct vm_struct vm_kernel = { 0 };
>>>>
>>>> /**
>>>> * MMU is enabled at this point. But page table setup is not complete yet.
>>>> @@ -565,7 +586,7 @@ static void __init setup_vm_final(void)
>>>> __pa_symbol(fixmap_pgd_next),
>>>> PGDIR_SIZE, PAGE_TABLE);
>>>>
>>>> - /* Map all memory banks */
>>>> + /* Map all memory banks in the linear mapping */
>>>> for_each_mem_range(i, &start, &end) {
>>>> if (start >= end)
>>>> break;
>>>> @@ -577,10 +598,22 @@ static void __init setup_vm_final(void)
>>>> for (pa = start; pa < end; pa += map_size) {
>>>> va = (uintptr_t)__va(pa);
>>>> create_pgd_mapping(swapper_pg_dir, va, pa,
>>>> - map_size, PAGE_KERNEL_EXEC);
>>>> + map_size, PAGE_KERNEL);
>>>> }
>>>> }
>>>>
>>>> + /* Map the kernel */
>>>> + create_kernel_page_table(swapper_pg_dir, PMD_SIZE);
>>>> +
>>>> + /* Reserve the vmalloc area occupied by the kernel */
>>>> + vm_kernel.addr = (void *)kernel_virt_addr;
>>>> + vm_kernel.phys_addr = load_pa;
>>>> + vm_kernel.size = (load_sz + PMD_SIZE - 1) & ~(PMD_SIZE - 1);
>>>> + vm_kernel.flags = VM_MAP | VM_NO_GUARD;
>>>> + vm_kernel.caller = __builtin_return_address(0);
>>>> +
>>>> + vm_area_add_early(&vm_kernel);
>>>> +
>>>> /* Clear fixmap PTE and PMD mappings */
>>>> clear_fixmap(FIX_PTE);
>>>> clear_fixmap(FIX_PMD);
>>>> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
>>>> index e8e4dcd39fed..35703d5ef5fd 100644
>>>> --- a/arch/riscv/mm/physaddr.c
>>>> +++ b/arch/riscv/mm/physaddr.c
>>>> @@ -23,7 +23,7 @@ EXPORT_SYMBOL(__virt_to_phys);
>>>>
>>>> phys_addr_t __phys_addr_symbol(unsigned long x)
>>>> {
>>>> - unsigned long kernel_start = (unsigned long)PAGE_OFFSET;
>>>> + unsigned long kernel_start = (unsigned long)kernel_virt_addr;
>>>> unsigned long kernel_end = (unsigned long)_end;
>>>>
>>>> /*
>>>> --
>>>> 2.20.1
>>>>
>>>
>>> Apart from the minor comment above, this looks good to me.
>>>
>>> Reviewed-by: Anup Patel <[email protected]>
>>
>> Thanks for your time Anup,
>>
>>>
>>> Regards,
>>> Anup
>>>
>>
>> Alex
>
> Regards,
> Anup
>
Alex
Hi Palmer,
On 1/4/21 2:58 PM, Alexandre Ghiti wrote:
> This patchset, contrary to the previous versions, allows to have a single
> kernel for sv39 and sv48 without being relocatable.
>
> The idea comes from Arnd Bergmann who suggested to do the same as x86,
> that is mapping the kernel to the end of the address space, which allows
> the kernel to be linked at the same address for both sv39 and sv48 and
> then does not require to be relocated at runtime.
>
> This is an RFC because I need to at least rebase a few commits and add
> documentation. The most interesting patches where I expect feedbacks are
> 1/12, 2/12 and 8/12. Note that moving the kernel out of the linear
> mapping and sv48 support can be separate patchsets, I share them together
> today to show that it works (this patchset is rebased on top of v5.10).
>
> If we agree about the overall idea, I'll rebase my relocatable patchset
> on top of that and then KASLR implementation from Zong will be greatly
> simplified since moving the kernel out of the linear mapping will avoid
> to copy the kernel physically.
>
> This implements sv48 support at runtime. The kernel will try to
> boot with 4-level page table and will fallback to 3-level if the HW does not
> support it. Folding the 4th level into a 3-level page table has almost no
> cost at runtime.
>
> Finally, the user can now ask for sv39 explicitly by using the device-tree
> which will reduce memory footprint and reduce the number of memory accesses
> in case of TLB miss.
>
> Alexandre Ghiti (12):
> riscv: Move kernel mapping outside of linear mapping
> riscv: Protect the kernel linear mapping
> riscv: Get rid of compile time logic with MAX_EARLY_MAPPING_SIZE
> riscv: Allow to dynamically define VA_BITS
> riscv: Simplify MAXPHYSMEM config
> riscv: Prepare ptdump for vm layout dynamic addresses
> asm-generic: Prepare for riscv use of pud_alloc_one and pud_free
> riscv: Implement sv48 support
> riscv: Allow user to downgrade to sv39 when hw supports sv48
> riscv: Use pgtable_l4_enabled to output mmu type in cpuinfo
> riscv: Explicit comment about user virtual address space size
> riscv: Improve virtual kernel memory layout dump
>
> arch/riscv/Kconfig | 34 +--
> arch/riscv/boot/loader.lds.S | 3 +-
> arch/riscv/include/asm/csr.h | 3 +-
> arch/riscv/include/asm/fixmap.h | 3 +
> arch/riscv/include/asm/page.h | 33 ++-
> arch/riscv/include/asm/pgalloc.h | 40 +++
> arch/riscv/include/asm/pgtable-64.h | 104 ++++++-
> arch/riscv/include/asm/pgtable.h | 68 +++--
> arch/riscv/include/asm/sparsemem.h | 6 +-
> arch/riscv/kernel/cpu.c | 23 +-
> arch/riscv/kernel/head.S | 6 +-
> arch/riscv/kernel/module.c | 4 +-
> arch/riscv/kernel/vmlinux.lds.S | 3 +-
> arch/riscv/mm/context.c | 2 +-
> arch/riscv/mm/init.c | 376 ++++++++++++++++++++----
> arch/riscv/mm/physaddr.c | 2 +-
> arch/riscv/mm/ptdump.c | 56 +++-
> drivers/firmware/efi/libstub/efi-stub.c | 2 +-
> include/asm-generic/pgalloc.h | 24 +-
> include/linux/sizes.h | 3 +-
> 20 files changed, 648 insertions(+), 147 deletions(-)
>
Any thought about the idea ? Is it going in the right direction ? I have
fixed quite a few things since I posted this so don't bother giving this
patchset a full review.
Thanks,
Alex
On Sat, 30 Jan 2021 01:33:20 PST (-0800), [email protected] wrote:
> Hi Palmer,
>
> On 1/4/21 2:58 PM, Alexandre Ghiti wrote:
>> This patchset, contrary to the previous versions, allows to have a single
>> kernel for sv39 and sv48 without being relocatable.
>>
>> The idea comes from Arnd Bergmann who suggested to do the same as x86,
>> that is mapping the kernel to the end of the address space, which allows
>> the kernel to be linked at the same address for both sv39 and sv48 and
>> then does not require to be relocated at runtime.
>>
>> This is an RFC because I need to at least rebase a few commits and add
>> documentation. The most interesting patches where I expect feedbacks are
>> 1/12, 2/12 and 8/12. Note that moving the kernel out of the linear
>> mapping and sv48 support can be separate patchsets, I share them together
>> today to show that it works (this patchset is rebased on top of v5.10).
>>
>> If we agree about the overall idea, I'll rebase my relocatable patchset
>> on top of that and then KASLR implementation from Zong will be greatly
>> simplified since moving the kernel out of the linear mapping will avoid
>> to copy the kernel physically.
>>
>> This implements sv48 support at runtime. The kernel will try to
>> boot with 4-level page table and will fallback to 3-level if the HW does not
>> support it. Folding the 4th level into a 3-level page table has almost no
>> cost at runtime.
>>
>> Finally, the user can now ask for sv39 explicitly by using the device-tree
>> which will reduce memory footprint and reduce the number of memory accesses
>> in case of TLB miss.
>>
>> Alexandre Ghiti (12):
>> riscv: Move kernel mapping outside of linear mapping
>> riscv: Protect the kernel linear mapping
>> riscv: Get rid of compile time logic with MAX_EARLY_MAPPING_SIZE
>> riscv: Allow to dynamically define VA_BITS
>> riscv: Simplify MAXPHYSMEM config
>> riscv: Prepare ptdump for vm layout dynamic addresses
>> asm-generic: Prepare for riscv use of pud_alloc_one and pud_free
>> riscv: Implement sv48 support
>> riscv: Allow user to downgrade to sv39 when hw supports sv48
>> riscv: Use pgtable_l4_enabled to output mmu type in cpuinfo
>> riscv: Explicit comment about user virtual address space size
>> riscv: Improve virtual kernel memory layout dump
>>
>> arch/riscv/Kconfig | 34 +--
>> arch/riscv/boot/loader.lds.S | 3 +-
>> arch/riscv/include/asm/csr.h | 3 +-
>> arch/riscv/include/asm/fixmap.h | 3 +
>> arch/riscv/include/asm/page.h | 33 ++-
>> arch/riscv/include/asm/pgalloc.h | 40 +++
>> arch/riscv/include/asm/pgtable-64.h | 104 ++++++-
>> arch/riscv/include/asm/pgtable.h | 68 +++--
>> arch/riscv/include/asm/sparsemem.h | 6 +-
>> arch/riscv/kernel/cpu.c | 23 +-
>> arch/riscv/kernel/head.S | 6 +-
>> arch/riscv/kernel/module.c | 4 +-
>> arch/riscv/kernel/vmlinux.lds.S | 3 +-
>> arch/riscv/mm/context.c | 2 +-
>> arch/riscv/mm/init.c | 376 ++++++++++++++++++++----
>> arch/riscv/mm/physaddr.c | 2 +-
>> arch/riscv/mm/ptdump.c | 56 +++-
>> drivers/firmware/efi/libstub/efi-stub.c | 2 +-
>> include/asm-generic/pgalloc.h | 24 +-
>> include/linux/sizes.h | 3 +-
>> 20 files changed, 648 insertions(+), 147 deletions(-)
>>
>
> Any thought about the idea ? Is it going in the right direction ? I have
> fixed quite a few things since I posted this so don't bother giving this
> patchset a full review.
My only real issue was the relocation stuff, which appears to be fixed. I
haven't had the time to look at the patches, but it wouldn't hurt to send
another revision. The best bet might be to just wait until 5.11 and send on
top of that, it's too late for this one anyway and that'll be a stable base to
test from.