2021-12-06 10:47:38

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v3 00/13] Introduce sv48 support without relocatable kernel

* Please note notable changes in memory layouts and kasan population *

This patchset allows to have a single kernel for sv39 and sv48 without
being relocatable.

The idea comes from Arnd Bergmann who suggested to do the same as x86,
that is mapping the kernel to the end of the address space, which allows
the kernel to be linked at the same address for both sv39 and sv48 and
then does not require to be relocated at runtime.

This implements sv48 support at runtime. The kernel will try to
boot with 4-level page table and will fallback to 3-level if the HW does not
support it. Folding the 4th level into a 3-level page table has almost no
cost at runtime.

Note that kasan region had to be moved to the end of the address space
since its location must be known at compile-time and then be valid for
both sv39 and sv48 (and sv57 that is coming).

Tested on:
- qemu rv64 sv39: OK
- qemu rv64 sv48: OK
- qemu rv64 sv39 + kasan: OK
- qemu rv64 sv48 + kasan: OK
- qemu rv32: OK

Changes in v3:
- Fix SZ_1T, thanks to Atish
- Fix warning create_pud_mapping, thanks to Atish
- Fix k210 nommu build, thanks to Atish
- Fix wrong rebase as noted by Samuel
- * Downgrade to sv39 is only possible if !KASAN (see commit changelog) *
- * Move KASAN next to the kernel: virtual layouts changed and kasan population *

Changes in v2:
- Rebase onto for-next
- Fix KASAN
- Fix stack canary
- Get completely rid of MAXPHYSMEM configs
- Add documentation

Alexandre Ghiti (13):
riscv: Move KASAN mapping next to the kernel mapping
riscv: Split early kasan mapping to prepare sv48 introduction
riscv: Introduce functions to switch pt_ops
riscv: Allow to dynamically define VA_BITS
riscv: Get rid of MAXPHYSMEM configs
asm-generic: Prepare for riscv use of pud_alloc_one and pud_free
riscv: Implement sv48 support
riscv: Use pgtable_l4_enabled to output mmu_type in cpuinfo
riscv: Explicit comment about user virtual address space size
riscv: Improve virtual kernel memory layout dump
Documentation: riscv: Add sv48 description to VM layout
riscv: Initialize thread pointer before calling C functions
riscv: Allow user to downgrade to sv39 when hw supports sv48 if !KASAN

Documentation/riscv/vm-layout.rst | 48 ++-
arch/riscv/Kconfig | 37 +-
arch/riscv/configs/nommu_k210_defconfig | 1 -
.../riscv/configs/nommu_k210_sdcard_defconfig | 1 -
arch/riscv/configs/nommu_virt_defconfig | 1 -
arch/riscv/include/asm/csr.h | 3 +-
arch/riscv/include/asm/fixmap.h | 1
arch/riscv/include/asm/kasan.h | 11 +-
arch/riscv/include/asm/page.h | 20 +-
arch/riscv/include/asm/pgalloc.h | 40 ++
arch/riscv/include/asm/pgtable-64.h | 108 ++++-
arch/riscv/include/asm/pgtable.h | 47 +-
arch/riscv/include/asm/sparsemem.h | 6 +-
arch/riscv/kernel/cpu.c | 23 +-
arch/riscv/kernel/head.S | 4 +-
arch/riscv/mm/context.c | 4 +-
arch/riscv/mm/init.c | 408 ++++++++++++++----
arch/riscv/mm/kasan_init.c | 250 ++++++++---
drivers/firmware/efi/libstub/efi-stub.c | 2
drivers/pci/controller/pci-xgene.c | 2 +-
include/asm-generic/pgalloc.h | 24 +-
include/linux/sizes.h | 1
22 files changed, 833 insertions(+), 209 deletions(-)

--
2.32.0



2021-12-06 10:48:21

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v3 01/13] riscv: Move KASAN mapping next to the kernel mapping

Now that KASAN_SHADOW_OFFSET is defined at compile time as a config,
this value must remain constant whatever the size of the virtual address
space, which is only possible by pushing this region at the end of the
address space next to the kernel mapping.

Signed-off-by: Alexandre Ghiti <[email protected]>
---
Documentation/riscv/vm-layout.rst | 12 ++++++------
arch/riscv/Kconfig | 4 ++--
arch/riscv/include/asm/kasan.h | 4 ++--
arch/riscv/include/asm/page.h | 6 +++++-
arch/riscv/include/asm/pgtable.h | 6 ++++--
arch/riscv/mm/init.c | 25 +++++++++++++------------
6 files changed, 32 insertions(+), 25 deletions(-)

diff --git a/Documentation/riscv/vm-layout.rst b/Documentation/riscv/vm-layout.rst
index b7f98930d38d..1bd687b97104 100644
--- a/Documentation/riscv/vm-layout.rst
+++ b/Documentation/riscv/vm-layout.rst
@@ -47,12 +47,12 @@ RISC-V Linux Kernel SV39
| Kernel-space virtual memory, shared between all processes:
____________________________________________________________|___________________________________________________________
| | | |
- ffffffc000000000 | -256 GB | ffffffc7ffffffff | 32 GB | kasan
- ffffffcefee00000 | -196 GB | ffffffcefeffffff | 2 MB | fixmap
- ffffffceff000000 | -196 GB | ffffffceffffffff | 16 MB | PCI io
- ffffffcf00000000 | -196 GB | ffffffcfffffffff | 4 GB | vmemmap
- ffffffd000000000 | -192 GB | ffffffdfffffffff | 64 GB | vmalloc/ioremap space
- ffffffe000000000 | -128 GB | ffffffff7fffffff | 124 GB | direct mapping of all physical memory
+ ffffffc6fee00000 | -228 GB | ffffffc6feffffff | 2 MB | fixmap
+ ffffffc6ff000000 | -228 GB | ffffffc6ffffffff | 16 MB | PCI io
+ ffffffc700000000 | -228 GB | ffffffc7ffffffff | 4 GB | vmemmap
+ ffffffc800000000 | -224 GB | ffffffd7ffffffff | 64 GB | vmalloc/ioremap space
+ ffffffd800000000 | -160 GB | fffffff6ffffffff | 124 GB | direct mapping of all physical memory
+ fffffff700000000 | -36 GB | fffffffeffffffff | 32 GB | kasan
__________________|____________|__________________|_________|____________________________________________________________
|
|
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 6d5b63bd4bd9..6cd98ade5ebc 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -161,12 +161,12 @@ config PAGE_OFFSET
default 0xC0000000 if 32BIT && MAXPHYSMEM_1GB
default 0x80000000 if 64BIT && !MMU
default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB
- default 0xffffffe000000000 if 64BIT && MAXPHYSMEM_128GB
+ default 0xffffffd800000000 if 64BIT && MAXPHYSMEM_128GB

config KASAN_SHADOW_OFFSET
hex
depends on KASAN_GENERIC
- default 0xdfffffc800000000 if 64BIT
+ default 0xdfffffff00000000 if 64BIT
default 0xffffffff if 32BIT

config ARCH_FLATMEM_ENABLE
diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h
index b00f503ec124..257a2495145a 100644
--- a/arch/riscv/include/asm/kasan.h
+++ b/arch/riscv/include/asm/kasan.h
@@ -28,8 +28,8 @@
#define KASAN_SHADOW_SCALE_SHIFT 3

#define KASAN_SHADOW_SIZE (UL(1) << ((CONFIG_VA_BITS - 1) - KASAN_SHADOW_SCALE_SHIFT))
-#define KASAN_SHADOW_START KERN_VIRT_START
-#define KASAN_SHADOW_END (KASAN_SHADOW_START + KASAN_SHADOW_SIZE)
+#define KASAN_SHADOW_START (KASAN_SHADOW_END - KASAN_SHADOW_SIZE)
+#define KASAN_SHADOW_END MODULES_LOWEST_VADDR
#define KASAN_SHADOW_OFFSET _AC(CONFIG_KASAN_SHADOW_OFFSET, UL)

void kasan_init(void);
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 109c97e991a6..e03559f9b35e 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -33,7 +33,11 @@
*/
#define PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)

-#define KERN_VIRT_SIZE (-PAGE_OFFSET)
+/*
+ * Half of the kernel address space (half of the entries of the page global
+ * directory) is for the direct mapping.
+ */
+#define KERN_VIRT_SIZE ((PTRS_PER_PGD / 2 * PGDIR_SIZE) / 2)

#ifndef __ASSEMBLY__

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 39b550310ec6..d34f3a7a9701 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -39,8 +39,10 @@

/* Modules always live before the kernel */
#ifdef CONFIG_64BIT
-#define MODULES_VADDR (PFN_ALIGN((unsigned long)&_end) - SZ_2G)
-#define MODULES_END (PFN_ALIGN((unsigned long)&_start))
+/* This is used to define the end of the KASAN shadow region */
+#define MODULES_LOWEST_VADDR (KERNEL_LINK_ADDR - SZ_2G)
+#define MODULES_VADDR (PFN_ALIGN((unsigned long)&_end) - SZ_2G)
+#define MODULES_END (PFN_ALIGN((unsigned long)&_start))
#endif

/*
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index c0cddf0fc22d..4224e9d0ecf5 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -103,6 +103,9 @@ static void __init print_vm_layout(void)
print_mlm("lowmem", (unsigned long)PAGE_OFFSET,
(unsigned long)high_memory);
#ifdef CONFIG_64BIT
+#ifdef CONFIG_KASAN
+ print_mlm("kasan", KASAN_SHADOW_START, KASAN_SHADOW_END);
+#endif
print_mlm("kernel", (unsigned long)KERNEL_LINK_ADDR,
(unsigned long)ADDRESS_SPACE_END);
#endif
@@ -130,18 +133,8 @@ void __init mem_init(void)
print_vm_layout();
}

-/*
- * The default maximal physical memory size is -PAGE_OFFSET for 32-bit kernel,
- * whereas for 64-bit kernel, the end of the virtual address space is occupied
- * by the modules/BPF/kernel mappings which reduces the available size of the
- * linear mapping.
- * Limit the memory size via mem.
- */
-#ifdef CONFIG_64BIT
-static phys_addr_t memory_limit = -PAGE_OFFSET - SZ_4G;
-#else
-static phys_addr_t memory_limit = -PAGE_OFFSET;
-#endif
+/* Limit the memory size via mem. */
+static phys_addr_t memory_limit;

static int __init early_mem(char *p)
{
@@ -613,6 +606,14 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)

riscv_pfn_base = PFN_DOWN(kernel_map.phys_addr);

+ /*
+ * The default maximal physical memory size is KERN_VIRT_SIZE for 32-bit
+ * kernel, whereas for 64-bit kernel, the end of the virtual address
+ * space is occupied by the modules/BPF/kernel mappings which reduces
+ * the available size of the linear mapping.
+ */
+ memory_limit = KERN_VIRT_SIZE - (IS_ENABLED(CONFIG_64BIT) ? SZ_4G : 0);
+
/* Sanity check alignment and size */
BUG_ON((PAGE_OFFSET % PGDIR_SIZE) != 0);
BUG_ON((kernel_map.phys_addr % PMD_SIZE) != 0);
--
2.32.0


2021-12-06 10:49:24

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v3 02/13] riscv: Split early kasan mapping to prepare sv48 introduction

Now that kasan shadow region is next to the kernel, for sv48, this
region won't be aligned on PGDIR_SIZE and then when populating this
region, we'll need to get down to lower levels of the page table. So
instead of reimplementing the page table walk for the early population,
take advantage of the existing functions used for the final population.

Note that kasan swapper initialization must also be split since memblock
is not initialized at this point and as the last PGD is shared with the
kernel, we'd need to allocate a PUD so postpone the kasan final
population after the kernel population is done.

Signed-off-by: Alexandre Ghiti <[email protected]>
---
arch/riscv/include/asm/kasan.h | 1 +
arch/riscv/mm/init.c | 4 ++
arch/riscv/mm/kasan_init.c | 113 ++++++++++++++++++---------------
3 files changed, 67 insertions(+), 51 deletions(-)

diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h
index 257a2495145a..2788e2c46609 100644
--- a/arch/riscv/include/asm/kasan.h
+++ b/arch/riscv/include/asm/kasan.h
@@ -34,6 +34,7 @@

void kasan_init(void);
asmlinkage void kasan_early_init(void);
+void kasan_swapper_init(void);

#endif
#endif
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 4224e9d0ecf5..5010eba52738 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -742,6 +742,10 @@ static void __init setup_vm_final(void)
create_kernel_page_table(swapper_pg_dir, false);
#endif

+#ifdef CONFIG_KASAN
+ kasan_swapper_init();
+#endif
+
/* Clear fixmap PTE and PMD mappings */
clear_fixmap(FIX_PTE);
clear_fixmap(FIX_PMD);
diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c
index 54294f83513d..1434a0225140 100644
--- a/arch/riscv/mm/kasan_init.c
+++ b/arch/riscv/mm/kasan_init.c
@@ -12,44 +12,6 @@
#include <asm/pgalloc.h>

extern pgd_t early_pg_dir[PTRS_PER_PGD];
-asmlinkage void __init kasan_early_init(void)
-{
- uintptr_t i;
- pgd_t *pgd = early_pg_dir + pgd_index(KASAN_SHADOW_START);
-
- BUILD_BUG_ON(KASAN_SHADOW_OFFSET !=
- KASAN_SHADOW_END - (1UL << (64 - KASAN_SHADOW_SCALE_SHIFT)));
-
- for (i = 0; i < PTRS_PER_PTE; ++i)
- set_pte(kasan_early_shadow_pte + i,
- mk_pte(virt_to_page(kasan_early_shadow_page),
- PAGE_KERNEL));
-
- for (i = 0; i < PTRS_PER_PMD; ++i)
- set_pmd(kasan_early_shadow_pmd + i,
- pfn_pmd(PFN_DOWN
- (__pa((uintptr_t) kasan_early_shadow_pte)),
- __pgprot(_PAGE_TABLE)));
-
- for (i = KASAN_SHADOW_START; i < KASAN_SHADOW_END;
- i += PGDIR_SIZE, ++pgd)
- set_pgd(pgd,
- pfn_pgd(PFN_DOWN
- (__pa(((uintptr_t) kasan_early_shadow_pmd))),
- __pgprot(_PAGE_TABLE)));
-
- /* init for swapper_pg_dir */
- pgd = pgd_offset_k(KASAN_SHADOW_START);
-
- for (i = KASAN_SHADOW_START; i < KASAN_SHADOW_END;
- i += PGDIR_SIZE, ++pgd)
- set_pgd(pgd,
- pfn_pgd(PFN_DOWN
- (__pa(((uintptr_t) kasan_early_shadow_pmd))),
- __pgprot(_PAGE_TABLE)));
-
- local_flush_tlb_all();
-}

static void __init kasan_populate_pte(pmd_t *pmd, unsigned long vaddr, unsigned long end)
{
@@ -108,26 +70,35 @@ static void __init kasan_populate_pmd(pgd_t *pgd, unsigned long vaddr, unsigned
set_pgd(pgd, pfn_pgd(PFN_DOWN(__pa(base_pmd)), PAGE_TABLE));
}

-static void __init kasan_populate_pgd(unsigned long vaddr, unsigned long end)
+static void __init kasan_populate_pgd(pgd_t *pgdp,
+ unsigned long vaddr, unsigned long end,
+ bool early)
{
phys_addr_t phys_addr;
- pgd_t *pgdp = pgd_offset_k(vaddr);
unsigned long next;

do {
next = pgd_addr_end(vaddr, end);

- /*
- * pgdp can't be none since kasan_early_init initialized all KASAN
- * shadow region with kasan_early_shadow_pmd: if this is stillthe case,
- * that means we can try to allocate a hugepage as a replacement.
- */
- if (pgd_page_vaddr(*pgdp) == (unsigned long)lm_alias(kasan_early_shadow_pmd) &&
- IS_ALIGNED(vaddr, PGDIR_SIZE) && (next - vaddr) >= PGDIR_SIZE) {
- phys_addr = memblock_phys_alloc(PGDIR_SIZE, PGDIR_SIZE);
- if (phys_addr) {
- set_pgd(pgdp, pfn_pgd(PFN_DOWN(phys_addr), PAGE_KERNEL));
+ if (IS_ALIGNED(vaddr, PGDIR_SIZE) && (next - vaddr) >= PGDIR_SIZE) {
+ if (early) {
+ phys_addr = __pa((uintptr_t)kasan_early_shadow_pgd_next);
+ set_pgd(pgdp, pfn_pgd(PFN_DOWN(phys_addr), PAGE_TABLE));
continue;
+ } else if (pgd_page_vaddr(*pgdp) ==
+ (unsigned long)lm_alias(kasan_early_shadow_pgd_next)) {
+ /*
+ * pgdp can't be none since kasan_early_init
+ * initialized all KASAN shadow region with
+ * kasan_early_shadow_pud: if this is still the
+ * case, that means we can try to allocate a
+ * hugepage as a replacement.
+ */
+ phys_addr = memblock_phys_alloc(PGDIR_SIZE, PGDIR_SIZE);
+ if (phys_addr) {
+ set_pgd(pgdp, pfn_pgd(PFN_DOWN(phys_addr), PAGE_KERNEL));
+ continue;
+ }
}
}

@@ -135,12 +106,52 @@ static void __init kasan_populate_pgd(unsigned long vaddr, unsigned long end)
} while (pgdp++, vaddr = next, vaddr != end);
}

+asmlinkage void __init kasan_early_init(void)
+{
+ uintptr_t i;
+
+ BUILD_BUG_ON(KASAN_SHADOW_OFFSET !=
+ KASAN_SHADOW_END - (1UL << (64 - KASAN_SHADOW_SCALE_SHIFT)));
+
+ for (i = 0; i < PTRS_PER_PTE; ++i)
+ set_pte(kasan_early_shadow_pte + i,
+ mk_pte(virt_to_page(kasan_early_shadow_page),
+ PAGE_KERNEL));
+
+ for (i = 0; i < PTRS_PER_PMD; ++i)
+ set_pmd(kasan_early_shadow_pmd + i,
+ pfn_pmd(PFN_DOWN
+ (__pa((uintptr_t)kasan_early_shadow_pte)),
+ PAGE_TABLE));
+
+ if (pgtable_l4_enabled) {
+ for (i = 0; i < PTRS_PER_PUD; ++i)
+ set_pud(kasan_early_shadow_pud + i,
+ pfn_pud(PFN_DOWN
+ (__pa(((uintptr_t)kasan_early_shadow_pmd))),
+ PAGE_TABLE));
+ }
+
+ kasan_populate_pgd(early_pg_dir + pgd_index(KASAN_SHADOW_START),
+ KASAN_SHADOW_START, KASAN_SHADOW_END, true);
+
+ local_flush_tlb_all();
+}
+
+void __init kasan_swapper_init(void)
+{
+ kasan_populate_pgd(pgd_offset_k(KASAN_SHADOW_START),
+ KASAN_SHADOW_START, KASAN_SHADOW_END, true);
+
+ local_flush_tlb_all();
+}
+
static void __init kasan_populate(void *start, void *end)
{
unsigned long vaddr = (unsigned long)start & PAGE_MASK;
unsigned long vend = PAGE_ALIGN((unsigned long)end);

- kasan_populate_pgd(vaddr, vend);
+ kasan_populate_pgd(pgd_offset_k(vaddr), vaddr, vend, false);

local_flush_tlb_all();
memset(start, KASAN_SHADOW_INIT, end - start);
--
2.32.0


2021-12-06 10:50:23

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v3 03/13] riscv: Introduce functions to switch pt_ops

This simply gathers the different pt_ops initialization in functions
where a comment was added to explain why the page table operations must
be changed along the boot process.

Signed-off-by: Alexandre Ghiti <[email protected]>
---
arch/riscv/mm/init.c | 74 ++++++++++++++++++++++++++++++--------------
1 file changed, 51 insertions(+), 23 deletions(-)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 5010eba52738..1552226fb6bd 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -582,6 +582,52 @@ static void __init create_fdt_early_page_table(pgd_t *pgdir, uintptr_t dtb_pa)
dtb_early_pa = dtb_pa;
}

+/*
+ * MMU is not enabled, the page tables are allocated directly using
+ * early_pmd/pud/p4d and the address returned is the physical one.
+ */
+void pt_ops_set_early(void)
+{
+ pt_ops.alloc_pte = alloc_pte_early;
+ pt_ops.get_pte_virt = get_pte_virt_early;
+#ifndef __PAGETABLE_PMD_FOLDED
+ pt_ops.alloc_pmd = alloc_pmd_early;
+ pt_ops.get_pmd_virt = get_pmd_virt_early;
+#endif
+}
+
+/*
+ * MMU is enabled but page table setup is not complete yet.
+ * fixmap page table alloc functions must be used as a means to temporarily
+ * map the allocated physical pages since the linear mapping does not exist yet.
+ *
+ * Note that this is called with MMU disabled, hence kernel_mapping_pa_to_va,
+ * but it will be used as described above.
+ */
+void pt_ops_set_fixmap(void)
+{
+ pt_ops.alloc_pte = kernel_mapping_pa_to_va((uintptr_t)alloc_pte_fixmap);
+ pt_ops.get_pte_virt = kernel_mapping_pa_to_va((uintptr_t)get_pte_virt_fixmap);
+#ifndef __PAGETABLE_PMD_FOLDED
+ pt_ops.alloc_pmd = kernel_mapping_pa_to_va((uintptr_t)alloc_pmd_fixmap);
+ pt_ops.get_pmd_virt = kernel_mapping_pa_to_va((uintptr_t)get_pmd_virt_fixmap);
+#endif
+}
+
+/*
+ * MMU is enabled and page table setup is complete, so from now, we can use
+ * generic page allocation functions to setup page table.
+ */
+void pt_ops_set_late(void)
+{
+ pt_ops.alloc_pte = alloc_pte_late;
+ pt_ops.get_pte_virt = get_pte_virt_late;
+#ifndef __PAGETABLE_PMD_FOLDED
+ pt_ops.alloc_pmd = alloc_pmd_late;
+ pt_ops.get_pmd_virt = get_pmd_virt_late;
+#endif
+}
+
asmlinkage void __init setup_vm(uintptr_t dtb_pa)
{
pmd_t __maybe_unused fix_bmap_spmd, fix_bmap_epmd;
@@ -626,12 +672,8 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
BUG_ON((kernel_map.virt_addr + kernel_map.size) > ADDRESS_SPACE_END - SZ_4K);
#endif

- pt_ops.alloc_pte = alloc_pte_early;
- pt_ops.get_pte_virt = get_pte_virt_early;
-#ifndef __PAGETABLE_PMD_FOLDED
- pt_ops.alloc_pmd = alloc_pmd_early;
- pt_ops.get_pmd_virt = get_pmd_virt_early;
-#endif
+ pt_ops_set_early();
+
/* Setup early PGD for fixmap */
create_pgd_mapping(early_pg_dir, FIXADDR_START,
(uintptr_t)fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
@@ -695,6 +737,8 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
pr_warn("FIX_BTMAP_BEGIN: %d\n", FIX_BTMAP_BEGIN);
}
#endif
+
+ pt_ops_set_fixmap();
}

static void __init setup_vm_final(void)
@@ -703,16 +747,6 @@ static void __init setup_vm_final(void)
phys_addr_t pa, start, end;
u64 i;

- /**
- * MMU is enabled at this point. But page table setup is not complete yet.
- * fixmap page table alloc functions should be used at this point
- */
- pt_ops.alloc_pte = alloc_pte_fixmap;
- pt_ops.get_pte_virt = get_pte_virt_fixmap;
-#ifndef __PAGETABLE_PMD_FOLDED
- pt_ops.alloc_pmd = alloc_pmd_fixmap;
- pt_ops.get_pmd_virt = get_pmd_virt_fixmap;
-#endif
/* Setup swapper PGD for fixmap */
create_pgd_mapping(swapper_pg_dir, FIXADDR_START,
__pa_symbol(fixmap_pgd_next),
@@ -754,13 +788,7 @@ static void __init setup_vm_final(void)
csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | SATP_MODE);
local_flush_tlb_all();

- /* generic page allocation functions must be used to setup page table */
- pt_ops.alloc_pte = alloc_pte_late;
- pt_ops.get_pte_virt = get_pte_virt_late;
-#ifndef __PAGETABLE_PMD_FOLDED
- pt_ops.alloc_pmd = alloc_pmd_late;
- pt_ops.get_pmd_virt = get_pmd_virt_late;
-#endif
+ pt_ops_set_late();
}
#else
asmlinkage void __init setup_vm(uintptr_t dtb_pa)
--
2.32.0


2021-12-06 10:51:25

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v3 04/13] riscv: Allow to dynamically define VA_BITS

With 4-level page table folding at runtime, we don't know at compile time
the size of the virtual address space so we must set VA_BITS dynamically
so that sparsemem reserves the right amount of memory for struct pages.

Signed-off-by: Alexandre Ghiti <[email protected]>
---
arch/riscv/Kconfig | 10 ----------
arch/riscv/include/asm/kasan.h | 2 +-
arch/riscv/include/asm/pgtable.h | 10 ++++++++--
arch/riscv/include/asm/sparsemem.h | 6 +++++-
4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 6cd98ade5ebc..c3a167eea011 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -146,16 +146,6 @@ config MMU
Select if you want MMU-based virtualised addressing space
support by paged memory management. If unsure, say 'Y'.

-config VA_BITS
- int
- default 32 if 32BIT
- default 39 if 64BIT
-
-config PA_BITS
- int
- default 34 if 32BIT
- default 56 if 64BIT
-
config PAGE_OFFSET
hex
default 0xC0000000 if 32BIT && MAXPHYSMEM_1GB
diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h
index 2788e2c46609..743e6ff57996 100644
--- a/arch/riscv/include/asm/kasan.h
+++ b/arch/riscv/include/asm/kasan.h
@@ -27,7 +27,7 @@
*/
#define KASAN_SHADOW_SCALE_SHIFT 3

-#define KASAN_SHADOW_SIZE (UL(1) << ((CONFIG_VA_BITS - 1) - KASAN_SHADOW_SCALE_SHIFT))
+#define KASAN_SHADOW_SIZE (UL(1) << ((VA_BITS - 1) - KASAN_SHADOW_SCALE_SHIFT))
#define KASAN_SHADOW_START (KASAN_SHADOW_END - KASAN_SHADOW_SIZE)
#define KASAN_SHADOW_END MODULES_LOWEST_VADDR
#define KASAN_SHADOW_OFFSET _AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index d34f3a7a9701..e1a52e22ad7e 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -50,8 +50,14 @@
* struct pages to map half the virtual address space. Then
* position vmemmap directly below the VMALLOC region.
*/
+#ifdef CONFIG_64BIT
+#define VA_BITS 39
+#else
+#define VA_BITS 32
+#endif
+
#define VMEMMAP_SHIFT \
- (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
+ (VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
#define VMEMMAP_SIZE BIT(VMEMMAP_SHIFT)
#define VMEMMAP_END (VMALLOC_START - 1)
#define VMEMMAP_START (VMALLOC_START - VMEMMAP_SIZE)
@@ -653,7 +659,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
* and give the kernel the other (upper) half.
*/
#ifdef CONFIG_64BIT
-#define KERN_VIRT_START (-(BIT(CONFIG_VA_BITS)) + TASK_SIZE)
+#define KERN_VIRT_START (-(BIT(VA_BITS)) + TASK_SIZE)
#else
#define KERN_VIRT_START FIXADDR_START
#endif
diff --git a/arch/riscv/include/asm/sparsemem.h b/arch/riscv/include/asm/sparsemem.h
index 45a7018a8118..63acaecc3374 100644
--- a/arch/riscv/include/asm/sparsemem.h
+++ b/arch/riscv/include/asm/sparsemem.h
@@ -4,7 +4,11 @@
#define _ASM_RISCV_SPARSEMEM_H

#ifdef CONFIG_SPARSEMEM
-#define MAX_PHYSMEM_BITS CONFIG_PA_BITS
+#ifdef CONFIG_64BIT
+#define MAX_PHYSMEM_BITS 56
+#else
+#define MAX_PHYSMEM_BITS 34
+#endif /* CONFIG_64BIT */
#define SECTION_SIZE_BITS 27
#endif /* CONFIG_SPARSEMEM */

--
2.32.0


2021-12-06 10:52:27

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v3 05/13] riscv: Get rid of MAXPHYSMEM configs

CONFIG_MAXPHYSMEM_* were actually never used, even the nommu defconfigs
selecting the MAXPHYSMEM_2GB had no effects on PAGE_OFFSET since it was
preempted by !MMU case right before.

In addition, I suspect that commit 2bfc6cd81bd1 ("riscv: Move kernel
mapping outside of linear mapping") which moved the kernel to
0xffffffff80000000 broke the MAXPHYSMEM_2GB config which defined
PAGE_OFFSET at the same address.

Signed-off-by: Alexandre Ghiti <[email protected]>
---
arch/riscv/Kconfig | 23 ++-----------------
arch/riscv/configs/nommu_k210_defconfig | 1 -
.../riscv/configs/nommu_k210_sdcard_defconfig | 1 -
arch/riscv/configs/nommu_virt_defconfig | 1 -
4 files changed, 2 insertions(+), 24 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index c3a167eea011..ac6c0cd9bc29 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -148,10 +148,9 @@ config MMU

config PAGE_OFFSET
hex
- default 0xC0000000 if 32BIT && MAXPHYSMEM_1GB
+ default 0xC0000000 if 32BIT
default 0x80000000 if 64BIT && !MMU
- default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB
- default 0xffffffd800000000 if 64BIT && MAXPHYSMEM_128GB
+ default 0xffffffd800000000 if 64BIT

config KASAN_SHADOW_OFFSET
hex
@@ -260,24 +259,6 @@ config MODULE_SECTIONS
bool
select HAVE_MOD_ARCH_SPECIFIC

-choice
- prompt "Maximum Physical Memory"
- default MAXPHYSMEM_1GB if 32BIT
- default MAXPHYSMEM_2GB if 64BIT && CMODEL_MEDLOW
- default MAXPHYSMEM_128GB if 64BIT && CMODEL_MEDANY
-
- config MAXPHYSMEM_1GB
- depends on 32BIT
- bool "1GiB"
- config MAXPHYSMEM_2GB
- depends on 64BIT && CMODEL_MEDLOW
- bool "2GiB"
- config MAXPHYSMEM_128GB
- depends on 64BIT && CMODEL_MEDANY
- bool "128GiB"
-endchoice
-
-
config SMP
bool "Symmetric Multi-Processing"
help
diff --git a/arch/riscv/configs/nommu_k210_defconfig b/arch/riscv/configs/nommu_k210_defconfig
index b16a2a12c82a..dae9179984cc 100644
--- a/arch/riscv/configs/nommu_k210_defconfig
+++ b/arch/riscv/configs/nommu_k210_defconfig
@@ -30,7 +30,6 @@ CONFIG_SLOB=y
# CONFIG_MMU is not set
CONFIG_SOC_CANAAN=y
CONFIG_SOC_CANAAN_K210_DTB_SOURCE="k210_generic"
-CONFIG_MAXPHYSMEM_2GB=y
CONFIG_SMP=y
CONFIG_NR_CPUS=2
CONFIG_CMDLINE="earlycon console=ttySIF0"
diff --git a/arch/riscv/configs/nommu_k210_sdcard_defconfig b/arch/riscv/configs/nommu_k210_sdcard_defconfig
index 61f887f65419..03f91525a059 100644
--- a/arch/riscv/configs/nommu_k210_sdcard_defconfig
+++ b/arch/riscv/configs/nommu_k210_sdcard_defconfig
@@ -22,7 +22,6 @@ CONFIG_SLOB=y
# CONFIG_MMU is not set
CONFIG_SOC_CANAAN=y
CONFIG_SOC_CANAAN_K210_DTB_SOURCE="k210_generic"
-CONFIG_MAXPHYSMEM_2GB=y
CONFIG_SMP=y
CONFIG_NR_CPUS=2
CONFIG_CMDLINE="earlycon console=ttySIF0 rootdelay=2 root=/dev/mmcblk0p1 ro"
diff --git a/arch/riscv/configs/nommu_virt_defconfig b/arch/riscv/configs/nommu_virt_defconfig
index e046a0babde4..f224be697785 100644
--- a/arch/riscv/configs/nommu_virt_defconfig
+++ b/arch/riscv/configs/nommu_virt_defconfig
@@ -27,7 +27,6 @@ CONFIG_SLOB=y
# CONFIG_SLAB_MERGE_DEFAULT is not set
# CONFIG_MMU is not set
CONFIG_SOC_VIRT=y
-CONFIG_MAXPHYSMEM_2GB=y
CONFIG_SMP=y
CONFIG_CMDLINE="root=/dev/vda rw earlycon=uart8250,mmio,0x10000000,115200n8 console=ttyS0"
CONFIG_CMDLINE_FORCE=y
--
2.32.0


2021-12-06 10:53:27

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v3 06/13] asm-generic: Prepare for riscv use of pud_alloc_one and pud_free

In the following commits, riscv will almost use the generic versions of
pud_alloc_one and pud_free but an additional check is required since those
functions are only relevant when using at least a 4-level page table, which
will be determined at runtime on riscv.

So move the content of those functions into other functions that riscv
can use without duplicating code.

Signed-off-by: Alexandre Ghiti <[email protected]>
---
include/asm-generic/pgalloc.h | 24 ++++++++++++++++++------
1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index 02932efad3ab..977bea16cf1b 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -147,6 +147,15 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)

#if CONFIG_PGTABLE_LEVELS > 3

+static inline pud_t *__pud_alloc_one(struct mm_struct *mm, unsigned long addr)
+{
+ gfp_t gfp = GFP_PGTABLE_USER;
+
+ if (mm == &init_mm)
+ gfp = GFP_PGTABLE_KERNEL;
+ return (pud_t *)get_zeroed_page(gfp);
+}
+
#ifndef __HAVE_ARCH_PUD_ALLOC_ONE
/**
* pud_alloc_one - allocate a page for PUD-level page table
@@ -159,20 +168,23 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
*/
static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
{
- gfp_t gfp = GFP_PGTABLE_USER;
-
- if (mm == &init_mm)
- gfp = GFP_PGTABLE_KERNEL;
- return (pud_t *)get_zeroed_page(gfp);
+ return __pud_alloc_one(mm, addr);
}
#endif

-static inline void pud_free(struct mm_struct *mm, pud_t *pud)
+static inline void __pud_free(struct mm_struct *mm, pud_t *pud)
{
BUG_ON((unsigned long)pud & (PAGE_SIZE-1));
free_page((unsigned long)pud);
}

+#ifndef __HAVE_ARCH_PUD_FREE
+static inline void pud_free(struct mm_struct *mm, pud_t *pud)
+{
+ __pud_free(mm, pud);
+}
+#endif
+
#endif /* CONFIG_PGTABLE_LEVELS > 3 */

#ifndef __HAVE_ARCH_PGD_FREE
--
2.32.0


2021-12-06 10:54:41

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v3 07/13] riscv: Implement sv48 support

By adding a new 4th level of page table, give the possibility to 64bit
kernel to address 2^48 bytes of virtual address: in practice, that offers
128TB of virtual address space to userspace and allows up to 64TB of
physical memory.

If the underlying hardware does not support sv48, we will automatically
fallback to a standard 3-level page table by folding the new PUD level into
PGDIR level. In order to detect HW capabilities at runtime, we
use SATP feature that ignores writes with an unsupported mode.

Signed-off-by: Alexandre Ghiti <[email protected]>
---
arch/riscv/Kconfig | 4 +-
arch/riscv/include/asm/csr.h | 3 +-
arch/riscv/include/asm/fixmap.h | 1 +
arch/riscv/include/asm/kasan.h | 6 +-
arch/riscv/include/asm/page.h | 14 ++
arch/riscv/include/asm/pgalloc.h | 40 +++++
arch/riscv/include/asm/pgtable-64.h | 108 +++++++++++-
arch/riscv/include/asm/pgtable.h | 24 ++-
arch/riscv/kernel/head.S | 3 +-
arch/riscv/mm/context.c | 4 +-
arch/riscv/mm/init.c | 212 +++++++++++++++++++++---
arch/riscv/mm/kasan_init.c | 137 ++++++++++++++-
drivers/firmware/efi/libstub/efi-stub.c | 2 +
13 files changed, 514 insertions(+), 44 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index ac6c0cd9bc29..d28fe0148e13 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -150,7 +150,7 @@ config PAGE_OFFSET
hex
default 0xC0000000 if 32BIT
default 0x80000000 if 64BIT && !MMU
- default 0xffffffd800000000 if 64BIT
+ default 0xffffaf8000000000 if 64BIT

config KASAN_SHADOW_OFFSET
hex
@@ -201,7 +201,7 @@ config FIX_EARLYCON_MEM

config PGTABLE_LEVELS
int
- default 3 if 64BIT
+ default 4 if 64BIT
default 2

config LOCKDEP_SUPPORT
diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
index 87ac65696871..3fdb971c7896 100644
--- a/arch/riscv/include/asm/csr.h
+++ b/arch/riscv/include/asm/csr.h
@@ -40,14 +40,13 @@
#ifndef CONFIG_64BIT
#define SATP_PPN _AC(0x003FFFFF, UL)
#define SATP_MODE_32 _AC(0x80000000, UL)
-#define SATP_MODE SATP_MODE_32
#define SATP_ASID_BITS 9
#define SATP_ASID_SHIFT 22
#define SATP_ASID_MASK _AC(0x1FF, UL)
#else
#define SATP_PPN _AC(0x00000FFFFFFFFFFF, UL)
#define SATP_MODE_39 _AC(0x8000000000000000, UL)
-#define SATP_MODE SATP_MODE_39
+#define SATP_MODE_48 _AC(0x9000000000000000, UL)
#define SATP_ASID_BITS 16
#define SATP_ASID_SHIFT 44
#define SATP_ASID_MASK _AC(0xFFFF, UL)
diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h
index 54cbf07fb4e9..58a718573ad6 100644
--- a/arch/riscv/include/asm/fixmap.h
+++ b/arch/riscv/include/asm/fixmap.h
@@ -24,6 +24,7 @@ enum fixed_addresses {
FIX_HOLE,
FIX_PTE,
FIX_PMD,
+ FIX_PUD,
FIX_TEXT_POKE1,
FIX_TEXT_POKE0,
FIX_EARLYCON_MEM_BASE,
diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h
index 743e6ff57996..0b85e363e778 100644
--- a/arch/riscv/include/asm/kasan.h
+++ b/arch/riscv/include/asm/kasan.h
@@ -28,7 +28,11 @@
#define KASAN_SHADOW_SCALE_SHIFT 3

#define KASAN_SHADOW_SIZE (UL(1) << ((VA_BITS - 1) - KASAN_SHADOW_SCALE_SHIFT))
-#define KASAN_SHADOW_START (KASAN_SHADOW_END - KASAN_SHADOW_SIZE)
+/*
+ * Depending on the size of the virtual address space, the region may not be
+ * aligned on PGDIR_SIZE, so force its alignment to ease its population.
+ */
+#define KASAN_SHADOW_START ((KASAN_SHADOW_END - KASAN_SHADOW_SIZE) & PGDIR_MASK)
#define KASAN_SHADOW_END MODULES_LOWEST_VADDR
#define KASAN_SHADOW_OFFSET _AC(CONFIG_KASAN_SHADOW_OFFSET, UL)

diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index e03559f9b35e..d089fe46f7d8 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -31,7 +31,20 @@
* When not using MMU this corresponds to the first free page in
* physical memory (aligned on a page boundary).
*/
+#ifdef CONFIG_64BIT
+#ifdef CONFIG_MMU
+#define PAGE_OFFSET kernel_map.page_offset
+#else
+#define PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)
+#endif
+/*
+ * By default, CONFIG_PAGE_OFFSET value corresponds to SV48 address space so
+ * define the PAGE_OFFSET value for SV39.
+ */
+#define PAGE_OFFSET_L3 _AC(0xffffffd800000000, UL)
+#else
#define PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)
+#endif /* CONFIG_64BIT */

/*
* Half of the kernel address space (half of the entries of the page global
@@ -90,6 +103,7 @@ extern unsigned long riscv_pfn_base;
#endif /* CONFIG_MMU */

struct kernel_mapping {
+ unsigned long page_offset;
unsigned long virt_addr;
uintptr_t phys_addr;
uintptr_t size;
diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
index 0af6933a7100..11823004b87a 100644
--- a/arch/riscv/include/asm/pgalloc.h
+++ b/arch/riscv/include/asm/pgalloc.h
@@ -11,6 +11,8 @@
#include <asm/tlb.h>

#ifdef CONFIG_MMU
+#define __HAVE_ARCH_PUD_ALLOC_ONE
+#define __HAVE_ARCH_PUD_FREE
#include <asm-generic/pgalloc.h>

static inline void pmd_populate_kernel(struct mm_struct *mm,
@@ -36,6 +38,44 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)

set_pud(pud, __pud((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
}
+
+static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
+{
+ if (pgtable_l4_enabled) {
+ unsigned long pfn = virt_to_pfn(pud);
+
+ set_p4d(p4d, __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
+ }
+}
+
+static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d,
+ pud_t *pud)
+{
+ if (pgtable_l4_enabled) {
+ unsigned long pfn = virt_to_pfn(pud);
+
+ set_p4d_safe(p4d,
+ __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
+ }
+}
+
+#define pud_alloc_one pud_alloc_one
+static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
+{
+ if (pgtable_l4_enabled)
+ return __pud_alloc_one(mm, addr);
+
+ return NULL;
+}
+
+#define pud_free pud_free
+static inline void pud_free(struct mm_struct *mm, pud_t *pud)
+{
+ if (pgtable_l4_enabled)
+ __pud_free(mm, pud);
+}
+
+#define __pud_free_tlb(tlb, pud, addr) pud_free((tlb)->mm, pud)
#endif /* __PAGETABLE_PMD_FOLDED */

static inline pgd_t *pgd_alloc(struct mm_struct *mm)
diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
index 228261aa9628..bbbdd66e5e2f 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -8,16 +8,36 @@

#include <linux/const.h>

-#define PGDIR_SHIFT 30
+extern bool pgtable_l4_enabled;
+
+#define PGDIR_SHIFT_L3 30
+#define PGDIR_SHIFT_L4 39
+#define PGDIR_SIZE_L3 (_AC(1, UL) << PGDIR_SHIFT_L3)
+
+#define PGDIR_SHIFT (pgtable_l4_enabled ? PGDIR_SHIFT_L4 : PGDIR_SHIFT_L3)
/* Size of region mapped by a page global directory */
#define PGDIR_SIZE (_AC(1, UL) << PGDIR_SHIFT)
#define PGDIR_MASK (~(PGDIR_SIZE - 1))

+/* pud is folded into pgd in case of 3-level page table */
+#define PUD_SHIFT 30
+#define PUD_SIZE (_AC(1, UL) << PUD_SHIFT)
+#define PUD_MASK (~(PUD_SIZE - 1))
+
#define PMD_SHIFT 21
/* Size of region mapped by a page middle directory */
#define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
#define PMD_MASK (~(PMD_SIZE - 1))

+/* Page Upper Directory entry */
+typedef struct {
+ unsigned long pud;
+} pud_t;
+
+#define pud_val(x) ((x).pud)
+#define __pud(x) ((pud_t) { (x) })
+#define PTRS_PER_PUD (PAGE_SIZE / sizeof(pud_t))
+
/* Page Middle Directory entry */
typedef struct {
unsigned long pmd;
@@ -59,6 +79,16 @@ static inline void pud_clear(pud_t *pudp)
set_pud(pudp, __pud(0));
}

+static inline pud_t pfn_pud(unsigned long pfn, pgprot_t prot)
+{
+ return __pud((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
+}
+
+static inline unsigned long _pud_pfn(pud_t pud)
+{
+ return pud_val(pud) >> _PAGE_PFN_SHIFT;
+}
+
static inline pmd_t *pud_pgtable(pud_t pud)
{
return (pmd_t *)pfn_to_virt(pud_val(pud) >> _PAGE_PFN_SHIFT);
@@ -69,6 +99,17 @@ static inline struct page *pud_page(pud_t pud)
return pfn_to_page(pud_val(pud) >> _PAGE_PFN_SHIFT);
}

+#define mm_pud_folded mm_pud_folded
+static inline bool mm_pud_folded(struct mm_struct *mm)
+{
+ if (pgtable_l4_enabled)
+ return false;
+
+ return true;
+}
+
+#define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
+
static inline pmd_t pfn_pmd(unsigned long pfn, pgprot_t prot)
{
return __pmd((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
@@ -84,4 +125,69 @@ static inline unsigned long _pmd_pfn(pmd_t pmd)
#define pmd_ERROR(e) \
pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))

+#define pud_ERROR(e) \
+ pr_err("%s:%d: bad pud %016lx.\n", __FILE__, __LINE__, pud_val(e))
+
+static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
+{
+ if (pgtable_l4_enabled)
+ *p4dp = p4d;
+ else
+ set_pud((pud_t *)p4dp, (pud_t){ p4d_val(p4d) });
+}
+
+static inline int p4d_none(p4d_t p4d)
+{
+ if (pgtable_l4_enabled)
+ return (p4d_val(p4d) == 0);
+
+ return 0;
+}
+
+static inline int p4d_present(p4d_t p4d)
+{
+ if (pgtable_l4_enabled)
+ return (p4d_val(p4d) & _PAGE_PRESENT);
+
+ return 1;
+}
+
+static inline int p4d_bad(p4d_t p4d)
+{
+ if (pgtable_l4_enabled)
+ return !p4d_present(p4d);
+
+ return 0;
+}
+
+static inline void p4d_clear(p4d_t *p4d)
+{
+ if (pgtable_l4_enabled)
+ set_p4d(p4d, __p4d(0));
+}
+
+static inline pud_t *p4d_pgtable(p4d_t p4d)
+{
+ if (pgtable_l4_enabled)
+ return (pud_t *)pfn_to_virt(p4d_val(p4d) >> _PAGE_PFN_SHIFT);
+
+ return (pud_t *)pud_pgtable((pud_t) { p4d_val(p4d) });
+}
+
+static inline struct page *p4d_page(p4d_t p4d)
+{
+ return pfn_to_page(p4d_val(p4d) >> _PAGE_PFN_SHIFT);
+}
+
+#define pud_index(addr) (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
+
+#define pud_offset pud_offset
+static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
+{
+ if (pgtable_l4_enabled)
+ return p4d_pgtable(*p4d) + pud_index(address);
+
+ return (pud_t *)p4d;
+}
+
#endif /* _ASM_RISCV_PGTABLE_64_H */
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index e1a52e22ad7e..e1c74ef4ead2 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -51,7 +51,7 @@
* position vmemmap directly below the VMALLOC region.
*/
#ifdef CONFIG_64BIT
-#define VA_BITS 39
+#define VA_BITS (pgtable_l4_enabled ? 48 : 39)
#else
#define VA_BITS 32
#endif
@@ -90,8 +90,7 @@

#ifndef __ASSEMBLY__

-/* Page Upper Directory not used in RISC-V */
-#include <asm-generic/pgtable-nopud.h>
+#include <asm-generic/pgtable-nop4d.h>
#include <asm/page.h>
#include <asm/tlbflush.h>
#include <linux/mm_types.h>
@@ -113,6 +112,17 @@
#define XIP_FIXUP(addr) (addr)
#endif /* CONFIG_XIP_KERNEL */

+struct pt_alloc_ops {
+ pte_t *(*get_pte_virt)(phys_addr_t pa);
+ phys_addr_t (*alloc_pte)(uintptr_t va);
+#ifndef __PAGETABLE_PMD_FOLDED
+ pmd_t *(*get_pmd_virt)(phys_addr_t pa);
+ phys_addr_t (*alloc_pmd)(uintptr_t va);
+ pud_t *(*get_pud_virt)(phys_addr_t pa);
+ phys_addr_t (*alloc_pud)(uintptr_t va);
+#endif
+};
+
#ifdef CONFIG_MMU
/* Number of entries in the page global directory */
#define PTRS_PER_PGD (PAGE_SIZE / sizeof(pgd_t))
@@ -669,9 +679,11 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
* Note that PGDIR_SIZE must evenly divide TASK_SIZE.
*/
#ifdef CONFIG_64BIT
-#define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)
+#define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)
+#define TASK_SIZE_MIN (PGDIR_SIZE_L3 * PTRS_PER_PGD / 2)
#else
-#define TASK_SIZE FIXADDR_START
+#define TASK_SIZE FIXADDR_START
+#define TASK_SIZE_MIN TASK_SIZE
#endif

#else /* CONFIG_MMU */
@@ -697,6 +709,8 @@ extern uintptr_t _dtb_early_pa;
#define dtb_early_va _dtb_early_va
#define dtb_early_pa _dtb_early_pa
#endif /* CONFIG_XIP_KERNEL */
+extern u64 satp_mode;
+extern bool pgtable_l4_enabled;

void paging_init(void);
void misc_mem_init(void);
diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index 52c5ff9804c5..c3c0ed559770 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -95,7 +95,8 @@ relocate:

/* Compute satp for kernel page tables, but don't load it yet */
srl a2, a0, PAGE_SHIFT
- li a1, SATP_MODE
+ la a1, satp_mode
+ REG_L a1, 0(a1)
or a2, a2, a1

/*
diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
index ee3459cb6750..a7246872bd30 100644
--- a/arch/riscv/mm/context.c
+++ b/arch/riscv/mm/context.c
@@ -192,7 +192,7 @@ static void set_mm_asid(struct mm_struct *mm, unsigned int cpu)
switch_mm_fast:
csr_write(CSR_SATP, virt_to_pfn(mm->pgd) |
((cntx & asid_mask) << SATP_ASID_SHIFT) |
- SATP_MODE);
+ satp_mode);

if (need_flush_tlb)
local_flush_tlb_all();
@@ -201,7 +201,7 @@ static void set_mm_asid(struct mm_struct *mm, unsigned int cpu)
static void set_mm_noasid(struct mm_struct *mm)
{
/* Switch the page table and blindly nuke entire local TLB */
- csr_write(CSR_SATP, virt_to_pfn(mm->pgd) | SATP_MODE);
+ csr_write(CSR_SATP, virt_to_pfn(mm->pgd) | satp_mode);
local_flush_tlb_all();
}

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 1552226fb6bd..6a19a1b1caf8 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -37,6 +37,17 @@ EXPORT_SYMBOL(kernel_map);
#define kernel_map (*(struct kernel_mapping *)XIP_FIXUP(&kernel_map))
#endif

+#ifdef CONFIG_64BIT
+u64 satp_mode = !IS_ENABLED(CONFIG_XIP_KERNEL) ? SATP_MODE_48 : SATP_MODE_39;
+#else
+u64 satp_mode = SATP_MODE_32;
+#endif
+EXPORT_SYMBOL(satp_mode);
+
+bool pgtable_l4_enabled = IS_ENABLED(CONFIG_64BIT) && !IS_ENABLED(CONFIG_XIP_KERNEL) ?
+ true : false;
+EXPORT_SYMBOL(pgtable_l4_enabled);
+
phys_addr_t phys_ram_base __ro_after_init;
EXPORT_SYMBOL(phys_ram_base);

@@ -53,15 +64,6 @@ extern char _start[];
void *_dtb_early_va __initdata;
uintptr_t _dtb_early_pa __initdata;

-struct pt_alloc_ops {
- pte_t *(*get_pte_virt)(phys_addr_t pa);
- phys_addr_t (*alloc_pte)(uintptr_t va);
-#ifndef __PAGETABLE_PMD_FOLDED
- pmd_t *(*get_pmd_virt)(phys_addr_t pa);
- phys_addr_t (*alloc_pmd)(uintptr_t va);
-#endif
-};
-
static phys_addr_t dma32_phys_limit __initdata;

static void __init zone_sizes_init(void)
@@ -222,7 +224,7 @@ static void __init setup_bootmem(void)
}

#ifdef CONFIG_MMU
-static struct pt_alloc_ops _pt_ops __initdata;
+struct pt_alloc_ops _pt_ops __initdata;

#ifdef CONFIG_XIP_KERNEL
#define pt_ops (*(struct pt_alloc_ops *)XIP_FIXUP(&_pt_ops))
@@ -238,6 +240,7 @@ pgd_t trampoline_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
static pte_t fixmap_pte[PTRS_PER_PTE] __page_aligned_bss;

pgd_t early_pg_dir[PTRS_PER_PGD] __initdata __aligned(PAGE_SIZE);
+static pud_t __maybe_unused early_dtb_pud[PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);
static pmd_t __maybe_unused early_dtb_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);

#ifdef CONFIG_XIP_KERNEL
@@ -326,6 +329,16 @@ static pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
#define early_pmd ((pmd_t *)XIP_FIXUP(early_pmd))
#endif /* CONFIG_XIP_KERNEL */

+static pud_t trampoline_pud[PTRS_PER_PUD] __page_aligned_bss;
+static pud_t fixmap_pud[PTRS_PER_PUD] __page_aligned_bss;
+static pud_t early_pud[PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);
+
+#ifdef CONFIG_XIP_KERNEL
+#define trampoline_pud ((pud_t *)XIP_FIXUP(trampoline_pud))
+#define fixmap_pud ((pud_t *)XIP_FIXUP(fixmap_pud))
+#define early_pud ((pud_t *)XIP_FIXUP(early_pud))
+#endif /* CONFIG_XIP_KERNEL */
+
static pmd_t *__init get_pmd_virt_early(phys_addr_t pa)
{
/* Before MMU is enabled */
@@ -345,7 +358,7 @@ static pmd_t *__init get_pmd_virt_late(phys_addr_t pa)

static phys_addr_t __init alloc_pmd_early(uintptr_t va)
{
- BUG_ON((va - kernel_map.virt_addr) >> PGDIR_SHIFT);
+ BUG_ON((va - kernel_map.virt_addr) >> PUD_SHIFT);

return (uintptr_t)early_pmd;
}
@@ -391,21 +404,97 @@ static void __init create_pmd_mapping(pmd_t *pmdp,
create_pte_mapping(ptep, va, pa, sz, prot);
}

-#define pgd_next_t pmd_t
-#define alloc_pgd_next(__va) pt_ops.alloc_pmd(__va)
-#define get_pgd_next_virt(__pa) pt_ops.get_pmd_virt(__pa)
+static pud_t *__init get_pud_virt_early(phys_addr_t pa)
+{
+ return (pud_t *)((uintptr_t)pa);
+}
+
+static pud_t *__init get_pud_virt_fixmap(phys_addr_t pa)
+{
+ clear_fixmap(FIX_PUD);
+ return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
+}
+
+static pud_t *__init get_pud_virt_late(phys_addr_t pa)
+{
+ return (pud_t *)__va(pa);
+}
+
+static phys_addr_t __init alloc_pud_early(uintptr_t va)
+{
+ /* Only one PUD is available for early mapping */
+ BUG_ON((va - kernel_map.virt_addr) >> PGDIR_SHIFT);
+
+ return (uintptr_t)early_pud;
+}
+
+static phys_addr_t __init alloc_pud_fixmap(uintptr_t va)
+{
+ return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
+}
+
+static phys_addr_t alloc_pud_late(uintptr_t va)
+{
+ unsigned long vaddr;
+
+ vaddr = __get_free_page(GFP_KERNEL);
+ BUG_ON(!vaddr);
+ return __pa(vaddr);
+}
+
+static void __init create_pud_mapping(pud_t *pudp,
+ uintptr_t va, phys_addr_t pa,
+ phys_addr_t sz, pgprot_t prot)
+{
+ pmd_t *nextp;
+ phys_addr_t next_phys;
+ uintptr_t pud_index = pud_index(va);
+
+ if (sz == PUD_SIZE) {
+ if (pud_val(pudp[pud_index]) == 0)
+ pudp[pud_index] = pfn_pud(PFN_DOWN(pa), prot);
+ return;
+ }
+
+ if (pud_val(pudp[pud_index]) == 0) {
+ next_phys = pt_ops.alloc_pmd(va);
+ pudp[pud_index] = pfn_pud(PFN_DOWN(next_phys), PAGE_TABLE);
+ nextp = pt_ops.get_pmd_virt(next_phys);
+ memset(nextp, 0, PAGE_SIZE);
+ } else {
+ next_phys = PFN_PHYS(_pud_pfn(pudp[pud_index]));
+ nextp = pt_ops.get_pmd_virt(next_phys);
+ }
+
+ create_pmd_mapping(nextp, va, pa, sz, prot);
+}
+
+#define pgd_next_t pud_t
+#define alloc_pgd_next(__va) (pgtable_l4_enabled ? \
+ pt_ops.alloc_pud(__va) : pt_ops.alloc_pmd(__va))
+#define get_pgd_next_virt(__pa) (pgtable_l4_enabled ? \
+ pt_ops.get_pud_virt(__pa) : (pgd_next_t *)pt_ops.get_pmd_virt(__pa))
#define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot) \
- create_pmd_mapping(__nextp, __va, __pa, __sz, __prot)
-#define fixmap_pgd_next fixmap_pmd
+ (pgtable_l4_enabled ? \
+ create_pud_mapping(__nextp, __va, __pa, __sz, __prot) : \
+ create_pmd_mapping((pmd_t *)__nextp, __va, __pa, __sz, __prot))
+#define fixmap_pgd_next (pgtable_l4_enabled ? \
+ (uintptr_t)fixmap_pud : (uintptr_t)fixmap_pmd)
+#define trampoline_pgd_next (pgtable_l4_enabled ? \
+ (uintptr_t)trampoline_pud : (uintptr_t)trampoline_pmd)
+#define early_dtb_pgd_next (pgtable_l4_enabled ? \
+ (uintptr_t)early_dtb_pud : (uintptr_t)early_dtb_pmd)
#else
#define pgd_next_t pte_t
#define alloc_pgd_next(__va) pt_ops.alloc_pte(__va)
#define get_pgd_next_virt(__pa) pt_ops.get_pte_virt(__pa)
#define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot) \
create_pte_mapping(__nextp, __va, __pa, __sz, __prot)
-#define fixmap_pgd_next fixmap_pte
+#define fixmap_pgd_next ((uintptr_t)fixmap_pte)
+#define early_dtb_pgd_next ((uintptr_t)early_dtb_pmd)
+#define create_pud_mapping(__pmdp, __va, __pa, __sz, __prot)
#define create_pmd_mapping(__pmdp, __va, __pa, __sz, __prot)
-#endif
+#endif /* __PAGETABLE_PMD_FOLDED */

void __init create_pgd_mapping(pgd_t *pgdp,
uintptr_t va, phys_addr_t pa,
@@ -493,6 +582,57 @@ static __init pgprot_t pgprot_from_va(uintptr_t va)
}
#endif /* CONFIG_STRICT_KERNEL_RWX */

+#ifdef CONFIG_64BIT
+static void __init disable_pgtable_l4(void)
+{
+ pgtable_l4_enabled = false;
+ kernel_map.page_offset = PAGE_OFFSET_L3;
+ satp_mode = SATP_MODE_39;
+}
+
+/*
+ * There is a simple way to determine if 4-level is supported by the
+ * underlying hardware: establish 1:1 mapping in 4-level page table mode
+ * then read SATP to see if the configuration was taken into account
+ * meaning sv48 is supported.
+ */
+static __init void set_satp_mode(void)
+{
+ u64 identity_satp, hw_satp;
+ uintptr_t set_satp_mode_pmd;
+
+ set_satp_mode_pmd = ((unsigned long)set_satp_mode) & PMD_MASK;
+ create_pgd_mapping(early_pg_dir,
+ set_satp_mode_pmd, (uintptr_t)early_pud,
+ PGDIR_SIZE, PAGE_TABLE);
+ create_pud_mapping(early_pud,
+ set_satp_mode_pmd, (uintptr_t)early_pmd,
+ PUD_SIZE, PAGE_TABLE);
+ /* Handle the case where set_satp_mode straddles 2 PMDs */
+ create_pmd_mapping(early_pmd,
+ set_satp_mode_pmd, set_satp_mode_pmd,
+ PMD_SIZE, PAGE_KERNEL_EXEC);
+ create_pmd_mapping(early_pmd,
+ set_satp_mode_pmd + PMD_SIZE,
+ set_satp_mode_pmd + PMD_SIZE,
+ PMD_SIZE, PAGE_KERNEL_EXEC);
+
+ identity_satp = PFN_DOWN((uintptr_t)&early_pg_dir) | satp_mode;
+
+ local_flush_tlb_all();
+ csr_write(CSR_SATP, identity_satp);
+ hw_satp = csr_swap(CSR_SATP, 0ULL);
+ local_flush_tlb_all();
+
+ if (hw_satp != identity_satp)
+ disable_pgtable_l4();
+
+ memset(early_pg_dir, 0, PAGE_SIZE);
+ memset(early_pud, 0, PAGE_SIZE);
+ memset(early_pmd, 0, PAGE_SIZE);
+}
+#endif
+
/*
* setup_vm() is called from head.S with MMU-off.
*
@@ -557,10 +697,15 @@ static void __init create_fdt_early_page_table(pgd_t *pgdir, uintptr_t dtb_pa)
uintptr_t pa = dtb_pa & ~(PMD_SIZE - 1);

create_pgd_mapping(early_pg_dir, DTB_EARLY_BASE_VA,
- IS_ENABLED(CONFIG_64BIT) ? (uintptr_t)early_dtb_pmd : pa,
+ IS_ENABLED(CONFIG_64BIT) ? early_dtb_pgd_next : pa,
PGDIR_SIZE,
IS_ENABLED(CONFIG_64BIT) ? PAGE_TABLE : PAGE_KERNEL);

+ if (pgtable_l4_enabled) {
+ create_pud_mapping(early_dtb_pud, DTB_EARLY_BASE_VA,
+ (uintptr_t)early_dtb_pmd, PUD_SIZE, PAGE_TABLE);
+ }
+
if (IS_ENABLED(CONFIG_64BIT)) {
create_pmd_mapping(early_dtb_pmd, DTB_EARLY_BASE_VA,
pa, PMD_SIZE, PAGE_KERNEL);
@@ -593,6 +738,8 @@ void pt_ops_set_early(void)
#ifndef __PAGETABLE_PMD_FOLDED
pt_ops.alloc_pmd = alloc_pmd_early;
pt_ops.get_pmd_virt = get_pmd_virt_early;
+ pt_ops.alloc_pud = alloc_pud_early;
+ pt_ops.get_pud_virt = get_pud_virt_early;
#endif
}

@@ -611,6 +758,8 @@ void pt_ops_set_fixmap(void)
#ifndef __PAGETABLE_PMD_FOLDED
pt_ops.alloc_pmd = kernel_mapping_pa_to_va((uintptr_t)alloc_pmd_fixmap);
pt_ops.get_pmd_virt = kernel_mapping_pa_to_va((uintptr_t)get_pmd_virt_fixmap);
+ pt_ops.alloc_pud = kernel_mapping_pa_to_va((uintptr_t)alloc_pud_fixmap);
+ pt_ops.get_pud_virt = kernel_mapping_pa_to_va((uintptr_t)get_pud_virt_fixmap);
#endif
}

@@ -625,6 +774,8 @@ void pt_ops_set_late(void)
#ifndef __PAGETABLE_PMD_FOLDED
pt_ops.alloc_pmd = alloc_pmd_late;
pt_ops.get_pmd_virt = get_pmd_virt_late;
+ pt_ops.alloc_pud = alloc_pud_late;
+ pt_ops.get_pud_virt = get_pud_virt_late;
#endif
}

@@ -633,6 +784,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
pmd_t __maybe_unused fix_bmap_spmd, fix_bmap_epmd;

kernel_map.virt_addr = KERNEL_LINK_ADDR;
+ kernel_map.page_offset = _AC(CONFIG_PAGE_OFFSET, UL);

#ifdef CONFIG_XIP_KERNEL
kernel_map.xiprom = (uintptr_t)CONFIG_XIP_PHYS_ADDR;
@@ -647,6 +799,11 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
kernel_map.phys_addr = (uintptr_t)(&_start);
kernel_map.size = (uintptr_t)(&_end) - kernel_map.phys_addr;
#endif
+
+#if defined(CONFIG_64BIT) && !defined(CONFIG_XIP_KERNEL)
+ set_satp_mode();
+#endif
+
kernel_map.va_pa_offset = PAGE_OFFSET - kernel_map.phys_addr;
kernel_map.va_kernel_pa_offset = kernel_map.virt_addr - kernel_map.phys_addr;

@@ -676,15 +833,21 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)

/* Setup early PGD for fixmap */
create_pgd_mapping(early_pg_dir, FIXADDR_START,
- (uintptr_t)fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
+ fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);

#ifndef __PAGETABLE_PMD_FOLDED
- /* Setup fixmap PMD */
+ /* Setup fixmap PUD and PMD */
+ if (pgtable_l4_enabled)
+ create_pud_mapping(fixmap_pud, FIXADDR_START,
+ (uintptr_t)fixmap_pmd, PUD_SIZE, PAGE_TABLE);
create_pmd_mapping(fixmap_pmd, FIXADDR_START,
(uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
/* Setup trampoline PGD and PMD */
create_pgd_mapping(trampoline_pg_dir, kernel_map.virt_addr,
- (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
+ trampoline_pgd_next, PGDIR_SIZE, PAGE_TABLE);
+ if (pgtable_l4_enabled)
+ create_pud_mapping(trampoline_pud, kernel_map.virt_addr,
+ (uintptr_t)trampoline_pmd, PUD_SIZE, PAGE_TABLE);
#ifdef CONFIG_XIP_KERNEL
create_pmd_mapping(trampoline_pmd, kernel_map.virt_addr,
kernel_map.xiprom, PMD_SIZE, PAGE_KERNEL_EXEC);
@@ -712,7 +875,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
* Bootime fixmap only can handle PMD_SIZE mapping. Thus, boot-ioremap
* range can not span multiple pmds.
*/
- BUILD_BUG_ON((__fix_to_virt(FIX_BTMAP_BEGIN) >> PMD_SHIFT)
+ BUG_ON((__fix_to_virt(FIX_BTMAP_BEGIN) >> PMD_SHIFT)
!= (__fix_to_virt(FIX_BTMAP_END) >> PMD_SHIFT));

#ifndef __PAGETABLE_PMD_FOLDED
@@ -783,9 +946,10 @@ static void __init setup_vm_final(void)
/* Clear fixmap PTE and PMD mappings */
clear_fixmap(FIX_PTE);
clear_fixmap(FIX_PMD);
+ clear_fixmap(FIX_PUD);

/* Move to swapper page table */
- csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | SATP_MODE);
+ csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | satp_mode);
local_flush_tlb_all();

pt_ops_set_late();
diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c
index 1434a0225140..993f50571a3b 100644
--- a/arch/riscv/mm/kasan_init.c
+++ b/arch/riscv/mm/kasan_init.c
@@ -11,7 +11,29 @@
#include <asm/fixmap.h>
#include <asm/pgalloc.h>

+/*
+ * Kasan shadow region must lie at a fixed address across sv39, sv48 and sv57
+ * which is right before the kernel.
+ *
+ * For sv39, the region is aligned on PGDIR_SIZE so we only need to populate
+ * the page global directory with kasan_early_shadow_pmd.
+ *
+ * For sv48 and sv57, the region is not aligned on PGDIR_SIZE so the mapping
+ * must be divided as follows:
+ * - the first PGD entry, although incomplete, is populated with
+ * kasan_early_shadow_pud/p4d
+ * - the PGD entries in the middle are populated with kasan_early_shadow_pud/p4d
+ * - the last PGD entry is shared with the kernel mapping so populated at the
+ * lower levels pud/p4d
+ *
+ * In addition, when shallow populating a kasan region (for example vmalloc),
+ * this region may also not be aligned on PGDIR size, so we must go down to the
+ * pud level too.
+ */
+
extern pgd_t early_pg_dir[PTRS_PER_PGD];
+extern struct pt_alloc_ops _pt_ops __initdata;
+#define pt_ops _pt_ops

static void __init kasan_populate_pte(pmd_t *pmd, unsigned long vaddr, unsigned long end)
{
@@ -35,15 +57,19 @@ static void __init kasan_populate_pte(pmd_t *pmd, unsigned long vaddr, unsigned
set_pmd(pmd, pfn_pmd(PFN_DOWN(__pa(base_pte)), PAGE_TABLE));
}

-static void __init kasan_populate_pmd(pgd_t *pgd, unsigned long vaddr, unsigned long end)
+static void __init kasan_populate_pmd(pud_t *pud, unsigned long vaddr, unsigned long end)
{
phys_addr_t phys_addr;
pmd_t *pmdp, *base_pmd;
unsigned long next;

- base_pmd = (pmd_t *)pgd_page_vaddr(*pgd);
- if (base_pmd == lm_alias(kasan_early_shadow_pmd))
+ if (pud_none(*pud)) {
base_pmd = memblock_alloc(PTRS_PER_PMD * sizeof(pmd_t), PAGE_SIZE);
+ } else {
+ base_pmd = (pmd_t *)pud_pgtable(*pud);
+ if (base_pmd == lm_alias(kasan_early_shadow_pmd))
+ base_pmd = memblock_alloc(PTRS_PER_PMD * sizeof(pmd_t), PAGE_SIZE);
+ }

pmdp = base_pmd + pmd_index(vaddr);

@@ -67,9 +93,72 @@ static void __init kasan_populate_pmd(pgd_t *pgd, unsigned long vaddr, unsigned
* it entirely, memblock could allocate a page at a physical address
* where KASAN is not populated yet and then we'd get a page fault.
*/
- set_pgd(pgd, pfn_pgd(PFN_DOWN(__pa(base_pmd)), PAGE_TABLE));
+ set_pud(pud, pfn_pud(PFN_DOWN(__pa(base_pmd)), PAGE_TABLE));
+}
+
+static void __init kasan_populate_pud(pgd_t *pgd,
+ unsigned long vaddr, unsigned long end,
+ bool early)
+{
+ phys_addr_t phys_addr;
+ pud_t *pudp, *base_pud;
+ unsigned long next;
+
+ if (early) {
+ /*
+ * We can't use pgd_page_vaddr here as it would return a linear
+ * mapping address but it is not mapped yet, but when populating
+ * early_pg_dir, we need the physical address and when populating
+ * swapper_pg_dir, we need the kernel virtual address so use
+ * pt_ops facility.
+ */
+ base_pud = pt_ops.get_pud_virt(pfn_to_phys(_pgd_pfn(*pgd)));
+ } else {
+ base_pud = (pud_t *)pgd_page_vaddr(*pgd);
+ if (base_pud == lm_alias(kasan_early_shadow_pud))
+ base_pud = memblock_alloc(PTRS_PER_PUD * sizeof(pud_t), PAGE_SIZE);
+ }
+
+ pudp = base_pud + pud_index(vaddr);
+
+ do {
+ next = pud_addr_end(vaddr, end);
+
+ if (pud_none(*pudp) && IS_ALIGNED(vaddr, PUD_SIZE) && (next - vaddr) >= PUD_SIZE) {
+ if (early) {
+ phys_addr = __pa(((uintptr_t)kasan_early_shadow_pmd));
+ set_pud(pudp, pfn_pud(PFN_DOWN(phys_addr), PAGE_TABLE));
+ continue;
+ } else {
+ phys_addr = memblock_phys_alloc(PUD_SIZE, PUD_SIZE);
+ if (phys_addr) {
+ set_pud(pudp, pfn_pud(PFN_DOWN(phys_addr), PAGE_KERNEL));
+ continue;
+ }
+ }
+ }
+
+ kasan_populate_pmd(pudp, vaddr, next);
+ } while (pudp++, vaddr = next, vaddr != end);
+
+ /*
+ * Wait for the whole PGD to be populated before setting the PGD in
+ * the page table, otherwise, if we did set the PGD before populating
+ * it entirely, memblock could allocate a page at a physical address
+ * where KASAN is not populated yet and then we'd get a page fault.
+ */
+ if (!early)
+ set_pgd(pgd, pfn_pgd(PFN_DOWN(__pa(base_pud)), PAGE_TABLE));
}

+#define kasan_early_shadow_pgd_next (pgtable_l4_enabled ? \
+ (uintptr_t)kasan_early_shadow_pud : \
+ (uintptr_t)kasan_early_shadow_pmd)
+#define kasan_populate_pgd_next(pgdp, vaddr, next, early) \
+ (pgtable_l4_enabled ? \
+ kasan_populate_pud(pgdp, vaddr, next, early) : \
+ kasan_populate_pmd((pud_t *)pgdp, vaddr, next))
+
static void __init kasan_populate_pgd(pgd_t *pgdp,
unsigned long vaddr, unsigned long end,
bool early)
@@ -102,7 +191,7 @@ static void __init kasan_populate_pgd(pgd_t *pgdp,
}
}

- kasan_populate_pmd(pgdp, vaddr, next);
+ kasan_populate_pgd_next(pgdp, vaddr, next, early);
} while (pgdp++, vaddr = next, vaddr != end);
}

@@ -157,18 +246,54 @@ static void __init kasan_populate(void *start, void *end)
memset(start, KASAN_SHADOW_INIT, end - start);
}

+static void __init kasan_shallow_populate_pud(pgd_t *pgdp,
+ unsigned long vaddr, unsigned long end,
+ bool kasan_populate)
+{
+ unsigned long next;
+ pud_t *pudp, *base_pud;
+ pmd_t *base_pmd;
+ bool is_kasan_pmd;
+
+ base_pud = (pud_t *)pgd_page_vaddr(*pgdp);
+ pudp = base_pud + pud_index(vaddr);
+
+ if (kasan_populate)
+ memcpy(base_pud, (void *)kasan_early_shadow_pgd_next,
+ sizeof(pud_t) * PTRS_PER_PUD);
+
+ do {
+ next = pud_addr_end(vaddr, end);
+ is_kasan_pmd = (pud_pgtable(*pudp) == lm_alias(kasan_early_shadow_pmd));
+
+ if (is_kasan_pmd) {
+ base_pmd = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
+ set_pud(pudp, pfn_pud(PFN_DOWN(__pa(base_pmd)), PAGE_TABLE));
+ }
+ } while (pudp++, vaddr = next, vaddr != end);
+}
+
static void __init kasan_shallow_populate_pgd(unsigned long vaddr, unsigned long end)
{
unsigned long next;
void *p;
pgd_t *pgd_k = pgd_offset_k(vaddr);
+ bool is_kasan_pgd_next;

do {
next = pgd_addr_end(vaddr, end);
- if (pgd_page_vaddr(*pgd_k) == (unsigned long)lm_alias(kasan_early_shadow_pmd)) {
+ is_kasan_pgd_next = (pgd_page_vaddr(*pgd_k) ==
+ (unsigned long)lm_alias(kasan_early_shadow_pgd_next));
+
+ if (is_kasan_pgd_next) {
p = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
set_pgd(pgd_k, pfn_pgd(PFN_DOWN(__pa(p)), PAGE_TABLE));
}
+
+ if (IS_ALIGNED(vaddr, PGDIR_SIZE) && (next - vaddr) >= PGDIR_SIZE)
+ continue;
+
+ kasan_shallow_populate_pud(pgd_k, vaddr, next, is_kasan_pgd_next);
} while (pgd_k++, vaddr = next, vaddr != end);
}

diff --git a/drivers/firmware/efi/libstub/efi-stub.c b/drivers/firmware/efi/libstub/efi-stub.c
index 26e69788f27a..b3db5d91ed38 100644
--- a/drivers/firmware/efi/libstub/efi-stub.c
+++ b/drivers/firmware/efi/libstub/efi-stub.c
@@ -40,6 +40,8 @@

#ifdef CONFIG_ARM64
# define EFI_RT_VIRTUAL_LIMIT DEFAULT_MAP_WINDOW_64
+#elif defined(CONFIG_RISCV)
+# define EFI_RT_VIRTUAL_LIMIT TASK_SIZE_MIN
#else
# define EFI_RT_VIRTUAL_LIMIT TASK_SIZE
#endif
--
2.32.0


2021-12-06 10:55:30

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v3 08/13] riscv: Use pgtable_l4_enabled to output mmu_type in cpuinfo

Now that the mmu type is determined at runtime using SATP
characteristic, use the global variable pgtable_l4_enabled to output
mmu type of the processor through /proc/cpuinfo instead of relying on
device tree infos.

Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Anup Patel <[email protected]>
Reviewed-by: Palmer Dabbelt <[email protected]>
---
arch/riscv/kernel/cpu.c | 23 ++++++++++++-----------
1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/arch/riscv/kernel/cpu.c b/arch/riscv/kernel/cpu.c
index 6d59e6906fdd..dea9b1c31889 100644
--- a/arch/riscv/kernel/cpu.c
+++ b/arch/riscv/kernel/cpu.c
@@ -7,6 +7,7 @@
#include <linux/seq_file.h>
#include <linux/of.h>
#include <asm/smp.h>
+#include <asm/pgtable.h>

/*
* Returns the hart ID of the given device tree node, or -ENODEV if the node
@@ -70,18 +71,19 @@ static void print_isa(struct seq_file *f, const char *isa)
seq_puts(f, "\n");
}

-static void print_mmu(struct seq_file *f, const char *mmu_type)
+static void print_mmu(struct seq_file *f)
{
+ char sv_type[16];
+
#if defined(CONFIG_32BIT)
- if (strcmp(mmu_type, "riscv,sv32") != 0)
- return;
+ strncpy(sv_type, "sv32", 5);
#elif defined(CONFIG_64BIT)
- if (strcmp(mmu_type, "riscv,sv39") != 0 &&
- strcmp(mmu_type, "riscv,sv48") != 0)
- return;
+ if (pgtable_l4_enabled)
+ strncpy(sv_type, "sv48", 5);
+ else
+ strncpy(sv_type, "sv39", 5);
#endif
-
- seq_printf(f, "mmu\t\t: %s\n", mmu_type+6);
+ seq_printf(f, "mmu\t\t: %s\n", sv_type);
}

static void *c_start(struct seq_file *m, loff_t *pos)
@@ -106,14 +108,13 @@ static int c_show(struct seq_file *m, void *v)
{
unsigned long cpu_id = (unsigned long)v - 1;
struct device_node *node = of_get_cpu_node(cpu_id, NULL);
- const char *compat, *isa, *mmu;
+ const char *compat, *isa;

seq_printf(m, "processor\t: %lu\n", cpu_id);
seq_printf(m, "hart\t\t: %lu\n", cpuid_to_hartid_map(cpu_id));
if (!of_property_read_string(node, "riscv,isa", &isa))
print_isa(m, isa);
- if (!of_property_read_string(node, "mmu-type", &mmu))
- print_mmu(m, mmu);
+ print_mmu(m);
if (!of_property_read_string(node, "compatible", &compat)
&& strcmp(compat, "riscv"))
seq_printf(m, "uarch\t\t: %s\n", compat);
--
2.32.0


2021-12-06 10:56:31

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v3 09/13] riscv: Explicit comment about user virtual address space size

Define precisely the size of the user accessible virtual space size
for sv32/39/48 mmu types and explain why the whole virtual address
space is split into 2 equal chunks between kernel and user space.

Signed-off-by: Alexandre Ghiti <[email protected]>
Reviewed-by: Anup Patel <[email protected]>
Reviewed-by: Palmer Dabbelt <[email protected]>
---
arch/riscv/include/asm/pgtable.h | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index e1c74ef4ead2..fe1701329237 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -677,6 +677,15 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
/*
* Task size is 0x4000000000 for RV64 or 0x9fc00000 for RV32.
* Note that PGDIR_SIZE must evenly divide TASK_SIZE.
+ * Task size is:
+ * - 0x9fc00000 (~2.5GB) for RV32.
+ * - 0x4000000000 ( 256GB) for RV64 using SV39 mmu
+ * - 0x800000000000 ( 128TB) for RV64 using SV48 mmu
+ *
+ * Note that PGDIR_SIZE must evenly divide TASK_SIZE since "RISC-V
+ * Instruction Set Manual Volume II: Privileged Architecture" states that
+ * "load and store effective addresses, which are 64bits, must have bits
+ * 63–48 all equal to bit 47, or else a page-fault exception will occur."
*/
#ifdef CONFIG_64BIT
#define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)
--
2.32.0


2021-12-06 10:57:34

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v3 10/13] riscv: Improve virtual kernel memory layout dump

With the arrival of sv48 and its large address space, it would be
cumbersome to statically define the unit size to use to print the different
portions of the virtual memory layout: instead, determine it dynamically.

Signed-off-by: Alexandre Ghiti <[email protected]>
---
arch/riscv/mm/init.c | 67 +++++++++++++++++++++++-------
drivers/pci/controller/pci-xgene.c | 2 +-
include/linux/sizes.h | 1 +
3 files changed, 54 insertions(+), 16 deletions(-)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 6a19a1b1caf8..28de6ea0a720 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -79,37 +79,74 @@ static void __init zone_sizes_init(void)
}

#if defined(CONFIG_MMU) && defined(CONFIG_DEBUG_VM)
+
+#define LOG2_SZ_1K ilog2(SZ_1K)
+#define LOG2_SZ_1M ilog2(SZ_1M)
+#define LOG2_SZ_1G ilog2(SZ_1G)
+#define LOG2_SZ_1T ilog2(SZ_1T)
+
static inline void print_mlk(char *name, unsigned long b, unsigned long t)
{
pr_notice("%12s : 0x%08lx - 0x%08lx (%4ld kB)\n", name, b, t,
- (((t) - (b)) >> 10));
+ (((t) - (b)) >> LOG2_SZ_1K));
}

static inline void print_mlm(char *name, unsigned long b, unsigned long t)
{
pr_notice("%12s : 0x%08lx - 0x%08lx (%4ld MB)\n", name, b, t,
- (((t) - (b)) >> 20));
+ (((t) - (b)) >> LOG2_SZ_1M));
+}
+
+static inline void print_mlg(char *name, unsigned long b, unsigned long t)
+{
+ pr_notice("%12s : 0x%08lx - 0x%08lx (%4ld GB)\n", name, b, t,
+ (((t) - (b)) >> LOG2_SZ_1G));
+}
+
+#ifdef CONFIG_64BIT
+static inline void print_mlt(char *name, unsigned long b, unsigned long t)
+{
+ pr_notice("%12s : 0x%08lx - 0x%08lx (%4ld TB)\n", name, b, t,
+ (((t) - (b)) >> LOG2_SZ_1T));
+}
+#endif
+
+static inline void print_ml(char *name, unsigned long b, unsigned long t)
+{
+ unsigned long diff = t - b;
+
+#ifdef CONFIG_64BIT
+ if ((diff >> LOG2_SZ_1T) >= 10)
+ print_mlt(name, b, t);
+ else
+#endif
+ if ((diff >> LOG2_SZ_1G) >= 10)
+ print_mlg(name, b, t);
+ else if ((diff >> LOG2_SZ_1M) >= 10)
+ print_mlm(name, b, t);
+ else
+ print_mlk(name, b, t);
}

static void __init print_vm_layout(void)
{
pr_notice("Virtual kernel memory layout:\n");
- print_mlk("fixmap", (unsigned long)FIXADDR_START,
- (unsigned long)FIXADDR_TOP);
- print_mlm("pci io", (unsigned long)PCI_IO_START,
- (unsigned long)PCI_IO_END);
- print_mlm("vmemmap", (unsigned long)VMEMMAP_START,
- (unsigned long)VMEMMAP_END);
- print_mlm("vmalloc", (unsigned long)VMALLOC_START,
- (unsigned long)VMALLOC_END);
- print_mlm("lowmem", (unsigned long)PAGE_OFFSET,
- (unsigned long)high_memory);
+ print_ml("fixmap", (unsigned long)FIXADDR_START,
+ (unsigned long)FIXADDR_TOP);
+ print_ml("pci io", (unsigned long)PCI_IO_START,
+ (unsigned long)PCI_IO_END);
+ print_ml("vmemmap", (unsigned long)VMEMMAP_START,
+ (unsigned long)VMEMMAP_END);
+ print_ml("vmalloc", (unsigned long)VMALLOC_START,
+ (unsigned long)VMALLOC_END);
+ print_ml("lowmem", (unsigned long)PAGE_OFFSET,
+ (unsigned long)high_memory);
#ifdef CONFIG_64BIT
#ifdef CONFIG_KASAN
- print_mlm("kasan", KASAN_SHADOW_START, KASAN_SHADOW_END);
+ print_ml("kasan", KASAN_SHADOW_START, KASAN_SHADOW_END);
#endif
- print_mlm("kernel", (unsigned long)KERNEL_LINK_ADDR,
- (unsigned long)ADDRESS_SPACE_END);
+ print_ml("kernel", (unsigned long)KERNEL_LINK_ADDR,
+ (unsigned long)ADDRESS_SPACE_END);
#endif
}
#else
diff --git a/drivers/pci/controller/pci-xgene.c b/drivers/pci/controller/pci-xgene.c
index e64536047b65..187dcf8a9694 100644
--- a/drivers/pci/controller/pci-xgene.c
+++ b/drivers/pci/controller/pci-xgene.c
@@ -21,6 +21,7 @@
#include <linux/pci-ecam.h>
#include <linux/platform_device.h>
#include <linux/slab.h>
+#include <linux/sizes.h>

#include "../pci.h"

@@ -50,7 +51,6 @@
#define OB_LO_IO 0x00000002
#define XGENE_PCIE_VENDORID 0x10E8
#define XGENE_PCIE_DEVICEID 0xE004
-#define SZ_1T (SZ_1G*1024ULL)
#define PIPE_PHY_RATE_RD(src) ((0xc000 & (u32)(src)) >> 0xe)

#define XGENE_V1_PCI_EXP_CAP 0x40
diff --git a/include/linux/sizes.h b/include/linux/sizes.h
index 1ac79bcee2bb..0bc6cf394b08 100644
--- a/include/linux/sizes.h
+++ b/include/linux/sizes.h
@@ -47,6 +47,7 @@
#define SZ_8G _AC(0x200000000, ULL)
#define SZ_16G _AC(0x400000000, ULL)
#define SZ_32G _AC(0x800000000, ULL)
+#define SZ_1T _AC(0x10000000000, ULL)
#define SZ_64T _AC(0x400000000000, ULL)

#endif /* __LINUX_SIZES_H__ */
--
2.32.0


2021-12-06 10:58:35

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v3 11/13] Documentation: riscv: Add sv48 description to VM layout

sv48 was just introduced, so add its virtual memory layout to the
documentation.

Signed-off-by: Alexandre Ghiti <[email protected]>
---
Documentation/riscv/vm-layout.rst | 36 +++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)

diff --git a/Documentation/riscv/vm-layout.rst b/Documentation/riscv/vm-layout.rst
index 1bd687b97104..5b36e45fef60 100644
--- a/Documentation/riscv/vm-layout.rst
+++ b/Documentation/riscv/vm-layout.rst
@@ -61,3 +61,39 @@ RISC-V Linux Kernel SV39
ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | modules, BPF
ffffffff80000000 | -2 GB | ffffffffffffffff | 2 GB | kernel
__________________|____________|__________________|_________|____________________________________________________________
+
+
+RISC-V Linux Kernel SV48
+------------------------
+
+::
+
+ ========================================================================================================================
+ Start addr | Offset | End addr | Size | VM area description
+ ========================================================================================================================
+ | | | |
+ 0000000000000000 | 0 | 00007fffffffffff | 128 TB | user-space virtual memory, different per mm
+ __________________|____________|__________________|_________|___________________________________________________________
+ | | | |
+ 0000800000000000 | +128 TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical
+ | | | | virtual memory addresses up to the -128 TB
+ | | | | starting offset of kernel mappings.
+ __________________|____________|__________________|_________|___________________________________________________________
+ |
+ | Kernel-space virtual memory, shared between all processes:
+ ____________________________________________________________|___________________________________________________________
+ | | | |
+ ffff8d7ffee00000 | -114.5 TB | ffff8d7ffeffffff | 2 MB | fixmap
+ ffff8d7fff000000 | -114.5 TB | ffff8d7fffffffff | 16 MB | PCI io
+ ffff8d8000000000 | -114.5 TB | ffff8f7fffffffff | 2 TB | vmemmap
+ ffff8f8000000000 | -112.5 TB | ffffaf7fffffffff | 32 TB | vmalloc/ioremap space
+ ffffaf8000000000 | -80.5 TB | ffffef7fffffffff | 64 TB | direct mapping of all physical memory
+ ffffef8000000000 | -16.5 TB | fffffffeffffffff | 16.5 TB | kasan
+ __________________|____________|__________________|_________|____________________________________________________________
+ |
+ | Identical layout to the 39-bit one from here on:
+ ____________________________________________________________|____________________________________________________________
+ | | | |
+ ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | modules, BPF
+ ffffffff80000000 | -2 GB | ffffffffffffffff | 2 GB | kernel
+ __________________|____________|__________________|_________|____________________________________________________________
--
2.32.0


2021-12-06 10:59:34

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v3 12/13] riscv: Initialize thread pointer before calling C functions

Because of the stack canary feature that reads from the current task
structure the stack canary value, the thread pointer register "tp" must
be set before calling any C function from head.S: by chance, setup_vm
and all the functions that it calls does not seem to be part of the
functions where the canary check is done, but in the following commits,
some functions will.

Fixes: f2c9699f65557a31 ("riscv: Add STACKPROTECTOR supported")
Signed-off-by: Alexandre Ghiti <[email protected]>
---
arch/riscv/kernel/head.S | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index c3c0ed559770..86f7ee3d210d 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -302,6 +302,7 @@ clear_bss_done:
REG_S a0, (a2)

/* Initialize page tables and relocate to virtual addresses */
+ la tp, init_task
la sp, init_thread_union + THREAD_SIZE
XIP_FIXUP_OFFSET sp
#ifdef CONFIG_BUILTIN_DTB
--
2.32.0


2021-12-06 11:00:37

by Alexandre Ghiti

[permalink] [raw]
Subject: [PATCH v3 13/13] riscv: Allow user to downgrade to sv39 when hw supports sv48 if !KASAN

This is made possible by using the mmu-type property of the cpu node of
the device tree.

By default, the kernel will boot with 4-level page table if the hw supports
it but it can be interesting for the user to select 3-level page table as
it is less memory consuming and faster since it requires less memory
accesses in case of a TLB miss.

This functionality requires that kasan is disabled since calling the fdt
functions that are kasan instrumented with the MMU off can't work.

Signed-off-by: Alexandre Ghiti <[email protected]>
---
arch/riscv/mm/init.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 28de6ea0a720..299b5a44f902 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -633,10 +633,38 @@ static void __init disable_pgtable_l4(void)
* then read SATP to see if the configuration was taken into account
* meaning sv48 is supported.
*/
-static __init void set_satp_mode(void)
+static __init void set_satp_mode(uintptr_t dtb_pa)
{
u64 identity_satp, hw_satp;
uintptr_t set_satp_mode_pmd;
+#ifndef CONFIG_KASAN
+ /*
+ * The below fdt functions are kasan instrumented, since at this point
+ * there is no mapping for the kasan shadow memory, this can't be used
+ * when kasan is enabled otherwise it traps.
+ */
+ int cpus_node;
+
+ /* Check if the user asked for sv39 explicitly in the device tree */
+ cpus_node = fdt_path_offset((void *)dtb_pa, "/cpus");
+ if (cpus_node >= 0) {
+ int node;
+
+ fdt_for_each_subnode(node, (void *)dtb_pa, cpus_node) {
+ const char *mmu_type = fdt_getprop((void *)dtb_pa, node,
+ "mmu-type", NULL);
+ if (!mmu_type)
+ continue;
+
+ if (!strcmp(mmu_type, "riscv,sv39")) {
+ disable_pgtable_l4();
+ return;
+ }
+
+ break;
+ }
+ }
+#endif

set_satp_mode_pmd = ((unsigned long)set_satp_mode) & PMD_MASK;
create_pgd_mapping(early_pg_dir,
@@ -838,7 +866,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
#endif

#if defined(CONFIG_64BIT) && !defined(CONFIG_XIP_KERNEL)
- set_satp_mode();
+ set_satp_mode(dtb_pa);
#endif

kernel_map.va_pa_offset = PAGE_OFFSET - kernel_map.phys_addr;
--
2.32.0


2021-12-06 11:05:43

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v3 07/13] riscv: Implement sv48 support

On 12/6/21 11:46, Alexandre Ghiti wrote:
> By adding a new 4th level of page table, give the possibility to 64bit
> kernel to address 2^48 bytes of virtual address: in practice, that offers
> 128TB of virtual address space to userspace and allows up to 64TB of
> physical memory.
>
> If the underlying hardware does not support sv48, we will automatically
> fallback to a standard 3-level page table by folding the new PUD level into
> PGDIR level. In order to detect HW capabilities at runtime, we
> use SATP feature that ignores writes with an unsupported mode.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> arch/riscv/Kconfig | 4 +-
> arch/riscv/include/asm/csr.h | 3 +-
> arch/riscv/include/asm/fixmap.h | 1 +
> arch/riscv/include/asm/kasan.h | 6 +-
> arch/riscv/include/asm/page.h | 14 ++
> arch/riscv/include/asm/pgalloc.h | 40 +++++
> arch/riscv/include/asm/pgtable-64.h | 108 +++++++++++-
> arch/riscv/include/asm/pgtable.h | 24 ++-
> arch/riscv/kernel/head.S | 3 +-
> arch/riscv/mm/context.c | 4 +-
> arch/riscv/mm/init.c | 212 +++++++++++++++++++++---
> arch/riscv/mm/kasan_init.c | 137 ++++++++++++++-
> drivers/firmware/efi/libstub/efi-stub.c | 2 +
> 13 files changed, 514 insertions(+), 44 deletions(-)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index ac6c0cd9bc29..d28fe0148e13 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -150,7 +150,7 @@ config PAGE_OFFSET
> hex
> default 0xC0000000 if 32BIT
> default 0x80000000 if 64BIT && !MMU
> - default 0xffffffd800000000 if 64BIT
> + default 0xffffaf8000000000 if 64BIT
>
> config KASAN_SHADOW_OFFSET
> hex
> @@ -201,7 +201,7 @@ config FIX_EARLYCON_MEM
>
> config PGTABLE_LEVELS
> int
> - default 3 if 64BIT
> + default 4 if 64BIT
> default 2
>
> config LOCKDEP_SUPPORT
> diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
> index 87ac65696871..3fdb971c7896 100644
> --- a/arch/riscv/include/asm/csr.h
> +++ b/arch/riscv/include/asm/csr.h
> @@ -40,14 +40,13 @@
> #ifndef CONFIG_64BIT
> #define SATP_PPN _AC(0x003FFFFF, UL)
> #define SATP_MODE_32 _AC(0x80000000, UL)
> -#define SATP_MODE SATP_MODE_32
> #define SATP_ASID_BITS 9
> #define SATP_ASID_SHIFT 22
> #define SATP_ASID_MASK _AC(0x1FF, UL)
> #else
> #define SATP_PPN _AC(0x00000FFFFFFFFFFF, UL)
> #define SATP_MODE_39 _AC(0x8000000000000000, UL)
> -#define SATP_MODE SATP_MODE_39
> +#define SATP_MODE_48 _AC(0x9000000000000000, UL)
> #define SATP_ASID_BITS 16
> #define SATP_ASID_SHIFT 44
> #define SATP_ASID_MASK _AC(0xFFFF, UL)
> diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h
> index 54cbf07fb4e9..58a718573ad6 100644
> --- a/arch/riscv/include/asm/fixmap.h
> +++ b/arch/riscv/include/asm/fixmap.h
> @@ -24,6 +24,7 @@ enum fixed_addresses {
> FIX_HOLE,
> FIX_PTE,
> FIX_PMD,
> + FIX_PUD,
> FIX_TEXT_POKE1,
> FIX_TEXT_POKE0,
> FIX_EARLYCON_MEM_BASE,
> diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h
> index 743e6ff57996..0b85e363e778 100644
> --- a/arch/riscv/include/asm/kasan.h
> +++ b/arch/riscv/include/asm/kasan.h
> @@ -28,7 +28,11 @@
> #define KASAN_SHADOW_SCALE_SHIFT 3
>
> #define KASAN_SHADOW_SIZE (UL(1) << ((VA_BITS - 1) - KASAN_SHADOW_SCALE_SHIFT))
> -#define KASAN_SHADOW_START (KASAN_SHADOW_END - KASAN_SHADOW_SIZE)
> +/*
> + * Depending on the size of the virtual address space, the region may not be
> + * aligned on PGDIR_SIZE, so force its alignment to ease its population.
> + */
> +#define KASAN_SHADOW_START ((KASAN_SHADOW_END - KASAN_SHADOW_SIZE) & PGDIR_MASK)
> #define KASAN_SHADOW_END MODULES_LOWEST_VADDR
> #define KASAN_SHADOW_OFFSET _AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
>
> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> index e03559f9b35e..d089fe46f7d8 100644
> --- a/arch/riscv/include/asm/page.h
> +++ b/arch/riscv/include/asm/page.h
> @@ -31,7 +31,20 @@
> * When not using MMU this corresponds to the first free page in
> * physical memory (aligned on a page boundary).
> */
> +#ifdef CONFIG_64BIT
> +#ifdef CONFIG_MMU
> +#define PAGE_OFFSET kernel_map.page_offset
> +#else
> +#define PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)
> +#endif
> +/*
> + * By default, CONFIG_PAGE_OFFSET value corresponds to SV48 address space so
> + * define the PAGE_OFFSET value for SV39.
> + */
> +#define PAGE_OFFSET_L3 _AC(0xffffffd800000000, UL)
> +#else
> #define PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)
> +#endif /* CONFIG_64BIT */
>
> /*
> * Half of the kernel address space (half of the entries of the page global
> @@ -90,6 +103,7 @@ extern unsigned long riscv_pfn_base;
> #endif /* CONFIG_MMU */
>
> struct kernel_mapping {
> + unsigned long page_offset;
> unsigned long virt_addr;
> uintptr_t phys_addr;
> uintptr_t size;
> diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
> index 0af6933a7100..11823004b87a 100644
> --- a/arch/riscv/include/asm/pgalloc.h
> +++ b/arch/riscv/include/asm/pgalloc.h
> @@ -11,6 +11,8 @@
> #include <asm/tlb.h>
>
> #ifdef CONFIG_MMU
> +#define __HAVE_ARCH_PUD_ALLOC_ONE
> +#define __HAVE_ARCH_PUD_FREE
> #include <asm-generic/pgalloc.h>
>
> static inline void pmd_populate_kernel(struct mm_struct *mm,
> @@ -36,6 +38,44 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
>
> set_pud(pud, __pud((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> }
> +
> +static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
> +{
> + if (pgtable_l4_enabled) {
> + unsigned long pfn = virt_to_pfn(pud);
> +
> + set_p4d(p4d, __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> + }
> +}
> +
> +static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d,
> + pud_t *pud)
> +{
> + if (pgtable_l4_enabled) {
> + unsigned long pfn = virt_to_pfn(pud);
> +
> + set_p4d_safe(p4d,
> + __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> + }
> +}
> +
> +#define pud_alloc_one pud_alloc_one
> +static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
> +{
> + if (pgtable_l4_enabled)
> + return __pud_alloc_one(mm, addr);
> +
> + return NULL;
> +}
> +
> +#define pud_free pud_free
> +static inline void pud_free(struct mm_struct *mm, pud_t *pud)
> +{
> + if (pgtable_l4_enabled)
> + __pud_free(mm, pud);
> +}
> +
> +#define __pud_free_tlb(tlb, pud, addr) pud_free((tlb)->mm, pud)
> #endif /* __PAGETABLE_PMD_FOLDED */
>
> static inline pgd_t *pgd_alloc(struct mm_struct *mm)
> diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
> index 228261aa9628..bbbdd66e5e2f 100644
> --- a/arch/riscv/include/asm/pgtable-64.h
> +++ b/arch/riscv/include/asm/pgtable-64.h
> @@ -8,16 +8,36 @@
>
> #include <linux/const.h>
>
> -#define PGDIR_SHIFT 30
> +extern bool pgtable_l4_enabled;
> +
> +#define PGDIR_SHIFT_L3 30
> +#define PGDIR_SHIFT_L4 39
> +#define PGDIR_SIZE_L3 (_AC(1, UL) << PGDIR_SHIFT_L3)
> +
> +#define PGDIR_SHIFT (pgtable_l4_enabled ? PGDIR_SHIFT_L4 : PGDIR_SHIFT_L3)
> /* Size of region mapped by a page global directory */
> #define PGDIR_SIZE (_AC(1, UL) << PGDIR_SHIFT)
> #define PGDIR_MASK (~(PGDIR_SIZE - 1))
>
> +/* pud is folded into pgd in case of 3-level page table */
> +#define PUD_SHIFT 30
> +#define PUD_SIZE (_AC(1, UL) << PUD_SHIFT)
> +#define PUD_MASK (~(PUD_SIZE - 1))
> +
> #define PMD_SHIFT 21
> /* Size of region mapped by a page middle directory */
> #define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
> #define PMD_MASK (~(PMD_SIZE - 1))
>
> +/* Page Upper Directory entry */
> +typedef struct {
> + unsigned long pud;
> +} pud_t;
> +
> +#define pud_val(x) ((x).pud)
> +#define __pud(x) ((pud_t) { (x) })
> +#define PTRS_PER_PUD (PAGE_SIZE / sizeof(pud_t))
> +
> /* Page Middle Directory entry */
> typedef struct {
> unsigned long pmd;
> @@ -59,6 +79,16 @@ static inline void pud_clear(pud_t *pudp)
> set_pud(pudp, __pud(0));
> }
>
> +static inline pud_t pfn_pud(unsigned long pfn, pgprot_t prot)
> +{
> + return __pud((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
> +}
> +
> +static inline unsigned long _pud_pfn(pud_t pud)
> +{
> + return pud_val(pud) >> _PAGE_PFN_SHIFT;
> +}
> +
> static inline pmd_t *pud_pgtable(pud_t pud)
> {
> return (pmd_t *)pfn_to_virt(pud_val(pud) >> _PAGE_PFN_SHIFT);
> @@ -69,6 +99,17 @@ static inline struct page *pud_page(pud_t pud)
> return pfn_to_page(pud_val(pud) >> _PAGE_PFN_SHIFT);
> }
>
> +#define mm_pud_folded mm_pud_folded
> +static inline bool mm_pud_folded(struct mm_struct *mm)
> +{
> + if (pgtable_l4_enabled)
> + return false;
> +
> + return true;
> +}
> +
> +#define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
> +
> static inline pmd_t pfn_pmd(unsigned long pfn, pgprot_t prot)
> {
> return __pmd((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
> @@ -84,4 +125,69 @@ static inline unsigned long _pmd_pfn(pmd_t pmd)
> #define pmd_ERROR(e) \
> pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))
>
> +#define pud_ERROR(e) \
> + pr_err("%s:%d: bad pud %016lx.\n", __FILE__, __LINE__, pud_val(e))
> +
> +static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
> +{
> + if (pgtable_l4_enabled)
> + *p4dp = p4d;
> + else
> + set_pud((pud_t *)p4dp, (pud_t){ p4d_val(p4d) });
> +}
> +
> +static inline int p4d_none(p4d_t p4d)
> +{
> + if (pgtable_l4_enabled)
> + return (p4d_val(p4d) == 0);
> +
> + return 0;
> +}
> +
> +static inline int p4d_present(p4d_t p4d)
> +{
> + if (pgtable_l4_enabled)
> + return (p4d_val(p4d) & _PAGE_PRESENT);
> +
> + return 1;
> +}
> +
> +static inline int p4d_bad(p4d_t p4d)
> +{
> + if (pgtable_l4_enabled)
> + return !p4d_present(p4d);
> +
> + return 0;
> +}
> +
> +static inline void p4d_clear(p4d_t *p4d)
> +{
> + if (pgtable_l4_enabled)
> + set_p4d(p4d, __p4d(0));
> +}
> +
> +static inline pud_t *p4d_pgtable(p4d_t p4d)
> +{
> + if (pgtable_l4_enabled)
> + return (pud_t *)pfn_to_virt(p4d_val(p4d) >> _PAGE_PFN_SHIFT);
> +
> + return (pud_t *)pud_pgtable((pud_t) { p4d_val(p4d) });
> +}
> +
> +static inline struct page *p4d_page(p4d_t p4d)
> +{
> + return pfn_to_page(p4d_val(p4d) >> _PAGE_PFN_SHIFT);
> +}
> +
> +#define pud_index(addr) (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
> +
> +#define pud_offset pud_offset
> +static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> +{
> + if (pgtable_l4_enabled)
> + return p4d_pgtable(*p4d) + pud_index(address);
> +
> + return (pud_t *)p4d;
> +}
> +
> #endif /* _ASM_RISCV_PGTABLE_64_H */
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index e1a52e22ad7e..e1c74ef4ead2 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -51,7 +51,7 @@
> * position vmemmap directly below the VMALLOC region.
> */
> #ifdef CONFIG_64BIT
> -#define VA_BITS 39
> +#define VA_BITS (pgtable_l4_enabled ? 48 : 39)
> #else
> #define VA_BITS 32
> #endif
> @@ -90,8 +90,7 @@
>
> #ifndef __ASSEMBLY__
>
> -/* Page Upper Directory not used in RISC-V */
> -#include <asm-generic/pgtable-nopud.h>
> +#include <asm-generic/pgtable-nop4d.h>
> #include <asm/page.h>
> #include <asm/tlbflush.h>
> #include <linux/mm_types.h>
> @@ -113,6 +112,17 @@
> #define XIP_FIXUP(addr) (addr)
> #endif /* CONFIG_XIP_KERNEL */
>
> +struct pt_alloc_ops {
> + pte_t *(*get_pte_virt)(phys_addr_t pa);
> + phys_addr_t (*alloc_pte)(uintptr_t va);
> +#ifndef __PAGETABLE_PMD_FOLDED
> + pmd_t *(*get_pmd_virt)(phys_addr_t pa);
> + phys_addr_t (*alloc_pmd)(uintptr_t va);
> + pud_t *(*get_pud_virt)(phys_addr_t pa);
> + phys_addr_t (*alloc_pud)(uintptr_t va);
> +#endif
> +};
> +
> #ifdef CONFIG_MMU
> /* Number of entries in the page global directory */
> #define PTRS_PER_PGD (PAGE_SIZE / sizeof(pgd_t))
> @@ -669,9 +679,11 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> * Note that PGDIR_SIZE must evenly divide TASK_SIZE.
> */
> #ifdef CONFIG_64BIT
> -#define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)
> +#define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)
> +#define TASK_SIZE_MIN (PGDIR_SIZE_L3 * PTRS_PER_PGD / 2)
> #else
> -#define TASK_SIZE FIXADDR_START
> +#define TASK_SIZE FIXADDR_START
> +#define TASK_SIZE_MIN TASK_SIZE
> #endif
>
> #else /* CONFIG_MMU */
> @@ -697,6 +709,8 @@ extern uintptr_t _dtb_early_pa;
> #define dtb_early_va _dtb_early_va
> #define dtb_early_pa _dtb_early_pa
> #endif /* CONFIG_XIP_KERNEL */
> +extern u64 satp_mode;
> +extern bool pgtable_l4_enabled;
>
> void paging_init(void);
> void misc_mem_init(void);
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index 52c5ff9804c5..c3c0ed559770 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -95,7 +95,8 @@ relocate:
>
> /* Compute satp for kernel page tables, but don't load it yet */
> srl a2, a0, PAGE_SHIFT
> - li a1, SATP_MODE
> + la a1, satp_mode
> + REG_L a1, 0(a1)
> or a2, a2, a1
>
> /*
> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> index ee3459cb6750..a7246872bd30 100644
> --- a/arch/riscv/mm/context.c
> +++ b/arch/riscv/mm/context.c
> @@ -192,7 +192,7 @@ static void set_mm_asid(struct mm_struct *mm, unsigned int cpu)
> switch_mm_fast:
> csr_write(CSR_SATP, virt_to_pfn(mm->pgd) |
> ((cntx & asid_mask) << SATP_ASID_SHIFT) |
> - SATP_MODE);
> + satp_mode);
>
> if (need_flush_tlb)
> local_flush_tlb_all();
> @@ -201,7 +201,7 @@ static void set_mm_asid(struct mm_struct *mm, unsigned int cpu)
> static void set_mm_noasid(struct mm_struct *mm)
> {
> /* Switch the page table and blindly nuke entire local TLB */
> - csr_write(CSR_SATP, virt_to_pfn(mm->pgd) | SATP_MODE);
> + csr_write(CSR_SATP, virt_to_pfn(mm->pgd) | satp_mode);
> local_flush_tlb_all();
> }
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 1552226fb6bd..6a19a1b1caf8 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -37,6 +37,17 @@ EXPORT_SYMBOL(kernel_map);
> #define kernel_map (*(struct kernel_mapping *)XIP_FIXUP(&kernel_map))
> #endif
>
> +#ifdef CONFIG_64BIT
> +u64 satp_mode = !IS_ENABLED(CONFIG_XIP_KERNEL) ? SATP_MODE_48 : SATP_MODE_39;
> +#else
> +u64 satp_mode = SATP_MODE_32;
> +#endif
> +EXPORT_SYMBOL(satp_mode);
> +
> +bool pgtable_l4_enabled = IS_ENABLED(CONFIG_64BIT) && !IS_ENABLED(CONFIG_XIP_KERNEL) ?
> + true : false;
> +EXPORT_SYMBOL(pgtable_l4_enabled);
> +
> phys_addr_t phys_ram_base __ro_after_init;
> EXPORT_SYMBOL(phys_ram_base);
>
> @@ -53,15 +64,6 @@ extern char _start[];
> void *_dtb_early_va __initdata;
> uintptr_t _dtb_early_pa __initdata;
>
> -struct pt_alloc_ops {
> - pte_t *(*get_pte_virt)(phys_addr_t pa);
> - phys_addr_t (*alloc_pte)(uintptr_t va);
> -#ifndef __PAGETABLE_PMD_FOLDED
> - pmd_t *(*get_pmd_virt)(phys_addr_t pa);
> - phys_addr_t (*alloc_pmd)(uintptr_t va);
> -#endif
> -};
> -
> static phys_addr_t dma32_phys_limit __initdata;
>
> static void __init zone_sizes_init(void)
> @@ -222,7 +224,7 @@ static void __init setup_bootmem(void)
> }
>
> #ifdef CONFIG_MMU
> -static struct pt_alloc_ops _pt_ops __initdata;
> +struct pt_alloc_ops _pt_ops __initdata;
>
> #ifdef CONFIG_XIP_KERNEL
> #define pt_ops (*(struct pt_alloc_ops *)XIP_FIXUP(&_pt_ops))
> @@ -238,6 +240,7 @@ pgd_t trampoline_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
> static pte_t fixmap_pte[PTRS_PER_PTE] __page_aligned_bss;
>
> pgd_t early_pg_dir[PTRS_PER_PGD] __initdata __aligned(PAGE_SIZE);
> +static pud_t __maybe_unused early_dtb_pud[PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);
> static pmd_t __maybe_unused early_dtb_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
>
> #ifdef CONFIG_XIP_KERNEL
> @@ -326,6 +329,16 @@ static pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
> #define early_pmd ((pmd_t *)XIP_FIXUP(early_pmd))
> #endif /* CONFIG_XIP_KERNEL */
>
> +static pud_t trampoline_pud[PTRS_PER_PUD] __page_aligned_bss;
> +static pud_t fixmap_pud[PTRS_PER_PUD] __page_aligned_bss;
> +static pud_t early_pud[PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);
> +
> +#ifdef CONFIG_XIP_KERNEL
> +#define trampoline_pud ((pud_t *)XIP_FIXUP(trampoline_pud))
> +#define fixmap_pud ((pud_t *)XIP_FIXUP(fixmap_pud))
> +#define early_pud ((pud_t *)XIP_FIXUP(early_pud))
> +#endif /* CONFIG_XIP_KERNEL */
> +
> static pmd_t *__init get_pmd_virt_early(phys_addr_t pa)
> {
> /* Before MMU is enabled */
> @@ -345,7 +358,7 @@ static pmd_t *__init get_pmd_virt_late(phys_addr_t pa)
>
> static phys_addr_t __init alloc_pmd_early(uintptr_t va)
> {
> - BUG_ON((va - kernel_map.virt_addr) >> PGDIR_SHIFT);
> + BUG_ON((va - kernel_map.virt_addr) >> PUD_SHIFT);
>
> return (uintptr_t)early_pmd;
> }
> @@ -391,21 +404,97 @@ static void __init create_pmd_mapping(pmd_t *pmdp,
> create_pte_mapping(ptep, va, pa, sz, prot);
> }
>
> -#define pgd_next_t pmd_t
> -#define alloc_pgd_next(__va) pt_ops.alloc_pmd(__va)
> -#define get_pgd_next_virt(__pa) pt_ops.get_pmd_virt(__pa)
> +static pud_t *__init get_pud_virt_early(phys_addr_t pa)
> +{
> + return (pud_t *)((uintptr_t)pa);
> +}
> +
> +static pud_t *__init get_pud_virt_fixmap(phys_addr_t pa)
> +{
> + clear_fixmap(FIX_PUD);
> + return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
> +}
> +
> +static pud_t *__init get_pud_virt_late(phys_addr_t pa)
> +{
> + return (pud_t *)__va(pa);
> +}
> +
> +static phys_addr_t __init alloc_pud_early(uintptr_t va)
> +{
> + /* Only one PUD is available for early mapping */
> + BUG_ON((va - kernel_map.virt_addr) >> PGDIR_SHIFT);
> +
> + return (uintptr_t)early_pud;
> +}
> +
> +static phys_addr_t __init alloc_pud_fixmap(uintptr_t va)
> +{
> + return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
> +}
> +
> +static phys_addr_t alloc_pud_late(uintptr_t va)
> +{
> + unsigned long vaddr;
> +
> + vaddr = __get_free_page(GFP_KERNEL);
> + BUG_ON(!vaddr);
> + return __pa(vaddr);
> +}
> +
> +static void __init create_pud_mapping(pud_t *pudp,
> + uintptr_t va, phys_addr_t pa,
> + phys_addr_t sz, pgprot_t prot)
> +{
> + pmd_t *nextp;
> + phys_addr_t next_phys;
> + uintptr_t pud_index = pud_index(va);
> +
> + if (sz == PUD_SIZE) {
> + if (pud_val(pudp[pud_index]) == 0)
> + pudp[pud_index] = pfn_pud(PFN_DOWN(pa), prot);
> + return;
> + }
> +
> + if (pud_val(pudp[pud_index]) == 0) {
> + next_phys = pt_ops.alloc_pmd(va);
> + pudp[pud_index] = pfn_pud(PFN_DOWN(next_phys), PAGE_TABLE);
> + nextp = pt_ops.get_pmd_virt(next_phys);
> + memset(nextp, 0, PAGE_SIZE);
> + } else {
> + next_phys = PFN_PHYS(_pud_pfn(pudp[pud_index]));
> + nextp = pt_ops.get_pmd_virt(next_phys);
> + }
> +
> + create_pmd_mapping(nextp, va, pa, sz, prot);
> +}
> +
> +#define pgd_next_t pud_t
> +#define alloc_pgd_next(__va) (pgtable_l4_enabled ? \
> + pt_ops.alloc_pud(__va) : pt_ops.alloc_pmd(__va))
> +#define get_pgd_next_virt(__pa) (pgtable_l4_enabled ? \
> + pt_ops.get_pud_virt(__pa) : (pgd_next_t *)pt_ops.get_pmd_virt(__pa))
> #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot) \
> - create_pmd_mapping(__nextp, __va, __pa, __sz, __prot)
> -#define fixmap_pgd_next fixmap_pmd
> + (pgtable_l4_enabled ? \
> + create_pud_mapping(__nextp, __va, __pa, __sz, __prot) : \
> + create_pmd_mapping((pmd_t *)__nextp, __va, __pa, __sz, __prot))
> +#define fixmap_pgd_next (pgtable_l4_enabled ? \
> + (uintptr_t)fixmap_pud : (uintptr_t)fixmap_pmd)
> +#define trampoline_pgd_next (pgtable_l4_enabled ? \
> + (uintptr_t)trampoline_pud : (uintptr_t)trampoline_pmd)
> +#define early_dtb_pgd_next (pgtable_l4_enabled ? \
> + (uintptr_t)early_dtb_pud : (uintptr_t)early_dtb_pmd)
> #else
> #define pgd_next_t pte_t
> #define alloc_pgd_next(__va) pt_ops.alloc_pte(__va)
> #define get_pgd_next_virt(__pa) pt_ops.get_pte_virt(__pa)
> #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot) \
> create_pte_mapping(__nextp, __va, __pa, __sz, __prot)
> -#define fixmap_pgd_next fixmap_pte
> +#define fixmap_pgd_next ((uintptr_t)fixmap_pte)
> +#define early_dtb_pgd_next ((uintptr_t)early_dtb_pmd)
> +#define create_pud_mapping(__pmdp, __va, __pa, __sz, __prot)
> #define create_pmd_mapping(__pmdp, __va, __pa, __sz, __prot)
> -#endif
> +#endif /* __PAGETABLE_PMD_FOLDED */
>
> void __init create_pgd_mapping(pgd_t *pgdp,
> uintptr_t va, phys_addr_t pa,
> @@ -493,6 +582,57 @@ static __init pgprot_t pgprot_from_va(uintptr_t va)
> }
> #endif /* CONFIG_STRICT_KERNEL_RWX */
>
> +#ifdef CONFIG_64BIT
> +static void __init disable_pgtable_l4(void)
> +{
> + pgtable_l4_enabled = false;
> + kernel_map.page_offset = PAGE_OFFSET_L3;
> + satp_mode = SATP_MODE_39;
> +}
> +
> +/*
> + * There is a simple way to determine if 4-level is supported by the
> + * underlying hardware: establish 1:1 mapping in 4-level page table mode
> + * then read SATP to see if the configuration was taken into account
> + * meaning sv48 is supported.
> + */
> +static __init void set_satp_mode(void)
> +{
> + u64 identity_satp, hw_satp;
> + uintptr_t set_satp_mode_pmd;
> +
> + set_satp_mode_pmd = ((unsigned long)set_satp_mode) & PMD_MASK;
> + create_pgd_mapping(early_pg_dir,
> + set_satp_mode_pmd, (uintptr_t)early_pud,
> + PGDIR_SIZE, PAGE_TABLE);
> + create_pud_mapping(early_pud,
> + set_satp_mode_pmd, (uintptr_t)early_pmd,
> + PUD_SIZE, PAGE_TABLE);
> + /* Handle the case where set_satp_mode straddles 2 PMDs */
> + create_pmd_mapping(early_pmd,
> + set_satp_mode_pmd, set_satp_mode_pmd,
> + PMD_SIZE, PAGE_KERNEL_EXEC);
> + create_pmd_mapping(early_pmd,
> + set_satp_mode_pmd + PMD_SIZE,
> + set_satp_mode_pmd + PMD_SIZE,
> + PMD_SIZE, PAGE_KERNEL_EXEC);
> +
> + identity_satp = PFN_DOWN((uintptr_t)&early_pg_dir) | satp_mode;
> +
> + local_flush_tlb_all();
> + csr_write(CSR_SATP, identity_satp);
> + hw_satp = csr_swap(CSR_SATP, 0ULL);
> + local_flush_tlb_all();
> +
> + if (hw_satp != identity_satp)
> + disable_pgtable_l4();
> +
> + memset(early_pg_dir, 0, PAGE_SIZE);
> + memset(early_pud, 0, PAGE_SIZE);
> + memset(early_pmd, 0, PAGE_SIZE);
> +}
> +#endif
> +
> /*
> * setup_vm() is called from head.S with MMU-off.
> *
> @@ -557,10 +697,15 @@ static void __init create_fdt_early_page_table(pgd_t *pgdir, uintptr_t dtb_pa)
> uintptr_t pa = dtb_pa & ~(PMD_SIZE - 1);
>
> create_pgd_mapping(early_pg_dir, DTB_EARLY_BASE_VA,
> - IS_ENABLED(CONFIG_64BIT) ? (uintptr_t)early_dtb_pmd : pa,
> + IS_ENABLED(CONFIG_64BIT) ? early_dtb_pgd_next : pa,
> PGDIR_SIZE,
> IS_ENABLED(CONFIG_64BIT) ? PAGE_TABLE : PAGE_KERNEL);
>
> + if (pgtable_l4_enabled) {
> + create_pud_mapping(early_dtb_pud, DTB_EARLY_BASE_VA,
> + (uintptr_t)early_dtb_pmd, PUD_SIZE, PAGE_TABLE);
> + }
> +
> if (IS_ENABLED(CONFIG_64BIT)) {
> create_pmd_mapping(early_dtb_pmd, DTB_EARLY_BASE_VA,
> pa, PMD_SIZE, PAGE_KERNEL);
> @@ -593,6 +738,8 @@ void pt_ops_set_early(void)
> #ifndef __PAGETABLE_PMD_FOLDED
> pt_ops.alloc_pmd = alloc_pmd_early;
> pt_ops.get_pmd_virt = get_pmd_virt_early;
> + pt_ops.alloc_pud = alloc_pud_early;
> + pt_ops.get_pud_virt = get_pud_virt_early;
> #endif
> }
>
> @@ -611,6 +758,8 @@ void pt_ops_set_fixmap(void)
> #ifndef __PAGETABLE_PMD_FOLDED
> pt_ops.alloc_pmd = kernel_mapping_pa_to_va((uintptr_t)alloc_pmd_fixmap);
> pt_ops.get_pmd_virt = kernel_mapping_pa_to_va((uintptr_t)get_pmd_virt_fixmap);
> + pt_ops.alloc_pud = kernel_mapping_pa_to_va((uintptr_t)alloc_pud_fixmap);
> + pt_ops.get_pud_virt = kernel_mapping_pa_to_va((uintptr_t)get_pud_virt_fixmap);
> #endif
> }
>
> @@ -625,6 +774,8 @@ void pt_ops_set_late(void)
> #ifndef __PAGETABLE_PMD_FOLDED
> pt_ops.alloc_pmd = alloc_pmd_late;
> pt_ops.get_pmd_virt = get_pmd_virt_late;
> + pt_ops.alloc_pud = alloc_pud_late;
> + pt_ops.get_pud_virt = get_pud_virt_late;
> #endif
> }
>
> @@ -633,6 +784,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> pmd_t __maybe_unused fix_bmap_spmd, fix_bmap_epmd;
>
> kernel_map.virt_addr = KERNEL_LINK_ADDR;
> + kernel_map.page_offset = _AC(CONFIG_PAGE_OFFSET, UL);
>
> #ifdef CONFIG_XIP_KERNEL
> kernel_map.xiprom = (uintptr_t)CONFIG_XIP_PHYS_ADDR;
> @@ -647,6 +799,11 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> kernel_map.phys_addr = (uintptr_t)(&_start);
> kernel_map.size = (uintptr_t)(&_end) - kernel_map.phys_addr;
> #endif
> +
> +#if defined(CONFIG_64BIT) && !defined(CONFIG_XIP_KERNEL)
> + set_satp_mode();
> +#endif
> +
> kernel_map.va_pa_offset = PAGE_OFFSET - kernel_map.phys_addr;
> kernel_map.va_kernel_pa_offset = kernel_map.virt_addr - kernel_map.phys_addr;
>
> @@ -676,15 +833,21 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>
> /* Setup early PGD for fixmap */
> create_pgd_mapping(early_pg_dir, FIXADDR_START,
> - (uintptr_t)fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
> + fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
>
> #ifndef __PAGETABLE_PMD_FOLDED
> - /* Setup fixmap PMD */
> + /* Setup fixmap PUD and PMD */
> + if (pgtable_l4_enabled)
> + create_pud_mapping(fixmap_pud, FIXADDR_START,
> + (uintptr_t)fixmap_pmd, PUD_SIZE, PAGE_TABLE);
> create_pmd_mapping(fixmap_pmd, FIXADDR_START,
> (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
> /* Setup trampoline PGD and PMD */
> create_pgd_mapping(trampoline_pg_dir, kernel_map.virt_addr,
> - (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
> + trampoline_pgd_next, PGDIR_SIZE, PAGE_TABLE);
> + if (pgtable_l4_enabled)
> + create_pud_mapping(trampoline_pud, kernel_map.virt_addr,
> + (uintptr_t)trampoline_pmd, PUD_SIZE, PAGE_TABLE);
> #ifdef CONFIG_XIP_KERNEL
> create_pmd_mapping(trampoline_pmd, kernel_map.virt_addr,
> kernel_map.xiprom, PMD_SIZE, PAGE_KERNEL_EXEC);
> @@ -712,7 +875,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> * Bootime fixmap only can handle PMD_SIZE mapping. Thus, boot-ioremap
> * range can not span multiple pmds.
> */
> - BUILD_BUG_ON((__fix_to_virt(FIX_BTMAP_BEGIN) >> PMD_SHIFT)
> + BUG_ON((__fix_to_virt(FIX_BTMAP_BEGIN) >> PMD_SHIFT)
> != (__fix_to_virt(FIX_BTMAP_END) >> PMD_SHIFT));
>
> #ifndef __PAGETABLE_PMD_FOLDED
> @@ -783,9 +946,10 @@ static void __init setup_vm_final(void)
> /* Clear fixmap PTE and PMD mappings */
> clear_fixmap(FIX_PTE);
> clear_fixmap(FIX_PMD);
> + clear_fixmap(FIX_PUD);
>
> /* Move to swapper page table */
> - csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | SATP_MODE);
> + csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | satp_mode);
> local_flush_tlb_all();
>
> pt_ops_set_late();
> diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c
> index 1434a0225140..993f50571a3b 100644
> --- a/arch/riscv/mm/kasan_init.c
> +++ b/arch/riscv/mm/kasan_init.c
> @@ -11,7 +11,29 @@
> #include <asm/fixmap.h>
> #include <asm/pgalloc.h>
>
> +/*
> + * Kasan shadow region must lie at a fixed address across sv39, sv48 and sv57
> + * which is right before the kernel.
> + *
> + * For sv39, the region is aligned on PGDIR_SIZE so we only need to populate
> + * the page global directory with kasan_early_shadow_pmd.
> + *
> + * For sv48 and sv57, the region is not aligned on PGDIR_SIZE so the mapping
> + * must be divided as follows:
> + * - the first PGD entry, although incomplete, is populated with
> + * kasan_early_shadow_pud/p4d
> + * - the PGD entries in the middle are populated with kasan_early_shadow_pud/p4d
> + * - the last PGD entry is shared with the kernel mapping so populated at the
> + * lower levels pud/p4d
> + *
> + * In addition, when shallow populating a kasan region (for example vmalloc),
> + * this region may also not be aligned on PGDIR size, so we must go down to the
> + * pud level too.
> + */
> +
> extern pgd_t early_pg_dir[PTRS_PER_PGD];
> +extern struct pt_alloc_ops _pt_ops __initdata;
> +#define pt_ops _pt_ops
>
> static void __init kasan_populate_pte(pmd_t *pmd, unsigned long vaddr, unsigned long end)
> {
> @@ -35,15 +57,19 @@ static void __init kasan_populate_pte(pmd_t *pmd, unsigned long vaddr, unsigned
> set_pmd(pmd, pfn_pmd(PFN_DOWN(__pa(base_pte)), PAGE_TABLE));
> }
>
> -static void __init kasan_populate_pmd(pgd_t *pgd, unsigned long vaddr, unsigned long end)
> +static void __init kasan_populate_pmd(pud_t *pud, unsigned long vaddr, unsigned long end)
> {
> phys_addr_t phys_addr;
> pmd_t *pmdp, *base_pmd;
> unsigned long next;
>
> - base_pmd = (pmd_t *)pgd_page_vaddr(*pgd);
> - if (base_pmd == lm_alias(kasan_early_shadow_pmd))
> + if (pud_none(*pud)) {
> base_pmd = memblock_alloc(PTRS_PER_PMD * sizeof(pmd_t), PAGE_SIZE);
> + } else {
> + base_pmd = (pmd_t *)pud_pgtable(*pud);
> + if (base_pmd == lm_alias(kasan_early_shadow_pmd))
> + base_pmd = memblock_alloc(PTRS_PER_PMD * sizeof(pmd_t), PAGE_SIZE);
> + }
>
> pmdp = base_pmd + pmd_index(vaddr);
>
> @@ -67,9 +93,72 @@ static void __init kasan_populate_pmd(pgd_t *pgd, unsigned long vaddr, unsigned
> * it entirely, memblock could allocate a page at a physical address
> * where KASAN is not populated yet and then we'd get a page fault.
> */
> - set_pgd(pgd, pfn_pgd(PFN_DOWN(__pa(base_pmd)), PAGE_TABLE));
> + set_pud(pud, pfn_pud(PFN_DOWN(__pa(base_pmd)), PAGE_TABLE));
> +}
> +
> +static void __init kasan_populate_pud(pgd_t *pgd,
> + unsigned long vaddr, unsigned long end,
> + bool early)
> +{
> + phys_addr_t phys_addr;
> + pud_t *pudp, *base_pud;
> + unsigned long next;
> +
> + if (early) {
> + /*
> + * We can't use pgd_page_vaddr here as it would return a linear
> + * mapping address but it is not mapped yet, but when populating
> + * early_pg_dir, we need the physical address and when populating
> + * swapper_pg_dir, we need the kernel virtual address so use
> + * pt_ops facility.
> + */
> + base_pud = pt_ops.get_pud_virt(pfn_to_phys(_pgd_pfn(*pgd)));
> + } else {
> + base_pud = (pud_t *)pgd_page_vaddr(*pgd);
> + if (base_pud == lm_alias(kasan_early_shadow_pud))
> + base_pud = memblock_alloc(PTRS_PER_PUD * sizeof(pud_t), PAGE_SIZE);
> + }
> +
> + pudp = base_pud + pud_index(vaddr);
> +
> + do {
> + next = pud_addr_end(vaddr, end);
> +
> + if (pud_none(*pudp) && IS_ALIGNED(vaddr, PUD_SIZE) && (next - vaddr) >= PUD_SIZE) {
> + if (early) {
> + phys_addr = __pa(((uintptr_t)kasan_early_shadow_pmd));
> + set_pud(pudp, pfn_pud(PFN_DOWN(phys_addr), PAGE_TABLE));
> + continue;
> + } else {
> + phys_addr = memblock_phys_alloc(PUD_SIZE, PUD_SIZE);
> + if (phys_addr) {
> + set_pud(pudp, pfn_pud(PFN_DOWN(phys_addr), PAGE_KERNEL));
> + continue;
> + }
> + }
> + }
> +
> + kasan_populate_pmd(pudp, vaddr, next);
> + } while (pudp++, vaddr = next, vaddr != end);
> +
> + /*
> + * Wait for the whole PGD to be populated before setting the PGD in
> + * the page table, otherwise, if we did set the PGD before populating
> + * it entirely, memblock could allocate a page at a physical address
> + * where KASAN is not populated yet and then we'd get a page fault.
> + */
> + if (!early)
> + set_pgd(pgd, pfn_pgd(PFN_DOWN(__pa(base_pud)), PAGE_TABLE));
> }
>
> +#define kasan_early_shadow_pgd_next (pgtable_l4_enabled ? \
> + (uintptr_t)kasan_early_shadow_pud : \
> + (uintptr_t)kasan_early_shadow_pmd)
> +#define kasan_populate_pgd_next(pgdp, vaddr, next, early) \
> + (pgtable_l4_enabled ? \
> + kasan_populate_pud(pgdp, vaddr, next, early) : \
> + kasan_populate_pmd((pud_t *)pgdp, vaddr, next))
> +
> static void __init kasan_populate_pgd(pgd_t *pgdp,
> unsigned long vaddr, unsigned long end,
> bool early)
> @@ -102,7 +191,7 @@ static void __init kasan_populate_pgd(pgd_t *pgdp,
> }
> }
>
> - kasan_populate_pmd(pgdp, vaddr, next);
> + kasan_populate_pgd_next(pgdp, vaddr, next, early);
> } while (pgdp++, vaddr = next, vaddr != end);
> }
>
> @@ -157,18 +246,54 @@ static void __init kasan_populate(void *start, void *end)
> memset(start, KASAN_SHADOW_INIT, end - start);
> }
>
> +static void __init kasan_shallow_populate_pud(pgd_t *pgdp,
> + unsigned long vaddr, unsigned long end,
> + bool kasan_populate)
> +{
> + unsigned long next;
> + pud_t *pudp, *base_pud;
> + pmd_t *base_pmd;
> + bool is_kasan_pmd;
> +
> + base_pud = (pud_t *)pgd_page_vaddr(*pgdp);
> + pudp = base_pud + pud_index(vaddr);
> +
> + if (kasan_populate)
> + memcpy(base_pud, (void *)kasan_early_shadow_pgd_next,
> + sizeof(pud_t) * PTRS_PER_PUD);
> +
> + do {
> + next = pud_addr_end(vaddr, end);
> + is_kasan_pmd = (pud_pgtable(*pudp) == lm_alias(kasan_early_shadow_pmd));
> +
> + if (is_kasan_pmd) {
> + base_pmd = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
> + set_pud(pudp, pfn_pud(PFN_DOWN(__pa(base_pmd)), PAGE_TABLE));
> + }
> + } while (pudp++, vaddr = next, vaddr != end);
> +}
> +
> static void __init kasan_shallow_populate_pgd(unsigned long vaddr, unsigned long end)
> {
> unsigned long next;
> void *p;
> pgd_t *pgd_k = pgd_offset_k(vaddr);
> + bool is_kasan_pgd_next;
>
> do {
> next = pgd_addr_end(vaddr, end);
> - if (pgd_page_vaddr(*pgd_k) == (unsigned long)lm_alias(kasan_early_shadow_pmd)) {
> + is_kasan_pgd_next = (pgd_page_vaddr(*pgd_k) ==
> + (unsigned long)lm_alias(kasan_early_shadow_pgd_next));
> +
> + if (is_kasan_pgd_next) {
> p = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
> set_pgd(pgd_k, pfn_pgd(PFN_DOWN(__pa(p)), PAGE_TABLE));
> }
> +
> + if (IS_ALIGNED(vaddr, PGDIR_SIZE) && (next - vaddr) >= PGDIR_SIZE)
> + continue;
> +
> + kasan_shallow_populate_pud(pgd_k, vaddr, next, is_kasan_pgd_next);
> } while (pgd_k++, vaddr = next, vaddr != end);
> }


@Qinglin: I can deal with sv57 kasan population if needs be as it is a
bit tricky and I think it would save you quite some time :)


>
> diff --git a/drivers/firmware/efi/libstub/efi-stub.c b/drivers/firmware/efi/libstub/efi-stub.c
> index 26e69788f27a..b3db5d91ed38 100644
> --- a/drivers/firmware/efi/libstub/efi-stub.c
> +++ b/drivers/firmware/efi/libstub/efi-stub.c
> @@ -40,6 +40,8 @@
>
> #ifdef CONFIG_ARM64
> # define EFI_RT_VIRTUAL_LIMIT DEFAULT_MAP_WINDOW_64
> +#elif defined(CONFIG_RISCV)
> +# define EFI_RT_VIRTUAL_LIMIT TASK_SIZE_MIN
> #else
> # define EFI_RT_VIRTUAL_LIMIT TASK_SIZE
> #endif

2021-12-06 11:08:32

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] Introduce sv48 support without relocatable kernel

And I messed Atish address, I was pretty sure I could recall it without
checking, I guess I'm wrong :)

Sorry for the noise,

Alex

On 12/6/21 11:46, Alexandre Ghiti wrote:
> * Please note notable changes in memory layouts and kasan population *
>
> This patchset allows to have a single kernel for sv39 and sv48 without
> being relocatable.
>
> The idea comes from Arnd Bergmann who suggested to do the same as x86,
> that is mapping the kernel to the end of the address space, which allows
> the kernel to be linked at the same address for both sv39 and sv48 and
> then does not require to be relocated at runtime.
>
> This implements sv48 support at runtime. The kernel will try to
> boot with 4-level page table and will fallback to 3-level if the HW does not
> support it. Folding the 4th level into a 3-level page table has almost no
> cost at runtime.
>
> Note that kasan region had to be moved to the end of the address space
> since its location must be known at compile-time and then be valid for
> both sv39 and sv48 (and sv57 that is coming).
>
> Tested on:
> - qemu rv64 sv39: OK
> - qemu rv64 sv48: OK
> - qemu rv64 sv39 + kasan: OK
> - qemu rv64 sv48 + kasan: OK
> - qemu rv32: OK
>
> Changes in v3:
> - Fix SZ_1T, thanks to Atish
> - Fix warning create_pud_mapping, thanks to Atish
> - Fix k210 nommu build, thanks to Atish
> - Fix wrong rebase as noted by Samuel
> - * Downgrade to sv39 is only possible if !KASAN (see commit changelog) *
> - * Move KASAN next to the kernel: virtual layouts changed and kasan population *
>
> Changes in v2:
> - Rebase onto for-next
> - Fix KASAN
> - Fix stack canary
> - Get completely rid of MAXPHYSMEM configs
> - Add documentation
>
> Alexandre Ghiti (13):
> riscv: Move KASAN mapping next to the kernel mapping
> riscv: Split early kasan mapping to prepare sv48 introduction
> riscv: Introduce functions to switch pt_ops
> riscv: Allow to dynamically define VA_BITS
> riscv: Get rid of MAXPHYSMEM configs
> asm-generic: Prepare for riscv use of pud_alloc_one and pud_free
> riscv: Implement sv48 support
> riscv: Use pgtable_l4_enabled to output mmu_type in cpuinfo
> riscv: Explicit comment about user virtual address space size
> riscv: Improve virtual kernel memory layout dump
> Documentation: riscv: Add sv48 description to VM layout
> riscv: Initialize thread pointer before calling C functions
> riscv: Allow user to downgrade to sv39 when hw supports sv48 if !KASAN
>
> Documentation/riscv/vm-layout.rst | 48 ++-
> arch/riscv/Kconfig | 37 +-
> arch/riscv/configs/nommu_k210_defconfig | 1 -
> .../riscv/configs/nommu_k210_sdcard_defconfig | 1 -
> arch/riscv/configs/nommu_virt_defconfig | 1 -
> arch/riscv/include/asm/csr.h | 3 +-
> arch/riscv/include/asm/fixmap.h | 1
> arch/riscv/include/asm/kasan.h | 11 +-
> arch/riscv/include/asm/page.h | 20 +-
> arch/riscv/include/asm/pgalloc.h | 40 ++
> arch/riscv/include/asm/pgtable-64.h | 108 ++++-
> arch/riscv/include/asm/pgtable.h | 47 +-
> arch/riscv/include/asm/sparsemem.h | 6 +-
> arch/riscv/kernel/cpu.c | 23 +-
> arch/riscv/kernel/head.S | 4 +-
> arch/riscv/mm/context.c | 4 +-
> arch/riscv/mm/init.c | 408 ++++++++++++++----
> arch/riscv/mm/kasan_init.c | 250 ++++++++---
> drivers/firmware/efi/libstub/efi-stub.c | 2
> drivers/pci/controller/pci-xgene.c | 2 +-
> include/asm-generic/pgalloc.h | 24 +-
> include/linux/sizes.h | 1
> 22 files changed, 833 insertions(+), 209 deletions(-)
>
> --
> 2.32.0
>
>
> _______________________________________________
> linux-riscv mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-riscv

2021-12-06 16:27:00

by Jisheng Zhang

[permalink] [raw]
Subject: Re: [PATCH v3 01/13] riscv: Move KASAN mapping next to the kernel mapping

On Mon, 6 Dec 2021 11:46:45 +0100
Alexandre Ghiti <[email protected]> wrote:

> Now that KASAN_SHADOW_OFFSET is defined at compile time as a config,
> this value must remain constant whatever the size of the virtual address
> space, which is only possible by pushing this region at the end of the
> address space next to the kernel mapping.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> Documentation/riscv/vm-layout.rst | 12 ++++++------
> arch/riscv/Kconfig | 4 ++--
> arch/riscv/include/asm/kasan.h | 4 ++--
> arch/riscv/include/asm/page.h | 6 +++++-
> arch/riscv/include/asm/pgtable.h | 6 ++++--
> arch/riscv/mm/init.c | 25 +++++++++++++------------
> 6 files changed, 32 insertions(+), 25 deletions(-)
>
> diff --git a/Documentation/riscv/vm-layout.rst b/Documentation/riscv/vm-layout.rst
> index b7f98930d38d..1bd687b97104 100644
> --- a/Documentation/riscv/vm-layout.rst
> +++ b/Documentation/riscv/vm-layout.rst
> @@ -47,12 +47,12 @@ RISC-V Linux Kernel SV39
> | Kernel-space virtual memory, shared between all processes:
> ____________________________________________________________|___________________________________________________________
> | | | |
> - ffffffc000000000 | -256 GB | ffffffc7ffffffff | 32 GB | kasan
> - ffffffcefee00000 | -196 GB | ffffffcefeffffff | 2 MB | fixmap
> - ffffffceff000000 | -196 GB | ffffffceffffffff | 16 MB | PCI io
> - ffffffcf00000000 | -196 GB | ffffffcfffffffff | 4 GB | vmemmap
> - ffffffd000000000 | -192 GB | ffffffdfffffffff | 64 GB | vmalloc/ioremap space
> - ffffffe000000000 | -128 GB | ffffffff7fffffff | 124 GB | direct mapping of all physical memory
> + ffffffc6fee00000 | -228 GB | ffffffc6feffffff | 2 MB | fixmap
> + ffffffc6ff000000 | -228 GB | ffffffc6ffffffff | 16 MB | PCI io
> + ffffffc700000000 | -228 GB | ffffffc7ffffffff | 4 GB | vmemmap
> + ffffffc800000000 | -224 GB | ffffffd7ffffffff | 64 GB | vmalloc/ioremap space
> + ffffffd800000000 | -160 GB | fffffff6ffffffff | 124 GB | direct mapping of all physical memory
> + fffffff700000000 | -36 GB | fffffffeffffffff | 32 GB | kasan
> __________________|____________|__________________|_________|____________________________________________________________
> |
> |
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index 6d5b63bd4bd9..6cd98ade5ebc 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -161,12 +161,12 @@ config PAGE_OFFSET
> default 0xC0000000 if 32BIT && MAXPHYSMEM_1GB
> default 0x80000000 if 64BIT && !MMU
> default 0xffffffff80000000 if 64BIT && MAXPHYSMEM_2GB
> - default 0xffffffe000000000 if 64BIT && MAXPHYSMEM_128GB
> + default 0xffffffd800000000 if 64BIT && MAXPHYSMEM_128GB
>
> config KASAN_SHADOW_OFFSET
> hex
> depends on KASAN_GENERIC
> - default 0xdfffffc800000000 if 64BIT
> + default 0xdfffffff00000000 if 64BIT
> default 0xffffffff if 32BIT
>
> config ARCH_FLATMEM_ENABLE
> diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h
> index b00f503ec124..257a2495145a 100644
> --- a/arch/riscv/include/asm/kasan.h
> +++ b/arch/riscv/include/asm/kasan.h
> @@ -28,8 +28,8 @@
> #define KASAN_SHADOW_SCALE_SHIFT 3
>
> #define KASAN_SHADOW_SIZE (UL(1) << ((CONFIG_VA_BITS - 1) - KASAN_SHADOW_SCALE_SHIFT))
> -#define KASAN_SHADOW_START KERN_VIRT_START
> -#define KASAN_SHADOW_END (KASAN_SHADOW_START + KASAN_SHADOW_SIZE)
> +#define KASAN_SHADOW_START (KASAN_SHADOW_END - KASAN_SHADOW_SIZE)
> +#define KASAN_SHADOW_END MODULES_LOWEST_VADDR
> #define KASAN_SHADOW_OFFSET _AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
>
> void kasan_init(void);
> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> index 109c97e991a6..e03559f9b35e 100644
> --- a/arch/riscv/include/asm/page.h
> +++ b/arch/riscv/include/asm/page.h
> @@ -33,7 +33,11 @@
> */
> #define PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)
>
> -#define KERN_VIRT_SIZE (-PAGE_OFFSET)
> +/*
> + * Half of the kernel address space (half of the entries of the page global
> + * directory) is for the direct mapping.
> + */
> +#define KERN_VIRT_SIZE ((PTRS_PER_PGD / 2 * PGDIR_SIZE) / 2)
>
> #ifndef __ASSEMBLY__
>
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 39b550310ec6..d34f3a7a9701 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -39,8 +39,10 @@
>
> /* Modules always live before the kernel */
> #ifdef CONFIG_64BIT
> -#define MODULES_VADDR (PFN_ALIGN((unsigned long)&_end) - SZ_2G)
> -#define MODULES_END (PFN_ALIGN((unsigned long)&_start))
> +/* This is used to define the end of the KASAN shadow region */
> +#define MODULES_LOWEST_VADDR (KERNEL_LINK_ADDR - SZ_2G)
> +#define MODULES_VADDR (PFN_ALIGN((unsigned long)&_end) - SZ_2G)
> +#define MODULES_END (PFN_ALIGN((unsigned long)&_start))
> #endif
>
> /*
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index c0cddf0fc22d..4224e9d0ecf5 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -103,6 +103,9 @@ static void __init print_vm_layout(void)
> print_mlm("lowmem", (unsigned long)PAGE_OFFSET,
> (unsigned long)high_memory);
> #ifdef CONFIG_64BIT
> +#ifdef CONFIG_KASAN
> + print_mlm("kasan", KASAN_SHADOW_START, KASAN_SHADOW_END);
> +#endif

I think we'd better avoid #ifdef usage as much as possible.
For this KASAN case, we can make both KASAN_SHADOW_START and KASAN_SHADOW_END
always visible as x86 does, then above code can be
if (IS_ENABLED(CONFIG_KASAN))
print_mlm("kasan", KASAN_SHADOW_START, KASAN_SHADOW_END);

Thanks

2021-12-09 04:33:39

by 潘庆霖

[permalink] [raw]
Subject: Re: [PATCH v3 07/13] riscv: Implement sv48 support

Hi Alex,

On 2021/12/6 19:05, Alexandre ghiti wrote:
> On 12/6/21 11:46, Alexandre Ghiti wrote:
>> By adding a new 4th level of page table, give the possibility to 64bit
>> kernel to address 2^48 bytes of virtual address: in practice, that
offers
>> 128TB of virtual address space to userspace and allows up to 64TB of
>> physical memory.
>>
>> If the underlying hardware does not support sv48, we will automatically
>> fallback to a standard 3-level page table by folding the new PUD
level into
>> PGDIR level. In order to detect HW capabilities at runtime, we
>> use SATP feature that ignores writes with an unsupported mode.
>>
>> Signed-off-by: Alexandre Ghiti <[email protected]>
>> ---
>>   arch/riscv/Kconfig                      |   4 +-
>>   arch/riscv/include/asm/csr.h            |   3 +-
>>   arch/riscv/include/asm/fixmap.h         |   1 +
>>   arch/riscv/include/asm/kasan.h          |   6 +-
>>   arch/riscv/include/asm/page.h           |  14 ++
>>   arch/riscv/include/asm/pgalloc.h        |  40 +++++
>>   arch/riscv/include/asm/pgtable-64.h     | 108 +++++++++++-
>>   arch/riscv/include/asm/pgtable.h        |  24 ++-
>>   arch/riscv/kernel/head.S                |   3 +-
>>   arch/riscv/mm/context.c                 |   4 +-
>>   arch/riscv/mm/init.c                    | 212 +++++++++++++++++++++---
>>   arch/riscv/mm/kasan_init.c              | 137 ++++++++++++++-
>>   drivers/firmware/efi/libstub/efi-stub.c |   2 +
>>   13 files changed, 514 insertions(+), 44 deletions(-)
>>
>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>> index ac6c0cd9bc29..d28fe0148e13 100644
>> --- a/arch/riscv/Kconfig
>> +++ b/arch/riscv/Kconfig
>> @@ -150,7 +150,7 @@ config PAGE_OFFSET
>>       hex
>>       default 0xC0000000 if 32BIT
>>       default 0x80000000 if 64BIT && !MMU
>> -    default 0xffffffd800000000 if 64BIT
>> +    default 0xffffaf8000000000 if 64BIT
>>     config KASAN_SHADOW_OFFSET
>>       hex
>> @@ -201,7 +201,7 @@ config FIX_EARLYCON_MEM
>>     config PGTABLE_LEVELS
>>       int
>> -    default 3 if 64BIT
>> +    default 4 if 64BIT
>>       default 2
>>     config LOCKDEP_SUPPORT
>> diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
>> index 87ac65696871..3fdb971c7896 100644
>> --- a/arch/riscv/include/asm/csr.h
>> +++ b/arch/riscv/include/asm/csr.h
>> @@ -40,14 +40,13 @@
>>   #ifndef CONFIG_64BIT
>>   #define SATP_PPN    _AC(0x003FFFFF, UL)
>>   #define SATP_MODE_32    _AC(0x80000000, UL)
>> -#define SATP_MODE    SATP_MODE_32
>>   #define SATP_ASID_BITS    9
>>   #define SATP_ASID_SHIFT    22
>>   #define SATP_ASID_MASK    _AC(0x1FF, UL)
>>   #else
>>   #define SATP_PPN    _AC(0x00000FFFFFFFFFFF, UL)
>>   #define SATP_MODE_39    _AC(0x8000000000000000, UL)
>> -#define SATP_MODE    SATP_MODE_39
>> +#define SATP_MODE_48    _AC(0x9000000000000000, UL)
>>   #define SATP_ASID_BITS    16
>>   #define SATP_ASID_SHIFT    44
>>   #define SATP_ASID_MASK    _AC(0xFFFF, UL)
>> diff --git a/arch/riscv/include/asm/fixmap.h
b/arch/riscv/include/asm/fixmap.h
>> index 54cbf07fb4e9..58a718573ad6 100644
>> --- a/arch/riscv/include/asm/fixmap.h
>> +++ b/arch/riscv/include/asm/fixmap.h
>> @@ -24,6 +24,7 @@ enum fixed_addresses {
>>       FIX_HOLE,
>>       FIX_PTE,
>>       FIX_PMD,
>> +    FIX_PUD,
>>       FIX_TEXT_POKE1,
>>       FIX_TEXT_POKE0,
>>       FIX_EARLYCON_MEM_BASE,
>> diff --git a/arch/riscv/include/asm/kasan.h
b/arch/riscv/include/asm/kasan.h
>> index 743e6ff57996..0b85e363e778 100644
>> --- a/arch/riscv/include/asm/kasan.h
>> +++ b/arch/riscv/include/asm/kasan.h
>> @@ -28,7 +28,11 @@
>>   #define KASAN_SHADOW_SCALE_SHIFT    3
>>     #define KASAN_SHADOW_SIZE    (UL(1) << ((VA_BITS - 1) -
KASAN_SHADOW_SCALE_SHIFT))
>> -#define KASAN_SHADOW_START    (KASAN_SHADOW_END - KASAN_SHADOW_SIZE)
>> +/*
>> + * Depending on the size of the virtual address space, the region
may not be
>> + * aligned on PGDIR_SIZE, so force its alignment to ease its
population.
>> + */
>> +#define KASAN_SHADOW_START    ((KASAN_SHADOW_END -
KASAN_SHADOW_SIZE) & PGDIR_MASK)
>>   #define KASAN_SHADOW_END    MODULES_LOWEST_VADDR
>>   #define KASAN_SHADOW_OFFSET _AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
>>   diff --git a/arch/riscv/include/asm/page.h
b/arch/riscv/include/asm/page.h
>> index e03559f9b35e..d089fe46f7d8 100644
>> --- a/arch/riscv/include/asm/page.h
>> +++ b/arch/riscv/include/asm/page.h
>> @@ -31,7 +31,20 @@
>>    * When not using MMU this corresponds to the first free page in
>>    * physical memory (aligned on a page boundary).
>>    */
>> +#ifdef CONFIG_64BIT
>> +#ifdef CONFIG_MMU
>> +#define PAGE_OFFSET        kernel_map.page_offset
>> +#else
>> +#define PAGE_OFFSET        _AC(CONFIG_PAGE_OFFSET, UL)
>> +#endif
>> +/*
>> + * By default, CONFIG_PAGE_OFFSET value corresponds to SV48 address
space so
>> + * define the PAGE_OFFSET value for SV39.
>> + */
>> +#define PAGE_OFFSET_L3        _AC(0xffffffd800000000, UL)
>> +#else
>>   #define PAGE_OFFSET        _AC(CONFIG_PAGE_OFFSET, UL)
>> +#endif /* CONFIG_64BIT */
>>     /*
>>    * Half of the kernel address space (half of the entries of the
page global
>> @@ -90,6 +103,7 @@ extern unsigned long riscv_pfn_base;
>>   #endif /* CONFIG_MMU */
>>     struct kernel_mapping {
>> +    unsigned long page_offset;
>>       unsigned long virt_addr;
>>       uintptr_t phys_addr;
>>       uintptr_t size;
>> diff --git a/arch/riscv/include/asm/pgalloc.h
b/arch/riscv/include/asm/pgalloc.h
>> index 0af6933a7100..11823004b87a 100644
>> --- a/arch/riscv/include/asm/pgalloc.h
>> +++ b/arch/riscv/include/asm/pgalloc.h
>> @@ -11,6 +11,8 @@
>>   #include <asm/tlb.h>
>>     #ifdef CONFIG_MMU
>> +#define __HAVE_ARCH_PUD_ALLOC_ONE
>> +#define __HAVE_ARCH_PUD_FREE
>>   #include <asm-generic/pgalloc.h>
>>     static inline void pmd_populate_kernel(struct mm_struct *mm,
>> @@ -36,6 +38,44 @@ static inline void pud_populate(struct mm_struct
*mm, pud_t *pud, pmd_t *pmd)
>>         set_pud(pud, __pud((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
>>   }
>> +
>> +static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d,
pud_t *pud)
>> +{
>> +    if (pgtable_l4_enabled) {
>> +        unsigned long pfn = virt_to_pfn(pud);
>> +
>> +        set_p4d(p4d, __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
>> +    }
>> +}
>> +
>> +static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d,
>> +                     pud_t *pud)
>> +{
>> +    if (pgtable_l4_enabled) {
>> +        unsigned long pfn = virt_to_pfn(pud);
>> +
>> +        set_p4d_safe(p4d,
>> +                 __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
>> +    }
>> +}
>> +
>> +#define pud_alloc_one pud_alloc_one
>> +static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned
long addr)
>> +{
>> +    if (pgtable_l4_enabled)
>> +        return __pud_alloc_one(mm, addr);
>> +
>> +    return NULL;
>> +}
>> +
>> +#define pud_free pud_free
>> +static inline void pud_free(struct mm_struct *mm, pud_t *pud)
>> +{
>> +    if (pgtable_l4_enabled)
>> +        __pud_free(mm, pud);
>> +}
>> +
>> +#define __pud_free_tlb(tlb, pud, addr) pud_free((tlb)->mm, pud)
>>   #endif /* __PAGETABLE_PMD_FOLDED */
>>     static inline pgd_t *pgd_alloc(struct mm_struct *mm)
>> diff --git a/arch/riscv/include/asm/pgtable-64.h
b/arch/riscv/include/asm/pgtable-64.h
>> index 228261aa9628..bbbdd66e5e2f 100644
>> --- a/arch/riscv/include/asm/pgtable-64.h
>> +++ b/arch/riscv/include/asm/pgtable-64.h
>> @@ -8,16 +8,36 @@
>>     #include <linux/const.h>
>>   -#define PGDIR_SHIFT     30
>> +extern bool pgtable_l4_enabled;
>> +
>> +#define PGDIR_SHIFT_L3  30
>> +#define PGDIR_SHIFT_L4  39
>> +#define PGDIR_SIZE_L3   (_AC(1, UL) << PGDIR_SHIFT_L3)
>> +
>> +#define PGDIR_SHIFT     (pgtable_l4_enabled ? PGDIR_SHIFT_L4 :
PGDIR_SHIFT_L3)
>>   /* Size of region mapped by a page global directory */
>>   #define PGDIR_SIZE      (_AC(1, UL) << PGDIR_SHIFT)
>>   #define PGDIR_MASK      (~(PGDIR_SIZE - 1))
>>   +/* pud is folded into pgd in case of 3-level page table */
>> +#define PUD_SHIFT      30
>> +#define PUD_SIZE       (_AC(1, UL) << PUD_SHIFT)
>> +#define PUD_MASK       (~(PUD_SIZE - 1))
>> +
>>   #define PMD_SHIFT       21
>>   /* Size of region mapped by a page middle directory */
>>   #define PMD_SIZE        (_AC(1, UL) << PMD_SHIFT)
>>   #define PMD_MASK        (~(PMD_SIZE - 1))
>>   +/* Page Upper Directory entry */
>> +typedef struct {
>> +    unsigned long pud;
>> +} pud_t;
>> +
>> +#define pud_val(x)      ((x).pud)
>> +#define __pud(x)        ((pud_t) { (x) })
>> +#define PTRS_PER_PUD    (PAGE_SIZE / sizeof(pud_t))
>> +
>>   /* Page Middle Directory entry */
>>   typedef struct {
>>       unsigned long pmd;
>> @@ -59,6 +79,16 @@ static inline void pud_clear(pud_t *pudp)
>>       set_pud(pudp, __pud(0));
>>   }
>>   +static inline pud_t pfn_pud(unsigned long pfn, pgprot_t prot)
>> +{
>> +    return __pud((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
>> +}
>> +
>> +static inline unsigned long _pud_pfn(pud_t pud)
>> +{
>> +    return pud_val(pud) >> _PAGE_PFN_SHIFT;
>> +}
>> +
>>   static inline pmd_t *pud_pgtable(pud_t pud)
>>   {
>>       return (pmd_t *)pfn_to_virt(pud_val(pud) >> _PAGE_PFN_SHIFT);
>> @@ -69,6 +99,17 @@ static inline struct page *pud_page(pud_t pud)
>>       return pfn_to_page(pud_val(pud) >> _PAGE_PFN_SHIFT);
>>   }
>>   +#define mm_pud_folded  mm_pud_folded
>> +static inline bool mm_pud_folded(struct mm_struct *mm)
>> +{
>> +    if (pgtable_l4_enabled)
>> +        return false;
>> +
>> +    return true;
>> +}
>> +
>> +#define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
>> +
>>   static inline pmd_t pfn_pmd(unsigned long pfn, pgprot_t prot)
>>   {
>>       return __pmd((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
>> @@ -84,4 +125,69 @@ static inline unsigned long _pmd_pfn(pmd_t pmd)
>>   #define pmd_ERROR(e) \
>>       pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))
>>   +#define pud_ERROR(e)   \
>> +    pr_err("%s:%d: bad pud %016lx.\n", __FILE__, __LINE__, pud_val(e))
>> +
>> +static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
>> +{
>> +    if (pgtable_l4_enabled)
>> +        *p4dp = p4d;
>> +    else
>> +        set_pud((pud_t *)p4dp, (pud_t){ p4d_val(p4d) });
>> +}
>> +
>> +static inline int p4d_none(p4d_t p4d)
>> +{
>> +    if (pgtable_l4_enabled)
>> +        return (p4d_val(p4d) == 0);
>> +
>> +    return 0;
>> +}
>> +
>> +static inline int p4d_present(p4d_t p4d)
>> +{
>> +    if (pgtable_l4_enabled)
>> +        return (p4d_val(p4d) & _PAGE_PRESENT);
>> +
>> +    return 1;
>> +}
>> +
>> +static inline int p4d_bad(p4d_t p4d)
>> +{
>> +    if (pgtable_l4_enabled)
>> +        return !p4d_present(p4d);
>> +
>> +    return 0;
>> +}
>> +
>> +static inline void p4d_clear(p4d_t *p4d)
>> +{
>> +    if (pgtable_l4_enabled)
>> +        set_p4d(p4d, __p4d(0));
>> +}
>> +
>> +static inline pud_t *p4d_pgtable(p4d_t p4d)
>> +{
>> +    if (pgtable_l4_enabled)
>> +        return (pud_t *)pfn_to_virt(p4d_val(p4d) >> _PAGE_PFN_SHIFT);
>> +
>> +    return (pud_t *)pud_pgtable((pud_t) { p4d_val(p4d) });
>> +}
>> +
>> +static inline struct page *p4d_page(p4d_t p4d)
>> +{
>> +    return pfn_to_page(p4d_val(p4d) >> _PAGE_PFN_SHIFT);
>> +}
>> +
>> +#define pud_index(addr) (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
>> +
>> +#define pud_offset pud_offset
>> +static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
>> +{
>> +    if (pgtable_l4_enabled)
>> +        return p4d_pgtable(*p4d) + pud_index(address);
>> +
>> +    return (pud_t *)p4d;
>> +}
>> +
>>   #endif /* _ASM_RISCV_PGTABLE_64_H */
>> diff --git a/arch/riscv/include/asm/pgtable.h
b/arch/riscv/include/asm/pgtable.h
>> index e1a52e22ad7e..e1c74ef4ead2 100644
>> --- a/arch/riscv/include/asm/pgtable.h
>> +++ b/arch/riscv/include/asm/pgtable.h
>> @@ -51,7 +51,7 @@
>>    * position vmemmap directly below the VMALLOC region.
>>    */
>>   #ifdef CONFIG_64BIT
>> -#define VA_BITS        39
>> +#define VA_BITS        (pgtable_l4_enabled ? 48 : 39)
>>   #else
>>   #define VA_BITS        32
>>   #endif
>> @@ -90,8 +90,7 @@
>>     #ifndef __ASSEMBLY__
>>   -/* Page Upper Directory not used in RISC-V */
>> -#include <asm-generic/pgtable-nopud.h>
>> +#include <asm-generic/pgtable-nop4d.h>
>>   #include <asm/page.h>
>>   #include <asm/tlbflush.h>
>>   #include <linux/mm_types.h>
>> @@ -113,6 +112,17 @@
>>   #define XIP_FIXUP(addr)        (addr)
>>   #endif /* CONFIG_XIP_KERNEL */
>>   +struct pt_alloc_ops {
>> +    pte_t *(*get_pte_virt)(phys_addr_t pa);
>> +    phys_addr_t (*alloc_pte)(uintptr_t va);
>> +#ifndef __PAGETABLE_PMD_FOLDED
>> +    pmd_t *(*get_pmd_virt)(phys_addr_t pa);
>> +    phys_addr_t (*alloc_pmd)(uintptr_t va);
>> +    pud_t *(*get_pud_virt)(phys_addr_t pa);
>> +    phys_addr_t (*alloc_pud)(uintptr_t va);
>> +#endif
>> +};
>> +
>>   #ifdef CONFIG_MMU
>>   /* Number of entries in the page global directory */
>>   #define PTRS_PER_PGD    (PAGE_SIZE / sizeof(pgd_t))
>> @@ -669,9 +679,11 @@ static inline pmd_t pmdp_establish(struct
vm_area_struct *vma,
>>    * Note that PGDIR_SIZE must evenly divide TASK_SIZE.
>>    */
>>   #ifdef CONFIG_64BIT
>> -#define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)
>> +#define TASK_SIZE      (PGDIR_SIZE * PTRS_PER_PGD / 2)
>> +#define TASK_SIZE_MIN  (PGDIR_SIZE_L3 * PTRS_PER_PGD / 2)
>>   #else
>> -#define TASK_SIZE FIXADDR_START
>> +#define TASK_SIZE    FIXADDR_START
>> +#define TASK_SIZE_MIN    TASK_SIZE
>>   #endif
>>     #else /* CONFIG_MMU */
>> @@ -697,6 +709,8 @@ extern uintptr_t _dtb_early_pa;
>>   #define dtb_early_va    _dtb_early_va
>>   #define dtb_early_pa    _dtb_early_pa
>>   #endif /* CONFIG_XIP_KERNEL */
>> +extern u64 satp_mode;
>> +extern bool pgtable_l4_enabled;
>>     void paging_init(void);
>>   void misc_mem_init(void);
>> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
>> index 52c5ff9804c5..c3c0ed559770 100644
>> --- a/arch/riscv/kernel/head.S
>> +++ b/arch/riscv/kernel/head.S
>> @@ -95,7 +95,8 @@ relocate:
>>         /* Compute satp for kernel page tables, but don't load it yet */
>>       srl a2, a0, PAGE_SHIFT
>> -    li a1, SATP_MODE
>> +    la a1, satp_mode
>> +    REG_L a1, 0(a1)
>>       or a2, a2, a1
>>         /*
>> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
>> index ee3459cb6750..a7246872bd30 100644
>> --- a/arch/riscv/mm/context.c
>> +++ b/arch/riscv/mm/context.c
>> @@ -192,7 +192,7 @@ static void set_mm_asid(struct mm_struct *mm,
unsigned int cpu)
>>   switch_mm_fast:
>>       csr_write(CSR_SATP, virt_to_pfn(mm->pgd) |
>>             ((cntx & asid_mask) << SATP_ASID_SHIFT) |
>> -          SATP_MODE);
>> +          satp_mode);
>>         if (need_flush_tlb)
>>           local_flush_tlb_all();
>> @@ -201,7 +201,7 @@ static void set_mm_asid(struct mm_struct *mm,
unsigned int cpu)
>>   static void set_mm_noasid(struct mm_struct *mm)
>>   {
>>       /* Switch the page table and blindly nuke entire local TLB */
>> -    csr_write(CSR_SATP, virt_to_pfn(mm->pgd) | SATP_MODE);
>> +    csr_write(CSR_SATP, virt_to_pfn(mm->pgd) | satp_mode);
>>       local_flush_tlb_all();
>>   }
>>   diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>> index 1552226fb6bd..6a19a1b1caf8 100644
>> --- a/arch/riscv/mm/init.c
>> +++ b/arch/riscv/mm/init.c
>> @@ -37,6 +37,17 @@ EXPORT_SYMBOL(kernel_map);
>>   #define kernel_map    (*(struct kernel_mapping
*)XIP_FIXUP(&kernel_map))
>>   #endif
>>   +#ifdef CONFIG_64BIT
>> +u64 satp_mode = !IS_ENABLED(CONFIG_XIP_KERNEL) ? SATP_MODE_48 :
SATP_MODE_39;
>> +#else
>> +u64 satp_mode = SATP_MODE_32;
>> +#endif
>> +EXPORT_SYMBOL(satp_mode);
>> +
>> +bool pgtable_l4_enabled = IS_ENABLED(CONFIG_64BIT) &&
!IS_ENABLED(CONFIG_XIP_KERNEL) ?
>> +                true : false;
>> +EXPORT_SYMBOL(pgtable_l4_enabled);
>> +
>>   phys_addr_t phys_ram_base __ro_after_init;
>>   EXPORT_SYMBOL(phys_ram_base);
>>   @@ -53,15 +64,6 @@ extern char _start[];
>>   void *_dtb_early_va __initdata;
>>   uintptr_t _dtb_early_pa __initdata;
>>   -struct pt_alloc_ops {
>> -    pte_t *(*get_pte_virt)(phys_addr_t pa);
>> -    phys_addr_t (*alloc_pte)(uintptr_t va);
>> -#ifndef __PAGETABLE_PMD_FOLDED
>> -    pmd_t *(*get_pmd_virt)(phys_addr_t pa);
>> -    phys_addr_t (*alloc_pmd)(uintptr_t va);
>> -#endif
>> -};
>> -
>>   static phys_addr_t dma32_phys_limit __initdata;
>>     static void __init zone_sizes_init(void)
>> @@ -222,7 +224,7 @@ static void __init setup_bootmem(void)
>>   }
>>     #ifdef CONFIG_MMU
>> -static struct pt_alloc_ops _pt_ops __initdata;
>> +struct pt_alloc_ops _pt_ops __initdata;
>>     #ifdef CONFIG_XIP_KERNEL
>>   #define pt_ops (*(struct pt_alloc_ops *)XIP_FIXUP(&_pt_ops))
>> @@ -238,6 +240,7 @@ pgd_t trampoline_pg_dir[PTRS_PER_PGD]
__page_aligned_bss;
>>   static pte_t fixmap_pte[PTRS_PER_PTE] __page_aligned_bss;
>>     pgd_t early_pg_dir[PTRS_PER_PGD] __initdata __aligned(PAGE_SIZE);
>> +static pud_t __maybe_unused early_dtb_pud[PTRS_PER_PUD] __initdata
__aligned(PAGE_SIZE);
>>   static pmd_t __maybe_unused early_dtb_pmd[PTRS_PER_PMD] __initdata
__aligned(PAGE_SIZE);
>>     #ifdef CONFIG_XIP_KERNEL
>> @@ -326,6 +329,16 @@ static pmd_t early_pmd[PTRS_PER_PMD] __initdata
__aligned(PAGE_SIZE);
>>   #define early_pmd      ((pmd_t *)XIP_FIXUP(early_pmd))
>>   #endif /* CONFIG_XIP_KERNEL */
>>   +static pud_t trampoline_pud[PTRS_PER_PUD] __page_aligned_bss;
>> +static pud_t fixmap_pud[PTRS_PER_PUD] __page_aligned_bss;
>> +static pud_t early_pud[PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);
>> +
>> +#ifdef CONFIG_XIP_KERNEL
>> +#define trampoline_pud ((pud_t *)XIP_FIXUP(trampoline_pud))
>> +#define fixmap_pud     ((pud_t *)XIP_FIXUP(fixmap_pud))
>> +#define early_pud      ((pud_t *)XIP_FIXUP(early_pud))
>> +#endif /* CONFIG_XIP_KERNEL */
>> +
>>   static pmd_t *__init get_pmd_virt_early(phys_addr_t pa)
>>   {
>>       /* Before MMU is enabled */
>> @@ -345,7 +358,7 @@ static pmd_t *__init
get_pmd_virt_late(phys_addr_t pa)
>>     static phys_addr_t __init alloc_pmd_early(uintptr_t va)
>>   {
>> -    BUG_ON((va - kernel_map.virt_addr) >> PGDIR_SHIFT);
>> +    BUG_ON((va - kernel_map.virt_addr) >> PUD_SHIFT);
>>         return (uintptr_t)early_pmd;
>>   }
>> @@ -391,21 +404,97 @@ static void __init create_pmd_mapping(pmd_t *pmdp,
>>       create_pte_mapping(ptep, va, pa, sz, prot);
>>   }
>>   -#define pgd_next_t        pmd_t
>> -#define alloc_pgd_next(__va)    pt_ops.alloc_pmd(__va)
>> -#define get_pgd_next_virt(__pa) pt_ops.get_pmd_virt(__pa)
>> +static pud_t *__init get_pud_virt_early(phys_addr_t pa)
>> +{
>> +    return (pud_t *)((uintptr_t)pa);
>> +}
>> +
>> +static pud_t *__init get_pud_virt_fixmap(phys_addr_t pa)
>> +{
>> +    clear_fixmap(FIX_PUD);
>> +    return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
>> +}
>> +
>> +static pud_t *__init get_pud_virt_late(phys_addr_t pa)
>> +{
>> +    return (pud_t *)__va(pa);
>> +}
>> +
>> +static phys_addr_t __init alloc_pud_early(uintptr_t va)
>> +{
>> +    /* Only one PUD is available for early mapping */
>> +    BUG_ON((va - kernel_map.virt_addr) >> PGDIR_SHIFT);
>> +
>> +    return (uintptr_t)early_pud;
>> +}
>> +
>> +static phys_addr_t __init alloc_pud_fixmap(uintptr_t va)
>> +{
>> +    return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>> +}
>> +
>> +static phys_addr_t alloc_pud_late(uintptr_t va)
>> +{
>> +    unsigned long vaddr;
>> +
>> +    vaddr = __get_free_page(GFP_KERNEL);
>> +    BUG_ON(!vaddr);
>> +    return __pa(vaddr);
>> +}
>> +
>> +static void __init create_pud_mapping(pud_t *pudp,
>> +                      uintptr_t va, phys_addr_t pa,
>> +                      phys_addr_t sz, pgprot_t prot)
>> +{
>> +    pmd_t *nextp;
>> +    phys_addr_t next_phys;
>> +    uintptr_t pud_index = pud_index(va);
>> +
>> +    if (sz == PUD_SIZE) {
>> +        if (pud_val(pudp[pud_index]) == 0)
>> +            pudp[pud_index] = pfn_pud(PFN_DOWN(pa), prot);
>> +        return;
>> +    }
>> +
>> +    if (pud_val(pudp[pud_index]) == 0) {
>> +        next_phys = pt_ops.alloc_pmd(va);
>> +        pudp[pud_index] = pfn_pud(PFN_DOWN(next_phys), PAGE_TABLE);
>> +        nextp = pt_ops.get_pmd_virt(next_phys);
>> +        memset(nextp, 0, PAGE_SIZE);
>> +    } else {
>> +        next_phys = PFN_PHYS(_pud_pfn(pudp[pud_index]));
>> +        nextp = pt_ops.get_pmd_virt(next_phys);
>> +    }
>> +
>> +    create_pmd_mapping(nextp, va, pa, sz, prot);
>> +}
>> +
>> +#define pgd_next_t        pud_t
>> +#define alloc_pgd_next(__va)    (pgtable_l4_enabled ?            \
>> +        pt_ops.alloc_pud(__va) : pt_ops.alloc_pmd(__va))
>> +#define get_pgd_next_virt(__pa)    (pgtable_l4_enabled ?            \
>> +        pt_ops.get_pud_virt(__pa) : (pgd_next_t
*)pt_ops.get_pmd_virt(__pa))
>>   #define create_pgd_next_mapping(__nextp, __va, __pa, __sz,
__prot)    \
>> -    create_pmd_mapping(__nextp, __va, __pa, __sz, __prot)
>> -#define fixmap_pgd_next        fixmap_pmd
>> +                (pgtable_l4_enabled ?            \
>> +        create_pud_mapping(__nextp, __va, __pa, __sz, __prot) :    \
>> +        create_pmd_mapping((pmd_t *)__nextp, __va, __pa, __sz, __prot))
>> +#define fixmap_pgd_next        (pgtable_l4_enabled ?            \
>> +        (uintptr_t)fixmap_pud : (uintptr_t)fixmap_pmd)
>> +#define trampoline_pgd_next    (pgtable_l4_enabled ?            \
>> +        (uintptr_t)trampoline_pud : (uintptr_t)trampoline_pmd)
>> +#define early_dtb_pgd_next    (pgtable_l4_enabled ?            \
>> +        (uintptr_t)early_dtb_pud : (uintptr_t)early_dtb_pmd)
>>   #else
>>   #define pgd_next_t        pte_t
>>   #define alloc_pgd_next(__va)    pt_ops.alloc_pte(__va)
>>   #define get_pgd_next_virt(__pa) pt_ops.get_pte_virt(__pa)
>>   #define create_pgd_next_mapping(__nextp, __va, __pa, __sz,
__prot)    \
>>       create_pte_mapping(__nextp, __va, __pa, __sz, __prot)
>> -#define fixmap_pgd_next        fixmap_pte
>> +#define fixmap_pgd_next        ((uintptr_t)fixmap_pte)
>> +#define early_dtb_pgd_next    ((uintptr_t)early_dtb_pmd)
>> +#define create_pud_mapping(__pmdp, __va, __pa, __sz, __prot)
>>   #define create_pmd_mapping(__pmdp, __va, __pa, __sz, __prot)
>> -#endif
>> +#endif /* __PAGETABLE_PMD_FOLDED */
>>     void __init create_pgd_mapping(pgd_t *pgdp,
>>                         uintptr_t va, phys_addr_t pa,
>> @@ -493,6 +582,57 @@ static __init pgprot_t pgprot_from_va(uintptr_t va)
>>   }
>>   #endif /* CONFIG_STRICT_KERNEL_RWX */
>>   +#ifdef CONFIG_64BIT
>> +static void __init disable_pgtable_l4(void)
>> +{
>> +    pgtable_l4_enabled = false;
>> +    kernel_map.page_offset = PAGE_OFFSET_L3;
>> +    satp_mode = SATP_MODE_39;
>> +}
>> +
>> +/*
>> + * There is a simple way to determine if 4-level is supported by the
>> + * underlying hardware: establish 1:1 mapping in 4-level page table
mode
>> + * then read SATP to see if the configuration was taken into account
>> + * meaning sv48 is supported.
>> + */
>> +static __init void set_satp_mode(void)
>> +{
>> +    u64 identity_satp, hw_satp;
>> +    uintptr_t set_satp_mode_pmd;
>> +
>> +    set_satp_mode_pmd = ((unsigned long)set_satp_mode) & PMD_MASK;
>> +    create_pgd_mapping(early_pg_dir,
>> +               set_satp_mode_pmd, (uintptr_t)early_pud,
>> +               PGDIR_SIZE, PAGE_TABLE);
>> +    create_pud_mapping(early_pud,
>> +               set_satp_mode_pmd, (uintptr_t)early_pmd,
>> +               PUD_SIZE, PAGE_TABLE);
>> +    /* Handle the case where set_satp_mode straddles 2 PMDs */
>> +    create_pmd_mapping(early_pmd,
>> +               set_satp_mode_pmd, set_satp_mode_pmd,
>> +               PMD_SIZE, PAGE_KERNEL_EXEC);
>> +    create_pmd_mapping(early_pmd,
>> +               set_satp_mode_pmd + PMD_SIZE,
>> +               set_satp_mode_pmd + PMD_SIZE,
>> +               PMD_SIZE, PAGE_KERNEL_EXEC);
>> +
>> +    identity_satp = PFN_DOWN((uintptr_t)&early_pg_dir) | satp_mode;
>> +
>> +    local_flush_tlb_all();
>> +    csr_write(CSR_SATP, identity_satp);
>> +    hw_satp = csr_swap(CSR_SATP, 0ULL);
>> +    local_flush_tlb_all();
>> +
>> +    if (hw_satp != identity_satp)
>> +        disable_pgtable_l4();
>> +
>> +    memset(early_pg_dir, 0, PAGE_SIZE);
>> +    memset(early_pud, 0, PAGE_SIZE);
>> +    memset(early_pmd, 0, PAGE_SIZE);
>> +}
>> +#endif
>> +
>>   /*
>>    * setup_vm() is called from head.S with MMU-off.
>>    *
>> @@ -557,10 +697,15 @@ static void __init
create_fdt_early_page_table(pgd_t *pgdir, uintptr_t dtb_pa)
>>       uintptr_t pa = dtb_pa & ~(PMD_SIZE - 1);
>>         create_pgd_mapping(early_pg_dir, DTB_EARLY_BASE_VA,
>> -               IS_ENABLED(CONFIG_64BIT) ? (uintptr_t)early_dtb_pmd
: pa,
>> +               IS_ENABLED(CONFIG_64BIT) ? early_dtb_pgd_next : pa,
>>                  PGDIR_SIZE,
>>                  IS_ENABLED(CONFIG_64BIT) ? PAGE_TABLE : PAGE_KERNEL);
>>   +    if (pgtable_l4_enabled) {
>> +        create_pud_mapping(early_dtb_pud, DTB_EARLY_BASE_VA,
>> +                   (uintptr_t)early_dtb_pmd, PUD_SIZE, PAGE_TABLE);
>> +    }
>> +
>>       if (IS_ENABLED(CONFIG_64BIT)) {
>>           create_pmd_mapping(early_dtb_pmd, DTB_EARLY_BASE_VA,
>>                      pa, PMD_SIZE, PAGE_KERNEL);
>> @@ -593,6 +738,8 @@ void pt_ops_set_early(void)
>>   #ifndef __PAGETABLE_PMD_FOLDED
>>       pt_ops.alloc_pmd = alloc_pmd_early;
>>       pt_ops.get_pmd_virt = get_pmd_virt_early;
>> +    pt_ops.alloc_pud = alloc_pud_early;
>> +    pt_ops.get_pud_virt = get_pud_virt_early;
>>   #endif
>>   }
>>   @@ -611,6 +758,8 @@ void pt_ops_set_fixmap(void)
>>   #ifndef __PAGETABLE_PMD_FOLDED
>>       pt_ops.alloc_pmd =
kernel_mapping_pa_to_va((uintptr_t)alloc_pmd_fixmap);
>>       pt_ops.get_pmd_virt =
kernel_mapping_pa_to_va((uintptr_t)get_pmd_virt_fixmap);
>> +    pt_ops.alloc_pud =
kernel_mapping_pa_to_va((uintptr_t)alloc_pud_fixmap);
>> +    pt_ops.get_pud_virt =
kernel_mapping_pa_to_va((uintptr_t)get_pud_virt_fixmap);
>>   #endif
>>   }
>>   @@ -625,6 +774,8 @@ void pt_ops_set_late(void)
>>   #ifndef __PAGETABLE_PMD_FOLDED
>>       pt_ops.alloc_pmd = alloc_pmd_late;
>>       pt_ops.get_pmd_virt = get_pmd_virt_late;
>> +    pt_ops.alloc_pud = alloc_pud_late;
>> +    pt_ops.get_pud_virt = get_pud_virt_late;
>>   #endif
>>   }
>>   @@ -633,6 +784,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>       pmd_t __maybe_unused fix_bmap_spmd, fix_bmap_epmd;
>>         kernel_map.virt_addr = KERNEL_LINK_ADDR;
>> +    kernel_map.page_offset = _AC(CONFIG_PAGE_OFFSET, UL);
>>     #ifdef CONFIG_XIP_KERNEL
>>       kernel_map.xiprom = (uintptr_t)CONFIG_XIP_PHYS_ADDR;
>> @@ -647,6 +799,11 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>       kernel_map.phys_addr = (uintptr_t)(&_start);
>>       kernel_map.size = (uintptr_t)(&_end) - kernel_map.phys_addr;
>>   #endif
>> +
>> +#if defined(CONFIG_64BIT) && !defined(CONFIG_XIP_KERNEL)
>> +    set_satp_mode();
>> +#endif
>> +
>>       kernel_map.va_pa_offset = PAGE_OFFSET - kernel_map.phys_addr;
>>       kernel_map.va_kernel_pa_offset = kernel_map.virt_addr -
kernel_map.phys_addr;
>>   @@ -676,15 +833,21 @@ asmlinkage void __init setup_vm(uintptr_t
dtb_pa)
>>         /* Setup early PGD for fixmap */
>>       create_pgd_mapping(early_pg_dir, FIXADDR_START,
>> -               (uintptr_t)fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
>> +               fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
>>     #ifndef __PAGETABLE_PMD_FOLDED
>> -    /* Setup fixmap PMD */
>> +    /* Setup fixmap PUD and PMD */
>> +    if (pgtable_l4_enabled)
>> +        create_pud_mapping(fixmap_pud, FIXADDR_START,
>> +                   (uintptr_t)fixmap_pmd, PUD_SIZE, PAGE_TABLE);
>>       create_pmd_mapping(fixmap_pmd, FIXADDR_START,
>>                  (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
>>       /* Setup trampoline PGD and PMD */
>>       create_pgd_mapping(trampoline_pg_dir, kernel_map.virt_addr,
>> -               (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
>> +               trampoline_pgd_next, PGDIR_SIZE, PAGE_TABLE);
>> +    if (pgtable_l4_enabled)
>> +        create_pud_mapping(trampoline_pud, kernel_map.virt_addr,
>> +                   (uintptr_t)trampoline_pmd, PUD_SIZE, PAGE_TABLE);
>>   #ifdef CONFIG_XIP_KERNEL
>>       create_pmd_mapping(trampoline_pmd, kernel_map.virt_addr,
>>                  kernel_map.xiprom, PMD_SIZE, PAGE_KERNEL_EXEC);
>> @@ -712,7 +875,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>        * Bootime fixmap only can handle PMD_SIZE mapping. Thus,
boot-ioremap
>>        * range can not span multiple pmds.
>>        */
>> -    BUILD_BUG_ON((__fix_to_virt(FIX_BTMAP_BEGIN) >> PMD_SHIFT)
>> +    BUG_ON((__fix_to_virt(FIX_BTMAP_BEGIN) >> PMD_SHIFT)
>>                != (__fix_to_virt(FIX_BTMAP_END) >> PMD_SHIFT));
>>     #ifndef __PAGETABLE_PMD_FOLDED
>> @@ -783,9 +946,10 @@ static void __init setup_vm_final(void)
>>       /* Clear fixmap PTE and PMD mappings */
>>       clear_fixmap(FIX_PTE);
>>       clear_fixmap(FIX_PMD);
>> +    clear_fixmap(FIX_PUD);
>>         /* Move to swapper page table */
>> -    csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) |
SATP_MODE);
>> +    csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) |
satp_mode);
>>       local_flush_tlb_all();
>>         pt_ops_set_late();
>> diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c
>> index 1434a0225140..993f50571a3b 100644
>> --- a/arch/riscv/mm/kasan_init.c
>> +++ b/arch/riscv/mm/kasan_init.c
>> @@ -11,7 +11,29 @@
>>   #include <asm/fixmap.h>
>>   #include <asm/pgalloc.h>
>>   +/*
>> + * Kasan shadow region must lie at a fixed address across sv39,
sv48 and sv57
>> + * which is right before the kernel.
>> + *
>> + * For sv39, the region is aligned on PGDIR_SIZE so we only need to
populate
>> + * the page global directory with kasan_early_shadow_pmd.
>> + *
>> + * For sv48 and sv57, the region is not aligned on PGDIR_SIZE so
the mapping
>> + * must be divided as follows:
>> + * - the first PGD entry, although incomplete, is populated with
>> + *   kasan_early_shadow_pud/p4d
>> + * - the PGD entries in the middle are populated with
kasan_early_shadow_pud/p4d
>> + * - the last PGD entry is shared with the kernel mapping so
populated at the
>> + *   lower levels pud/p4d
>> + *
>> + * In addition, when shallow populating a kasan region (for example
vmalloc),
>> + * this region may also not be aligned on PGDIR size, so we must go
down to the
>> + * pud level too.
>> + */
>> +
>>   extern pgd_t early_pg_dir[PTRS_PER_PGD];
>> +extern struct pt_alloc_ops _pt_ops __initdata;
>> +#define pt_ops    _pt_ops
>>     static void __init kasan_populate_pte(pmd_t *pmd, unsigned long
vaddr, unsigned long end)
>>   {
>> @@ -35,15 +57,19 @@ static void __init kasan_populate_pte(pmd_t
*pmd, unsigned long vaddr, unsigned
>>       set_pmd(pmd, pfn_pmd(PFN_DOWN(__pa(base_pte)), PAGE_TABLE));
>>   }
>>   -static void __init kasan_populate_pmd(pgd_t *pgd, unsigned long
vaddr, unsigned long end)
>> +static void __init kasan_populate_pmd(pud_t *pud, unsigned long
vaddr, unsigned long end)
>>   {
>>       phys_addr_t phys_addr;
>>       pmd_t *pmdp, *base_pmd;
>>       unsigned long next;
>>   -    base_pmd = (pmd_t *)pgd_page_vaddr(*pgd);
>> -    if (base_pmd == lm_alias(kasan_early_shadow_pmd))
>> +    if (pud_none(*pud)) {
>>           base_pmd = memblock_alloc(PTRS_PER_PMD * sizeof(pmd_t),
PAGE_SIZE);
>> +    } else {
>> +        base_pmd = (pmd_t *)pud_pgtable(*pud);
>> +        if (base_pmd == lm_alias(kasan_early_shadow_pmd))
>> +            base_pmd = memblock_alloc(PTRS_PER_PMD * sizeof(pmd_t),
PAGE_SIZE);
>> +    }
>>         pmdp = base_pmd + pmd_index(vaddr);
>>   @@ -67,9 +93,72 @@ static void __init kasan_populate_pmd(pgd_t
*pgd, unsigned long vaddr, unsigned
>>        * it entirely, memblock could allocate a page at a physical
address
>>        * where KASAN is not populated yet and then we'd get a page
fault.
>>        */
>> -    set_pgd(pgd, pfn_pgd(PFN_DOWN(__pa(base_pmd)), PAGE_TABLE));
>> +    set_pud(pud, pfn_pud(PFN_DOWN(__pa(base_pmd)), PAGE_TABLE));
>> +}
>> +
>> +static void __init kasan_populate_pud(pgd_t *pgd,
>> +                      unsigned long vaddr, unsigned long end,
>> +                      bool early)
>> +{
>> +    phys_addr_t phys_addr;
>> +    pud_t *pudp, *base_pud;
>> +    unsigned long next;
>> +
>> +    if (early) {
>> +        /*
>> +         * We can't use pgd_page_vaddr here as it would return a linear
>> +         * mapping address but it is not mapped yet, but when
populating
>> +         * early_pg_dir, we need the physical address and when
populating
>> +         * swapper_pg_dir, we need the kernel virtual address so use
>> +         * pt_ops facility.
>> +         */
>> +        base_pud = pt_ops.get_pud_virt(pfn_to_phys(_pgd_pfn(*pgd)));
>> +    } else {
>> +        base_pud = (pud_t *)pgd_page_vaddr(*pgd);
>> +        if (base_pud == lm_alias(kasan_early_shadow_pud))
>> +            base_pud = memblock_alloc(PTRS_PER_PUD * sizeof(pud_t),
PAGE_SIZE);
>> +    }
>> +
>> +    pudp = base_pud + pud_index(vaddr);
>> +
>> +    do {
>> +        next = pud_addr_end(vaddr, end);
>> +
>> +        if (pud_none(*pudp) && IS_ALIGNED(vaddr, PUD_SIZE) && (next
- vaddr) >= PUD_SIZE) {
>> +            if (early) {
>> +                phys_addr = __pa(((uintptr_t)kasan_early_shadow_pmd));
>> +                set_pud(pudp, pfn_pud(PFN_DOWN(phys_addr),
PAGE_TABLE));
>> +                continue;
>> +            } else {
>> +                phys_addr = memblock_phys_alloc(PUD_SIZE, PUD_SIZE);
>> +                if (phys_addr) {
>> +                    set_pud(pudp, pfn_pud(PFN_DOWN(phys_addr),
PAGE_KERNEL));
>> +                    continue;
>> +                }
>> +            }
>> +        }
>> +
>> +        kasan_populate_pmd(pudp, vaddr, next);
>> +    } while (pudp++, vaddr = next, vaddr != end);
>> +
>> +    /*
>> +     * Wait for the whole PGD to be populated before setting the PGD in
>> +     * the page table, otherwise, if we did set the PGD before
populating
>> +     * it entirely, memblock could allocate a page at a physical
address
>> +     * where KASAN is not populated yet and then we'd get a page fault.
>> +     */
>> +    if (!early)
>> +        set_pgd(pgd, pfn_pgd(PFN_DOWN(__pa(base_pud)), PAGE_TABLE));
>>   }
>>   +#define kasan_early_shadow_pgd_next (pgtable_l4_enabled ?    \
>> +                (uintptr_t)kasan_early_shadow_pud : \
>> +                (uintptr_t)kasan_early_shadow_pmd)
>> +#define kasan_populate_pgd_next(pgdp, vaddr, next, early)            \
>> +        (pgtable_l4_enabled ?                        \
>> +            kasan_populate_pud(pgdp, vaddr, next, early) :        \
>> +            kasan_populate_pmd((pud_t *)pgdp, vaddr, next))
>> +
>>   static void __init kasan_populate_pgd(pgd_t *pgdp,
>>                         unsigned long vaddr, unsigned long end,
>>                         bool early)
>> @@ -102,7 +191,7 @@ static void __init kasan_populate_pgd(pgd_t *pgdp,
>>               }
>>           }
>>   -        kasan_populate_pmd(pgdp, vaddr, next);
>> +        kasan_populate_pgd_next(pgdp, vaddr, next, early);
>>       } while (pgdp++, vaddr = next, vaddr != end);
>>   }
>>   @@ -157,18 +246,54 @@ static void __init kasan_populate(void
*start, void *end)
>>       memset(start, KASAN_SHADOW_INIT, end - start);
>>   }
>>   +static void __init kasan_shallow_populate_pud(pgd_t *pgdp,
>> +                          unsigned long vaddr, unsigned long end,
>> +                          bool kasan_populate)
>> +{
>> +    unsigned long next;
>> +    pud_t *pudp, *base_pud;
>> +    pmd_t *base_pmd;
>> +    bool is_kasan_pmd;
>> +
>> +    base_pud = (pud_t *)pgd_page_vaddr(*pgdp);
>> +    pudp = base_pud + pud_index(vaddr);
>> +
>> +    if (kasan_populate)
>> +        memcpy(base_pud, (void *)kasan_early_shadow_pgd_next,
>> +               sizeof(pud_t) * PTRS_PER_PUD);
>> +
>> +    do {
>> +        next = pud_addr_end(vaddr, end);
>> +        is_kasan_pmd = (pud_pgtable(*pudp) ==
lm_alias(kasan_early_shadow_pmd));
>> +
>> +        if (is_kasan_pmd) {
>> +            base_pmd = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
>> +            set_pud(pudp, pfn_pud(PFN_DOWN(__pa(base_pmd)),
PAGE_TABLE));
>> +        }
>> +    } while (pudp++, vaddr = next, vaddr != end);
>> +}
>> +
>>   static void __init kasan_shallow_populate_pgd(unsigned long vaddr,
unsigned long end)
>>   {
>>       unsigned long next;
>>       void *p;
>>       pgd_t *pgd_k = pgd_offset_k(vaddr);
>> +    bool is_kasan_pgd_next;
>>         do {
>>           next = pgd_addr_end(vaddr, end);
>> -        if (pgd_page_vaddr(*pgd_k) == (unsigned
long)lm_alias(kasan_early_shadow_pmd)) {
>> +        is_kasan_pgd_next = (pgd_page_vaddr(*pgd_k) ==
>> +                     (unsigned
long)lm_alias(kasan_early_shadow_pgd_next));
>> +
>> +        if (is_kasan_pgd_next) {
>>               p = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
>>               set_pgd(pgd_k, pfn_pgd(PFN_DOWN(__pa(p)), PAGE_TABLE));
>>           }
>> +
>> +        if (IS_ALIGNED(vaddr, PGDIR_SIZE) && (next - vaddr) >=
PGDIR_SIZE)
>> +            continue;
>> +
>> +        kasan_shallow_populate_pud(pgd_k, vaddr, next,
is_kasan_pgd_next);
>>       } while (pgd_k++, vaddr = next, vaddr != end);
>>   }
>
>
> @Qinglin: I can deal with sv57 kasan population if needs be as it is
a bit tricky and I think it would save you quite some time :)

Thanks so much for you suggestion! And I want to give it a try firstly
as I am now making new Sv57 patchset :) I will ask for your help when I
meet any trouble, and thanks again!

Yours,
Qinglin

>
>
>>   diff --git a/drivers/firmware/efi/libstub/efi-stub.c
b/drivers/firmware/efi/libstub/efi-stub.c
>> index 26e69788f27a..b3db5d91ed38 100644
>> --- a/drivers/firmware/efi/libstub/efi-stub.c
>> +++ b/drivers/firmware/efi/libstub/efi-stub.c
>> @@ -40,6 +40,8 @@
>>     #ifdef CONFIG_ARM64
>>   # define EFI_RT_VIRTUAL_LIMIT    DEFAULT_MAP_WINDOW_64
>> +#elif defined(CONFIG_RISCV)
>> +# define EFI_RT_VIRTUAL_LIMIT    TASK_SIZE_MIN
>>   #else
>>   # define EFI_RT_VIRTUAL_LIMIT    TASK_SIZE
>>   #endif


2021-12-20 09:11:22

by Guo Ren

[permalink] [raw]
Subject: Re: [PATCH v3 12/13] riscv: Initialize thread pointer before calling C functions

On Tue, Dec 7, 2021 at 11:55 AM Alexandre Ghiti
<[email protected]> wrote:
>
> Because of the stack canary feature that reads from the current task
> structure the stack canary value, the thread pointer register "tp" must
> be set before calling any C function from head.S: by chance, setup_vm
Shall we disable -fstack-protector for setup_vm() with __attribute__?
Actually, we've already init tp later.

> and all the functions that it calls does not seem to be part of the
> functions where the canary check is done, but in the following commits,
> some functions will.
>
> Fixes: f2c9699f65557a31 ("riscv: Add STACKPROTECTOR supported")
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> arch/riscv/kernel/head.S | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index c3c0ed559770..86f7ee3d210d 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -302,6 +302,7 @@ clear_bss_done:
> REG_S a0, (a2)
>
> /* Initialize page tables and relocate to virtual addresses */
> + la tp, init_task
> la sp, init_thread_union + THREAD_SIZE
> XIP_FIXUP_OFFSET sp
> #ifdef CONFIG_BUILTIN_DTB
> --
> 2.32.0
>


--
Best Regards
Guo Ren

ML: https://lore.kernel.org/linux-csky/

2021-12-20 09:17:36

by Ard Biesheuvel

[permalink] [raw]
Subject: Re: [PATCH v3 12/13] riscv: Initialize thread pointer before calling C functions

On Mon, 20 Dec 2021 at 10:11, Guo Ren <[email protected]> wrote:
>
> On Tue, Dec 7, 2021 at 11:55 AM Alexandre Ghiti
> <[email protected]> wrote:
> >
> > Because of the stack canary feature that reads from the current task
> > structure the stack canary value, the thread pointer register "tp" must
> > be set before calling any C function from head.S: by chance, setup_vm
> Shall we disable -fstack-protector for setup_vm() with __attribute__?

Don't use __attribute__((optimize())) for that: it is known to be
broken, and documented as debug purposes only in the GCC info pages:

https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html




> Actually, we've already init tp later.
>
> > and all the functions that it calls does not seem to be part of the
> > functions where the canary check is done, but in the following commits,
> > some functions will.
> >
> > Fixes: f2c9699f65557a31 ("riscv: Add STACKPROTECTOR supported")
> > Signed-off-by: Alexandre Ghiti <[email protected]>
> > ---
> > arch/riscv/kernel/head.S | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> > index c3c0ed559770..86f7ee3d210d 100644
> > --- a/arch/riscv/kernel/head.S
> > +++ b/arch/riscv/kernel/head.S
> > @@ -302,6 +302,7 @@ clear_bss_done:
> > REG_S a0, (a2)
> >
> > /* Initialize page tables and relocate to virtual addresses */
> > + la tp, init_task
> > la sp, init_thread_union + THREAD_SIZE
> > XIP_FIXUP_OFFSET sp
> > #ifdef CONFIG_BUILTIN_DTB
> > --
> > 2.32.0
> >
>
>
> --
> Best Regards
> Guo Ren
>
> ML: https://lore.kernel.org/linux-csky/

2021-12-20 13:41:08

by Guo Ren

[permalink] [raw]
Subject: Re: [PATCH v3 12/13] riscv: Initialize thread pointer before calling C functions

On Mon, Dec 20, 2021 at 5:17 PM Ard Biesheuvel <[email protected]> wrote:
>
> On Mon, 20 Dec 2021 at 10:11, Guo Ren <[email protected]> wrote:
> >
> > On Tue, Dec 7, 2021 at 11:55 AM Alexandre Ghiti
> > <[email protected]> wrote:
> > >
> > > Because of the stack canary feature that reads from the current task
> > > structure the stack canary value, the thread pointer register "tp" must
> > > be set before calling any C function from head.S: by chance, setup_vm
> > Shall we disable -fstack-protector for setup_vm() with __attribute__?
>
> Don't use __attribute__((optimize())) for that: it is known to be
> broken, and documented as debug purposes only in the GCC info pages:
>
> https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html
Oh, thx for the link.

>
>
>
>
> > Actually, we've already init tp later.
> >
> > > and all the functions that it calls does not seem to be part of the
> > > functions where the canary check is done, but in the following commits,
> > > some functions will.
> > >
> > > Fixes: f2c9699f65557a31 ("riscv: Add STACKPROTECTOR supported")
> > > Signed-off-by: Alexandre Ghiti <[email protected]>
> > > ---
> > > arch/riscv/kernel/head.S | 1 +
> > > 1 file changed, 1 insertion(+)
> > >
> > > diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> > > index c3c0ed559770..86f7ee3d210d 100644
> > > --- a/arch/riscv/kernel/head.S
> > > +++ b/arch/riscv/kernel/head.S
> > > @@ -302,6 +302,7 @@ clear_bss_done:
> > > REG_S a0, (a2)
> > >
> > > /* Initialize page tables and relocate to virtual addresses */
> > > + la tp, init_task
> > > la sp, init_thread_union + THREAD_SIZE
> > > XIP_FIXUP_OFFSET sp
> > > #ifdef CONFIG_BUILTIN_DTB
> > > --
> > > 2.32.0
> > >
> >
> >
> > --
> > Best Regards
> > Guo Ren
> >
> > ML: https://lore.kernel.org/linux-csky/



--
Best Regards
Guo Ren

ML: https://lore.kernel.org/linux-csky/

2021-12-26 09:16:33

by Jisheng Zhang

[permalink] [raw]
Subject: Re: [PATCH v3 07/13] riscv: Implement sv48 support

On Mon, 6 Dec 2021 11:46:51 +0100
Alexandre Ghiti <[email protected]> wrote:

> By adding a new 4th level of page table, give the possibility to 64bit
> kernel to address 2^48 bytes of virtual address: in practice, that offers
> 128TB of virtual address space to userspace and allows up to 64TB of
> physical memory.
>
> If the underlying hardware does not support sv48, we will automatically
> fallback to a standard 3-level page table by folding the new PUD level into
> PGDIR level. In order to detect HW capabilities at runtime, we
> use SATP feature that ignores writes with an unsupported mode.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> arch/riscv/Kconfig | 4 +-
> arch/riscv/include/asm/csr.h | 3 +-
> arch/riscv/include/asm/fixmap.h | 1 +
> arch/riscv/include/asm/kasan.h | 6 +-
> arch/riscv/include/asm/page.h | 14 ++
> arch/riscv/include/asm/pgalloc.h | 40 +++++
> arch/riscv/include/asm/pgtable-64.h | 108 +++++++++++-
> arch/riscv/include/asm/pgtable.h | 24 ++-
> arch/riscv/kernel/head.S | 3 +-
> arch/riscv/mm/context.c | 4 +-
> arch/riscv/mm/init.c | 212 +++++++++++++++++++++---
> arch/riscv/mm/kasan_init.c | 137 ++++++++++++++-
> drivers/firmware/efi/libstub/efi-stub.c | 2 +
> 13 files changed, 514 insertions(+), 44 deletions(-)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index ac6c0cd9bc29..d28fe0148e13 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -150,7 +150,7 @@ config PAGE_OFFSET
> hex
> default 0xC0000000 if 32BIT
> default 0x80000000 if 64BIT && !MMU
> - default 0xffffffd800000000 if 64BIT
> + default 0xffffaf8000000000 if 64BIT
>
> config KASAN_SHADOW_OFFSET
> hex
> @@ -201,7 +201,7 @@ config FIX_EARLYCON_MEM
>
> config PGTABLE_LEVELS
> int
> - default 3 if 64BIT
> + default 4 if 64BIT
> default 2
>
> config LOCKDEP_SUPPORT
> diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
> index 87ac65696871..3fdb971c7896 100644
> --- a/arch/riscv/include/asm/csr.h
> +++ b/arch/riscv/include/asm/csr.h
> @@ -40,14 +40,13 @@
> #ifndef CONFIG_64BIT
> #define SATP_PPN _AC(0x003FFFFF, UL)
> #define SATP_MODE_32 _AC(0x80000000, UL)
> -#define SATP_MODE SATP_MODE_32
> #define SATP_ASID_BITS 9
> #define SATP_ASID_SHIFT 22
> #define SATP_ASID_MASK _AC(0x1FF, UL)
> #else
> #define SATP_PPN _AC(0x00000FFFFFFFFFFF, UL)
> #define SATP_MODE_39 _AC(0x8000000000000000, UL)
> -#define SATP_MODE SATP_MODE_39
> +#define SATP_MODE_48 _AC(0x9000000000000000, UL)
> #define SATP_ASID_BITS 16
> #define SATP_ASID_SHIFT 44
> #define SATP_ASID_MASK _AC(0xFFFF, UL)
> diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h
> index 54cbf07fb4e9..58a718573ad6 100644
> --- a/arch/riscv/include/asm/fixmap.h
> +++ b/arch/riscv/include/asm/fixmap.h
> @@ -24,6 +24,7 @@ enum fixed_addresses {
> FIX_HOLE,
> FIX_PTE,
> FIX_PMD,
> + FIX_PUD,
> FIX_TEXT_POKE1,
> FIX_TEXT_POKE0,
> FIX_EARLYCON_MEM_BASE,
> diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h
> index 743e6ff57996..0b85e363e778 100644
> --- a/arch/riscv/include/asm/kasan.h
> +++ b/arch/riscv/include/asm/kasan.h
> @@ -28,7 +28,11 @@
> #define KASAN_SHADOW_SCALE_SHIFT 3
>
> #define KASAN_SHADOW_SIZE (UL(1) << ((VA_BITS - 1) - KASAN_SHADOW_SCALE_SHIFT))
> -#define KASAN_SHADOW_START (KASAN_SHADOW_END - KASAN_SHADOW_SIZE)
> +/*
> + * Depending on the size of the virtual address space, the region may not be
> + * aligned on PGDIR_SIZE, so force its alignment to ease its population.
> + */
> +#define KASAN_SHADOW_START ((KASAN_SHADOW_END - KASAN_SHADOW_SIZE) & PGDIR_MASK)
> #define KASAN_SHADOW_END MODULES_LOWEST_VADDR
> #define KASAN_SHADOW_OFFSET _AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
>
> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> index e03559f9b35e..d089fe46f7d8 100644
> --- a/arch/riscv/include/asm/page.h
> +++ b/arch/riscv/include/asm/page.h
> @@ -31,7 +31,20 @@
> * When not using MMU this corresponds to the first free page in
> * physical memory (aligned on a page boundary).
> */
> +#ifdef CONFIG_64BIT
> +#ifdef CONFIG_MMU
> +#define PAGE_OFFSET kernel_map.page_offset
> +#else
> +#define PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)
> +#endif
> +/*
> + * By default, CONFIG_PAGE_OFFSET value corresponds to SV48 address space so
> + * define the PAGE_OFFSET value for SV39.
> + */
> +#define PAGE_OFFSET_L3 _AC(0xffffffd800000000, UL)
> +#else
> #define PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)
> +#endif /* CONFIG_64BIT */
>
> /*
> * Half of the kernel address space (half of the entries of the page global
> @@ -90,6 +103,7 @@ extern unsigned long riscv_pfn_base;
> #endif /* CONFIG_MMU */
>
> struct kernel_mapping {
> + unsigned long page_offset;
> unsigned long virt_addr;
> uintptr_t phys_addr;
> uintptr_t size;
> diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
> index 0af6933a7100..11823004b87a 100644
> --- a/arch/riscv/include/asm/pgalloc.h
> +++ b/arch/riscv/include/asm/pgalloc.h
> @@ -11,6 +11,8 @@
> #include <asm/tlb.h>
>
> #ifdef CONFIG_MMU
> +#define __HAVE_ARCH_PUD_ALLOC_ONE
> +#define __HAVE_ARCH_PUD_FREE
> #include <asm-generic/pgalloc.h>
>
> static inline void pmd_populate_kernel(struct mm_struct *mm,
> @@ -36,6 +38,44 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
>
> set_pud(pud, __pud((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> }
> +
> +static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
> +{
> + if (pgtable_l4_enabled) {
> + unsigned long pfn = virt_to_pfn(pud);
> +
> + set_p4d(p4d, __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> + }
> +}
> +
> +static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d,
> + pud_t *pud)
> +{
> + if (pgtable_l4_enabled) {
> + unsigned long pfn = virt_to_pfn(pud);
> +
> + set_p4d_safe(p4d,
> + __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> + }
> +}
> +
> +#define pud_alloc_one pud_alloc_one
> +static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
> +{
> + if (pgtable_l4_enabled)
> + return __pud_alloc_one(mm, addr);
> +
> + return NULL;
> +}
> +
> +#define pud_free pud_free
> +static inline void pud_free(struct mm_struct *mm, pud_t *pud)
> +{
> + if (pgtable_l4_enabled)
> + __pud_free(mm, pud);
> +}
> +
> +#define __pud_free_tlb(tlb, pud, addr) pud_free((tlb)->mm, pud)
> #endif /* __PAGETABLE_PMD_FOLDED */
>
> static inline pgd_t *pgd_alloc(struct mm_struct *mm)
> diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
> index 228261aa9628..bbbdd66e5e2f 100644
> --- a/arch/riscv/include/asm/pgtable-64.h
> +++ b/arch/riscv/include/asm/pgtable-64.h
> @@ -8,16 +8,36 @@
>
> #include <linux/const.h>
>
> -#define PGDIR_SHIFT 30
> +extern bool pgtable_l4_enabled;
> +
> +#define PGDIR_SHIFT_L3 30
> +#define PGDIR_SHIFT_L4 39
> +#define PGDIR_SIZE_L3 (_AC(1, UL) << PGDIR_SHIFT_L3)
> +
> +#define PGDIR_SHIFT (pgtable_l4_enabled ? PGDIR_SHIFT_L4 : PGDIR_SHIFT_L3)
> /* Size of region mapped by a page global directory */
> #define PGDIR_SIZE (_AC(1, UL) << PGDIR_SHIFT)
> #define PGDIR_MASK (~(PGDIR_SIZE - 1))
>
> +/* pud is folded into pgd in case of 3-level page table */
> +#define PUD_SHIFT 30
> +#define PUD_SIZE (_AC(1, UL) << PUD_SHIFT)
> +#define PUD_MASK (~(PUD_SIZE - 1))
> +
> #define PMD_SHIFT 21
> /* Size of region mapped by a page middle directory */
> #define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
> #define PMD_MASK (~(PMD_SIZE - 1))
>
> +/* Page Upper Directory entry */
> +typedef struct {
> + unsigned long pud;
> +} pud_t;
> +
> +#define pud_val(x) ((x).pud)
> +#define __pud(x) ((pud_t) { (x) })
> +#define PTRS_PER_PUD (PAGE_SIZE / sizeof(pud_t))
> +
> /* Page Middle Directory entry */
> typedef struct {
> unsigned long pmd;
> @@ -59,6 +79,16 @@ static inline void pud_clear(pud_t *pudp)
> set_pud(pudp, __pud(0));
> }
>
> +static inline pud_t pfn_pud(unsigned long pfn, pgprot_t prot)
> +{
> + return __pud((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
> +}
> +
> +static inline unsigned long _pud_pfn(pud_t pud)
> +{
> + return pud_val(pud) >> _PAGE_PFN_SHIFT;
> +}
> +
> static inline pmd_t *pud_pgtable(pud_t pud)
> {
> return (pmd_t *)pfn_to_virt(pud_val(pud) >> _PAGE_PFN_SHIFT);
> @@ -69,6 +99,17 @@ static inline struct page *pud_page(pud_t pud)
> return pfn_to_page(pud_val(pud) >> _PAGE_PFN_SHIFT);
> }
>
> +#define mm_pud_folded mm_pud_folded
> +static inline bool mm_pud_folded(struct mm_struct *mm)
> +{
> + if (pgtable_l4_enabled)
> + return false;
> +
> + return true;
> +}
> +
> +#define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
> +
> static inline pmd_t pfn_pmd(unsigned long pfn, pgprot_t prot)
> {
> return __pmd((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
> @@ -84,4 +125,69 @@ static inline unsigned long _pmd_pfn(pmd_t pmd)
> #define pmd_ERROR(e) \
> pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))
>
> +#define pud_ERROR(e) \
> + pr_err("%s:%d: bad pud %016lx.\n", __FILE__, __LINE__, pud_val(e))
> +
> +static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
> +{
> + if (pgtable_l4_enabled)
> + *p4dp = p4d;
> + else
> + set_pud((pud_t *)p4dp, (pud_t){ p4d_val(p4d) });
> +}
> +
> +static inline int p4d_none(p4d_t p4d)
> +{
> + if (pgtable_l4_enabled)
> + return (p4d_val(p4d) == 0);
> +
> + return 0;
> +}
> +
> +static inline int p4d_present(p4d_t p4d)
> +{
> + if (pgtable_l4_enabled)
> + return (p4d_val(p4d) & _PAGE_PRESENT);
> +
> + return 1;
> +}
> +
> +static inline int p4d_bad(p4d_t p4d)
> +{
> + if (pgtable_l4_enabled)
> + return !p4d_present(p4d);
> +
> + return 0;
> +}
> +
> +static inline void p4d_clear(p4d_t *p4d)
> +{
> + if (pgtable_l4_enabled)
> + set_p4d(p4d, __p4d(0));
> +}
> +
> +static inline pud_t *p4d_pgtable(p4d_t p4d)
> +{
> + if (pgtable_l4_enabled)
> + return (pud_t *)pfn_to_virt(p4d_val(p4d) >> _PAGE_PFN_SHIFT);
> +
> + return (pud_t *)pud_pgtable((pud_t) { p4d_val(p4d) });
> +}
> +
> +static inline struct page *p4d_page(p4d_t p4d)
> +{
> + return pfn_to_page(p4d_val(p4d) >> _PAGE_PFN_SHIFT);
> +}
> +
> +#define pud_index(addr) (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
> +
> +#define pud_offset pud_offset
> +static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> +{
> + if (pgtable_l4_enabled)
> + return p4d_pgtable(*p4d) + pud_index(address);
> +
> + return (pud_t *)p4d;
> +}
> +
> #endif /* _ASM_RISCV_PGTABLE_64_H */
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index e1a52e22ad7e..e1c74ef4ead2 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -51,7 +51,7 @@
> * position vmemmap directly below the VMALLOC region.
> */
> #ifdef CONFIG_64BIT
> -#define VA_BITS 39
> +#define VA_BITS (pgtable_l4_enabled ? 48 : 39)
> #else
> #define VA_BITS 32
> #endif
> @@ -90,8 +90,7 @@
>
> #ifndef __ASSEMBLY__
>
> -/* Page Upper Directory not used in RISC-V */
> -#include <asm-generic/pgtable-nopud.h>
> +#include <asm-generic/pgtable-nop4d.h>
> #include <asm/page.h>
> #include <asm/tlbflush.h>
> #include <linux/mm_types.h>
> @@ -113,6 +112,17 @@
> #define XIP_FIXUP(addr) (addr)
> #endif /* CONFIG_XIP_KERNEL */
>
> +struct pt_alloc_ops {
> + pte_t *(*get_pte_virt)(phys_addr_t pa);
> + phys_addr_t (*alloc_pte)(uintptr_t va);
> +#ifndef __PAGETABLE_PMD_FOLDED
> + pmd_t *(*get_pmd_virt)(phys_addr_t pa);
> + phys_addr_t (*alloc_pmd)(uintptr_t va);
> + pud_t *(*get_pud_virt)(phys_addr_t pa);
> + phys_addr_t (*alloc_pud)(uintptr_t va);
> +#endif
> +};
> +
> #ifdef CONFIG_MMU
> /* Number of entries in the page global directory */
> #define PTRS_PER_PGD (PAGE_SIZE / sizeof(pgd_t))
> @@ -669,9 +679,11 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> * Note that PGDIR_SIZE must evenly divide TASK_SIZE.
> */
> #ifdef CONFIG_64BIT
> -#define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)
> +#define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)
> +#define TASK_SIZE_MIN (PGDIR_SIZE_L3 * PTRS_PER_PGD / 2)
> #else
> -#define TASK_SIZE FIXADDR_START
> +#define TASK_SIZE FIXADDR_START
> +#define TASK_SIZE_MIN TASK_SIZE
> #endif
>
> #else /* CONFIG_MMU */
> @@ -697,6 +709,8 @@ extern uintptr_t _dtb_early_pa;
> #define dtb_early_va _dtb_early_va
> #define dtb_early_pa _dtb_early_pa
> #endif /* CONFIG_XIP_KERNEL */
> +extern u64 satp_mode;
> +extern bool pgtable_l4_enabled;
>
> void paging_init(void);
> void misc_mem_init(void);
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index 52c5ff9804c5..c3c0ed559770 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -95,7 +95,8 @@ relocate:
>
> /* Compute satp for kernel page tables, but don't load it yet */
> srl a2, a0, PAGE_SHIFT
> - li a1, SATP_MODE
> + la a1, satp_mode
> + REG_L a1, 0(a1)
> or a2, a2, a1
>
> /*
> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> index ee3459cb6750..a7246872bd30 100644
> --- a/arch/riscv/mm/context.c
> +++ b/arch/riscv/mm/context.c
> @@ -192,7 +192,7 @@ static void set_mm_asid(struct mm_struct *mm, unsigned int cpu)
> switch_mm_fast:
> csr_write(CSR_SATP, virt_to_pfn(mm->pgd) |
> ((cntx & asid_mask) << SATP_ASID_SHIFT) |
> - SATP_MODE);
> + satp_mode);
>
> if (need_flush_tlb)
> local_flush_tlb_all();
> @@ -201,7 +201,7 @@ static void set_mm_asid(struct mm_struct *mm, unsigned int cpu)
> static void set_mm_noasid(struct mm_struct *mm)
> {
> /* Switch the page table and blindly nuke entire local TLB */
> - csr_write(CSR_SATP, virt_to_pfn(mm->pgd) | SATP_MODE);
> + csr_write(CSR_SATP, virt_to_pfn(mm->pgd) | satp_mode);
> local_flush_tlb_all();
> }
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 1552226fb6bd..6a19a1b1caf8 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -37,6 +37,17 @@ EXPORT_SYMBOL(kernel_map);
> #define kernel_map (*(struct kernel_mapping *)XIP_FIXUP(&kernel_map))
> #endif
>
> +#ifdef CONFIG_64BIT
> +u64 satp_mode = !IS_ENABLED(CONFIG_XIP_KERNEL) ? SATP_MODE_48 : SATP_MODE_39;
> +#else
> +u64 satp_mode = SATP_MODE_32;
> +#endif
> +EXPORT_SYMBOL(satp_mode);
> +
> +bool pgtable_l4_enabled = IS_ENABLED(CONFIG_64BIT) && !IS_ENABLED(CONFIG_XIP_KERNEL) ?
> + true : false;

Hi Alex,

I'm not sure whether we can use static key for pgtable_l4_enabled or
not. Obviously, for a specific HW platform, pgtable_l4_enabled won't change
after boot, and it seems it sits hot code path, so IMHO, static key maybe
suitable for it.

Thanks


2021-12-29 03:42:49

by Guo Ren

[permalink] [raw]
Subject: Re: [PATCH v3 07/13] riscv: Implement sv48 support

On Tue, Dec 7, 2021 at 11:54 AM Alexandre Ghiti
<[email protected]> wrote:
>
> By adding a new 4th level of page table, give the possibility to 64bit
> kernel to address 2^48 bytes of virtual address: in practice, that offers
> 128TB of virtual address space to userspace and allows up to 64TB of
> physical memory.
>
> If the underlying hardware does not support sv48, we will automatically
> fallback to a standard 3-level page table by folding the new PUD level into
> PGDIR level. In order to detect HW capabilities at runtime, we
> use SATP feature that ignores writes with an unsupported mode.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> arch/riscv/Kconfig | 4 +-
> arch/riscv/include/asm/csr.h | 3 +-
> arch/riscv/include/asm/fixmap.h | 1 +
> arch/riscv/include/asm/kasan.h | 6 +-
> arch/riscv/include/asm/page.h | 14 ++
> arch/riscv/include/asm/pgalloc.h | 40 +++++
> arch/riscv/include/asm/pgtable-64.h | 108 +++++++++++-
> arch/riscv/include/asm/pgtable.h | 24 ++-
> arch/riscv/kernel/head.S | 3 +-
> arch/riscv/mm/context.c | 4 +-
> arch/riscv/mm/init.c | 212 +++++++++++++++++++++---
> arch/riscv/mm/kasan_init.c | 137 ++++++++++++++-
> drivers/firmware/efi/libstub/efi-stub.c | 2 +
> 13 files changed, 514 insertions(+), 44 deletions(-)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index ac6c0cd9bc29..d28fe0148e13 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -150,7 +150,7 @@ config PAGE_OFFSET
> hex
> default 0xC0000000 if 32BIT
> default 0x80000000 if 64BIT && !MMU
> - default 0xffffffd800000000 if 64BIT
> + default 0xffffaf8000000000 if 64BIT
>
> config KASAN_SHADOW_OFFSET
> hex
> @@ -201,7 +201,7 @@ config FIX_EARLYCON_MEM
>
> config PGTABLE_LEVELS
> int
> - default 3 if 64BIT
> + default 4 if 64BIT
> default 2
>
> config LOCKDEP_SUPPORT
> diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
> index 87ac65696871..3fdb971c7896 100644
> --- a/arch/riscv/include/asm/csr.h
> +++ b/arch/riscv/include/asm/csr.h
> @@ -40,14 +40,13 @@
> #ifndef CONFIG_64BIT
> #define SATP_PPN _AC(0x003FFFFF, UL)
> #define SATP_MODE_32 _AC(0x80000000, UL)
> -#define SATP_MODE SATP_MODE_32
> #define SATP_ASID_BITS 9
> #define SATP_ASID_SHIFT 22
> #define SATP_ASID_MASK _AC(0x1FF, UL)
> #else
> #define SATP_PPN _AC(0x00000FFFFFFFFFFF, UL)
> #define SATP_MODE_39 _AC(0x8000000000000000, UL)
> -#define SATP_MODE SATP_MODE_39
> +#define SATP_MODE_48 _AC(0x9000000000000000, UL)
> #define SATP_ASID_BITS 16
> #define SATP_ASID_SHIFT 44
> #define SATP_ASID_MASK _AC(0xFFFF, UL)
> diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h
> index 54cbf07fb4e9..58a718573ad6 100644
> --- a/arch/riscv/include/asm/fixmap.h
> +++ b/arch/riscv/include/asm/fixmap.h
> @@ -24,6 +24,7 @@ enum fixed_addresses {
> FIX_HOLE,
> FIX_PTE,
> FIX_PMD,
> + FIX_PUD,
> FIX_TEXT_POKE1,
> FIX_TEXT_POKE0,
> FIX_EARLYCON_MEM_BASE,
> diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h
> index 743e6ff57996..0b85e363e778 100644
> --- a/arch/riscv/include/asm/kasan.h
> +++ b/arch/riscv/include/asm/kasan.h
> @@ -28,7 +28,11 @@
> #define KASAN_SHADOW_SCALE_SHIFT 3
>
> #define KASAN_SHADOW_SIZE (UL(1) << ((VA_BITS - 1) - KASAN_SHADOW_SCALE_SHIFT))
> -#define KASAN_SHADOW_START (KASAN_SHADOW_END - KASAN_SHADOW_SIZE)
> +/*
> + * Depending on the size of the virtual address space, the region may not be
> + * aligned on PGDIR_SIZE, so force its alignment to ease its population.
> + */
> +#define KASAN_SHADOW_START ((KASAN_SHADOW_END - KASAN_SHADOW_SIZE) & PGDIR_MASK)
> #define KASAN_SHADOW_END MODULES_LOWEST_VADDR
> #define KASAN_SHADOW_OFFSET _AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
>
> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> index e03559f9b35e..d089fe46f7d8 100644
> --- a/arch/riscv/include/asm/page.h
> +++ b/arch/riscv/include/asm/page.h
> @@ -31,7 +31,20 @@
> * When not using MMU this corresponds to the first free page in
> * physical memory (aligned on a page boundary).
> */
> +#ifdef CONFIG_64BIT
> +#ifdef CONFIG_MMU
> +#define PAGE_OFFSET kernel_map.page_offset
> +#else
> +#define PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)
> +#endif
> +/*
> + * By default, CONFIG_PAGE_OFFSET value corresponds to SV48 address space so
> + * define the PAGE_OFFSET value for SV39.
> + */
> +#define PAGE_OFFSET_L3 _AC(0xffffffd800000000, UL)
> +#else
> #define PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)
> +#endif /* CONFIG_64BIT */
>
> /*
> * Half of the kernel address space (half of the entries of the page global
> @@ -90,6 +103,7 @@ extern unsigned long riscv_pfn_base;
> #endif /* CONFIG_MMU */
>
> struct kernel_mapping {
> + unsigned long page_offset;
> unsigned long virt_addr;
> uintptr_t phys_addr;
> uintptr_t size;
> diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
> index 0af6933a7100..11823004b87a 100644
> --- a/arch/riscv/include/asm/pgalloc.h
> +++ b/arch/riscv/include/asm/pgalloc.h
> @@ -11,6 +11,8 @@
> #include <asm/tlb.h>
>
> #ifdef CONFIG_MMU
> +#define __HAVE_ARCH_PUD_ALLOC_ONE
> +#define __HAVE_ARCH_PUD_FREE
> #include <asm-generic/pgalloc.h>
>
> static inline void pmd_populate_kernel(struct mm_struct *mm,
> @@ -36,6 +38,44 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
>
> set_pud(pud, __pud((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> }
> +
> +static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
> +{
> + if (pgtable_l4_enabled) {
> + unsigned long pfn = virt_to_pfn(pud);
> +
> + set_p4d(p4d, __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> + }
> +}
> +
> +static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d,
> + pud_t *pud)
> +{
> + if (pgtable_l4_enabled) {
> + unsigned long pfn = virt_to_pfn(pud);
> +
> + set_p4d_safe(p4d,
> + __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> + }
> +}
> +
> +#define pud_alloc_one pud_alloc_one
> +static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
> +{
> + if (pgtable_l4_enabled)
> + return __pud_alloc_one(mm, addr);
> +
> + return NULL;
> +}
> +
> +#define pud_free pud_free
> +static inline void pud_free(struct mm_struct *mm, pud_t *pud)
> +{
> + if (pgtable_l4_enabled)
> + __pud_free(mm, pud);
> +}
> +
> +#define __pud_free_tlb(tlb, pud, addr) pud_free((tlb)->mm, pud)
> #endif /* __PAGETABLE_PMD_FOLDED */
>
> static inline pgd_t *pgd_alloc(struct mm_struct *mm)
> diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
> index 228261aa9628..bbbdd66e5e2f 100644
> --- a/arch/riscv/include/asm/pgtable-64.h
> +++ b/arch/riscv/include/asm/pgtable-64.h
> @@ -8,16 +8,36 @@
>
> #include <linux/const.h>
>
> -#define PGDIR_SHIFT 30
> +extern bool pgtable_l4_enabled;
> +
> +#define PGDIR_SHIFT_L3 30
> +#define PGDIR_SHIFT_L4 39
> +#define PGDIR_SIZE_L3 (_AC(1, UL) << PGDIR_SHIFT_L3)
> +
> +#define PGDIR_SHIFT (pgtable_l4_enabled ? PGDIR_SHIFT_L4 : PGDIR_SHIFT_L3)
> /* Size of region mapped by a page global directory */
> #define PGDIR_SIZE (_AC(1, UL) << PGDIR_SHIFT)
> #define PGDIR_MASK (~(PGDIR_SIZE - 1))
>
> +/* pud is folded into pgd in case of 3-level page table */
> +#define PUD_SHIFT 30
> +#define PUD_SIZE (_AC(1, UL) << PUD_SHIFT)
> +#define PUD_MASK (~(PUD_SIZE - 1))
> +
> #define PMD_SHIFT 21
> /* Size of region mapped by a page middle directory */
> #define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
> #define PMD_MASK (~(PMD_SIZE - 1))
>
> +/* Page Upper Directory entry */
> +typedef struct {
> + unsigned long pud;
> +} pud_t;
> +
> +#define pud_val(x) ((x).pud)
> +#define __pud(x) ((pud_t) { (x) })
> +#define PTRS_PER_PUD (PAGE_SIZE / sizeof(pud_t))
> +
> /* Page Middle Directory entry */
> typedef struct {
> unsigned long pmd;
> @@ -59,6 +79,16 @@ static inline void pud_clear(pud_t *pudp)
> set_pud(pudp, __pud(0));
> }
>
> +static inline pud_t pfn_pud(unsigned long pfn, pgprot_t prot)
> +{
> + return __pud((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
> +}
> +
> +static inline unsigned long _pud_pfn(pud_t pud)
> +{
> + return pud_val(pud) >> _PAGE_PFN_SHIFT;
> +}
> +
> static inline pmd_t *pud_pgtable(pud_t pud)
> {
> return (pmd_t *)pfn_to_virt(pud_val(pud) >> _PAGE_PFN_SHIFT);
> @@ -69,6 +99,17 @@ static inline struct page *pud_page(pud_t pud)
> return pfn_to_page(pud_val(pud) >> _PAGE_PFN_SHIFT);
> }
>
> +#define mm_pud_folded mm_pud_folded
> +static inline bool mm_pud_folded(struct mm_struct *mm)
> +{
> + if (pgtable_l4_enabled)
> + return false;
> +
> + return true;
> +}
> +
> +#define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
> +
> static inline pmd_t pfn_pmd(unsigned long pfn, pgprot_t prot)
> {
> return __pmd((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
> @@ -84,4 +125,69 @@ static inline unsigned long _pmd_pfn(pmd_t pmd)
> #define pmd_ERROR(e) \
> pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))
>
> +#define pud_ERROR(e) \
> + pr_err("%s:%d: bad pud %016lx.\n", __FILE__, __LINE__, pud_val(e))
> +
> +static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
> +{
> + if (pgtable_l4_enabled)
> + *p4dp = p4d;
> + else
> + set_pud((pud_t *)p4dp, (pud_t){ p4d_val(p4d) });
> +}
> +
> +static inline int p4d_none(p4d_t p4d)
> +{
> + if (pgtable_l4_enabled)
> + return (p4d_val(p4d) == 0);
> +
> + return 0;
> +}
> +
> +static inline int p4d_present(p4d_t p4d)
> +{
> + if (pgtable_l4_enabled)
> + return (p4d_val(p4d) & _PAGE_PRESENT);
> +
> + return 1;
> +}
> +
> +static inline int p4d_bad(p4d_t p4d)
> +{
> + if (pgtable_l4_enabled)
> + return !p4d_present(p4d);
> +
> + return 0;
> +}
> +
> +static inline void p4d_clear(p4d_t *p4d)
> +{
> + if (pgtable_l4_enabled)
> + set_p4d(p4d, __p4d(0));
> +}
> +
> +static inline pud_t *p4d_pgtable(p4d_t p4d)
> +{
> + if (pgtable_l4_enabled)
> + return (pud_t *)pfn_to_virt(p4d_val(p4d) >> _PAGE_PFN_SHIFT);
> +
> + return (pud_t *)pud_pgtable((pud_t) { p4d_val(p4d) });
> +}
> +
> +static inline struct page *p4d_page(p4d_t p4d)
> +{
> + return pfn_to_page(p4d_val(p4d) >> _PAGE_PFN_SHIFT);
> +}
> +
> +#define pud_index(addr) (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
> +
> +#define pud_offset pud_offset
> +static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> +{
> + if (pgtable_l4_enabled)
> + return p4d_pgtable(*p4d) + pud_index(address);
> +
> + return (pud_t *)p4d;
> +}
> +
> #endif /* _ASM_RISCV_PGTABLE_64_H */
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index e1a52e22ad7e..e1c74ef4ead2 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -51,7 +51,7 @@
> * position vmemmap directly below the VMALLOC region.
> */
> #ifdef CONFIG_64BIT
> -#define VA_BITS 39
> +#define VA_BITS (pgtable_l4_enabled ? 48 : 39)
> #else
> #define VA_BITS 32
> #endif
> @@ -90,8 +90,7 @@
>
> #ifndef __ASSEMBLY__
>
> -/* Page Upper Directory not used in RISC-V */
> -#include <asm-generic/pgtable-nopud.h>
> +#include <asm-generic/pgtable-nop4d.h>
> #include <asm/page.h>
> #include <asm/tlbflush.h>
> #include <linux/mm_types.h>
> @@ -113,6 +112,17 @@
> #define XIP_FIXUP(addr) (addr)
> #endif /* CONFIG_XIP_KERNEL */
>
> +struct pt_alloc_ops {
> + pte_t *(*get_pte_virt)(phys_addr_t pa);
> + phys_addr_t (*alloc_pte)(uintptr_t va);
> +#ifndef __PAGETABLE_PMD_FOLDED
> + pmd_t *(*get_pmd_virt)(phys_addr_t pa);
> + phys_addr_t (*alloc_pmd)(uintptr_t va);
> + pud_t *(*get_pud_virt)(phys_addr_t pa);
> + phys_addr_t (*alloc_pud)(uintptr_t va);
> +#endif
> +};
> +
> #ifdef CONFIG_MMU
> /* Number of entries in the page global directory */
> #define PTRS_PER_PGD (PAGE_SIZE / sizeof(pgd_t))
> @@ -669,9 +679,11 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> * Note that PGDIR_SIZE must evenly divide TASK_SIZE.
> */
> #ifdef CONFIG_64BIT
> -#define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)
> +#define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)
> +#define TASK_SIZE_MIN (PGDIR_SIZE_L3 * PTRS_PER_PGD / 2)
> #else
> -#define TASK_SIZE FIXADDR_START
> +#define TASK_SIZE FIXADDR_START
> +#define TASK_SIZE_MIN TASK_SIZE
This is used by efi-stub.c, rv64 compat patch also need it, we reuse
DEFAULT_MAP_WINDOW_64 macro.

TASK_SIZE_MIN is also okay for me, I think it should be a separate
patch with efi-stub midification.
https://lore.kernel.org/linux-riscv/[email protected]/

I've merged your patchset with compat tree and we are testing them
together totally & carefully.
https://github.com/c-sky/csky-linux/tree/riscv_compat_v2_sv48_v3

Now, rv32_rootfs & 64_rootfs booting have been passed. But I would
give you tested-by later after totally tested. Your patch set is very
helpful, thx.

ps: Could you give chance let customer choice sv48 or sv39 in dts?


> #endif
>
> #else /* CONFIG_MMU */
> @@ -697,6 +709,8 @@ extern uintptr_t _dtb_early_pa;
> #define dtb_early_va _dtb_early_va
> #define dtb_early_pa _dtb_early_pa
> #endif /* CONFIG_XIP_KERNEL */
> +extern u64 satp_mode;
> +extern bool pgtable_l4_enabled;
>
> void paging_init(void);
> void misc_mem_init(void);
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index 52c5ff9804c5..c3c0ed559770 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -95,7 +95,8 @@ relocate:
>
> /* Compute satp for kernel page tables, but don't load it yet */
> srl a2, a0, PAGE_SHIFT
> - li a1, SATP_MODE
> + la a1, satp_mode
> + REG_L a1, 0(a1)
> or a2, a2, a1
>
> /*
> diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> index ee3459cb6750..a7246872bd30 100644
> --- a/arch/riscv/mm/context.c
> +++ b/arch/riscv/mm/context.c
> @@ -192,7 +192,7 @@ static void set_mm_asid(struct mm_struct *mm, unsigned int cpu)
> switch_mm_fast:
> csr_write(CSR_SATP, virt_to_pfn(mm->pgd) |
> ((cntx & asid_mask) << SATP_ASID_SHIFT) |
> - SATP_MODE);
> + satp_mode);
>
> if (need_flush_tlb)
> local_flush_tlb_all();
> @@ -201,7 +201,7 @@ static void set_mm_asid(struct mm_struct *mm, unsigned int cpu)
> static void set_mm_noasid(struct mm_struct *mm)
> {
> /* Switch the page table and blindly nuke entire local TLB */
> - csr_write(CSR_SATP, virt_to_pfn(mm->pgd) | SATP_MODE);
> + csr_write(CSR_SATP, virt_to_pfn(mm->pgd) | satp_mode);
> local_flush_tlb_all();
> }
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 1552226fb6bd..6a19a1b1caf8 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -37,6 +37,17 @@ EXPORT_SYMBOL(kernel_map);
> #define kernel_map (*(struct kernel_mapping *)XIP_FIXUP(&kernel_map))
> #endif
>
> +#ifdef CONFIG_64BIT
> +u64 satp_mode = !IS_ENABLED(CONFIG_XIP_KERNEL) ? SATP_MODE_48 : SATP_MODE_39;
> +#else
> +u64 satp_mode = SATP_MODE_32;
> +#endif
> +EXPORT_SYMBOL(satp_mode);
> +
> +bool pgtable_l4_enabled = IS_ENABLED(CONFIG_64BIT) && !IS_ENABLED(CONFIG_XIP_KERNEL) ?
> + true : false;
> +EXPORT_SYMBOL(pgtable_l4_enabled);
> +
> phys_addr_t phys_ram_base __ro_after_init;
> EXPORT_SYMBOL(phys_ram_base);
>
> @@ -53,15 +64,6 @@ extern char _start[];
> void *_dtb_early_va __initdata;
> uintptr_t _dtb_early_pa __initdata;
>
> -struct pt_alloc_ops {
> - pte_t *(*get_pte_virt)(phys_addr_t pa);
> - phys_addr_t (*alloc_pte)(uintptr_t va);
> -#ifndef __PAGETABLE_PMD_FOLDED
> - pmd_t *(*get_pmd_virt)(phys_addr_t pa);
> - phys_addr_t (*alloc_pmd)(uintptr_t va);
> -#endif
> -};
> -
> static phys_addr_t dma32_phys_limit __initdata;
>
> static void __init zone_sizes_init(void)
> @@ -222,7 +224,7 @@ static void __init setup_bootmem(void)
> }
>
> #ifdef CONFIG_MMU
> -static struct pt_alloc_ops _pt_ops __initdata;
> +struct pt_alloc_ops _pt_ops __initdata;
>
> #ifdef CONFIG_XIP_KERNEL
> #define pt_ops (*(struct pt_alloc_ops *)XIP_FIXUP(&_pt_ops))
> @@ -238,6 +240,7 @@ pgd_t trampoline_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
> static pte_t fixmap_pte[PTRS_PER_PTE] __page_aligned_bss;
>
> pgd_t early_pg_dir[PTRS_PER_PGD] __initdata __aligned(PAGE_SIZE);
> +static pud_t __maybe_unused early_dtb_pud[PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);
> static pmd_t __maybe_unused early_dtb_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
>
> #ifdef CONFIG_XIP_KERNEL
> @@ -326,6 +329,16 @@ static pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
> #define early_pmd ((pmd_t *)XIP_FIXUP(early_pmd))
> #endif /* CONFIG_XIP_KERNEL */
>
> +static pud_t trampoline_pud[PTRS_PER_PUD] __page_aligned_bss;
> +static pud_t fixmap_pud[PTRS_PER_PUD] __page_aligned_bss;
> +static pud_t early_pud[PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);
> +
> +#ifdef CONFIG_XIP_KERNEL
> +#define trampoline_pud ((pud_t *)XIP_FIXUP(trampoline_pud))
> +#define fixmap_pud ((pud_t *)XIP_FIXUP(fixmap_pud))
> +#define early_pud ((pud_t *)XIP_FIXUP(early_pud))
> +#endif /* CONFIG_XIP_KERNEL */
> +
> static pmd_t *__init get_pmd_virt_early(phys_addr_t pa)
> {
> /* Before MMU is enabled */
> @@ -345,7 +358,7 @@ static pmd_t *__init get_pmd_virt_late(phys_addr_t pa)
>
> static phys_addr_t __init alloc_pmd_early(uintptr_t va)
> {
> - BUG_ON((va - kernel_map.virt_addr) >> PGDIR_SHIFT);
> + BUG_ON((va - kernel_map.virt_addr) >> PUD_SHIFT);
>
> return (uintptr_t)early_pmd;
> }
> @@ -391,21 +404,97 @@ static void __init create_pmd_mapping(pmd_t *pmdp,
> create_pte_mapping(ptep, va, pa, sz, prot);
> }
>
> -#define pgd_next_t pmd_t
> -#define alloc_pgd_next(__va) pt_ops.alloc_pmd(__va)
> -#define get_pgd_next_virt(__pa) pt_ops.get_pmd_virt(__pa)
> +static pud_t *__init get_pud_virt_early(phys_addr_t pa)
> +{
> + return (pud_t *)((uintptr_t)pa);
> +}
> +
> +static pud_t *__init get_pud_virt_fixmap(phys_addr_t pa)
> +{
> + clear_fixmap(FIX_PUD);
> + return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
> +}
> +
> +static pud_t *__init get_pud_virt_late(phys_addr_t pa)
> +{
> + return (pud_t *)__va(pa);
> +}
> +
> +static phys_addr_t __init alloc_pud_early(uintptr_t va)
> +{
> + /* Only one PUD is available for early mapping */
> + BUG_ON((va - kernel_map.virt_addr) >> PGDIR_SHIFT);
> +
> + return (uintptr_t)early_pud;
> +}
> +
> +static phys_addr_t __init alloc_pud_fixmap(uintptr_t va)
> +{
> + return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
> +}
> +
> +static phys_addr_t alloc_pud_late(uintptr_t va)
> +{
> + unsigned long vaddr;
> +
> + vaddr = __get_free_page(GFP_KERNEL);
> + BUG_ON(!vaddr);
> + return __pa(vaddr);
> +}
> +
> +static void __init create_pud_mapping(pud_t *pudp,
> + uintptr_t va, phys_addr_t pa,
> + phys_addr_t sz, pgprot_t prot)
> +{
> + pmd_t *nextp;
> + phys_addr_t next_phys;
> + uintptr_t pud_index = pud_index(va);
> +
> + if (sz == PUD_SIZE) {
> + if (pud_val(pudp[pud_index]) == 0)
> + pudp[pud_index] = pfn_pud(PFN_DOWN(pa), prot);
> + return;
> + }
> +
> + if (pud_val(pudp[pud_index]) == 0) {
> + next_phys = pt_ops.alloc_pmd(va);
> + pudp[pud_index] = pfn_pud(PFN_DOWN(next_phys), PAGE_TABLE);
> + nextp = pt_ops.get_pmd_virt(next_phys);
> + memset(nextp, 0, PAGE_SIZE);
> + } else {
> + next_phys = PFN_PHYS(_pud_pfn(pudp[pud_index]));
> + nextp = pt_ops.get_pmd_virt(next_phys);
> + }
> +
> + create_pmd_mapping(nextp, va, pa, sz, prot);
> +}
> +
> +#define pgd_next_t pud_t
> +#define alloc_pgd_next(__va) (pgtable_l4_enabled ? \
> + pt_ops.alloc_pud(__va) : pt_ops.alloc_pmd(__va))
> +#define get_pgd_next_virt(__pa) (pgtable_l4_enabled ? \
> + pt_ops.get_pud_virt(__pa) : (pgd_next_t *)pt_ops.get_pmd_virt(__pa))
> #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot) \
> - create_pmd_mapping(__nextp, __va, __pa, __sz, __prot)
> -#define fixmap_pgd_next fixmap_pmd
> + (pgtable_l4_enabled ? \
> + create_pud_mapping(__nextp, __va, __pa, __sz, __prot) : \
> + create_pmd_mapping((pmd_t *)__nextp, __va, __pa, __sz, __prot))
> +#define fixmap_pgd_next (pgtable_l4_enabled ? \
> + (uintptr_t)fixmap_pud : (uintptr_t)fixmap_pmd)
> +#define trampoline_pgd_next (pgtable_l4_enabled ? \
> + (uintptr_t)trampoline_pud : (uintptr_t)trampoline_pmd)
> +#define early_dtb_pgd_next (pgtable_l4_enabled ? \
> + (uintptr_t)early_dtb_pud : (uintptr_t)early_dtb_pmd)
> #else
> #define pgd_next_t pte_t
> #define alloc_pgd_next(__va) pt_ops.alloc_pte(__va)
> #define get_pgd_next_virt(__pa) pt_ops.get_pte_virt(__pa)
> #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot) \
> create_pte_mapping(__nextp, __va, __pa, __sz, __prot)
> -#define fixmap_pgd_next fixmap_pte
> +#define fixmap_pgd_next ((uintptr_t)fixmap_pte)
> +#define early_dtb_pgd_next ((uintptr_t)early_dtb_pmd)
> +#define create_pud_mapping(__pmdp, __va, __pa, __sz, __prot)
> #define create_pmd_mapping(__pmdp, __va, __pa, __sz, __prot)
> -#endif
> +#endif /* __PAGETABLE_PMD_FOLDED */
>
> void __init create_pgd_mapping(pgd_t *pgdp,
> uintptr_t va, phys_addr_t pa,
> @@ -493,6 +582,57 @@ static __init pgprot_t pgprot_from_va(uintptr_t va)
> }
> #endif /* CONFIG_STRICT_KERNEL_RWX */
>
> +#ifdef CONFIG_64BIT
> +static void __init disable_pgtable_l4(void)
> +{
> + pgtable_l4_enabled = false;
> + kernel_map.page_offset = PAGE_OFFSET_L3;
> + satp_mode = SATP_MODE_39;
> +}
> +
> +/*
> + * There is a simple way to determine if 4-level is supported by the
> + * underlying hardware: establish 1:1 mapping in 4-level page table mode
> + * then read SATP to see if the configuration was taken into account
> + * meaning sv48 is supported.
> + */
> +static __init void set_satp_mode(void)
> +{
> + u64 identity_satp, hw_satp;
> + uintptr_t set_satp_mode_pmd;
> +
> + set_satp_mode_pmd = ((unsigned long)set_satp_mode) & PMD_MASK;
> + create_pgd_mapping(early_pg_dir,
> + set_satp_mode_pmd, (uintptr_t)early_pud,
> + PGDIR_SIZE, PAGE_TABLE);
> + create_pud_mapping(early_pud,
> + set_satp_mode_pmd, (uintptr_t)early_pmd,
> + PUD_SIZE, PAGE_TABLE);
> + /* Handle the case where set_satp_mode straddles 2 PMDs */
> + create_pmd_mapping(early_pmd,
> + set_satp_mode_pmd, set_satp_mode_pmd,
> + PMD_SIZE, PAGE_KERNEL_EXEC);
> + create_pmd_mapping(early_pmd,
> + set_satp_mode_pmd + PMD_SIZE,
> + set_satp_mode_pmd + PMD_SIZE,
> + PMD_SIZE, PAGE_KERNEL_EXEC);
> +
> + identity_satp = PFN_DOWN((uintptr_t)&early_pg_dir) | satp_mode;
> +
> + local_flush_tlb_all();
> + csr_write(CSR_SATP, identity_satp);
> + hw_satp = csr_swap(CSR_SATP, 0ULL);
> + local_flush_tlb_all();
> +
> + if (hw_satp != identity_satp)
> + disable_pgtable_l4();
> +
> + memset(early_pg_dir, 0, PAGE_SIZE);
> + memset(early_pud, 0, PAGE_SIZE);
> + memset(early_pmd, 0, PAGE_SIZE);
> +}
> +#endif
> +
> /*
> * setup_vm() is called from head.S with MMU-off.
> *
> @@ -557,10 +697,15 @@ static void __init create_fdt_early_page_table(pgd_t *pgdir, uintptr_t dtb_pa)
> uintptr_t pa = dtb_pa & ~(PMD_SIZE - 1);
>
> create_pgd_mapping(early_pg_dir, DTB_EARLY_BASE_VA,
> - IS_ENABLED(CONFIG_64BIT) ? (uintptr_t)early_dtb_pmd : pa,
> + IS_ENABLED(CONFIG_64BIT) ? early_dtb_pgd_next : pa,
> PGDIR_SIZE,
> IS_ENABLED(CONFIG_64BIT) ? PAGE_TABLE : PAGE_KERNEL);
>
> + if (pgtable_l4_enabled) {
> + create_pud_mapping(early_dtb_pud, DTB_EARLY_BASE_VA,
> + (uintptr_t)early_dtb_pmd, PUD_SIZE, PAGE_TABLE);
> + }
> +
> if (IS_ENABLED(CONFIG_64BIT)) {
> create_pmd_mapping(early_dtb_pmd, DTB_EARLY_BASE_VA,
> pa, PMD_SIZE, PAGE_KERNEL);
> @@ -593,6 +738,8 @@ void pt_ops_set_early(void)
> #ifndef __PAGETABLE_PMD_FOLDED
> pt_ops.alloc_pmd = alloc_pmd_early;
> pt_ops.get_pmd_virt = get_pmd_virt_early;
> + pt_ops.alloc_pud = alloc_pud_early;
> + pt_ops.get_pud_virt = get_pud_virt_early;
> #endif
> }
>
> @@ -611,6 +758,8 @@ void pt_ops_set_fixmap(void)
> #ifndef __PAGETABLE_PMD_FOLDED
> pt_ops.alloc_pmd = kernel_mapping_pa_to_va((uintptr_t)alloc_pmd_fixmap);
> pt_ops.get_pmd_virt = kernel_mapping_pa_to_va((uintptr_t)get_pmd_virt_fixmap);
> + pt_ops.alloc_pud = kernel_mapping_pa_to_va((uintptr_t)alloc_pud_fixmap);
> + pt_ops.get_pud_virt = kernel_mapping_pa_to_va((uintptr_t)get_pud_virt_fixmap);
> #endif
> }
>
> @@ -625,6 +774,8 @@ void pt_ops_set_late(void)
> #ifndef __PAGETABLE_PMD_FOLDED
> pt_ops.alloc_pmd = alloc_pmd_late;
> pt_ops.get_pmd_virt = get_pmd_virt_late;
> + pt_ops.alloc_pud = alloc_pud_late;
> + pt_ops.get_pud_virt = get_pud_virt_late;
> #endif
> }
>
> @@ -633,6 +784,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> pmd_t __maybe_unused fix_bmap_spmd, fix_bmap_epmd;
>
> kernel_map.virt_addr = KERNEL_LINK_ADDR;
> + kernel_map.page_offset = _AC(CONFIG_PAGE_OFFSET, UL);
>
> #ifdef CONFIG_XIP_KERNEL
> kernel_map.xiprom = (uintptr_t)CONFIG_XIP_PHYS_ADDR;
> @@ -647,6 +799,11 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> kernel_map.phys_addr = (uintptr_t)(&_start);
> kernel_map.size = (uintptr_t)(&_end) - kernel_map.phys_addr;
> #endif
> +
> +#if defined(CONFIG_64BIT) && !defined(CONFIG_XIP_KERNEL)
> + set_satp_mode();
> +#endif
> +
> kernel_map.va_pa_offset = PAGE_OFFSET - kernel_map.phys_addr;
> kernel_map.va_kernel_pa_offset = kernel_map.virt_addr - kernel_map.phys_addr;
>
> @@ -676,15 +833,21 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>
> /* Setup early PGD for fixmap */
> create_pgd_mapping(early_pg_dir, FIXADDR_START,
> - (uintptr_t)fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
> + fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
>
> #ifndef __PAGETABLE_PMD_FOLDED
> - /* Setup fixmap PMD */
> + /* Setup fixmap PUD and PMD */
> + if (pgtable_l4_enabled)
> + create_pud_mapping(fixmap_pud, FIXADDR_START,
> + (uintptr_t)fixmap_pmd, PUD_SIZE, PAGE_TABLE);
> create_pmd_mapping(fixmap_pmd, FIXADDR_START,
> (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
> /* Setup trampoline PGD and PMD */
> create_pgd_mapping(trampoline_pg_dir, kernel_map.virt_addr,
> - (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
> + trampoline_pgd_next, PGDIR_SIZE, PAGE_TABLE);
> + if (pgtable_l4_enabled)
> + create_pud_mapping(trampoline_pud, kernel_map.virt_addr,
> + (uintptr_t)trampoline_pmd, PUD_SIZE, PAGE_TABLE);
> #ifdef CONFIG_XIP_KERNEL
> create_pmd_mapping(trampoline_pmd, kernel_map.virt_addr,
> kernel_map.xiprom, PMD_SIZE, PAGE_KERNEL_EXEC);
> @@ -712,7 +875,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> * Bootime fixmap only can handle PMD_SIZE mapping. Thus, boot-ioremap
> * range can not span multiple pmds.
> */
> - BUILD_BUG_ON((__fix_to_virt(FIX_BTMAP_BEGIN) >> PMD_SHIFT)
> + BUG_ON((__fix_to_virt(FIX_BTMAP_BEGIN) >> PMD_SHIFT)
> != (__fix_to_virt(FIX_BTMAP_END) >> PMD_SHIFT));
>
> #ifndef __PAGETABLE_PMD_FOLDED
> @@ -783,9 +946,10 @@ static void __init setup_vm_final(void)
> /* Clear fixmap PTE and PMD mappings */
> clear_fixmap(FIX_PTE);
> clear_fixmap(FIX_PMD);
> + clear_fixmap(FIX_PUD);
>
> /* Move to swapper page table */
> - csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | SATP_MODE);
> + csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | satp_mode);
> local_flush_tlb_all();
>
> pt_ops_set_late();
> diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c
> index 1434a0225140..993f50571a3b 100644
> --- a/arch/riscv/mm/kasan_init.c
> +++ b/arch/riscv/mm/kasan_init.c
> @@ -11,7 +11,29 @@
> #include <asm/fixmap.h>
> #include <asm/pgalloc.h>
>
> +/*
> + * Kasan shadow region must lie at a fixed address across sv39, sv48 and sv57
> + * which is right before the kernel.
> + *
> + * For sv39, the region is aligned on PGDIR_SIZE so we only need to populate
> + * the page global directory with kasan_early_shadow_pmd.
> + *
> + * For sv48 and sv57, the region is not aligned on PGDIR_SIZE so the mapping
> + * must be divided as follows:
> + * - the first PGD entry, although incomplete, is populated with
> + * kasan_early_shadow_pud/p4d
> + * - the PGD entries in the middle are populated with kasan_early_shadow_pud/p4d
> + * - the last PGD entry is shared with the kernel mapping so populated at the
> + * lower levels pud/p4d
> + *
> + * In addition, when shallow populating a kasan region (for example vmalloc),
> + * this region may also not be aligned on PGDIR size, so we must go down to the
> + * pud level too.
> + */
> +
> extern pgd_t early_pg_dir[PTRS_PER_PGD];
> +extern struct pt_alloc_ops _pt_ops __initdata;
> +#define pt_ops _pt_ops
>
> static void __init kasan_populate_pte(pmd_t *pmd, unsigned long vaddr, unsigned long end)
> {
> @@ -35,15 +57,19 @@ static void __init kasan_populate_pte(pmd_t *pmd, unsigned long vaddr, unsigned
> set_pmd(pmd, pfn_pmd(PFN_DOWN(__pa(base_pte)), PAGE_TABLE));
> }
>
> -static void __init kasan_populate_pmd(pgd_t *pgd, unsigned long vaddr, unsigned long end)
> +static void __init kasan_populate_pmd(pud_t *pud, unsigned long vaddr, unsigned long end)
> {
> phys_addr_t phys_addr;
> pmd_t *pmdp, *base_pmd;
> unsigned long next;
>
> - base_pmd = (pmd_t *)pgd_page_vaddr(*pgd);
> - if (base_pmd == lm_alias(kasan_early_shadow_pmd))
> + if (pud_none(*pud)) {
> base_pmd = memblock_alloc(PTRS_PER_PMD * sizeof(pmd_t), PAGE_SIZE);
> + } else {
> + base_pmd = (pmd_t *)pud_pgtable(*pud);
> + if (base_pmd == lm_alias(kasan_early_shadow_pmd))
> + base_pmd = memblock_alloc(PTRS_PER_PMD * sizeof(pmd_t), PAGE_SIZE);
> + }
>
> pmdp = base_pmd + pmd_index(vaddr);
>
> @@ -67,9 +93,72 @@ static void __init kasan_populate_pmd(pgd_t *pgd, unsigned long vaddr, unsigned
> * it entirely, memblock could allocate a page at a physical address
> * where KASAN is not populated yet and then we'd get a page fault.
> */
> - set_pgd(pgd, pfn_pgd(PFN_DOWN(__pa(base_pmd)), PAGE_TABLE));
> + set_pud(pud, pfn_pud(PFN_DOWN(__pa(base_pmd)), PAGE_TABLE));
> +}
> +
> +static void __init kasan_populate_pud(pgd_t *pgd,
> + unsigned long vaddr, unsigned long end,
> + bool early)
> +{
> + phys_addr_t phys_addr;
> + pud_t *pudp, *base_pud;
> + unsigned long next;
> +
> + if (early) {
> + /*
> + * We can't use pgd_page_vaddr here as it would return a linear
> + * mapping address but it is not mapped yet, but when populating
> + * early_pg_dir, we need the physical address and when populating
> + * swapper_pg_dir, we need the kernel virtual address so use
> + * pt_ops facility.
> + */
> + base_pud = pt_ops.get_pud_virt(pfn_to_phys(_pgd_pfn(*pgd)));
> + } else {
> + base_pud = (pud_t *)pgd_page_vaddr(*pgd);
> + if (base_pud == lm_alias(kasan_early_shadow_pud))
> + base_pud = memblock_alloc(PTRS_PER_PUD * sizeof(pud_t), PAGE_SIZE);
> + }
> +
> + pudp = base_pud + pud_index(vaddr);
> +
> + do {
> + next = pud_addr_end(vaddr, end);
> +
> + if (pud_none(*pudp) && IS_ALIGNED(vaddr, PUD_SIZE) && (next - vaddr) >= PUD_SIZE) {
> + if (early) {
> + phys_addr = __pa(((uintptr_t)kasan_early_shadow_pmd));
> + set_pud(pudp, pfn_pud(PFN_DOWN(phys_addr), PAGE_TABLE));
> + continue;
> + } else {
> + phys_addr = memblock_phys_alloc(PUD_SIZE, PUD_SIZE);
> + if (phys_addr) {
> + set_pud(pudp, pfn_pud(PFN_DOWN(phys_addr), PAGE_KERNEL));
> + continue;
> + }
> + }
> + }
> +
> + kasan_populate_pmd(pudp, vaddr, next);
> + } while (pudp++, vaddr = next, vaddr != end);
> +
> + /*
> + * Wait for the whole PGD to be populated before setting the PGD in
> + * the page table, otherwise, if we did set the PGD before populating
> + * it entirely, memblock could allocate a page at a physical address
> + * where KASAN is not populated yet and then we'd get a page fault.
> + */
> + if (!early)
> + set_pgd(pgd, pfn_pgd(PFN_DOWN(__pa(base_pud)), PAGE_TABLE));
> }
>
> +#define kasan_early_shadow_pgd_next (pgtable_l4_enabled ? \
> + (uintptr_t)kasan_early_shadow_pud : \
> + (uintptr_t)kasan_early_shadow_pmd)
> +#define kasan_populate_pgd_next(pgdp, vaddr, next, early) \
> + (pgtable_l4_enabled ? \
> + kasan_populate_pud(pgdp, vaddr, next, early) : \
> + kasan_populate_pmd((pud_t *)pgdp, vaddr, next))
> +
> static void __init kasan_populate_pgd(pgd_t *pgdp,
> unsigned long vaddr, unsigned long end,
> bool early)
> @@ -102,7 +191,7 @@ static void __init kasan_populate_pgd(pgd_t *pgdp,
> }
> }
>
> - kasan_populate_pmd(pgdp, vaddr, next);
> + kasan_populate_pgd_next(pgdp, vaddr, next, early);
> } while (pgdp++, vaddr = next, vaddr != end);
> }
>
> @@ -157,18 +246,54 @@ static void __init kasan_populate(void *start, void *end)
> memset(start, KASAN_SHADOW_INIT, end - start);
> }
>
> +static void __init kasan_shallow_populate_pud(pgd_t *pgdp,
> + unsigned long vaddr, unsigned long end,
> + bool kasan_populate)
> +{
> + unsigned long next;
> + pud_t *pudp, *base_pud;
> + pmd_t *base_pmd;
> + bool is_kasan_pmd;
> +
> + base_pud = (pud_t *)pgd_page_vaddr(*pgdp);
> + pudp = base_pud + pud_index(vaddr);
> +
> + if (kasan_populate)
> + memcpy(base_pud, (void *)kasan_early_shadow_pgd_next,
> + sizeof(pud_t) * PTRS_PER_PUD);
> +
> + do {
> + next = pud_addr_end(vaddr, end);
> + is_kasan_pmd = (pud_pgtable(*pudp) == lm_alias(kasan_early_shadow_pmd));
> +
> + if (is_kasan_pmd) {
> + base_pmd = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
> + set_pud(pudp, pfn_pud(PFN_DOWN(__pa(base_pmd)), PAGE_TABLE));
> + }
> + } while (pudp++, vaddr = next, vaddr != end);
> +}
> +
> static void __init kasan_shallow_populate_pgd(unsigned long vaddr, unsigned long end)
> {
> unsigned long next;
> void *p;
> pgd_t *pgd_k = pgd_offset_k(vaddr);
> + bool is_kasan_pgd_next;
>
> do {
> next = pgd_addr_end(vaddr, end);
> - if (pgd_page_vaddr(*pgd_k) == (unsigned long)lm_alias(kasan_early_shadow_pmd)) {
> + is_kasan_pgd_next = (pgd_page_vaddr(*pgd_k) ==
> + (unsigned long)lm_alias(kasan_early_shadow_pgd_next));
> +
> + if (is_kasan_pgd_next) {
> p = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
> set_pgd(pgd_k, pfn_pgd(PFN_DOWN(__pa(p)), PAGE_TABLE));
> }
> +
> + if (IS_ALIGNED(vaddr, PGDIR_SIZE) && (next - vaddr) >= PGDIR_SIZE)
> + continue;
> +
> + kasan_shallow_populate_pud(pgd_k, vaddr, next, is_kasan_pgd_next);
> } while (pgd_k++, vaddr = next, vaddr != end);
> }
>
> diff --git a/drivers/firmware/efi/libstub/efi-stub.c b/drivers/firmware/efi/libstub/efi-stub.c
> index 26e69788f27a..b3db5d91ed38 100644
> --- a/drivers/firmware/efi/libstub/efi-stub.c
> +++ b/drivers/firmware/efi/libstub/efi-stub.c
> @@ -40,6 +40,8 @@
>
> #ifdef CONFIG_ARM64
> # define EFI_RT_VIRTUAL_LIMIT DEFAULT_MAP_WINDOW_64
> +#elif defined(CONFIG_RISCV)
> +# define EFI_RT_VIRTUAL_LIMIT TASK_SIZE_MIN
> #else
> # define EFI_RT_VIRTUAL_LIMIT TASK_SIZE
> #endif
> --
> 2.32.0
>


--
Best Regards
Guo Ren

ML: https://lore.kernel.org/linux-csky/

2022-01-04 12:42:34

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v3 07/13] riscv: Implement sv48 support

Hi Guo,

On Wed, Dec 29, 2021 at 4:42 AM Guo Ren <[email protected]> wrote:
>
> On Tue, Dec 7, 2021 at 11:54 AM Alexandre Ghiti
> <[email protected]> wrote:
> >
> > By adding a new 4th level of page table, give the possibility to 64bit
> > kernel to address 2^48 bytes of virtual address: in practice, that offers
> > 128TB of virtual address space to userspace and allows up to 64TB of
> > physical memory.
> >
> > If the underlying hardware does not support sv48, we will automatically
> > fallback to a standard 3-level page table by folding the new PUD level into
> > PGDIR level. In order to detect HW capabilities at runtime, we
> > use SATP feature that ignores writes with an unsupported mode.
> >
> > Signed-off-by: Alexandre Ghiti <[email protected]>
> > ---
> > arch/riscv/Kconfig | 4 +-
> > arch/riscv/include/asm/csr.h | 3 +-
> > arch/riscv/include/asm/fixmap.h | 1 +
> > arch/riscv/include/asm/kasan.h | 6 +-
> > arch/riscv/include/asm/page.h | 14 ++
> > arch/riscv/include/asm/pgalloc.h | 40 +++++
> > arch/riscv/include/asm/pgtable-64.h | 108 +++++++++++-
> > arch/riscv/include/asm/pgtable.h | 24 ++-
> > arch/riscv/kernel/head.S | 3 +-
> > arch/riscv/mm/context.c | 4 +-
> > arch/riscv/mm/init.c | 212 +++++++++++++++++++++---
> > arch/riscv/mm/kasan_init.c | 137 ++++++++++++++-
> > drivers/firmware/efi/libstub/efi-stub.c | 2 +
> > 13 files changed, 514 insertions(+), 44 deletions(-)
> >
> > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > index ac6c0cd9bc29..d28fe0148e13 100644
> > --- a/arch/riscv/Kconfig
> > +++ b/arch/riscv/Kconfig
> > @@ -150,7 +150,7 @@ config PAGE_OFFSET
> > hex
> > default 0xC0000000 if 32BIT
> > default 0x80000000 if 64BIT && !MMU
> > - default 0xffffffd800000000 if 64BIT
> > + default 0xffffaf8000000000 if 64BIT
> >
> > config KASAN_SHADOW_OFFSET
> > hex
> > @@ -201,7 +201,7 @@ config FIX_EARLYCON_MEM
> >
> > config PGTABLE_LEVELS
> > int
> > - default 3 if 64BIT
> > + default 4 if 64BIT
> > default 2
> >
> > config LOCKDEP_SUPPORT
> > diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
> > index 87ac65696871..3fdb971c7896 100644
> > --- a/arch/riscv/include/asm/csr.h
> > +++ b/arch/riscv/include/asm/csr.h
> > @@ -40,14 +40,13 @@
> > #ifndef CONFIG_64BIT
> > #define SATP_PPN _AC(0x003FFFFF, UL)
> > #define SATP_MODE_32 _AC(0x80000000, UL)
> > -#define SATP_MODE SATP_MODE_32
> > #define SATP_ASID_BITS 9
> > #define SATP_ASID_SHIFT 22
> > #define SATP_ASID_MASK _AC(0x1FF, UL)
> > #else
> > #define SATP_PPN _AC(0x00000FFFFFFFFFFF, UL)
> > #define SATP_MODE_39 _AC(0x8000000000000000, UL)
> > -#define SATP_MODE SATP_MODE_39
> > +#define SATP_MODE_48 _AC(0x9000000000000000, UL)
> > #define SATP_ASID_BITS 16
> > #define SATP_ASID_SHIFT 44
> > #define SATP_ASID_MASK _AC(0xFFFF, UL)
> > diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h
> > index 54cbf07fb4e9..58a718573ad6 100644
> > --- a/arch/riscv/include/asm/fixmap.h
> > +++ b/arch/riscv/include/asm/fixmap.h
> > @@ -24,6 +24,7 @@ enum fixed_addresses {
> > FIX_HOLE,
> > FIX_PTE,
> > FIX_PMD,
> > + FIX_PUD,
> > FIX_TEXT_POKE1,
> > FIX_TEXT_POKE0,
> > FIX_EARLYCON_MEM_BASE,
> > diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h
> > index 743e6ff57996..0b85e363e778 100644
> > --- a/arch/riscv/include/asm/kasan.h
> > +++ b/arch/riscv/include/asm/kasan.h
> > @@ -28,7 +28,11 @@
> > #define KASAN_SHADOW_SCALE_SHIFT 3
> >
> > #define KASAN_SHADOW_SIZE (UL(1) << ((VA_BITS - 1) - KASAN_SHADOW_SCALE_SHIFT))
> > -#define KASAN_SHADOW_START (KASAN_SHADOW_END - KASAN_SHADOW_SIZE)
> > +/*
> > + * Depending on the size of the virtual address space, the region may not be
> > + * aligned on PGDIR_SIZE, so force its alignment to ease its population.
> > + */
> > +#define KASAN_SHADOW_START ((KASAN_SHADOW_END - KASAN_SHADOW_SIZE) & PGDIR_MASK)
> > #define KASAN_SHADOW_END MODULES_LOWEST_VADDR
> > #define KASAN_SHADOW_OFFSET _AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
> >
> > diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> > index e03559f9b35e..d089fe46f7d8 100644
> > --- a/arch/riscv/include/asm/page.h
> > +++ b/arch/riscv/include/asm/page.h
> > @@ -31,7 +31,20 @@
> > * When not using MMU this corresponds to the first free page in
> > * physical memory (aligned on a page boundary).
> > */
> > +#ifdef CONFIG_64BIT
> > +#ifdef CONFIG_MMU
> > +#define PAGE_OFFSET kernel_map.page_offset
> > +#else
> > +#define PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)
> > +#endif
> > +/*
> > + * By default, CONFIG_PAGE_OFFSET value corresponds to SV48 address space so
> > + * define the PAGE_OFFSET value for SV39.
> > + */
> > +#define PAGE_OFFSET_L3 _AC(0xffffffd800000000, UL)
> > +#else
> > #define PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)
> > +#endif /* CONFIG_64BIT */
> >
> > /*
> > * Half of the kernel address space (half of the entries of the page global
> > @@ -90,6 +103,7 @@ extern unsigned long riscv_pfn_base;
> > #endif /* CONFIG_MMU */
> >
> > struct kernel_mapping {
> > + unsigned long page_offset;
> > unsigned long virt_addr;
> > uintptr_t phys_addr;
> > uintptr_t size;
> > diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
> > index 0af6933a7100..11823004b87a 100644
> > --- a/arch/riscv/include/asm/pgalloc.h
> > +++ b/arch/riscv/include/asm/pgalloc.h
> > @@ -11,6 +11,8 @@
> > #include <asm/tlb.h>
> >
> > #ifdef CONFIG_MMU
> > +#define __HAVE_ARCH_PUD_ALLOC_ONE
> > +#define __HAVE_ARCH_PUD_FREE
> > #include <asm-generic/pgalloc.h>
> >
> > static inline void pmd_populate_kernel(struct mm_struct *mm,
> > @@ -36,6 +38,44 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
> >
> > set_pud(pud, __pud((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> > }
> > +
> > +static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
> > +{
> > + if (pgtable_l4_enabled) {
> > + unsigned long pfn = virt_to_pfn(pud);
> > +
> > + set_p4d(p4d, __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> > + }
> > +}
> > +
> > +static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d,
> > + pud_t *pud)
> > +{
> > + if (pgtable_l4_enabled) {
> > + unsigned long pfn = virt_to_pfn(pud);
> > +
> > + set_p4d_safe(p4d,
> > + __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> > + }
> > +}
> > +
> > +#define pud_alloc_one pud_alloc_one
> > +static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
> > +{
> > + if (pgtable_l4_enabled)
> > + return __pud_alloc_one(mm, addr);
> > +
> > + return NULL;
> > +}
> > +
> > +#define pud_free pud_free
> > +static inline void pud_free(struct mm_struct *mm, pud_t *pud)
> > +{
> > + if (pgtable_l4_enabled)
> > + __pud_free(mm, pud);
> > +}
> > +
> > +#define __pud_free_tlb(tlb, pud, addr) pud_free((tlb)->mm, pud)
> > #endif /* __PAGETABLE_PMD_FOLDED */
> >
> > static inline pgd_t *pgd_alloc(struct mm_struct *mm)
> > diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
> > index 228261aa9628..bbbdd66e5e2f 100644
> > --- a/arch/riscv/include/asm/pgtable-64.h
> > +++ b/arch/riscv/include/asm/pgtable-64.h
> > @@ -8,16 +8,36 @@
> >
> > #include <linux/const.h>
> >
> > -#define PGDIR_SHIFT 30
> > +extern bool pgtable_l4_enabled;
> > +
> > +#define PGDIR_SHIFT_L3 30
> > +#define PGDIR_SHIFT_L4 39
> > +#define PGDIR_SIZE_L3 (_AC(1, UL) << PGDIR_SHIFT_L3)
> > +
> > +#define PGDIR_SHIFT (pgtable_l4_enabled ? PGDIR_SHIFT_L4 : PGDIR_SHIFT_L3)
> > /* Size of region mapped by a page global directory */
> > #define PGDIR_SIZE (_AC(1, UL) << PGDIR_SHIFT)
> > #define PGDIR_MASK (~(PGDIR_SIZE - 1))
> >
> > +/* pud is folded into pgd in case of 3-level page table */
> > +#define PUD_SHIFT 30
> > +#define PUD_SIZE (_AC(1, UL) << PUD_SHIFT)
> > +#define PUD_MASK (~(PUD_SIZE - 1))
> > +
> > #define PMD_SHIFT 21
> > /* Size of region mapped by a page middle directory */
> > #define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
> > #define PMD_MASK (~(PMD_SIZE - 1))
> >
> > +/* Page Upper Directory entry */
> > +typedef struct {
> > + unsigned long pud;
> > +} pud_t;
> > +
> > +#define pud_val(x) ((x).pud)
> > +#define __pud(x) ((pud_t) { (x) })
> > +#define PTRS_PER_PUD (PAGE_SIZE / sizeof(pud_t))
> > +
> > /* Page Middle Directory entry */
> > typedef struct {
> > unsigned long pmd;
> > @@ -59,6 +79,16 @@ static inline void pud_clear(pud_t *pudp)
> > set_pud(pudp, __pud(0));
> > }
> >
> > +static inline pud_t pfn_pud(unsigned long pfn, pgprot_t prot)
> > +{
> > + return __pud((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
> > +}
> > +
> > +static inline unsigned long _pud_pfn(pud_t pud)
> > +{
> > + return pud_val(pud) >> _PAGE_PFN_SHIFT;
> > +}
> > +
> > static inline pmd_t *pud_pgtable(pud_t pud)
> > {
> > return (pmd_t *)pfn_to_virt(pud_val(pud) >> _PAGE_PFN_SHIFT);
> > @@ -69,6 +99,17 @@ static inline struct page *pud_page(pud_t pud)
> > return pfn_to_page(pud_val(pud) >> _PAGE_PFN_SHIFT);
> > }
> >
> > +#define mm_pud_folded mm_pud_folded
> > +static inline bool mm_pud_folded(struct mm_struct *mm)
> > +{
> > + if (pgtable_l4_enabled)
> > + return false;
> > +
> > + return true;
> > +}
> > +
> > +#define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
> > +
> > static inline pmd_t pfn_pmd(unsigned long pfn, pgprot_t prot)
> > {
> > return __pmd((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
> > @@ -84,4 +125,69 @@ static inline unsigned long _pmd_pfn(pmd_t pmd)
> > #define pmd_ERROR(e) \
> > pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))
> >
> > +#define pud_ERROR(e) \
> > + pr_err("%s:%d: bad pud %016lx.\n", __FILE__, __LINE__, pud_val(e))
> > +
> > +static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
> > +{
> > + if (pgtable_l4_enabled)
> > + *p4dp = p4d;
> > + else
> > + set_pud((pud_t *)p4dp, (pud_t){ p4d_val(p4d) });
> > +}
> > +
> > +static inline int p4d_none(p4d_t p4d)
> > +{
> > + if (pgtable_l4_enabled)
> > + return (p4d_val(p4d) == 0);
> > +
> > + return 0;
> > +}
> > +
> > +static inline int p4d_present(p4d_t p4d)
> > +{
> > + if (pgtable_l4_enabled)
> > + return (p4d_val(p4d) & _PAGE_PRESENT);
> > +
> > + return 1;
> > +}
> > +
> > +static inline int p4d_bad(p4d_t p4d)
> > +{
> > + if (pgtable_l4_enabled)
> > + return !p4d_present(p4d);
> > +
> > + return 0;
> > +}
> > +
> > +static inline void p4d_clear(p4d_t *p4d)
> > +{
> > + if (pgtable_l4_enabled)
> > + set_p4d(p4d, __p4d(0));
> > +}
> > +
> > +static inline pud_t *p4d_pgtable(p4d_t p4d)
> > +{
> > + if (pgtable_l4_enabled)
> > + return (pud_t *)pfn_to_virt(p4d_val(p4d) >> _PAGE_PFN_SHIFT);
> > +
> > + return (pud_t *)pud_pgtable((pud_t) { p4d_val(p4d) });
> > +}
> > +
> > +static inline struct page *p4d_page(p4d_t p4d)
> > +{
> > + return pfn_to_page(p4d_val(p4d) >> _PAGE_PFN_SHIFT);
> > +}
> > +
> > +#define pud_index(addr) (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
> > +
> > +#define pud_offset pud_offset
> > +static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> > +{
> > + if (pgtable_l4_enabled)
> > + return p4d_pgtable(*p4d) + pud_index(address);
> > +
> > + return (pud_t *)p4d;
> > +}
> > +
> > #endif /* _ASM_RISCV_PGTABLE_64_H */
> > diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> > index e1a52e22ad7e..e1c74ef4ead2 100644
> > --- a/arch/riscv/include/asm/pgtable.h
> > +++ b/arch/riscv/include/asm/pgtable.h
> > @@ -51,7 +51,7 @@
> > * position vmemmap directly below the VMALLOC region.
> > */
> > #ifdef CONFIG_64BIT
> > -#define VA_BITS 39
> > +#define VA_BITS (pgtable_l4_enabled ? 48 : 39)
> > #else
> > #define VA_BITS 32
> > #endif
> > @@ -90,8 +90,7 @@
> >
> > #ifndef __ASSEMBLY__
> >
> > -/* Page Upper Directory not used in RISC-V */
> > -#include <asm-generic/pgtable-nopud.h>
> > +#include <asm-generic/pgtable-nop4d.h>
> > #include <asm/page.h>
> > #include <asm/tlbflush.h>
> > #include <linux/mm_types.h>
> > @@ -113,6 +112,17 @@
> > #define XIP_FIXUP(addr) (addr)
> > #endif /* CONFIG_XIP_KERNEL */
> >
> > +struct pt_alloc_ops {
> > + pte_t *(*get_pte_virt)(phys_addr_t pa);
> > + phys_addr_t (*alloc_pte)(uintptr_t va);
> > +#ifndef __PAGETABLE_PMD_FOLDED
> > + pmd_t *(*get_pmd_virt)(phys_addr_t pa);
> > + phys_addr_t (*alloc_pmd)(uintptr_t va);
> > + pud_t *(*get_pud_virt)(phys_addr_t pa);
> > + phys_addr_t (*alloc_pud)(uintptr_t va);
> > +#endif
> > +};
> > +
> > #ifdef CONFIG_MMU
> > /* Number of entries in the page global directory */
> > #define PTRS_PER_PGD (PAGE_SIZE / sizeof(pgd_t))
> > @@ -669,9 +679,11 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> > * Note that PGDIR_SIZE must evenly divide TASK_SIZE.
> > */
> > #ifdef CONFIG_64BIT
> > -#define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)
> > +#define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)
> > +#define TASK_SIZE_MIN (PGDIR_SIZE_L3 * PTRS_PER_PGD / 2)
> > #else
> > -#define TASK_SIZE FIXADDR_START
> > +#define TASK_SIZE FIXADDR_START
> > +#define TASK_SIZE_MIN TASK_SIZE
> This is used by efi-stub.c, rv64 compat patch also need it, we reuse
> DEFAULT_MAP_WINDOW_64 macro.
>
> TASK_SIZE_MIN is also okay for me, I think it should be a separate
> patch with efi-stub midification.

IMO, TASK_SIZE_MIN is more explicit than DEFAULT_MAP_WINDOW_64. I'll
split this change in the next series.

> https://lore.kernel.org/linux-riscv/[email protected]/
>
> I've merged your patchset with compat tree and we are testing them
> together totally & carefully.
> https://github.com/c-sky/csky-linux/tree/riscv_compat_v2_sv48_v3
>
> Now, rv32_rootfs & 64_rootfs booting have been passed. But I would
> give you tested-by later after totally tested. Your patch set is very
> helpful, thx.

Thanks a lot, that will help move forward ;)

>
> ps: Could you give chance let customer choice sv48 or sv39 in dts?
>

This is already implemented in patch 13.

Thanks!

Alex

>
> > #endif
> >
> > #else /* CONFIG_MMU */
> > @@ -697,6 +709,8 @@ extern uintptr_t _dtb_early_pa;
> > #define dtb_early_va _dtb_early_va
> > #define dtb_early_pa _dtb_early_pa
> > #endif /* CONFIG_XIP_KERNEL */
> > +extern u64 satp_mode;
> > +extern bool pgtable_l4_enabled;
> >
> > void paging_init(void);
> > void misc_mem_init(void);
> > diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> > index 52c5ff9804c5..c3c0ed559770 100644
> > --- a/arch/riscv/kernel/head.S
> > +++ b/arch/riscv/kernel/head.S
> > @@ -95,7 +95,8 @@ relocate:
> >
> > /* Compute satp for kernel page tables, but don't load it yet */
> > srl a2, a0, PAGE_SHIFT
> > - li a1, SATP_MODE
> > + la a1, satp_mode
> > + REG_L a1, 0(a1)
> > or a2, a2, a1
> >
> > /*
> > diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> > index ee3459cb6750..a7246872bd30 100644
> > --- a/arch/riscv/mm/context.c
> > +++ b/arch/riscv/mm/context.c
> > @@ -192,7 +192,7 @@ static void set_mm_asid(struct mm_struct *mm, unsigned int cpu)
> > switch_mm_fast:
> > csr_write(CSR_SATP, virt_to_pfn(mm->pgd) |
> > ((cntx & asid_mask) << SATP_ASID_SHIFT) |
> > - SATP_MODE);
> > + satp_mode);
> >
> > if (need_flush_tlb)
> > local_flush_tlb_all();
> > @@ -201,7 +201,7 @@ static void set_mm_asid(struct mm_struct *mm, unsigned int cpu)
> > static void set_mm_noasid(struct mm_struct *mm)
> > {
> > /* Switch the page table and blindly nuke entire local TLB */
> > - csr_write(CSR_SATP, virt_to_pfn(mm->pgd) | SATP_MODE);
> > + csr_write(CSR_SATP, virt_to_pfn(mm->pgd) | satp_mode);
> > local_flush_tlb_all();
> > }
> >
> > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > index 1552226fb6bd..6a19a1b1caf8 100644
> > --- a/arch/riscv/mm/init.c
> > +++ b/arch/riscv/mm/init.c
> > @@ -37,6 +37,17 @@ EXPORT_SYMBOL(kernel_map);
> > #define kernel_map (*(struct kernel_mapping *)XIP_FIXUP(&kernel_map))
> > #endif
> >
> > +#ifdef CONFIG_64BIT
> > +u64 satp_mode = !IS_ENABLED(CONFIG_XIP_KERNEL) ? SATP_MODE_48 : SATP_MODE_39;
> > +#else
> > +u64 satp_mode = SATP_MODE_32;
> > +#endif
> > +EXPORT_SYMBOL(satp_mode);
> > +
> > +bool pgtable_l4_enabled = IS_ENABLED(CONFIG_64BIT) && !IS_ENABLED(CONFIG_XIP_KERNEL) ?
> > + true : false;
> > +EXPORT_SYMBOL(pgtable_l4_enabled);
> > +
> > phys_addr_t phys_ram_base __ro_after_init;
> > EXPORT_SYMBOL(phys_ram_base);
> >
> > @@ -53,15 +64,6 @@ extern char _start[];
> > void *_dtb_early_va __initdata;
> > uintptr_t _dtb_early_pa __initdata;
> >
> > -struct pt_alloc_ops {
> > - pte_t *(*get_pte_virt)(phys_addr_t pa);
> > - phys_addr_t (*alloc_pte)(uintptr_t va);
> > -#ifndef __PAGETABLE_PMD_FOLDED
> > - pmd_t *(*get_pmd_virt)(phys_addr_t pa);
> > - phys_addr_t (*alloc_pmd)(uintptr_t va);
> > -#endif
> > -};
> > -
> > static phys_addr_t dma32_phys_limit __initdata;
> >
> > static void __init zone_sizes_init(void)
> > @@ -222,7 +224,7 @@ static void __init setup_bootmem(void)
> > }
> >
> > #ifdef CONFIG_MMU
> > -static struct pt_alloc_ops _pt_ops __initdata;
> > +struct pt_alloc_ops _pt_ops __initdata;
> >
> > #ifdef CONFIG_XIP_KERNEL
> > #define pt_ops (*(struct pt_alloc_ops *)XIP_FIXUP(&_pt_ops))
> > @@ -238,6 +240,7 @@ pgd_t trampoline_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
> > static pte_t fixmap_pte[PTRS_PER_PTE] __page_aligned_bss;
> >
> > pgd_t early_pg_dir[PTRS_PER_PGD] __initdata __aligned(PAGE_SIZE);
> > +static pud_t __maybe_unused early_dtb_pud[PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);
> > static pmd_t __maybe_unused early_dtb_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
> >
> > #ifdef CONFIG_XIP_KERNEL
> > @@ -326,6 +329,16 @@ static pmd_t early_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
> > #define early_pmd ((pmd_t *)XIP_FIXUP(early_pmd))
> > #endif /* CONFIG_XIP_KERNEL */
> >
> > +static pud_t trampoline_pud[PTRS_PER_PUD] __page_aligned_bss;
> > +static pud_t fixmap_pud[PTRS_PER_PUD] __page_aligned_bss;
> > +static pud_t early_pud[PTRS_PER_PUD] __initdata __aligned(PAGE_SIZE);
> > +
> > +#ifdef CONFIG_XIP_KERNEL
> > +#define trampoline_pud ((pud_t *)XIP_FIXUP(trampoline_pud))
> > +#define fixmap_pud ((pud_t *)XIP_FIXUP(fixmap_pud))
> > +#define early_pud ((pud_t *)XIP_FIXUP(early_pud))
> > +#endif /* CONFIG_XIP_KERNEL */
> > +
> > static pmd_t *__init get_pmd_virt_early(phys_addr_t pa)
> > {
> > /* Before MMU is enabled */
> > @@ -345,7 +358,7 @@ static pmd_t *__init get_pmd_virt_late(phys_addr_t pa)
> >
> > static phys_addr_t __init alloc_pmd_early(uintptr_t va)
> > {
> > - BUG_ON((va - kernel_map.virt_addr) >> PGDIR_SHIFT);
> > + BUG_ON((va - kernel_map.virt_addr) >> PUD_SHIFT);
> >
> > return (uintptr_t)early_pmd;
> > }
> > @@ -391,21 +404,97 @@ static void __init create_pmd_mapping(pmd_t *pmdp,
> > create_pte_mapping(ptep, va, pa, sz, prot);
> > }
> >
> > -#define pgd_next_t pmd_t
> > -#define alloc_pgd_next(__va) pt_ops.alloc_pmd(__va)
> > -#define get_pgd_next_virt(__pa) pt_ops.get_pmd_virt(__pa)
> > +static pud_t *__init get_pud_virt_early(phys_addr_t pa)
> > +{
> > + return (pud_t *)((uintptr_t)pa);
> > +}
> > +
> > +static pud_t *__init get_pud_virt_fixmap(phys_addr_t pa)
> > +{
> > + clear_fixmap(FIX_PUD);
> > + return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
> > +}
> > +
> > +static pud_t *__init get_pud_virt_late(phys_addr_t pa)
> > +{
> > + return (pud_t *)__va(pa);
> > +}
> > +
> > +static phys_addr_t __init alloc_pud_early(uintptr_t va)
> > +{
> > + /* Only one PUD is available for early mapping */
> > + BUG_ON((va - kernel_map.virt_addr) >> PGDIR_SHIFT);
> > +
> > + return (uintptr_t)early_pud;
> > +}
> > +
> > +static phys_addr_t __init alloc_pud_fixmap(uintptr_t va)
> > +{
> > + return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
> > +}
> > +
> > +static phys_addr_t alloc_pud_late(uintptr_t va)
> > +{
> > + unsigned long vaddr;
> > +
> > + vaddr = __get_free_page(GFP_KERNEL);
> > + BUG_ON(!vaddr);
> > + return __pa(vaddr);
> > +}
> > +
> > +static void __init create_pud_mapping(pud_t *pudp,
> > + uintptr_t va, phys_addr_t pa,
> > + phys_addr_t sz, pgprot_t prot)
> > +{
> > + pmd_t *nextp;
> > + phys_addr_t next_phys;
> > + uintptr_t pud_index = pud_index(va);
> > +
> > + if (sz == PUD_SIZE) {
> > + if (pud_val(pudp[pud_index]) == 0)
> > + pudp[pud_index] = pfn_pud(PFN_DOWN(pa), prot);
> > + return;
> > + }
> > +
> > + if (pud_val(pudp[pud_index]) == 0) {
> > + next_phys = pt_ops.alloc_pmd(va);
> > + pudp[pud_index] = pfn_pud(PFN_DOWN(next_phys), PAGE_TABLE);
> > + nextp = pt_ops.get_pmd_virt(next_phys);
> > + memset(nextp, 0, PAGE_SIZE);
> > + } else {
> > + next_phys = PFN_PHYS(_pud_pfn(pudp[pud_index]));
> > + nextp = pt_ops.get_pmd_virt(next_phys);
> > + }
> > +
> > + create_pmd_mapping(nextp, va, pa, sz, prot);
> > +}
> > +
> > +#define pgd_next_t pud_t
> > +#define alloc_pgd_next(__va) (pgtable_l4_enabled ? \
> > + pt_ops.alloc_pud(__va) : pt_ops.alloc_pmd(__va))
> > +#define get_pgd_next_virt(__pa) (pgtable_l4_enabled ? \
> > + pt_ops.get_pud_virt(__pa) : (pgd_next_t *)pt_ops.get_pmd_virt(__pa))
> > #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot) \
> > - create_pmd_mapping(__nextp, __va, __pa, __sz, __prot)
> > -#define fixmap_pgd_next fixmap_pmd
> > + (pgtable_l4_enabled ? \
> > + create_pud_mapping(__nextp, __va, __pa, __sz, __prot) : \
> > + create_pmd_mapping((pmd_t *)__nextp, __va, __pa, __sz, __prot))
> > +#define fixmap_pgd_next (pgtable_l4_enabled ? \
> > + (uintptr_t)fixmap_pud : (uintptr_t)fixmap_pmd)
> > +#define trampoline_pgd_next (pgtable_l4_enabled ? \
> > + (uintptr_t)trampoline_pud : (uintptr_t)trampoline_pmd)
> > +#define early_dtb_pgd_next (pgtable_l4_enabled ? \
> > + (uintptr_t)early_dtb_pud : (uintptr_t)early_dtb_pmd)
> > #else
> > #define pgd_next_t pte_t
> > #define alloc_pgd_next(__va) pt_ops.alloc_pte(__va)
> > #define get_pgd_next_virt(__pa) pt_ops.get_pte_virt(__pa)
> > #define create_pgd_next_mapping(__nextp, __va, __pa, __sz, __prot) \
> > create_pte_mapping(__nextp, __va, __pa, __sz, __prot)
> > -#define fixmap_pgd_next fixmap_pte
> > +#define fixmap_pgd_next ((uintptr_t)fixmap_pte)
> > +#define early_dtb_pgd_next ((uintptr_t)early_dtb_pmd)
> > +#define create_pud_mapping(__pmdp, __va, __pa, __sz, __prot)
> > #define create_pmd_mapping(__pmdp, __va, __pa, __sz, __prot)
> > -#endif
> > +#endif /* __PAGETABLE_PMD_FOLDED */
> >
> > void __init create_pgd_mapping(pgd_t *pgdp,
> > uintptr_t va, phys_addr_t pa,
> > @@ -493,6 +582,57 @@ static __init pgprot_t pgprot_from_va(uintptr_t va)
> > }
> > #endif /* CONFIG_STRICT_KERNEL_RWX */
> >
> > +#ifdef CONFIG_64BIT
> > +static void __init disable_pgtable_l4(void)
> > +{
> > + pgtable_l4_enabled = false;
> > + kernel_map.page_offset = PAGE_OFFSET_L3;
> > + satp_mode = SATP_MODE_39;
> > +}
> > +
> > +/*
> > + * There is a simple way to determine if 4-level is supported by the
> > + * underlying hardware: establish 1:1 mapping in 4-level page table mode
> > + * then read SATP to see if the configuration was taken into account
> > + * meaning sv48 is supported.
> > + */
> > +static __init void set_satp_mode(void)
> > +{
> > + u64 identity_satp, hw_satp;
> > + uintptr_t set_satp_mode_pmd;
> > +
> > + set_satp_mode_pmd = ((unsigned long)set_satp_mode) & PMD_MASK;
> > + create_pgd_mapping(early_pg_dir,
> > + set_satp_mode_pmd, (uintptr_t)early_pud,
> > + PGDIR_SIZE, PAGE_TABLE);
> > + create_pud_mapping(early_pud,
> > + set_satp_mode_pmd, (uintptr_t)early_pmd,
> > + PUD_SIZE, PAGE_TABLE);
> > + /* Handle the case where set_satp_mode straddles 2 PMDs */
> > + create_pmd_mapping(early_pmd,
> > + set_satp_mode_pmd, set_satp_mode_pmd,
> > + PMD_SIZE, PAGE_KERNEL_EXEC);
> > + create_pmd_mapping(early_pmd,
> > + set_satp_mode_pmd + PMD_SIZE,
> > + set_satp_mode_pmd + PMD_SIZE,
> > + PMD_SIZE, PAGE_KERNEL_EXEC);
> > +
> > + identity_satp = PFN_DOWN((uintptr_t)&early_pg_dir) | satp_mode;
> > +
> > + local_flush_tlb_all();
> > + csr_write(CSR_SATP, identity_satp);
> > + hw_satp = csr_swap(CSR_SATP, 0ULL);
> > + local_flush_tlb_all();
> > +
> > + if (hw_satp != identity_satp)
> > + disable_pgtable_l4();
> > +
> > + memset(early_pg_dir, 0, PAGE_SIZE);
> > + memset(early_pud, 0, PAGE_SIZE);
> > + memset(early_pmd, 0, PAGE_SIZE);
> > +}
> > +#endif
> > +
> > /*
> > * setup_vm() is called from head.S with MMU-off.
> > *
> > @@ -557,10 +697,15 @@ static void __init create_fdt_early_page_table(pgd_t *pgdir, uintptr_t dtb_pa)
> > uintptr_t pa = dtb_pa & ~(PMD_SIZE - 1);
> >
> > create_pgd_mapping(early_pg_dir, DTB_EARLY_BASE_VA,
> > - IS_ENABLED(CONFIG_64BIT) ? (uintptr_t)early_dtb_pmd : pa,
> > + IS_ENABLED(CONFIG_64BIT) ? early_dtb_pgd_next : pa,
> > PGDIR_SIZE,
> > IS_ENABLED(CONFIG_64BIT) ? PAGE_TABLE : PAGE_KERNEL);
> >
> > + if (pgtable_l4_enabled) {
> > + create_pud_mapping(early_dtb_pud, DTB_EARLY_BASE_VA,
> > + (uintptr_t)early_dtb_pmd, PUD_SIZE, PAGE_TABLE);
> > + }
> > +
> > if (IS_ENABLED(CONFIG_64BIT)) {
> > create_pmd_mapping(early_dtb_pmd, DTB_EARLY_BASE_VA,
> > pa, PMD_SIZE, PAGE_KERNEL);
> > @@ -593,6 +738,8 @@ void pt_ops_set_early(void)
> > #ifndef __PAGETABLE_PMD_FOLDED
> > pt_ops.alloc_pmd = alloc_pmd_early;
> > pt_ops.get_pmd_virt = get_pmd_virt_early;
> > + pt_ops.alloc_pud = alloc_pud_early;
> > + pt_ops.get_pud_virt = get_pud_virt_early;
> > #endif
> > }
> >
> > @@ -611,6 +758,8 @@ void pt_ops_set_fixmap(void)
> > #ifndef __PAGETABLE_PMD_FOLDED
> > pt_ops.alloc_pmd = kernel_mapping_pa_to_va((uintptr_t)alloc_pmd_fixmap);
> > pt_ops.get_pmd_virt = kernel_mapping_pa_to_va((uintptr_t)get_pmd_virt_fixmap);
> > + pt_ops.alloc_pud = kernel_mapping_pa_to_va((uintptr_t)alloc_pud_fixmap);
> > + pt_ops.get_pud_virt = kernel_mapping_pa_to_va((uintptr_t)get_pud_virt_fixmap);
> > #endif
> > }
> >
> > @@ -625,6 +774,8 @@ void pt_ops_set_late(void)
> > #ifndef __PAGETABLE_PMD_FOLDED
> > pt_ops.alloc_pmd = alloc_pmd_late;
> > pt_ops.get_pmd_virt = get_pmd_virt_late;
> > + pt_ops.alloc_pud = alloc_pud_late;
> > + pt_ops.get_pud_virt = get_pud_virt_late;
> > #endif
> > }
> >
> > @@ -633,6 +784,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> > pmd_t __maybe_unused fix_bmap_spmd, fix_bmap_epmd;
> >
> > kernel_map.virt_addr = KERNEL_LINK_ADDR;
> > + kernel_map.page_offset = _AC(CONFIG_PAGE_OFFSET, UL);
> >
> > #ifdef CONFIG_XIP_KERNEL
> > kernel_map.xiprom = (uintptr_t)CONFIG_XIP_PHYS_ADDR;
> > @@ -647,6 +799,11 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> > kernel_map.phys_addr = (uintptr_t)(&_start);
> > kernel_map.size = (uintptr_t)(&_end) - kernel_map.phys_addr;
> > #endif
> > +
> > +#if defined(CONFIG_64BIT) && !defined(CONFIG_XIP_KERNEL)
> > + set_satp_mode();
> > +#endif
> > +
> > kernel_map.va_pa_offset = PAGE_OFFSET - kernel_map.phys_addr;
> > kernel_map.va_kernel_pa_offset = kernel_map.virt_addr - kernel_map.phys_addr;
> >
> > @@ -676,15 +833,21 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >
> > /* Setup early PGD for fixmap */
> > create_pgd_mapping(early_pg_dir, FIXADDR_START,
> > - (uintptr_t)fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
> > + fixmap_pgd_next, PGDIR_SIZE, PAGE_TABLE);
> >
> > #ifndef __PAGETABLE_PMD_FOLDED
> > - /* Setup fixmap PMD */
> > + /* Setup fixmap PUD and PMD */
> > + if (pgtable_l4_enabled)
> > + create_pud_mapping(fixmap_pud, FIXADDR_START,
> > + (uintptr_t)fixmap_pmd, PUD_SIZE, PAGE_TABLE);
> > create_pmd_mapping(fixmap_pmd, FIXADDR_START,
> > (uintptr_t)fixmap_pte, PMD_SIZE, PAGE_TABLE);
> > /* Setup trampoline PGD and PMD */
> > create_pgd_mapping(trampoline_pg_dir, kernel_map.virt_addr,
> > - (uintptr_t)trampoline_pmd, PGDIR_SIZE, PAGE_TABLE);
> > + trampoline_pgd_next, PGDIR_SIZE, PAGE_TABLE);
> > + if (pgtable_l4_enabled)
> > + create_pud_mapping(trampoline_pud, kernel_map.virt_addr,
> > + (uintptr_t)trampoline_pmd, PUD_SIZE, PAGE_TABLE);
> > #ifdef CONFIG_XIP_KERNEL
> > create_pmd_mapping(trampoline_pmd, kernel_map.virt_addr,
> > kernel_map.xiprom, PMD_SIZE, PAGE_KERNEL_EXEC);
> > @@ -712,7 +875,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> > * Bootime fixmap only can handle PMD_SIZE mapping. Thus, boot-ioremap
> > * range can not span multiple pmds.
> > */
> > - BUILD_BUG_ON((__fix_to_virt(FIX_BTMAP_BEGIN) >> PMD_SHIFT)
> > + BUG_ON((__fix_to_virt(FIX_BTMAP_BEGIN) >> PMD_SHIFT)
> > != (__fix_to_virt(FIX_BTMAP_END) >> PMD_SHIFT));
> >
> > #ifndef __PAGETABLE_PMD_FOLDED
> > @@ -783,9 +946,10 @@ static void __init setup_vm_final(void)
> > /* Clear fixmap PTE and PMD mappings */
> > clear_fixmap(FIX_PTE);
> > clear_fixmap(FIX_PMD);
> > + clear_fixmap(FIX_PUD);
> >
> > /* Move to swapper page table */
> > - csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | SATP_MODE);
> > + csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | satp_mode);
> > local_flush_tlb_all();
> >
> > pt_ops_set_late();
> > diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c
> > index 1434a0225140..993f50571a3b 100644
> > --- a/arch/riscv/mm/kasan_init.c
> > +++ b/arch/riscv/mm/kasan_init.c
> > @@ -11,7 +11,29 @@
> > #include <asm/fixmap.h>
> > #include <asm/pgalloc.h>
> >
> > +/*
> > + * Kasan shadow region must lie at a fixed address across sv39, sv48 and sv57
> > + * which is right before the kernel.
> > + *
> > + * For sv39, the region is aligned on PGDIR_SIZE so we only need to populate
> > + * the page global directory with kasan_early_shadow_pmd.
> > + *
> > + * For sv48 and sv57, the region is not aligned on PGDIR_SIZE so the mapping
> > + * must be divided as follows:
> > + * - the first PGD entry, although incomplete, is populated with
> > + * kasan_early_shadow_pud/p4d
> > + * - the PGD entries in the middle are populated with kasan_early_shadow_pud/p4d
> > + * - the last PGD entry is shared with the kernel mapping so populated at the
> > + * lower levels pud/p4d
> > + *
> > + * In addition, when shallow populating a kasan region (for example vmalloc),
> > + * this region may also not be aligned on PGDIR size, so we must go down to the
> > + * pud level too.
> > + */
> > +
> > extern pgd_t early_pg_dir[PTRS_PER_PGD];
> > +extern struct pt_alloc_ops _pt_ops __initdata;
> > +#define pt_ops _pt_ops
> >
> > static void __init kasan_populate_pte(pmd_t *pmd, unsigned long vaddr, unsigned long end)
> > {
> > @@ -35,15 +57,19 @@ static void __init kasan_populate_pte(pmd_t *pmd, unsigned long vaddr, unsigned
> > set_pmd(pmd, pfn_pmd(PFN_DOWN(__pa(base_pte)), PAGE_TABLE));
> > }
> >
> > -static void __init kasan_populate_pmd(pgd_t *pgd, unsigned long vaddr, unsigned long end)
> > +static void __init kasan_populate_pmd(pud_t *pud, unsigned long vaddr, unsigned long end)
> > {
> > phys_addr_t phys_addr;
> > pmd_t *pmdp, *base_pmd;
> > unsigned long next;
> >
> > - base_pmd = (pmd_t *)pgd_page_vaddr(*pgd);
> > - if (base_pmd == lm_alias(kasan_early_shadow_pmd))
> > + if (pud_none(*pud)) {
> > base_pmd = memblock_alloc(PTRS_PER_PMD * sizeof(pmd_t), PAGE_SIZE);
> > + } else {
> > + base_pmd = (pmd_t *)pud_pgtable(*pud);
> > + if (base_pmd == lm_alias(kasan_early_shadow_pmd))
> > + base_pmd = memblock_alloc(PTRS_PER_PMD * sizeof(pmd_t), PAGE_SIZE);
> > + }
> >
> > pmdp = base_pmd + pmd_index(vaddr);
> >
> > @@ -67,9 +93,72 @@ static void __init kasan_populate_pmd(pgd_t *pgd, unsigned long vaddr, unsigned
> > * it entirely, memblock could allocate a page at a physical address
> > * where KASAN is not populated yet and then we'd get a page fault.
> > */
> > - set_pgd(pgd, pfn_pgd(PFN_DOWN(__pa(base_pmd)), PAGE_TABLE));
> > + set_pud(pud, pfn_pud(PFN_DOWN(__pa(base_pmd)), PAGE_TABLE));
> > +}
> > +
> > +static void __init kasan_populate_pud(pgd_t *pgd,
> > + unsigned long vaddr, unsigned long end,
> > + bool early)
> > +{
> > + phys_addr_t phys_addr;
> > + pud_t *pudp, *base_pud;
> > + unsigned long next;
> > +
> > + if (early) {
> > + /*
> > + * We can't use pgd_page_vaddr here as it would return a linear
> > + * mapping address but it is not mapped yet, but when populating
> > + * early_pg_dir, we need the physical address and when populating
> > + * swapper_pg_dir, we need the kernel virtual address so use
> > + * pt_ops facility.
> > + */
> > + base_pud = pt_ops.get_pud_virt(pfn_to_phys(_pgd_pfn(*pgd)));
> > + } else {
> > + base_pud = (pud_t *)pgd_page_vaddr(*pgd);
> > + if (base_pud == lm_alias(kasan_early_shadow_pud))
> > + base_pud = memblock_alloc(PTRS_PER_PUD * sizeof(pud_t), PAGE_SIZE);
> > + }
> > +
> > + pudp = base_pud + pud_index(vaddr);
> > +
> > + do {
> > + next = pud_addr_end(vaddr, end);
> > +
> > + if (pud_none(*pudp) && IS_ALIGNED(vaddr, PUD_SIZE) && (next - vaddr) >= PUD_SIZE) {
> > + if (early) {
> > + phys_addr = __pa(((uintptr_t)kasan_early_shadow_pmd));
> > + set_pud(pudp, pfn_pud(PFN_DOWN(phys_addr), PAGE_TABLE));
> > + continue;
> > + } else {
> > + phys_addr = memblock_phys_alloc(PUD_SIZE, PUD_SIZE);
> > + if (phys_addr) {
> > + set_pud(pudp, pfn_pud(PFN_DOWN(phys_addr), PAGE_KERNEL));
> > + continue;
> > + }
> > + }
> > + }
> > +
> > + kasan_populate_pmd(pudp, vaddr, next);
> > + } while (pudp++, vaddr = next, vaddr != end);
> > +
> > + /*
> > + * Wait for the whole PGD to be populated before setting the PGD in
> > + * the page table, otherwise, if we did set the PGD before populating
> > + * it entirely, memblock could allocate a page at a physical address
> > + * where KASAN is not populated yet and then we'd get a page fault.
> > + */
> > + if (!early)
> > + set_pgd(pgd, pfn_pgd(PFN_DOWN(__pa(base_pud)), PAGE_TABLE));
> > }
> >
> > +#define kasan_early_shadow_pgd_next (pgtable_l4_enabled ? \
> > + (uintptr_t)kasan_early_shadow_pud : \
> > + (uintptr_t)kasan_early_shadow_pmd)
> > +#define kasan_populate_pgd_next(pgdp, vaddr, next, early) \
> > + (pgtable_l4_enabled ? \
> > + kasan_populate_pud(pgdp, vaddr, next, early) : \
> > + kasan_populate_pmd((pud_t *)pgdp, vaddr, next))
> > +
> > static void __init kasan_populate_pgd(pgd_t *pgdp,
> > unsigned long vaddr, unsigned long end,
> > bool early)
> > @@ -102,7 +191,7 @@ static void __init kasan_populate_pgd(pgd_t *pgdp,
> > }
> > }
> >
> > - kasan_populate_pmd(pgdp, vaddr, next);
> > + kasan_populate_pgd_next(pgdp, vaddr, next, early);
> > } while (pgdp++, vaddr = next, vaddr != end);
> > }
> >
> > @@ -157,18 +246,54 @@ static void __init kasan_populate(void *start, void *end)
> > memset(start, KASAN_SHADOW_INIT, end - start);
> > }
> >
> > +static void __init kasan_shallow_populate_pud(pgd_t *pgdp,
> > + unsigned long vaddr, unsigned long end,
> > + bool kasan_populate)
> > +{
> > + unsigned long next;
> > + pud_t *pudp, *base_pud;
> > + pmd_t *base_pmd;
> > + bool is_kasan_pmd;
> > +
> > + base_pud = (pud_t *)pgd_page_vaddr(*pgdp);
> > + pudp = base_pud + pud_index(vaddr);
> > +
> > + if (kasan_populate)
> > + memcpy(base_pud, (void *)kasan_early_shadow_pgd_next,
> > + sizeof(pud_t) * PTRS_PER_PUD);
> > +
> > + do {
> > + next = pud_addr_end(vaddr, end);
> > + is_kasan_pmd = (pud_pgtable(*pudp) == lm_alias(kasan_early_shadow_pmd));
> > +
> > + if (is_kasan_pmd) {
> > + base_pmd = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
> > + set_pud(pudp, pfn_pud(PFN_DOWN(__pa(base_pmd)), PAGE_TABLE));
> > + }
> > + } while (pudp++, vaddr = next, vaddr != end);
> > +}
> > +
> > static void __init kasan_shallow_populate_pgd(unsigned long vaddr, unsigned long end)
> > {
> > unsigned long next;
> > void *p;
> > pgd_t *pgd_k = pgd_offset_k(vaddr);
> > + bool is_kasan_pgd_next;
> >
> > do {
> > next = pgd_addr_end(vaddr, end);
> > - if (pgd_page_vaddr(*pgd_k) == (unsigned long)lm_alias(kasan_early_shadow_pmd)) {
> > + is_kasan_pgd_next = (pgd_page_vaddr(*pgd_k) ==
> > + (unsigned long)lm_alias(kasan_early_shadow_pgd_next));
> > +
> > + if (is_kasan_pgd_next) {
> > p = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
> > set_pgd(pgd_k, pfn_pgd(PFN_DOWN(__pa(p)), PAGE_TABLE));
> > }
> > +
> > + if (IS_ALIGNED(vaddr, PGDIR_SIZE) && (next - vaddr) >= PGDIR_SIZE)
> > + continue;
> > +
> > + kasan_shallow_populate_pud(pgd_k, vaddr, next, is_kasan_pgd_next);
> > } while (pgd_k++, vaddr = next, vaddr != end);
> > }
> >
> > diff --git a/drivers/firmware/efi/libstub/efi-stub.c b/drivers/firmware/efi/libstub/efi-stub.c
> > index 26e69788f27a..b3db5d91ed38 100644
> > --- a/drivers/firmware/efi/libstub/efi-stub.c
> > +++ b/drivers/firmware/efi/libstub/efi-stub.c
> > @@ -40,6 +40,8 @@
> >
> > #ifdef CONFIG_ARM64
> > # define EFI_RT_VIRTUAL_LIMIT DEFAULT_MAP_WINDOW_64
> > +#elif defined(CONFIG_RISCV)
> > +# define EFI_RT_VIRTUAL_LIMIT TASK_SIZE_MIN
> > #else
> > # define EFI_RT_VIRTUAL_LIMIT TASK_SIZE
> > #endif
> > --
> > 2.32.0
> >
>
>
> --
> Best Regards
> Guo Ren
>
> ML: https://lore.kernel.org/linux-csky/

2022-01-04 12:45:08

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v3 07/13] riscv: Implement sv48 support

Hi Jisheng,

On Sun, Dec 26, 2021 at 10:06 AM Jisheng Zhang
<[email protected]> wrote:
>
> On Mon, 6 Dec 2021 11:46:51 +0100
> Alexandre Ghiti <[email protected]> wrote:
>
> > By adding a new 4th level of page table, give the possibility to 64bit
> > kernel to address 2^48 bytes of virtual address: in practice, that offers
> > 128TB of virtual address space to userspace and allows up to 64TB of
> > physical memory.
> >
> > If the underlying hardware does not support sv48, we will automatically
> > fallback to a standard 3-level page table by folding the new PUD level into
> > PGDIR level. In order to detect HW capabilities at runtime, we
> > use SATP feature that ignores writes with an unsupported mode.
> >
> > Signed-off-by: Alexandre Ghiti <[email protected]>
> > ---
> > arch/riscv/Kconfig | 4 +-
> > arch/riscv/include/asm/csr.h | 3 +-
> > arch/riscv/include/asm/fixmap.h | 1 +
> > arch/riscv/include/asm/kasan.h | 6 +-
> > arch/riscv/include/asm/page.h | 14 ++
> > arch/riscv/include/asm/pgalloc.h | 40 +++++
> > arch/riscv/include/asm/pgtable-64.h | 108 +++++++++++-
> > arch/riscv/include/asm/pgtable.h | 24 ++-
> > arch/riscv/kernel/head.S | 3 +-
> > arch/riscv/mm/context.c | 4 +-
> > arch/riscv/mm/init.c | 212 +++++++++++++++++++++---
> > arch/riscv/mm/kasan_init.c | 137 ++++++++++++++-
> > drivers/firmware/efi/libstub/efi-stub.c | 2 +
> > 13 files changed, 514 insertions(+), 44 deletions(-)
> >
> > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > index ac6c0cd9bc29..d28fe0148e13 100644
> > --- a/arch/riscv/Kconfig
> > +++ b/arch/riscv/Kconfig
> > @@ -150,7 +150,7 @@ config PAGE_OFFSET
> > hex
> > default 0xC0000000 if 32BIT
> > default 0x80000000 if 64BIT && !MMU
> > - default 0xffffffd800000000 if 64BIT
> > + default 0xffffaf8000000000 if 64BIT
> >
> > config KASAN_SHADOW_OFFSET
> > hex
> > @@ -201,7 +201,7 @@ config FIX_EARLYCON_MEM
> >
> > config PGTABLE_LEVELS
> > int
> > - default 3 if 64BIT
> > + default 4 if 64BIT
> > default 2
> >
> > config LOCKDEP_SUPPORT
> > diff --git a/arch/riscv/include/asm/csr.h b/arch/riscv/include/asm/csr.h
> > index 87ac65696871..3fdb971c7896 100644
> > --- a/arch/riscv/include/asm/csr.h
> > +++ b/arch/riscv/include/asm/csr.h
> > @@ -40,14 +40,13 @@
> > #ifndef CONFIG_64BIT
> > #define SATP_PPN _AC(0x003FFFFF, UL)
> > #define SATP_MODE_32 _AC(0x80000000, UL)
> > -#define SATP_MODE SATP_MODE_32
> > #define SATP_ASID_BITS 9
> > #define SATP_ASID_SHIFT 22
> > #define SATP_ASID_MASK _AC(0x1FF, UL)
> > #else
> > #define SATP_PPN _AC(0x00000FFFFFFFFFFF, UL)
> > #define SATP_MODE_39 _AC(0x8000000000000000, UL)
> > -#define SATP_MODE SATP_MODE_39
> > +#define SATP_MODE_48 _AC(0x9000000000000000, UL)
> > #define SATP_ASID_BITS 16
> > #define SATP_ASID_SHIFT 44
> > #define SATP_ASID_MASK _AC(0xFFFF, UL)
> > diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h
> > index 54cbf07fb4e9..58a718573ad6 100644
> > --- a/arch/riscv/include/asm/fixmap.h
> > +++ b/arch/riscv/include/asm/fixmap.h
> > @@ -24,6 +24,7 @@ enum fixed_addresses {
> > FIX_HOLE,
> > FIX_PTE,
> > FIX_PMD,
> > + FIX_PUD,
> > FIX_TEXT_POKE1,
> > FIX_TEXT_POKE0,
> > FIX_EARLYCON_MEM_BASE,
> > diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h
> > index 743e6ff57996..0b85e363e778 100644
> > --- a/arch/riscv/include/asm/kasan.h
> > +++ b/arch/riscv/include/asm/kasan.h
> > @@ -28,7 +28,11 @@
> > #define KASAN_SHADOW_SCALE_SHIFT 3
> >
> > #define KASAN_SHADOW_SIZE (UL(1) << ((VA_BITS - 1) - KASAN_SHADOW_SCALE_SHIFT))
> > -#define KASAN_SHADOW_START (KASAN_SHADOW_END - KASAN_SHADOW_SIZE)
> > +/*
> > + * Depending on the size of the virtual address space, the region may not be
> > + * aligned on PGDIR_SIZE, so force its alignment to ease its population.
> > + */
> > +#define KASAN_SHADOW_START ((KASAN_SHADOW_END - KASAN_SHADOW_SIZE) & PGDIR_MASK)
> > #define KASAN_SHADOW_END MODULES_LOWEST_VADDR
> > #define KASAN_SHADOW_OFFSET _AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
> >
> > diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> > index e03559f9b35e..d089fe46f7d8 100644
> > --- a/arch/riscv/include/asm/page.h
> > +++ b/arch/riscv/include/asm/page.h
> > @@ -31,7 +31,20 @@
> > * When not using MMU this corresponds to the first free page in
> > * physical memory (aligned on a page boundary).
> > */
> > +#ifdef CONFIG_64BIT
> > +#ifdef CONFIG_MMU
> > +#define PAGE_OFFSET kernel_map.page_offset
> > +#else
> > +#define PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)
> > +#endif
> > +/*
> > + * By default, CONFIG_PAGE_OFFSET value corresponds to SV48 address space so
> > + * define the PAGE_OFFSET value for SV39.
> > + */
> > +#define PAGE_OFFSET_L3 _AC(0xffffffd800000000, UL)
> > +#else
> > #define PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)
> > +#endif /* CONFIG_64BIT */
> >
> > /*
> > * Half of the kernel address space (half of the entries of the page global
> > @@ -90,6 +103,7 @@ extern unsigned long riscv_pfn_base;
> > #endif /* CONFIG_MMU */
> >
> > struct kernel_mapping {
> > + unsigned long page_offset;
> > unsigned long virt_addr;
> > uintptr_t phys_addr;
> > uintptr_t size;
> > diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
> > index 0af6933a7100..11823004b87a 100644
> > --- a/arch/riscv/include/asm/pgalloc.h
> > +++ b/arch/riscv/include/asm/pgalloc.h
> > @@ -11,6 +11,8 @@
> > #include <asm/tlb.h>
> >
> > #ifdef CONFIG_MMU
> > +#define __HAVE_ARCH_PUD_ALLOC_ONE
> > +#define __HAVE_ARCH_PUD_FREE
> > #include <asm-generic/pgalloc.h>
> >
> > static inline void pmd_populate_kernel(struct mm_struct *mm,
> > @@ -36,6 +38,44 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
> >
> > set_pud(pud, __pud((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> > }
> > +
> > +static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
> > +{
> > + if (pgtable_l4_enabled) {
> > + unsigned long pfn = virt_to_pfn(pud);
> > +
> > + set_p4d(p4d, __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> > + }
> > +}
> > +
> > +static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d,
> > + pud_t *pud)
> > +{
> > + if (pgtable_l4_enabled) {
> > + unsigned long pfn = virt_to_pfn(pud);
> > +
> > + set_p4d_safe(p4d,
> > + __p4d((pfn << _PAGE_PFN_SHIFT) | _PAGE_TABLE));
> > + }
> > +}
> > +
> > +#define pud_alloc_one pud_alloc_one
> > +static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
> > +{
> > + if (pgtable_l4_enabled)
> > + return __pud_alloc_one(mm, addr);
> > +
> > + return NULL;
> > +}
> > +
> > +#define pud_free pud_free
> > +static inline void pud_free(struct mm_struct *mm, pud_t *pud)
> > +{
> > + if (pgtable_l4_enabled)
> > + __pud_free(mm, pud);
> > +}
> > +
> > +#define __pud_free_tlb(tlb, pud, addr) pud_free((tlb)->mm, pud)
> > #endif /* __PAGETABLE_PMD_FOLDED */
> >
> > static inline pgd_t *pgd_alloc(struct mm_struct *mm)
> > diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
> > index 228261aa9628..bbbdd66e5e2f 100644
> > --- a/arch/riscv/include/asm/pgtable-64.h
> > +++ b/arch/riscv/include/asm/pgtable-64.h
> > @@ -8,16 +8,36 @@
> >
> > #include <linux/const.h>
> >
> > -#define PGDIR_SHIFT 30
> > +extern bool pgtable_l4_enabled;
> > +
> > +#define PGDIR_SHIFT_L3 30
> > +#define PGDIR_SHIFT_L4 39
> > +#define PGDIR_SIZE_L3 (_AC(1, UL) << PGDIR_SHIFT_L3)
> > +
> > +#define PGDIR_SHIFT (pgtable_l4_enabled ? PGDIR_SHIFT_L4 : PGDIR_SHIFT_L3)
> > /* Size of region mapped by a page global directory */
> > #define PGDIR_SIZE (_AC(1, UL) << PGDIR_SHIFT)
> > #define PGDIR_MASK (~(PGDIR_SIZE - 1))
> >
> > +/* pud is folded into pgd in case of 3-level page table */
> > +#define PUD_SHIFT 30
> > +#define PUD_SIZE (_AC(1, UL) << PUD_SHIFT)
> > +#define PUD_MASK (~(PUD_SIZE - 1))
> > +
> > #define PMD_SHIFT 21
> > /* Size of region mapped by a page middle directory */
> > #define PMD_SIZE (_AC(1, UL) << PMD_SHIFT)
> > #define PMD_MASK (~(PMD_SIZE - 1))
> >
> > +/* Page Upper Directory entry */
> > +typedef struct {
> > + unsigned long pud;
> > +} pud_t;
> > +
> > +#define pud_val(x) ((x).pud)
> > +#define __pud(x) ((pud_t) { (x) })
> > +#define PTRS_PER_PUD (PAGE_SIZE / sizeof(pud_t))
> > +
> > /* Page Middle Directory entry */
> > typedef struct {
> > unsigned long pmd;
> > @@ -59,6 +79,16 @@ static inline void pud_clear(pud_t *pudp)
> > set_pud(pudp, __pud(0));
> > }
> >
> > +static inline pud_t pfn_pud(unsigned long pfn, pgprot_t prot)
> > +{
> > + return __pud((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
> > +}
> > +
> > +static inline unsigned long _pud_pfn(pud_t pud)
> > +{
> > + return pud_val(pud) >> _PAGE_PFN_SHIFT;
> > +}
> > +
> > static inline pmd_t *pud_pgtable(pud_t pud)
> > {
> > return (pmd_t *)pfn_to_virt(pud_val(pud) >> _PAGE_PFN_SHIFT);
> > @@ -69,6 +99,17 @@ static inline struct page *pud_page(pud_t pud)
> > return pfn_to_page(pud_val(pud) >> _PAGE_PFN_SHIFT);
> > }
> >
> > +#define mm_pud_folded mm_pud_folded
> > +static inline bool mm_pud_folded(struct mm_struct *mm)
> > +{
> > + if (pgtable_l4_enabled)
> > + return false;
> > +
> > + return true;
> > +}
> > +
> > +#define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1))
> > +
> > static inline pmd_t pfn_pmd(unsigned long pfn, pgprot_t prot)
> > {
> > return __pmd((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
> > @@ -84,4 +125,69 @@ static inline unsigned long _pmd_pfn(pmd_t pmd)
> > #define pmd_ERROR(e) \
> > pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))
> >
> > +#define pud_ERROR(e) \
> > + pr_err("%s:%d: bad pud %016lx.\n", __FILE__, __LINE__, pud_val(e))
> > +
> > +static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
> > +{
> > + if (pgtable_l4_enabled)
> > + *p4dp = p4d;
> > + else
> > + set_pud((pud_t *)p4dp, (pud_t){ p4d_val(p4d) });
> > +}
> > +
> > +static inline int p4d_none(p4d_t p4d)
> > +{
> > + if (pgtable_l4_enabled)
> > + return (p4d_val(p4d) == 0);
> > +
> > + return 0;
> > +}
> > +
> > +static inline int p4d_present(p4d_t p4d)
> > +{
> > + if (pgtable_l4_enabled)
> > + return (p4d_val(p4d) & _PAGE_PRESENT);
> > +
> > + return 1;
> > +}
> > +
> > +static inline int p4d_bad(p4d_t p4d)
> > +{
> > + if (pgtable_l4_enabled)
> > + return !p4d_present(p4d);
> > +
> > + return 0;
> > +}
> > +
> > +static inline void p4d_clear(p4d_t *p4d)
> > +{
> > + if (pgtable_l4_enabled)
> > + set_p4d(p4d, __p4d(0));
> > +}
> > +
> > +static inline pud_t *p4d_pgtable(p4d_t p4d)
> > +{
> > + if (pgtable_l4_enabled)
> > + return (pud_t *)pfn_to_virt(p4d_val(p4d) >> _PAGE_PFN_SHIFT);
> > +
> > + return (pud_t *)pud_pgtable((pud_t) { p4d_val(p4d) });
> > +}
> > +
> > +static inline struct page *p4d_page(p4d_t p4d)
> > +{
> > + return pfn_to_page(p4d_val(p4d) >> _PAGE_PFN_SHIFT);
> > +}
> > +
> > +#define pud_index(addr) (((addr) >> PUD_SHIFT) & (PTRS_PER_PUD - 1))
> > +
> > +#define pud_offset pud_offset
> > +static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
> > +{
> > + if (pgtable_l4_enabled)
> > + return p4d_pgtable(*p4d) + pud_index(address);
> > +
> > + return (pud_t *)p4d;
> > +}
> > +
> > #endif /* _ASM_RISCV_PGTABLE_64_H */
> > diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> > index e1a52e22ad7e..e1c74ef4ead2 100644
> > --- a/arch/riscv/include/asm/pgtable.h
> > +++ b/arch/riscv/include/asm/pgtable.h
> > @@ -51,7 +51,7 @@
> > * position vmemmap directly below the VMALLOC region.
> > */
> > #ifdef CONFIG_64BIT
> > -#define VA_BITS 39
> > +#define VA_BITS (pgtable_l4_enabled ? 48 : 39)
> > #else
> > #define VA_BITS 32
> > #endif
> > @@ -90,8 +90,7 @@
> >
> > #ifndef __ASSEMBLY__
> >
> > -/* Page Upper Directory not used in RISC-V */
> > -#include <asm-generic/pgtable-nopud.h>
> > +#include <asm-generic/pgtable-nop4d.h>
> > #include <asm/page.h>
> > #include <asm/tlbflush.h>
> > #include <linux/mm_types.h>
> > @@ -113,6 +112,17 @@
> > #define XIP_FIXUP(addr) (addr)
> > #endif /* CONFIG_XIP_KERNEL */
> >
> > +struct pt_alloc_ops {
> > + pte_t *(*get_pte_virt)(phys_addr_t pa);
> > + phys_addr_t (*alloc_pte)(uintptr_t va);
> > +#ifndef __PAGETABLE_PMD_FOLDED
> > + pmd_t *(*get_pmd_virt)(phys_addr_t pa);
> > + phys_addr_t (*alloc_pmd)(uintptr_t va);
> > + pud_t *(*get_pud_virt)(phys_addr_t pa);
> > + phys_addr_t (*alloc_pud)(uintptr_t va);
> > +#endif
> > +};
> > +
> > #ifdef CONFIG_MMU
> > /* Number of entries in the page global directory */
> > #define PTRS_PER_PGD (PAGE_SIZE / sizeof(pgd_t))
> > @@ -669,9 +679,11 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> > * Note that PGDIR_SIZE must evenly divide TASK_SIZE.
> > */
> > #ifdef CONFIG_64BIT
> > -#define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)
> > +#define TASK_SIZE (PGDIR_SIZE * PTRS_PER_PGD / 2)
> > +#define TASK_SIZE_MIN (PGDIR_SIZE_L3 * PTRS_PER_PGD / 2)
> > #else
> > -#define TASK_SIZE FIXADDR_START
> > +#define TASK_SIZE FIXADDR_START
> > +#define TASK_SIZE_MIN TASK_SIZE
> > #endif
> >
> > #else /* CONFIG_MMU */
> > @@ -697,6 +709,8 @@ extern uintptr_t _dtb_early_pa;
> > #define dtb_early_va _dtb_early_va
> > #define dtb_early_pa _dtb_early_pa
> > #endif /* CONFIG_XIP_KERNEL */
> > +extern u64 satp_mode;
> > +extern bool pgtable_l4_enabled;
> >
> > void paging_init(void);
> > void misc_mem_init(void);
> > diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> > index 52c5ff9804c5..c3c0ed559770 100644
> > --- a/arch/riscv/kernel/head.S
> > +++ b/arch/riscv/kernel/head.S
> > @@ -95,7 +95,8 @@ relocate:
> >
> > /* Compute satp for kernel page tables, but don't load it yet */
> > srl a2, a0, PAGE_SHIFT
> > - li a1, SATP_MODE
> > + la a1, satp_mode
> > + REG_L a1, 0(a1)
> > or a2, a2, a1
> >
> > /*
> > diff --git a/arch/riscv/mm/context.c b/arch/riscv/mm/context.c
> > index ee3459cb6750..a7246872bd30 100644
> > --- a/arch/riscv/mm/context.c
> > +++ b/arch/riscv/mm/context.c
> > @@ -192,7 +192,7 @@ static void set_mm_asid(struct mm_struct *mm, unsigned int cpu)
> > switch_mm_fast:
> > csr_write(CSR_SATP, virt_to_pfn(mm->pgd) |
> > ((cntx & asid_mask) << SATP_ASID_SHIFT) |
> > - SATP_MODE);
> > + satp_mode);
> >
> > if (need_flush_tlb)
> > local_flush_tlb_all();
> > @@ -201,7 +201,7 @@ static void set_mm_asid(struct mm_struct *mm, unsigned int cpu)
> > static void set_mm_noasid(struct mm_struct *mm)
> > {
> > /* Switch the page table and blindly nuke entire local TLB */
> > - csr_write(CSR_SATP, virt_to_pfn(mm->pgd) | SATP_MODE);
> > + csr_write(CSR_SATP, virt_to_pfn(mm->pgd) | satp_mode);
> > local_flush_tlb_all();
> > }
> >
> > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > index 1552226fb6bd..6a19a1b1caf8 100644
> > --- a/arch/riscv/mm/init.c
> > +++ b/arch/riscv/mm/init.c
> > @@ -37,6 +37,17 @@ EXPORT_SYMBOL(kernel_map);
> > #define kernel_map (*(struct kernel_mapping *)XIP_FIXUP(&kernel_map))
> > #endif
> >
> > +#ifdef CONFIG_64BIT
> > +u64 satp_mode = !IS_ENABLED(CONFIG_XIP_KERNEL) ? SATP_MODE_48 : SATP_MODE_39;
> > +#else
> > +u64 satp_mode = SATP_MODE_32;
> > +#endif
> > +EXPORT_SYMBOL(satp_mode);
> > +
> > +bool pgtable_l4_enabled = IS_ENABLED(CONFIG_64BIT) && !IS_ENABLED(CONFIG_XIP_KERNEL) ?
> > + true : false;
>
> Hi Alex,
>
> I'm not sure whether we can use static key for pgtable_l4_enabled or
> not. Obviously, for a specific HW platform, pgtable_l4_enabled won't change
> after boot, and it seems it sits hot code path, so IMHO, static key maybe
> suitable for it.

Thanks for the suggestion, I'll explore that after this series is
merged if you don't mind.

Thanks,

Alex

>
> Thanks
>

2022-01-10 08:06:39

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v3 12/13] riscv: Initialize thread pointer before calling C functions

Hi Palmer,

I fell onto this issue again today, do you think you could take this
patch in for-next? Because I assume it is too late now to take the sv48
patchset: if not, I can respin it today or tomorrow.

Thanks,

Alex

On 12/6/21 11:46, Alexandre Ghiti wrote:
> Because of the stack canary feature that reads from the current task
> structure the stack canary value, the thread pointer register "tp" must
> be set before calling any C function from head.S: by chance, setup_vm
> and all the functions that it calls does not seem to be part of the
> functions where the canary check is done, but in the following commits,
> some functions will.
>
> Fixes: f2c9699f65557a31 ("riscv: Add STACKPROTECTOR supported")
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> arch/riscv/kernel/head.S | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index c3c0ed559770..86f7ee3d210d 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -302,6 +302,7 @@ clear_bss_done:
> REG_S a0, (a2)
>
> /* Initialize page tables and relocate to virtual addresses */
> + la tp, init_task
> la sp, init_thread_union + THREAD_SIZE
> XIP_FIXUP_OFFSET sp
> #ifdef CONFIG_BUILTIN_DTB

2022-01-21 21:06:16

by Palmer Dabbelt

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] Introduce sv48 support without relocatable kernel

On Mon, 06 Dec 2021 02:46:44 PST (-0800), [email protected] wrote:
> * Please note notable changes in memory layouts and kasan population *
>
> This patchset allows to have a single kernel for sv39 and sv48 without
> being relocatable.
>
> The idea comes from Arnd Bergmann who suggested to do the same as x86,
> that is mapping the kernel to the end of the address space, which allows
> the kernel to be linked at the same address for both sv39 and sv48 and
> then does not require to be relocated at runtime.
>
> This implements sv48 support at runtime. The kernel will try to
> boot with 4-level page table and will fallback to 3-level if the HW does not
> support it. Folding the 4th level into a 3-level page table has almost no
> cost at runtime.
>
> Note that kasan region had to be moved to the end of the address space
> since its location must be known at compile-time and then be valid for
> both sv39 and sv48 (and sv57 that is coming).
>
> Tested on:
> - qemu rv64 sv39: OK
> - qemu rv64 sv48: OK
> - qemu rv64 sv39 + kasan: OK
> - qemu rv64 sv48 + kasan: OK
> - qemu rv32: OK
>
> Changes in v3:
> - Fix SZ_1T, thanks to Atish
> - Fix warning create_pud_mapping, thanks to Atish
> - Fix k210 nommu build, thanks to Atish
> - Fix wrong rebase as noted by Samuel
> - * Downgrade to sv39 is only possible if !KASAN (see commit changelog) *
> - * Move KASAN next to the kernel: virtual layouts changed and kasan population *
>
> Changes in v2:
> - Rebase onto for-next
> - Fix KASAN
> - Fix stack canary
> - Get completely rid of MAXPHYSMEM configs
> - Add documentation
>
> Alexandre Ghiti (13):
> riscv: Move KASAN mapping next to the kernel mapping
> riscv: Split early kasan mapping to prepare sv48 introduction
> riscv: Introduce functions to switch pt_ops
> riscv: Allow to dynamically define VA_BITS
> riscv: Get rid of MAXPHYSMEM configs
> asm-generic: Prepare for riscv use of pud_alloc_one and pud_free
> riscv: Implement sv48 support
> riscv: Use pgtable_l4_enabled to output mmu_type in cpuinfo
> riscv: Explicit comment about user virtual address space size
> riscv: Improve virtual kernel memory layout dump
> Documentation: riscv: Add sv48 description to VM layout
> riscv: Initialize thread pointer before calling C functions
> riscv: Allow user to downgrade to sv39 when hw supports sv48 if !KASAN
>
> Documentation/riscv/vm-layout.rst | 48 ++-
> arch/riscv/Kconfig | 37 +-
> arch/riscv/configs/nommu_k210_defconfig | 1 -
> .../riscv/configs/nommu_k210_sdcard_defconfig | 1 -
> arch/riscv/configs/nommu_virt_defconfig | 1 -
> arch/riscv/include/asm/csr.h | 3 +-
> arch/riscv/include/asm/fixmap.h | 1
> arch/riscv/include/asm/kasan.h | 11 +-
> arch/riscv/include/asm/page.h | 20 +-
> arch/riscv/include/asm/pgalloc.h | 40 ++
> arch/riscv/include/asm/pgtable-64.h | 108 ++++-
> arch/riscv/include/asm/pgtable.h | 47 +-
> arch/riscv/include/asm/sparsemem.h | 6 +-
> arch/riscv/kernel/cpu.c | 23 +-
> arch/riscv/kernel/head.S | 4 +-
> arch/riscv/mm/context.c | 4 +-
> arch/riscv/mm/init.c | 408 ++++++++++++++----
> arch/riscv/mm/kasan_init.c | 250 ++++++++---
> drivers/firmware/efi/libstub/efi-stub.c | 2
> drivers/pci/controller/pci-xgene.c | 2 +-
> include/asm-generic/pgalloc.h | 24 +-
> include/linux/sizes.h | 1
> 22 files changed, 833 insertions(+), 209 deletions(-)

Sorry this took a while. This is on for-next, with a bit of juggling: a
handful of trivial fixes for configs that were failing to build/boot and
some merge issues. I also pulled out that MAXPHYSMEM fix to the top, so
it'd be easier to backport. This is bigger than something I'd normally like to
take late in the cycle, but given there's a lot of cleanups, likely some fixes,
and it looks like folks have been testing this I'm just going to go with it.

Let me know if there's any issues with the merge, it was a bit hairy.
Probably best to just send along a fixup patch at this point.

Thanks!

2022-01-21 21:07:54

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] Introduce sv48 support without relocatable kernel

On Thu, Jan 20, 2022 at 5:18 AM Palmer Dabbelt <[email protected]> wrote:
>
> On Mon, 06 Dec 2021 02:46:44 PST (-0800), [email protected] wrote:
> > * Please note notable changes in memory layouts and kasan population *
> >
> > This patchset allows to have a single kernel for sv39 and sv48 without
> > being relocatable.
> >
> > The idea comes from Arnd Bergmann who suggested to do the same as x86,
> > that is mapping the kernel to the end of the address space, which allows
> > the kernel to be linked at the same address for both sv39 and sv48 and
> > then does not require to be relocated at runtime.
> >
> > This implements sv48 support at runtime. The kernel will try to
> > boot with 4-level page table and will fallback to 3-level if the HW does not
> > support it. Folding the 4th level into a 3-level page table has almost no
> > cost at runtime.
> >
> > Note that kasan region had to be moved to the end of the address space
> > since its location must be known at compile-time and then be valid for
> > both sv39 and sv48 (and sv57 that is coming).
> >
> > Tested on:
> > - qemu rv64 sv39: OK
> > - qemu rv64 sv48: OK
> > - qemu rv64 sv39 + kasan: OK
> > - qemu rv64 sv48 + kasan: OK
> > - qemu rv32: OK
> >
> > Changes in v3:
> > - Fix SZ_1T, thanks to Atish
> > - Fix warning create_pud_mapping, thanks to Atish
> > - Fix k210 nommu build, thanks to Atish
> > - Fix wrong rebase as noted by Samuel
> > - * Downgrade to sv39 is only possible if !KASAN (see commit changelog) *
> > - * Move KASAN next to the kernel: virtual layouts changed and kasan population *
> >
> > Changes in v2:
> > - Rebase onto for-next
> > - Fix KASAN
> > - Fix stack canary
> > - Get completely rid of MAXPHYSMEM configs
> > - Add documentation
> >
> > Alexandre Ghiti (13):
> > riscv: Move KASAN mapping next to the kernel mapping
> > riscv: Split early kasan mapping to prepare sv48 introduction
> > riscv: Introduce functions to switch pt_ops
> > riscv: Allow to dynamically define VA_BITS
> > riscv: Get rid of MAXPHYSMEM configs
> > asm-generic: Prepare for riscv use of pud_alloc_one and pud_free
> > riscv: Implement sv48 support
> > riscv: Use pgtable_l4_enabled to output mmu_type in cpuinfo
> > riscv: Explicit comment about user virtual address space size
> > riscv: Improve virtual kernel memory layout dump
> > Documentation: riscv: Add sv48 description to VM layout
> > riscv: Initialize thread pointer before calling C functions
> > riscv: Allow user to downgrade to sv39 when hw supports sv48 if !KASAN
> >
> > Documentation/riscv/vm-layout.rst | 48 ++-
> > arch/riscv/Kconfig | 37 +-
> > arch/riscv/configs/nommu_k210_defconfig | 1 -
> > .../riscv/configs/nommu_k210_sdcard_defconfig | 1 -
> > arch/riscv/configs/nommu_virt_defconfig | 1 -
> > arch/riscv/include/asm/csr.h | 3 +-
> > arch/riscv/include/asm/fixmap.h | 1
> > arch/riscv/include/asm/kasan.h | 11 +-
> > arch/riscv/include/asm/page.h | 20 +-
> > arch/riscv/include/asm/pgalloc.h | 40 ++
> > arch/riscv/include/asm/pgtable-64.h | 108 ++++-
> > arch/riscv/include/asm/pgtable.h | 47 +-
> > arch/riscv/include/asm/sparsemem.h | 6 +-
> > arch/riscv/kernel/cpu.c | 23 +-
> > arch/riscv/kernel/head.S | 4 +-
> > arch/riscv/mm/context.c | 4 +-
> > arch/riscv/mm/init.c | 408 ++++++++++++++----
> > arch/riscv/mm/kasan_init.c | 250 ++++++++---
> > drivers/firmware/efi/libstub/efi-stub.c | 2
> > drivers/pci/controller/pci-xgene.c | 2 +-
> > include/asm-generic/pgalloc.h | 24 +-
> > include/linux/sizes.h | 1
> > 22 files changed, 833 insertions(+), 209 deletions(-)
>
> Sorry this took a while. This is on for-next, with a bit of juggling: a
> handful of trivial fixes for configs that were failing to build/boot and
> some merge issues. I also pulled out that MAXPHYSMEM fix to the top, so
> it'd be easier to backport. This is bigger than something I'd normally like to
> take late in the cycle, but given there's a lot of cleanups, likely some fixes,
> and it looks like folks have been testing this I'm just going to go with it.
>

Yes yes yes! That's fantastic news :)

> Let me know if there's any issues with the merge, it was a bit hairy.
> Probably best to just send along a fixup patch at this point.

I'm going to take a look at that now, and I'll fix anything that comes
up quickly :)

Thanks!

Alex

>
> Thanks!

2022-01-21 21:15:45

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] Introduce sv48 support without relocatable kernel

On Thu, Jan 20, 2022 at 8:30 AM Alexandre Ghiti
<[email protected]> wrote:
>
> On Thu, Jan 20, 2022 at 5:18 AM Palmer Dabbelt <[email protected]> wrote:
> >
> > On Mon, 06 Dec 2021 02:46:44 PST (-0800), [email protected] wrote:
> > > * Please note notable changes in memory layouts and kasan population *
> > >
> > > This patchset allows to have a single kernel for sv39 and sv48 without
> > > being relocatable.
> > >
> > > The idea comes from Arnd Bergmann who suggested to do the same as x86,
> > > that is mapping the kernel to the end of the address space, which allows
> > > the kernel to be linked at the same address for both sv39 and sv48 and
> > > then does not require to be relocated at runtime.
> > >
> > > This implements sv48 support at runtime. The kernel will try to
> > > boot with 4-level page table and will fallback to 3-level if the HW does not
> > > support it. Folding the 4th level into a 3-level page table has almost no
> > > cost at runtime.
> > >
> > > Note that kasan region had to be moved to the end of the address space
> > > since its location must be known at compile-time and then be valid for
> > > both sv39 and sv48 (and sv57 that is coming).
> > >
> > > Tested on:
> > > - qemu rv64 sv39: OK
> > > - qemu rv64 sv48: OK
> > > - qemu rv64 sv39 + kasan: OK
> > > - qemu rv64 sv48 + kasan: OK
> > > - qemu rv32: OK
> > >
> > > Changes in v3:
> > > - Fix SZ_1T, thanks to Atish
> > > - Fix warning create_pud_mapping, thanks to Atish
> > > - Fix k210 nommu build, thanks to Atish
> > > - Fix wrong rebase as noted by Samuel
> > > - * Downgrade to sv39 is only possible if !KASAN (see commit changelog) *
> > > - * Move KASAN next to the kernel: virtual layouts changed and kasan population *
> > >
> > > Changes in v2:
> > > - Rebase onto for-next
> > > - Fix KASAN
> > > - Fix stack canary
> > > - Get completely rid of MAXPHYSMEM configs
> > > - Add documentation
> > >
> > > Alexandre Ghiti (13):
> > > riscv: Move KASAN mapping next to the kernel mapping
> > > riscv: Split early kasan mapping to prepare sv48 introduction
> > > riscv: Introduce functions to switch pt_ops
> > > riscv: Allow to dynamically define VA_BITS
> > > riscv: Get rid of MAXPHYSMEM configs
> > > asm-generic: Prepare for riscv use of pud_alloc_one and pud_free
> > > riscv: Implement sv48 support
> > > riscv: Use pgtable_l4_enabled to output mmu_type in cpuinfo
> > > riscv: Explicit comment about user virtual address space size
> > > riscv: Improve virtual kernel memory layout dump
> > > Documentation: riscv: Add sv48 description to VM layout
> > > riscv: Initialize thread pointer before calling C functions
> > > riscv: Allow user to downgrade to sv39 when hw supports sv48 if !KASAN
> > >
> > > Documentation/riscv/vm-layout.rst | 48 ++-
> > > arch/riscv/Kconfig | 37 +-
> > > arch/riscv/configs/nommu_k210_defconfig | 1 -
> > > .../riscv/configs/nommu_k210_sdcard_defconfig | 1 -
> > > arch/riscv/configs/nommu_virt_defconfig | 1 -
> > > arch/riscv/include/asm/csr.h | 3 +-
> > > arch/riscv/include/asm/fixmap.h | 1
> > > arch/riscv/include/asm/kasan.h | 11 +-
> > > arch/riscv/include/asm/page.h | 20 +-
> > > arch/riscv/include/asm/pgalloc.h | 40 ++
> > > arch/riscv/include/asm/pgtable-64.h | 108 ++++-
> > > arch/riscv/include/asm/pgtable.h | 47 +-
> > > arch/riscv/include/asm/sparsemem.h | 6 +-
> > > arch/riscv/kernel/cpu.c | 23 +-
> > > arch/riscv/kernel/head.S | 4 +-
> > > arch/riscv/mm/context.c | 4 +-
> > > arch/riscv/mm/init.c | 408 ++++++++++++++----
> > > arch/riscv/mm/kasan_init.c | 250 ++++++++---
> > > drivers/firmware/efi/libstub/efi-stub.c | 2
> > > drivers/pci/controller/pci-xgene.c | 2 +-
> > > include/asm-generic/pgalloc.h | 24 +-
> > > include/linux/sizes.h | 1
> > > 22 files changed, 833 insertions(+), 209 deletions(-)
> >
> > Sorry this took a while. This is on for-next, with a bit of juggling: a
> > handful of trivial fixes for configs that were failing to build/boot and
> > some merge issues. I also pulled out that MAXPHYSMEM fix to the top, so
> > it'd be easier to backport. This is bigger than something I'd normally like to
> > take late in the cycle, but given there's a lot of cleanups, likely some fixes,
> > and it looks like folks have been testing this I'm just going to go with it.
> >
>
> Yes yes yes! That's fantastic news :)
>
> > Let me know if there's any issues with the merge, it was a bit hairy.
> > Probably best to just send along a fixup patch at this point.
>
> I'm going to take a look at that now, and I'll fix anything that comes
> up quickly :)

I see in for-next that you did not take the following patches:

riscv: Improve virtual kernel memory layout dump
Documentation: riscv: Add sv48 description to VM layout
riscv: Initialize thread pointer before calling C functions
riscv: Allow user to downgrade to sv39 when hw supports sv48 if !KASAN

I'm not sure this was your intention. If it was, I believe that at
least the first 2 patches are needed in this series, the 3rd one is a
useful fix and we can discuss the 4th if that's an issue for you.

I tested for-next on both sv39 and sv48 successfully, I took a glance
at the code and noticed you fixed the PTRS_PER_PGD error, thanks for
that. Otherwise nothing obvious has popped.

Thanks again,

Alex

>
> Thanks!
>
> Alex
>
> >
> > Thanks!

2022-02-18 14:54:30

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] Introduce sv48 support without relocatable kernel

Hi Palmer,

On Thu, Jan 20, 2022 at 11:05 AM Alexandre Ghiti
<[email protected]> wrote:
>
> On Thu, Jan 20, 2022 at 8:30 AM Alexandre Ghiti
> <[email protected]> wrote:
> >
> > On Thu, Jan 20, 2022 at 5:18 AM Palmer Dabbelt <[email protected]> wrote:
> > >
> > > On Mon, 06 Dec 2021 02:46:44 PST (-0800), [email protected] wrote:
> > > > * Please note notable changes in memory layouts and kasan population *
> > > >
> > > > This patchset allows to have a single kernel for sv39 and sv48 without
> > > > being relocatable.
> > > >
> > > > The idea comes from Arnd Bergmann who suggested to do the same as x86,
> > > > that is mapping the kernel to the end of the address space, which allows
> > > > the kernel to be linked at the same address for both sv39 and sv48 and
> > > > then does not require to be relocated at runtime.
> > > >
> > > > This implements sv48 support at runtime. The kernel will try to
> > > > boot with 4-level page table and will fallback to 3-level if the HW does not
> > > > support it. Folding the 4th level into a 3-level page table has almost no
> > > > cost at runtime.
> > > >
> > > > Note that kasan region had to be moved to the end of the address space
> > > > since its location must be known at compile-time and then be valid for
> > > > both sv39 and sv48 (and sv57 that is coming).
> > > >
> > > > Tested on:
> > > > - qemu rv64 sv39: OK
> > > > - qemu rv64 sv48: OK
> > > > - qemu rv64 sv39 + kasan: OK
> > > > - qemu rv64 sv48 + kasan: OK
> > > > - qemu rv32: OK
> > > >
> > > > Changes in v3:
> > > > - Fix SZ_1T, thanks to Atish
> > > > - Fix warning create_pud_mapping, thanks to Atish
> > > > - Fix k210 nommu build, thanks to Atish
> > > > - Fix wrong rebase as noted by Samuel
> > > > - * Downgrade to sv39 is only possible if !KASAN (see commit changelog) *
> > > > - * Move KASAN next to the kernel: virtual layouts changed and kasan population *
> > > >
> > > > Changes in v2:
> > > > - Rebase onto for-next
> > > > - Fix KASAN
> > > > - Fix stack canary
> > > > - Get completely rid of MAXPHYSMEM configs
> > > > - Add documentation
> > > >
> > > > Alexandre Ghiti (13):
> > > > riscv: Move KASAN mapping next to the kernel mapping
> > > > riscv: Split early kasan mapping to prepare sv48 introduction
> > > > riscv: Introduce functions to switch pt_ops
> > > > riscv: Allow to dynamically define VA_BITS
> > > > riscv: Get rid of MAXPHYSMEM configs
> > > > asm-generic: Prepare for riscv use of pud_alloc_one and pud_free
> > > > riscv: Implement sv48 support
> > > > riscv: Use pgtable_l4_enabled to output mmu_type in cpuinfo
> > > > riscv: Explicit comment about user virtual address space size
> > > > riscv: Improve virtual kernel memory layout dump
> > > > Documentation: riscv: Add sv48 description to VM layout
> > > > riscv: Initialize thread pointer before calling C functions
> > > > riscv: Allow user to downgrade to sv39 when hw supports sv48 if !KASAN
> > > >
> > > > Documentation/riscv/vm-layout.rst | 48 ++-
> > > > arch/riscv/Kconfig | 37 +-
> > > > arch/riscv/configs/nommu_k210_defconfig | 1 -
> > > > .../riscv/configs/nommu_k210_sdcard_defconfig | 1 -
> > > > arch/riscv/configs/nommu_virt_defconfig | 1 -
> > > > arch/riscv/include/asm/csr.h | 3 +-
> > > > arch/riscv/include/asm/fixmap.h | 1
> > > > arch/riscv/include/asm/kasan.h | 11 +-
> > > > arch/riscv/include/asm/page.h | 20 +-
> > > > arch/riscv/include/asm/pgalloc.h | 40 ++
> > > > arch/riscv/include/asm/pgtable-64.h | 108 ++++-
> > > > arch/riscv/include/asm/pgtable.h | 47 +-
> > > > arch/riscv/include/asm/sparsemem.h | 6 +-
> > > > arch/riscv/kernel/cpu.c | 23 +-
> > > > arch/riscv/kernel/head.S | 4 +-
> > > > arch/riscv/mm/context.c | 4 +-
> > > > arch/riscv/mm/init.c | 408 ++++++++++++++----
> > > > arch/riscv/mm/kasan_init.c | 250 ++++++++---
> > > > drivers/firmware/efi/libstub/efi-stub.c | 2
> > > > drivers/pci/controller/pci-xgene.c | 2 +-
> > > > include/asm-generic/pgalloc.h | 24 +-
> > > > include/linux/sizes.h | 1
> > > > 22 files changed, 833 insertions(+), 209 deletions(-)
> > >
> > > Sorry this took a while. This is on for-next, with a bit of juggling: a
> > > handful of trivial fixes for configs that were failing to build/boot and
> > > some merge issues. I also pulled out that MAXPHYSMEM fix to the top, so
> > > it'd be easier to backport. This is bigger than something I'd normally like to
> > > take late in the cycle, but given there's a lot of cleanups, likely some fixes,
> > > and it looks like folks have been testing this I'm just going to go with it.
> > >
> >
> > Yes yes yes! That's fantastic news :)
> >
> > > Let me know if there's any issues with the merge, it was a bit hairy.
> > > Probably best to just send along a fixup patch at this point.
> >
> > I'm going to take a look at that now, and I'll fix anything that comes
> > up quickly :)
>
> I see in for-next that you did not take the following patches:
>
> riscv: Improve virtual kernel memory layout dump
> Documentation: riscv: Add sv48 description to VM layout
> riscv: Initialize thread pointer before calling C functions
> riscv: Allow user to downgrade to sv39 when hw supports sv48 if !KASAN
>
> I'm not sure this was your intention. If it was, I believe that at
> least the first 2 patches are needed in this series, the 3rd one is a
> useful fix and we can discuss the 4th if that's an issue for you.

Can you confirm that this was intentional and maybe explain the
motivation behind it? Because I see value in those patches.

Thanks,

Alex

>
> I tested for-next on both sv39 and sv48 successfully, I took a glance
> at the code and noticed you fixed the PTRS_PER_PGD error, thanks for
> that. Otherwise nothing obvious has popped.
>
> Thanks again,
>
> Alex
>
> >
> > Thanks!
> >
> > Alex
> >
> > >
> > > Thanks!

2022-04-02 15:30:51

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] Introduce sv48 support without relocatable kernel

On Fri, Feb 18, 2022 at 11:45 AM Alexandre Ghiti
<[email protected]> wrote:
>
> Hi Palmer,
>
> On Thu, Jan 20, 2022 at 11:05 AM Alexandre Ghiti
> <[email protected]> wrote:
> >
> > On Thu, Jan 20, 2022 at 8:30 AM Alexandre Ghiti
> > <[email protected]> wrote:
> > >
> > > On Thu, Jan 20, 2022 at 5:18 AM Palmer Dabbelt <[email protected]> wrote:
> > > >
> > > > On Mon, 06 Dec 2021 02:46:44 PST (-0800), [email protected] wrote:
> > > > > * Please note notable changes in memory layouts and kasan population *
> > > > >
> > > > > This patchset allows to have a single kernel for sv39 and sv48 without
> > > > > being relocatable.
> > > > >
> > > > > The idea comes from Arnd Bergmann who suggested to do the same as x86,
> > > > > that is mapping the kernel to the end of the address space, which allows
> > > > > the kernel to be linked at the same address for both sv39 and sv48 and
> > > > > then does not require to be relocated at runtime.
> > > > >
> > > > > This implements sv48 support at runtime. The kernel will try to
> > > > > boot with 4-level page table and will fallback to 3-level if the HW does not
> > > > > support it. Folding the 4th level into a 3-level page table has almost no
> > > > > cost at runtime.
> > > > >
> > > > > Note that kasan region had to be moved to the end of the address space
> > > > > since its location must be known at compile-time and then be valid for
> > > > > both sv39 and sv48 (and sv57 that is coming).
> > > > >
> > > > > Tested on:
> > > > > - qemu rv64 sv39: OK
> > > > > - qemu rv64 sv48: OK
> > > > > - qemu rv64 sv39 + kasan: OK
> > > > > - qemu rv64 sv48 + kasan: OK
> > > > > - qemu rv32: OK
> > > > >
> > > > > Changes in v3:
> > > > > - Fix SZ_1T, thanks to Atish
> > > > > - Fix warning create_pud_mapping, thanks to Atish
> > > > > - Fix k210 nommu build, thanks to Atish
> > > > > - Fix wrong rebase as noted by Samuel
> > > > > - * Downgrade to sv39 is only possible if !KASAN (see commit changelog) *
> > > > > - * Move KASAN next to the kernel: virtual layouts changed and kasan population *
> > > > >
> > > > > Changes in v2:
> > > > > - Rebase onto for-next
> > > > > - Fix KASAN
> > > > > - Fix stack canary
> > > > > - Get completely rid of MAXPHYSMEM configs
> > > > > - Add documentation
> > > > >
> > > > > Alexandre Ghiti (13):
> > > > > riscv: Move KASAN mapping next to the kernel mapping
> > > > > riscv: Split early kasan mapping to prepare sv48 introduction
> > > > > riscv: Introduce functions to switch pt_ops
> > > > > riscv: Allow to dynamically define VA_BITS
> > > > > riscv: Get rid of MAXPHYSMEM configs
> > > > > asm-generic: Prepare for riscv use of pud_alloc_one and pud_free
> > > > > riscv: Implement sv48 support
> > > > > riscv: Use pgtable_l4_enabled to output mmu_type in cpuinfo
> > > > > riscv: Explicit comment about user virtual address space size
> > > > > riscv: Improve virtual kernel memory layout dump
> > > > > Documentation: riscv: Add sv48 description to VM layout
> > > > > riscv: Initialize thread pointer before calling C functions
> > > > > riscv: Allow user to downgrade to sv39 when hw supports sv48 if !KASAN
> > > > >
> > > > > Documentation/riscv/vm-layout.rst | 48 ++-
> > > > > arch/riscv/Kconfig | 37 +-
> > > > > arch/riscv/configs/nommu_k210_defconfig | 1 -
> > > > > .../riscv/configs/nommu_k210_sdcard_defconfig | 1 -
> > > > > arch/riscv/configs/nommu_virt_defconfig | 1 -
> > > > > arch/riscv/include/asm/csr.h | 3 +-
> > > > > arch/riscv/include/asm/fixmap.h | 1
> > > > > arch/riscv/include/asm/kasan.h | 11 +-
> > > > > arch/riscv/include/asm/page.h | 20 +-
> > > > > arch/riscv/include/asm/pgalloc.h | 40 ++
> > > > > arch/riscv/include/asm/pgtable-64.h | 108 ++++-
> > > > > arch/riscv/include/asm/pgtable.h | 47 +-
> > > > > arch/riscv/include/asm/sparsemem.h | 6 +-
> > > > > arch/riscv/kernel/cpu.c | 23 +-
> > > > > arch/riscv/kernel/head.S | 4 +-
> > > > > arch/riscv/mm/context.c | 4 +-
> > > > > arch/riscv/mm/init.c | 408 ++++++++++++++----
> > > > > arch/riscv/mm/kasan_init.c | 250 ++++++++---
> > > > > drivers/firmware/efi/libstub/efi-stub.c | 2
> > > > > drivers/pci/controller/pci-xgene.c | 2 +-
> > > > > include/asm-generic/pgalloc.h | 24 +-
> > > > > include/linux/sizes.h | 1
> > > > > 22 files changed, 833 insertions(+), 209 deletions(-)
> > > >
> > > > Sorry this took a while. This is on for-next, with a bit of juggling: a
> > > > handful of trivial fixes for configs that were failing to build/boot and
> > > > some merge issues. I also pulled out that MAXPHYSMEM fix to the top, so
> > > > it'd be easier to backport. This is bigger than something I'd normally like to
> > > > take late in the cycle, but given there's a lot of cleanups, likely some fixes,
> > > > and it looks like folks have been testing this I'm just going to go with it.
> > > >
> > >
> > > Yes yes yes! That's fantastic news :)
> > >
> > > > Let me know if there's any issues with the merge, it was a bit hairy.
> > > > Probably best to just send along a fixup patch at this point.
> > >
> > > I'm going to take a look at that now, and I'll fix anything that comes
> > > up quickly :)
> >
> > I see in for-next that you did not take the following patches:
> >
> > riscv: Improve virtual kernel memory layout dump
> > Documentation: riscv: Add sv48 description to VM layout
> > riscv: Initialize thread pointer before calling C functions
> > riscv: Allow user to downgrade to sv39 when hw supports sv48 if !KASAN
> >
> > I'm not sure this was your intention. If it was, I believe that at
> > least the first 2 patches are needed in this series, the 3rd one is a
> > useful fix and we can discuss the 4th if that's an issue for you.
>
> Can you confirm that this was intentional and maybe explain the
> motivation behind it? Because I see value in those patches.

Palmer,

I read that you were still taking patches for 5.18, so I confirm again
that the patches above are needed IMO.

Maybe even the relocatable series?

Thanks,

Alex

>
> Thanks,
>
> Alex
>
> >
> > I tested for-next on both sv39 and sv48 successfully, I took a glance
> > at the code and noticed you fixed the PTRS_PER_PGD error, thanks for
> > that. Otherwise nothing obvious has popped.
> >
> > Thanks again,
> >
> > Alex
> >
> > >
> > > Thanks!
> > >
> > > Alex
> > >
> > > >
> > > > Thanks!

2022-04-23 02:33:04

by Palmer Dabbelt

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] Introduce sv48 support without relocatable kernel

On Fri, 01 Apr 2022 05:56:30 PDT (-0700), [email protected] wrote:
> On Fri, Feb 18, 2022 at 11:45 AM Alexandre Ghiti
> <[email protected]> wrote:
>>
>> Hi Palmer,
>>
>> On Thu, Jan 20, 2022 at 11:05 AM Alexandre Ghiti
>> <[email protected]> wrote:
>> >
>> > On Thu, Jan 20, 2022 at 8:30 AM Alexandre Ghiti
>> > <[email protected]> wrote:
>> > >
>> > > On Thu, Jan 20, 2022 at 5:18 AM Palmer Dabbelt <[email protected]> wrote:
>> > > >
>> > > > On Mon, 06 Dec 2021 02:46:44 PST (-0800), [email protected] wrote:
>> > > > > * Please note notable changes in memory layouts and kasan population *
>> > > > >
>> > > > > This patchset allows to have a single kernel for sv39 and sv48 without
>> > > > > being relocatable.
>> > > > >
>> > > > > The idea comes from Arnd Bergmann who suggested to do the same as x86,
>> > > > > that is mapping the kernel to the end of the address space, which allows
>> > > > > the kernel to be linked at the same address for both sv39 and sv48 and
>> > > > > then does not require to be relocated at runtime.
>> > > > >
>> > > > > This implements sv48 support at runtime. The kernel will try to
>> > > > > boot with 4-level page table and will fallback to 3-level if the HW does not
>> > > > > support it. Folding the 4th level into a 3-level page table has almost no
>> > > > > cost at runtime.
>> > > > >
>> > > > > Note that kasan region had to be moved to the end of the address space
>> > > > > since its location must be known at compile-time and then be valid for
>> > > > > both sv39 and sv48 (and sv57 that is coming).
>> > > > >
>> > > > > Tested on:
>> > > > > - qemu rv64 sv39: OK
>> > > > > - qemu rv64 sv48: OK
>> > > > > - qemu rv64 sv39 + kasan: OK
>> > > > > - qemu rv64 sv48 + kasan: OK
>> > > > > - qemu rv32: OK
>> > > > >
>> > > > > Changes in v3:
>> > > > > - Fix SZ_1T, thanks to Atish
>> > > > > - Fix warning create_pud_mapping, thanks to Atish
>> > > > > - Fix k210 nommu build, thanks to Atish
>> > > > > - Fix wrong rebase as noted by Samuel
>> > > > > - * Downgrade to sv39 is only possible if !KASAN (see commit changelog) *
>> > > > > - * Move KASAN next to the kernel: virtual layouts changed and kasan population *
>> > > > >
>> > > > > Changes in v2:
>> > > > > - Rebase onto for-next
>> > > > > - Fix KASAN
>> > > > > - Fix stack canary
>> > > > > - Get completely rid of MAXPHYSMEM configs
>> > > > > - Add documentation
>> > > > >
>> > > > > Alexandre Ghiti (13):
>> > > > > riscv: Move KASAN mapping next to the kernel mapping
>> > > > > riscv: Split early kasan mapping to prepare sv48 introduction
>> > > > > riscv: Introduce functions to switch pt_ops
>> > > > > riscv: Allow to dynamically define VA_BITS
>> > > > > riscv: Get rid of MAXPHYSMEM configs
>> > > > > asm-generic: Prepare for riscv use of pud_alloc_one and pud_free
>> > > > > riscv: Implement sv48 support
>> > > > > riscv: Use pgtable_l4_enabled to output mmu_type in cpuinfo
>> > > > > riscv: Explicit comment about user virtual address space size
>> > > > > riscv: Improve virtual kernel memory layout dump
>> > > > > Documentation: riscv: Add sv48 description to VM layout
>> > > > > riscv: Initialize thread pointer before calling C functions
>> > > > > riscv: Allow user to downgrade to sv39 when hw supports sv48 if !KASAN
>> > > > >
>> > > > > Documentation/riscv/vm-layout.rst | 48 ++-
>> > > > > arch/riscv/Kconfig | 37 +-
>> > > > > arch/riscv/configs/nommu_k210_defconfig | 1 -
>> > > > > .../riscv/configs/nommu_k210_sdcard_defconfig | 1 -
>> > > > > arch/riscv/configs/nommu_virt_defconfig | 1 -
>> > > > > arch/riscv/include/asm/csr.h | 3 +-
>> > > > > arch/riscv/include/asm/fixmap.h | 1
>> > > > > arch/riscv/include/asm/kasan.h | 11 +-
>> > > > > arch/riscv/include/asm/page.h | 20 +-
>> > > > > arch/riscv/include/asm/pgalloc.h | 40 ++
>> > > > > arch/riscv/include/asm/pgtable-64.h | 108 ++++-
>> > > > > arch/riscv/include/asm/pgtable.h | 47 +-
>> > > > > arch/riscv/include/asm/sparsemem.h | 6 +-
>> > > > > arch/riscv/kernel/cpu.c | 23 +-
>> > > > > arch/riscv/kernel/head.S | 4 +-
>> > > > > arch/riscv/mm/context.c | 4 +-
>> > > > > arch/riscv/mm/init.c | 408 ++++++++++++++----
>> > > > > arch/riscv/mm/kasan_init.c | 250 ++++++++---
>> > > > > drivers/firmware/efi/libstub/efi-stub.c | 2
>> > > > > drivers/pci/controller/pci-xgene.c | 2 +-
>> > > > > include/asm-generic/pgalloc.h | 24 +-
>> > > > > include/linux/sizes.h | 1
>> > > > > 22 files changed, 833 insertions(+), 209 deletions(-)
>> > > >
>> > > > Sorry this took a while. This is on for-next, with a bit of juggling: a
>> > > > handful of trivial fixes for configs that were failing to build/boot and
>> > > > some merge issues. I also pulled out that MAXPHYSMEM fix to the top, so
>> > > > it'd be easier to backport. This is bigger than something I'd normally like to
>> > > > take late in the cycle, but given there's a lot of cleanups, likely some fixes,
>> > > > and it looks like folks have been testing this I'm just going to go with it.
>> > > >
>> > >
>> > > Yes yes yes! That's fantastic news :)
>> > >
>> > > > Let me know if there's any issues with the merge, it was a bit hairy.
>> > > > Probably best to just send along a fixup patch at this point.
>> > >
>> > > I'm going to take a look at that now, and I'll fix anything that comes
>> > > up quickly :)
>> >
>> > I see in for-next that you did not take the following patches:
>> >
>> > riscv: Improve virtual kernel memory layout dump
>> > Documentation: riscv: Add sv48 description to VM layout
>> > riscv: Initialize thread pointer before calling C functions
>> > riscv: Allow user to downgrade to sv39 when hw supports sv48 if !KASAN
>> >
>> > I'm not sure this was your intention. If it was, I believe that at
>> > least the first 2 patches are needed in this series, the 3rd one is a
>> > useful fix and we can discuss the 4th if that's an issue for you.
>>
>> Can you confirm that this was intentional and maybe explain the
>> motivation behind it? Because I see value in those patches.
>
> Palmer,
>
> I read that you were still taking patches for 5.18, so I confirm again
> that the patches above are needed IMO.

It was too late for this when it was sent (I saw it then, but just got
around to actually doing the work to sort it out).

It took me a while to figure out exactly what was going on here, but I
think I remember now: that downgrade patch (and the follow-on I just
sent) is broken for medlow, because mm/init.c must be built medany
(which we're using for the mostly-PIC qualities). I remember being in
the middle of rebasing/debugging this a while ago, I must have forgotten
I was in the middle of that and accidentally merged the branch as-is.
Certainly wasn't trying to silently take half the patch set and leave
the rest in limbo, that's the wrong way to do things.

I'm not sure what the right answer is here, but I just sent a patch to
drop support for medlow. We'll have to talk about that, for now I
cleaned up some other minor issues, rearranged that docs and fix to come
first, and put this at palmer/riscv-sv48. I think that fix is
reasonable to take the doc and fix into fixes, then the dump improvement
on for-next. We'll have to see what folks think about the medany-only
kernels, the other option would be to build FDT as medany which seems a
bit awkward.

> Maybe even the relocatable series?

Do you mind giving me a pointer? I'm not sure why I'm so drop-prone
with your patches, I promise I'm not doing it on purpose.

>
> Thanks,
>
> Alex
>
>>
>> Thanks,
>>
>> Alex
>>
>> >
>> > I tested for-next on both sv39 and sv48 successfully, I took a glance
>> > at the code and noticed you fixed the PTRS_PER_PGD error, thanks for
>> > that. Otherwise nothing obvious has popped.
>> >
>> > Thanks again,
>> >
>> > Alex
>> >
>> > >
>> > > Thanks!
>> > >
>> > > Alex
>> > >
>> > > >
>> > > > Thanks!

2022-04-26 08:50:07

by Nick Kossifidis

[permalink] [raw]
Subject: Re: [PATCH v3 07/13] riscv: Implement sv48 support

Hello Alex,

On 12/6/21 12:46, Alexandre Ghiti wrote:
>
> +#ifdef CONFIG_64BIT
> +static void __init disable_pgtable_l4(void)
> +{
> + pgtable_l4_enabled = false;
> + kernel_map.page_offset = PAGE_OFFSET_L3;
> + satp_mode = SATP_MODE_39;
> +}
> +
> +/*
> + * There is a simple way to determine if 4-level is supported by the
> + * underlying hardware: establish 1:1 mapping in 4-level page table mode
> + * then read SATP to see if the configuration was taken into account
> + * meaning sv48 is supported.
> + */
> +static __init void set_satp_mode(void)
> +{
> + u64 identity_satp, hw_satp;
> + uintptr_t set_satp_mode_pmd;
> +
> + set_satp_mode_pmd = ((unsigned long)set_satp_mode) & PMD_MASK;
> + create_pgd_mapping(early_pg_dir,
> + set_satp_mode_pmd, (uintptr_t)early_pud,
> + PGDIR_SIZE, PAGE_TABLE);
> + create_pud_mapping(early_pud,
> + set_satp_mode_pmd, (uintptr_t)early_pmd,
> + PUD_SIZE, PAGE_TABLE);
> + /* Handle the case where set_satp_mode straddles 2 PMDs */
> + create_pmd_mapping(early_pmd,
> + set_satp_mode_pmd, set_satp_mode_pmd,
> + PMD_SIZE, PAGE_KERNEL_EXEC);
> + create_pmd_mapping(early_pmd,
> + set_satp_mode_pmd + PMD_SIZE,
> + set_satp_mode_pmd + PMD_SIZE,
> + PMD_SIZE, PAGE_KERNEL_EXEC);
> +
> + identity_satp = PFN_DOWN((uintptr_t)&early_pg_dir) | satp_mode;
> +
> + local_flush_tlb_all();
> + csr_write(CSR_SATP, identity_satp);
> + hw_satp = csr_swap(CSR_SATP, 0ULL);
> + local_flush_tlb_all();
> +
> + if (hw_satp != identity_satp)
> + disable_pgtable_l4();
> +
> + memset(early_pg_dir, 0, PAGE_SIZE);
> + memset(early_pud, 0, PAGE_SIZE);
> + memset(early_pmd, 0, PAGE_SIZE);
> +}
> +#endif
> +

When doing the 1:1 mapping you don't take into account the limitation
that all bits above 47 need to have the same value as bit 47. If the
kernel exists at a high physical address with bit 47 set the
corresponding virtual address will be invalid, resulting an instruction
fetch fault as the privilege spec mandates. We verified this bug on our
prototype. I suggest we re-write this in assembly and do a proper satp
switch like we do on head.S, so that we don't need the 1:1 mapping and
we also have a way to recover in case this fails.

Regards,
Nick

2022-06-02 04:08:09

by Palmer Dabbelt

[permalink] [raw]
Subject: Re: [PATCH v3 00/13] Introduce sv48 support without relocatable kernel

On Fri, 22 Apr 2022 18:50:47 PDT (-0700), Palmer Dabbelt wrote:
> On Fri, 01 Apr 2022 05:56:30 PDT (-0700), [email protected] wrote:
>> On Fri, Feb 18, 2022 at 11:45 AM Alexandre Ghiti
>> <[email protected]> wrote:
>>>
>>> Hi Palmer,
>>>
>>> On Thu, Jan 20, 2022 at 11:05 AM Alexandre Ghiti
>>> <[email protected]> wrote:
>>> >
>>> > On Thu, Jan 20, 2022 at 8:30 AM Alexandre Ghiti
>>> > <[email protected]> wrote:
>>> > >
>>> > > On Thu, Jan 20, 2022 at 5:18 AM Palmer Dabbelt <[email protected]> wrote:
>>> > > >
>>> > > > On Mon, 06 Dec 2021 02:46:44 PST (-0800), [email protected] wrote:
>>> > > > > * Please note notable changes in memory layouts and kasan population *
>>> > > > >
>>> > > > > This patchset allows to have a single kernel for sv39 and sv48 without
>>> > > > > being relocatable.
>>> > > > >
>>> > > > > The idea comes from Arnd Bergmann who suggested to do the same as x86,
>>> > > > > that is mapping the kernel to the end of the address space, which allows
>>> > > > > the kernel to be linked at the same address for both sv39 and sv48 and
>>> > > > > then does not require to be relocated at runtime.
>>> > > > >
>>> > > > > This implements sv48 support at runtime. The kernel will try to
>>> > > > > boot with 4-level page table and will fallback to 3-level if the HW does not
>>> > > > > support it. Folding the 4th level into a 3-level page table has almost no
>>> > > > > cost at runtime.
>>> > > > >
>>> > > > > Note that kasan region had to be moved to the end of the address space
>>> > > > > since its location must be known at compile-time and then be valid for
>>> > > > > both sv39 and sv48 (and sv57 that is coming).
>>> > > > >
>>> > > > > Tested on:
>>> > > > > - qemu rv64 sv39: OK
>>> > > > > - qemu rv64 sv48: OK
>>> > > > > - qemu rv64 sv39 + kasan: OK
>>> > > > > - qemu rv64 sv48 + kasan: OK
>>> > > > > - qemu rv32: OK
>>> > > > >
>>> > > > > Changes in v3:
>>> > > > > - Fix SZ_1T, thanks to Atish
>>> > > > > - Fix warning create_pud_mapping, thanks to Atish
>>> > > > > - Fix k210 nommu build, thanks to Atish
>>> > > > > - Fix wrong rebase as noted by Samuel
>>> > > > > - * Downgrade to sv39 is only possible if !KASAN (see commit changelog) *
>>> > > > > - * Move KASAN next to the kernel: virtual layouts changed and kasan population *
>>> > > > >
>>> > > > > Changes in v2:
>>> > > > > - Rebase onto for-next
>>> > > > > - Fix KASAN
>>> > > > > - Fix stack canary
>>> > > > > - Get completely rid of MAXPHYSMEM configs
>>> > > > > - Add documentation
>>> > > > >
>>> > > > > Alexandre Ghiti (13):
>>> > > > > riscv: Move KASAN mapping next to the kernel mapping
>>> > > > > riscv: Split early kasan mapping to prepare sv48 introduction
>>> > > > > riscv: Introduce functions to switch pt_ops
>>> > > > > riscv: Allow to dynamically define VA_BITS
>>> > > > > riscv: Get rid of MAXPHYSMEM configs
>>> > > > > asm-generic: Prepare for riscv use of pud_alloc_one and pud_free
>>> > > > > riscv: Implement sv48 support
>>> > > > > riscv: Use pgtable_l4_enabled to output mmu_type in cpuinfo
>>> > > > > riscv: Explicit comment about user virtual address space size
>>> > > > > riscv: Improve virtual kernel memory layout dump
>>> > > > > Documentation: riscv: Add sv48 description to VM layout
>>> > > > > riscv: Initialize thread pointer before calling C functions
>>> > > > > riscv: Allow user to downgrade to sv39 when hw supports sv48 if !KASAN
>>> > > > >
>>> > > > > Documentation/riscv/vm-layout.rst | 48 ++-
>>> > > > > arch/riscv/Kconfig | 37 +-
>>> > > > > arch/riscv/configs/nommu_k210_defconfig | 1 -
>>> > > > > .../riscv/configs/nommu_k210_sdcard_defconfig | 1 -
>>> > > > > arch/riscv/configs/nommu_virt_defconfig | 1 -
>>> > > > > arch/riscv/include/asm/csr.h | 3 +-
>>> > > > > arch/riscv/include/asm/fixmap.h | 1
>>> > > > > arch/riscv/include/asm/kasan.h | 11 +-
>>> > > > > arch/riscv/include/asm/page.h | 20 +-
>>> > > > > arch/riscv/include/asm/pgalloc.h | 40 ++
>>> > > > > arch/riscv/include/asm/pgtable-64.h | 108 ++++-
>>> > > > > arch/riscv/include/asm/pgtable.h | 47 +-
>>> > > > > arch/riscv/include/asm/sparsemem.h | 6 +-
>>> > > > > arch/riscv/kernel/cpu.c | 23 +-
>>> > > > > arch/riscv/kernel/head.S | 4 +-
>>> > > > > arch/riscv/mm/context.c | 4 +-
>>> > > > > arch/riscv/mm/init.c | 408 ++++++++++++++----
>>> > > > > arch/riscv/mm/kasan_init.c | 250 ++++++++---
>>> > > > > drivers/firmware/efi/libstub/efi-stub.c | 2
>>> > > > > drivers/pci/controller/pci-xgene.c | 2 +-
>>> > > > > include/asm-generic/pgalloc.h | 24 +-
>>> > > > > include/linux/sizes.h | 1
>>> > > > > 22 files changed, 833 insertions(+), 209 deletions(-)
>>> > > >
>>> > > > Sorry this took a while. This is on for-next, with a bit of juggling: a
>>> > > > handful of trivial fixes for configs that were failing to build/boot and
>>> > > > some merge issues. I also pulled out that MAXPHYSMEM fix to the top, so
>>> > > > it'd be easier to backport. This is bigger than something I'd normally like to
>>> > > > take late in the cycle, but given there's a lot of cleanups, likely some fixes,
>>> > > > and it looks like folks have been testing this I'm just going to go with it.
>>> > > >
>>> > >
>>> > > Yes yes yes! That's fantastic news :)
>>> > >
>>> > > > Let me know if there's any issues with the merge, it was a bit hairy.
>>> > > > Probably best to just send along a fixup patch at this point.
>>> > >
>>> > > I'm going to take a look at that now, and I'll fix anything that comes
>>> > > up quickly :)
>>> >
>>> > I see in for-next that you did not take the following patches:
>>> >
>>> > riscv: Improve virtual kernel memory layout dump
>>> > Documentation: riscv: Add sv48 description to VM layout
>>> > riscv: Initialize thread pointer before calling C functions
>>> > riscv: Allow user to downgrade to sv39 when hw supports sv48 if !KASAN
>>> >
>>> > I'm not sure this was your intention. If it was, I believe that at
>>> > least the first 2 patches are needed in this series, the 3rd one is a
>>> > useful fix and we can discuss the 4th if that's an issue for you.
>>>
>>> Can you confirm that this was intentional and maybe explain the
>>> motivation behind it? Because I see value in those patches.
>>
>> Palmer,
>>
>> I read that you were still taking patches for 5.18, so I confirm again
>> that the patches above are needed IMO.
>
> It was too late for this when it was sent (I saw it then, but just got
> around to actually doing the work to sort it out).
>
> It took me a while to figure out exactly what was going on here, but I
> think I remember now: that downgrade patch (and the follow-on I just
> sent) is broken for medlow, because mm/init.c must be built medany
> (which we're using for the mostly-PIC qualities). I remember being in
> the middle of rebasing/debugging this a while ago, I must have forgotten
> I was in the middle of that and accidentally merged the branch as-is.
> Certainly wasn't trying to silently take half the patch set and leave
> the rest in limbo, that's the wrong way to do things.
>
> I'm not sure what the right answer is here, but I just sent a patch to
> drop support for medlow. We'll have to talk about that, for now I
> cleaned up some other minor issues, rearranged that docs and fix to come
> first, and put this at palmer/riscv-sv48. I think that fix is
> reasonable to take the doc and fix into fixes, then the dump improvement
> on for-next. We'll have to see what folks think about the medany-only
> kernels, the other option would be to build FDT as medany which seems a
> bit awkward.

All but the last one are on for-next, there's some discussion on that
last one that pointed out some better ways to do it.

>
>> Maybe even the relocatable series?
>
> Do you mind giving me a pointer? I'm not sure why I'm so drop-prone
> with your patches, I promise I'm not doing it on purpose.
>
>>
>> Thanks,
>>
>> Alex
>>
>>>
>>> Thanks,
>>>
>>> Alex
>>>
>>> >
>>> > I tested for-next on both sv39 and sv48 successfully, I took a glance
>>> > at the code and noticed you fixed the PTRS_PER_PGD error, thanks for
>>> > that. Otherwise nothing obvious has popped.
>>> >
>>> > Thanks again,
>>> >
>>> > Alex
>>> >
>>> > >
>>> > > Thanks!
>>> > >
>>> > > Alex
>>> > >
>>> > > >
>>> > > > Thanks!