LinuxLists.cc - [PATCH 0/3] Boot RISC-V kernel from any 4KB aligned address

2019-03-12 22:08:52

[permalink] [raw]

Subject: [PATCH 0/3] Boot RISC-V kernel from any 4KB aligned address

From: Anup Patel <[email protected]>

This patchset primarily extends initial page table setup using fixmap
to boot Linux RISC-V kernel (64bit and 32bit) from any 4KB aligned address.
We also add 32bit defconfig to allow people to try 32bit Linux RISC-V
kernel as well.

The patchset is tested on SiFive Unleashed board and QEMU virt machine.

It can also be found in riscv_setup_vm_v1 branch of
https//github.com/avpatel/linux.git

Anup Patel (3):
RISC-V: Add separate defconfig for 32bit systems
RISC-V: Make setup_vm() independent of GCC code model
RISC-V: Allow booting kernel from any 4KB aligned address

arch/riscv/configs/rv32_defconfig | 84 +++++++
arch/riscv/include/asm/fixmap.h | 5 +
arch/riscv/include/asm/pgtable-64.h | 5 +
arch/riscv/include/asm/pgtable.h | 6 +-
arch/riscv/kernel/head.S | 2 +
arch/riscv/kernel/setup.c | 4 +-
arch/riscv/mm/init.c | 370 +++++++++++++++++++++++-----
7 files changed, 419 insertions(+), 57 deletions(-)
create mode 100644 arch/riscv/configs/rv32_defconfig

--
2.17.1

2019-03-12 22:09:02

[permalink] [raw]

Subject: [PATCH 1/3] RISC-V: Add separate defconfig for 32bit systems

This patch adds rv32_defconfig for 32bit systems. The only
difference between rv32_defconfig and defconfig is that
rv32_defconfig has CONFIG_ARCH_RV32I=y.

Signed-off-by: Anup Patel <[email protected]>
---
arch/riscv/configs/rv32_defconfig | 84 +++++++++++++++++++++++++++++++
1 file changed, 84 insertions(+)
create mode 100644 arch/riscv/configs/rv32_defconfig

diff --git a/arch/riscv/configs/rv32_defconfig b/arch/riscv/configs/rv32_defconfig
new file mode 100644
index 000000000000..1a911ed8e772
--- /dev/null
+++ b/arch/riscv/configs/rv32_defconfig
@@ -0,0 +1,84 @@
+CONFIG_SYSVIPC=y
+CONFIG_POSIX_MQUEUE=y
+CONFIG_IKCONFIG=y
+CONFIG_IKCONFIG_PROC=y
+CONFIG_CGROUPS=y
+CONFIG_CGROUP_SCHED=y
+CONFIG_CFS_BANDWIDTH=y
+CONFIG_CGROUP_BPF=y
+CONFIG_NAMESPACES=y
+CONFIG_USER_NS=y
+CONFIG_CHECKPOINT_RESTORE=y
+CONFIG_BLK_DEV_INITRD=y
+CONFIG_EXPERT=y
+CONFIG_BPF_SYSCALL=y
+CONFIG_ARCH_RV32I=y
+CONFIG_SMP=y
+CONFIG_MODULES=y
+CONFIG_MODULE_UNLOAD=y
+CONFIG_NET=y
+CONFIG_PACKET=y
+CONFIG_UNIX=y
+CONFIG_INET=y
+CONFIG_IP_MULTICAST=y
+CONFIG_IP_ADVANCED_ROUTER=y
+CONFIG_IP_PNP=y
+CONFIG_IP_PNP_DHCP=y
+CONFIG_IP_PNP_BOOTP=y
+CONFIG_IP_PNP_RARP=y
+CONFIG_NETLINK_DIAG=y
+CONFIG_PCI=y
+CONFIG_PCIEPORTBUS=y
+CONFIG_PCI_HOST_GENERIC=y
+CONFIG_PCIE_XILINX=y
+CONFIG_DEVTMPFS=y
+CONFIG_BLK_DEV_LOOP=y
+CONFIG_VIRTIO_BLK=y
+CONFIG_BLK_DEV_SD=y
+CONFIG_BLK_DEV_SR=y
+CONFIG_ATA=y
+CONFIG_SATA_AHCI=y
+CONFIG_SATA_AHCI_PLATFORM=y
+CONFIG_NETDEVICES=y
+CONFIG_VIRTIO_NET=y
+CONFIG_MACB=y
+CONFIG_E1000E=y
+CONFIG_R8169=y
+CONFIG_MICROSEMI_PHY=y
+CONFIG_INPUT_MOUSEDEV=y
+CONFIG_SERIAL_8250=y
+CONFIG_SERIAL_8250_CONSOLE=y
+CONFIG_SERIAL_OF_PLATFORM=y
+CONFIG_SERIAL_EARLYCON_RISCV_SBI=y
+CONFIG_HVC_RISCV_SBI=y
+# CONFIG_PTP_1588_CLOCK is not set
+CONFIG_DRM=y
+CONFIG_DRM_RADEON=y
+CONFIG_FRAMEBUFFER_CONSOLE=y
+CONFIG_USB=y
+CONFIG_USB_XHCI_HCD=y
+CONFIG_USB_XHCI_PLATFORM=y
+CONFIG_USB_EHCI_HCD=y
+CONFIG_USB_EHCI_HCD_PLATFORM=y
+CONFIG_USB_OHCI_HCD=y
+CONFIG_USB_OHCI_HCD_PLATFORM=y
+CONFIG_USB_STORAGE=y
+CONFIG_USB_UAS=y
+CONFIG_VIRTIO_MMIO=y
+CONFIG_SIFIVE_PLIC=y
+CONFIG_EXT4_FS=y
+CONFIG_EXT4_FS_POSIX_ACL=y
+CONFIG_AUTOFS4_FS=y
+CONFIG_MSDOS_FS=y
+CONFIG_VFAT_FS=y
+CONFIG_TMPFS=y
+CONFIG_TMPFS_POSIX_ACL=y
+CONFIG_NFS_FS=y
+CONFIG_NFS_V4=y
+CONFIG_NFS_V4_1=y
+CONFIG_NFS_V4_2=y
+CONFIG_ROOT_NFS=y
+CONFIG_CRYPTO_USER_API_HASH=y
+CONFIG_CRYPTO_DEV_VIRTIO=y
+CONFIG_PRINTK_TIME=y
+# CONFIG_RCU_TRACE is not set
--
2.17.1

2019-03-12 22:09:21

[permalink] [raw]

Subject: [PATCH 2/3] RISC-V: Make setup_vm() independent of GCC code model

The setup_vm() must access kernel symbols in a position independent way
because it will be called from head.S with MMU off.

If we compile kernel with cmodel=medany then PC-relative addressing will
be used in setup_vm() to access kernel symbols so it works perfectly fine.

Although, if we compile kernel with cmodel=medlow then either absolute
addressing or PC-relative addressing (based on whichever requires fewer
instructions) is used to access kernel symbols in setup_vm(). This can
break setup_vm() whenever any absolute addressing is used to access
kernel symbols.

With the movement of setup_vm() from kernel/setup.c to mm/init.c, the
setup_vm() is now broken for cmodel=medlow but it works perfectly fine
for cmodel=medany.

This patch fixes setup_vm() and makes it independent of GCC code model
by accessing kernel symbols relative to kernel load address instead of
assuming PC-relative addressing.

Fixes: 6f1e9e946f0b ("RISC-V: Move setup_vm() to mm/init.c")
Signed-off-by: Anup Patel <[email protected]>
---
arch/riscv/kernel/head.S | 1 +
arch/riscv/mm/init.c | 71 ++++++++++++++++++++++++++--------------
2 files changed, 47 insertions(+), 25 deletions(-)

diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index fe884cd69abd..7966262b4f9d 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -62,6 +62,7 @@ clear_bss_done:

/* Initialize page tables and relocate to virtual addresses */
la sp, init_thread_union + THREAD_SIZE
+ la a0, _start
call setup_vm
call relocate

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index b379a75ac6a6..f35299f2f3d5 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -172,55 +172,76 @@ void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot)
}
}

-asmlinkage void __init setup_vm(void)
+static inline void *__early_va(void *ptr, uintptr_t load_pa)
{
extern char _start;
+ uintptr_t va = (uintptr_t)ptr;
+ uintptr_t sz = (uintptr_t)(&_end) - (uintptr_t)(&_start);
+
+ if (va >= PAGE_OFFSET && va < (PAGE_OFFSET + sz))
+ return (void *)(load_pa + (va - PAGE_OFFSET));
+ return (void *)va;
+}
+
+asmlinkage void __init setup_vm(uintptr_t load_pa)
+{
uintptr_t i;
- uintptr_t pa = (uintptr_t) &_start;
+#ifndef __PAGETABLE_PMD_FOLDED
+ pmd_t *pmdp;
+#endif
+ pgd_t *pgdp;
+ phys_addr_t map_pa;
+ pgprot_t tableprot = __pgprot(_PAGE_TABLE);
pgprot_t prot = __pgprot(pgprot_val(PAGE_KERNEL) | _PAGE_EXEC);

- va_pa_offset = PAGE_OFFSET - pa;
- pfn_base = PFN_DOWN(pa);
+ va_pa_offset = PAGE_OFFSET - load_pa;
+ pfn_base = PFN_DOWN(load_pa);

/* Sanity check alignment and size */
BUG_ON((PAGE_OFFSET % PGDIR_SIZE) != 0);
- BUG_ON((pa % (PAGE_SIZE * PTRS_PER_PTE)) != 0);
+ BUG_ON((load_pa % (PAGE_SIZE * PTRS_PER_PTE)) != 0);

#ifndef __PAGETABLE_PMD_FOLDED
- trampoline_pg_dir[(PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD] =
- pfn_pgd(PFN_DOWN((uintptr_t)trampoline_pmd),
- __pgprot(_PAGE_TABLE));
- trampoline_pmd[0] = pfn_pmd(PFN_DOWN(pa), prot);
+ pgdp = __early_va(trampoline_pg_dir, load_pa);
+ map_pa = (uintptr_t)__early_va(trampoline_pmd, load_pa);
+ pgdp[(PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD] =
+ pfn_pgd(PFN_DOWN(map_pa), tableprot);
+ trampoline_pmd[0] = pfn_pmd(PFN_DOWN(load_pa), prot);
+
+ pgdp = __early_va(swapper_pg_dir, load_pa);

for (i = 0; i < (-PAGE_OFFSET)/PGDIR_SIZE; ++i) {
size_t o = (PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD + i;

- swapper_pg_dir[o] =
- pfn_pgd(PFN_DOWN((uintptr_t)swapper_pmd) + i,
- __pgprot(_PAGE_TABLE));
+ map_pa = (uintptr_t)__early_va(swapper_pmd, load_pa);
+ pgdp[o] = pfn_pgd(PFN_DOWN(map_pa) + i, tableprot);
}
+ pmdp = __early_va(swapper_pmd, load_pa);
for (i = 0; i < ARRAY_SIZE(swapper_pmd); i++)
- swapper_pmd[i] = pfn_pmd(PFN_DOWN(pa + i * PMD_SIZE), prot);
+ pmdp[i] = pfn_pmd(PFN_DOWN(load_pa + i * PMD_SIZE), prot);

- swapper_pg_dir[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
- pfn_pgd(PFN_DOWN((uintptr_t)fixmap_pmd),
- __pgprot(_PAGE_TABLE));
+ map_pa = (uintptr_t)__early_va(fixmap_pmd, load_pa);
+ pgdp[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
+ pfn_pgd(PFN_DOWN(map_pa), tableprot);
+ pmdp = __early_va(fixmap_pmd, load_pa);
+ map_pa = (uintptr_t)__early_va(fixmap_pte, load_pa);
fixmap_pmd[(FIXADDR_START >> PMD_SHIFT) % PTRS_PER_PMD] =
- pfn_pmd(PFN_DOWN((uintptr_t)fixmap_pte),
- __pgprot(_PAGE_TABLE));
+ pfn_pmd(PFN_DOWN(map_pa), tableprot);
#else
- trampoline_pg_dir[(PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD] =
- pfn_pgd(PFN_DOWN(pa), prot);
+ pgdp = __early_va(trampoline_pg_dir, load_pa);
+ pgdp[(PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD] =
+ pfn_pgd(PFN_DOWN(load_pa), prot);
+
+ pgdp = __early_va(swapper_pg_dir, load_pa);

for (i = 0; i < (-PAGE_OFFSET)/PGDIR_SIZE; ++i) {
size_t o = (PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD + i;

- swapper_pg_dir[o] =
- pfn_pgd(PFN_DOWN(pa + i * PGDIR_SIZE), prot);
+ pgdp[o] = pfn_pgd(PFN_DOWN(load_pa + i * PGDIR_SIZE), prot);
}

- swapper_pg_dir[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
- pfn_pgd(PFN_DOWN((uintptr_t)fixmap_pte),
- __pgprot(_PAGE_TABLE));
+ map_pa = (uintptr_t)__early_va(fixmap_pte, load_pa);
+ pgdp[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
+ pfn_pgd(PFN_DOWN(map_pa), tableprot);
#endif
}
--
2.17.1

2019-03-12 22:09:36

[permalink] [raw]

Subject: [PATCH 3/3] RISC-V: Allow booting kernel from any 4KB aligned address

Currently, we have to boot RISCV64 kernel from a 2MB aligned physical
address and RISCV32 kernel from a 4MB aligned physical address. This
constraint is because initial pagetable setup (i.e. setup_vm()) maps
entire RAM using hugepages (i.e. 2MB for 3-level pagetable and 4MB for
2-level pagetable).

Further, the above booting contraint also results in memory wastage
because if we boot kernel from some <xyz> address (which is not same as
RAM start address) then RISCV kernel will map PAGE_OFFSET virtual address
lineraly to <xyz> physical address and memory between RAM start and <xyz>
will be reserved/unusable.

For example, RISCV64 kernel booted from 0x80200000 will waste 2MB of RAM
and RISCV32 kernel booted from 0x80400000 will waste 4MB of RAM.

This patch re-writes the initial pagetable setup code to allow booting
RISV32 and RISCV64 kernel from any 4KB (i.e. PAGE_SIZE) aligned address.

To achieve this:
1. We map kernel, dtb and only some amount of RAM (few MBs) using 4KB
mappings in setup_vm() (called from head.S)
2. Once we reach paging_init() (called from setup_arch()) after
memblock setup, we map all available memory banks using 4KB
mappings and memblock APIs.

With this patch in-place, the booting constraint for RISCV32 and RISCV64
kernel is much more relaxed and we can now boot kernel very close to
RAM start thereby minimizng memory wastage.

Signed-off-by: Anup Patel <[email protected]>
---
arch/riscv/include/asm/fixmap.h | 5 +
arch/riscv/include/asm/pgtable-64.h | 5 +
arch/riscv/include/asm/pgtable.h | 6 +-
arch/riscv/kernel/head.S | 1 +
arch/riscv/kernel/setup.c | 4 +-
arch/riscv/mm/init.c | 357 +++++++++++++++++++++++-----
6 files changed, 317 insertions(+), 61 deletions(-)

diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h
index 57afe604b495..5cf53dd882e5 100644
--- a/arch/riscv/include/asm/fixmap.h
+++ b/arch/riscv/include/asm/fixmap.h
@@ -21,6 +21,11 @@
*/
enum fixed_addresses {
FIX_HOLE,
+#define FIX_FDT_SIZE SZ_1M
+ FIX_FDT_END,
+ FIX_FDT = FIX_FDT_END + FIX_FDT_SIZE / PAGE_SIZE - 1,
+ FIX_PTE,
+ FIX_PMD,
FIX_EARLYCON_MEM_BASE,
__end_of_fixed_addresses
};
diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
index 7aa0ea9bd8bb..56ecc3dc939d 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -78,6 +78,11 @@ static inline pmd_t pfn_pmd(unsigned long pfn, pgprot_t prot)
return __pmd((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
}

+static inline unsigned long _pmd_pfn(pmd_t pmd)
+{
+ return pmd_val(pmd) >> _PAGE_PFN_SHIFT;
+}
+
#define pmd_ERROR(e) \
pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 1141364d990e..05fa2115e736 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -121,12 +121,16 @@ static inline void pmd_clear(pmd_t *pmdp)
set_pmd(pmdp, __pmd(0));
}

-
static inline pgd_t pfn_pgd(unsigned long pfn, pgprot_t prot)
{
return __pgd((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
}

+static inline unsigned long _pgd_pfn(pgd_t pgd)
+{
+ return pgd_val(pgd) >> _PAGE_PFN_SHIFT;
+}
+
#define pgd_index(addr) (((addr) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))

/* Locate an entry in the page global directory */
diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
index 7966262b4f9d..12a3ec5eb8ab 100644
--- a/arch/riscv/kernel/head.S
+++ b/arch/riscv/kernel/head.S
@@ -63,6 +63,7 @@ clear_bss_done:
/* Initialize page tables and relocate to virtual addresses */
la sp, init_thread_union + THREAD_SIZE
la a0, _start
+ mv a1, s1
call setup_vm
call relocate

diff --git a/arch/riscv/kernel/setup.c b/arch/riscv/kernel/setup.c
index ecb654f6a79e..acdd0f74982b 100644
--- a/arch/riscv/kernel/setup.c
+++ b/arch/riscv/kernel/setup.c
@@ -30,6 +30,7 @@
#include <linux/sched/task.h>
#include <linux/swiotlb.h>

+#include <asm/fixmap.h>
#include <asm/setup.h>
#include <asm/sections.h>
#include <asm/pgtable.h>
@@ -62,7 +63,8 @@ unsigned long boot_cpu_hartid;

void __init parse_dtb(unsigned int hartid, void *dtb)
{
- if (early_init_dt_scan(__va(dtb)))
+ dtb = (void *)fix_to_virt(FIX_FDT) + ((uintptr_t)dtb & ~PAGE_MASK);
+ if (early_init_dt_scan(dtb))
return;

pr_err("No DTB passed to the kernel\n");
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index f35299f2f3d5..ee55a4b90dec 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1,14 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
/*
+ * Copyright (C) 2019 Western Digital Corporation or its affiliates.
* Copyright (C) 2012 Regents of the University of California
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License
- * as published by the Free Software Foundation, version 2.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
*/

#include <linux/init.h>
@@ -43,13 +36,6 @@ void setup_zero_page(void)
memset((void *)empty_zero_page, 0, PAGE_SIZE);
}

-void __init paging_init(void)
-{
- setup_zero_page();
- local_flush_tlb_all();
- zone_sizes_init();
-}
-
void __init mem_init(void)
{
#ifdef CONFIG_FLATMEM
@@ -146,13 +132,24 @@ void __init setup_bootmem(void)
pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
pgd_t trampoline_pg_dir[PTRS_PER_PGD] __initdata __aligned(PAGE_SIZE);

+#define MAX_EARLY_MAPPING_SIZE SZ_128M
+
#ifndef __PAGETABLE_PMD_FOLDED
-#define NUM_SWAPPER_PMDS ((uintptr_t)-PAGE_OFFSET >> PGDIR_SHIFT)
-pmd_t swapper_pmd[PTRS_PER_PMD*((-PAGE_OFFSET)/PGDIR_SIZE)] __page_aligned_bss;
-pmd_t trampoline_pmd[PTRS_PER_PGD] __initdata __aligned(PAGE_SIZE);
+#if MAX_EARLY_MAPPING_SIZE < PGDIR_SIZE
+#define NUM_SWAPPER_PMDS 1UL
+#else
+#define NUM_SWAPPER_PMDS (MAX_EARLY_MAPPING_SIZE/PGDIR_SIZE)
+#endif
+pmd_t swapper_pmd[PTRS_PER_PMD*NUM_SWAPPER_PMDS] __page_aligned_bss;
+pmd_t trampoline_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
+#define NUM_SWAPPER_PTES (MAX_EARLY_MAPPING_SIZE/PMD_SIZE)
+#else
+#define NUM_SWAPPER_PTES (MAX_EARLY_MAPPING_SIZE/PGDIR_SIZE)
#endif

+pte_t swapper_pte[PTRS_PER_PTE*NUM_SWAPPER_PTES] __page_aligned_bss;
+pte_t trampoline_pte[PTRS_PER_PTE] __initdata __aligned(PAGE_SIZE);
pte_t fixmap_pte[PTRS_PER_PTE] __page_aligned_bss;

void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot)
@@ -172,76 +169,318 @@ void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot)
}
}

+struct mapping_ops {
+ pte_t *(*get_pte_virt)(phys_addr_t pa);
+ phys_addr_t (*alloc_pte)(uintptr_t va, uintptr_t load_pa);
+ pmd_t *(*get_pmd_virt)(phys_addr_t pa);
+ phys_addr_t (*alloc_pmd)(uintptr_t va, uintptr_t load_pa);
+};
+
static inline void *__early_va(void *ptr, uintptr_t load_pa)
{
extern char _start;
uintptr_t va = (uintptr_t)ptr;
uintptr_t sz = (uintptr_t)(&_end) - (uintptr_t)(&_start);

- if (va >= PAGE_OFFSET && va < (PAGE_OFFSET + sz))
+ if (va >= PAGE_OFFSET && va <= (PAGE_OFFSET + sz))
return (void *)(load_pa + (va - PAGE_OFFSET));
return (void *)va;
}

-asmlinkage void __init setup_vm(uintptr_t load_pa)
+#define __early_pa(ptr, load_pa) (uintptr_t)__early_va(ptr, load_pa)
+
+static phys_addr_t __init final_alloc_pgtable(void)
+{
+ return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
+}
+
+static pte_t *__init early_get_pte_virt(phys_addr_t pa)
{
- uintptr_t i;
+ return (pte_t *)((uintptr_t)pa);
+}
+
+static pte_t *__init final_get_pte_virt(phys_addr_t pa)
+{
+ clear_fixmap(FIX_PTE);
+
+ return (pte_t *)set_fixmap_offset(FIX_PTE, pa);
+}
+
+static phys_addr_t __init early_alloc_pte(uintptr_t va, uintptr_t load_pa)
+{
+ pte_t *base = __early_va(swapper_pte, load_pa);
+ uintptr_t pte_num = ((va - PAGE_OFFSET) >> PMD_SHIFT);
+
+ BUG_ON(pte_num >= NUM_SWAPPER_PTES);
+
+ return (uintptr_t)&base[pte_num * PTRS_PER_PTE];
+}
+
+static phys_addr_t __init final_alloc_pte(uintptr_t va, uintptr_t load_pa)
+{
+ return final_alloc_pgtable();
+}
+
+static void __init create_pte_mapping(pte_t *ptep,
+ uintptr_t va, phys_addr_t pa,
+ phys_addr_t sz, pgprot_t prot)
+{
+ uintptr_t pte_index = pte_index(va);
+
+ BUG_ON(sz != PAGE_SIZE);
+
+ if (pte_none(ptep[pte_index]))
+ ptep[pte_index] = pfn_pte(PFN_DOWN(pa), prot);
+}
+
#ifndef __PAGETABLE_PMD_FOLDED
+static pmd_t *__init early_get_pmd_virt(phys_addr_t pa)
+{
+ return (pmd_t *)((uintptr_t)pa);
+}
+
+static pmd_t *__init final_get_pmd_virt(phys_addr_t pa)
+{
+ clear_fixmap(FIX_PMD);
+
+ return (pmd_t *)set_fixmap_offset(FIX_PMD, pa);
+}
+
+static phys_addr_t __init early_alloc_pmd(uintptr_t va, uintptr_t load_pa)
+{
+ pmd_t *base = __early_va(swapper_pmd, load_pa);
+ uintptr_t pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
+
+ BUG_ON(pmd_num >= NUM_SWAPPER_PMDS);
+
+ return (uintptr_t)&base[pmd_num * PTRS_PER_PMD];
+}
+
+static phys_addr_t __init final_alloc_pmd(uintptr_t va, uintptr_t load_pa)
+{
+ return final_alloc_pgtable();
+}
+
+static void __init create_pmd_mapping(pmd_t *pmdp,
+ uintptr_t va, phys_addr_t pa,
+ phys_addr_t sz, pgprot_t prot,
+ uintptr_t ops_load_pa,
+ struct mapping_ops *ops)
+{
+ pte_t *ptep;
+ phys_addr_t pte_phys;
+ uintptr_t pmd_index = pmd_index(va);
+
+ if (sz == PMD_SIZE) {
+ if (pmd_none(pmdp[pmd_index]))
+ pmdp[pmd_index] = pfn_pmd(PFN_DOWN(pa), prot);
+ return;
+ }
+
+ if (pmd_none(pmdp[pmd_index])) {
+ pte_phys = ops->alloc_pte(va, ops_load_pa);
+ pmdp[pmd_index] = pfn_pmd(PFN_DOWN(pte_phys),
+ __pgprot(_PAGE_TABLE));
+ ptep = ops->get_pte_virt(pte_phys);
+ memset(ptep, 0, PAGE_SIZE);
+ } else {
+ pte_phys = PFN_PHYS(_pmd_pfn(pmdp[pmd_index]));
+ ptep = ops->get_pte_virt(pte_phys);
+ }
+
+ create_pte_mapping(ptep, va, pa, sz, prot);
+}
+
+static void __init create_pgd_mapping(pgd_t *pgdp,
+ uintptr_t va, phys_addr_t pa,
+ phys_addr_t sz, pgprot_t prot,
+ uintptr_t ops_load_pa,
+ struct mapping_ops *ops)
+{
pmd_t *pmdp;
+ phys_addr_t pmd_phys;
+ uintptr_t pgd_index = pgd_index(va);
+
+ if (sz == PGDIR_SIZE) {
+ if (pgd_val(pgdp[pgd_index]) == 0)
+ pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pa), prot);
+ return;
+ }
+
+ if (pgd_val(pgdp[pgd_index]) == 0) {
+ pmd_phys = ops->alloc_pmd(va, ops_load_pa);
+ pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pmd_phys),
+ __pgprot(_PAGE_TABLE));
+ pmdp = ops->get_pmd_virt(pmd_phys);
+ memset(pmdp, 0, PAGE_SIZE);
+ } else {
+ pmd_phys = PFN_PHYS(_pgd_pfn(pgdp[pgd_index]));
+ pmdp = ops->get_pmd_virt(pmd_phys);
+ }
+
+ create_pmd_mapping(pmdp, va, pa, sz, prot, ops_load_pa, ops);
+}
+#else
+static void __init create_pgd_mapping(pgd_t *pgdp,
+ uintptr_t va, phys_addr_t pa,
+ phys_addr_t sz, pgprot_t prot,
+ uintptr_t ops_load_pa,
+ struct mapping_ops *ops)
+{
+ pte_t *ptep;
+ phys_addr_t pte_phys;
+ uintptr_t pgd_index = pgd_index(va);
+
+ if (sz == PGDIR_SIZE) {
+ if (pgd_val(pgdp[pgd_index]) == 0)
+ pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pa), prot);
+ return;
+ }
+
+ if (pgd_val(pgdp[pgd_index]) == 0) {
+ pte_phys = ops->alloc_pte(va, ops_load_pa);
+ pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pte_phys),
+ __pgprot(_PAGE_TABLE));
+ ptep = ops->get_pte_virt(pte_phys);
+ memset(ptep, 0, PAGE_SIZE);
+ } else {
+ pte_phys = PFN_PHYS(_pgd_pfn(pgdp[pgd_index]));
+ ptep = ops->get_pte_virt(pte_phys);
+ }
+
+ create_pte_mapping(ptep, va, pa, sz, prot);
+}
#endif
- pgd_t *pgdp;
+
+asmlinkage void __init setup_vm(uintptr_t load_pa, uintptr_t dtb_pa)
+{
phys_addr_t map_pa;
+ uintptr_t va, end_va;
+ uintptr_t load_sz = __early_pa(&_end, load_pa) - load_pa;
pgprot_t tableprot = __pgprot(_PAGE_TABLE);
pgprot_t prot = __pgprot(pgprot_val(PAGE_KERNEL) | _PAGE_EXEC);
+ struct mapping_ops ops;

va_pa_offset = PAGE_OFFSET - load_pa;
pfn_base = PFN_DOWN(load_pa);

/* Sanity check alignment and size */
BUG_ON((PAGE_OFFSET % PGDIR_SIZE) != 0);
- BUG_ON((load_pa % (PAGE_SIZE * PTRS_PER_PTE)) != 0);
+ BUG_ON((load_pa % PAGE_SIZE) != 0);
+ BUG_ON(load_sz > MAX_EARLY_MAPPING_SIZE);
+
+ /* Setup mapping ops */
+ ops.get_pte_virt = __early_va(early_get_pte_virt, load_pa);
+ ops.alloc_pte = __early_va(early_alloc_pte, load_pa);
+ ops.get_pmd_virt = NULL;
+ ops.alloc_pmd = NULL;

#ifndef __PAGETABLE_PMD_FOLDED
- pgdp = __early_va(trampoline_pg_dir, load_pa);
- map_pa = (uintptr_t)__early_va(trampoline_pmd, load_pa);
- pgdp[(PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD] =
- pfn_pgd(PFN_DOWN(map_pa), tableprot);
- trampoline_pmd[0] = pfn_pmd(PFN_DOWN(load_pa), prot);
+ /* Update mapping ops for PMD */
+ ops.get_pmd_virt = __early_va(early_get_pmd_virt, load_pa);
+ ops.alloc_pmd = __early_va(early_alloc_pmd, load_pa);
+
+ /* Setup trampoline PGD and PMD */
+ map_pa = __early_pa(trampoline_pmd, load_pa);
+ create_pgd_mapping(__early_va(trampoline_pg_dir, load_pa),
+ PAGE_OFFSET, map_pa, PGDIR_SIZE, tableprot,
+ load_pa, &ops);
+ map_pa = __early_pa(trampoline_pte, load_pa);
+ create_pmd_mapping(__early_va(trampoline_pmd, load_pa),
+ PAGE_OFFSET, map_pa, PMD_SIZE, tableprot,
+ load_pa, &ops);
+
+ /* Setup swapper PGD and PMD for fixmap */
+ map_pa = __early_pa(fixmap_pmd, load_pa);
+ create_pgd_mapping(__early_va(swapper_pg_dir, load_pa),
+ FIXADDR_START, map_pa, PGDIR_SIZE, tableprot,
+ load_pa, &ops);
+ map_pa = __early_pa(fixmap_pte, load_pa);
+ create_pmd_mapping(__early_va(fixmap_pmd, load_pa),
+ FIXADDR_START, map_pa, PMD_SIZE, tableprot,
+ load_pa, &ops);
+#else
+ /* Setup trampoline PGD */
+ map_pa = __early_pa(trampoline_pte, load_pa);
+ create_pgd_mapping(__early_va(trampoline_pg_dir, load_pa),
+ PAGE_OFFSET, map_pa, PGDIR_SIZE, tableprot,
+ load_pa, &ops);
+
+ /* Setup swapper PGD for fixmap */
+ map_pa = __early_pa(fixmap_pte, load_pa);
+ create_pgd_mapping(__early_va(swapper_pg_dir, load_pa),
+ FIXADDR_START, map_pa, PGDIR_SIZE, tableprot,
+ load_pa, &ops);
+#endif

- pgdp = __early_va(swapper_pg_dir, load_pa);
+ /* Setup trampoling PTE */
+ end_va = PAGE_OFFSET + PAGE_SIZE*PTRS_PER_PTE;
+ for (va = PAGE_OFFSET; va < end_va; va += PAGE_SIZE)
+ create_pte_mapping(__early_va(trampoline_pte, load_pa),
+ va, load_pa + (va - PAGE_OFFSET),
+ PAGE_SIZE, prot);
+
+ /*
+ * Setup swapper PGD covering kernel and some amount of
+ * RAM which will allows us to reach paging_init(). We map
+ * all memory banks later in setup_vm_final() below.
+ */
+ end_va = PAGE_OFFSET + load_sz;
+ for (va = PAGE_OFFSET; va < end_va; va += PAGE_SIZE)
+ create_pgd_mapping(__early_va(swapper_pg_dir, load_pa),
+ va, load_pa + (va - PAGE_OFFSET),
+ PAGE_SIZE, prot, load_pa, &ops);
+
+ /* Create fixed mapping for early parsing of FDT */
+ end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
+ for (va = __fix_to_virt(FIX_FDT); va < end_va; va += PAGE_SIZE)
+ create_pte_mapping(__early_va(fixmap_pte, load_pa),
+ va, dtb_pa + (va - __fix_to_virt(FIX_FDT)),
+ PAGE_SIZE, prot);
+}

- for (i = 0; i < (-PAGE_OFFSET)/PGDIR_SIZE; ++i) {
- size_t o = (PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD + i;
+static void __init setup_vm_final(void)
+{
+ phys_addr_t pa, start, end;
+ struct memblock_region *reg;
+ struct mapping_ops ops;
+ pgprot_t prot = __pgprot(pgprot_val(PAGE_KERNEL) | _PAGE_EXEC);

- map_pa = (uintptr_t)__early_va(swapper_pmd, load_pa);
- pgdp[o] = pfn_pgd(PFN_DOWN(map_pa) + i, tableprot);
- }
- pmdp = __early_va(swapper_pmd, load_pa);
- for (i = 0; i < ARRAY_SIZE(swapper_pmd); i++)
- pmdp[i] = pfn_pmd(PFN_DOWN(load_pa + i * PMD_SIZE), prot);
-
- map_pa = (uintptr_t)__early_va(fixmap_pmd, load_pa);
- pgdp[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
- pfn_pgd(PFN_DOWN(map_pa), tableprot);
- pmdp = __early_va(fixmap_pmd, load_pa);
- map_pa = (uintptr_t)__early_va(fixmap_pte, load_pa);
- fixmap_pmd[(FIXADDR_START >> PMD_SHIFT) % PTRS_PER_PMD] =
- pfn_pmd(PFN_DOWN(map_pa), tableprot);
+ /* Setup mapping ops */
+ ops.get_pte_virt = final_get_pte_virt;
+ ops.alloc_pte = final_alloc_pte;
+#ifndef __PAGETABLE_PMD_FOLDED
+ ops.get_pmd_virt = final_get_pmd_virt;
+ ops.alloc_pmd = final_alloc_pmd;
#else
- pgdp = __early_va(trampoline_pg_dir, load_pa);
- pgdp[(PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD] =
- pfn_pgd(PFN_DOWN(load_pa), prot);
-
- pgdp = __early_va(swapper_pg_dir, load_pa);
-
- for (i = 0; i < (-PAGE_OFFSET)/PGDIR_SIZE; ++i) {
- size_t o = (PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD + i;
+ ops.get_pmd_virt = NULL;
+ ops.alloc_pmd = NULL;
+#endif

- pgdp[o] = pfn_pgd(PFN_DOWN(load_pa + i * PGDIR_SIZE), prot);
+ /* Map all memory banks */
+ for_each_memblock(memory, reg) {
+ start = reg->base;
+ end = start + reg->size;
+
+ if (start >= end)
+ break;
+ if (memblock_is_nomap(reg))
+ continue;
+
+ for (pa = start; pa < end; pa += PAGE_SIZE)
+ create_pgd_mapping(swapper_pg_dir,
+ (uintptr_t)__va(pa), pa,
+ PAGE_SIZE, prot, 0, &ops);
}

- map_pa = (uintptr_t)__early_va(fixmap_pte, load_pa);
- pgdp[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
- pfn_pgd(PFN_DOWN(map_pa), tableprot);
-#endif
+ clear_fixmap(FIX_PTE);
+ clear_fixmap(FIX_PMD);
+}
+
+void __init paging_init(void)
+{
+ setup_vm_final();
+ setup_zero_page();
+ local_flush_tlb_all();
+ zone_sizes_init();
}
--
2.17.1

2019-03-13 18:17:44

by Mike Rapoport

[permalink] [raw]

Subject: Re: [PATCH 2/3] RISC-V: Make setup_vm() independent of GCC code model

On Tue, Mar 12, 2019 at 10:08:16PM +0000, Anup Patel wrote:
> The setup_vm() must access kernel symbols in a position independent way
> because it will be called from head.S with MMU off.
>
> If we compile kernel with cmodel=medany then PC-relative addressing will
> be used in setup_vm() to access kernel symbols so it works perfectly fine.
>
> Although, if we compile kernel with cmodel=medlow then either absolute
> addressing or PC-relative addressing (based on whichever requires fewer
> instructions) is used to access kernel symbols in setup_vm(). This can
> break setup_vm() whenever any absolute addressing is used to access
> kernel symbols.
>
> With the movement of setup_vm() from kernel/setup.c to mm/init.c, the
> setup_vm() is now broken for cmodel=medlow but it works perfectly fine
> for cmodel=medany.
>
> This patch fixes setup_vm() and makes it independent of GCC code model
> by accessing kernel symbols relative to kernel load address instead of
> assuming PC-relative addressing.
>
> Fixes: 6f1e9e946f0b ("RISC-V: Move setup_vm() to mm/init.c")
> Signed-off-by: Anup Patel <[email protected]>
> ---
> arch/riscv/kernel/head.S | 1 +
> arch/riscv/mm/init.c | 71 ++++++++++++++++++++++++++--------------
> 2 files changed, 47 insertions(+), 25 deletions(-)
>
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index fe884cd69abd..7966262b4f9d 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -62,6 +62,7 @@ clear_bss_done:
>
> /* Initialize page tables and relocate to virtual addresses */
> la sp, init_thread_union + THREAD_SIZE
> + la a0, _start
> call setup_vm
> call relocate
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index b379a75ac6a6..f35299f2f3d5 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -172,55 +172,76 @@ void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot)
> }
> }
>
> -asmlinkage void __init setup_vm(void)
> +static inline void *__early_va(void *ptr, uintptr_t load_pa)
> {
> extern char _start;
> + uintptr_t va = (uintptr_t)ptr;
> + uintptr_t sz = (uintptr_t)(&_end) - (uintptr_t)(&_start);
> +
> + if (va >= PAGE_OFFSET && va < (PAGE_OFFSET + sz))
> + return (void *)(load_pa + (va - PAGE_OFFSET));

This is (void *)__pa(va), isn't it?

> + return (void *)va;

The below usage suggests that __early_va() should be used solely for
addresses inside the kernel. What will happen if the accesses is outside
that range? Isn't it a BUG()?

> +}
> +
> +asmlinkage void __init setup_vm(uintptr_t load_pa)
> +{
> uintptr_t i;
> - uintptr_t pa = (uintptr_t) &_start;
> +#ifndef __PAGETABLE_PMD_FOLDED
> + pmd_t *pmdp;
> +#endif
> + pgd_t *pgdp;
> + phys_addr_t map_pa;
> + pgprot_t tableprot = __pgprot(_PAGE_TABLE);
> pgprot_t prot = __pgprot(pgprot_val(PAGE_KERNEL) | _PAGE_EXEC);
>
> - va_pa_offset = PAGE_OFFSET - pa;
> - pfn_base = PFN_DOWN(pa);
> + va_pa_offset = PAGE_OFFSET - load_pa;
> + pfn_base = PFN_DOWN(load_pa);
>
> /* Sanity check alignment and size */
> BUG_ON((PAGE_OFFSET % PGDIR_SIZE) != 0);
> - BUG_ON((pa % (PAGE_SIZE * PTRS_PER_PTE)) != 0);
> + BUG_ON((load_pa % (PAGE_SIZE * PTRS_PER_PTE)) != 0);
>
> #ifndef __PAGETABLE_PMD_FOLDED
> - trampoline_pg_dir[(PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> - pfn_pgd(PFN_DOWN((uintptr_t)trampoline_pmd),
> - __pgprot(_PAGE_TABLE));
> - trampoline_pmd[0] = pfn_pmd(PFN_DOWN(pa), prot);
> + pgdp = __early_va(trampoline_pg_dir, load_pa);
> + map_pa = (uintptr_t)__early_va(trampoline_pmd, load_pa);

This reads a bit strange: pa = va()
BTW, I think you could keep the pa local variable instead of introducing
map_pa.

> + pgdp[(PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> + pfn_pgd(PFN_DOWN(map_pa), tableprot);
> + trampoline_pmd[0] = pfn_pmd(PFN_DOWN(load_pa), prot);
> +
> + pgdp = __early_va(swapper_pg_dir, load_pa);
>
> for (i = 0; i < (-PAGE_OFFSET)/PGDIR_SIZE; ++i) {
> size_t o = (PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD + i;
>
> - swapper_pg_dir[o] =
> - pfn_pgd(PFN_DOWN((uintptr_t)swapper_pmd) + i,
> - __pgprot(_PAGE_TABLE));
> + map_pa = (uintptr_t)__early_va(swapper_pmd, load_pa);
> + pgdp[o] = pfn_pgd(PFN_DOWN(map_pa) + i, tableprot);
> }
> + pmdp = __early_va(swapper_pmd, load_pa);
> for (i = 0; i < ARRAY_SIZE(swapper_pmd); i++)
> - swapper_pmd[i] = pfn_pmd(PFN_DOWN(pa + i * PMD_SIZE), prot);
> + pmdp[i] = pfn_pmd(PFN_DOWN(load_pa + i * PMD_SIZE), prot);
>
> - swapper_pg_dir[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> - pfn_pgd(PFN_DOWN((uintptr_t)fixmap_pmd),
> - __pgprot(_PAGE_TABLE));
> + map_pa = (uintptr_t)__early_va(fixmap_pmd, load_pa);
> + pgdp[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> + pfn_pgd(PFN_DOWN(map_pa), tableprot);
> + pmdp = __early_va(fixmap_pmd, load_pa);
> + map_pa = (uintptr_t)__early_va(fixmap_pte, load_pa);
> fixmap_pmd[(FIXADDR_START >> PMD_SHIFT) % PTRS_PER_PMD] =
> - pfn_pmd(PFN_DOWN((uintptr_t)fixmap_pte),
> - __pgprot(_PAGE_TABLE));
> + pfn_pmd(PFN_DOWN(map_pa), tableprot);
> #else
> - trampoline_pg_dir[(PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> - pfn_pgd(PFN_DOWN(pa), prot);
> + pgdp = __early_va(trampoline_pg_dir, load_pa);
> + pgdp[(PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> + pfn_pgd(PFN_DOWN(load_pa), prot);
> +
> + pgdp = __early_va(swapper_pg_dir, load_pa);
>
> for (i = 0; i < (-PAGE_OFFSET)/PGDIR_SIZE; ++i) {
> size_t o = (PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD + i;
>
> - swapper_pg_dir[o] =
> - pfn_pgd(PFN_DOWN(pa + i * PGDIR_SIZE), prot);
> + pgdp[o] = pfn_pgd(PFN_DOWN(load_pa + i * PGDIR_SIZE), prot);
> }
>
> - swapper_pg_dir[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> - pfn_pgd(PFN_DOWN((uintptr_t)fixmap_pte),
> - __pgprot(_PAGE_TABLE));
> + map_pa = (uintptr_t)__early_va(fixmap_pte, load_pa);
> + pgdp[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> + pfn_pgd(PFN_DOWN(map_pa), tableprot);
> #endif
> }
> --
> 2.17.1
>

--
Sincerely yours,
Mike.

2019-03-13 18:32:15

by Mike Rapoport

[permalink] [raw]

Subject: Re: [PATCH 3/3] RISC-V: Allow booting kernel from any 4KB aligned address

On Tue, Mar 12, 2019 at 10:08:22PM +0000, Anup Patel wrote:
> Currently, we have to boot RISCV64 kernel from a 2MB aligned physical
> address and RISCV32 kernel from a 4MB aligned physical address. This
> constraint is because initial pagetable setup (i.e. setup_vm()) maps
> entire RAM using hugepages (i.e. 2MB for 3-level pagetable and 4MB for
> 2-level pagetable).
>
> Further, the above booting contraint also results in memory wastage
> because if we boot kernel from some <xyz> address (which is not same as
> RAM start address) then RISCV kernel will map PAGE_OFFSET virtual address
> lineraly to <xyz> physical address and memory between RAM start and <xyz>
> will be reserved/unusable.
>
> For example, RISCV64 kernel booted from 0x80200000 will waste 2MB of RAM
> and RISCV32 kernel booted from 0x80400000 will waste 4MB of RAM.
>
> This patch re-writes the initial pagetable setup code to allow booting
> RISV32 and RISCV64 kernel from any 4KB (i.e. PAGE_SIZE) aligned address.
>
> To achieve this:
> 1. We map kernel, dtb and only some amount of RAM (few MBs) using 4KB
> mappings in setup_vm() (called from head.S)
> 2. Once we reach paging_init() (called from setup_arch()) after
> memblock setup, we map all available memory banks using 4KB
> mappings and memblock APIs.

I'm not really familiar with RISC-V, but my guess would be that you'd get
worse TLB performance with 4KB mappings. Not mentioning the amount of
memory required for the page table itself.

If the only goal is to utilize the physical memory below the kernel, it
simply should not be reserved at the first place, something like:

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index b379a75..6301ced 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -108,6 +108,7 @@ void __init setup_bootmem(void)
/* Find the memory region containing the kernel */
for_each_memblock(memory, reg) {
phys_addr_t vmlinux_end = __pa(_end);
+ phys_addr_t vmlinux_start = __pa(start);
phys_addr_t end = reg->base + reg->size;

if (reg->base <= vmlinux_end && vmlinux_end <= end) {
@@ -115,7 +116,8 @@ void __init setup_bootmem(void)
* Reserve from the start of the region to the end of
* the kernel
*/
- memblock_reserve(reg->base, vmlinux_end - reg->base);
+ memblock_reserve(vmlinux_start,
+ vmlinux_end - vmlinux_start);
mem_size = min(reg->size, (phys_addr_t)-PAGE_OFFSET);
}
}

> With this patch in-place, the booting constraint for RISCV32 and RISCV64
> kernel is much more relaxed and we can now boot kernel very close to
> RAM start thereby minimizng memory wastage.
>
> Signed-off-by: Anup Patel <[email protected]>
> ---
> arch/riscv/include/asm/fixmap.h | 5 +
> arch/riscv/include/asm/pgtable-64.h | 5 +
> arch/riscv/include/asm/pgtable.h | 6 +-
> arch/riscv/kernel/head.S | 1 +
> arch/riscv/kernel/setup.c | 4 +-
> arch/riscv/mm/init.c | 357 +++++++++++++++++++++++-----
> 6 files changed, 317 insertions(+), 61 deletions(-)
>
> diff --git a/arch/riscv/include/asm/fixmap.h b/arch/riscv/include/asm/fixmap.h
> index 57afe604b495..5cf53dd882e5 100644
> --- a/arch/riscv/include/asm/fixmap.h
> +++ b/arch/riscv/include/asm/fixmap.h
> @@ -21,6 +21,11 @@
> */
> enum fixed_addresses {
> FIX_HOLE,
> +#define FIX_FDT_SIZE SZ_1M
> + FIX_FDT_END,
> + FIX_FDT = FIX_FDT_END + FIX_FDT_SIZE / PAGE_SIZE - 1,
> + FIX_PTE,
> + FIX_PMD,
> FIX_EARLYCON_MEM_BASE,
> __end_of_fixed_addresses
> };
> diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
> index 7aa0ea9bd8bb..56ecc3dc939d 100644
> --- a/arch/riscv/include/asm/pgtable-64.h
> +++ b/arch/riscv/include/asm/pgtable-64.h
> @@ -78,6 +78,11 @@ static inline pmd_t pfn_pmd(unsigned long pfn, pgprot_t prot)
> return __pmd((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
> }
>
> +static inline unsigned long _pmd_pfn(pmd_t pmd)
> +{
> + return pmd_val(pmd) >> _PAGE_PFN_SHIFT;
> +}
> +
> #define pmd_ERROR(e) \
> pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))
>
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 1141364d990e..05fa2115e736 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -121,12 +121,16 @@ static inline void pmd_clear(pmd_t *pmdp)
> set_pmd(pmdp, __pmd(0));
> }
>
> -
> static inline pgd_t pfn_pgd(unsigned long pfn, pgprot_t prot)
> {
> return __pgd((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
> }
>
> +static inline unsigned long _pgd_pfn(pgd_t pgd)
> +{
> + return pgd_val(pgd) >> _PAGE_PFN_SHIFT;
> +}
> +
> #define pgd_index(addr) (((addr) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
>
> /* Locate an entry in the page global directory */
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index 7966262b4f9d..12a3ec5eb8ab 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -63,6 +63,7 @@ clear_bss_done:
> /* Initialize page tables and relocate to virtual addresses */
> la sp, init_thread_union + THREAD_SIZE
> la a0, _start
> + mv a1, s1
> call setup_vm
> call relocate
>
> diff --git a/arch/riscv/kernel/setup.c b/arch/riscv/kernel/setup.c
> index ecb654f6a79e..acdd0f74982b 100644
> --- a/arch/riscv/kernel/setup.c
> +++ b/arch/riscv/kernel/setup.c
> @@ -30,6 +30,7 @@
> #include <linux/sched/task.h>
> #include <linux/swiotlb.h>
>
> +#include <asm/fixmap.h>
> #include <asm/setup.h>
> #include <asm/sections.h>
> #include <asm/pgtable.h>
> @@ -62,7 +63,8 @@ unsigned long boot_cpu_hartid;
>
> void __init parse_dtb(unsigned int hartid, void *dtb)
> {
> - if (early_init_dt_scan(__va(dtb)))
> + dtb = (void *)fix_to_virt(FIX_FDT) + ((uintptr_t)dtb & ~PAGE_MASK);
> + if (early_init_dt_scan(dtb))
> return;
>
> pr_err("No DTB passed to the kernel\n");
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index f35299f2f3d5..ee55a4b90dec 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -1,14 +1,7 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> /*
> + * Copyright (C) 2019 Western Digital Corporation or its affiliates.
> * Copyright (C) 2012 Regents of the University of California
> - *
> - * This program is free software; you can redistribute it and/or
> - * modify it under the terms of the GNU General Public License
> - * as published by the Free Software Foundation, version 2.
> - *
> - * This program is distributed in the hope that it will be useful,
> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> - * GNU General Public License for more details.
> */
>
> #include <linux/init.h>
> @@ -43,13 +36,6 @@ void setup_zero_page(void)
> memset((void *)empty_zero_page, 0, PAGE_SIZE);
> }
>
> -void __init paging_init(void)
> -{
> - setup_zero_page();
> - local_flush_tlb_all();
> - zone_sizes_init();
> -}
> -
> void __init mem_init(void)
> {
> #ifdef CONFIG_FLATMEM
> @@ -146,13 +132,24 @@ void __init setup_bootmem(void)
> pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
> pgd_t trampoline_pg_dir[PTRS_PER_PGD] __initdata __aligned(PAGE_SIZE);
>
> +#define MAX_EARLY_MAPPING_SIZE SZ_128M
> +
> #ifndef __PAGETABLE_PMD_FOLDED
> -#define NUM_SWAPPER_PMDS ((uintptr_t)-PAGE_OFFSET >> PGDIR_SHIFT)
> -pmd_t swapper_pmd[PTRS_PER_PMD*((-PAGE_OFFSET)/PGDIR_SIZE)] __page_aligned_bss;
> -pmd_t trampoline_pmd[PTRS_PER_PGD] __initdata __aligned(PAGE_SIZE);
> +#if MAX_EARLY_MAPPING_SIZE < PGDIR_SIZE
> +#define NUM_SWAPPER_PMDS 1UL
> +#else
> +#define NUM_SWAPPER_PMDS (MAX_EARLY_MAPPING_SIZE/PGDIR_SIZE)
> +#endif
> +pmd_t swapper_pmd[PTRS_PER_PMD*NUM_SWAPPER_PMDS] __page_aligned_bss;
> +pmd_t trampoline_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
> pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
> +#define NUM_SWAPPER_PTES (MAX_EARLY_MAPPING_SIZE/PMD_SIZE)
> +#else
> +#define NUM_SWAPPER_PTES (MAX_EARLY_MAPPING_SIZE/PGDIR_SIZE)
> #endif
>
> +pte_t swapper_pte[PTRS_PER_PTE*NUM_SWAPPER_PTES] __page_aligned_bss;
> +pte_t trampoline_pte[PTRS_PER_PTE] __initdata __aligned(PAGE_SIZE);
> pte_t fixmap_pte[PTRS_PER_PTE] __page_aligned_bss;
>
> void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot)
> @@ -172,76 +169,318 @@ void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot)
> }
> }
>
> +struct mapping_ops {
> + pte_t *(*get_pte_virt)(phys_addr_t pa);
> + phys_addr_t (*alloc_pte)(uintptr_t va, uintptr_t load_pa);
> + pmd_t *(*get_pmd_virt)(phys_addr_t pa);
> + phys_addr_t (*alloc_pmd)(uintptr_t va, uintptr_t load_pa);
> +};
> +
> static inline void *__early_va(void *ptr, uintptr_t load_pa)
> {
> extern char _start;
> uintptr_t va = (uintptr_t)ptr;
> uintptr_t sz = (uintptr_t)(&_end) - (uintptr_t)(&_start);
>
> - if (va >= PAGE_OFFSET && va < (PAGE_OFFSET + sz))
> + if (va >= PAGE_OFFSET && va <= (PAGE_OFFSET + sz))
> return (void *)(load_pa + (va - PAGE_OFFSET));
> return (void *)va;
> }
>
> -asmlinkage void __init setup_vm(uintptr_t load_pa)
> +#define __early_pa(ptr, load_pa) (uintptr_t)__early_va(ptr, load_pa)
> +
> +static phys_addr_t __init final_alloc_pgtable(void)
> +{
> + return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
> +}
> +
> +static pte_t *__init early_get_pte_virt(phys_addr_t pa)
> {
> - uintptr_t i;
> + return (pte_t *)((uintptr_t)pa);
> +}
> +
> +static pte_t *__init final_get_pte_virt(phys_addr_t pa)
> +{
> + clear_fixmap(FIX_PTE);
> +
> + return (pte_t *)set_fixmap_offset(FIX_PTE, pa);
> +}
> +
> +static phys_addr_t __init early_alloc_pte(uintptr_t va, uintptr_t load_pa)
> +{
> + pte_t *base = __early_va(swapper_pte, load_pa);
> + uintptr_t pte_num = ((va - PAGE_OFFSET) >> PMD_SHIFT);
> +
> + BUG_ON(pte_num >= NUM_SWAPPER_PTES);
> +
> + return (uintptr_t)&base[pte_num * PTRS_PER_PTE];
> +}
> +
> +static phys_addr_t __init final_alloc_pte(uintptr_t va, uintptr_t load_pa)
> +{
> + return final_alloc_pgtable();
> +}
> +
> +static void __init create_pte_mapping(pte_t *ptep,
> + uintptr_t va, phys_addr_t pa,
> + phys_addr_t sz, pgprot_t prot)
> +{
> + uintptr_t pte_index = pte_index(va);
> +
> + BUG_ON(sz != PAGE_SIZE);
> +
> + if (pte_none(ptep[pte_index]))
> + ptep[pte_index] = pfn_pte(PFN_DOWN(pa), prot);
> +}
> +
> #ifndef __PAGETABLE_PMD_FOLDED
> +static pmd_t *__init early_get_pmd_virt(phys_addr_t pa)
> +{
> + return (pmd_t *)((uintptr_t)pa);
> +}
> +
> +static pmd_t *__init final_get_pmd_virt(phys_addr_t pa)
> +{
> + clear_fixmap(FIX_PMD);
> +
> + return (pmd_t *)set_fixmap_offset(FIX_PMD, pa);
> +}
> +
> +static phys_addr_t __init early_alloc_pmd(uintptr_t va, uintptr_t load_pa)
> +{
> + pmd_t *base = __early_va(swapper_pmd, load_pa);
> + uintptr_t pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
> +
> + BUG_ON(pmd_num >= NUM_SWAPPER_PMDS);
> +
> + return (uintptr_t)&base[pmd_num * PTRS_PER_PMD];
> +}
> +
> +static phys_addr_t __init final_alloc_pmd(uintptr_t va, uintptr_t load_pa)
> +{
> + return final_alloc_pgtable();
> +}
> +
> +static void __init create_pmd_mapping(pmd_t *pmdp,
> + uintptr_t va, phys_addr_t pa,
> + phys_addr_t sz, pgprot_t prot,
> + uintptr_t ops_load_pa,
> + struct mapping_ops *ops)
> +{
> + pte_t *ptep;
> + phys_addr_t pte_phys;
> + uintptr_t pmd_index = pmd_index(va);
> +
> + if (sz == PMD_SIZE) {
> + if (pmd_none(pmdp[pmd_index]))
> + pmdp[pmd_index] = pfn_pmd(PFN_DOWN(pa), prot);
> + return;
> + }
> +
> + if (pmd_none(pmdp[pmd_index])) {
> + pte_phys = ops->alloc_pte(va, ops_load_pa);
> + pmdp[pmd_index] = pfn_pmd(PFN_DOWN(pte_phys),
> + __pgprot(_PAGE_TABLE));
> + ptep = ops->get_pte_virt(pte_phys);
> + memset(ptep, 0, PAGE_SIZE);
> + } else {
> + pte_phys = PFN_PHYS(_pmd_pfn(pmdp[pmd_index]));
> + ptep = ops->get_pte_virt(pte_phys);
> + }
> +
> + create_pte_mapping(ptep, va, pa, sz, prot);
> +}
> +
> +static void __init create_pgd_mapping(pgd_t *pgdp,
> + uintptr_t va, phys_addr_t pa,
> + phys_addr_t sz, pgprot_t prot,
> + uintptr_t ops_load_pa,
> + struct mapping_ops *ops)
> +{
> pmd_t *pmdp;
> + phys_addr_t pmd_phys;
> + uintptr_t pgd_index = pgd_index(va);
> +
> + if (sz == PGDIR_SIZE) {
> + if (pgd_val(pgdp[pgd_index]) == 0)
> + pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pa), prot);
> + return;
> + }
> +
> + if (pgd_val(pgdp[pgd_index]) == 0) {
> + pmd_phys = ops->alloc_pmd(va, ops_load_pa);
> + pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pmd_phys),
> + __pgprot(_PAGE_TABLE));
> + pmdp = ops->get_pmd_virt(pmd_phys);
> + memset(pmdp, 0, PAGE_SIZE);
> + } else {
> + pmd_phys = PFN_PHYS(_pgd_pfn(pgdp[pgd_index]));
> + pmdp = ops->get_pmd_virt(pmd_phys);
> + }
> +
> + create_pmd_mapping(pmdp, va, pa, sz, prot, ops_load_pa, ops);
> +}
> +#else
> +static void __init create_pgd_mapping(pgd_t *pgdp,
> + uintptr_t va, phys_addr_t pa,
> + phys_addr_t sz, pgprot_t prot,
> + uintptr_t ops_load_pa,
> + struct mapping_ops *ops)
> +{
> + pte_t *ptep;
> + phys_addr_t pte_phys;
> + uintptr_t pgd_index = pgd_index(va);
> +
> + if (sz == PGDIR_SIZE) {
> + if (pgd_val(pgdp[pgd_index]) == 0)
> + pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pa), prot);
> + return;
> + }
> +
> + if (pgd_val(pgdp[pgd_index]) == 0) {
> + pte_phys = ops->alloc_pte(va, ops_load_pa);
> + pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pte_phys),
> + __pgprot(_PAGE_TABLE));
> + ptep = ops->get_pte_virt(pte_phys);
> + memset(ptep, 0, PAGE_SIZE);
> + } else {
> + pte_phys = PFN_PHYS(_pgd_pfn(pgdp[pgd_index]));
> + ptep = ops->get_pte_virt(pte_phys);
> + }
> +
> + create_pte_mapping(ptep, va, pa, sz, prot);
> +}
> #endif
> - pgd_t *pgdp;
> +
> +asmlinkage void __init setup_vm(uintptr_t load_pa, uintptr_t dtb_pa)
> +{
> phys_addr_t map_pa;
> + uintptr_t va, end_va;
> + uintptr_t load_sz = __early_pa(&_end, load_pa) - load_pa;
> pgprot_t tableprot = __pgprot(_PAGE_TABLE);
> pgprot_t prot = __pgprot(pgprot_val(PAGE_KERNEL) | _PAGE_EXEC);
> + struct mapping_ops ops;
>
> va_pa_offset = PAGE_OFFSET - load_pa;
> pfn_base = PFN_DOWN(load_pa);
>
> /* Sanity check alignment and size */
> BUG_ON((PAGE_OFFSET % PGDIR_SIZE) != 0);
> - BUG_ON((load_pa % (PAGE_SIZE * PTRS_PER_PTE)) != 0);
> + BUG_ON((load_pa % PAGE_SIZE) != 0);
> + BUG_ON(load_sz > MAX_EARLY_MAPPING_SIZE);
> +
> + /* Setup mapping ops */
> + ops.get_pte_virt = __early_va(early_get_pte_virt, load_pa);
> + ops.alloc_pte = __early_va(early_alloc_pte, load_pa);
> + ops.get_pmd_virt = NULL;
> + ops.alloc_pmd = NULL;
>
> #ifndef __PAGETABLE_PMD_FOLDED
> - pgdp = __early_va(trampoline_pg_dir, load_pa);
> - map_pa = (uintptr_t)__early_va(trampoline_pmd, load_pa);
> - pgdp[(PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> - pfn_pgd(PFN_DOWN(map_pa), tableprot);
> - trampoline_pmd[0] = pfn_pmd(PFN_DOWN(load_pa), prot);
> + /* Update mapping ops for PMD */
> + ops.get_pmd_virt = __early_va(early_get_pmd_virt, load_pa);
> + ops.alloc_pmd = __early_va(early_alloc_pmd, load_pa);
> +
> + /* Setup trampoline PGD and PMD */
> + map_pa = __early_pa(trampoline_pmd, load_pa);
> + create_pgd_mapping(__early_va(trampoline_pg_dir, load_pa),
> + PAGE_OFFSET, map_pa, PGDIR_SIZE, tableprot,
> + load_pa, &ops);
> + map_pa = __early_pa(trampoline_pte, load_pa);
> + create_pmd_mapping(__early_va(trampoline_pmd, load_pa),
> + PAGE_OFFSET, map_pa, PMD_SIZE, tableprot,
> + load_pa, &ops);
> +
> + /* Setup swapper PGD and PMD for fixmap */
> + map_pa = __early_pa(fixmap_pmd, load_pa);
> + create_pgd_mapping(__early_va(swapper_pg_dir, load_pa),
> + FIXADDR_START, map_pa, PGDIR_SIZE, tableprot,
> + load_pa, &ops);
> + map_pa = __early_pa(fixmap_pte, load_pa);
> + create_pmd_mapping(__early_va(fixmap_pmd, load_pa),
> + FIXADDR_START, map_pa, PMD_SIZE, tableprot,
> + load_pa, &ops);
> +#else
> + /* Setup trampoline PGD */
> + map_pa = __early_pa(trampoline_pte, load_pa);
> + create_pgd_mapping(__early_va(trampoline_pg_dir, load_pa),
> + PAGE_OFFSET, map_pa, PGDIR_SIZE, tableprot,
> + load_pa, &ops);
> +
> + /* Setup swapper PGD for fixmap */
> + map_pa = __early_pa(fixmap_pte, load_pa);
> + create_pgd_mapping(__early_va(swapper_pg_dir, load_pa),
> + FIXADDR_START, map_pa, PGDIR_SIZE, tableprot,
> + load_pa, &ops);
> +#endif
>
> - pgdp = __early_va(swapper_pg_dir, load_pa);
> + /* Setup trampoling PTE */
> + end_va = PAGE_OFFSET + PAGE_SIZE*PTRS_PER_PTE;
> + for (va = PAGE_OFFSET; va < end_va; va += PAGE_SIZE)
> + create_pte_mapping(__early_va(trampoline_pte, load_pa),
> + va, load_pa + (va - PAGE_OFFSET),
> + PAGE_SIZE, prot);
> +
> + /*
> + * Setup swapper PGD covering kernel and some amount of
> + * RAM which will allows us to reach paging_init(). We map
> + * all memory banks later in setup_vm_final() below.
> + */
> + end_va = PAGE_OFFSET + load_sz;
> + for (va = PAGE_OFFSET; va < end_va; va += PAGE_SIZE)
> + create_pgd_mapping(__early_va(swapper_pg_dir, load_pa),
> + va, load_pa + (va - PAGE_OFFSET),
> + PAGE_SIZE, prot, load_pa, &ops);
> +
> + /* Create fixed mapping for early parsing of FDT */
> + end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
> + for (va = __fix_to_virt(FIX_FDT); va < end_va; va += PAGE_SIZE)
> + create_pte_mapping(__early_va(fixmap_pte, load_pa),
> + va, dtb_pa + (va - __fix_to_virt(FIX_FDT)),
> + PAGE_SIZE, prot);
> +}
>
> - for (i = 0; i < (-PAGE_OFFSET)/PGDIR_SIZE; ++i) {
> - size_t o = (PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD + i;
> +static void __init setup_vm_final(void)
> +{
> + phys_addr_t pa, start, end;
> + struct memblock_region *reg;
> + struct mapping_ops ops;
> + pgprot_t prot = __pgprot(pgprot_val(PAGE_KERNEL) | _PAGE_EXEC);
>
> - map_pa = (uintptr_t)__early_va(swapper_pmd, load_pa);
> - pgdp[o] = pfn_pgd(PFN_DOWN(map_pa) + i, tableprot);
> - }
> - pmdp = __early_va(swapper_pmd, load_pa);
> - for (i = 0; i < ARRAY_SIZE(swapper_pmd); i++)
> - pmdp[i] = pfn_pmd(PFN_DOWN(load_pa + i * PMD_SIZE), prot);
> -
> - map_pa = (uintptr_t)__early_va(fixmap_pmd, load_pa);
> - pgdp[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> - pfn_pgd(PFN_DOWN(map_pa), tableprot);
> - pmdp = __early_va(fixmap_pmd, load_pa);
> - map_pa = (uintptr_t)__early_va(fixmap_pte, load_pa);
> - fixmap_pmd[(FIXADDR_START >> PMD_SHIFT) % PTRS_PER_PMD] =
> - pfn_pmd(PFN_DOWN(map_pa), tableprot);
> + /* Setup mapping ops */
> + ops.get_pte_virt = final_get_pte_virt;
> + ops.alloc_pte = final_alloc_pte;
> +#ifndef __PAGETABLE_PMD_FOLDED
> + ops.get_pmd_virt = final_get_pmd_virt;
> + ops.alloc_pmd = final_alloc_pmd;
> #else
> - pgdp = __early_va(trampoline_pg_dir, load_pa);
> - pgdp[(PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> - pfn_pgd(PFN_DOWN(load_pa), prot);
> -
> - pgdp = __early_va(swapper_pg_dir, load_pa);
> -
> - for (i = 0; i < (-PAGE_OFFSET)/PGDIR_SIZE; ++i) {
> - size_t o = (PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD + i;
> + ops.get_pmd_virt = NULL;
> + ops.alloc_pmd = NULL;
> +#endif
>
> - pgdp[o] = pfn_pgd(PFN_DOWN(load_pa + i * PGDIR_SIZE), prot);
> + /* Map all memory banks */
> + for_each_memblock(memory, reg) {
> + start = reg->base;
> + end = start + reg->size;
> +
> + if (start >= end)
> + break;
> + if (memblock_is_nomap(reg))
> + continue;
> +
> + for (pa = start; pa < end; pa += PAGE_SIZE)
> + create_pgd_mapping(swapper_pg_dir,
> + (uintptr_t)__va(pa), pa,
> + PAGE_SIZE, prot, 0, &ops);
> }
>
> - map_pa = (uintptr_t)__early_va(fixmap_pte, load_pa);
> - pgdp[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> - pfn_pgd(PFN_DOWN(map_pa), tableprot);
> -#endif
> + clear_fixmap(FIX_PTE);
> + clear_fixmap(FIX_PMD);
> +}
> +
> +void __init paging_init(void)
> +{
> + setup_vm_final();
> + setup_zero_page();
> + local_flush_tlb_all();
> + zone_sizes_init();
> }
> --
> 2.17.1
>

--
Sincerely yours,
Mike.

2019-03-13 21:08:45

[permalink] [raw]

Subject: Re: [PATCH 3/3] RISC-V: Allow booting kernel from any 4KB aligned address

On Thu, Mar 14, 2019 at 12:01 AM Mike Rapoport <[email protected]> wrote:
>
> On Tue, Mar 12, 2019 at 10:08:22PM +0000, Anup Patel wrote:
> > Currently, we have to boot RISCV64 kernel from a 2MB aligned physical
> > address and RISCV32 kernel from a 4MB aligned physical address. This
> > constraint is because initial pagetable setup (i.e. setup_vm()) maps
> > entire RAM using hugepages (i.e. 2MB for 3-level pagetable and 4MB for
> > 2-level pagetable).
> >
> > Further, the above booting contraint also results in memory wastage
> > because if we boot kernel from some <xyz> address (which is not same as
> > RAM start address) then RISCV kernel will map PAGE_OFFSET virtual address
> > lineraly to <xyz> physical address and memory between RAM start and <xyz>
> > will be reserved/unusable.
> >
> > For example, RISCV64 kernel booted from 0x80200000 will waste 2MB of RAM
> > and RISCV32 kernel booted from 0x80400000 will waste 4MB of RAM.
> >
> > This patch re-writes the initial pagetable setup code to allow booting
> > RISV32 and RISCV64 kernel from any 4KB (i.e. PAGE_SIZE) aligned address.
> >
> > To achieve this:
> > 1. We map kernel, dtb and only some amount of RAM (few MBs) using 4KB
> > mappings in setup_vm() (called from head.S)
> > 2. Once we reach paging_init() (called from setup_arch()) after
> > memblock setup, we map all available memory banks using 4KB
> > mappings and memblock APIs.
>
> I'm not really familiar with RISC-V, but my guess would be that you'd get
> worse TLB performance with 4KB mappings. Not mentioning the amount of
> memory required for the page table itself.

I agree we will see a hit in TLB performance due to 4KB mappings.

To address this we can create, 2MB (or 4MB on 32bit system) mappings
whenever load_pa is aligned to it otherwise we prefer 4KB mappings. In other
words, we create bigger mappings whenever possible and fallback to 4KB
mappings when not possible.

This way if kernel is booted from 2MB (or 4MB) aligned address then we will
see good TLB performance for kernel addresses. Also, users are still free to
boot Linux RISC-V kernel from any 4KB aligned address.

Of course, we will have to document this as part of Linux RISC-V booting
requirements under Documentation/ (which does not exist currently).

>
> If the only goal is to utilize the physical memory below the kernel, it
> simply should not be reserved at the first place, something like:

Well, our goal was two-fold:

1. We wanted to unify boot-time alignment requirements for 32bit and
64bit RISC-V systems
2. Save memory by allowing users to place kernel just after the runtime
firmware at starting of RAM.

>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index b379a75..6301ced 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -108,6 +108,7 @@ void __init setup_bootmem(void)
> /* Find the memory region containing the kernel */
> for_each_memblock(memory, reg) {
> phys_addr_t vmlinux_end = __pa(_end);
> + phys_addr_t vmlinux_start = __pa(start);
> phys_addr_t end = reg->base + reg->size;
>
> if (reg->base <= vmlinux_end && vmlinux_end <= end) {
> @@ -115,7 +116,8 @@ void __init setup_bootmem(void)
> * Reserve from the start of the region to the end of
> * the kernel
> */
> - memblock_reserve(reg->base, vmlinux_end - reg->base);
> + memblock_reserve(vmlinux_start,
> + vmlinux_end - vmlinux_start);
> mem_size = min(reg->size, (phys_addr_t)-PAGE_OFFSET);
> }
> }

Thanks for above changes. I will include them in my next revision.

Regards,
Anup

2019-03-14 06:54:22

by Mike Rapoport

[permalink] [raw]

Subject: Re: [PATCH 3/3] RISC-V: Allow booting kernel from any 4KB aligned address

On Thu, Mar 14, 2019 at 02:36:01AM +0530, Anup Patel wrote:
> On Thu, Mar 14, 2019 at 12:01 AM Mike Rapoport <[email protected]> wrote:
> >
> > On Tue, Mar 12, 2019 at 10:08:22PM +0000, Anup Patel wrote:
> > > Currently, we have to boot RISCV64 kernel from a 2MB aligned physical
> > > address and RISCV32 kernel from a 4MB aligned physical address. This
> > > constraint is because initial pagetable setup (i.e. setup_vm()) maps
> > > entire RAM using hugepages (i.e. 2MB for 3-level pagetable and 4MB for
> > > 2-level pagetable).
> > >
> > > Further, the above booting contraint also results in memory wastage
> > > because if we boot kernel from some <xyz> address (which is not same as
> > > RAM start address) then RISCV kernel will map PAGE_OFFSET virtual address
> > > lineraly to <xyz> physical address and memory between RAM start and <xyz>
> > > will be reserved/unusable.
> > >
> > > For example, RISCV64 kernel booted from 0x80200000 will waste 2MB of RAM
> > > and RISCV32 kernel booted from 0x80400000 will waste 4MB of RAM.
> > >
> > > This patch re-writes the initial pagetable setup code to allow booting
> > > RISV32 and RISCV64 kernel from any 4KB (i.e. PAGE_SIZE) aligned address.
> > >
> > > To achieve this:
> > > 1. We map kernel, dtb and only some amount of RAM (few MBs) using 4KB
> > > mappings in setup_vm() (called from head.S)
> > > 2. Once we reach paging_init() (called from setup_arch()) after
> > > memblock setup, we map all available memory banks using 4KB
> > > mappings and memblock APIs.
> >
> > I'm not really familiar with RISC-V, but my guess would be that you'd get
> > worse TLB performance with 4KB mappings. Not mentioning the amount of
> > memory required for the page table itself.
>
> I agree we will see a hit in TLB performance due to 4KB mappings.
>
> To address this we can create, 2MB (or 4MB on 32bit system) mappings
> whenever load_pa is aligned to it otherwise we prefer 4KB mappings. In other
> words, we create bigger mappings whenever possible and fallback to 4KB
> mappings when not possible.
>
> This way if kernel is booted from 2MB (or 4MB) aligned address then we will
> see good TLB performance for kernel addresses. Also, users are still free to
> boot Linux RISC-V kernel from any 4KB aligned address.
>
> Of course, we will have to document this as part of Linux RISC-V booting
> requirements under Documentation/ (which does not exist currently).
>
> >
> > If the only goal is to utilize the physical memory below the kernel, it
> > simply should not be reserved at the first place, something like:
>
> Well, our goal was two-fold:
>
> 1. We wanted to unify boot-time alignment requirements for 32bit and
> 64bit RISC-V systems

Can't they both start from 4MB aligned address provided the memory below
the kernel can be freed?

> 2. Save memory by allowing users to place kernel just after the runtime
> firmware at starting of RAM.

If the firmware should be alive after kernel boot, it's memory is the only
part that should be reserved below the kernel. Otherwise, the entire region
<physical memory start> - <kernel start> can be free.

Using 4K pages for the swapper_pg_dir is quite a change and I'm not
convinced its really justified.

> >
> > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > index b379a75..6301ced 100644
> > --- a/arch/riscv/mm/init.c
> > +++ b/arch/riscv/mm/init.c
> > @@ -108,6 +108,7 @@ void __init setup_bootmem(void)
> > /* Find the memory region containing the kernel */
> > for_each_memblock(memory, reg) {
> > phys_addr_t vmlinux_end = __pa(_end);
> > + phys_addr_t vmlinux_start = __pa(start);
> > phys_addr_t end = reg->base + reg->size;
> >
> > if (reg->base <= vmlinux_end && vmlinux_end <= end) {
> > @@ -115,7 +116,8 @@ void __init setup_bootmem(void)
> > * Reserve from the start of the region to the end of
> > * the kernel
> > */
> > - memblock_reserve(reg->base, vmlinux_end - reg->base);
> > + memblock_reserve(vmlinux_start,
> > + vmlinux_end - vmlinux_start);
> > mem_size = min(reg->size, (phys_addr_t)-PAGE_OFFSET);
> > }
> > }
>
> Thanks for above changes. I will include them in my next revision.
>
> Regards,
> Anup
>

--
Sincerely yours,
Mike.

2019-03-14 18:01:13

[permalink] [raw]

Subject: Re: [PATCH 3/3] RISC-V: Allow booting kernel from any 4KB aligned address

On Thu, Mar 14, 2019 at 12:23 PM Mike Rapoport <[email protected]> wrote:
>
> On Thu, Mar 14, 2019 at 02:36:01AM +0530, Anup Patel wrote:
> > On Thu, Mar 14, 2019 at 12:01 AM Mike Rapoport <[email protected]> wrote:
> > >
> > > On Tue, Mar 12, 2019 at 10:08:22PM +0000, Anup Patel wrote:
> > > > Currently, we have to boot RISCV64 kernel from a 2MB aligned physical
> > > > address and RISCV32 kernel from a 4MB aligned physical address. This
> > > > constraint is because initial pagetable setup (i.e. setup_vm()) maps
> > > > entire RAM using hugepages (i.e. 2MB for 3-level pagetable and 4MB for
> > > > 2-level pagetable).
> > > >
> > > > Further, the above booting contraint also results in memory wastage
> > > > because if we boot kernel from some <xyz> address (which is not same as
> > > > RAM start address) then RISCV kernel will map PAGE_OFFSET virtual address
> > > > lineraly to <xyz> physical address and memory between RAM start and <xyz>
> > > > will be reserved/unusable.
> > > >
> > > > For example, RISCV64 kernel booted from 0x80200000 will waste 2MB of RAM
> > > > and RISCV32 kernel booted from 0x80400000 will waste 4MB of RAM.
> > > >
> > > > This patch re-writes the initial pagetable setup code to allow booting
> > > > RISV32 and RISCV64 kernel from any 4KB (i.e. PAGE_SIZE) aligned address.
> > > >
> > > > To achieve this:
> > > > 1. We map kernel, dtb and only some amount of RAM (few MBs) using 4KB
> > > > mappings in setup_vm() (called from head.S)
> > > > 2. Once we reach paging_init() (called from setup_arch()) after
> > > > memblock setup, we map all available memory banks using 4KB
> > > > mappings and memblock APIs.
> > >
> > > I'm not really familiar with RISC-V, but my guess would be that you'd get
> > > worse TLB performance with 4KB mappings. Not mentioning the amount of
> > > memory required for the page table itself.
> >
> > I agree we will see a hit in TLB performance due to 4KB mappings.
> >
> > To address this we can create, 2MB (or 4MB on 32bit system) mappings
> > whenever load_pa is aligned to it otherwise we prefer 4KB mappings. In other
> > words, we create bigger mappings whenever possible and fallback to 4KB
> > mappings when not possible.
> >
> > This way if kernel is booted from 2MB (or 4MB) aligned address then we will
> > see good TLB performance for kernel addresses. Also, users are still free to
> > boot Linux RISC-V kernel from any 4KB aligned address.
> >
> > Of course, we will have to document this as part of Linux RISC-V booting
> > requirements under Documentation/ (which does not exist currently).
> >
> > >
> > > If the only goal is to utilize the physical memory below the kernel, it
> > > simply should not be reserved at the first place, something like:
> >
> > Well, our goal was two-fold:
> >
> > 1. We wanted to unify boot-time alignment requirements for 32bit and
> > 64bit RISC-V systems
>
> Can't they both start from 4MB aligned address provided the memory below
> the kernel can be freed?

Yes, they can both start from 4MB aligned address.

>
> > 2. Save memory by allowing users to place kernel just after the runtime
> > firmware at starting of RAM.
>
> If the firmware should be alive after kernel boot, it's memory is the only
> part that should be reserved below the kernel. Otherwise, the entire region
> <physical memory start> - <kernel start> can be free.
>
> Using 4K pages for the swapper_pg_dir is quite a change and I'm not
> convinced its really justified.

I understand your concern about TLB performance and more page
tables.

Not just 2MB/4MB mappings, we should be able to create even 1GB
mappings as well for good TLB performance.

I suggest we should use best possible mapping size (4KB, 2MB, or
1GB) based on alignment of kernel load address. This way users can
boot from any 4KB aligned address and setup_vm() will try to use
biggest possible mapping size.

For example, If the kernel load address is aligned to 2MB then we 2MB
mappings bigger mappings and use fewer page tables. Same thing
possible for 1GB mappings as well.

Regards,
Anup

2019-03-15 16:00:04

by Mike Rapoport

[permalink] [raw]

Subject: Re: [PATCH 3/3] RISC-V: Allow booting kernel from any 4KB aligned address

On Thu, Mar 14, 2019 at 11:28:32PM +0530, Anup Patel wrote:
> On Thu, Mar 14, 2019 at 12:23 PM Mike Rapoport <[email protected]> wrote:
> >
> > On Thu, Mar 14, 2019 at 02:36:01AM +0530, Anup Patel wrote:
> > > On Thu, Mar 14, 2019 at 12:01 AM Mike Rapoport <[email protected]> wrote:
> > > >
> > > > On Tue, Mar 12, 2019 at 10:08:22PM +0000, Anup Patel wrote:
> > > > > Currently, we have to boot RISCV64 kernel from a 2MB aligned physical
> > > > > address and RISCV32 kernel from a 4MB aligned physical address. This
> > > > > constraint is because initial pagetable setup (i.e. setup_vm()) maps
> > > > > entire RAM using hugepages (i.e. 2MB for 3-level pagetable and 4MB for
> > > > > 2-level pagetable).
> > > > >
> > > > > Further, the above booting contraint also results in memory wastage
> > > > > because if we boot kernel from some <xyz> address (which is not same as
> > > > > RAM start address) then RISCV kernel will map PAGE_OFFSET virtual address
> > > > > lineraly to <xyz> physical address and memory between RAM start and <xyz>
> > > > > will be reserved/unusable.
> > > > >
> > > > > For example, RISCV64 kernel booted from 0x80200000 will waste 2MB of RAM
> > > > > and RISCV32 kernel booted from 0x80400000 will waste 4MB of RAM.
> > > > >
> > > > > This patch re-writes the initial pagetable setup code to allow booting
> > > > > RISV32 and RISCV64 kernel from any 4KB (i.e. PAGE_SIZE) aligned address.
> > > > >
> > > > > To achieve this:
> > > > > 1. We map kernel, dtb and only some amount of RAM (few MBs) using 4KB
> > > > > mappings in setup_vm() (called from head.S)
> > > > > 2. Once we reach paging_init() (called from setup_arch()) after
> > > > > memblock setup, we map all available memory banks using 4KB
> > > > > mappings and memblock APIs.
> > > >
> > > > I'm not really familiar with RISC-V, but my guess would be that you'd get
> > > > worse TLB performance with 4KB mappings. Not mentioning the amount of
> > > > memory required for the page table itself.
> > >
> > > I agree we will see a hit in TLB performance due to 4KB mappings.
> > >
> > > To address this we can create, 2MB (or 4MB on 32bit system) mappings
> > > whenever load_pa is aligned to it otherwise we prefer 4KB mappings. In other
> > > words, we create bigger mappings whenever possible and fallback to 4KB
> > > mappings when not possible.
> > >
> > > This way if kernel is booted from 2MB (or 4MB) aligned address then we will
> > > see good TLB performance for kernel addresses. Also, users are still free to
> > > boot Linux RISC-V kernel from any 4KB aligned address.
> > >
> > > Of course, we will have to document this as part of Linux RISC-V booting
> > > requirements under Documentation/ (which does not exist currently).
> > >
> > > >
> > > > If the only goal is to utilize the physical memory below the kernel, it
> > > > simply should not be reserved at the first place, something like:
> > >
> > > Well, our goal was two-fold:
> > >
> > > 1. We wanted to unify boot-time alignment requirements for 32bit and
> > > 64bit RISC-V systems
> >
> > Can't they both start from 4MB aligned address provided the memory below
> > the kernel can be freed?
>
> Yes, they can both start from 4MB aligned address.
>
> >
> > > 2. Save memory by allowing users to place kernel just after the runtime
> > > firmware at starting of RAM.
> >
> > If the firmware should be alive after kernel boot, it's memory is the only
> > part that should be reserved below the kernel. Otherwise, the entire region
> > <physical memory start> - <kernel start> can be free.
> >
> > Using 4K pages for the swapper_pg_dir is quite a change and I'm not
> > convinced its really justified.
>
> I understand your concern about TLB performance and more page
> tables.
>
> Not just 2MB/4MB mappings, we should be able to create even 1GB
> mappings as well for good TLB performance.
>
> I suggest we should use best possible mapping size (4KB, 2MB, or
> 1GB) based on alignment of kernel load address. This way users can
> boot from any 4KB aligned address and setup_vm() will try to use
> biggest possible mapping size.
>
> For example, If the kernel load address is aligned to 2MB then we 2MB
> mappings bigger mappings and use fewer page tables. Same thing
> possible for 1GB mappings as well.

I still don't get why it is that important to relax alignment of the kernel
load address. Provided you can use the memory below the kernel, it really
should not matter.

> Regards,
> Anup
>

--
Sincerely yours,
Mike.

2019-03-15 16:19:48

[permalink] [raw]

Subject: Re: [PATCH 3/3] RISC-V: Allow booting kernel from any 4KB aligned address

On Fri, Mar 15, 2019 at 9:28 PM Mike Rapoport <[email protected]> wrote:
>
> On Thu, Mar 14, 2019 at 11:28:32PM +0530, Anup Patel wrote:
> > On Thu, Mar 14, 2019 at 12:23 PM Mike Rapoport <[email protected]> wrote:
> > >
> > > On Thu, Mar 14, 2019 at 02:36:01AM +0530, Anup Patel wrote:
> > > > On Thu, Mar 14, 2019 at 12:01 AM Mike Rapoport <[email protected]> wrote:
> > > > >
> > > > > On Tue, Mar 12, 2019 at 10:08:22PM +0000, Anup Patel wrote:
> > > > > > Currently, we have to boot RISCV64 kernel from a 2MB aligned physical
> > > > > > address and RISCV32 kernel from a 4MB aligned physical address. This
> > > > > > constraint is because initial pagetable setup (i.e. setup_vm()) maps
> > > > > > entire RAM using hugepages (i.e. 2MB for 3-level pagetable and 4MB for
> > > > > > 2-level pagetable).
> > > > > >
> > > > > > Further, the above booting contraint also results in memory wastage
> > > > > > because if we boot kernel from some <xyz> address (which is not same as
> > > > > > RAM start address) then RISCV kernel will map PAGE_OFFSET virtual address
> > > > > > lineraly to <xyz> physical address and memory between RAM start and <xyz>
> > > > > > will be reserved/unusable.
> > > > > >
> > > > > > For example, RISCV64 kernel booted from 0x80200000 will waste 2MB of RAM
> > > > > > and RISCV32 kernel booted from 0x80400000 will waste 4MB of RAM.
> > > > > >
> > > > > > This patch re-writes the initial pagetable setup code to allow booting
> > > > > > RISV32 and RISCV64 kernel from any 4KB (i.e. PAGE_SIZE) aligned address.
> > > > > >
> > > > > > To achieve this:
> > > > > > 1. We map kernel, dtb and only some amount of RAM (few MBs) using 4KB
> > > > > > mappings in setup_vm() (called from head.S)
> > > > > > 2. Once we reach paging_init() (called from setup_arch()) after
> > > > > > memblock setup, we map all available memory banks using 4KB
> > > > > > mappings and memblock APIs.
> > > > >
> > > > > I'm not really familiar with RISC-V, but my guess would be that you'd get
> > > > > worse TLB performance with 4KB mappings. Not mentioning the amount of
> > > > > memory required for the page table itself.
> > > >
> > > > I agree we will see a hit in TLB performance due to 4KB mappings.
> > > >
> > > > To address this we can create, 2MB (or 4MB on 32bit system) mappings
> > > > whenever load_pa is aligned to it otherwise we prefer 4KB mappings. In other
> > > > words, we create bigger mappings whenever possible and fallback to 4KB
> > > > mappings when not possible.
> > > >
> > > > This way if kernel is booted from 2MB (or 4MB) aligned address then we will
> > > > see good TLB performance for kernel addresses. Also, users are still free to
> > > > boot Linux RISC-V kernel from any 4KB aligned address.
> > > >
> > > > Of course, we will have to document this as part of Linux RISC-V booting
> > > > requirements under Documentation/ (which does not exist currently).
> > > >
> > > > >
> > > > > If the only goal is to utilize the physical memory below the kernel, it
> > > > > simply should not be reserved at the first place, something like:
> > > >
> > > > Well, our goal was two-fold:
> > > >
> > > > 1. We wanted to unify boot-time alignment requirements for 32bit and
> > > > 64bit RISC-V systems
> > >
> > > Can't they both start from 4MB aligned address provided the memory below
> > > the kernel can be freed?
> >
> > Yes, they can both start from 4MB aligned address.
> >
> > >
> > > > 2. Save memory by allowing users to place kernel just after the runtime
> > > > firmware at starting of RAM.
> > >
> > > If the firmware should be alive after kernel boot, it's memory is the only
> > > part that should be reserved below the kernel. Otherwise, the entire region
> > > <physical memory start> - <kernel start> can be free.
> > >
> > > Using 4K pages for the swapper_pg_dir is quite a change and I'm not
> > > convinced its really justified.
> >
> > I understand your concern about TLB performance and more page
> > tables.
> >
> > Not just 2MB/4MB mappings, we should be able to create even 1GB
> > mappings as well for good TLB performance.
> >
> > I suggest we should use best possible mapping size (4KB, 2MB, or
> > 1GB) based on alignment of kernel load address. This way users can
> > boot from any 4KB aligned address and setup_vm() will try to use
> > biggest possible mapping size.
> >
> > For example, If the kernel load address is aligned to 2MB then we 2MB
> > mappings bigger mappings and use fewer page tables. Same thing
> > possible for 1GB mappings as well.
>
> I still don't get why it is that important to relax alignment of the kernel
> load address. Provided you can use the memory below the kernel, it really
> should not matter.

The original idea was just to relax the alignment constraint on the kernel
load address.

What I am suggesting now is to improve this patch so that we can
dynamically select mapping size based on kernel load address. This
will achieve both:
1. Relaxed constraint on kernel load address
2. Better TLB performance whenever possible

Regards,
Anup

2019-03-15 16:24:18

[permalink] [raw]

Subject: Re: [PATCH 3/3] RISC-V: Allow booting kernel from any 4KB aligned address

On Fri, Mar 15, 2019 at 9:28 PM Mike Rapoport <[email protected]> wrote:
>
> On Thu, Mar 14, 2019 at 11:28:32PM +0530, Anup Patel wrote:
> > On Thu, Mar 14, 2019 at 12:23 PM Mike Rapoport <[email protected]> wrote:
> > >
> > > On Thu, Mar 14, 2019 at 02:36:01AM +0530, Anup Patel wrote:
> > > > On Thu, Mar 14, 2019 at 12:01 AM Mike Rapoport <[email protected]> wrote:
> > > > >
> > > > > On Tue, Mar 12, 2019 at 10:08:22PM +0000, Anup Patel wrote:
> > > > > > Currently, we have to boot RISCV64 kernel from a 2MB aligned physical
> > > > > > address and RISCV32 kernel from a 4MB aligned physical address. This
> > > > > > constraint is because initial pagetable setup (i.e. setup_vm()) maps
> > > > > > entire RAM using hugepages (i.e. 2MB for 3-level pagetable and 4MB for
> > > > > > 2-level pagetable).
> > > > > >
> > > > > > Further, the above booting contraint also results in memory wastage
> > > > > > because if we boot kernel from some <xyz> address (which is not same as
> > > > > > RAM start address) then RISCV kernel will map PAGE_OFFSET virtual address
> > > > > > lineraly to <xyz> physical address and memory between RAM start and <xyz>
> > > > > > will be reserved/unusable.
> > > > > >
> > > > > > For example, RISCV64 kernel booted from 0x80200000 will waste 2MB of RAM
> > > > > > and RISCV32 kernel booted from 0x80400000 will waste 4MB of RAM.
> > > > > >
> > > > > > This patch re-writes the initial pagetable setup code to allow booting
> > > > > > RISV32 and RISCV64 kernel from any 4KB (i.e. PAGE_SIZE) aligned address.
> > > > > >
> > > > > > To achieve this:
> > > > > > 1. We map kernel, dtb and only some amount of RAM (few MBs) using 4KB
> > > > > > mappings in setup_vm() (called from head.S)
> > > > > > 2. Once we reach paging_init() (called from setup_arch()) after
> > > > > > memblock setup, we map all available memory banks using 4KB
> > > > > > mappings and memblock APIs.
> > > > >
> > > > > I'm not really familiar with RISC-V, but my guess would be that you'd get
> > > > > worse TLB performance with 4KB mappings. Not mentioning the amount of
> > > > > memory required for the page table itself.
> > > >
> > > > I agree we will see a hit in TLB performance due to 4KB mappings.
> > > >
> > > > To address this we can create, 2MB (or 4MB on 32bit system) mappings
> > > > whenever load_pa is aligned to it otherwise we prefer 4KB mappings. In other
> > > > words, we create bigger mappings whenever possible and fallback to 4KB
> > > > mappings when not possible.
> > > >
> > > > This way if kernel is booted from 2MB (or 4MB) aligned address then we will
> > > > see good TLB performance for kernel addresses. Also, users are still free to
> > > > boot Linux RISC-V kernel from any 4KB aligned address.
> > > >
> > > > Of course, we will have to document this as part of Linux RISC-V booting
> > > > requirements under Documentation/ (which does not exist currently).
> > > >
> > > > >
> > > > > If the only goal is to utilize the physical memory below the kernel, it
> > > > > simply should not be reserved at the first place, something like:
> > > >
> > > > Well, our goal was two-fold:
> > > >
> > > > 1. We wanted to unify boot-time alignment requirements for 32bit and
> > > > 64bit RISC-V systems
> > >
> > > Can't they both start from 4MB aligned address provided the memory below
> > > the kernel can be freed?
> >
> > Yes, they can both start from 4MB aligned address.
> >
> > >
> > > > 2. Save memory by allowing users to place kernel just after the runtime
> > > > firmware at starting of RAM.
> > >
> > > If the firmware should be alive after kernel boot, it's memory is the only
> > > part that should be reserved below the kernel. Otherwise, the entire region
> > > <physical memory start> - <kernel start> can be free.
> > >
> > > Using 4K pages for the swapper_pg_dir is quite a change and I'm not
> > > convinced its really justified.
> >
> > I understand your concern about TLB performance and more page
> > tables.
> >
> > Not just 2MB/4MB mappings, we should be able to create even 1GB
> > mappings as well for good TLB performance.
> >
> > I suggest we should use best possible mapping size (4KB, 2MB, or
> > 1GB) based on alignment of kernel load address. This way users can
> > boot from any 4KB aligned address and setup_vm() will try to use
> > biggest possible mapping size.
> >
> > For example, If the kernel load address is aligned to 2MB then we 2MB
> > mappings bigger mappings and use fewer page tables. Same thing
> > possible for 1GB mappings as well.
>
> I still don't get why it is that important to relax alignment of the kernel
> load address. Provided you can use the memory below the kernel, it really
> should not matter.

Irrespective to constraint on kernel load address, we certainly need
to allow memory below kernel to be usable but that's a separate change.

Currently, the memory below kernel is ignored by
early_init_dt_add_memory_arch() in
drivers/of/fdt.c

Regards,
Anup

2019-03-15 23:26:52

[permalink] [raw]

Subject: Re: [PATCH 3/3] RISC-V: Allow booting kernel from any 4KB aligned address

On Fri, Mar 15, 2019 at 9:52 PM Anup Patel <[email protected]> wrote:
>
> On Fri, Mar 15, 2019 at 9:28 PM Mike Rapoport <[email protected]> wrote:
> >
> > On Thu, Mar 14, 2019 at 11:28:32PM +0530, Anup Patel wrote:
> > > On Thu, Mar 14, 2019 at 12:23 PM Mike Rapoport <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 14, 2019 at 02:36:01AM +0530, Anup Patel wrote:
> > > > > On Thu, Mar 14, 2019 at 12:01 AM Mike Rapoport <[email protected]> wrote:
> > > > > >
> > > > > > On Tue, Mar 12, 2019 at 10:08:22PM +0000, Anup Patel wrote:
> > > > > > > Currently, we have to boot RISCV64 kernel from a 2MB aligned physical
> > > > > > > address and RISCV32 kernel from a 4MB aligned physical address. This
> > > > > > > constraint is because initial pagetable setup (i.e. setup_vm()) maps
> > > > > > > entire RAM using hugepages (i.e. 2MB for 3-level pagetable and 4MB for
> > > > > > > 2-level pagetable).
> > > > > > >
> > > > > > > Further, the above booting contraint also results in memory wastage
> > > > > > > because if we boot kernel from some <xyz> address (which is not same as
> > > > > > > RAM start address) then RISCV kernel will map PAGE_OFFSET virtual address
> > > > > > > lineraly to <xyz> physical address and memory between RAM start and <xyz>
> > > > > > > will be reserved/unusable.
> > > > > > >
> > > > > > > For example, RISCV64 kernel booted from 0x80200000 will waste 2MB of RAM
> > > > > > > and RISCV32 kernel booted from 0x80400000 will waste 4MB of RAM.
> > > > > > >
> > > > > > > This patch re-writes the initial pagetable setup code to allow booting
> > > > > > > RISV32 and RISCV64 kernel from any 4KB (i.e. PAGE_SIZE) aligned address.
> > > > > > >
> > > > > > > To achieve this:
> > > > > > > 1. We map kernel, dtb and only some amount of RAM (few MBs) using 4KB
> > > > > > > mappings in setup_vm() (called from head.S)
> > > > > > > 2. Once we reach paging_init() (called from setup_arch()) after
> > > > > > > memblock setup, we map all available memory banks using 4KB
> > > > > > > mappings and memblock APIs.
> > > > > >
> > > > > > I'm not really familiar with RISC-V, but my guess would be that you'd get
> > > > > > worse TLB performance with 4KB mappings. Not mentioning the amount of
> > > > > > memory required for the page table itself.
> > > > >
> > > > > I agree we will see a hit in TLB performance due to 4KB mappings.
> > > > >
> > > > > To address this we can create, 2MB (or 4MB on 32bit system) mappings
> > > > > whenever load_pa is aligned to it otherwise we prefer 4KB mappings. In other
> > > > > words, we create bigger mappings whenever possible and fallback to 4KB
> > > > > mappings when not possible.
> > > > >
> > > > > This way if kernel is booted from 2MB (or 4MB) aligned address then we will
> > > > > see good TLB performance for kernel addresses. Also, users are still free to
> > > > > boot Linux RISC-V kernel from any 4KB aligned address.
> > > > >
> > > > > Of course, we will have to document this as part of Linux RISC-V booting
> > > > > requirements under Documentation/ (which does not exist currently).
> > > > >
> > > > > >
> > > > > > If the only goal is to utilize the physical memory below the kernel, it
> > > > > > simply should not be reserved at the first place, something like:
> > > > >
> > > > > Well, our goal was two-fold:
> > > > >
> > > > > 1. We wanted to unify boot-time alignment requirements for 32bit and
> > > > > 64bit RISC-V systems
> > > >
> > > > Can't they both start from 4MB aligned address provided the memory below
> > > > the kernel can be freed?
> > >
> > > Yes, they can both start from 4MB aligned address.
> > >
> > > >
> > > > > 2. Save memory by allowing users to place kernel just after the runtime
> > > > > firmware at starting of RAM.
> > > >
> > > > If the firmware should be alive after kernel boot, it's memory is the only
> > > > part that should be reserved below the kernel. Otherwise, the entire region
> > > > <physical memory start> - <kernel start> can be free.
> > > >
> > > > Using 4K pages for the swapper_pg_dir is quite a change and I'm not
> > > > convinced its really justified.
> > >
> > > I understand your concern about TLB performance and more page
> > > tables.
> > >
> > > Not just 2MB/4MB mappings, we should be able to create even 1GB
> > > mappings as well for good TLB performance.
> > >
> > > I suggest we should use best possible mapping size (4KB, 2MB, or
> > > 1GB) based on alignment of kernel load address. This way users can
> > > boot from any 4KB aligned address and setup_vm() will try to use
> > > biggest possible mapping size.
> > >
> > > For example, If the kernel load address is aligned to 2MB then we 2MB
> > > mappings bigger mappings and use fewer page tables. Same thing
> > > possible for 1GB mappings as well.
> >
> > I still don't get why it is that important to relax alignment of the kernel
> > load address. Provided you can use the memory below the kernel, it really
> > should not matter.
>
> Irrespective to constraint on kernel load address, we certainly need
> to allow memory below kernel to be usable but that's a separate change.
>
> Currently, the memory below kernel is ignored by
> early_init_dt_add_memory_arch() in
> drivers/of/fdt.c
>

I explored the possibility of re-claiming memory below kernel but then
we have an issue in this case.

For RISC-V kernel, PAGE_OFFSET is mapped to kernel load address
(i.e. load_pa in this code). The va_pa_offset is based on load_pa so linear
conversion of VA-to-PA and PA-to-VA won't be possible on the memory
below kernel. I guess this is why early_init_dt_add_memory_arch() is
marking memory below kernel as reserved. Is there better way to do it??

We started exploring ways to re-claim memory below kernel because
we are trying to get Linux working on Kendryte K210 board
(https://kendryte.com/). This board has dual-core 64bit RISC-V but it
only has 8MB RAM.

Regards,
Anup

2019-03-18 07:20:04

by Mike Rapoport

[permalink] [raw]

Subject: Re: [PATCH 3/3] RISC-V: Allow booting kernel from any 4KB aligned address

On Sat, Mar 16, 2019 at 04:55:30AM +0530, Anup Patel wrote:
> On Fri, Mar 15, 2019 at 9:52 PM Anup Patel <[email protected]> wrote:
> >
> > On Fri, Mar 15, 2019 at 9:28 PM Mike Rapoport <[email protected]> wrote:
> > >
> > > I still don't get why it is that important to relax alignment of the kernel
> > > load address. Provided you can use the memory below the kernel, it really
> > > should not matter.
> >
> > Irrespective to constraint on kernel load address, we certainly need
> > to allow memory below kernel to be usable but that's a separate change.
> >
> > Currently, the memory below kernel is ignored by
> > early_init_dt_add_memory_arch() in
> > drivers/of/fdt.c
> >
>
> I explored the possibility of re-claiming memory below kernel but then
> we have an issue in this case.
>
> For RISC-V kernel, PAGE_OFFSET is mapped to kernel load address
> (i.e. load_pa in this code). The va_pa_offset is based on load_pa so linear
> conversion of VA-to-PA and PA-to-VA won't be possible on the memory
> below kernel. I guess this is why early_init_dt_add_memory_arch() is
> marking memory below kernel as reserved. Is there better way to do it??
>
> We started exploring ways to re-claim memory below kernel because
> we are trying to get Linux working on Kendryte K210 board
> (https://kendryte.com/). This board has dual-core 64bit RISC-V but it
> only has 8MB RAM.

Huh, 8MB of RAM is tough...

It is possible to use the memory below the kernel, e.g x86-64 does that.
But it is definitely a separate change and with such RAM diet using 4K
pages seems unavoidable.

I still have concern about using 4K pages whenever the load address is not
2M (4M) aligned. People tend to not pay enough attention to such details
and they would load the kernel at an arbitrary address and get the
performance hit.

I think the default should remain as is and the ability to map the kernel
with 4K pages (and use 4K aligned load address) should be a Kconfig option.

Another thing I'd like to suggest is to completely split swapper_pg_dir
initialization from setup_vm() and keep this function solely for
initialization of the trampoline_pg_dir. The trampoline_pg_dir can use
large pages and the memory below the kernel start can be mapped there
simply by mapping the entire large page containing the kernel start.
Then, the swapper_pg_dir setup can run with virtual memory enabled and can
have much more flexibility.

>
> Regards,
> Anup
>

--
Sincerely yours,
Mike.

2019-03-18 13:17:52

[permalink] [raw]

Subject: Re: [PATCH 3/3] RISC-V: Allow booting kernel from any 4KB aligned address

On Mon, Mar 18, 2019 at 12:48 PM Mike Rapoport <[email protected]> wrote:
>
> On Sat, Mar 16, 2019 at 04:55:30AM +0530, Anup Patel wrote:
> > On Fri, Mar 15, 2019 at 9:52 PM Anup Patel <[email protected]> wrote:
> > >
> > > On Fri, Mar 15, 2019 at 9:28 PM Mike Rapoport <[email protected]> wrote:
> > > >
> > > > I still don't get why it is that important to relax alignment of the kernel
> > > > load address. Provided you can use the memory below the kernel, it really
> > > > should not matter.
> > >
> > > Irrespective to constraint on kernel load address, we certainly need
> > > to allow memory below kernel to be usable but that's a separate change.
> > >
> > > Currently, the memory below kernel is ignored by
> > > early_init_dt_add_memory_arch() in
> > > drivers/of/fdt.c
> > >
> >
> > I explored the possibility of re-claiming memory below kernel but then
> > we have an issue in this case.
> >
> > For RISC-V kernel, PAGE_OFFSET is mapped to kernel load address
> > (i.e. load_pa in this code). The va_pa_offset is based on load_pa so linear
> > conversion of VA-to-PA and PA-to-VA won't be possible on the memory
> > below kernel. I guess this is why early_init_dt_add_memory_arch() is
> > marking memory below kernel as reserved. Is there better way to do it??
> >
> > We started exploring ways to re-claim memory below kernel because
> > we are trying to get Linux working on Kendryte K210 board
> > (https://kendryte.com/). This board has dual-core 64bit RISC-V but it
> > only has 8MB RAM.
>
> Huh, 8MB of RAM is tough...
>
> It is possible to use the memory below the kernel, e.g x86-64 does that.
> But it is definitely a separate change and with such RAM diet using 4K
> pages seems unavoidable.
>
> I still have concern about using 4K pages whenever the load address is not
> 2M (4M) aligned. People tend to not pay enough attention to such details
> and they would load the kernel at an arbitrary address and get the
> performance hit.
>
> I think the default should remain as is and the ability to map the kernel
> with 4K pages (and use 4K aligned load address) should be a Kconfig option.

I agree people will tend to not pay attention on the load address alignment
but this is also possible with current approach. Currently, if user boots kernel
form any non-2M aligned address then we don't see any prints at all which
let's users think it to be kernel bug. In fact, I have done same mistake couple
of times.

Another approach (apart from kconfig option) would be to throw big-fat
warning when users boot kernel form 4K aligned load address this way
atleast kernel boots instead of no prints. Your thoughts??

>
> Another thing I'd like to suggest is to completely split swapper_pg_dir
> initialization from setup_vm() and keep this function solely for
> initialization of the trampoline_pg_dir. The trampoline_pg_dir can use
> large pages and the memory below the kernel start can be mapped there
> simply by mapping the entire large page containing the kernel start.
> Then, the swapper_pg_dir setup can run with virtual memory enabled and can
> have much more flexibility.

Sure, this is a good suggestion. I will add this as separate patch in this
series.

Regards,
Anup

2019-03-18 16:14:38

by Paul Walmsley

[permalink] [raw]

Subject: Re: [PATCH 3/3] RISC-V: Allow booting kernel from any 4KB aligned address

On Mon, 18 Mar 2019, Mike Rapoport wrote:

> On Sat, Mar 16, 2019 at 04:55:30AM +0530, Anup Patel wrote:
>
> > We started exploring ways to re-claim memory below kernel because
> > we are trying to get Linux working on Kendryte K210 board
> > (https://kendryte.com/). This board has dual-core 64bit RISC-V but it
> > only has 8MB RAM.
>
> Huh, 8MB of RAM is tough...
>
> It is possible to use the memory below the kernel, e.g x86-64 does that.
> But it is definitely a separate change and with such RAM diet using 4K
> pages seems unavoidable.
>
> I still have concern about using 4K pages whenever the load address is not
> 2M (4M) aligned. People tend to not pay enough attention to such details
> and they would load the kernel at an arbitrary address and get the
> performance hit.
>
> I think the default should remain as is and the ability to map the kernel
> with 4K pages (and use 4K aligned load address) should be a Kconfig option.

Agreed. That Kconfig parameter should also be default-off.

Only a small number of people will try to run RISC-V Linux on a Kendryte
board. That niche use-case shouldn't impact the much larger group of
people who will run Linux on more reasonably-sized systems. No one should
need to ask people to report their kernel load address whenever someone
reports a performance regression.

- Paul

2019-03-18 16:28:37

by Mike Rapoport

[permalink] [raw]

Subject: Re: [PATCH 3/3] RISC-V: Allow booting kernel from any 4KB aligned address

On Mon, Mar 18, 2019 at 06:46:18PM +0530, Anup Patel wrote:
> On Mon, Mar 18, 2019 at 12:48 PM Mike Rapoport <[email protected]> wrote:
> >
> > On Sat, Mar 16, 2019 at 04:55:30AM +0530, Anup Patel wrote:
> > > On Fri, Mar 15, 2019 at 9:52 PM Anup Patel <[email protected]> wrote:
> > > >
> > > > On Fri, Mar 15, 2019 at 9:28 PM Mike Rapoport <[email protected]> wrote:
> > > > >
> > > > > I still don't get why it is that important to relax alignment of the kernel
> > > > > load address. Provided you can use the memory below the kernel, it really
> > > > > should not matter.
> > > >
> > > > Irrespective to constraint on kernel load address, we certainly need
> > > > to allow memory below kernel to be usable but that's a separate change.
> > > >
> > > > Currently, the memory below kernel is ignored by
> > > > early_init_dt_add_memory_arch() in
> > > > drivers/of/fdt.c
> > > >
> > >
> > > I explored the possibility of re-claiming memory below kernel but then
> > > we have an issue in this case.
> > >
> > > For RISC-V kernel, PAGE_OFFSET is mapped to kernel load address
> > > (i.e. load_pa in this code). The va_pa_offset is based on load_pa so linear
> > > conversion of VA-to-PA and PA-to-VA won't be possible on the memory
> > > below kernel. I guess this is why early_init_dt_add_memory_arch() is
> > > marking memory below kernel as reserved. Is there better way to do it??
> > >
> > > We started exploring ways to re-claim memory below kernel because
> > > we are trying to get Linux working on Kendryte K210 board
> > > (https://kendryte.com/). This board has dual-core 64bit RISC-V but it
> > > only has 8MB RAM.
> >
> > Huh, 8MB of RAM is tough...
> >
> > It is possible to use the memory below the kernel, e.g x86-64 does that.
> > But it is definitely a separate change and with such RAM diet using 4K
> > pages seems unavoidable.
> >
> > I still have concern about using 4K pages whenever the load address is not
> > 2M (4M) aligned. People tend to not pay enough attention to such details
> > and they would load the kernel at an arbitrary address and get the
> > performance hit.
> >
> > I think the default should remain as is and the ability to map the kernel
> > with 4K pages (and use 4K aligned load address) should be a Kconfig option.
>
> I agree people will tend to not pay attention on the load address alignment
> but this is also possible with current approach. Currently, if user boots kernel
> form any non-2M aligned address then we don't see any prints at all which
> let's users think it to be kernel bug. In fact, I have done same mistake couple
> of times.
>
> Another approach (apart from kconfig option) would be to throw big-fat
> warning when users boot kernel form 4K aligned load address this way
> atleast kernel boots instead of no prints. Your thoughts??

That should be panic() rather than warning. If the trampoline_pg_dir will
map everything, it can be emitted during the initialization of
swapper_pg_dir.

> >
> > Another thing I'd like to suggest is to completely split swapper_pg_dir
> > initialization from setup_vm() and keep this function solely for
> > initialization of the trampoline_pg_dir. The trampoline_pg_dir can use
> > large pages and the memory below the kernel start can be mapped there
> > simply by mapping the entire large page containing the kernel start.
> > Then, the swapper_pg_dir setup can run with virtual memory enabled and can
> > have much more flexibility.
>
> Sure, this is a good suggestion. I will add this as separate patch in this
> series.

> Regards,
> Anup
>

--
Sincerely yours,
Mike.

2019-04-09 16:44:56

by Palmer Dabbelt

[permalink] [raw]

Subject: Re: [PATCH 1/3] RISC-V: Add separate defconfig for 32bit systems

On Tue, 12 Mar 2019 15:08:12 PDT (-0700), Anup Patel wrote:
> This patch adds rv32_defconfig for 32bit systems. The only
> difference between rv32_defconfig and defconfig is that
> rv32_defconfig has CONFIG_ARCH_RV32I=y.

Thanks. I think it makes sense to have this in 5.1 so I'm going to take it
into the next RC.

>
> Signed-off-by: Anup Patel <[email protected]>
> ---
> arch/riscv/configs/rv32_defconfig | 84 +++++++++++++++++++++++++++++++
> 1 file changed, 84 insertions(+)
> create mode 100644 arch/riscv/configs/rv32_defconfig
>
> diff --git a/arch/riscv/configs/rv32_defconfig b/arch/riscv/configs/rv32_defconfig
> new file mode 100644
> index 000000000000..1a911ed8e772
> --- /dev/null
> +++ b/arch/riscv/configs/rv32_defconfig
> @@ -0,0 +1,84 @@
> +CONFIG_SYSVIPC=y
> +CONFIG_POSIX_MQUEUE=y
> +CONFIG_IKCONFIG=y
> +CONFIG_IKCONFIG_PROC=y
> +CONFIG_CGROUPS=y
> +CONFIG_CGROUP_SCHED=y
> +CONFIG_CFS_BANDWIDTH=y
> +CONFIG_CGROUP_BPF=y
> +CONFIG_NAMESPACES=y
> +CONFIG_USER_NS=y
> +CONFIG_CHECKPOINT_RESTORE=y
> +CONFIG_BLK_DEV_INITRD=y
> +CONFIG_EXPERT=y
> +CONFIG_BPF_SYSCALL=y
> +CONFIG_ARCH_RV32I=y
> +CONFIG_SMP=y
> +CONFIG_MODULES=y
> +CONFIG_MODULE_UNLOAD=y
> +CONFIG_NET=y
> +CONFIG_PACKET=y
> +CONFIG_UNIX=y
> +CONFIG_INET=y
> +CONFIG_IP_MULTICAST=y
> +CONFIG_IP_ADVANCED_ROUTER=y
> +CONFIG_IP_PNP=y
> +CONFIG_IP_PNP_DHCP=y
> +CONFIG_IP_PNP_BOOTP=y
> +CONFIG_IP_PNP_RARP=y
> +CONFIG_NETLINK_DIAG=y
> +CONFIG_PCI=y
> +CONFIG_PCIEPORTBUS=y
> +CONFIG_PCI_HOST_GENERIC=y
> +CONFIG_PCIE_XILINX=y
> +CONFIG_DEVTMPFS=y
> +CONFIG_BLK_DEV_LOOP=y
> +CONFIG_VIRTIO_BLK=y
> +CONFIG_BLK_DEV_SD=y
> +CONFIG_BLK_DEV_SR=y
> +CONFIG_ATA=y
> +CONFIG_SATA_AHCI=y
> +CONFIG_SATA_AHCI_PLATFORM=y
> +CONFIG_NETDEVICES=y
> +CONFIG_VIRTIO_NET=y
> +CONFIG_MACB=y
> +CONFIG_E1000E=y
> +CONFIG_R8169=y
> +CONFIG_MICROSEMI_PHY=y
> +CONFIG_INPUT_MOUSEDEV=y
> +CONFIG_SERIAL_8250=y
> +CONFIG_SERIAL_8250_CONSOLE=y
> +CONFIG_SERIAL_OF_PLATFORM=y
> +CONFIG_SERIAL_EARLYCON_RISCV_SBI=y
> +CONFIG_HVC_RISCV_SBI=y
> +# CONFIG_PTP_1588_CLOCK is not set
> +CONFIG_DRM=y
> +CONFIG_DRM_RADEON=y
> +CONFIG_FRAMEBUFFER_CONSOLE=y
> +CONFIG_USB=y
> +CONFIG_USB_XHCI_HCD=y
> +CONFIG_USB_XHCI_PLATFORM=y
> +CONFIG_USB_EHCI_HCD=y
> +CONFIG_USB_EHCI_HCD_PLATFORM=y
> +CONFIG_USB_OHCI_HCD=y
> +CONFIG_USB_OHCI_HCD_PLATFORM=y
> +CONFIG_USB_STORAGE=y
> +CONFIG_USB_UAS=y
> +CONFIG_VIRTIO_MMIO=y
> +CONFIG_SIFIVE_PLIC=y
> +CONFIG_EXT4_FS=y
> +CONFIG_EXT4_FS_POSIX_ACL=y
> +CONFIG_AUTOFS4_FS=y
> +CONFIG_MSDOS_FS=y
> +CONFIG_VFAT_FS=y
> +CONFIG_TMPFS=y
> +CONFIG_TMPFS_POSIX_ACL=y
> +CONFIG_NFS_FS=y
> +CONFIG_NFS_V4=y
> +CONFIG_NFS_V4_1=y
> +CONFIG_NFS_V4_2=y
> +CONFIG_ROOT_NFS=y
> +CONFIG_CRYPTO_USER_API_HASH=y
> +CONFIG_CRYPTO_DEV_VIRTIO=y
> +CONFIG_PRINTK_TIME=y
> +# CONFIG_RCU_TRACE is not set
> --
> 2.17.1

2019-04-09 16:48:50

by Palmer Dabbelt

[permalink] [raw]

Subject: Re: [PATCH 2/3] RISC-V: Make setup_vm() independent of GCC code model

On Tue, 12 Mar 2019 15:08:16 PDT (-0700), Anup Patel wrote:
> The setup_vm() must access kernel symbols in a position independent way
> because it will be called from head.S with MMU off.
>
> If we compile kernel with cmodel=medany then PC-relative addressing will
> be used in setup_vm() to access kernel symbols so it works perfectly fine.
>
> Although, if we compile kernel with cmodel=medlow then either absolute
> addressing or PC-relative addressing (based on whichever requires fewer
> instructions) is used to access kernel symbols in setup_vm(). This can
> break setup_vm() whenever any absolute addressing is used to access
> kernel symbols.
>
> With the movement of setup_vm() from kernel/setup.c to mm/init.c, the
> setup_vm() is now broken for cmodel=medlow but it works perfectly fine
> for cmodel=medany.
>
> This patch fixes setup_vm() and makes it independent of GCC code model
> by accessing kernel symbols relative to kernel load address instead of
> assuming PC-relative addressing.

I think we ended up with a cleaner solution as 387181dcdb6c ("RISC-V: Always
compile mm/init.c with cmodel=medany and notrace"), but let me know if I missed
something here.

>
> Fixes: 6f1e9e946f0b ("RISC-V: Move setup_vm() to mm/init.c")
> Signed-off-by: Anup Patel <[email protected]>
> ---
> arch/riscv/kernel/head.S | 1 +
> arch/riscv/mm/init.c | 71 ++++++++++++++++++++++++++--------------
> 2 files changed, 47 insertions(+), 25 deletions(-)
>
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index fe884cd69abd..7966262b4f9d 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -62,6 +62,7 @@ clear_bss_done:
>
> /* Initialize page tables and relocate to virtual addresses */
> la sp, init_thread_union + THREAD_SIZE
> + la a0, _start
> call setup_vm
> call relocate
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index b379a75ac6a6..f35299f2f3d5 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -172,55 +172,76 @@ void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t prot)
> }
> }
>
> -asmlinkage void __init setup_vm(void)
> +static inline void *__early_va(void *ptr, uintptr_t load_pa)
> {
> extern char _start;
> + uintptr_t va = (uintptr_t)ptr;
> + uintptr_t sz = (uintptr_t)(&_end) - (uintptr_t)(&_start);
> +
> + if (va >= PAGE_OFFSET && va < (PAGE_OFFSET + sz))
> + return (void *)(load_pa + (va - PAGE_OFFSET));
> + return (void *)va;
> +}
> +
> +asmlinkage void __init setup_vm(uintptr_t load_pa)
> +{
> uintptr_t i;
> - uintptr_t pa = (uintptr_t) &_start;
> +#ifndef __PAGETABLE_PMD_FOLDED
> + pmd_t *pmdp;
> +#endif
> + pgd_t *pgdp;
> + phys_addr_t map_pa;
> + pgprot_t tableprot = __pgprot(_PAGE_TABLE);
> pgprot_t prot = __pgprot(pgprot_val(PAGE_KERNEL) | _PAGE_EXEC);
>
> - va_pa_offset = PAGE_OFFSET - pa;
> - pfn_base = PFN_DOWN(pa);
> + va_pa_offset = PAGE_OFFSET - load_pa;
> + pfn_base = PFN_DOWN(load_pa);
>
> /* Sanity check alignment and size */
> BUG_ON((PAGE_OFFSET % PGDIR_SIZE) != 0);
> - BUG_ON((pa % (PAGE_SIZE * PTRS_PER_PTE)) != 0);
> + BUG_ON((load_pa % (PAGE_SIZE * PTRS_PER_PTE)) != 0);
>
> #ifndef __PAGETABLE_PMD_FOLDED
> - trampoline_pg_dir[(PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> - pfn_pgd(PFN_DOWN((uintptr_t)trampoline_pmd),
> - __pgprot(_PAGE_TABLE));
> - trampoline_pmd[0] = pfn_pmd(PFN_DOWN(pa), prot);
> + pgdp = __early_va(trampoline_pg_dir, load_pa);
> + map_pa = (uintptr_t)__early_va(trampoline_pmd, load_pa);
> + pgdp[(PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> + pfn_pgd(PFN_DOWN(map_pa), tableprot);
> + trampoline_pmd[0] = pfn_pmd(PFN_DOWN(load_pa), prot);
> +
> + pgdp = __early_va(swapper_pg_dir, load_pa);
>
> for (i = 0; i < (-PAGE_OFFSET)/PGDIR_SIZE; ++i) {
> size_t o = (PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD + i;
>
> - swapper_pg_dir[o] =
> - pfn_pgd(PFN_DOWN((uintptr_t)swapper_pmd) + i,
> - __pgprot(_PAGE_TABLE));
> + map_pa = (uintptr_t)__early_va(swapper_pmd, load_pa);
> + pgdp[o] = pfn_pgd(PFN_DOWN(map_pa) + i, tableprot);
> }
> + pmdp = __early_va(swapper_pmd, load_pa);
> for (i = 0; i < ARRAY_SIZE(swapper_pmd); i++)
> - swapper_pmd[i] = pfn_pmd(PFN_DOWN(pa + i * PMD_SIZE), prot);
> + pmdp[i] = pfn_pmd(PFN_DOWN(load_pa + i * PMD_SIZE), prot);
>
> - swapper_pg_dir[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> - pfn_pgd(PFN_DOWN((uintptr_t)fixmap_pmd),
> - __pgprot(_PAGE_TABLE));
> + map_pa = (uintptr_t)__early_va(fixmap_pmd, load_pa);
> + pgdp[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> + pfn_pgd(PFN_DOWN(map_pa), tableprot);
> + pmdp = __early_va(fixmap_pmd, load_pa);
> + map_pa = (uintptr_t)__early_va(fixmap_pte, load_pa);
> fixmap_pmd[(FIXADDR_START >> PMD_SHIFT) % PTRS_PER_PMD] =
> - pfn_pmd(PFN_DOWN((uintptr_t)fixmap_pte),
> - __pgprot(_PAGE_TABLE));
> + pfn_pmd(PFN_DOWN(map_pa), tableprot);
> #else
> - trampoline_pg_dir[(PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> - pfn_pgd(PFN_DOWN(pa), prot);
> + pgdp = __early_va(trampoline_pg_dir, load_pa);
> + pgdp[(PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> + pfn_pgd(PFN_DOWN(load_pa), prot);
> +
> + pgdp = __early_va(swapper_pg_dir, load_pa);
>
> for (i = 0; i < (-PAGE_OFFSET)/PGDIR_SIZE; ++i) {
> size_t o = (PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD + i;
>
> - swapper_pg_dir[o] =
> - pfn_pgd(PFN_DOWN(pa + i * PGDIR_SIZE), prot);
> + pgdp[o] = pfn_pgd(PFN_DOWN(load_pa + i * PGDIR_SIZE), prot);
> }
>
> - swapper_pg_dir[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> - pfn_pgd(PFN_DOWN((uintptr_t)fixmap_pte),
> - __pgprot(_PAGE_TABLE));
> + map_pa = (uintptr_t)__early_va(fixmap_pte, load_pa);
> + pgdp[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> + pfn_pgd(PFN_DOWN(map_pa), tableprot);
> #endif
> }
> --
> 2.17.1

2019-04-10 04:12:07

[permalink] [raw]

Subject: Re: [PATCH 2/3] RISC-V: Make setup_vm() independent of GCC code model

On Tue, Apr 9, 2019 at 10:17 PM Palmer Dabbelt <[email protected]> wrote:
>
> On Tue, 12 Mar 2019 15:08:16 PDT (-0700), Anup Patel wrote:
> > The setup_vm() must access kernel symbols in a position independent way
> > because it will be called from head.S with MMU off.
> >
> > If we compile kernel with cmodel=medany then PC-relative addressing will
> > be used in setup_vm() to access kernel symbols so it works perfectly fine.
> >
> > Although, if we compile kernel with cmodel=medlow then either absolute
> > addressing or PC-relative addressing (based on whichever requires fewer
> > instructions) is used to access kernel symbols in setup_vm(). This can
> > break setup_vm() whenever any absolute addressing is used to access
> > kernel symbols.
> >
> > With the movement of setup_vm() from kernel/setup.c to mm/init.c, the
> > setup_vm() is now broken for cmodel=medlow but it works perfectly fine
> > for cmodel=medany.
> >
> > This patch fixes setup_vm() and makes it independent of GCC code model
> > by accessing kernel symbols relative to kernel load address instead of
> > assuming PC-relative addressing.
>
> I think we ended up with a cleaner solution as 387181dcdb6c ("RISC-V: Always
> compile mm/init.c with cmodel=medany and notrace"), but let me know if I missed
> something here.

Yes, please ignore this patch. I have dropped it in latest version of
this patch series.

Regards,
Anup

2019-04-10 06:28:27

[permalink] [raw]

Subject: Re: [PATCH 1/3] RISC-V: Add separate defconfig for 32bit systems

On Tue, Apr 9, 2019 at 10:14 PM Palmer Dabbelt <[email protected]> wrote:
>
> On Tue, 12 Mar 2019 15:08:12 PDT (-0700), Anup Patel wrote:
> > This patch adds rv32_defconfig for 32bit systems. The only
> > difference between rv32_defconfig and defconfig is that
> > rv32_defconfig has CONFIG_ARCH_RV32I=y.
>
> Thanks. I think it makes sense to have this in 5.1 so I'm going to take it
> into the next RC.

Thanks, Palmer.

Can you consider "[PATCH v3] RISC-V: Fix Maximum Physical Memory 2GiB
option for 64bit systems" for 5.1?

Refer, https://patchwork.kernel.org/patch/10886895/

The above patch is also required for 64bit system with more than 128GiB memory
(i.e. server-class system).

We can remove "Maximum Physical Memory 2GiB" option as separate patch.

Regards,
Anup

2019-04-10 13:38:23

by Nick Kossifidis

[permalink] [raw]

Subject: Re: [PATCH 3/3] RISC-V: Allow booting kernel from any 4KB aligned address

Στις 2019-03-13 00:08, Anup Patel έγραψε:
> Currently, we have to boot RISCV64 kernel from a 2MB aligned physical
> address and RISCV32 kernel from a 4MB aligned physical address. This
> constraint is because initial pagetable setup (i.e. setup_vm()) maps
> entire RAM using hugepages (i.e. 2MB for 3-level pagetable and 4MB for
> 2-level pagetable).
>
> Further, the above booting contraint also results in memory wastage
> because if we boot kernel from some <xyz> address (which is not same as
> RAM start address) then RISCV kernel will map PAGE_OFFSET virtual
> address
> lineraly to <xyz> physical address and memory between RAM start and
> <xyz>
> will be reserved/unusable.
>
> For example, RISCV64 kernel booted from 0x80200000 will waste 2MB of
> RAM
> and RISCV32 kernel booted from 0x80400000 will waste 4MB of RAM.
>
> This patch re-writes the initial pagetable setup code to allow booting
> RISV32 and RISCV64 kernel from any 4KB (i.e. PAGE_SIZE) aligned
> address.
>
> To achieve this:
> 1. We map kernel, dtb and only some amount of RAM (few MBs) using 4KB
> mappings in setup_vm() (called from head.S)
> 2. Once we reach paging_init() (called from setup_arch()) after
> memblock setup, we map all available memory banks using 4KB
> mappings and memblock APIs.
>
> With this patch in-place, the booting constraint for RISCV32 and
> RISCV64
> kernel is much more relaxed and we can now boot kernel very close to
> RAM start thereby minimizng memory wastage.
>
> Signed-off-by: Anup Patel <[email protected]>
> ---
> arch/riscv/include/asm/fixmap.h | 5 +
> arch/riscv/include/asm/pgtable-64.h | 5 +
> arch/riscv/include/asm/pgtable.h | 6 +-
> arch/riscv/kernel/head.S | 1 +
> arch/riscv/kernel/setup.c | 4 +-
> arch/riscv/mm/init.c | 357 +++++++++++++++++++++++-----
> 6 files changed, 317 insertions(+), 61 deletions(-)
>
> diff --git a/arch/riscv/include/asm/fixmap.h
> b/arch/riscv/include/asm/fixmap.h
> index 57afe604b495..5cf53dd882e5 100644
> --- a/arch/riscv/include/asm/fixmap.h
> +++ b/arch/riscv/include/asm/fixmap.h
> @@ -21,6 +21,11 @@
> */
> enum fixed_addresses {
> FIX_HOLE,
> +#define FIX_FDT_SIZE SZ_1M
> + FIX_FDT_END,
> + FIX_FDT = FIX_FDT_END + FIX_FDT_SIZE / PAGE_SIZE - 1,
> + FIX_PTE,
> + FIX_PMD,
> FIX_EARLYCON_MEM_BASE,
> __end_of_fixed_addresses
> };
> diff --git a/arch/riscv/include/asm/pgtable-64.h
> b/arch/riscv/include/asm/pgtable-64.h
> index 7aa0ea9bd8bb..56ecc3dc939d 100644
> --- a/arch/riscv/include/asm/pgtable-64.h
> +++ b/arch/riscv/include/asm/pgtable-64.h
> @@ -78,6 +78,11 @@ static inline pmd_t pfn_pmd(unsigned long pfn,
> pgprot_t prot)
> return __pmd((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
> }
>
> +static inline unsigned long _pmd_pfn(pmd_t pmd)
> +{
> + return pmd_val(pmd) >> _PAGE_PFN_SHIFT;
> +}
> +
> #define pmd_ERROR(e) \
> pr_err("%s:%d: bad pmd %016lx.\n", __FILE__, __LINE__, pmd_val(e))
>
> diff --git a/arch/riscv/include/asm/pgtable.h
> b/arch/riscv/include/asm/pgtable.h
> index 1141364d990e..05fa2115e736 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -121,12 +121,16 @@ static inline void pmd_clear(pmd_t *pmdp)
> set_pmd(pmdp, __pmd(0));
> }
>
> -
> static inline pgd_t pfn_pgd(unsigned long pfn, pgprot_t prot)
> {
> return __pgd((pfn << _PAGE_PFN_SHIFT) | pgprot_val(prot));
> }
>
> +static inline unsigned long _pgd_pfn(pgd_t pgd)
> +{
> + return pgd_val(pgd) >> _PAGE_PFN_SHIFT;
> +}
> +
> #define pgd_index(addr) (((addr) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
>
> /* Locate an entry in the page global directory */
> diff --git a/arch/riscv/kernel/head.S b/arch/riscv/kernel/head.S
> index 7966262b4f9d..12a3ec5eb8ab 100644
> --- a/arch/riscv/kernel/head.S
> +++ b/arch/riscv/kernel/head.S
> @@ -63,6 +63,7 @@ clear_bss_done:
> /* Initialize page tables and relocate to virtual addresses */
> la sp, init_thread_union + THREAD_SIZE
> la a0, _start
> + mv a1, s1
> call setup_vm
> call relocate
>
> diff --git a/arch/riscv/kernel/setup.c b/arch/riscv/kernel/setup.c
> index ecb654f6a79e..acdd0f74982b 100644
> --- a/arch/riscv/kernel/setup.c
> +++ b/arch/riscv/kernel/setup.c
> @@ -30,6 +30,7 @@
> #include <linux/sched/task.h>
> #include <linux/swiotlb.h>
>
> +#include <asm/fixmap.h>
> #include <asm/setup.h>
> #include <asm/sections.h>
> #include <asm/pgtable.h>
> @@ -62,7 +63,8 @@ unsigned long boot_cpu_hartid;
>
> void __init parse_dtb(unsigned int hartid, void *dtb)
> {
> - if (early_init_dt_scan(__va(dtb)))
> + dtb = (void *)fix_to_virt(FIX_FDT) + ((uintptr_t)dtb & ~PAGE_MASK);
> + if (early_init_dt_scan(dtb))
> return;
>
> pr_err("No DTB passed to the kernel\n");
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index f35299f2f3d5..ee55a4b90dec 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -1,14 +1,7 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> /*
> + * Copyright (C) 2019 Western Digital Corporation or its affiliates.
> * Copyright (C) 2012 Regents of the University of California
> - *
> - * This program is free software; you can redistribute it and/or
> - * modify it under the terms of the GNU General Public License
> - * as published by the Free Software Foundation, version 2.
> - *
> - * This program is distributed in the hope that it will be useful,
> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> - * GNU General Public License for more details.
> */
>
> #include <linux/init.h>
> @@ -43,13 +36,6 @@ void setup_zero_page(void)
> memset((void *)empty_zero_page, 0, PAGE_SIZE);
> }
>
> -void __init paging_init(void)
> -{
> - setup_zero_page();
> - local_flush_tlb_all();
> - zone_sizes_init();
> -}
> -
> void __init mem_init(void)
> {
> #ifdef CONFIG_FLATMEM
> @@ -146,13 +132,24 @@ void __init setup_bootmem(void)
> pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
> pgd_t trampoline_pg_dir[PTRS_PER_PGD] __initdata __aligned(PAGE_SIZE);
>
> +#define MAX_EARLY_MAPPING_SIZE SZ_128M
> +
> #ifndef __PAGETABLE_PMD_FOLDED
> -#define NUM_SWAPPER_PMDS ((uintptr_t)-PAGE_OFFSET >> PGDIR_SHIFT)
> -pmd_t swapper_pmd[PTRS_PER_PMD*((-PAGE_OFFSET)/PGDIR_SIZE)]
> __page_aligned_bss;
> -pmd_t trampoline_pmd[PTRS_PER_PGD] __initdata __aligned(PAGE_SIZE);
> +#if MAX_EARLY_MAPPING_SIZE < PGDIR_SIZE
> +#define NUM_SWAPPER_PMDS 1UL
> +#else
> +#define NUM_SWAPPER_PMDS (MAX_EARLY_MAPPING_SIZE/PGDIR_SIZE)
> +#endif
> +pmd_t swapper_pmd[PTRS_PER_PMD*NUM_SWAPPER_PMDS] __page_aligned_bss;
> +pmd_t trampoline_pmd[PTRS_PER_PMD] __initdata __aligned(PAGE_SIZE);
> pmd_t fixmap_pmd[PTRS_PER_PMD] __page_aligned_bss;
> +#define NUM_SWAPPER_PTES (MAX_EARLY_MAPPING_SIZE/PMD_SIZE)
> +#else
> +#define NUM_SWAPPER_PTES (MAX_EARLY_MAPPING_SIZE/PGDIR_SIZE)
> #endif
>
> +pte_t swapper_pte[PTRS_PER_PTE*NUM_SWAPPER_PTES] __page_aligned_bss;
> +pte_t trampoline_pte[PTRS_PER_PTE] __initdata __aligned(PAGE_SIZE);
> pte_t fixmap_pte[PTRS_PER_PTE] __page_aligned_bss;
>
> void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t
> prot)
> @@ -172,76 +169,318 @@ void __set_fixmap(enum fixed_addresses idx,
> phys_addr_t phys, pgprot_t prot)
> }
> }
>
> +struct mapping_ops {
> + pte_t *(*get_pte_virt)(phys_addr_t pa);
> + phys_addr_t (*alloc_pte)(uintptr_t va, uintptr_t load_pa);
> + pmd_t *(*get_pmd_virt)(phys_addr_t pa);
> + phys_addr_t (*alloc_pmd)(uintptr_t va, uintptr_t load_pa);
> +};
> +
> static inline void *__early_va(void *ptr, uintptr_t load_pa)
> {
> extern char _start;
> uintptr_t va = (uintptr_t)ptr;
> uintptr_t sz = (uintptr_t)(&_end) - (uintptr_t)(&_start);
>
> - if (va >= PAGE_OFFSET && va < (PAGE_OFFSET + sz))
> + if (va >= PAGE_OFFSET && va <= (PAGE_OFFSET + sz))
> return (void *)(load_pa + (va - PAGE_OFFSET));
> return (void *)va;
> }
>
> -asmlinkage void __init setup_vm(uintptr_t load_pa)
> +#define __early_pa(ptr, load_pa) (uintptr_t)__early_va(ptr, load_pa)
> +
> +static phys_addr_t __init final_alloc_pgtable(void)
> +{
> + return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
> +}
> +
> +static pte_t *__init early_get_pte_virt(phys_addr_t pa)
> {
> - uintptr_t i;
> + return (pte_t *)((uintptr_t)pa);
> +}
> +
> +static pte_t *__init final_get_pte_virt(phys_addr_t pa)
> +{
> + clear_fixmap(FIX_PTE);
> +
> + return (pte_t *)set_fixmap_offset(FIX_PTE, pa);
> +}
> +
> +static phys_addr_t __init early_alloc_pte(uintptr_t va, uintptr_t
> load_pa)
> +{
> + pte_t *base = __early_va(swapper_pte, load_pa);
> + uintptr_t pte_num = ((va - PAGE_OFFSET) >> PMD_SHIFT);
> +
> + BUG_ON(pte_num >= NUM_SWAPPER_PTES);
> +
> + return (uintptr_t)&base[pte_num * PTRS_PER_PTE];
> +}
> +
> +static phys_addr_t __init final_alloc_pte(uintptr_t va, uintptr_t
> load_pa)
> +{
> + return final_alloc_pgtable();
> +}
> +
> +static void __init create_pte_mapping(pte_t *ptep,
> + uintptr_t va, phys_addr_t pa,
> + phys_addr_t sz, pgprot_t prot)
> +{
> + uintptr_t pte_index = pte_index(va);
> +
> + BUG_ON(sz != PAGE_SIZE);
> +
> + if (pte_none(ptep[pte_index]))
> + ptep[pte_index] = pfn_pte(PFN_DOWN(pa), prot);
> +}
> +
> #ifndef __PAGETABLE_PMD_FOLDED
> +static pmd_t *__init early_get_pmd_virt(phys_addr_t pa)
> +{
> + return (pmd_t *)((uintptr_t)pa);
> +}
> +
> +static pmd_t *__init final_get_pmd_virt(phys_addr_t pa)
> +{
> + clear_fixmap(FIX_PMD);
> +
> + return (pmd_t *)set_fixmap_offset(FIX_PMD, pa);
> +}
> +
> +static phys_addr_t __init early_alloc_pmd(uintptr_t va, uintptr_t
> load_pa)
> +{
> + pmd_t *base = __early_va(swapper_pmd, load_pa);
> + uintptr_t pmd_num = (va - PAGE_OFFSET) >> PGDIR_SHIFT;
> +
> + BUG_ON(pmd_num >= NUM_SWAPPER_PMDS);
> +
> + return (uintptr_t)&base[pmd_num * PTRS_PER_PMD];
> +}
> +
> +static phys_addr_t __init final_alloc_pmd(uintptr_t va, uintptr_t
> load_pa)
> +{
> + return final_alloc_pgtable();
> +}
> +
> +static void __init create_pmd_mapping(pmd_t *pmdp,
> + uintptr_t va, phys_addr_t pa,
> + phys_addr_t sz, pgprot_t prot,
> + uintptr_t ops_load_pa,
> + struct mapping_ops *ops)
> +{
> + pte_t *ptep;
> + phys_addr_t pte_phys;
> + uintptr_t pmd_index = pmd_index(va);
> +
> + if (sz == PMD_SIZE) {
> + if (pmd_none(pmdp[pmd_index]))
> + pmdp[pmd_index] = pfn_pmd(PFN_DOWN(pa), prot);
> + return;
> + }
> +
> + if (pmd_none(pmdp[pmd_index])) {
> + pte_phys = ops->alloc_pte(va, ops_load_pa);
> + pmdp[pmd_index] = pfn_pmd(PFN_DOWN(pte_phys),
> + __pgprot(_PAGE_TABLE));
> + ptep = ops->get_pte_virt(pte_phys);
> + memset(ptep, 0, PAGE_SIZE);
> + } else {
> + pte_phys = PFN_PHYS(_pmd_pfn(pmdp[pmd_index]));
> + ptep = ops->get_pte_virt(pte_phys);
> + }
> +
> + create_pte_mapping(ptep, va, pa, sz, prot);
> +}
> +
> +static void __init create_pgd_mapping(pgd_t *pgdp,
> + uintptr_t va, phys_addr_t pa,
> + phys_addr_t sz, pgprot_t prot,
> + uintptr_t ops_load_pa,
> + struct mapping_ops *ops)
> +{
> pmd_t *pmdp;
> + phys_addr_t pmd_phys;
> + uintptr_t pgd_index = pgd_index(va);
> +
> + if (sz == PGDIR_SIZE) {
> + if (pgd_val(pgdp[pgd_index]) == 0)
> + pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pa), prot);
> + return;
> + }
> +
> + if (pgd_val(pgdp[pgd_index]) == 0) {
> + pmd_phys = ops->alloc_pmd(va, ops_load_pa);
> + pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pmd_phys),
> + __pgprot(_PAGE_TABLE));
> + pmdp = ops->get_pmd_virt(pmd_phys);
> + memset(pmdp, 0, PAGE_SIZE);
> + } else {
> + pmd_phys = PFN_PHYS(_pgd_pfn(pgdp[pgd_index]));
> + pmdp = ops->get_pmd_virt(pmd_phys);
> + }
> +
> + create_pmd_mapping(pmdp, va, pa, sz, prot, ops_load_pa, ops);
> +}
> +#else
> +static void __init create_pgd_mapping(pgd_t *pgdp,
> + uintptr_t va, phys_addr_t pa,
> + phys_addr_t sz, pgprot_t prot,
> + uintptr_t ops_load_pa,
> + struct mapping_ops *ops)
> +{
> + pte_t *ptep;
> + phys_addr_t pte_phys;
> + uintptr_t pgd_index = pgd_index(va);
> +
> + if (sz == PGDIR_SIZE) {
> + if (pgd_val(pgdp[pgd_index]) == 0)
> + pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pa), prot);
> + return;
> + }
> +
> + if (pgd_val(pgdp[pgd_index]) == 0) {
> + pte_phys = ops->alloc_pte(va, ops_load_pa);
> + pgdp[pgd_index] = pfn_pgd(PFN_DOWN(pte_phys),
> + __pgprot(_PAGE_TABLE));
> + ptep = ops->get_pte_virt(pte_phys);
> + memset(ptep, 0, PAGE_SIZE);
> + } else {
> + pte_phys = PFN_PHYS(_pgd_pfn(pgdp[pgd_index]));
> + ptep = ops->get_pte_virt(pte_phys);
> + }
> +
> + create_pte_mapping(ptep, va, pa, sz, prot);
> +}
> #endif
> - pgd_t *pgdp;
> +
> +asmlinkage void __init setup_vm(uintptr_t load_pa, uintptr_t dtb_pa)
> +{
> phys_addr_t map_pa;
> + uintptr_t va, end_va;
> + uintptr_t load_sz = __early_pa(&_end, load_pa) - load_pa;
> pgprot_t tableprot = __pgprot(_PAGE_TABLE);
> pgprot_t prot = __pgprot(pgprot_val(PAGE_KERNEL) | _PAGE_EXEC);
> + struct mapping_ops ops;
>
> va_pa_offset = PAGE_OFFSET - load_pa;
> pfn_base = PFN_DOWN(load_pa);
>
> /* Sanity check alignment and size */
> BUG_ON((PAGE_OFFSET % PGDIR_SIZE) != 0);
> - BUG_ON((load_pa % (PAGE_SIZE * PTRS_PER_PTE)) != 0);
> + BUG_ON((load_pa % PAGE_SIZE) != 0);
> + BUG_ON(load_sz > MAX_EARLY_MAPPING_SIZE);
> +
> + /* Setup mapping ops */
> + ops.get_pte_virt = __early_va(early_get_pte_virt, load_pa);
> + ops.alloc_pte = __early_va(early_alloc_pte, load_pa);
> + ops.get_pmd_virt = NULL;
> + ops.alloc_pmd = NULL;
>
> #ifndef __PAGETABLE_PMD_FOLDED
> - pgdp = __early_va(trampoline_pg_dir, load_pa);
> - map_pa = (uintptr_t)__early_va(trampoline_pmd, load_pa);
> - pgdp[(PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> - pfn_pgd(PFN_DOWN(map_pa), tableprot);
> - trampoline_pmd[0] = pfn_pmd(PFN_DOWN(load_pa), prot);
> + /* Update mapping ops for PMD */
> + ops.get_pmd_virt = __early_va(early_get_pmd_virt, load_pa);
> + ops.alloc_pmd = __early_va(early_alloc_pmd, load_pa);
> +
> + /* Setup trampoline PGD and PMD */
> + map_pa = __early_pa(trampoline_pmd, load_pa);
> + create_pgd_mapping(__early_va(trampoline_pg_dir, load_pa),
> + PAGE_OFFSET, map_pa, PGDIR_SIZE, tableprot,
> + load_pa, &ops);
> + map_pa = __early_pa(trampoline_pte, load_pa);
> + create_pmd_mapping(__early_va(trampoline_pmd, load_pa),
> + PAGE_OFFSET, map_pa, PMD_SIZE, tableprot,
> + load_pa, &ops);
> +
> + /* Setup swapper PGD and PMD for fixmap */
> + map_pa = __early_pa(fixmap_pmd, load_pa);
> + create_pgd_mapping(__early_va(swapper_pg_dir, load_pa),
> + FIXADDR_START, map_pa, PGDIR_SIZE, tableprot,
> + load_pa, &ops);
> + map_pa = __early_pa(fixmap_pte, load_pa);
> + create_pmd_mapping(__early_va(fixmap_pmd, load_pa),
> + FIXADDR_START, map_pa, PMD_SIZE, tableprot,
> + load_pa, &ops);
> +#else
> + /* Setup trampoline PGD */
> + map_pa = __early_pa(trampoline_pte, load_pa);
> + create_pgd_mapping(__early_va(trampoline_pg_dir, load_pa),
> + PAGE_OFFSET, map_pa, PGDIR_SIZE, tableprot,
> + load_pa, &ops);
> +
> + /* Setup swapper PGD for fixmap */
> + map_pa = __early_pa(fixmap_pte, load_pa);
> + create_pgd_mapping(__early_va(swapper_pg_dir, load_pa),
> + FIXADDR_START, map_pa, PGDIR_SIZE, tableprot,
> + load_pa, &ops);
> +#endif
>
> - pgdp = __early_va(swapper_pg_dir, load_pa);
> + /* Setup trampoling PTE */
> + end_va = PAGE_OFFSET + PAGE_SIZE*PTRS_PER_PTE;
> + for (va = PAGE_OFFSET; va < end_va; va += PAGE_SIZE)
> + create_pte_mapping(__early_va(trampoline_pte, load_pa),
> + va, load_pa + (va - PAGE_OFFSET),
> + PAGE_SIZE, prot);
> +
> + /*
> + * Setup swapper PGD covering kernel and some amount of
> + * RAM which will allows us to reach paging_init(). We map
> + * all memory banks later in setup_vm_final() below.
> + */
> + end_va = PAGE_OFFSET + load_sz;
> + for (va = PAGE_OFFSET; va < end_va; va += PAGE_SIZE)
> + create_pgd_mapping(__early_va(swapper_pg_dir, load_pa),
> + va, load_pa + (va - PAGE_OFFSET),
> + PAGE_SIZE, prot, load_pa, &ops);
> +
> + /* Create fixed mapping for early parsing of FDT */
> + end_va = __fix_to_virt(FIX_FDT) + FIX_FDT_SIZE;
> + for (va = __fix_to_virt(FIX_FDT); va < end_va; va += PAGE_SIZE)
> + create_pte_mapping(__early_va(fixmap_pte, load_pa),
> + va, dtb_pa + (va - __fix_to_virt(FIX_FDT)),
> + PAGE_SIZE, prot);
> +}
>
> - for (i = 0; i < (-PAGE_OFFSET)/PGDIR_SIZE; ++i) {
> - size_t o = (PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD + i;
> +static void __init setup_vm_final(void)
> +{
> + phys_addr_t pa, start, end;
> + struct memblock_region *reg;
> + struct mapping_ops ops;
> + pgprot_t prot = __pgprot(pgprot_val(PAGE_KERNEL) | _PAGE_EXEC);
>
> - map_pa = (uintptr_t)__early_va(swapper_pmd, load_pa);
> - pgdp[o] = pfn_pgd(PFN_DOWN(map_pa) + i, tableprot);
> - }
> - pmdp = __early_va(swapper_pmd, load_pa);
> - for (i = 0; i < ARRAY_SIZE(swapper_pmd); i++)
> - pmdp[i] = pfn_pmd(PFN_DOWN(load_pa + i * PMD_SIZE), prot);
> -
> - map_pa = (uintptr_t)__early_va(fixmap_pmd, load_pa);
> - pgdp[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> - pfn_pgd(PFN_DOWN(map_pa), tableprot);
> - pmdp = __early_va(fixmap_pmd, load_pa);
> - map_pa = (uintptr_t)__early_va(fixmap_pte, load_pa);
> - fixmap_pmd[(FIXADDR_START >> PMD_SHIFT) % PTRS_PER_PMD] =
> - pfn_pmd(PFN_DOWN(map_pa), tableprot);
> + /* Setup mapping ops */
> + ops.get_pte_virt = final_get_pte_virt;
> + ops.alloc_pte = final_alloc_pte;
> +#ifndef __PAGETABLE_PMD_FOLDED
> + ops.get_pmd_virt = final_get_pmd_virt;
> + ops.alloc_pmd = final_alloc_pmd;
> #else
> - pgdp = __early_va(trampoline_pg_dir, load_pa);
> - pgdp[(PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> - pfn_pgd(PFN_DOWN(load_pa), prot);
> -
> - pgdp = __early_va(swapper_pg_dir, load_pa);
> -
> - for (i = 0; i < (-PAGE_OFFSET)/PGDIR_SIZE; ++i) {
> - size_t o = (PAGE_OFFSET >> PGDIR_SHIFT) % PTRS_PER_PGD + i;
> + ops.get_pmd_virt = NULL;
> + ops.alloc_pmd = NULL;
> +#endif
>
> - pgdp[o] = pfn_pgd(PFN_DOWN(load_pa + i * PGDIR_SIZE), prot);
> + /* Map all memory banks */
> + for_each_memblock(memory, reg) {
> + start = reg->base;
> + end = start + reg->size;
> +
> + if (start >= end)
> + break;
> + if (memblock_is_nomap(reg))
> + continue;
> +
> + for (pa = start; pa < end; pa += PAGE_SIZE)
> + create_pgd_mapping(swapper_pg_dir,
> + (uintptr_t)__va(pa), pa,
> + PAGE_SIZE, prot, 0, &ops);
> }
>
> - map_pa = (uintptr_t)__early_va(fixmap_pte, load_pa);
> - pgdp[(FIXADDR_START >> PGDIR_SHIFT) % PTRS_PER_PGD] =
> - pfn_pgd(PFN_DOWN(map_pa), tableprot);
> -#endif
> + clear_fixmap(FIX_PTE);
> + clear_fixmap(FIX_PMD);
> +}
> +
> +void __init paging_init(void)
> +{
> + setup_vm_final();
> + setup_zero_page();
> + local_flush_tlb_all();
> + zone_sizes_init();
> }

Some thoughts on that:

a) I agree we don't want to waste memory but that doesn't mean we need
to change the kernel's load address alignment requirements, we can
re-claim the address below the kernel and use pages from there (instead
of hugepages).

b) If we do so OpenSBI, BBL and any other firmware / SBI implementation
will need to modify the device tree they pass on to the kernel to
indicate where they are in memory so that the kernel won't try to use
their memory. In case of kexec for example right now I have no way of
knowing where the firmware is, I get the /memory node and it says that
the whole memory is usable (and its start address is aligned), which is
wrong.

Regards
Nick

2019-04-10 19:04:25

by Palmer Dabbelt

[permalink] [raw]

Subject: Re: [PATCH 1/3] RISC-V: Add separate defconfig for 32bit systems

On Tue, 09 Apr 2019 23:09:03 PDT (-0700), [email protected] wrote:
> On Tue, Apr 9, 2019 at 10:14 PM Palmer Dabbelt <[email protected]> wrote:
>>
>> On Tue, 12 Mar 2019 15:08:12 PDT (-0700), Anup Patel wrote:
>> > This patch adds rv32_defconfig for 32bit systems. The only
>> > difference between rv32_defconfig and defconfig is that
>> > rv32_defconfig has CONFIG_ARCH_RV32I=y.
>>
>> Thanks. I think it makes sense to have this in 5.1 so I'm going to take it
>> into the next RC.
>
> Thanks, Palmer.
>
> Can you consider "[PATCH v3] RISC-V: Fix Maximum Physical Memory 2GiB
> option for 64bit systems" for 5.1?
>
> Refer, https://patchwork.kernel.org/patch/10886895/
>
> The above patch is also required for 64bit system with more than 128GiB memory
> (i.e. server-class system).
>
> We can remove "Maximum Physical Memory 2GiB" option as separate patch.

Ya, that should go in. It's on the branch, thanks for the reminder!