From: Mike Rapoport <[email protected]>
Hi,
There are several architectures that use very similar code for setup if
"System RAM" resources under iomem_resource tree and requesting memory
resources corresponding to the kernel code, data etc.
The flow for resources setup iterates over memory regions registered in
memblock, adds a resource for each region as "System RAM" and than
registers the areas used by the kernel image and, optionally, the crash
kernel area, as children of the "System RAM" resources.
The notable differences are:
* arm/arm64 use [__text, __init_begin] range for kernel code resource and
[_sdata, _end] range for kernel data, while most other architectures use
more fine grained ranges.
* arm has "System RAM (boot alias)" that do not seem useful for any other
architecture
* arm64 has special treatment for NOMAP areas and all the areas reserved in
memblock
* s390 has crashk_res in parallel with the "System RAM" resource, but it
seems it was required some time ago but no longer actually needed.
These patches use s390 implementation of the resource setup as the basis
and then switch MIPS, arm and arm64 to use it with modifications required
to support each architecture.
The generic code loops over all memblock.memory regions, adds the NOMAP
regions as "reserved" iomem resources and "normal" regions as "System RAM"
iomem resrouces, reserves the areas occupied by the kernel code, rodata,
data and bss, if there is crash kernel resource it is also reserved.
In addition, if an architectures selects
CONFIG_ARCH_WANT_RESERVE_MEMBLOCK_RESERVED_REGIONS (bad name, but I could
not find a better one) the memblock.reserved regions are registered as
"reserved" resources in iomem_resource.
It would be also possible to convert other architectures (e.g, RISC-V and
sh) to use the common infrastructure.
Mike Rapoport (5):
s390: make crashk_res resource a child of "System RAM"
memblock: introduce generic memblock_setup_resources()
arm: switch to generic memblock_setup_resources()
MIPS: switch to generic memblock_setup_resources
arm64: switch to generic memblock_setup_resources()
arch/Kconfig | 7 ++
arch/arm/kernel/setup.c | 37 +----------
arch/arm64/Kconfig | 1 +
arch/arm64/kernel/setup.c | 101 +----------------------------
arch/mips/kernel/setup.c | 78 ++--------------------
arch/s390/kernel/setup.c | 86 +------------------------
include/linux/memblock.h | 2 +
mm/memblock.c | 132 ++++++++++++++++++++++++++++++++++++++
8 files changed, 151 insertions(+), 293 deletions(-)
base-commit: c4681547bcce777daf576925a966ffa824edd09d
--
2.28.0
From: Mike Rapoport <[email protected]>
Commit 4e042af463f8 ("s390/kexec: fix crash on resize of reserved memory")
added a comment that says "crash kernel resource should not be part of the
System RAM resource" but never explained why. As it looks from the code in
the kernel and in kexec there is no actual reason for that.
Keeping crashk_res inline with other resources makes code simpler and
cleaner, and allows future consolidation of the resources setup across
several architectures.
Signed-off-by: Mike Rapoport <[email protected]>
---
arch/s390/kernel/setup.c | 21 +++++----------------
1 file changed, 5 insertions(+), 16 deletions(-)
diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
index 5aab59ad5688..30430e7c1b03 100644
--- a/arch/s390/kernel/setup.c
+++ b/arch/s390/kernel/setup.c
@@ -500,6 +500,9 @@ static struct resource __initdata *standard_resources[] = {
&code_resource,
&data_resource,
&bss_resource,
+#ifdef CONFIG_CRASH_DUMP
+ &crashk_res,
+#endif
};
static void __init setup_resources(void)
@@ -535,7 +538,7 @@ static void __init setup_resources(void)
for (j = 0; j < ARRAY_SIZE(standard_resources); j++) {
std_res = standard_resources[j];
- if (std_res->start < res->start ||
+ if (!std_res->end || std_res->start < res->start ||
std_res->start > res->end)
continue;
if (std_res->end > res->end) {
@@ -552,20 +555,6 @@ static void __init setup_resources(void)
}
}
}
-#ifdef CONFIG_CRASH_DUMP
- /*
- * Re-add removed crash kernel memory as reserved memory. This makes
- * sure it will be mapped with the identity mapping and struct pages
- * will be created, so it can be resized later on.
- * However add it later since the crash kernel resource should not be
- * part of the System RAM resource.
- */
- if (crashk_res.end) {
- memblock_add_node(crashk_res.start, resource_size(&crashk_res), 0);
- memblock_reserve(crashk_res.start, resource_size(&crashk_res));
- insert_resource(&iomem_resource, &crashk_res);
- }
-#endif
}
static void __init setup_ident_map_size(void)
@@ -733,7 +722,7 @@ static void __init reserve_crashkernel(void)
diag10_range(PFN_DOWN(crash_base), PFN_DOWN(crash_size));
crashk_res.start = crash_base;
crashk_res.end = crash_base + crash_size - 1;
- memblock_remove(crash_base, crash_size);
+ memblock_reserve(crash_base, crash_size);
pr_info("Reserving %lluMB of memory at %lluMB "
"for crashkernel (System RAM: %luMB)\n",
crash_size >> 20, crash_base >> 20,
--
2.28.0
From: Mike Rapoport <[email protected]>
Several architecture have very similar implementation of the setup of the
resource tree. The implementations loop over memory ranges registered in
memblock, request a "System RAM" resource for each range and then reserve
resource rages corresponding to various sections of the kernel image. If
kexec/crash dump are enable the crashk_res region is also reserved.
Move the implementation of resource tree setup from s390 to memblock. It
will be then used by several other architectures as well.
Signed-off-by: Mike Rapoport <[email protected]>
---
arch/s390/kernel/setup.c | 73 +--------------------------------
include/linux/memblock.h | 2 +
mm/memblock.c | 87 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 90 insertions(+), 72 deletions(-)
diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
index 30430e7c1b03..6f3c82cc0b0d 100644
--- a/arch/s390/kernel/setup.c
+++ b/arch/s390/kernel/setup.c
@@ -481,80 +481,9 @@ static void __init setup_lowcore_dat_on(void)
__ctl_set_bit(0, 28);
}
-static struct resource code_resource = {
- .name = "Kernel code",
- .flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM,
-};
-
-static struct resource data_resource = {
- .name = "Kernel data",
- .flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM,
-};
-
-static struct resource bss_resource = {
- .name = "Kernel bss",
- .flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM,
-};
-
-static struct resource __initdata *standard_resources[] = {
- &code_resource,
- &data_resource,
- &bss_resource,
-#ifdef CONFIG_CRASH_DUMP
- &crashk_res,
-#endif
-};
-
static void __init setup_resources(void)
{
- struct resource *res, *std_res, *sub_res;
- phys_addr_t start, end;
- int j;
- u64 i;
-
- code_resource.start = (unsigned long) _text;
- code_resource.end = (unsigned long) _etext - 1;
- data_resource.start = (unsigned long) _etext;
- data_resource.end = (unsigned long) _edata - 1;
- bss_resource.start = (unsigned long) __bss_start;
- bss_resource.end = (unsigned long) __bss_stop - 1;
-
- for_each_mem_range(i, &start, &end) {
- res = memblock_alloc(sizeof(*res), 8);
- if (!res)
- panic("%s: Failed to allocate %zu bytes align=0x%x\n",
- __func__, sizeof(*res), 8);
- res->flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM;
-
- res->name = "System RAM";
- res->start = start;
- /*
- * In memblock, end points to the first byte after the
- * range while in resourses, end points to the last byte in
- * the range.
- */
- res->end = end - 1;
- request_resource(&iomem_resource, res);
-
- for (j = 0; j < ARRAY_SIZE(standard_resources); j++) {
- std_res = standard_resources[j];
- if (!std_res->end || std_res->start < res->start ||
- std_res->start > res->end)
- continue;
- if (std_res->end > res->end) {
- sub_res = memblock_alloc(sizeof(*sub_res), 8);
- if (!sub_res)
- panic("%s: Failed to allocate %zu bytes align=0x%x\n",
- __func__, sizeof(*sub_res), 8);
- *sub_res = *std_res;
- sub_res->end = res->end;
- std_res->start = res->end + 1;
- request_resource(res, sub_res);
- } else {
- request_resource(res, std_res);
- }
- }
- }
+ memblock_setup_resources();
}
static void __init setup_ident_map_size(void)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 5984fff3f175..44c29ebae842 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -121,6 +121,8 @@ void memblock_free_all(void);
void reset_node_managed_pages(pg_data_t *pgdat);
void reset_all_zones_managed_pages(void);
+void memblock_setup_resources(void);
+
/* Low level functions */
void __next_mem_range(u64 *idx, int nid, enum memblock_flags flags,
struct memblock_type *type_a,
diff --git a/mm/memblock.c b/mm/memblock.c
index afaefa8fc6ab..504435753259 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -16,6 +16,8 @@
#include <linux/kmemleak.h>
#include <linux/seq_file.h>
#include <linux/memblock.h>
+#include <linux/ioport.h>
+#include <linux/kexec.h>
#include <asm/sections.h>
#include <linux/io.h>
@@ -2062,6 +2064,91 @@ void __init memblock_free_all(void)
totalram_pages_add(pages);
}
+static struct resource code_resource = {
+ .name = "Kernel code",
+ .flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM,
+};
+
+static struct resource rodata_resource = {
+ .name = "Kernel rodata",
+ .flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM,
+};
+
+static struct resource data_resource = {
+ .name = "Kernel data",
+ .flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM,
+};
+
+static struct resource bss_resource = {
+ .name = "Kernel bss",
+ .flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM,
+};
+
+static struct resource __initdata *standard_resources[] = {
+ &code_resource,
+ &rodata_resource,
+ &data_resource,
+ &bss_resource,
+#ifdef CONFIG_KEXEC_CORE
+ &crashk_res,
+#endif
+};
+
+void __init memblock_setup_resources(void)
+{
+ struct resource *res, *kres, *sub_res;
+ phys_addr_t start, end;
+ int j;
+ u64 i;
+
+ code_resource.start = __pa_symbol(_text);
+ code_resource.end = __pa_symbol(_etext)-1;
+ rodata_resource.start = __pa_symbol(__start_rodata);
+ rodata_resource.end = __pa_symbol(__end_rodata)-1;
+ data_resource.start = __pa_symbol(_sdata);
+ data_resource.end = __pa_symbol(_edata)-1;
+ bss_resource.start = __pa_symbol(__bss_start);
+ bss_resource.end = __pa_symbol(__bss_stop)-1;
+
+ for_each_mem_range(i, &start, &end) {
+ res = memblock_alloc(sizeof(*res), 8);
+ if (!res)
+ panic("%s: Failed to allocate %zu bytes align=0x%x\n",
+ __func__, sizeof(*res), 8);
+ res->flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM;
+
+ res->name = "System RAM";
+ res->start = start;
+
+ /*
+ * In memblock, end points to the first byte after the
+ * range while in resourses, end points to the last byte in
+ * the range.
+ */
+ res->end = end - 1;
+ request_resource(&iomem_resource, res);
+
+ for (j = 0; j < ARRAY_SIZE(standard_resources); j++) {
+ kres = standard_resources[j];
+ if (!kres->end || kres->start < res->start ||
+ kres->start > res->end)
+ continue;
+ if (kres->end > res->end) {
+ sub_res = memblock_alloc(sizeof(*sub_res), 8);
+ if (!sub_res)
+ panic("%s: Failed to allocate %zu bytes align=0x%x\n",
+ __func__, sizeof(*sub_res), 8);
+ *sub_res = *kres;
+ sub_res->end = res->end;
+ kres->start = res->end + 1;
+ request_resource(res, sub_res);
+ } else {
+ request_resource(res, kres);
+ }
+ }
+ }
+}
+
#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_ARCH_KEEP_MEMBLOCK)
static int memblock_debug_show(struct seq_file *m, void *private)
--
2.28.0
From: Mike Rapoport <[email protected]>
MIPS version of resource setup is very similar to the generic one. The only
difference is the reservation of the crash kernel area that is spread among
several functions.
Switch MIPS to use the generic version.
Signed-off-by: Mike Rapoport <[email protected]>
---
arch/mips/kernel/setup.c | 78 +++-------------------------------------
1 file changed, 5 insertions(+), 73 deletions(-)
diff --git a/arch/mips/kernel/setup.c b/arch/mips/kernel/setup.c
index 23a140327a0b..be49217f0f22 100644
--- a/arch/mips/kernel/setup.c
+++ b/arch/mips/kernel/setup.c
@@ -79,10 +79,6 @@ static const char builtin_cmdline[] __initconst = "";
unsigned long mips_io_port_base = -1;
EXPORT_SYMBOL(mips_io_port_base);
-static struct resource code_resource = { .name = "Kernel code", };
-static struct resource data_resource = { .name = "Kernel data", };
-static struct resource bss_resource = { .name = "Kernel bss", };
-
unsigned long __kaslr_offset __ro_after_init;
EXPORT_SYMBOL(__kaslr_offset);
@@ -469,31 +465,18 @@ static void __init mips_parse_crashkernel(void)
}
}
+ memblock_reserve(crashk_res.start, resource_size(&crashk_res));
+
crashk_res.start = crash_base;
crashk_res.end = crash_base + crash_size - 1;
-}
-
-static void __init request_crashkernel(struct resource *res)
-{
- int ret;
-
- if (crashk_res.start == crashk_res.end)
- return;
- ret = request_resource(res, &crashk_res);
- if (!ret)
- pr_info("Reserving %ldMB of memory at %ldMB for crashkernel\n",
- (unsigned long)(resource_size(&crashk_res) >> 20),
- (unsigned long)(crashk_res.start >> 20));
+ pr_info("Reserving %lldMB of memory at %lldMB for crashkernel\n",
+ crash_base >> 20, crash_size);
}
#else /* !defined(CONFIG_KEXEC) */
static void __init mips_parse_crashkernel(void)
{
}
-
-static void __init request_crashkernel(struct resource *res)
-{
-}
#endif /* !defined(CONFIG_KEXEC) */
static void __init check_kernel_sections_mem(void)
@@ -656,10 +639,6 @@ static void __init arch_mem_init(char **cmdline_p)
mips_reserve_vmcore();
mips_parse_crashkernel();
-#ifdef CONFIG_KEXEC
- if (crashk_res.start != crashk_res.end)
- memblock_reserve(crashk_res.start, resource_size(&crashk_res));
-#endif
device_tree_init();
/*
@@ -683,53 +662,6 @@ static void __init arch_mem_init(char **cmdline_p)
early_memtest(PFN_PHYS(ARCH_PFN_OFFSET), PFN_PHYS(max_low_pfn));
}
-static void __init resource_init(void)
-{
- phys_addr_t start, end;
- u64 i;
-
- if (UNCAC_BASE != IO_BASE)
- return;
-
- code_resource.start = __pa_symbol(&_text);
- code_resource.end = __pa_symbol(&_etext) - 1;
- data_resource.start = __pa_symbol(&_etext);
- data_resource.end = __pa_symbol(&_edata) - 1;
- bss_resource.start = __pa_symbol(&__bss_start);
- bss_resource.end = __pa_symbol(&__bss_stop) - 1;
-
- for_each_mem_range(i, &start, &end) {
- struct resource *res;
-
- res = memblock_alloc(sizeof(struct resource), SMP_CACHE_BYTES);
- if (!res)
- panic("%s: Failed to allocate %zu bytes\n", __func__,
- sizeof(struct resource));
-
- res->start = start;
- /*
- * In memblock, end points to the first byte after the
- * range while in resourses, end points to the last byte in
- * the range.
- */
- res->end = end - 1;
- res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
- res->name = "System RAM";
-
- request_resource(&iomem_resource, res);
-
- /*
- * We don't know which RAM region contains kernel data,
- * so we try it repeatedly and let the resource manager
- * test it.
- */
- request_resource(res, &code_resource);
- request_resource(res, &data_resource);
- request_resource(res, &bss_resource);
- request_crashkernel(res);
- }
-}
-
#ifdef CONFIG_SMP
static void __init prefill_possible_map(void)
{
@@ -771,7 +703,7 @@ void __init setup_arch(char **cmdline_p)
arch_mem_init(cmdline_p);
dmi_setup();
- resource_init();
+ memblock_setup_resources();
plat_smp_setup();
prefill_possible_map();
--
2.28.0
From: Mike Rapoport <[email protected]>
The registration of "System RAM" resources on arm is very similar to the
generic version.
The major differences are the registration of "System RAM (boot alias)"
areas and minor differences in the way kernel image resources are reserved.
Replace the open coded registration of the "System RAM" resource and the
requests for kernel image and data areas in the resource tree with the
generic implementation.
Signed-off-by: Mike Rapoport <[email protected]>
---
arch/arm/kernel/setup.c | 37 +------------------------------------
1 file changed, 1 insertion(+), 36 deletions(-)
diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
index 1a5edf562e85..d14edf42815b 100644
--- a/arch/arm/kernel/setup.c
+++ b/arch/arm/kernel/setup.c
@@ -174,23 +174,9 @@ static struct resource mem_res[] = {
.end = 0,
.flags = IORESOURCE_MEM
},
- {
- .name = "Kernel code",
- .start = 0,
- .end = 0,
- .flags = IORESOURCE_SYSTEM_RAM
- },
- {
- .name = "Kernel data",
- .start = 0,
- .end = 0,
- .flags = IORESOURCE_SYSTEM_RAM
- }
};
#define video_ram mem_res[0]
-#define kernel_code mem_res[1]
-#define kernel_data mem_res[2]
static struct resource io_res[] = {
{
@@ -849,10 +835,7 @@ static void __init request_standard_resources(const struct machine_desc *mdesc)
struct resource *res;
u64 i;
- kernel_code.start = virt_to_phys(_text);
- kernel_code.end = virt_to_phys(__init_begin - 1);
- kernel_data.start = virt_to_phys(_sdata);
- kernel_data.end = virt_to_phys(_end - 1);
+ memblock_setup_resources();
for_each_mem_range(i, &start, &end) {
unsigned long boot_alias_start;
@@ -881,24 +864,6 @@ static void __init request_standard_resources(const struct machine_desc *mdesc)
res->flags = IORESOURCE_MEM | IORESOURCE_BUSY;
request_resource(&iomem_resource, res);
}
-
- res = memblock_alloc(sizeof(*res), SMP_CACHE_BYTES);
- if (!res)
- panic("%s: Failed to allocate %zu bytes\n", __func__,
- sizeof(*res));
- res->name = "System RAM";
- res->start = start;
- res->end = res_end;
- res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-
- request_resource(&iomem_resource, res);
-
- if (kernel_code.start >= res->start &&
- kernel_code.end <= res->end)
- request_resource(res, &kernel_code);
- if (kernel_data.start >= res->start &&
- kernel_data.end <= res->end)
- request_resource(res, &kernel_data);
}
if (mdesc->video_start) {
--
2.28.0
From: Mike Rapoport <[email protected]>
The implementation of resource setup on arm64 takes into account NOMAP
regions and registers them as "Reserved" rather than as "System RAM".
In addition, there is an extra pass that adds regions that were reserved in
memblock to the resource tree as well.
Update the generic version of memblock_setup_resources() to take into the
account NOMAP regions and move the registration of the reserved regions to
memblock to have all the "System RAM" related resource data structures and
code close to each other.
Only enable requests for the reserved resources if an architecture selects
CONFIG_ARCH_WANT_RESERVE_MEMBLOCK_RESERVED_REGIONS.
Signed-off-by: Mike Rapoport <[email protected]>
---
arch/Kconfig | 7 +++
arch/arm64/Kconfig | 1 +
arch/arm64/kernel/setup.c | 101 +-------------------------------------
mm/memblock.c | 73 +++++++++++++++++++++------
4 files changed, 68 insertions(+), 114 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index c45b770d3579..9c999dc8ab47 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1275,6 +1275,13 @@ config ARCH_SPLIT_ARG64
config ARCH_HAS_ELFCORE_COMPAT
bool
+config ARCH_WANT_RESERVE_MEMBLOCK_RESERVED_REGIONS
+ bool
+ help
+ If an architecture requires that memory regions reserved in
+ memblock will appear as "Reserved" in the resource tree, select
+ this option.
+
source "kernel/gcov/Kconfig"
source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 9f1d8566bbf9..0e1f41650a6a 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -93,6 +93,7 @@ config ARM64
select ARCH_WANT_FRAME_POINTERS
select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
select ARCH_WANT_LD_ORPHAN_WARN
+ select ARCH_WANT_RESERVE_MEMBLOCK_RESERVED_REGIONS
select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARM_AMBA
select ARM_ARCH_TIMER
diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
index 61845c0821d9..90e97cf70844 100644
--- a/arch/arm64/kernel/setup.c
+++ b/arch/arm64/kernel/setup.c
@@ -51,32 +51,8 @@
#include <asm/xen/hypervisor.h>
#include <asm/mmu_context.h>
-static int num_standard_resources;
-static struct resource *standard_resources;
-
phys_addr_t __fdt_pointer __initdata;
-/*
- * Standard memory resources
- */
-static struct resource mem_res[] = {
- {
- .name = "Kernel code",
- .start = 0,
- .end = 0,
- .flags = IORESOURCE_SYSTEM_RAM
- },
- {
- .name = "Kernel data",
- .start = 0,
- .end = 0,
- .flags = IORESOURCE_SYSTEM_RAM
- }
-};
-
-#define kernel_code mem_res[0]
-#define kernel_data mem_res[1]
-
/*
* The recorded values of x0 .. x3 upon kernel entry.
*/
@@ -214,81 +190,6 @@ static void __init setup_machine_fdt(phys_addr_t dt_phys)
dump_stack_set_arch_desc("%s (DT)", name);
}
-static void __init request_standard_resources(void)
-{
- struct memblock_region *region;
- struct resource *res;
- unsigned long i = 0;
- size_t res_size;
-
- kernel_code.start = __pa_symbol(_stext);
- kernel_code.end = __pa_symbol(__init_begin - 1);
- kernel_data.start = __pa_symbol(_sdata);
- kernel_data.end = __pa_symbol(_end - 1);
-
- num_standard_resources = memblock.memory.cnt;
- res_size = num_standard_resources * sizeof(*standard_resources);
- standard_resources = memblock_alloc(res_size, SMP_CACHE_BYTES);
- if (!standard_resources)
- panic("%s: Failed to allocate %zu bytes\n", __func__, res_size);
-
- for_each_mem_region(region) {
- res = &standard_resources[i++];
- if (memblock_is_nomap(region)) {
- res->name = "reserved";
- res->flags = IORESOURCE_MEM;
- } else {
- res->name = "System RAM";
- res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
- }
- res->start = __pfn_to_phys(memblock_region_memory_base_pfn(region));
- res->end = __pfn_to_phys(memblock_region_memory_end_pfn(region)) - 1;
-
- request_resource(&iomem_resource, res);
-
- if (kernel_code.start >= res->start &&
- kernel_code.end <= res->end)
- request_resource(res, &kernel_code);
- if (kernel_data.start >= res->start &&
- kernel_data.end <= res->end)
- request_resource(res, &kernel_data);
-#ifdef CONFIG_KEXEC_CORE
- /* Userspace will find "Crash kernel" region in /proc/iomem. */
- if (crashk_res.end && crashk_res.start >= res->start &&
- crashk_res.end <= res->end)
- request_resource(res, &crashk_res);
-#endif
- }
-}
-
-static int __init reserve_memblock_reserved_regions(void)
-{
- u64 i, j;
-
- for (i = 0; i < num_standard_resources; ++i) {
- struct resource *mem = &standard_resources[i];
- phys_addr_t r_start, r_end, mem_size = resource_size(mem);
-
- if (!memblock_is_region_reserved(mem->start, mem_size))
- continue;
-
- for_each_reserved_mem_range(j, &r_start, &r_end) {
- resource_size_t start, end;
-
- start = max(PFN_PHYS(PFN_DOWN(r_start)), mem->start);
- end = min(PFN_PHYS(PFN_UP(r_end)) - 1, mem->end);
-
- if (start > mem->end || end < mem->start)
- continue;
-
- reserve_region_with_split(mem, start, end, "reserved");
- }
- }
-
- return 0;
-}
-arch_initcall(reserve_memblock_reserved_regions);
-
u64 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] = INVALID_HWID };
u64 cpu_logical_map(unsigned int cpu)
@@ -359,7 +260,7 @@ void __init __no_sanitize_address setup_arch(char **cmdline_p)
kasan_init();
- request_standard_resources();
+ memblock_setup_resources();
early_ioremap_reset();
diff --git a/mm/memblock.c b/mm/memblock.c
index 504435753259..82d4b0f5bf5e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2094,12 +2094,15 @@ static struct resource __initdata *standard_resources[] = {
#endif
};
+static int num_resources;
+static struct resource *resources;
+
void __init memblock_setup_resources(void)
{
struct resource *res, *kres, *sub_res;
- phys_addr_t start, end;
- int j;
- u64 i;
+ struct memblock_region *region;
+ size_t res_size;
+ int i;
code_resource.start = __pa_symbol(_text);
code_resource.end = __pa_symbol(_etext)-1;
@@ -2110,26 +2113,36 @@ void __init memblock_setup_resources(void)
bss_resource.start = __pa_symbol(__bss_start);
bss_resource.end = __pa_symbol(__bss_stop)-1;
- for_each_mem_range(i, &start, &end) {
- res = memblock_alloc(sizeof(*res), 8);
- if (!res)
- panic("%s: Failed to allocate %zu bytes align=0x%x\n",
- __func__, sizeof(*res), 8);
- res->flags = IORESOURCE_BUSY | IORESOURCE_SYSTEM_RAM;
+ num_resources = memblock.memory.cnt;
+ res_size = num_resources * sizeof(*resources);
+ resources = memblock_alloc(res_size, SMP_CACHE_BYTES);
- res->name = "System RAM";
- res->start = start;
+ if (!resources)
+ panic("%s: Failed to allocate %zu bytes align=0x%x\n",
+ __func__, res_size, 8);
+
+ res = resources;
+ for_each_mem_region(region) {
+ if (memblock_is_nomap(region)) {
+ res->name = "reserved";
+ res->flags = IORESOURCE_MEM;
+ } else {
+ res->name = "System RAM";
+ res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+ }
+ res->start = region->base;
/*
* In memblock, end points to the first byte after the
* range while in resourses, end points to the last byte in
* the range.
*/
- res->end = end - 1;
+ res->end = region->base + region->size - 1;
+
request_resource(&iomem_resource, res);
- for (j = 0; j < ARRAY_SIZE(standard_resources); j++) {
- kres = standard_resources[j];
+ for (i = 0; i < ARRAY_SIZE(standard_resources); i++) {
+ kres = standard_resources[i];
if (!kres->end || kres->start < res->start ||
kres->start > res->end)
continue;
@@ -2146,9 +2159,41 @@ void __init memblock_setup_resources(void)
request_resource(res, kres);
}
}
+
+ res++;
}
}
+#ifdef CONFIG_ARCH_WANT_RESERVE_MEMBLOCK_RESERVED_REGIONS
+static int __init reserve_memblock_reserved_regions(void)
+{
+ u64 i, j;
+
+ for (i = 0; i < num_resources; ++i) {
+ struct resource *mem = &resources[i];
+ phys_addr_t r_start, r_end, mem_size = resource_size(mem);
+
+ if (!memblock_is_region_reserved(mem->start, mem_size))
+ continue;
+
+ for_each_reserved_mem_range(j, &r_start, &r_end) {
+ resource_size_t start, end;
+
+ start = max(PFN_PHYS(PFN_DOWN(r_start)), mem->start);
+ end = min(PFN_PHYS(PFN_UP(r_end)) - 1, mem->end);
+
+ if (start > mem->end || end < mem->start)
+ continue;
+
+ reserve_region_with_split(mem, start, end, "reserved");
+ }
+ }
+
+ return 0;
+}
+arch_initcall(reserve_memblock_reserved_regions);
+#endif
+
#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_ARCH_KEEP_MEMBLOCK)
static int memblock_debug_show(struct seq_file *m, void *private)
--
2.28.0
On 31.05.21 14:29, Mike Rapoport wrote:
> From: Mike Rapoport <[email protected]>
>
> Commit 4e042af463f8 ("s390/kexec: fix crash on resize of reserved memory")
> added a comment that says "crash kernel resource should not be part of the
> System RAM resource" but never explained why. As it looks from the code in
> the kernel and in kexec there is no actual reason for that.
Are you sure?
Looking at kexec-tools: kexec/arch/s390/kexec-s390.c
get_memory_ranges_s390() wants "System RAM" and Crash kernel only with
"with_crashk=1". Your patch would change that. "Crash kernel" would
always be included if you make it a child of "System RAM".
Further, get_memory_ranges() and is_crashkernel_mem_reserved() look out
for "Crash kernel\n" via parse_iomem_single().
However, parse_iomem_single() does not care about ranges that start with
spaces IIRC via
sscanf(line, "%llx-%llx : %n" ...
So once you make "Crash kernel" a child of "System RAM", kexec-tools
would break if I'm not completely wrong.
--
Thanks,
David / dhildenb
On 01.06.21 10:45, David Hildenbrand wrote:
> On 31.05.21 14:29, Mike Rapoport wrote:
>> From: Mike Rapoport <[email protected]>
>>
>> Commit 4e042af463f8 ("s390/kexec: fix crash on resize of reserved memory")
>> added a comment that says "crash kernel resource should not be part of the
>> System RAM resource" but never explained why. As it looks from the code in
>> the kernel and in kexec there is no actual reason for that.
>
> Are you sure?
>
> Looking at kexec-tools: kexec/arch/s390/kexec-s390.c
>
> get_memory_ranges_s390() wants "System RAM" and Crash kernel only with
> "with_crashk=1". Your patch would change that. "Crash kernel" would
> always be included if you make it a child of "System RAM".
>
> Further, get_memory_ranges() and is_crashkernel_mem_reserved() look out
> for "Crash kernel\n" via parse_iomem_single().
>
> However, parse_iomem_single() does not care about ranges that start with
> spaces IIRC via
> sscanf(line, "%llx-%llx : %n" ...
I think I'm wrong about that one because I read
"Input white-space characters (as specified by the isspace function) are
skipped, unless the specification includes a [ , c , or n specifier"
So having it as a child won't affect parse_iomem_single().
--
Thanks,
David / dhildenb
On Mon, 31 May 2021 15:29:55 +0300
Mike Rapoport <[email protected]> wrote:
> From: Mike Rapoport <[email protected]>
>
> Commit 4e042af463f8 ("s390/kexec: fix crash on resize of reserved memory")
> added a comment that says "crash kernel resource should not be part of the
> System RAM resource" but never explained why. As it looks from the code in
> the kernel and in kexec there is no actual reason for that.
Still testing, but so far everything works fine.
>
> Keeping crashk_res inline with other resources makes code simpler and
> cleaner, and allows future consolidation of the resources setup across
> several architectures.
>
> Signed-off-by: Mike Rapoport <[email protected]>
> ---
> arch/s390/kernel/setup.c | 21 +++++----------------
> 1 file changed, 5 insertions(+), 16 deletions(-)
>
> diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
> index 5aab59ad5688..30430e7c1b03 100644
> --- a/arch/s390/kernel/setup.c
> +++ b/arch/s390/kernel/setup.c
> @@ -500,6 +500,9 @@ static struct resource __initdata *standard_resources[] = {
> &code_resource,
> &data_resource,
> &bss_resource,
> +#ifdef CONFIG_CRASH_DUMP
> + &crashk_res,
> +#endif
> };
>
> static void __init setup_resources(void)
> @@ -535,7 +538,7 @@ static void __init setup_resources(void)
>
> for (j = 0; j < ARRAY_SIZE(standard_resources); j++) {
> std_res = standard_resources[j];
> - if (std_res->start < res->start ||
> + if (!std_res->end || std_res->start < res->start ||
> std_res->start > res->end)
> continue;
> if (std_res->end > res->end) {
Why is this extra check for !std_res->end added here? I assume it
might be needed later, after you moved this to common code, but I
cannot see how any of the other patches in this series would require
that.
> @@ -552,20 +555,6 @@ static void __init setup_resources(void)
> }
> }
> }
> -#ifdef CONFIG_CRASH_DUMP
> - /*
> - * Re-add removed crash kernel memory as reserved memory. This makes
> - * sure it will be mapped with the identity mapping and struct pages
> - * will be created, so it can be resized later on.
> - * However add it later since the crash kernel resource should not be
> - * part of the System RAM resource.
> - */
> - if (crashk_res.end) {
> - memblock_add_node(crashk_res.start, resource_size(&crashk_res), 0);
> - memblock_reserve(crashk_res.start, resource_size(&crashk_res));
> - insert_resource(&iomem_resource, &crashk_res);
> - }
> -#endif
> }
>
> static void __init setup_ident_map_size(void)
> @@ -733,7 +722,7 @@ static void __init reserve_crashkernel(void)
> diag10_range(PFN_DOWN(crash_base), PFN_DOWN(crash_size));
> crashk_res.start = crash_base;
> crashk_res.end = crash_base + crash_size - 1;
> - memblock_remove(crash_base, crash_size);
> + memblock_reserve(crash_base, crash_size);
> pr_info("Reserving %lluMB of memory at %lluMB "
> "for crashkernel (System RAM: %luMB)\n",
> crash_size >> 20, crash_base >> 20,
Other architectures check the return value of memblock_reserve() at
this point, and exit crashkernel reservation if it fails. IIUC, the
only reason why memblock_reserve() could fail would be the same reason
why also memblock_remove() could fail, i.e. that memblock_double_array()
would fail. And since we also do not check that at the moment, your
patch would probably not (additionally) break anything.
Still, this might be something for an add-on patch (for us). Do you
happen to know how likely it would be that memblock_remove/reserve()
could fail at this point?
On Mon, May 31, 2021 at 03:29:54PM +0300, Mike Rapoport wrote:
> * arm has "System RAM (boot alias)" that do not seem useful for any other
> architecture
This is VERY important for kexec and must _not_ be removed, since you
will be causing a userspace regression by doing so.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
On Mon, May 31, 2021 at 03:29:56PM +0300, Mike Rapoport wrote:
> + code_resource.start = __pa_symbol(_text);
> + code_resource.end = __pa_symbol(_etext)-1;
> + rodata_resource.start = __pa_symbol(__start_rodata);
> + rodata_resource.end = __pa_symbol(__end_rodata)-1;
> + data_resource.start = __pa_symbol(_sdata);
> + data_resource.end = __pa_symbol(_edata)-1;
> + bss_resource.start = __pa_symbol(__bss_start);
> + bss_resource.end = __pa_symbol(__bss_stop)-1;
This falls short on 32-bit ARM. The old code was:
- kernel_code.start = virt_to_phys(_text);
- kernel_code.end = virt_to_phys(__init_begin - 1);
- kernel_data.start = virt_to_phys(_sdata);
- kernel_data.end = virt_to_phys(_end - 1);
If I look at one of my kernels:
c0008000 T _text
c0b5b000 R __end_rodata
... exception and unwind tables live here ...
c0c00000 T __init_begin
c0e00000 D _sdata
c0e68870 D _edata
c0e68870 B __bss_start
c0e995d4 B __bss_stop
c0e995d4 B _end
So the original covers _text..__init_begin-1 which includes the
exception and unwind tables. Your version above omits these, which
leaves them exposed.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
On Tue, Jun 01, 2021 at 11:02:17AM +0200, David Hildenbrand wrote:
> On 01.06.21 10:45, David Hildenbrand wrote:
> > On 31.05.21 14:29, Mike Rapoport wrote:
> > > From: Mike Rapoport <[email protected]>
> > >
> > > Commit 4e042af463f8 ("s390/kexec: fix crash on resize of reserved memory")
> > > added a comment that says "crash kernel resource should not be part of the
> > > System RAM resource" but never explained why. As it looks from the code in
> > > the kernel and in kexec there is no actual reason for that.
> >
> > Are you sure?
> >
> > Looking at kexec-tools: kexec/arch/s390/kexec-s390.c
> >
> > get_memory_ranges_s390() wants "System RAM" and Crash kernel only with
> > "with_crashk=1". Your patch would change that. "Crash kernel" would
> > always be included if you make it a child of "System RAM".
> >
> > Further, get_memory_ranges() and is_crashkernel_mem_reserved() look out
> > for "Crash kernel\n" via parse_iomem_single().
> >
> > However, parse_iomem_single() does not care about ranges that start with
> > spaces IIRC via
> > sscanf(line, "%llx-%llx : %n" ...
>
> I think I'm wrong about that one because I read
>
> "Input white-space characters (as specified by the isspace function) are
> skipped, unless the specification includes a [ , c , or n specifier"
>
> So having it as a child won't affect parse_iomem_single().
Yes, this was my understanding as well.
--
Sincerely yours,
Mike.
On Tue, Jun 01, 2021 at 03:18:36PM +0200, Gerald Schaefer wrote:
> On Mon, 31 May 2021 15:29:55 +0300
> Mike Rapoport <[email protected]> wrote:
>
> > From: Mike Rapoport <[email protected]>
> >
> > Commit 4e042af463f8 ("s390/kexec: fix crash on resize of reserved memory")
> > added a comment that says "crash kernel resource should not be part of the
> > System RAM resource" but never explained why. As it looks from the code in
> > the kernel and in kexec there is no actual reason for that.
>
> Still testing, but so far everything works fine.
>
> >
> > Keeping crashk_res inline with other resources makes code simpler and
> > cleaner, and allows future consolidation of the resources setup across
> > several architectures.
> >
> > Signed-off-by: Mike Rapoport <[email protected]>
> > ---
> > arch/s390/kernel/setup.c | 21 +++++----------------
> > 1 file changed, 5 insertions(+), 16 deletions(-)
> >
> > diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
> > index 5aab59ad5688..30430e7c1b03 100644
> > --- a/arch/s390/kernel/setup.c
> > +++ b/arch/s390/kernel/setup.c
> > @@ -500,6 +500,9 @@ static struct resource __initdata *standard_resources[] = {
> > &code_resource,
> > &data_resource,
> > &bss_resource,
> > +#ifdef CONFIG_CRASH_DUMP
> > + &crashk_res,
> > +#endif
> > };
> >
> > static void __init setup_resources(void)
> > @@ -535,7 +538,7 @@ static void __init setup_resources(void)
> >
> > for (j = 0; j < ARRAY_SIZE(standard_resources); j++) {
> > std_res = standard_resources[j];
> > - if (std_res->start < res->start ||
> > + if (!std_res->end || std_res->start < res->start ||
> > std_res->start > res->end)
> > continue;
> > if (std_res->end > res->end) {
>
> Why is this extra check for !std_res->end added here? I assume it
> might be needed later, after you moved this to common code, but I
> cannot see how any of the other patches in this series would require
> that.
This is needed to avoid requesting empty crash_res.
> > @@ -552,20 +555,6 @@ static void __init setup_resources(void)
> > }
> > }
> > }
> > -#ifdef CONFIG_CRASH_DUMP
> > - /*
> > - * Re-add removed crash kernel memory as reserved memory. This makes
> > - * sure it will be mapped with the identity mapping and struct pages
> > - * will be created, so it can be resized later on.
> > - * However add it later since the crash kernel resource should not be
> > - * part of the System RAM resource.
> > - */
> > - if (crashk_res.end) {
> > - memblock_add_node(crashk_res.start, resource_size(&crashk_res), 0);
> > - memblock_reserve(crashk_res.start, resource_size(&crashk_res));
> > - insert_resource(&iomem_resource, &crashk_res);
> > - }
> > -#endif
> > }
> >
> > static void __init setup_ident_map_size(void)
> > @@ -733,7 +722,7 @@ static void __init reserve_crashkernel(void)
> > diag10_range(PFN_DOWN(crash_base), PFN_DOWN(crash_size));
> > crashk_res.start = crash_base;
> > crashk_res.end = crash_base + crash_size - 1;
> > - memblock_remove(crash_base, crash_size);
> > + memblock_reserve(crash_base, crash_size);
> > pr_info("Reserving %lluMB of memory at %lluMB "
> > "for crashkernel (System RAM: %luMB)\n",
> > crash_size >> 20, crash_base >> 20,
>
> Other architectures check the return value of memblock_reserve() at
> this point, and exit crashkernel reservation if it fails. IIUC, the
> only reason why memblock_reserve() could fail would be the same reason
> why also memblock_remove() could fail, i.e. that memblock_double_array()
> would fail. And since we also do not check that at the moment, your
> patch would probably not (additionally) break anything.
We don't check the return value of memblock_reserve() (or memblock_remove()
for that matter) in vast majority of cases all over the place, but it is
very much unlikely it will fail because there is not enough memory.
> Still, this might be something for an add-on patch (for us). Do you
> happen to know how likely it would be that memblock_remove/reserve()
> could fail at this point?
Generally, the worst case scenario is that memblock_remove/reserve() will
need to double one of the memblock arrays. This will require to allocate
several kilobytes of memory, but since the memory this early is mostly free
it should not be a problem.
Looking particularly at s390 case, there are handful of reservations before
setup_resouces(), so we can accommodate more than 100 calls to
memblock_reserve() before we'll need to increase memblock.reserved array.
--
Sincerely yours,
Mike.
On Tue, Jun 01, 2021 at 02:44:29PM +0100, Russell King (Oracle) wrote:
> On Mon, May 31, 2021 at 03:29:54PM +0300, Mike Rapoport wrote:
> > * arm has "System RAM (boot alias)" that do not seem useful for any other
> > architecture
>
> This is VERY important for kexec and must _not_ be removed, since you
> will be causing a userspace regression by doing so.
I didn't remove these. I'll update the changelog and the cover letter to
make it clear.
--
Sincerely yours,
Mike.
On Wed, Jun 02, 2021 at 11:33:10AM +0300, Mike Rapoport wrote:
> On Tue, Jun 01, 2021 at 02:54:15PM +0100, Russell King (Oracle) wrote:
> > On Mon, May 31, 2021 at 03:29:56PM +0300, Mike Rapoport wrote:
> > > + code_resource.start = __pa_symbol(_text);
> > > + code_resource.end = __pa_symbol(_etext)-1;
> > > + rodata_resource.start = __pa_symbol(__start_rodata);
> > > + rodata_resource.end = __pa_symbol(__end_rodata)-1;
> > > + data_resource.start = __pa_symbol(_sdata);
> > > + data_resource.end = __pa_symbol(_edata)-1;
> > > + bss_resource.start = __pa_symbol(__bss_start);
> > > + bss_resource.end = __pa_symbol(__bss_stop)-1;
> >
> > This falls short on 32-bit ARM. The old code was:
> >
> > - kernel_code.start = virt_to_phys(_text);
> > - kernel_code.end = virt_to_phys(__init_begin - 1);
> > - kernel_data.start = virt_to_phys(_sdata);
> > - kernel_data.end = virt_to_phys(_end - 1);
> >
> > If I look at one of my kernels:
> >
> > c0008000 T _text
> > c0b5b000 R __end_rodata
> > ... exception and unwind tables live here ...
> > c0c00000 T __init_begin
> > c0e00000 D _sdata
> > c0e68870 D _edata
> > c0e68870 B __bss_start
> > c0e995d4 B __bss_stop
> > c0e995d4 B _end
> >
> > So the original covers _text..__init_begin-1 which includes the
> > exception and unwind tables. Your version above omits these, which
> > leaves them exposed.
>
> Right, this needs to be fixed. Is there any reason the exception and unwind
> tables cannot be placed between _sdata and _edata?
>
> It seems to me that they were left outside for purely historical reasons.
> Commit ee951c630c5c ("ARM: 7568/1: Sort exception table at compile time")
> moved the exception tables out of .data section before _sdata existed.
> Commit 14c4a533e099 ("ARM: 8583/1: mm: fix location of _etext") moved
> _etext before the unwind tables and didn't bother to put them into data or
> rodata areas.
You can not assume that all sections will be between these symbols. This
isn't specific to 32-bit ARM. If you look at x86's vmlinux.lds.in, you
will see that BUG_TABLE and ORC_UNWIND_TABLE are after _edata, along
with many other undiscarded sections before __bss_start. So it seems
your assumptions in trying to clean this up are somewhat false.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
On Tue, Jun 01, 2021 at 02:54:15PM +0100, Russell King (Oracle) wrote:
> On Mon, May 31, 2021 at 03:29:56PM +0300, Mike Rapoport wrote:
> > + code_resource.start = __pa_symbol(_text);
> > + code_resource.end = __pa_symbol(_etext)-1;
> > + rodata_resource.start = __pa_symbol(__start_rodata);
> > + rodata_resource.end = __pa_symbol(__end_rodata)-1;
> > + data_resource.start = __pa_symbol(_sdata);
> > + data_resource.end = __pa_symbol(_edata)-1;
> > + bss_resource.start = __pa_symbol(__bss_start);
> > + bss_resource.end = __pa_symbol(__bss_stop)-1;
>
> This falls short on 32-bit ARM. The old code was:
>
> - kernel_code.start = virt_to_phys(_text);
> - kernel_code.end = virt_to_phys(__init_begin - 1);
> - kernel_data.start = virt_to_phys(_sdata);
> - kernel_data.end = virt_to_phys(_end - 1);
>
> If I look at one of my kernels:
>
> c0008000 T _text
> c0b5b000 R __end_rodata
> ... exception and unwind tables live here ...
> c0c00000 T __init_begin
> c0e00000 D _sdata
> c0e68870 D _edata
> c0e68870 B __bss_start
> c0e995d4 B __bss_stop
> c0e995d4 B _end
>
> So the original covers _text..__init_begin-1 which includes the
> exception and unwind tables. Your version above omits these, which
> leaves them exposed.
Right, this needs to be fixed. Is there any reason the exception and unwind
tables cannot be placed between _sdata and _edata?
It seems to me that they were left outside for purely historical reasons.
Commit ee951c630c5c ("ARM: 7568/1: Sort exception table at compile time")
moved the exception tables out of .data section before _sdata existed.
Commit 14c4a533e099 ("ARM: 8583/1: mm: fix location of _etext") moved
_etext before the unwind tables and didn't bother to put them into data or
rodata areas.
--
Sincerely yours,
Mike.
On Wed, Jun 02, 2021 at 11:15:21AM +0100, Russell King (Oracle) wrote:
> On Wed, Jun 02, 2021 at 11:33:10AM +0300, Mike Rapoport wrote:
> > On Tue, Jun 01, 2021 at 02:54:15PM +0100, Russell King (Oracle) wrote:
> > > On Mon, May 31, 2021 at 03:29:56PM +0300, Mike Rapoport wrote:
> > > > + code_resource.start = __pa_symbol(_text);
> > > > + code_resource.end = __pa_symbol(_etext)-1;
> > > > + rodata_resource.start = __pa_symbol(__start_rodata);
> > > > + rodata_resource.end = __pa_symbol(__end_rodata)-1;
> > > > + data_resource.start = __pa_symbol(_sdata);
> > > > + data_resource.end = __pa_symbol(_edata)-1;
> > > > + bss_resource.start = __pa_symbol(__bss_start);
> > > > + bss_resource.end = __pa_symbol(__bss_stop)-1;
> > >
> > > This falls short on 32-bit ARM. The old code was:
> > >
> > > - kernel_code.start = virt_to_phys(_text);
> > > - kernel_code.end = virt_to_phys(__init_begin - 1);
> > > - kernel_data.start = virt_to_phys(_sdata);
> > > - kernel_data.end = virt_to_phys(_end - 1);
> > >
> > > If I look at one of my kernels:
> > >
> > > c0008000 T _text
> > > c0b5b000 R __end_rodata
> > > ... exception and unwind tables live here ...
> > > c0c00000 T __init_begin
> > > c0e00000 D _sdata
> > > c0e68870 D _edata
> > > c0e68870 B __bss_start
> > > c0e995d4 B __bss_stop
> > > c0e995d4 B _end
> > >
> > > So the original covers _text..__init_begin-1 which includes the
> > > exception and unwind tables. Your version above omits these, which
> > > leaves them exposed.
> >
> > Right, this needs to be fixed. Is there any reason the exception and unwind
> > tables cannot be placed between _sdata and _edata?
> >
> > It seems to me that they were left outside for purely historical reasons.
> > Commit ee951c630c5c ("ARM: 7568/1: Sort exception table at compile time")
> > moved the exception tables out of .data section before _sdata existed.
> > Commit 14c4a533e099 ("ARM: 8583/1: mm: fix location of _etext") moved
> > _etext before the unwind tables and didn't bother to put them into data or
> > rodata areas.
>
> You can not assume that all sections will be between these symbols. This
> isn't specific to 32-bit ARM. If you look at x86's vmlinux.lds.in, you
> will see that BUG_TABLE and ORC_UNWIND_TABLE are after _edata, along
> with many other undiscarded sections before __bss_start.
But if you look at x86's setup_arch() all these never make it to the
resource tree. So there are holes in /proc/iomem between the kernel
resources.
> So it seems your assumptions in trying to clean this up are somewhat
> false.
My assumption was that there is complete lack of consistency between what
is reserved memory and how it is reported in /proc/iomem or
/sys/firmware/memmap for that matter. I'm not trying to clean this up, I'm
trying to make different views of the physical memory consistent.
Consolidating several similar per-arch implementations is the first step in
this direction.
--
Sincerely yours,
Mike.
On Wed, Jun 02, 2021 at 04:54:17PM +0300, Mike Rapoport wrote:
> On Wed, Jun 02, 2021 at 11:15:21AM +0100, Russell King (Oracle) wrote:
> > On Wed, Jun 02, 2021 at 11:33:10AM +0300, Mike Rapoport wrote:
> > > On Tue, Jun 01, 2021 at 02:54:15PM +0100, Russell King (Oracle) wrote:
> > > > If I look at one of my kernels:
> > > >
> > > > c0008000 T _text
> > > > c0b5b000 R __end_rodata
> > > > ... exception and unwind tables live here ...
> > > > c0c00000 T __init_begin
> > > > c0e00000 D _sdata
> > > > c0e68870 D _edata
> > > > c0e68870 B __bss_start
> > > > c0e995d4 B __bss_stop
> > > > c0e995d4 B _end
> > > >
> > > > So the original covers _text..__init_begin-1 which includes the
> > > > exception and unwind tables. Your version above omits these, which
> > > > leaves them exposed.
> > >
> > > Right, this needs to be fixed. Is there any reason the exception and unwind
> > > tables cannot be placed between _sdata and _edata?
> > >
> > > It seems to me that they were left outside for purely historical reasons.
> > > Commit ee951c630c5c ("ARM: 7568/1: Sort exception table at compile time")
> > > moved the exception tables out of .data section before _sdata existed.
> > > Commit 14c4a533e099 ("ARM: 8583/1: mm: fix location of _etext") moved
> > > _etext before the unwind tables and didn't bother to put them into data or
> > > rodata areas.
> >
> > You can not assume that all sections will be between these symbols. This
> > isn't specific to 32-bit ARM. If you look at x86's vmlinux.lds.in, you
> > will see that BUG_TABLE and ORC_UNWIND_TABLE are after _edata, along
> > with many other undiscarded sections before __bss_start.
>
> But if you look at x86's setup_arch() all these never make it to the
> resource tree. So there are holes in /proc/iomem between the kernel
> resources.
Also true. However, my point was to counter your claim that these
sections should be part of the .text/.data/.rodata etc sections in the
output vmlinux.
There is, however, a more important point. The __ex_table section
must exist and be separate from the .text/.data/.rodata sections in
the output ELF file, as sorttable (the exception table sorter) relies
on this to be able to find the table and sort it.
So, it isn't entirely "for historical reasons" as you said two messages
ago.
> > So it seems your assumptions in trying to clean this up are somewhat
> > false.
>
> My assumption was that there is complete lack of consistency between what
> is reserved memory and how it is reported in /proc/iomem or
> /sys/firmware/memmap for that matter. I'm not trying to clean this up, I'm
> trying to make different views of the physical memory consistent.
> Consolidating several similar per-arch implementations is the first step in
> this direction.
It looks to me that there is quite a number of things that need fixing.
One glaring thing is the kernel's init memory - should that be counted
as reserved memory? It's marked as such in memblock and /proc/iomem,
yet we free these pages into the page allocator after boot meaning
they are just like any other page in the memory allocator - they are
most certainly not "reserved" at that point.
So, what is reported as reserved in firmware maps will be different
from memblock. Memblock includes kernel boot-time allocations, which
count as "reserved" but are not part of the firmware maps - these will
be for things like early page tables and the struct page array. So,
you're never going to get consistency between memblock and firmware.
Memblock and /proc/iomem should be fairly consistent - areas marked
as reserved in memblock seem to be propagated into /proc/iomem,
including areas around the kernel image (the resources that you're
changing in your patch.) Here's an example:
/sys/kernel/debug/memblock/reserved:
1: 0x0000000081210000..0x0000000082d6efff
2: 0x0000000082d71000..0x0000000082d7ffff
81210000-821cffff : Kernel code
821d0000-8246ffff : reserved
82470000-82d7ffff : Kernel data
This is aarch64, which isn't as accurate as 32-bit ARM in /proc/iomem:
/sys/kernel/debug/memblock/reserved:
1: 0x0000000040200000..0x0000000040ea1c17
/proc/iomem:
40008000-40bfffff : Kernel code
40e00000-40ea1c17 : Kernel data
32-bit ARM doesn't forward the memblock reserved areas into /proc/iomem
because they are kernel allocations. In the example I show above for
32-bit ARM, there are no firmware reserved regions, yet there are 19
memblock "reserved" regions.
I think part of the problem here is understanding what "reserved" means
in these cases. For something passed to the kernel from firmware, it's
an area that firmware doesn't want the OS to use. For memblock, it is
those areas plus allocations made early on during kernel boot before
the page allocator is up and running, and includes areas of memory that
these allocations must avoid (e.g. due to initramfs or device tree
temporarily residing there.) Then there's differences in what should
be placed in /proc/iomem.
Now, bear in mind that /proc/iomem is a user API, one which userspace
depends on. If we start going around making /proc/iomem report stuff
like kernel boot time reservations as "reserved" memory, we will end up
breaking the kexec tooling on some platforms. For example, kexec
tooling for 32-bit ARM parses /proc/iomem, looking for "System RAM",
"System RAM (boot alias)" and "reserved" regions.
So, I think changes to make this "more consistent" come with high
risk.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
On Wed, Jun 02, 2021 at 04:51:41PM +0100, Russell King (Oracle) wrote:
> On Wed, Jun 02, 2021 at 04:54:17PM +0300, Mike Rapoport wrote:
> > On Wed, Jun 02, 2021 at 11:15:21AM +0100, Russell King (Oracle) wrote:
> > > On Wed, Jun 02, 2021 at 11:33:10AM +0300, Mike Rapoport wrote:
> > > > On Tue, Jun 01, 2021 at 02:54:15PM +0100, Russell King (Oracle) wrote:
> > > > > If I look at one of my kernels:
> > > > >
> > > > > c0008000 T _text
> > > > > c0b5b000 R __end_rodata
> > > > > ... exception and unwind tables live here ...
> > > > > c0c00000 T __init_begin
> > > > > c0e00000 D _sdata
> > > > > c0e68870 D _edata
> > > > > c0e68870 B __bss_start
> > > > > c0e995d4 B __bss_stop
> > > > > c0e995d4 B _end
> > > > >
> > > > > So the original covers _text..__init_begin-1 which includes the
> > > > > exception and unwind tables. Your version above omits these, which
> > > > > leaves them exposed.
> > > >
> > > > Right, this needs to be fixed. Is there any reason the exception and unwind
> > > > tables cannot be placed between _sdata and _edata?
> > > >
> > > > It seems to me that they were left outside for purely historical reasons.
> > > > Commit ee951c630c5c ("ARM: 7568/1: Sort exception table at compile time")
> > > > moved the exception tables out of .data section before _sdata existed.
> > > > Commit 14c4a533e099 ("ARM: 8583/1: mm: fix location of _etext") moved
> > > > _etext before the unwind tables and didn't bother to put them into data or
> > > > rodata areas.
> > >
> > > You can not assume that all sections will be between these symbols. This
> > > isn't specific to 32-bit ARM. If you look at x86's vmlinux.lds.in, you
> > > will see that BUG_TABLE and ORC_UNWIND_TABLE are after _edata, along
> > > with many other undiscarded sections before __bss_start.
> >
> > But if you look at x86's setup_arch() all these never make it to the
> > resource tree. So there are holes in /proc/iomem between the kernel
> > resources.
>
> Also true. However, my point was to counter your claim that these
> sections should be part of the .text/.data/.rodata etc sections in the
> output vmlinux.
>
> There is, however, a more important point. The __ex_table section
> must exist and be separate from the .text/.data/.rodata sections in
> the output ELF file, as sorttable (the exception table sorter) relies
> on this to be able to find the table and sort it.
>
> So, it isn't entirely "for historical reasons" as you said two messages
> ago.
Back then when __ex_table was moved from .data section, _sdata and _edata
were part of the .data section. Today they are not. So something like the
patch below will ensure for instance that __ex_table would be a part of
"Kernel data" in /proc/iomem without moving it to the .data section:
diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S
index f7f4620d59c3..2991feceab31 100644
--- a/arch/arm/kernel/vmlinux.lds.S
+++ b/arch/arm/kernel/vmlinux.lds.S
@@ -72,13 +72,6 @@ SECTIONS
RO_DATA(PAGE_SIZE)
- . = ALIGN(4);
- __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) {
- __start___ex_table = .;
- ARM_MMU_KEEP(*(__ex_table))
- __stop___ex_table = .;
- }
-
#ifdef CONFIG_ARM_UNWIND
ARM_UNWIND_SECTIONS
#endif
@@ -143,6 +136,14 @@ SECTIONS
__init_end = .;
_sdata = .;
+
+ . = ALIGN(4);
+ __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) {
+ __start___ex_table = .;
+ ARM_MMU_KEEP(*(__ex_table))
+ __stop___ex_table = .;
+ }
+
RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
_edata = .;
> Now, bear in mind that /proc/iomem is a user API, one which userspace
> depends on. If we start going around making /proc/iomem report stuff
> like kernel boot time reservations as "reserved" memory, we will end up
> breaking the kexec tooling on some platforms. For example, kexec
> tooling for 32-bit ARM parses /proc/iomem, looking for "System RAM",
> "System RAM (boot alias)" and "reserved" regions.
>
> So, I think changes to make this "more consistent" come with high
> risk.
I agree there is a risk but I don't think it's high. It does not look like
the minor changes in "reserved" reporting in /proc/iomem will break kexec
tooling. Anyway the amount of reserved and free memory depends on a
particular system, kernel version, configuration and command line.
I have no intention to report kernel boot time reservations
to /proc/iomem on architectures that do not report them there today,
although this also does not seem like a significant factor.
On the other hand, making /proc/iomem reporting consistent among
architectures will allow to reduce complexity of both the kernel and kexec
tools in the long run.
--
Sincerely yours,
Mike.
On Wed, Jun 02, 2021 at 09:43:32PM +0300, Mike Rapoport wrote:
> Back then when __ex_table was moved from .data section, _sdata and _edata
> were part of the .data section. Today they are not. So something like the
> patch below will ensure for instance that __ex_table would be a part of
> "Kernel data" in /proc/iomem without moving it to the .data section:
>
> diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S
> index f7f4620d59c3..2991feceab31 100644
> --- a/arch/arm/kernel/vmlinux.lds.S
> +++ b/arch/arm/kernel/vmlinux.lds.S
> @@ -72,13 +72,6 @@ SECTIONS
>
> RO_DATA(PAGE_SIZE)
>
> - . = ALIGN(4);
> - __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) {
> - __start___ex_table = .;
> - ARM_MMU_KEEP(*(__ex_table))
> - __stop___ex_table = .;
> - }
> -
> #ifdef CONFIG_ARM_UNWIND
> ARM_UNWIND_SECTIONS
> #endif
> @@ -143,6 +136,14 @@ SECTIONS
> __init_end = .;
>
> _sdata = .;
> +
> + . = ALIGN(4);
> + __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) {
> + __start___ex_table = .;
> + ARM_MMU_KEEP(*(__ex_table))
> + __stop___ex_table = .;
> + }
> +
> RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE)
> _edata = .;
This example has undesirable security implications. It moves the
exception table out of the read-only mappings into the read-write
mappings, thereby providing a way for an attacker to bypass the
read-only protection on the kernel and manipulate code pointers at
potentially known addresses for distro built kernels.
> I agree there is a risk but I don't think it's high. It does not look like
> the minor changes in "reserved" reporting in /proc/iomem will break kexec
> tooling.
What makes you come to that conclusion? The kexec tools architecture
backends get to decide what they do when parsing /proc/iomem.
Currently, only firmware areas are marked reserved in /proc/iomem on
32-bit ARM.
This is read by kexec, and entered into its memory_range[] table as
either RAM, or RESERVED.
kexec uses this to search for a suitable hole in the memory map to
place the kernel in physical memory. The addition of what I will call
ficticious "reserved" areas by the host kernel because the host kernel
happened to use them _will_ have an impact on this.
They _are_ ficticious, because they are purely an artifact of the host
kernel being run, and are of no consequence to tooling such as kexec.
What such tooling is interested in is which areas it needs to avoid
because of firmware.
I think what isn't helping here is that you haven't adequately
described what your overall objective actually is. Framing it in
terms of wanting the reserved memory to be consistent between the
various kernel "interfaces" such as /proc/iomem, the memblock debugfs
and firmware is very ambiguous and open to different interpretations,
whcih I think is what the problem is here.
> Anyway the amount of reserved and free memory depends on a
> particular system, kernel version, configuration and command line.
> I have no intention to report kernel boot time reservations
> to /proc/iomem on architectures that do not report them there today,
> although this also does not seem like a significant factor.
You seem to be missing the point I've tried to make. The areas in
memblock that are marked "reserved" are the areas of reserved memory
from the firmware _plus_ the areas that the kernel has made during
boot which are of no consequence to userspace.
Wanting /proc/iomem, memblock and firmware to all agree on the values
that they mark as "reserved" is IMHO unrealistic.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
On Wed, Jun 02, 2021 at 09:15:02PM +0100, Russell King (Oracle) wrote:
> On Wed, Jun 02, 2021 at 09:43:32PM +0300, Mike Rapoport wrote:
> > Back then when __ex_table was moved from .data section, _sdata and _edata
> > were part of the .data section. Today they are not. So something like the
> > patch below will ensure for instance that __ex_table would be a part of
> > "Kernel data" in /proc/iomem without moving it to the .data section:
> >
> This example has undesirable security implications. It moves the
> exception table out of the read-only mappings into the read-write
> mappings, thereby providing a way for an attacker to bypass the
> read-only protection on the kernel and manipulate code pointers at
> potentially known addresses for distro built kernels.
My point was that __ex_table can be in "Kernel data" or "Kernel rodata"
without loosing the ability to sort it.
> You seem to be missing the point I've tried to make. The areas in
> memblock that are marked "reserved" are the areas of reserved memory
> from the firmware _plus_ the areas that the kernel has made during
> boot which are of no consequence to userspace.
I know what areas are marked "reserved" in memblock.
I never suggested to report "ficticious" reserved areas in /proc/iomem
unless an architecture already reports them there, like arm64 for example.
You are right I should have described better the overall objective, but
sill I feel that we keep missing each other points.
I'll update the descriptions for the next repost, hopefully it'll help.
--
Sincerely yours,
Mike.