Add wildcard option of reserving physical memory on kernel command line
Background:
In ChromeOS, we have 1 MB of pstore ramoops reserved so that we can extract
dmesg output and some other information when a crash happens in the field.
(This is only done when the user selects "Allow Google to collect data for
improving the system"). But there are cases when there's a bug that
requires more data to be retrieved to figure out what is happening. We would
like to increase the pstore size, either temporarily, or maybe even
permanently. The pstore on these devices are at a fixed location in RAM (as
the RAM is not cleared on soft reboots nor crashes). The location is chosen
by the BIOS (coreboot) and passed to the kernel via ACPI tables on x86.
There's a driver that queries for this to initialize the pstore for
ChromeOS:
See drivers/platform/chrome/chromeos_pstore.c
Problem:
The problem is that, even though there's a process to change the kernel on
these systems, and is done regularly to install updates, the firmware is
updated much less frequently. Choosing the place in RAM also takes special
care, and may be in a different address for different boards. Updating the
size via firmware is a large effort and not something that many are willing
to do for a temporary pstore size change.
Requirement:
Need a way to reserve memory that will be at a consistent location for
every boot, if the kernel and system are the same. Does not need to work
if rebooting to a different kernel, or if the system can change the
memory layout between boots.
The reserved memory can not be an hard coded address, as the same kernel /
command line needs to run on several different machines. The picked memory
reservation just needs to be the same for a given machine, but may be
different for different machines.
Solution:
The solution I have come up with is to introduce a new "memmap=" kernel
command line (for x86 and I would like something similar for ARM that uses
device tree). As "memmap=" kernel command line parameter takes on several
flavors already, I would like to introduce a new one. The "memmap=" kernel
parameter is of the format of:
memmap=nn[Xss]
Where nn is the size, 'X' defines the flavor, and 'ss' usually a parameter
to that flavor. The '$' flavor is to reserve physical memory where you could
have:
memmap=12M$0xb000000
Where 12 megs of memory will be reserved at the address 0xb0000000. This
memory will not be part of the memory used by the kernel's memory management
system. (e.g. alloc_pages() and kmalloc() will not return memory in that
location).
I would like to introduce a "wildcard" flavor that is of the format:
memmap=nn*align:label
Where nn is the size of memory to reserve, the align is the alignment of
that memory, and label is the way for other sub-systems to find that memory.
This way the kernel command line could have:
memmap=12M*4096:oops ramoops.mem_name=oops
At boot up, the kernel will search for 12 megabytes in usable memory regions
with an alignment of 4096. It will start at the highest regions and work its
way down (for those old devices that want access to lower address DMA). When
it finds a region, it will save it off in a small table and mark it with the
"oops" label. Then the pstore ramoops sub-system could ask for that memory
and location, and it will map itself there.
This prototype allows for 8 different mappings (which may be overkill, 4 is
probably plenty) with 16 byte size to store the label. The table lookup is
only available until boot finishes, which means it is only available for
builtin code and not for modules.
I have tested this and it works for us to solve the above problem. We can
update the kernel and command line and increase the size of pstore without
needing to update the firmware, or knowing every memory layout of each
board. I only tested this locally, it has not been tested in the field. Before
doing anything, I am looking for feedback. Maybe I missed something. Perhaps
there's a better way. Anyway, this is both a Proof of Concept as well as a
Request for Comments.
Thanks!
Steven Rostedt (Google) (2):
mm/x86: Add wildcard '*' option as memmap=nn*align:name
pstore/ramoops: Add ramoops.mem_name= command line option
----
arch/x86/kernel/e820.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/pstore/ram.c | 18 ++++++++++
include/linux/mm.h | 2 ++
mm/memory.c | 7 ++++
4 files changed, 118 insertions(+)
On Tue, 09 Apr 2024 17:02:54 -0400
Steven Rostedt <[email protected]> wrote:
> memmap=12M*4096:oops ramoops.mem_name=oops
I forgot to mention that this makes it trivial for any machine that doesn't
clear memory on soft-reboot, to enable console ramoops (to have access to
the last boot dmesg without needing serial).
I tested this on a couple of my test boxes and on QEMU, and it works rather
well.
-- Steve
On Tue, Apr 09, 2024 at 05:23:58PM -0400, Steven Rostedt wrote:
> On Tue, 09 Apr 2024 17:02:54 -0400
> Steven Rostedt <[email protected]> wrote:
>
> > memmap=12M*4096:oops ramoops.mem_name=oops
>
> I forgot to mention that this makes it trivial for any machine that doesn't
> clear memory on soft-reboot, to enable console ramoops (to have access to
> the last boot dmesg without needing serial).
>
> I tested this on a couple of my test boxes and on QEMU, and it works rather
> well.
I've long wanted a "stable for this machine and kernel" memory region
like this for pstore. It would make testing much easier.
--
Kees Cook
>> I forgot to mention that this makes it trivial for any machine that doesn't
>> clear memory on soft-reboot, to enable console ramoops (to have access to
>> the last boot dmesg without needing serial).
>>
>> I tested this on a couple of my test boxes and on QEMU, and it works rather
>> well.
>
> I've long wanted a "stable for this machine and kernel" memory region
> like this for pstore. It would make testing much easier.
Which systems does this work on? I'd assume that servers (and anything
else with ECC memory) would nuke contents while resetting ECC to clean
state.
-Tony
> On Apr 10, 2024, at 3:55 AM, Luck, Tony <[email protected]> wrote:
>
>
>>
>>> I forgot to mention that this makes it trivial for any machine that doesn't
>>> clear memory on soft-reboot, to enable console ramoops (to have access to
>>> the last boot dmesg without needing serial).
>>>
>>> I tested this on a couple of my test boxes and on QEMU, and it works rather
>>> well.
>>
>> I've long wanted a "stable for this machine and kernel" memory region
>> like this for pstore. It would make testing much easier.
>
> Which systems does this work on? I'd assume that servers (and anything
> else with ECC memory) would nuke contents while resetting ECC to clean
> state.
If that were the case universally, then ramoops pstore backend would not work either?
And yet we get the last kernel logs via the pstore for many years now, on embedded-ish devices.
From my reading, ECC-enabled DRAM is not present on lots of systems and IIRC, pstore ramoops has its own ECC.
Or did I miss a recent trend with ECC-enabled DRAM?
- Joel
>
> -Tony
On Tue, 9 Apr 2024 22:25:33 +0000
"Luck, Tony" <[email protected]> wrote:
> >> I forgot to mention that this makes it trivial for any machine that doesn't
> >> clear memory on soft-reboot, to enable console ramoops (to have access to
> >> the last boot dmesg without needing serial).
> >>
> >> I tested this on a couple of my test boxes and on QEMU, and it works rather
> >> well.
> >
> > I've long wanted a "stable for this machine and kernel" memory region
> > like this for pstore. It would make testing much easier.
>
> Which systems does this work on? I'd assume that servers (and anything
> else with ECC memory) would nuke contents while resetting ECC to clean
> state.
>
Well I tested it on a couple of chromebooks, a test box and a laptop (as
well as QEMU). I know that ramoops has an ecc option. I'm guessing that
would help here (but I'd have to defer to others to answer that).
-- Steve
On Tue, Apr 09, 2024 at 10:25:33PM +0000, Luck, Tony wrote:
> >> I forgot to mention that this makes it trivial for any machine that doesn't
> >> clear memory on soft-reboot, to enable console ramoops (to have access to
> >> the last boot dmesg without needing serial).
> >>
> >> I tested this on a couple of my test boxes and on QEMU, and it works rather
> >> well.
> >
> > I've long wanted a "stable for this machine and kernel" memory region
> > like this for pstore. It would make testing much easier.
>
> Which systems does this work on? I'd assume that servers (and anything
> else with ECC memory) would nuke contents while resetting ECC to clean
> state.
Do ECC servers wipe their RAM by default? I know that if you build with
CONFIG_RESET_ATTACK_MITIGATION=y on an EFI system that supports the
MemoryOverwriteRequestControl EFI variable you'll get a RAM wipe...
--
Kees Cook
> Do ECC servers wipe their RAM by default? I know that if you build with
> CONFIG_RESET_ATTACK_MITIGATION=y on an EFI system that supports the
> MemoryOverwriteRequestControl EFI variable you'll get a RAM wipe...
I know that after I've been running RAS tests that inject ECC errors into thousands of
pages, those errors all disappear after a reboot.
I think some BIOS have options to speed up boot by skipping memory initialization.
I don't know if anyone makes that the default mode.
-Tony
On 09/04/2024 19:25, Luck, Tony wrote:
>>> I forgot to mention that this makes it trivial for any machine that doesn't
>>> clear memory on soft-reboot, to enable console ramoops (to have access to
>>> the last boot dmesg without needing serial).
>>>
>>> I tested this on a couple of my test boxes and on QEMU, and it works rather
>>> well.
>>
>> I've long wanted a "stable for this machine and kernel" memory region
>> like this for pstore. It would make testing much easier.
>
> Which systems does this work on? I'd assume that servers (and anything
> else with ECC memory) would nuke contents while resetting ECC to clean
> state.
>
> -Tony
Thanks Steve! Like Kees, I've been wanting a consistent way of mapping
some RAM for pstore for a while, without resorting to platform drivers
like Chromebooks do...
The idea seems very interesting and helpful, I'll test it here. My only
concern / "complain" is that it's currently only implemented for builtin
ramoops, which is not the default in many distros (like Arch, Ubuntu,
Debian). I read patch 2 (and discussion), so I think would be good to
have that builtin helper implemented upfront to allow modular usage of
ramoops.
Now, responding to Tony: Steam Deck also uses pstore/ram to store logs,
and I've tested in my AMD desktop, it does work. Seems disabling memory
retraining in BIOS (to speedup boot?) is somewhat common, not sure for
servers though. As Joel mentioned as well, quite common to use
pstore/ram in ARM embedded world.
Cheers,
Guilherme
On Thu, 11 Apr 2024 16:11:55 -0300
"Guilherme G. Piccoli" <[email protected]> wrote:
> Thanks Steve! Like Kees, I've been wanting a consistent way of mapping
> some RAM for pstore for a while, without resorting to platform drivers
> like Chromebooks do...
Great!
>
> The idea seems very interesting and helpful, I'll test it here. My only
> concern / "complain" is that it's currently only implemented for builtin
> ramoops, which is not the default in many distros (like Arch, Ubuntu,
> Debian). I read patch 2 (and discussion), so I think would be good to
> have that builtin helper implemented upfront to allow modular usage of
> ramoops.
What I think I could do is to have a check after memory is allocated to
copy the table mapping (in the heap) if it is filled. The reason I did it
this way was because it was the easiest way to save the label to address
memory before memory is initialized. I use a __initdata array (why waste
memory if it's hardly ever used).
But, after memory is initialized, we can check if the table has content,
and if so allocate a copy and store it there and use that table instead.
That would give modules a way to find the address as well.
-- Steve
On 11/04/2024 16:40, Steven Rostedt wrote:
> [...]
> What I think I could do is to have a check after memory is allocated to
> copy the table mapping (in the heap) if it is filled. The reason I did it
> this way was because it was the easiest way to save the label to address
> memory before memory is initialized. I use a __initdata array (why waste
> memory if it's hardly ever used).
>
> But, after memory is initialized, we can check if the table has content,
> and if so allocate a copy and store it there and use that table instead.
> That would give modules a way to find the address as well.
>
> -- Steve
>
Thanks Steve, seems a good idea. With that, I could test on kdumpst (the
tool used on Steam Deck), since it relies on modular pstore/ram.
Cheers!
On Fri, 12 Apr 2024 09:17:18 -0300
"Guilherme G. Piccoli" <[email protected]> wrote:
> Thanks Steve, seems a good idea. With that, I could test on kdumpst (the
> tool used on Steam Deck), since it relies on modular pstore/ram.
Something like this could work.
-- Steve
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index a8831ef30c73..878aee8b2399 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -16,6 +16,7 @@
#include <linux/firmware-map.h>
#include <linux/sort.h>
#include <linux/memory_hotplug.h>
+#include <linux/mm.h>
#include <asm/e820/api.h>
#include <asm/setup.h>
@@ -64,61 +65,6 @@ struct e820_table *e820_table __refdata = &e820_table_init;
struct e820_table *e820_table_kexec __refdata = &e820_table_kexec_init;
struct e820_table *e820_table_firmware __refdata = &e820_table_firmware_init;
-/* For wildcard memory requests, have a table to find them later */
-#define E820_MAX_MAPS 8
-#define E820_MAP_NAME_SIZE 16
-struct e820_mmap_map {
- char name[E820_MAP_NAME_SIZE];
- u64 start;
- u64 size;
-};
-static struct e820_mmap_map e820_mmap_list[E820_MAX_MAPS] __initdata;
-static int e820_mmap_size __initdata;
-
-/* Add wildcard region with a lookup name */
-static int __init e820_add_mmap(u64 start, u64 size, const char *name)
-{
- struct e820_mmap_map *map;
-
- if (!name || !name[0] || strlen(name) >= E820_MAP_NAME_SIZE)
- return -EINVAL;
-
- if (e820_mmap_size >= E820_MAX_MAPS)
- return -1;
-
- map = &e820_mmap_list[e820_mmap_size++];
- map->start = start;
- map->size = size;
- strcpy(map->name, name);
- return 0;
-}
-
-/**
- * memmap_named - Find a wildcard region with a given name
- * @name: The name that is attached to a wildcard region
- * @start: If found, holds the start address
- * @size: If found, holds the size of the address.
- *
- * Returns: 1 if found or 0 if not found.
- */
-int __init memmap_named(const char *name, u64 *start, u64 *size)
-{
- struct e820_mmap_map *map;
- int i;
-
- for (i = 0; i < e820_mmap_size; i++) {
- map = &e820_mmap_list[i];
- if (!map->size)
- continue;
- if (strcmp(name, map->name) == 0) {
- *start = map->start;
- *size = map->size;
- return 1;
- }
- }
- return 0;
-}
-
/* For PCI or other memory-mapped resources */
unsigned long pci_mem_start = 0xaeedbabe;
#ifdef CONFIG_PCI
@@ -1024,6 +970,8 @@ static int __init parse_memmap_one(char *p)
e820__range_add(start_at, mem_size, E820_TYPE_RESERVED);
} else if (*p == '*') {
u64 align;
+ int ret;
+
/* Followed by alignment and ':' then the name */
align = memparse(p+1, &p);
start_at = e820__region(mem_size, align);
@@ -1032,9 +980,10 @@ static int __init parse_memmap_one(char *p)
if (*p != ':')
return -EINVAL;
p++;
- e820_add_mmap(start_at, mem_size, p);
+ ret = memmap_add(start_at, mem_size, p);
p += strlen(p);
- e820__range_add(start_at, mem_size, E820_TYPE_RESERVED);
+ if (!ret)
+ e820__range_add(start_at, mem_size, E820_TYPE_RESERVED);
} else if (*p == '!') {
start_at = memparse(p+1, &p);
e820__range_add(start_at, mem_size, E820_TYPE_PRAM);
diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
index c200388399fb..22d2e2731dc2 100644
--- a/fs/pstore/ram.c
+++ b/fs/pstore/ram.c
@@ -919,7 +919,6 @@ static void __init ramoops_register_dummy(void)
{
struct ramoops_platform_data pdata;
-#ifndef MODULE
/* Only allowed when builtin */
if (mem_name) {
u64 start;
@@ -930,7 +929,6 @@ static void __init ramoops_register_dummy(void)
mem_size = size;
}
}
-#endif
/*
* Prepare a dummy platform data structure to carry the module
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cf9b34454c6f..6ce1c6929d1f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4203,5 +4203,6 @@ static inline bool pfn_is_unaccepted_memory(unsigned long pfn)
}
int memmap_named(const char *name, u64 *start, u64 *size);
+int memmap_add(long start, long size, const char *name);
#endif /* _LINUX_MM_H */
diff --git a/mm/memory.c b/mm/memory.c
index 7a29f17df7c1..fe054e1bb678 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -120,12 +120,6 @@ static bool vmf_orig_pte_uffd_wp(struct vm_fault *vmf)
return pte_marker_uffd_wp(vmf->orig_pte);
}
-int __init __weak memmap_named(const char *name, u64 *start, u64 *size)
-{
- pr_info("Kernel command line: memmap=nn*align:name not supported on this kernel");
- /* zero means not found */
- return 0;
-}
/*
* A number of key systems in x86 including ioremap() rely on the assumption
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 549e76af8f82..e5b729b83fdc 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -154,6 +154,77 @@ static __init int set_mminit_loglevel(char *str)
early_param("mminit_loglevel", set_mminit_loglevel);
#endif /* CONFIG_DEBUG_MEMORY_INIT */
+/* For wildcard memory requests, have a table to find them later */
+#define MAX_MAPS 8
+#define MAP_NAME_SIZE 16
+struct mmap_map {
+ char name[MAP_NAME_SIZE];
+ long start;
+ long size;
+};
+static struct mmap_map early_mmap_list[MAX_MAPS] __initdata;
+static int early_mmap_size __initdata;
+static struct mmap_map *mmap_list;
+
+/* Add wildcard region with a lookup name */
+int memmap_add(long start, long size, const char *name)
+{
+ struct mmap_map *map;
+
+ if (!name || !name[0] || strlen(name) >= MAP_NAME_SIZE)
+ return -EINVAL;
+
+ if (early_mmap_size >= MAX_MAPS)
+ return -1;
+
+ map = &early_mmap_list[early_mmap_size++];
+ map->start = start;
+ map->size = size;
+ strcpy(map->name, name);
+ return 0;
+}
+
+static void __init memmap_copy(void)
+{
+ if (!early_mmap_size)
+ return;
+
+ mmap_list = kcalloc(early_mmap_size + 1, sizeof(mmap_list), GFP_KERNEL);
+ if (!mmap_list)
+ return;
+
+ for (int i = 0; i < early_mmap_size; i++)
+ mmap_list[i] = early_mmap_list[i];
+}
+
+/**
+ * memmap_named - Find a wildcard region with a given name
+ * @name: The name that is attached to a wildcard region
+ * @start: If found, holds the start address
+ * @size: If found, holds the size of the address.
+ *
+ * Returns: 1 if found or 0 if not found.
+ */
+int memmap_named(const char *name, u64 *start, u64 *size)
+{
+ struct mmap_map *map;
+
+ if (!mmap_list)
+ return 0;
+
+ for (int i = 0; mmap_list[i].name[0]; i++) {
+ map = &mmap_list[i];
+ if (!map->size)
+ continue;
+ if (strcmp(name, map->name) == 0) {
+ *start = map->start;
+ *size = map->size;
+ return 1;
+ }
+ }
+ return 0;
+}
+
struct kobject *mm_kobj;
#ifdef CONFIG_SMP
@@ -2793,4 +2864,5 @@ void __init mm_core_init(void)
pti_init();
kmsan_init_runtime();
mm_cache_init();
+ memmap_copy();
}
On Fri, Apr 12, 2024 at 01:22:43PM -0400, Steven Rostedt wrote:
> On Fri, 12 Apr 2024 09:17:18 -0300
> "Guilherme G. Piccoli" <[email protected]> wrote:
>
> > Thanks Steve, seems a good idea. With that, I could test on kdumpst (the
> > tool used on Steam Deck), since it relies on modular pstore/ram.
>
> Something like this could work.
>
> -- Steve
>
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index a8831ef30c73..878aee8b2399 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -16,6 +16,7 @@
> #include <linux/firmware-map.h>
> #include <linux/sort.h>
> #include <linux/memory_hotplug.h>
> +#include <linux/mm.h>
>
> #include <asm/e820/api.h>
> #include <asm/setup.h>
> @@ -64,61 +65,6 @@ struct e820_table *e820_table __refdata = &e820_table_init;
> struct e820_table *e820_table_kexec __refdata = &e820_table_kexec_init;
> struct e820_table *e820_table_firmware __refdata = &e820_table_firmware_init;
>
> -/* For wildcard memory requests, have a table to find them later */
> -#define E820_MAX_MAPS 8
> -#define E820_MAP_NAME_SIZE 16
> -struct e820_mmap_map {
> - char name[E820_MAP_NAME_SIZE];
> - u64 start;
> - u64 size;
> -};
> -static struct e820_mmap_map e820_mmap_list[E820_MAX_MAPS] __initdata;
> -static int e820_mmap_size __initdata;
> -
> -/* Add wildcard region with a lookup name */
> -static int __init e820_add_mmap(u64 start, u64 size, const char *name)
> -{
> - struct e820_mmap_map *map;
> -
> - if (!name || !name[0] || strlen(name) >= E820_MAP_NAME_SIZE)
> - return -EINVAL;
> -
> - if (e820_mmap_size >= E820_MAX_MAPS)
> - return -1;
> -
> - map = &e820_mmap_list[e820_mmap_size++];
> - map->start = start;
> - map->size = size;
> - strcpy(map->name, name);
> - return 0;
> -}
> -
> -/**
> - * memmap_named - Find a wildcard region with a given name
> - * @name: The name that is attached to a wildcard region
> - * @start: If found, holds the start address
> - * @size: If found, holds the size of the address.
> - *
> - * Returns: 1 if found or 0 if not found.
> - */
> -int __init memmap_named(const char *name, u64 *start, u64 *size)
> -{
> - struct e820_mmap_map *map;
> - int i;
> -
> - for (i = 0; i < e820_mmap_size; i++) {
> - map = &e820_mmap_list[i];
> - if (!map->size)
> - continue;
> - if (strcmp(name, map->name) == 0) {
> - *start = map->start;
> - *size = map->size;
> - return 1;
> - }
> - }
> - return 0;
> -}
> -
> /* For PCI or other memory-mapped resources */
> unsigned long pci_mem_start = 0xaeedbabe;
> #ifdef CONFIG_PCI
> @@ -1024,6 +970,8 @@ static int __init parse_memmap_one(char *p)
> e820__range_add(start_at, mem_size, E820_TYPE_RESERVED);
> } else if (*p == '*') {
> u64 align;
> + int ret;
> +
> /* Followed by alignment and ':' then the name */
> align = memparse(p+1, &p);
> start_at = e820__region(mem_size, align);
> @@ -1032,9 +980,10 @@ static int __init parse_memmap_one(char *p)
> if (*p != ':')
> return -EINVAL;
> p++;
> - e820_add_mmap(start_at, mem_size, p);
> + ret = memmap_add(start_at, mem_size, p);
> p += strlen(p);
> - e820__range_add(start_at, mem_size, E820_TYPE_RESERVED);
> + if (!ret)
> + e820__range_add(start_at, mem_size, E820_TYPE_RESERVED);
> } else if (*p == '!') {
> start_at = memparse(p+1, &p);
> e820__range_add(start_at, mem_size, E820_TYPE_PRAM);
> diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
> index c200388399fb..22d2e2731dc2 100644
> --- a/fs/pstore/ram.c
> +++ b/fs/pstore/ram.c
> @@ -919,7 +919,6 @@ static void __init ramoops_register_dummy(void)
> {
> struct ramoops_platform_data pdata;
>
> -#ifndef MODULE
> /* Only allowed when builtin */
> if (mem_name) {
> u64 start;
> @@ -930,7 +929,6 @@ static void __init ramoops_register_dummy(void)
> mem_size = size;
> }
> }
> -#endif
>
> /*
> * Prepare a dummy platform data structure to carry the module
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index cf9b34454c6f..6ce1c6929d1f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4203,5 +4203,6 @@ static inline bool pfn_is_unaccepted_memory(unsigned long pfn)
> }
>
> int memmap_named(const char *name, u64 *start, u64 *size);
> +int memmap_add(long start, long size, const char *name);
>
> #endif /* _LINUX_MM_H */
> diff --git a/mm/memory.c b/mm/memory.c
> index 7a29f17df7c1..fe054e1bb678 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -120,12 +120,6 @@ static bool vmf_orig_pte_uffd_wp(struct vm_fault *vmf)
> return pte_marker_uffd_wp(vmf->orig_pte);
> }
>
> -int __init __weak memmap_named(const char *name, u64 *start, u64 *size)
> -{
> - pr_info("Kernel command line: memmap=nn*align:name not supported on this kernel");
> - /* zero means not found */
> - return 0;
> -}
>
> /*
> * A number of key systems in x86 including ioremap() rely on the assumption
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 549e76af8f82..e5b729b83fdc 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -154,6 +154,77 @@ static __init int set_mminit_loglevel(char *str)
> early_param("mminit_loglevel", set_mminit_loglevel);
> #endif /* CONFIG_DEBUG_MEMORY_INIT */
>
> +/* For wildcard memory requests, have a table to find them later */
> +#define MAX_MAPS 8
> +#define MAP_NAME_SIZE 16
> +struct mmap_map {
> + char name[MAP_NAME_SIZE];
> + long start;
> + long size;
> +};
> +static struct mmap_map early_mmap_list[MAX_MAPS] __initdata;
> +static int early_mmap_size __initdata;
> +static struct mmap_map *mmap_list;
> +
> +/* Add wildcard region with a lookup name */
> +int memmap_add(long start, long size, const char *name)
> +{
> + struct mmap_map *map;
> +
> + if (!name || !name[0] || strlen(name) >= MAP_NAME_SIZE)
> + return -EINVAL;
> +
> + if (early_mmap_size >= MAX_MAPS)
> + return -1;
> +
> + map = &early_mmap_list[early_mmap_size++];
> + map->start = start;
> + map->size = size;
> + strcpy(map->name, name);
> + return 0;
> +}
> +
> +static void __init memmap_copy(void)
> +{
> + if (!early_mmap_size)
> + return;
> +
> + mmap_list = kcalloc(early_mmap_size + 1, sizeof(mmap_list), GFP_KERNEL);
We can keep early_mmap_size after boot and then we don't need to allocate
an extra element in the mmap_list. No strong feeling here, though.
> + if (!mmap_list)
> + return;
> +
> + for (int i = 0; i < early_mmap_size; i++)
> + mmap_list[i] = early_mmap_list[i];
> +}
With something like this
/*
* Parse early_reserve_mem=nn:align:name
*/
static int __init early_reserve_mem(char *p)
{
phys_addr_t start, size, align;
char *oldp;
int err;
if (!p)
return -EINVAL;
oldp = p;
size = memparse(p, &p);
if (p == oldp)
return -EINVAL;
if (*p != ':')
return -EINVAL;
align = memparse(p+1, &p);
if (*p != ':')
return -EINVAL;
start = memblock_phys_alloc(size, align);
if (!start)
return -ENOMEM;
p++;
err = memmap_add(start, size, p);
if (err) {
memblock_phys_free(start, size);
return err;
}
p += strlen(p);
return *p == '\0' ? 0: -EINVAL;
}
__setup("early_reserve_mem=", early_reserve_mem);
you don't need to touch e820 and it will work the same for all
architectures.
We'd need a better naming, but I couldn't think of something better yet.
> +
> +/**
> + * memmap_named - Find a wildcard region with a given name
> + * @name: The name that is attached to a wildcard region
> + * @start: If found, holds the start address
> + * @size: If found, holds the size of the address.
> + *
> + * Returns: 1 if found or 0 if not found.
> + */
> +int memmap_named(const char *name, u64 *start, u64 *size)
> +{
> + struct mmap_map *map;
> +
> + if (!mmap_list)
> + return 0;
> +
> + for (int i = 0; mmap_list[i].name[0]; i++) {
> + map = &mmap_list[i];
> + if (!map->size)
> + continue;
> + if (strcmp(name, map->name) == 0) {
> + *start = map->start;
> + *size = map->size;
> + return 1;
> + }
> + }
> + return 0;
> +}
> +
> struct kobject *mm_kobj;
>
> #ifdef CONFIG_SMP
> @@ -2793,4 +2864,5 @@ void __init mm_core_init(void)
> pti_init();
> kmsan_init_runtime();
> mm_cache_init();
> + memmap_copy();
> }
--
Sincerely yours,
Mike.
On Wed, 1 May 2024 17:45:49 +0300
Mike Rapoport <[email protected]> wrote:
> > +static void __init memmap_copy(void)
> > +{
> > + if (!early_mmap_size)
> > + return;
> > +
> > + mmap_list = kcalloc(early_mmap_size + 1, sizeof(mmap_list), GFP_KERNEL);
>
> We can keep early_mmap_size after boot and then we don't need to allocate
> an extra element in the mmap_list. No strong feeling here, though.
>
> > + if (!mmap_list)
> > + return;
> > +
> > + for (int i = 0; i < early_mmap_size; i++)
> > + mmap_list[i] = early_mmap_list[i];
> > +}
>
> With something like this
>
> /*
> * Parse early_reserve_mem=nn:align:name
> */
> static int __init early_reserve_mem(char *p)
> {
> phys_addr_t start, size, align;
> char *oldp;
> int err;
>
> if (!p)
> return -EINVAL;
>
> oldp = p;
> size = memparse(p, &p);
> if (p == oldp)
> return -EINVAL;
>
> if (*p != ':')
> return -EINVAL;
>
> align = memparse(p+1, &p);
> if (*p != ':')
> return -EINVAL;
>
> start = memblock_phys_alloc(size, align);
So this will allocate the same physical location for every boot, if booting
the same kernel and having the same physical memory layout?
-- Steve
> if (!start)
> return -ENOMEM;
>
> p++;
> err = memmap_add(start, size, p);
> if (err) {
> memblock_phys_free(start, size);
> return err;
> }
>
> p += strlen(p);
>
> return *p == '\0' ? 0: -EINVAL;
> }
> __setup("early_reserve_mem=", early_reserve_mem);
>
> you don't need to touch e820 and it will work the same for all
> architectures.
>
> We'd need a better naming, but I couldn't think of something better yet.
>
> > +
> > +/**
> > + * memmap_named - Find a wildcard region with a given name
> > + * @name: The name that is attached to a wildcard region
> > + * @start: If found, holds the start address
> > + * @size: If found, holds the size of the address.
> > + *
> > + * Returns: 1 if found or 0 if not found.
> > + */
> > +int memmap_named(const char *name, u64 *start, u64 *size)
> > +{
> > + struct mmap_map *map;
> > +
> > + if (!mmap_list)
> > + return 0;
> > +
> > + for (int i = 0; mmap_list[i].name[0]; i++) {
> > + map = &mmap_list[i];
> > + if (!map->size)
> > + continue;
> > + if (strcmp(name, map->name) == 0) {
> > + *start = map->start;
> > + *size = map->size;
> > + return 1;
> > + }
> > + }
> > + return 0;
> > +}
> > +
> > struct kobject *mm_kobj;
> >
> > #ifdef CONFIG_SMP
> > @@ -2793,4 +2864,5 @@ void __init mm_core_init(void)
> > pti_init();
> > kmsan_init_runtime();
> > mm_cache_init();
> > + memmap_copy();
> > }
>
On Wed, May 01, 2024 at 10:54:55AM -0400, Steven Rostedt wrote:
> On Wed, 1 May 2024 17:45:49 +0300
> Mike Rapoport <[email protected]> wrote:
>
> > > +static void __init memmap_copy(void)
> > > +{
> > > + if (!early_mmap_size)
> > > + return;
> > > +
> > > + mmap_list = kcalloc(early_mmap_size + 1, sizeof(mmap_list), GFP_KERNEL);
> >
> > We can keep early_mmap_size after boot and then we don't need to allocate
> > an extra element in the mmap_list. No strong feeling here, though.
> >
> > > + if (!mmap_list)
> > > + return;
> > > +
> > > + for (int i = 0; i < early_mmap_size; i++)
> > > + mmap_list[i] = early_mmap_list[i];
> > > +}
> >
> > With something like this
> >
> > /*
> > * Parse early_reserve_mem=nn:align:name
> > */
> > static int __init early_reserve_mem(char *p)
> > {
> > phys_addr_t start, size, align;
> > char *oldp;
> > int err;
> >
> > if (!p)
> > return -EINVAL;
> >
> > oldp = p;
> > size = memparse(p, &p);
> > if (p == oldp)
> > return -EINVAL;
> >
> > if (*p != ':')
> > return -EINVAL;
> >
> > align = memparse(p+1, &p);
> > if (*p != ':')
> > return -EINVAL;
> >
> > start = memblock_phys_alloc(size, align);
>
> So this will allocate the same physical location for every boot, if booting
> the same kernel and having the same physical memory layout?
Up to kaslr that might use that location for the kernel image.
But it's the same as allocating from e820 after kaslr.
And, TBH, I don't have good ideas how to ensure the same physical location
with randomization of the physical address of the kernel image.
> -- Steve
>
>
> > if (!start)
> > return -ENOMEM;
> >
> > p++;
> > err = memmap_add(start, size, p);
> > if (err) {
> > memblock_phys_free(start, size);
> > return err;
> > }
> >
> > p += strlen(p);
> >
> > return *p == '\0' ? 0: -EINVAL;
> > }
> > __setup("early_reserve_mem=", early_reserve_mem);
> >
> > you don't need to touch e820 and it will work the same for all
> > architectures.
> >
> > We'd need a better naming, but I couldn't think of something better yet.
> >
> > > +
> > > +/**
> > > + * memmap_named - Find a wildcard region with a given name
> > > + * @name: The name that is attached to a wildcard region
> > > + * @start: If found, holds the start address
> > > + * @size: If found, holds the size of the address.
> > > + *
> > > + * Returns: 1 if found or 0 if not found.
> > > + */
> > > +int memmap_named(const char *name, u64 *start, u64 *size)
> > > +{
> > > + struct mmap_map *map;
> > > +
> > > + if (!mmap_list)
> > > + return 0;
> > > +
> > > + for (int i = 0; mmap_list[i].name[0]; i++) {
> > > + map = &mmap_list[i];
> > > + if (!map->size)
> > > + continue;
> > > + if (strcmp(name, map->name) == 0) {
> > > + *start = map->start;
> > > + *size = map->size;
> > > + return 1;
> > > + }
> > > + }
> > > + return 0;
> > > +}
> > > +
> > > struct kobject *mm_kobj;
> > >
> > > #ifdef CONFIG_SMP
> > > @@ -2793,4 +2864,5 @@ void __init mm_core_init(void)
> > > pti_init();
> > > kmsan_init_runtime();
> > > mm_cache_init();
> > > + memmap_copy();
> > > }
> >
>
--
Sincerely yours,
Mike.
On Wed, May 01, 2024 at 12:09:04PM -0400, Steven Rostedt wrote:
> On Wed, 1 May 2024 18:30:40 +0300
> Mike Rapoport <[email protected]> wrote:
>
> > > > /*
> > > > * Parse early_reserve_mem=nn:align:name
> > > > */
> > > > static int __init early_reserve_mem(char *p)
> > > > {
> > > > phys_addr_t start, size, align;
> > > > char *oldp;
> > > > int err;
> > > >
> > > > if (!p)
> > > > return -EINVAL;
> > > >
> > > > oldp = p;
> > > > size = memparse(p, &p);
> > > > if (p == oldp)
> > > > return -EINVAL;
> > > >
> > > > if (*p != ':')
> > > > return -EINVAL;
> > > >
> > > > align = memparse(p+1, &p);
> > > > if (*p != ':')
> > > > return -EINVAL;
> > > >
> > > > start = memblock_phys_alloc(size, align);
> > >
> > > So this will allocate the same physical location for every boot, if booting
> > > the same kernel and having the same physical memory layout?
> >
> > Up to kaslr that might use that location for the kernel image.
> > But it's the same as allocating from e820 after kaslr.
> >
> > And, TBH, I don't have good ideas how to ensure the same physical location
> > with randomization of the physical address of the kernel image.
>
> I'll try it out. Looking at arch/x86/boot/compressed/kaslr.c, if I read the
> code correctly, it creates up to a 100 slots to store the kernel.
>
> The method I used was to make sure that the allocation was always done at
> the top address of memory, which I think would in most cases never be
> assigned by KASLR.
>
> This looks to just grab the next available physical address, which KASLR
> can most definitely mess with.
On x86 memblock allocates from top of the memory. As this will run later
than e820, the allocation will be lower than from e820, but still close to
the top of the memory.
> I would still like to get the highest address possible.
>
> -- Steve
--
Sincerely yours,
Mike.
On Wed, 1 May 2024 18:30:40 +0300
Mike Rapoport <[email protected]> wrote:
> > > /*
> > > * Parse early_reserve_mem=nn:align:name
> > > */
> > > static int __init early_reserve_mem(char *p)
> > > {
> > > phys_addr_t start, size, align;
> > > char *oldp;
> > > int err;
> > >
> > > if (!p)
> > > return -EINVAL;
> > >
> > > oldp = p;
> > > size = memparse(p, &p);
> > > if (p == oldp)
> > > return -EINVAL;
> > >
> > > if (*p != ':')
> > > return -EINVAL;
> > >
> > > align = memparse(p+1, &p);
> > > if (*p != ':')
> > > return -EINVAL;
> > >
> > > start = memblock_phys_alloc(size, align);
> >
> > So this will allocate the same physical location for every boot, if booting
> > the same kernel and having the same physical memory layout?
>
> Up to kaslr that might use that location for the kernel image.
> But it's the same as allocating from e820 after kaslr.
>
> And, TBH, I don't have good ideas how to ensure the same physical location
> with randomization of the physical address of the kernel image.
I'll try it out. Looking at arch/x86/boot/compressed/kaslr.c, if I read the
code correctly, it creates up to a 100 slots to store the kernel.
The method I used was to make sure that the allocation was always done at
the top address of memory, which I think would in most cases never be
assigned by KASLR.
This looks to just grab the next available physical address, which KASLR
can most definitely mess with.
I would still like to get the highest address possible.
-- Steve
On Wed, 1 May 2024 18:30:40 +0300
Mike Rapoport <[email protected]> wrote:
> > So this will allocate the same physical location for every boot, if booting
> > the same kernel and having the same physical memory layout?
>
> Up to kaslr that might use that location for the kernel image.
> But it's the same as allocating from e820 after kaslr.
>
> And, TBH, I don't have good ideas how to ensure the same physical location
> with randomization of the physical address of the kernel image.
>
I tried this approach and it unfortunately picks a different physical
location every time :-(
So it is either adding to e820 tables or we create a new way to
allocate memory at early boot up.
Below is the patch I used.
-- Steve
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b6bdaa18b9e9..74aaf0bcb363 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4204,4 +4204,6 @@ static inline bool pfn_is_unaccepted_memory(unsigned long pfn)
return range_contains_unaccepted_memory(paddr, paddr + PAGE_SIZE);
}
+int memmap_named(const char *name, unsigned long *start, unsigned long *size);
+
#endif /* _LINUX_MM_H */
diff --git a/mm/memblock.c b/mm/memblock.c
index d09136e040d3..3c015395d262 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2243,6 +2243,101 @@ void __init memblock_free_all(void)
pages = free_low_memory_core_early();
totalram_pages_add(pages);
}
+/* For wildcard memory requests, have a table to find them later */
+#define MEMMAP_MAX_MAPS 8
+#define MEMMAP_NAME_SIZE 16
+struct memmap_map {
+ char name[MEMMAP_NAME_SIZE];
+ unsigned long start;
+ unsigned long size;
+};
+static struct memmap_map memmap_list[MEMMAP_MAX_MAPS] __initdata;
+static int memmap_size __initdata;
+
+/* Add wildcard region with a lookup name */
+static int __init memmap_add(u64 start, u64 size, const char *name)
+{
+ struct memmap_map *map;
+
+ if (!name || !name[0] || strlen(name) >= MEMMAP_NAME_SIZE)
+ return -EINVAL;
+
+ if (memmap_size >= MEMMAP_MAX_MAPS)
+ return -1;
+
+ map = &memmap_list[memmap_size++];
+ map->start = start;
+ map->size = size;
+ strcpy(map->name, name);
+ return 0;
+}
+
+/**
+ * memmap_named - Find a wildcard region with a given name
+ * @name: The name that is attached to a wildcard region
+ * @start: If found, holds the start address
+ * @size: If found, holds the size of the address.
+ *
+ * Returns: 1 if found or 0 if not found.
+ */
+int __init memmap_named(const char *name, unsigned long *start, unsigned long *size)
+{
+ struct memmap_map *map;
+ int i;
+
+ for (i = 0; i < memmap_size; i++) {
+ map = &memmap_list[i];
+ if (!map->size)
+ continue;
+ if (strcmp(name, map->name) == 0) {
+ *start = map->start;
+ *size = map->size;
+ return 1;
+ }
+ }
+ return 0;
+}
+
+/*
+ * Parse early_reserve_mem=nn:align:name
+ */
+static int __init early_reserve_mem(char *p)
+{
+ phys_addr_t start, size, align;
+ char *oldp;
+ int err;
+
+ if (!p)
+ return -EINVAL;
+
+ oldp = p;
+ size = memparse(p, &p);
+ if (p == oldp)
+ return -EINVAL;
+
+ if (*p != ':')
+ return -EINVAL;
+
+ align = memparse(p+1, &p);
+ if (*p != ':')
+ return -EINVAL;
+
+ start = memblock_phys_alloc(size, align);
+ if (!start)
+ return -ENOMEM;
+
+ p++;
+ err = memmap_add(start, size, p);
+ if (err) {
+ memblock_phys_free(start, size);
+ return err;
+ }
+
+ p += strlen(p);
+
+ return *p == '\0' ? 0: -EINVAL;
+}
+__setup("early_reserve_mem=", early_reserve_mem);
#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_ARCH_KEEP_MEMBLOCK)
static const char * const flagname[] = {
On Thu, 9 May 2024 00:00:23 -0400
Steven Rostedt <[email protected]> wrote:
> I tried this approach and it unfortunately picks a different physical
> location every time :-(
>
> So it is either adding to e820 tables or we create a new way to
> allocate memory at early boot up.
>
Hmm, now I'm testing it more and it always seems to end up in the same
location. I'm not sure why it failed the first three times I tried it :-/
-- Steve
On Thu, 9 May 2024 23:24:20 +0300
Mike Rapoport <[email protected]> wrote:
> On Thu, May 09, 2024 at 01:31:22PM -0400, Steven Rostedt wrote:
> > On Thu, 9 May 2024 00:00:23 -0400
> > Steven Rostedt <[email protected]> wrote:
> >
> > > I tried this approach and it unfortunately picks a different physical
> > > location every time :-(
> > >
> > > So it is either adding to e820 tables or we create a new way to
> > > allocate memory at early boot up.
> > >
> >
> > Hmm, now I'm testing it more and it always seems to end up in the same
> > location. I'm not sure why it failed the first three times I tried it :-/
>
> If the kernel and the command line were the same, they shouldn't have.
> e820 translates to memblock the same way every boot and the early
> allocations also do not change from boot to boot.
>
> Could it be that three runs that failed had some differences in the kernel
> parameters?
>
I wonder if KASLR caused it or not. But when I first added it to
another machine, it failed to get the same address on the second boot,
but was fine after that.
Could be just something with my setup. I'm going to backport this to
5.15 and test this on a Chromebook and see what happens there (as
that's the motivation behind this work).
-- Steve
On Thu, May 09, 2024 at 01:31:22PM -0400, Steven Rostedt wrote:
> On Thu, 9 May 2024 00:00:23 -0400
> Steven Rostedt <[email protected]> wrote:
>
> > I tried this approach and it unfortunately picks a different physical
> > location every time :-(
> >
> > So it is either adding to e820 tables or we create a new way to
> > allocate memory at early boot up.
> >
>
> Hmm, now I'm testing it more and it always seems to end up in the same
> location. I'm not sure why it failed the first three times I tried it :-/
If the kernel and the command line were the same, they shouldn't have.
e820 translates to memblock the same way every boot and the early
allocations also do not change from boot to boot.
Could it be that three runs that failed had some differences in the kernel
parameters?
> -- Steve
--
Sincerely yours,
Mike.