2020-11-03 17:33:49

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: [PATCH v6 0/7] arm64: Default to 32-bit wide ZONE_DMA

Using two distinct DMA zones turned out to be problematic. Here's an
attempt go back to a saner default.

I tested this on both a RPi4 and QEMU.

---

Changes since v5:
- Unify ACPI/DT functions

Changes since v4:
- Fix of_dma_get_max_cpu_address() so it returns the last addressable
addres, not the limit

Changes since v3:
- Drop patch adding define in dma-mapping
- Address small review changes
- Update Ard's patch
- Add new patch removing examples from mmzone.h

Changes since v2:
- Introduce Ard's patch
- Improve OF dma-ranges parsing function
- Add unit test for OF function
- Address small changes
- Move crashkernel reservation later in boot process

Changes since v1:
- Parse dma-ranges instead of using machine compatible string

Ard Biesheuvel (1):
arm64: mm: Set ZONE_DMA size based on early IORT scan

Nicolas Saenz Julienne (6):
arm64: mm: Move reserve_crashkernel() into mem_init()
arm64: mm: Move zone_dma_bits initialization into zone_sizes_init()
of/address: Introduce of_dma_get_max_cpu_address()
of: unittest: Add test for of_dma_get_max_cpu_address()
arm64: mm: Set ZONE_DMA size based on devicetree's dma-ranges
mm: Remove examples from enum zone_type comment

arch/arm64/mm/init.c | 18 ++++++-------
drivers/acpi/arm64/iort.c | 55 +++++++++++++++++++++++++++++++++++++++
drivers/of/address.c | 42 ++++++++++++++++++++++++++++++
drivers/of/unittest.c | 18 +++++++++++++
include/linux/acpi_iort.h | 4 +++
include/linux/mmzone.h | 20 --------------
include/linux/of.h | 7 +++++
7 files changed, 135 insertions(+), 29 deletions(-)

--
2.29.1


2020-11-03 17:33:58

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: [PATCH v6 2/7] arm64: mm: Move zone_dma_bits initialization into zone_sizes_init()

zone_dma_bits's initialization happens earlier that it's actually
needed, in arm64_memblock_init(). So move it into the more suitable
zone_sizes_init().

Signed-off-by: Nicolas Saenz Julienne <[email protected]>
Tested-by: Jeremy Linton <[email protected]>
---
arch/arm64/mm/init.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index fc4ab0d6d5d2..410721fc4fc0 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -190,6 +190,8 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)
unsigned long max_zone_pfns[MAX_NR_ZONES] = {0};

#ifdef CONFIG_ZONE_DMA
+ zone_dma_bits = ARM64_ZONE_DMA_BITS;
+ arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
#endif
#ifdef CONFIG_ZONE_DMA32
@@ -376,11 +378,6 @@ void __init arm64_memblock_init(void)

early_init_fdt_scan_reserved_mem();

- if (IS_ENABLED(CONFIG_ZONE_DMA)) {
- zone_dma_bits = ARM64_ZONE_DMA_BITS;
- arm64_dma_phys_limit = max_zone_phys(ARM64_ZONE_DMA_BITS);
- }
-
if (IS_ENABLED(CONFIG_ZONE_DMA32))
arm64_dma32_phys_limit = max_zone_phys(32);
else
--
2.29.1

2020-11-03 17:34:03

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: [PATCH v6 1/7] arm64: mm: Move reserve_crashkernel() into mem_init()

crashkernel might reserve memory located in ZONE_DMA. We plan to delay
ZONE_DMA's initialization after unflattening the devicetree and ACPI's
boot table initialization, so move it later in the boot process.
Specifically into mem_init(), this is the last place crashkernel will be
able to reserve the memory before the page allocator kicks in. There
isn't any apparent reason for doing this earlier.

Signed-off-by: Nicolas Saenz Julienne <[email protected]>
Tested-by: Jeremy Linton <[email protected]>
---
arch/arm64/mm/init.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 095540667f0f..fc4ab0d6d5d2 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -386,8 +386,6 @@ void __init arm64_memblock_init(void)
else
arm64_dma32_phys_limit = PHYS_MASK + 1;

- reserve_crashkernel();
-
reserve_elfcorehdr();

high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
@@ -508,6 +506,8 @@ void __init mem_init(void)
else
swiotlb_force = SWIOTLB_NO_FORCE;

+ reserve_crashkernel();
+
set_max_mapnr(max_pfn - PHYS_PFN_OFFSET);

#ifndef CONFIG_SPARSEMEM_VMEMMAP
--
2.29.1

2020-11-03 17:34:12

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: [PATCH v6 7/7] mm: Remove examples from enum zone_type comment

We can't really list every setup in common code. On top of that they are
unlikely to stay true for long as things change in the arch trees
independently of this comment.

Suggested-by: Christoph Hellwig <[email protected]>
Signed-off-by: Nicolas Saenz Julienne <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/mmzone.h | 20 --------------------
1 file changed, 20 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fb3bf696c05e..9d0c454d23cd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -354,26 +354,6 @@ enum zone_type {
* DMA mask is assumed when ZONE_DMA32 is defined. Some 64-bit
* platforms may need both zones as they support peripherals with
* different DMA addressing limitations.
- *
- * Some examples:
- *
- * - i386 and x86_64 have a fixed 16M ZONE_DMA and ZONE_DMA32 for the
- * rest of the lower 4G.
- *
- * - arm only uses ZONE_DMA, the size, up to 4G, may vary depending on
- * the specific device.
- *
- * - arm64 has a fixed 1G ZONE_DMA and ZONE_DMA32 for the rest of the
- * lower 4G.
- *
- * - powerpc only uses ZONE_DMA, the size, up to 2G, may vary
- * depending on the specific device.
- *
- * - s390 uses ZONE_DMA fixed to the lower 2G.
- *
- * - ia64 and riscv only use ZONE_DMA32.
- *
- * - parisc uses neither.
*/
#ifdef CONFIG_ZONE_DMA
ZONE_DMA,
--
2.29.1

2020-11-03 17:34:22

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: [PATCH v6 5/7] arm64: mm: Set ZONE_DMA size based on devicetree's dma-ranges

We recently introduced a 1 GB sized ZONE_DMA to cater for platforms
incorporating masters that can address less than 32 bits of DMA, in
particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has
peripherals that can only address up to 1 GB (and its PCIe host
bridge can only access the bottom 3 GB)

The DMA layer also needs to be able to allocate memory that is
guaranteed to meet those DMA constraints, for bounce buffering as well
as allocating the backing for consistent mappings. This is why the 1 GB
ZONE_DMA was introduced recently. Unfortunately, it turns out the having
a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes problems with kdump, and
potentially in other places where allocations cannot cross zone
boundaries. Therefore, we should avoid having two separate DMA zones
when possible.

So, with the help of of_dma_get_max_cpu_address() get the topmost
physical address accessible to all DMA masters in system and use that
information to fine-tune ZONE_DMA's size. In the absence of addressing
limited masters ZONE_DMA will span the whole 32-bit address space,
otherwise, in the case of the Raspberry Pi 4 it'll only span the 30-bit
address space, and have ZONE_DMA32 cover the rest of the 32-bit address
space.

Signed-off-by: Nicolas Saenz Julienne <[email protected]>

---

Changes since v4:
- Use fls64 as we're now using the max address (as opposed to the
limit)

Changes since v3:
- Simplify code for readability.

Changes since v2:
- Updated commit log by shamelessly copying Ard's ACPI commit log

arch/arm64/mm/init.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 410721fc4fc0..a2ce8a9a71a6 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -42,8 +42,6 @@
#include <asm/tlb.h>
#include <asm/alternative.h>

-#define ARM64_ZONE_DMA_BITS 30
-
/*
* We need to be able to catch inadvertent references to memstart_addr
* that occur (potentially in generic code) before arm64_memblock_init()
@@ -188,9 +186,11 @@ static phys_addr_t __init max_zone_phys(unsigned int zone_bits)
static void __init zone_sizes_init(unsigned long min, unsigned long max)
{
unsigned long max_zone_pfns[MAX_NR_ZONES] = {0};
+ unsigned int __maybe_unused dt_zone_dma_bits;

#ifdef CONFIG_ZONE_DMA
- zone_dma_bits = ARM64_ZONE_DMA_BITS;
+ dt_zone_dma_bits = fls64(of_dma_get_max_cpu_address(NULL));
+ zone_dma_bits = min(32U, dt_zone_dma_bits);
arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
#endif
--
2.29.1

2020-11-03 17:34:23

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: [PATCH v6 6/7] arm64: mm: Set ZONE_DMA size based on early IORT scan

From: Ard Biesheuvel <[email protected]>

We recently introduced a 1 GB sized ZONE_DMA to cater for platforms
incorporating masters that can address less than 32 bits of DMA, in
particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has
peripherals that can only address up to 1 GB (and its PCIe host
bridge can only access the bottom 3 GB)

Instructing the DMA layer about these limitations is straight-forward,
even though we had to fix some issues regarding memory limits set in
the IORT for named components, and regarding the handling of ACPI _DMA
methods. However, the DMA layer also needs to be able to allocate
memory that is guaranteed to meet those DMA constraints, for bounce
buffering as well as allocating the backing for consistent mappings.

This is why the 1 GB ZONE_DMA was introduced recently. Unfortunately,
it turns out the having a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes
problems with kdump, and potentially in other places where allocations
cannot cross zone boundaries. Therefore, we should avoid having two
separate DMA zones when possible.

So let's do an early scan of the IORT, and only create the ZONE_DMA
if we encounter any devices that need it. This puts the burden on
the firmware to describe such limitations in the IORT, which may be
redundant (and less precise) if _DMA methods are also being provided.
However, it should be noted that this situation is highly unusual for
arm64 ACPI machines. Also, the DMA subsystem still gives precedence to
the _DMA method if implemented, and so we will not lose the ability to
perform streaming DMA outside the ZONE_DMA if the _DMA method permits
it.

Cc: Jeremy Linton <[email protected]>
Cc: Lorenzo Pieralisi <[email protected]>
Cc: Nicolas Saenz Julienne <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Robin Murphy <[email protected]>
Cc: Hanjun Guo <[email protected]>
Cc: Sudeep Holla <[email protected]>
Cc: Anshuman Khandual <[email protected]>
Signed-off-by: Ard Biesheuvel <[email protected]>
[nsaenz: unified implementation with DT's counterpart]
Signed-off-by: Nicolas Saenz Julienne <[email protected]>
Tested-by: Jeremy Linton <[email protected]>
Acked-by: Lorenzo Pieralisi <[email protected]>
Acked-by: Hanjun Guo <[email protected]>

---

Changes since v5:
- Unify with DT's counterpart, return phys_addr_t

Changes since v3:
- Use min_not_zero()
- Check revision
- Remove unnecessary #ifdef in zone_sizes_init()

arch/arm64/mm/init.c | 5 +++-
drivers/acpi/arm64/iort.c | 55 +++++++++++++++++++++++++++++++++++++++
include/linux/acpi_iort.h | 4 +++
3 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index a2ce8a9a71a6..ca5d4b10679d 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -29,6 +29,7 @@
#include <linux/kexec.h>
#include <linux/crash_dump.h>
#include <linux/hugetlb.h>
+#include <linux/acpi_iort.h>

#include <asm/boot.h>
#include <asm/fixmap.h>
@@ -186,11 +187,13 @@ static phys_addr_t __init max_zone_phys(unsigned int zone_bits)
static void __init zone_sizes_init(unsigned long min, unsigned long max)
{
unsigned long max_zone_pfns[MAX_NR_ZONES] = {0};
+ unsigned int __maybe_unused acpi_zone_dma_bits;
unsigned int __maybe_unused dt_zone_dma_bits;

#ifdef CONFIG_ZONE_DMA
+ acpi_zone_dma_bits = fls64(acpi_iort_dma_get_max_cpu_address());
dt_zone_dma_bits = fls64(of_dma_get_max_cpu_address(NULL));
- zone_dma_bits = min(32U, dt_zone_dma_bits);
+ zone_dma_bits = min3(32U, dt_zone_dma_bits, acpi_zone_dma_bits);
arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
#endif
diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 9929ff50c0c0..1787406684aa 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -1718,3 +1718,58 @@ void __init acpi_iort_init(void)

iort_init_platform_devices();
}
+
+#ifdef CONFIG_ZONE_DMA
+/*
+ * Extract the highest CPU physical address accessible to all DMA masters in
+ * the system. PHYS_ADDR_MAX is returned when no constrained device is found.
+ */
+phys_addr_t __init acpi_iort_dma_get_max_cpu_address(void)
+{
+ phys_addr_t limit = PHYS_ADDR_MAX;
+ struct acpi_iort_node *node, *end;
+ struct acpi_table_iort *iort;
+ acpi_status status;
+ int i;
+
+ if (acpi_disabled)
+ return limit;
+
+ status = acpi_get_table(ACPI_SIG_IORT, 0,
+ (struct acpi_table_header **)&iort);
+ if (ACPI_FAILURE(status))
+ return limit;
+
+ node = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->node_offset);
+ end = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->header.length);
+
+ for (i = 0; i < iort->node_count; i++) {
+ if (node >= end)
+ break;
+
+ switch (node->type) {
+ struct acpi_iort_named_component *ncomp;
+ struct acpi_iort_root_complex *rc;
+ phys_addr_t local_limit;
+
+ case ACPI_IORT_NODE_NAMED_COMPONENT:
+ ncomp = (struct acpi_iort_named_component *)node->node_data;
+ local_limit = DMA_BIT_MASK(ncomp->memory_address_limit);
+ limit = min_not_zero(limit, local_limit);
+ break;
+
+ case ACPI_IORT_NODE_PCI_ROOT_COMPLEX:
+ if (node->revision < 1)
+ break;
+
+ rc = (struct acpi_iort_root_complex *)node->node_data;
+ local_limit = DMA_BIT_MASK(rc->memory_address_limit);
+ limit = min_not_zero(limit, local_limit);
+ break;
+ }
+ node = ACPI_ADD_PTR(struct acpi_iort_node, node, node->length);
+ }
+ acpi_put_table(&iort->header);
+ return limit;
+}
+#endif
diff --git a/include/linux/acpi_iort.h b/include/linux/acpi_iort.h
index 20a32120bb88..1a12baa58e40 100644
--- a/include/linux/acpi_iort.h
+++ b/include/linux/acpi_iort.h
@@ -38,6 +38,7 @@ void iort_dma_setup(struct device *dev, u64 *dma_addr, u64 *size);
const struct iommu_ops *iort_iommu_configure_id(struct device *dev,
const u32 *id_in);
int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head);
+phys_addr_t acpi_iort_dma_get_max_cpu_address(void);
#else
static inline void acpi_iort_init(void) { }
static inline u32 iort_msi_map_id(struct device *dev, u32 id)
@@ -55,6 +56,9 @@ static inline const struct iommu_ops *iort_iommu_configure_id(
static inline
int iort_iommu_msi_get_resv_regions(struct device *dev, struct list_head *head)
{ return 0; }
+
+static inline phys_addr_t acpi_iort_dma_get_max_cpu_address(void)
+{ return PHYS_ADDR_MAX; }
#endif

#endif /* __ACPI_IORT_H__ */
--
2.29.1

2020-11-03 17:36:10

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: [PATCH v6 4/7] of: unittest: Add test for of_dma_get_max_cpu_address()

Introduce a test for of_dma_get_max_cup_address(), it uses the same DT
data as the rest of dma-ranges unit tests.

Signed-off-by: Nicolas Saenz Julienne <[email protected]>
Reviewed-by: Rob Herring <[email protected]>

---
Changes since v5:
- Update address expected by test

Changes since v3:
- Remove HAS_DMA guards

drivers/of/unittest.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)

diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index 06cc988faf78..98cc0163301b 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -869,6 +869,23 @@ static void __init of_unittest_changeset(void)
#endif
}

+static void __init of_unittest_dma_get_max_cpu_address(void)
+{
+ struct device_node *np;
+ phys_addr_t cpu_addr;
+
+ np = of_find_node_by_path("/testcase-data/address-tests");
+ if (!np) {
+ pr_err("missing testcase data\n");
+ return;
+ }
+
+ cpu_addr = of_dma_get_max_cpu_address(np);
+ unittest(cpu_addr == 0x4fffffff,
+ "of_dma_get_max_cpu_address: wrong CPU addr %pad (expecting %x)\n",
+ &cpu_addr, 0x4fffffff);
+}
+
static void __init of_unittest_dma_ranges_one(const char *path,
u64 expect_dma_addr, u64 expect_paddr)
{
@@ -3266,6 +3283,7 @@ static int __init of_unittest(void)
of_unittest_changeset();
of_unittest_parse_interrupts();
of_unittest_parse_interrupts_extended();
+ of_unittest_dma_get_max_cpu_address();
of_unittest_parse_dma_ranges();
of_unittest_pci_dma_ranges();
of_unittest_match_node();
--
2.29.1

2020-11-03 17:36:29

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: [PATCH v6 3/7] of/address: Introduce of_dma_get_max_cpu_address()

Introduce of_dma_get_max_cpu_address(), which provides the highest CPU
physical address addressable by all DMA masters in the system. It's
specially useful for setting memory zones sizes at early boot time.

Signed-off-by: Nicolas Saenz Julienne <[email protected]>
Reviewed-by: Rob Herring <[email protected]>

---

Changes since v4:
- Return max address, not address limit (one off difference)

Changes since v3:
- use u64 with cpu_end

Changes since v2:
- Use PHYS_ADDR_MAX
- return phys_dma_t
- Rename function
- Correct subject
- Add support to start parsing from an arbitrary device node in order
for the function to work with unit tests

drivers/of/address.c | 42 ++++++++++++++++++++++++++++++++++++++++++
include/linux/of.h | 7 +++++++
2 files changed, 49 insertions(+)

diff --git a/drivers/of/address.c b/drivers/of/address.c
index eb9ab4f1e80b..09c0af7fd1c4 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -1024,6 +1024,48 @@ int of_dma_get_range(struct device_node *np, const struct bus_dma_region **map)
}
#endif /* CONFIG_HAS_DMA */

+/**
+ * of_dma_get_max_cpu_address - Gets highest CPU address suitable for DMA
+ * @np: The node to start searching from or NULL to start from the root
+ *
+ * Gets the highest CPU physical address that is addressable by all DMA masters
+ * in the sub-tree pointed by np, or the whole tree if NULL is passed. If no
+ * DMA constrained device is found, it returns PHYS_ADDR_MAX.
+ */
+phys_addr_t __init of_dma_get_max_cpu_address(struct device_node *np)
+{
+ phys_addr_t max_cpu_addr = PHYS_ADDR_MAX;
+ struct of_range_parser parser;
+ phys_addr_t subtree_max_addr;
+ struct device_node *child;
+ struct of_range range;
+ const __be32 *ranges;
+ u64 cpu_end = 0;
+ int len;
+
+ if (!np)
+ np = of_root;
+
+ ranges = of_get_property(np, "dma-ranges", &len);
+ if (ranges && len) {
+ of_dma_range_parser_init(&parser, np);
+ for_each_of_range(&parser, &range)
+ if (range.cpu_addr + range.size > cpu_end)
+ cpu_end = range.cpu_addr + range.size - 1;
+
+ if (max_cpu_addr > cpu_end)
+ max_cpu_addr = cpu_end;
+ }
+
+ for_each_available_child_of_node(np, child) {
+ subtree_max_addr = of_dma_get_max_cpu_address(child);
+ if (max_cpu_addr > subtree_max_addr)
+ max_cpu_addr = subtree_max_addr;
+ }
+
+ return max_cpu_addr;
+}
+
/**
* of_dma_is_coherent - Check if device is coherent
* @np: device node
diff --git a/include/linux/of.h b/include/linux/of.h
index 5d51891cbf1a..9ed5b8532c30 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -558,6 +558,8 @@ int of_map_id(struct device_node *np, u32 id,
const char *map_name, const char *map_mask_name,
struct device_node **target, u32 *id_out);

+phys_addr_t of_dma_get_max_cpu_address(struct device_node *np);
+
#else /* CONFIG_OF */

static inline void of_core_init(void)
@@ -995,6 +997,11 @@ static inline int of_map_id(struct device_node *np, u32 id,
return -EINVAL;
}

+static inline phys_addr_t of_dma_get_max_cpu_address(struct device_node *np)
+{
+ return PHYS_ADDR_MAX;
+}
+
#define of_match_ptr(_ptr) NULL
#define of_match_node(_matches, _node) NULL
#endif /* CONFIG_OF */
--
2.29.1

2020-11-05 16:13:11

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v6 1/7] arm64: mm: Move reserve_crashkernel() into mem_init()

Hi!

On 03/11/2020 17:31, Nicolas Saenz Julienne wrote:
> crashkernel might reserve memory located in ZONE_DMA. We plan to delay
> ZONE_DMA's initialization after unflattening the devicetree and ACPI's
> boot table initialization, so move it later in the boot process.
> Specifically into mem_init(), this is the last place crashkernel will be
> able to reserve the memory before the page allocator kicks in.

> There
> isn't any apparent reason for doing this earlier.

It's so that map_mem() can carve it out of the linear/direct map.
This is so that stray writes from a crashing kernel can't accidentally corrupt the kdump
kernel. We depend on this if we continue with kdump, but failed to offline all the other
CPUs. We also depend on this when skipping the checksum code in purgatory, which can be
exceedingly slow.

Grepping around, the current order is:

start_kernel()
-> setup_arch()
-> arm64_memblock_init() /* reserve */
-> paging_init()
-> map_mem() /* carve out reservation */
[...]
-> mm_init()
-> mem_init()


I agree we should add comments to make this apparent!


Thanks,

James


> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index 095540667f0f..fc4ab0d6d5d2 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -386,8 +386,6 @@ void __init arm64_memblock_init(void)
> else
> arm64_dma32_phys_limit = PHYS_MASK + 1;
>
> - reserve_crashkernel();
> -
> reserve_elfcorehdr();
>
> high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
> @@ -508,6 +506,8 @@ void __init mem_init(void)
> else
> swiotlb_force = SWIOTLB_NO_FORCE;
>
> + reserve_crashkernel();
> +
> set_max_mapnr(max_pfn - PHYS_PFN_OFFSET);
>
> #ifndef CONFIG_SPARSEMEM_VMEMMAP
>

2020-11-06 18:50:54

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH v6 1/7] arm64: mm: Move reserve_crashkernel() into mem_init()

Hi James, thanks for the review. Some comments/questions below.

On Thu, 2020-11-05 at 16:11 +0000, James Morse wrote:
> Hi!
>
> On 03/11/2020 17:31, Nicolas Saenz Julienne wrote:
> > crashkernel might reserve memory located in ZONE_DMA. We plan to delay
> > ZONE_DMA's initialization after unflattening the devicetree and ACPI's
> > boot table initialization, so move it later in the boot process.
> > Specifically into mem_init(), this is the last place crashkernel will be
> > able to reserve the memory before the page allocator kicks in.
> > There
> > isn't any apparent reason for doing this earlier.
>
> It's so that map_mem() can carve it out of the linear/direct map.
> This is so that stray writes from a crashing kernel can't accidentally corrupt the kdump
> kernel. We depend on this if we continue with kdump, but failed to offline all the other
> CPUs.

I presume here you refer to arch_kexec_protect_crashkres(), IIUC this will only
happen further down the line, after having loaded the kdump kernel image. But
it also depends on the mappings to be PAGE sized (flags == NO_BLOCK_MAPPINGS |
NO_CONT_MAPPINGS).

> We also depend on this when skipping the checksum code in purgatory, which can be
> exceedingly slow.

This one I don't fully understand, so I'll lazily assume the prerequisite is
the same WRT how memory is mapped. :)

Ultimately there's also /sys/kernel/kexec_crash_size's handling. Same
prerequisite.

Keeping in mind acpi_table_upgrade() and unflatten_device_tree() depend on
having the linear mappings available. I don't see any simple way of solving
this. Both moving the firmware description routines to use fixmap or correcting
the linear mapping further down the line so as to include kdump's regions, seem
excessive/impossible (feel free to correct me here). I'd be happy to hear
suggestions. Otherwise we're back to hard-coding the information as we
initially did.

Let me stress that knowing the DMA constraints in the system before reserving
crashkernel's regions is necessary if we ever want it to work seamlessly on all
platforms. Be it small stuff like the Raspberry Pi or huge servers with TB of
memory.

Regards,
Nicolas


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2020-11-10 18:20:57

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v6 1/7] arm64: mm: Move reserve_crashkernel() into mem_init()

On Fri, Nov 06, 2020 at 07:46:29PM +0100, Nicolas Saenz Julienne wrote:
> On Thu, 2020-11-05 at 16:11 +0000, James Morse wrote:
> > On 03/11/2020 17:31, Nicolas Saenz Julienne wrote:
> > > crashkernel might reserve memory located in ZONE_DMA. We plan to delay
> > > ZONE_DMA's initialization after unflattening the devicetree and ACPI's
> > > boot table initialization, so move it later in the boot process.
> > > Specifically into mem_init(), this is the last place crashkernel will be
> > > able to reserve the memory before the page allocator kicks in.
> > > There
> > > isn't any apparent reason for doing this earlier.
> >
> > It's so that map_mem() can carve it out of the linear/direct map.
> > This is so that stray writes from a crashing kernel can't accidentally corrupt the kdump
> > kernel. We depend on this if we continue with kdump, but failed to offline all the other
> > CPUs.
>
> I presume here you refer to arch_kexec_protect_crashkres(), IIUC this will only
> happen further down the line, after having loaded the kdump kernel image. But
> it also depends on the mappings to be PAGE sized (flags == NO_BLOCK_MAPPINGS |
> NO_CONT_MAPPINGS).

IIUC, arch_kexec_protect_crashkres() is only for the crashkernel image,
not the whole reserved memory that the crashkernel will use. For the
latter, we avoid the linear map by marking it as nomap in map_mem().

> > We also depend on this when skipping the checksum code in purgatory, which can be
> > exceedingly slow.
>
> This one I don't fully understand, so I'll lazily assume the prerequisite is
> the same WRT how memory is mapped. :)
>
> Ultimately there's also /sys/kernel/kexec_crash_size's handling. Same
> prerequisite.
>
> Keeping in mind acpi_table_upgrade() and unflatten_device_tree() depend on
> having the linear mappings available.

So it looks like reserve_crashkernel() wants to reserve memory before
setting up the linear map with the information about the DMA zones in
place but that comes later when we can parse the firmware tables.

I wonder, instead of not mapping the crashkernel reservation, can we not
do an arch_kexec_protect_crashkres() for the whole reservation after we
created the linear map?

> Let me stress that knowing the DMA constraints in the system before reserving
> crashkernel's regions is necessary if we ever want it to work seamlessly on all
> platforms. Be it small stuff like the Raspberry Pi or huge servers with TB of
> memory.

Indeed. So we have 3 options (so far):

1. Allow the crashkernel reservation to go into the linear map but set
it to invalid once allocated.

2. Parse the flattened DT (not sure what we do with ACPI) before
creating the linear map. We may have to rely on some SoC ID here
instead of actual DMA ranges.

3. Assume the smallest ZONE_DMA possible on arm64 (1GB) for crashkernel
reservations and not rely on arm64_dma_phys_limit in
reserve_crashkernel().

I think (2) we tried hard to avoid. Option (3) brings us back to the
issues we had on large crashkernel reservations regressing on some
platforms (though it's been a while since, they mostly went quiet ;)).
However, with Chen's crashkernel patches we end up with two
reservations, one in the low DMA zone and one higher, potentially above
4GB. Having a fixed 1GB limit wouldn't be any worse for crashkernel
reservations than what we have now.

If (1) works, I'd go for it (James knows this part better than me),
otherwise we can go for (3).

--
Catalin

2020-11-12 16:00:47

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH v6 1/7] arm64: mm: Move reserve_crashkernel() into mem_init()

Hi Catalin,

On Tue, 2020-11-10 at 18:17 +0000, Catalin Marinas wrote:
> On Fri, Nov 06, 2020 at 07:46:29PM +0100, Nicolas Saenz Julienne wrote:
> > On Thu, 2020-11-05 at 16:11 +0000, James Morse wrote:
> > > On 03/11/2020 17:31, Nicolas Saenz Julienne wrote:
> > > > crashkernel might reserve memory located in ZONE_DMA. We plan to delay
> > > > ZONE_DMA's initialization after unflattening the devicetree and ACPI's
> > > > boot table initialization, so move it later in the boot process.
> > > > Specifically into mem_init(), this is the last place crashkernel will be
> > > > able to reserve the memory before the page allocator kicks in.
> > > > There
> > > > isn't any apparent reason for doing this earlier.
> > >
> > > It's so that map_mem() can carve it out of the linear/direct map.
> > > This is so that stray writes from a crashing kernel can't accidentally corrupt the kdump
> > > kernel. We depend on this if we continue with kdump, but failed to offline all the other
> > > CPUs.
> >
> > I presume here you refer to arch_kexec_protect_crashkres(), IIUC this will only
> > happen further down the line, after having loaded the kdump kernel image. But
> > it also depends on the mappings to be PAGE sized (flags == NO_BLOCK_MAPPINGS |
> > NO_CONT_MAPPINGS).
>
> IIUC, arch_kexec_protect_crashkres() is only for the crashkernel image,
> not the whole reserved memory that the crashkernel will use. For the
> latter, we avoid the linear map by marking it as nomap in map_mem().

I'm not sure we're on the same page here, so sorry if this was already implied.

The crashkernel memory mapping is bypassed while preparing the linear mappings
but it is then mapped right away, with page granularity and !MTE.
See paging_init()->map_mem():

/*
* Use page-level mappings here so that we can shrink the region
* in page granularity and put back unused memory to buddy system
* through /sys/kernel/kexec_crash_size interface.
*/
if (crashk_res.end) {
__map_memblock(pgdp, crashk_res.start, crashk_res.end + 1,
PAGE_KERNEL,
NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
memblock_clear_nomap(crashk_res.start,
resource_size(&crashk_res));
}

IIUC the inconvenience here is that we need special mapping options for
crashkernel and updating those after having mapped that memory as regular
memory isn't possible/easy to do.

> > > We also depend on this when skipping the checksum code in purgatory, which can be
> > > exceedingly slow.
> >
> > This one I don't fully understand, so I'll lazily assume the prerequisite is
> > the same WRT how memory is mapped. :)
> >
> > Ultimately there's also /sys/kernel/kexec_crash_size's handling. Same
> > prerequisite.
> >
> > Keeping in mind acpi_table_upgrade() and unflatten_device_tree() depend on
> > having the linear mappings available.
>
> So it looks like reserve_crashkernel() wants to reserve memory before
> setting up the linear map with the information about the DMA zones in
> place but that comes later when we can parse the firmware tables.
>
> I wonder, instead of not mapping the crashkernel reservation, can we not
> do an arch_kexec_protect_crashkres() for the whole reservation after we
> created the linear map?

arch_kexec_protect_crashkres() depends on __change_memory_common() which
ultimately depends on the memory to be mapped with PAGE_SIZE pages. As I
comment above, the trick would work as long as there is as way to update the
linear mappings with whatever crashkernel needs later in the boot process.

> > Let me stress that knowing the DMA constraints in the system before reserving
> > crashkernel's regions is necessary if we ever want it to work seamlessly on all
> > platforms. Be it small stuff like the Raspberry Pi or huge servers with TB of
> > memory.
>
> Indeed. So we have 3 options (so far):
>
> 1. Allow the crashkernel reservation to go into the linear map but set
> it to invalid once allocated.
>
> 2. Parse the flattened DT (not sure what we do with ACPI) before
> creating the linear map. We may have to rely on some SoC ID here
> instead of actual DMA ranges.
>
> 3. Assume the smallest ZONE_DMA possible on arm64 (1GB) for crashkernel
> reservations and not rely on arm64_dma_phys_limit in
> reserve_crashkernel().
>
> I think (2) we tried hard to avoid. Option (3) brings us back to the
> issues we had on large crashkernel reservations regressing on some
> platforms (though it's been a while since, they mostly went quiet ;)).
> However, with Chen's crashkernel patches we end up with two
> reservations, one in the low DMA zone and one higher, potentially above
> 4GB. Having a fixed 1GB limit wouldn't be any worse for crashkernel
> reservations than what we have now.
>
> If (1) works, I'd go for it (James knows this part better than me),
> otherwise we can go for (3).

Overall, I'd prefer (1) as well, and I'd be happy to have a got at it. If not
I'll append (3) in this series.

Regards,
Nicolas


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2020-11-13 11:32:50

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v6 1/7] arm64: mm: Move reserve_crashkernel() into mem_init()

Hi Nicolas,

On Thu, Nov 12, 2020 at 04:56:38PM +0100, Nicolas Saenz Julienne wrote:
> On Tue, 2020-11-10 at 18:17 +0000, Catalin Marinas wrote:
> > On Fri, Nov 06, 2020 at 07:46:29PM +0100, Nicolas Saenz Julienne wrote:
> > > On Thu, 2020-11-05 at 16:11 +0000, James Morse wrote:
> > > > On 03/11/2020 17:31, Nicolas Saenz Julienne wrote:
> > > > > crashkernel might reserve memory located in ZONE_DMA. We plan to delay
> > > > > ZONE_DMA's initialization after unflattening the devicetree and ACPI's
> > > > > boot table initialization, so move it later in the boot process.
> > > > > Specifically into mem_init(), this is the last place crashkernel will be
> > > > > able to reserve the memory before the page allocator kicks in.
> > > > > There
> > > > > isn't any apparent reason for doing this earlier.
> > > >
> > > > It's so that map_mem() can carve it out of the linear/direct map.
> > > > This is so that stray writes from a crashing kernel can't accidentally corrupt the kdump
> > > > kernel. We depend on this if we continue with kdump, but failed to offline all the other
> > > > CPUs.
> > >
> > > I presume here you refer to arch_kexec_protect_crashkres(), IIUC this will only
> > > happen further down the line, after having loaded the kdump kernel image. But
> > > it also depends on the mappings to be PAGE sized (flags == NO_BLOCK_MAPPINGS |
> > > NO_CONT_MAPPINGS).
> >
> > IIUC, arch_kexec_protect_crashkres() is only for the crashkernel image,
> > not the whole reserved memory that the crashkernel will use. For the
> > latter, we avoid the linear map by marking it as nomap in map_mem().
>
> I'm not sure we're on the same page here, so sorry if this was already implied.
>
> The crashkernel memory mapping is bypassed while preparing the linear mappings
> but it is then mapped right away, with page granularity and !MTE.
> See paging_init()->map_mem():
>
> /*
> * Use page-level mappings here so that we can shrink the region
> * in page granularity and put back unused memory to buddy system
> * through /sys/kernel/kexec_crash_size interface.
> */
> if (crashk_res.end) {
> __map_memblock(pgdp, crashk_res.start, crashk_res.end + 1,
> PAGE_KERNEL,
> NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
> memblock_clear_nomap(crashk_res.start,
> resource_size(&crashk_res));
> }
>
> IIUC the inconvenience here is that we need special mapping options for
> crashkernel and updating those after having mapped that memory as regular
> memory isn't possible/easy to do.

You are right, it still gets mapped but with page granularity. However,
to James' point, we still need to know the crashkernel range in
map_mem() as arch_kexec_protect_crashkres() relies on having page rather
than block mappings.

> > > > We also depend on this when skipping the checksum code in purgatory, which can be
> > > > exceedingly slow.
> > >
> > > This one I don't fully understand, so I'll lazily assume the prerequisite is
> > > the same WRT how memory is mapped. :)
> > >
> > > Ultimately there's also /sys/kernel/kexec_crash_size's handling. Same
> > > prerequisite.
> > >
> > > Keeping in mind acpi_table_upgrade() and unflatten_device_tree() depend on
> > > having the linear mappings available.
> >
> > So it looks like reserve_crashkernel() wants to reserve memory before
> > setting up the linear map with the information about the DMA zones in
> > place but that comes later when we can parse the firmware tables.
> >
> > I wonder, instead of not mapping the crashkernel reservation, can we not
> > do an arch_kexec_protect_crashkres() for the whole reservation after we
> > created the linear map?
>
> arch_kexec_protect_crashkres() depends on __change_memory_common() which
> ultimately depends on the memory to be mapped with PAGE_SIZE pages. As I
> comment above, the trick would work as long as there is as way to update the
> linear mappings with whatever crashkernel needs later in the boot process.

Breaking block mappings into pages is a lot more difficult later. OTOH,
the default these days is rodata_full==true, so I don't think we have
block mappings anyway. We could add NO_BLOCK_MAPPINGS if KEXEC_CORE is
enabled.

> > > Let me stress that knowing the DMA constraints in the system before reserving
> > > crashkernel's regions is necessary if we ever want it to work seamlessly on all
> > > platforms. Be it small stuff like the Raspberry Pi or huge servers with TB of
> > > memory.
> >
> > Indeed. So we have 3 options (so far):
> >
> > 1. Allow the crashkernel reservation to go into the linear map but set
> > it to invalid once allocated.
> >
> > 2. Parse the flattened DT (not sure what we do with ACPI) before
> > creating the linear map. We may have to rely on some SoC ID here
> > instead of actual DMA ranges.
> >
> > 3. Assume the smallest ZONE_DMA possible on arm64 (1GB) for crashkernel
> > reservations and not rely on arm64_dma_phys_limit in
> > reserve_crashkernel().
> >
> > I think (2) we tried hard to avoid. Option (3) brings us back to the
> > issues we had on large crashkernel reservations regressing on some
> > platforms (though it's been a while since, they mostly went quiet ;)).
> > However, with Chen's crashkernel patches we end up with two
> > reservations, one in the low DMA zone and one higher, potentially above
> > 4GB. Having a fixed 1GB limit wouldn't be any worse for crashkernel
> > reservations than what we have now.
> >
> > If (1) works, I'd go for it (James knows this part better than me),
> > otherwise we can go for (3).
>
> Overall, I'd prefer (1) as well, and I'd be happy to have a got at it. If not
> I'll append (3) in this series.

I think for 1 we could also remove the additional KEXEC_CORE checks,
something like below, untested:

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 3e5a6913acc8..27ab609c1c0c 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -477,7 +477,8 @@ static void __init map_mem(pgd_t *pgdp)
int flags = 0;
u64 i;

- if (rodata_full || debug_pagealloc_enabled())
+ if (rodata_full || debug_pagealloc_enabled() ||
+ IS_ENABLED(CONFIG_KEXEC_CORE))
flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;

/*
@@ -487,11 +488,6 @@ static void __init map_mem(pgd_t *pgdp)
* the following for-loop
*/
memblock_mark_nomap(kernel_start, kernel_end - kernel_start);
-#ifdef CONFIG_KEXEC_CORE
- if (crashk_res.end)
- memblock_mark_nomap(crashk_res.start,
- resource_size(&crashk_res));
-#endif

/* map all the memory banks */
for_each_mem_range(i, &start, &end) {
@@ -518,21 +514,6 @@ static void __init map_mem(pgd_t *pgdp)
__map_memblock(pgdp, kernel_start, kernel_end,
PAGE_KERNEL, NO_CONT_MAPPINGS);
memblock_clear_nomap(kernel_start, kernel_end - kernel_start);
-
-#ifdef CONFIG_KEXEC_CORE
- /*
- * Use page-level mappings here so that we can shrink the region
- * in page granularity and put back unused memory to buddy system
- * through /sys/kernel/kexec_crash_size interface.
- */
- if (crashk_res.end) {
- __map_memblock(pgdp, crashk_res.start, crashk_res.end + 1,
- PAGE_KERNEL,
- NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
- memblock_clear_nomap(crashk_res.start,
- resource_size(&crashk_res));
- }
-#endif
}

void mark_rodata_ro(void)

--
Catalin

2020-11-19 14:12:59

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH v6 1/7] arm64: mm: Move reserve_crashkernel() into mem_init()

Hi Catalin, James,
sorry for the late reply but I got sidetracked.

On Fri, 2020-11-13 at 11:29 +0000, Catalin Marinas wrote:
[...]
> > > > Let me stress that knowing the DMA constraints in the system before reserving
> > > > crashkernel's regions is necessary if we ever want it to work seamlessly on all
> > > > platforms. Be it small stuff like the Raspberry Pi or huge servers with TB of
> > > > memory.
> > >
> > > Indeed. So we have 3 options (so far):
> > >
> > > 1. Allow the crashkernel reservation to go into the linear map but set
> > > it to invalid once allocated.
> > >
> > > 2. Parse the flattened DT (not sure what we do with ACPI) before
> > > creating the linear map. We may have to rely on some SoC ID here
> > > instead of actual DMA ranges.
> > >
> > > 3. Assume the smallest ZONE_DMA possible on arm64 (1GB) for crashkernel
> > > reservations and not rely on arm64_dma_phys_limit in
> > > reserve_crashkernel().
> > >
> > > I think (2) we tried hard to avoid. Option (3) brings us back to the
> > > issues we had on large crashkernel reservations regressing on some
> > > platforms (though it's been a while since, they mostly went quiet ;)).
> > > However, with Chen's crashkernel patches we end up with two
> > > reservations, one in the low DMA zone and one higher, potentially above
> > > 4GB. Having a fixed 1GB limit wouldn't be any worse for crashkernel
> > > reservations than what we have now.
> > >
> > > If (1) works, I'd go for it (James knows this part better than me),
> > > otherwise we can go for (3).
> >
> > Overall, I'd prefer (1) as well, and I'd be happy to have a got at it. If not
> > I'll append (3) in this series.
>
> I think for 1 we could also remove the additional KEXEC_CORE checks,
> something like below, untested:
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 3e5a6913acc8..27ab609c1c0c 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -477,7 +477,8 @@ static void __init map_mem(pgd_t *pgdp)
> int flags = 0;
> u64 i;
>
> - if (rodata_full || debug_pagealloc_enabled())
> + if (rodata_full || debug_pagealloc_enabled() ||
> + IS_ENABLED(CONFIG_KEXEC_CORE))
> flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>
> /*
> @@ -487,11 +488,6 @@ static void __init map_mem(pgd_t *pgdp)
> * the following for-loop
> */
> memblock_mark_nomap(kernel_start, kernel_end - kernel_start);
> -#ifdef CONFIG_KEXEC_CORE
> - if (crashk_res.end)
> - memblock_mark_nomap(crashk_res.start,
> - resource_size(&crashk_res));
> -#endif
>
> /* map all the memory banks */
> for_each_mem_range(i, &start, &end) {
> @@ -518,21 +514,6 @@ static void __init map_mem(pgd_t *pgdp)
> __map_memblock(pgdp, kernel_start, kernel_end,
> PAGE_KERNEL, NO_CONT_MAPPINGS);
> memblock_clear_nomap(kernel_start, kernel_end - kernel_start);
> -
> -#ifdef CONFIG_KEXEC_CORE
> - /*
> - * Use page-level mappings here so that we can shrink the region
> - * in page granularity and put back unused memory to buddy system
> - * through /sys/kernel/kexec_crash_size interface.
> - */
> - if (crashk_res.end) {
> - __map_memblock(pgdp, crashk_res.start, crashk_res.end + 1,
> - PAGE_KERNEL,
> - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
> - memblock_clear_nomap(crashk_res.start,
> - resource_size(&crashk_res));
> - }
> -#endif
> }
>
> void mark_rodata_ro(void)

So as far as I'm concerned this is good enough for me. I took the time to
properly test crashkernel on RPi4 using the series, this patch, and another
small fix to properly update /proc/iomem.

I'll send v7 soon, but before, James (or anyone for that matter) any obvious
push-back to Catalin's solution?

Regards,
Nicolas


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2020-11-19 17:13:33

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v6 1/7] arm64: mm: Move reserve_crashkernel() into mem_init()

On Thu, Nov 19, 2020 at 03:09:58PM +0100, Nicolas Saenz Julienne wrote:
> On Fri, 2020-11-13 at 11:29 +0000, Catalin Marinas wrote:
> [...]
> > > > > Let me stress that knowing the DMA constraints in the system before reserving
> > > > > crashkernel's regions is necessary if we ever want it to work seamlessly on all
> > > > > platforms. Be it small stuff like the Raspberry Pi or huge servers with TB of
> > > > > memory.
> > > >
> > > > Indeed. So we have 3 options (so far):
> > > >
> > > > 1. Allow the crashkernel reservation to go into the linear map but set
> > > > it to invalid once allocated.
> > > >
> > > > 2. Parse the flattened DT (not sure what we do with ACPI) before
> > > > creating the linear map. We may have to rely on some SoC ID here
> > > > instead of actual DMA ranges.
> > > >
> > > > 3. Assume the smallest ZONE_DMA possible on arm64 (1GB) for crashkernel
> > > > reservations and not rely on arm64_dma_phys_limit in
> > > > reserve_crashkernel().
> > > >
> > > > I think (2) we tried hard to avoid. Option (3) brings us back to the
> > > > issues we had on large crashkernel reservations regressing on some
> > > > platforms (though it's been a while since, they mostly went quiet ;)).
> > > > However, with Chen's crashkernel patches we end up with two
> > > > reservations, one in the low DMA zone and one higher, potentially above
> > > > 4GB. Having a fixed 1GB limit wouldn't be any worse for crashkernel
> > > > reservations than what we have now.
> > > >
> > > > If (1) works, I'd go for it (James knows this part better than me),
> > > > otherwise we can go for (3).
> > >
> > > Overall, I'd prefer (1) as well, and I'd be happy to have a got at it. If not
> > > I'll append (3) in this series.
> >
> > I think for 1 we could also remove the additional KEXEC_CORE checks,
> > something like below, untested:
> >
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index 3e5a6913acc8..27ab609c1c0c 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -477,7 +477,8 @@ static void __init map_mem(pgd_t *pgdp)
> > int flags = 0;
> > u64 i;
> >
> > - if (rodata_full || debug_pagealloc_enabled())
> > + if (rodata_full || debug_pagealloc_enabled() ||
> > + IS_ENABLED(CONFIG_KEXEC_CORE))
> > flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> >
> > /*
> > @@ -487,11 +488,6 @@ static void __init map_mem(pgd_t *pgdp)
> > * the following for-loop
> > */
> > memblock_mark_nomap(kernel_start, kernel_end - kernel_start);
> > -#ifdef CONFIG_KEXEC_CORE
> > - if (crashk_res.end)
> > - memblock_mark_nomap(crashk_res.start,
> > - resource_size(&crashk_res));
> > -#endif
> >
> > /* map all the memory banks */
> > for_each_mem_range(i, &start, &end) {
> > @@ -518,21 +514,6 @@ static void __init map_mem(pgd_t *pgdp)
> > __map_memblock(pgdp, kernel_start, kernel_end,
> > PAGE_KERNEL, NO_CONT_MAPPINGS);
> > memblock_clear_nomap(kernel_start, kernel_end - kernel_start);
> > -
> > -#ifdef CONFIG_KEXEC_CORE
> > - /*
> > - * Use page-level mappings here so that we can shrink the region
> > - * in page granularity and put back unused memory to buddy system
> > - * through /sys/kernel/kexec_crash_size interface.
> > - */
> > - if (crashk_res.end) {
> > - __map_memblock(pgdp, crashk_res.start, crashk_res.end + 1,
> > - PAGE_KERNEL,
> > - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
> > - memblock_clear_nomap(crashk_res.start,
> > - resource_size(&crashk_res));
> > - }
> > -#endif
> > }
> >
> > void mark_rodata_ro(void)
>
> So as far as I'm concerned this is good enough for me. I took the time to
> properly test crashkernel on RPi4 using the series, this patch, and another
> small fix to properly update /proc/iomem.
>
> I'll send v7 soon, but before, James (or anyone for that matter) any obvious
> push-back to Catalin's solution?

I talked to James earlier and he was suggesting that we check the
command line for any crashkernel reservations and only disable block
mappings in that case, see the diff below on top of the one I already
sent (still testing it).

If you don't have any other changes for v7, I'm happy to pick v6 up on
top of the no-block-mapping fix.

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index ed71b1c305d7..acdec0c67d3b 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -469,6 +469,21 @@ void __init mark_linear_text_alias_ro(void)
PAGE_KERNEL_RO);
}

+static bool crash_mem_map __initdata;
+
+static int __init enable_crash_mem_map(char *arg)
+{
+ /*
+ * Proper parameter parsing is done by reserve_crashkernel(). We only
+ * need to know if the linear map has to avoid block mappings so that
+ * the crashkernel reservations can be unmapped later.
+ */
+ crash_mem_map = false;
+
+ return 0;
+}
+early_param("crashkernel", enable_crash_mem_map);
+
static void __init map_mem(pgd_t *pgdp)
{
phys_addr_t kernel_start = __pa_symbol(_stext);
@@ -477,8 +492,7 @@ static void __init map_mem(pgd_t *pgdp)
int flags = 0;
u64 i;

- if (rodata_full || debug_pagealloc_enabled() ||
- IS_ENABLED(CONFIG_KEXEC_CORE))
+ if (rodata_full || debug_pagealloc_enabled() || crash_mem_map)
flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;

/*

2020-11-19 17:29:31

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v6 1/7] arm64: mm: Move reserve_crashkernel() into mem_init()

On Thu, Nov 19, 2020 at 05:10:49PM +0000, Catalin Marinas wrote:
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index ed71b1c305d7..acdec0c67d3b 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -469,6 +469,21 @@ void __init mark_linear_text_alias_ro(void)
> PAGE_KERNEL_RO);
> }
>
> +static bool crash_mem_map __initdata;
> +
> +static int __init enable_crash_mem_map(char *arg)
> +{
> + /*
> + * Proper parameter parsing is done by reserve_crashkernel(). We only
> + * need to know if the linear map has to avoid block mappings so that
> + * the crashkernel reservations can be unmapped later.
> + */
> + crash_mem_map = false;

It should be set to true.

--
Catalin

2020-11-19 17:29:56

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH v6 1/7] arm64: mm: Move reserve_crashkernel() into mem_init()

On Thu, 2020-11-19 at 17:10 +0000, Catalin Marinas wrote:
> On Thu, Nov 19, 2020 at 03:09:58PM +0100, Nicolas Saenz Julienne wrote:
> > On Fri, 2020-11-13 at 11:29 +0000, Catalin Marinas wrote:
> > [...]
> > > > > > Let me stress that knowing the DMA constraints in the system before reserving
> > > > > > crashkernel's regions is necessary if we ever want it to work seamlessly on all
> > > > > > platforms. Be it small stuff like the Raspberry Pi or huge servers with TB of
> > > > > > memory.
> > > > >
> > > > > Indeed. So we have 3 options (so far):
> > > > >
> > > > > 1. Allow the crashkernel reservation to go into the linear map but set
> > > > > it to invalid once allocated.
> > > > >
> > > > > 2. Parse the flattened DT (not sure what we do with ACPI) before
> > > > > creating the linear map. We may have to rely on some SoC ID here
> > > > > instead of actual DMA ranges.
> > > > >
> > > > > 3. Assume the smallest ZONE_DMA possible on arm64 (1GB) for crashkernel
> > > > > reservations and not rely on arm64_dma_phys_limit in
> > > > > reserve_crashkernel().
> > > > >
> > > > > I think (2) we tried hard to avoid. Option (3) brings us back to the
> > > > > issues we had on large crashkernel reservations regressing on some
> > > > > platforms (though it's been a while since, they mostly went quiet ;)).
> > > > > However, with Chen's crashkernel patches we end up with two
> > > > > reservations, one in the low DMA zone and one higher, potentially above
> > > > > 4GB. Having a fixed 1GB limit wouldn't be any worse for crashkernel
> > > > > reservations than what we have now.
> > > > >
> > > > > If (1) works, I'd go for it (James knows this part better than me),
> > > > > otherwise we can go for (3).
> > > >
> > > > Overall, I'd prefer (1) as well, and I'd be happy to have a got at it. If not
> > > > I'll append (3) in this series.
> > >
> > > I think for 1 we could also remove the additional KEXEC_CORE checks,
> > > something like below, untested:
> > >
> > > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > > index 3e5a6913acc8..27ab609c1c0c 100644
> > > --- a/arch/arm64/mm/mmu.c
> > > +++ b/arch/arm64/mm/mmu.c
> > > @@ -477,7 +477,8 @@ static void __init map_mem(pgd_t *pgdp)
> > > int flags = 0;
> > > u64 i;
> > >
> > > - if (rodata_full || debug_pagealloc_enabled())
> > > + if (rodata_full || debug_pagealloc_enabled() ||
> > > + IS_ENABLED(CONFIG_KEXEC_CORE))
> > > flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> > >
> > > /*
> > > @@ -487,11 +488,6 @@ static void __init map_mem(pgd_t *pgdp)
> > > * the following for-loop
> > > */
> > > memblock_mark_nomap(kernel_start, kernel_end - kernel_start);
> > > -#ifdef CONFIG_KEXEC_CORE
> > > - if (crashk_res.end)
> > > - memblock_mark_nomap(crashk_res.start,
> > > - resource_size(&crashk_res));
> > > -#endif
> > >
> > > /* map all the memory banks */
> > > for_each_mem_range(i, &start, &end) {
> > > @@ -518,21 +514,6 @@ static void __init map_mem(pgd_t *pgdp)
> > > __map_memblock(pgdp, kernel_start, kernel_end,
> > > PAGE_KERNEL, NO_CONT_MAPPINGS);
> > > memblock_clear_nomap(kernel_start, kernel_end - kernel_start);
> > > -
> > > -#ifdef CONFIG_KEXEC_CORE
> > > - /*
> > > - * Use page-level mappings here so that we can shrink the region
> > > - * in page granularity and put back unused memory to buddy system
> > > - * through /sys/kernel/kexec_crash_size interface.
> > > - */
> > > - if (crashk_res.end) {
> > > - __map_memblock(pgdp, crashk_res.start, crashk_res.end + 1,
> > > - PAGE_KERNEL,
> > > - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
> > > - memblock_clear_nomap(crashk_res.start,
> > > - resource_size(&crashk_res));
> > > - }
> > > -#endif
> > > }
> > >
> > > void mark_rodata_ro(void)
> >
> > So as far as I'm concerned this is good enough for me. I took the time to
> > properly test crashkernel on RPi4 using the series, this patch, and another
> > small fix to properly update /proc/iomem.
> >
> > I'll send v7 soon, but before, James (or anyone for that matter) any obvious
> > push-back to Catalin's solution?
>
> I talked to James earlier and he was suggesting that we check the
> command line for any crashkernel reservations and only disable block
> mappings in that case, see the diff below on top of the one I already
> sent (still testing it).

That's even better :)

> If you don't have any other changes for v7, I'm happy to pick v6 up on
> top of the no-block-mapping fix.

Yes I've got a small change in patch #1, the crashkernel reservation has to be
performed before request_standart_resouces() is called, which is OK, since
we're all setup by then, I moved the crashkernel reservation at the end of
bootmem_init(). I attached the patch. If it's easier for you I'll send v7.

Regards,
Nicolas


Attachments:
0001-arm64-mm-Move-reserve_crashkernel-into-mem_init.patch (1.40 kB)
signature.asc (499.00 B)
This is a digitally signed message part
Download all attachments

2020-11-19 17:47:23

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v6 1/7] arm64: mm: Move reserve_crashkernel() into mem_init()

On Thu, Nov 19, 2020 at 06:25:29PM +0100, Nicolas Saenz Julienne wrote:
> On Thu, 2020-11-19 at 17:10 +0000, Catalin Marinas wrote:
> > On Thu, Nov 19, 2020 at 03:09:58PM +0100, Nicolas Saenz Julienne wrote:
> > > On Fri, 2020-11-13 at 11:29 +0000, Catalin Marinas wrote:
> > > [...]
> > > > > > > Let me stress that knowing the DMA constraints in the system before reserving
> > > > > > > crashkernel's regions is necessary if we ever want it to work seamlessly on all
> > > > > > > platforms. Be it small stuff like the Raspberry Pi or huge servers with TB of
> > > > > > > memory.
> > > > > >
> > > > > > Indeed. So we have 3 options (so far):
> > > > > >
> > > > > > 1. Allow the crashkernel reservation to go into the linear map but set
> > > > > > it to invalid once allocated.
> > > > > >
> > > > > > 2. Parse the flattened DT (not sure what we do with ACPI) before
> > > > > > creating the linear map. We may have to rely on some SoC ID here
> > > > > > instead of actual DMA ranges.
> > > > > >
> > > > > > 3. Assume the smallest ZONE_DMA possible on arm64 (1GB) for crashkernel
> > > > > > reservations and not rely on arm64_dma_phys_limit in
> > > > > > reserve_crashkernel().
> > > > > >
> > > > > > I think (2) we tried hard to avoid. Option (3) brings us back to the
> > > > > > issues we had on large crashkernel reservations regressing on some
> > > > > > platforms (though it's been a while since, they mostly went quiet ;)).
> > > > > > However, with Chen's crashkernel patches we end up with two
> > > > > > reservations, one in the low DMA zone and one higher, potentially above
> > > > > > 4GB. Having a fixed 1GB limit wouldn't be any worse for crashkernel
> > > > > > reservations than what we have now.
> > > > > >
> > > > > > If (1) works, I'd go for it (James knows this part better than me),
> > > > > > otherwise we can go for (3).
> > > > >
> > > > > Overall, I'd prefer (1) as well, and I'd be happy to have a got at it. If not
> > > > > I'll append (3) in this series.
> > > >
> > > > I think for 1 we could also remove the additional KEXEC_CORE checks,
> > > > something like below, untested:
> > > >
> > > > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > > > index 3e5a6913acc8..27ab609c1c0c 100644
> > > > --- a/arch/arm64/mm/mmu.c
> > > > +++ b/arch/arm64/mm/mmu.c
> > > > @@ -477,7 +477,8 @@ static void __init map_mem(pgd_t *pgdp)
> > > > int flags = 0;
> > > > u64 i;
> > > >
> > > > - if (rodata_full || debug_pagealloc_enabled())
> > > > + if (rodata_full || debug_pagealloc_enabled() ||
> > > > + IS_ENABLED(CONFIG_KEXEC_CORE))
> > > > flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> > > >
> > > > /*
> > > > @@ -487,11 +488,6 @@ static void __init map_mem(pgd_t *pgdp)
> > > > * the following for-loop
> > > > */
> > > > memblock_mark_nomap(kernel_start, kernel_end - kernel_start);
> > > > -#ifdef CONFIG_KEXEC_CORE
> > > > - if (crashk_res.end)
> > > > - memblock_mark_nomap(crashk_res.start,
> > > > - resource_size(&crashk_res));
> > > > -#endif
> > > >
> > > > /* map all the memory banks */
> > > > for_each_mem_range(i, &start, &end) {
> > > > @@ -518,21 +514,6 @@ static void __init map_mem(pgd_t *pgdp)
> > > > __map_memblock(pgdp, kernel_start, kernel_end,
> > > > PAGE_KERNEL, NO_CONT_MAPPINGS);
> > > > memblock_clear_nomap(kernel_start, kernel_end - kernel_start);
> > > > -
> > > > -#ifdef CONFIG_KEXEC_CORE
> > > > - /*
> > > > - * Use page-level mappings here so that we can shrink the region
> > > > - * in page granularity and put back unused memory to buddy system
> > > > - * through /sys/kernel/kexec_crash_size interface.
> > > > - */
> > > > - if (crashk_res.end) {
> > > > - __map_memblock(pgdp, crashk_res.start, crashk_res.end + 1,
> > > > - PAGE_KERNEL,
> > > > - NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS);
> > > > - memblock_clear_nomap(crashk_res.start,
> > > > - resource_size(&crashk_res));
> > > > - }
> > > > -#endif
> > > > }
> > > >
> > > > void mark_rodata_ro(void)
> > >
> > > So as far as I'm concerned this is good enough for me. I took the time to
> > > properly test crashkernel on RPi4 using the series, this patch, and another
> > > small fix to properly update /proc/iomem.
> > >
> > > I'll send v7 soon, but before, James (or anyone for that matter) any obvious
> > > push-back to Catalin's solution?
> >
> > I talked to James earlier and he was suggesting that we check the
> > command line for any crashkernel reservations and only disable block
> > mappings in that case, see the diff below on top of the one I already
> > sent (still testing it).
>
> That's even better :)
>
> > If you don't have any other changes for v7, I'm happy to pick v6 up on
> > top of the no-block-mapping fix.
>
> Yes I've got a small change in patch #1, the crashkernel reservation has to be
> performed before request_standart_resouces() is called, which is OK, since
> we're all setup by then, I moved the crashkernel reservation at the end of
> bootmem_init(). I attached the patch. If it's easier for you I'll send v7.

Please send a v7, otherwise b4 gets confused.

Thanks.

--
Catalin

2020-11-19 18:21:00

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v6 1/7] arm64: mm: Move reserve_crashkernel() into mem_init()

Hi,

(sorry for the late response)

On 06/11/2020 18:46, Nicolas Saenz Julienne wrote:
> On Thu, 2020-11-05 at 16:11 +0000, James Morse wrote:>> We also depend on this when skipping the checksum code in purgatory, which can be
>> exceedingly slow.
>
> This one I don't fully understand, so I'll lazily assume the prerequisite is
> the same WRT how memory is mapped. :)

The aim is its never normally mapped by the kernel. This is so that if we can't get rid of
the secondary CPUs (e.g. they have IRQs masked), but they are busy scribbling all over
memory, we have a rough guarantee that they aren't scribbling over the kdump kernel.

We can skip the checksum in purgatory, as there is very little risk of the memory having
been corrupted.


> Ultimately there's also /sys/kernel/kexec_crash_size's handling. Same
> prerequisite.

Yeah, this lets you release PAGE_SIZEs back to the allocator, which means the
marked-invalid page tables we have hidden there need to be PAGE_SIZE mappings.


Thanks,

James


> Keeping in mind acpi_table_upgrade() and unflatten_device_tree() depend on
> having the linear mappings available. I don't see any simple way of solving
> this. Both moving the firmware description routines to use fixmap or correcting
> the linear mapping further down the line so as to include kdump's regions, seem
> excessive/impossible (feel free to correct me here). I'd be happy to hear
> suggestions. Otherwise we're back to hard-coding the information as we
> initially did.
>
> Let me stress that knowing the DMA constraints in the system before reserving
> crashkernel's regions is necessary if we ever want it to work seamlessly on all
> platforms. Be it small stuff like the Raspberry Pi or huge servers with TB of
> memory.