2020-11-11 14:55:46

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 0/8] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations

Based on latest linux/master

powernv/memtrace is the only in-kernel user that rips out random memory
it never added (doesn't own) in order to allocate memory without a
linear mapping. Let's stop abusing memory hot(un)plug infrastructure for
that - use alloc_contig_pages() for allocating memory and remove the
linear mapping manually.

The original idea was discussed in:
https://lkml.kernel.org/r/[email protected]

I only tested via QEMU TCG with a single NUMA node- see patch #8 for more
details.

Error handling and cleanup handling in memtrace code is a mess - that
should definitely get cleaned up sooner or later. Once we have __GFP_ZERO
support for alloc_contig_pages(), we can drop manual clearing. I added
a TODO for now, so this series can go via the powerpc tree - the __GFP_ZERO
change is then better suited via the mm tree, along with support for
__GFP_ZERO.

v1 -> v2:
- Tweaks to patch descriptions
- "powernv/memtrace: don't leak kernel memory to user space"
-- Added. Reported by Michael.
- "powernv/memtrace: fix crashing the kernel when enabling concurrently"
-- Added, discovered while testing.
- "powerpc/mm: protect linear mapping modifications by a mutex"
-- Added. Although we currently won't have concurrency, this is cleaner and
future-proof.
- "powerepc/book3s64/hash: drop WARN_ON in hash__remove_section_mapping"
-- Added. Suggested by Oscar
- "powernv/memtrace: don't abuse memory hot(un)plug infrastructure for
memory allocations"
-- Reshuffle the code to make review easier.
-- Add a TODO regarding __GFP_ZERO. Adapt to changed page clearing code.
-- Use GFP_KERNEL | __GFP_THISNODE | __GFP_NOWARN for allocations.


David Hildenbrand (8):
powernv/memtrace: don't leak kernel memory to user space
powernv/memtrace: fix crashing the kernel when enabling concurrently
powerpc/mm: factor out creating/removing linear mapping
powerpc/mm: protect linear mapping modifications by a mutex
powerpc/mm: print warning in arch_remove_linear_mapping()
powerepc/book3s64/hash: drop WARN_ON in hash__remove_section_mapping
powerpc/mm: remove linear mapping if __add_pages() fails in
arch_add_memory()
powernv/memtrace: don't abuse memory hot(un)plug infrastructure for
memory allocations

arch/powerpc/mm/book3s64/hash_utils.c | 1 -
arch/powerpc/mm/mem.c | 53 +++++--
arch/powerpc/platforms/powernv/Kconfig | 8 +-
arch/powerpc/platforms/powernv/memtrace.c | 175 ++++++++++------------
include/linux/memory_hotplug.h | 3 +
5 files changed, 125 insertions(+), 115 deletions(-)

--
2.26.2


2020-11-11 14:56:01

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 1/8] powernv/memtrace: don't leak kernel memory to user space

We currently leak kernel memory to user space, because memory offlining
doesn't do any implicit clearing of memory and we are missing explicit
clearing of memory.

Let's keep it simple and clear pages before removing the linear mapping.

Reproduced in QEMU/TCG with 10 GiB of main memory:
[root@localhost ~]# dd obs=9G if=/dev/urandom of=/dev/null
[... wait until "free -m" used counter no longer changes and cancel]
19665802+0 records in
1+0 records out
9663676416 bytes (9.7 GB, 9.0 GiB) copied, 135.548 s, 71.3 MB/s
[root@localhost ~]# cat /sys/devices/system/memory/block_size_bytes
40000000
[root@localhost ~]# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
[ 402.978663][ T1086] page:000000001bc4bc74 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x24900
[ 402.980063][ T1086] flags: 0x7ffff000001000(reserved)
[ 402.980415][ T1086] raw: 007ffff000001000 c00c000000924008 c00c000000924008 0000000000000000
[ 402.980627][ T1086] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
[ 402.980845][ T1086] page dumped because: unmovable page
[ 402.989608][ T1086] Offlined Pages 16384
[ 403.324155][ T1086] memtrace: Allocated trace memory on node 0 at 0x0000000200000000

Before this patch:
[root@localhost ~]# hexdump -C /sys/kernel/debug/powerpc/memtrace/00000000/trace | head
00000000 c8 25 72 51 4d 26 36 c5 5c c2 56 15 d5 1a cd 10 |.%rQM&6.\.V.....|
00000010 19 b9 50 b2 cb e3 60 b8 ec 0a f3 ec 4b 3c 39 f0 |..P...`.....K<9.|$
00000020 4e 5a 4c cf bd 26 19 ff 37 79 13 67 24 b7 b8 57 |NZL..&..7y.g$..W|$
00000030 98 3e f5 be 6f 14 6a bd a4 52 bc 6e e9 e0 c1 5d |.>..o.j..R.n...]|$
00000040 76 b3 ae b5 88 d7 da e3 64 23 85 2c 10 88 07 b6 |v.......d#.,....|$
00000050 9a d8 91 de f7 50 27 69 2e 64 9c 6f d3 19 45 79 |.....P'i.d.o..Ey|$
00000060 6a 6f 8a 61 71 19 1f c7 f1 df 28 26 ca 0f 84 55 |jo.aq.....(&...U|$
00000070 01 3f be e4 e2 e1 da ff 7b 8c 8e 32 37 b4 24 53 |.?......{..27.$S|$
00000080 1b 70 30 45 56 e6 8c c4 0e b5 4c fb 9f dd 88 06 |.p0EV.....L.....|$
00000090 ef c4 18 79 f1 60 b1 5c 79 59 4d f4 36 d7 4a 5c |...y.`.\yYM.6.J\|$

After this patch:
[root@localhost ~]# hexdump -C /sys/kernel/debug/powerpc/memtrace/00000000/trace | head
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
40000000

Reported-by: Michael Ellerman <[email protected]>
Fixes: 9d5171a8f248 ("powerpc/powernv: Enable removal of memory for in memory tracing")
Cc: [email protected] # v4.14+
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Rashmica Gupta <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/powerpc/platforms/powernv/memtrace.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c
index 6828108486f8..eea1f94482ff 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -67,6 +67,23 @@ static int change_memblock_state(struct memory_block *mem, void *arg)
return 0;
}

+static void memtrace_clear_range(unsigned long start_pfn,
+ unsigned long nr_pages)
+{
+ unsigned long pfn;
+
+ /*
+ * As pages are offline, we cannot trust the memmap anymore. As HIGHMEM
+ * does not apply, avoid passing around "struct page" and use
+ * clear_page() instead directly.
+ */
+ for (pfn = start_pfn; pfn < start_pfn + nr_pages; pfn++) {
+ if (IS_ALIGNED(pfn, PAGES_PER_SECTION))
+ cond_resched();
+ clear_page(__va(PFN_PHYS(pfn)));
+ }
+}
+
/* called with device_hotplug_lock held */
static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages)
{
@@ -111,6 +128,11 @@ static u64 memtrace_alloc_node(u32 nid, u64 size)
lock_device_hotplug();
for (base_pfn = end_pfn; base_pfn > start_pfn; base_pfn -= nr_pages) {
if (memtrace_offline_pages(nid, base_pfn, nr_pages) == true) {
+ /*
+ * Clear the range while we still have a linear
+ * mapping.
+ */
+ memtrace_clear_range(base_pfn, nr_pages);
/*
* Remove memory in memory block size chunks so that
* iomem resources are always split to the same size and
--
2.26.2

2020-11-11 14:56:04

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 4/8] powerpc/mm: protect linear mapping modifications by a mutex

This code currently relies on mem_hotplug_begin()/mem_hotplug_done() -
create_section_mapping()/remove_section_mapping() implementations
cannot tollerate getting called concurrently.

Let's prepare for callers (memtrace) not holding any such locks (and
don't force them to mess with memory hotplug locks).

Other parts in these functions don't seem to rely on external locking.

Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Rashmica Gupta <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/powerpc/mm/mem.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 8a86d81f8df0..ca5c4b54c366 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -58,6 +58,7 @@
#define CPU_FTR_NOEXECUTE 0
#endif

+static DEFINE_MUTEX(linear_mapping_mutex);
unsigned long long memory_limit;
bool init_mem_is_free;

@@ -126,8 +127,10 @@ int __ref arch_create_linear_mapping(int nid, u64 start, u64 size,
int rc;

start = (unsigned long)__va(start);
+ mutex_lock(&linear_mapping_mutex);
rc = create_section_mapping(start, start + size, nid,
params->pgprot);
+ mutex_unlock(&linear_mapping_mutex);
if (rc) {
pr_warn("Unable to create linear mapping for 0x%llx..0x%llx: %d\n",
start, start + size, rc);
@@ -144,7 +147,9 @@ void __ref arch_remove_linear_mapping(u64 start, u64 size)
start = (unsigned long)__va(start);
flush_dcache_range_chunked(start, start + size, FLUSH_CHUNK_SIZE);

+ mutex_lock(&linear_mapping_mutex);
ret = remove_section_mapping(start, start + size);
+ mutex_unlock(&linear_mapping_mutex);
WARN_ON_ONCE(ret);

/* Ensure all vmalloc mappings are flushed in case they also
--
2.26.2

2020-11-11 14:56:06

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 6/8] powerepc/book3s64/hash: drop WARN_ON in hash__remove_section_mapping

The single caller (arch_remove_linear_mapping()) prints a proper warning
when this function fails. No need to eventually crash the kernel - let's
drop this WARN_ON.

Suggested-by: Oscar Salvador <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Rashmica Gupta <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: "Aneesh Kumar K.V" <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/powerpc/mm/book3s64/hash_utils.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
index 24702c0a92e0..d2dcb7757c68 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -845,7 +845,6 @@ int hash__remove_section_mapping(unsigned long start, unsigned long end)
{
int rc = htab_remove_mapping(start, end, mmu_linear_psize,
mmu_kernel_ssize);
- WARN_ON(rc < 0);

if (resize_hpt_for_hotplug(memblock_phys_mem_size()) == -ENOSPC)
pr_warn("Hash collision while resizing HPT\n");
--
2.26.2

2020-11-11 14:56:10

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 7/8] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory()

Let's revert what we did in case seomthing goes wrong and we return an
error - as already done on arm64 and s390x.

Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Rashmica Gupta <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/powerpc/mm/mem.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index c5755b9efb64..8b946ec68d1b 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -170,7 +170,10 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
rc = arch_create_linear_mapping(nid, start, size, params);
if (rc)
return rc;
- return __add_pages(nid, start_pfn, nr_pages, params);
+ rc = __add_pages(nid, start_pfn, nr_pages, params);
+ if (rc)
+ arch_remove_linear_mapping(start, size);
+ return rc;
}

void __ref arch_remove_memory(int nid, u64 start, u64 size,
--
2.26.2

2020-11-11 14:56:21

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 8/8] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations

Let's use alloc_contig_pages() for allocating memory and remove the
linear mapping manually via arch_remove_linear_mapping(). Mark all pages
PG_offline, such that they will definitely not get touched - e.g.,
when hibernating. When freeing memory, try to revert what we did.

The original idea was discussed in:
https://lkml.kernel.org/r/[email protected]

This is similar to CONFIG_DEBUG_PAGEALLOC handling on other
architectures, whereby only single pages are unmapped from the linear
mapping. Let's mimic what memory hot(un)plug would do with the linear
mapping.

We now need MEMORY_HOTPLUG and CONTIG_ALLOC as dependencies. Add a TODO
that we want to use __GFP_ZERO for clearing once alloc_contig_pages()
understands that.

Tested with in QEMU/TCG with 10 GiB of main memory:
[root@localhost ~]# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
[ 105.903043][ T1080] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
[root@localhost ~]# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
[ 145.042493][ T1080] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages
[ 145.049019][ T1080] memtrace: Freed trace memory back on node 0
[ 145.333960][ T1080] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
[root@localhost ~]# echo 0x80000000 > /sys/kernel/debug/powerpc/memtrace/enable
[ 213.606916][ T1080] radix-mmu: Mapped 0x0000000080000000-0x00000000c0000000 with 64.0 KiB pages
[ 213.613855][ T1080] memtrace: Freed trace memory back on node 0
[ 214.185094][ T1080] memtrace: Allocated trace memory on node 0 at 0x0000000080000000
[root@localhost ~]# echo 0x100000000 > /sys/kernel/debug/powerpc/memtrace/enable
[ 234.874872][ T1080] radix-mmu: Mapped 0x0000000080000000-0x0000000100000000 with 64.0 KiB pages
[ 234.886974][ T1080] memtrace: Freed trace memory back on node 0
[ 234.890153][ T1080] memtrace: Failed to allocate trace memory on node 0
[root@localhost ~]# echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
[ 259.490196][ T1080] memtrace: Allocated trace memory on node 0 at 0x0000000080000000

I also made sure allocated memory is properly zeroed.

Note 1: We currently won't be allocating from ZONE_MOVABLE - because our
pages are not movable. However, as we don't run with any memory
hot(un)plug mechanism around, we could make an exception to
increase the chance of allocations succeeding.

Note 2: PG_reserved isn't sufficient. E.g., kernel_page_present() used
along PG_reserved in hibernation code will always return "true"
on powerpc, resulting in the pages getting touched. It's too
generic - e.g., indicates boot allocations.

Note 3: For now, we keep using memory_block_size_bytes() as minimum
granularity.

Suggested-by: Michal Hocko <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Rashmica Gupta <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/powerpc/platforms/powernv/Kconfig | 8 +-
arch/powerpc/platforms/powernv/memtrace.c | 163 ++++++++--------------
2 files changed, 62 insertions(+), 109 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig
index 938803eab0ad..619b093a0657 100644
--- a/arch/powerpc/platforms/powernv/Kconfig
+++ b/arch/powerpc/platforms/powernv/Kconfig
@@ -27,11 +27,11 @@ config OPAL_PRD
recovery diagnostics on OpenPower machines

config PPC_MEMTRACE
- bool "Enable removal of RAM from kernel mappings for tracing"
- depends on PPC_POWERNV && MEMORY_HOTREMOVE
+ bool "Enable runtime allocation of RAM for tracing"
+ depends on PPC_POWERNV && MEMORY_HOTPLUG && CONTIG_ALLOC
help
- Enabling this option allows for the removal of memory (RAM)
- from the kernel mappings to be used for hardware tracing.
+ Enabling this option allows for runtime allocation of memory (RAM)
+ for hardware tracing.

config PPC_VAS
bool "IBM Virtual Accelerator Switchboard (VAS)"
diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c
index 0e42fe2d7b6a..5fc9408bb0b3 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -51,33 +51,12 @@ static const struct file_operations memtrace_fops = {
.open = simple_open,
};

-static int check_memblock_online(struct memory_block *mem, void *arg)
-{
- if (mem->state != MEM_ONLINE)
- return -1;
-
- return 0;
-}
-
-static int change_memblock_state(struct memory_block *mem, void *arg)
-{
- unsigned long state = (unsigned long)arg;
-
- mem->state = state;
-
- return 0;
-}
-
static void memtrace_clear_range(unsigned long start_pfn,
unsigned long nr_pages)
{
unsigned long pfn;

- /*
- * As pages are offline, we cannot trust the memmap anymore. As HIGHMEM
- * does not apply, avoid passing around "struct page" and use
- * clear_page() instead directly.
- */
+ /* As HIGHMEM does not apply, use clear_page() directly. */
for (pfn = start_pfn; pfn < start_pfn + nr_pages; pfn++) {
if (IS_ALIGNED(pfn, PAGES_PER_SECTION))
cond_resched();
@@ -85,72 +64,39 @@ static void memtrace_clear_range(unsigned long start_pfn,
}
}

-/* called with device_hotplug_lock held */
-static bool memtrace_offline_pages(u32 nid, u64 start_pfn, u64 nr_pages)
-{
- const unsigned long start = PFN_PHYS(start_pfn);
- const unsigned long size = PFN_PHYS(nr_pages);
-
- if (walk_memory_blocks(start, size, NULL, check_memblock_online))
- return false;
-
- walk_memory_blocks(start, size, (void *)MEM_GOING_OFFLINE,
- change_memblock_state);
-
- if (offline_pages(start_pfn, nr_pages)) {
- walk_memory_blocks(start, size, (void *)MEM_ONLINE,
- change_memblock_state);
- return false;
- }
-
- walk_memory_blocks(start, size, (void *)MEM_OFFLINE,
- change_memblock_state);
-
-
- return true;
-}
-
static u64 memtrace_alloc_node(u32 nid, u64 size)
{
- u64 start_pfn, end_pfn, nr_pages, pfn;
- u64 base_pfn;
- u64 bytes = memory_block_size_bytes();
+ const unsigned long nr_pages = PHYS_PFN(size);
+ unsigned long pfn, start_pfn;
+ struct page *page;

- if (!node_spanned_pages(nid))
+ /*
+ * Trace memory needs to be aligned to the size, which is guaranteed
+ * by alloc_contig_pages().
+ */
+ page = alloc_contig_pages(nr_pages, GFP_KERNEL | __GFP_THISNODE |
+ __GFP_NOWARN, nid, NULL);
+ if (!page)
return 0;
+ start_pfn = page_to_pfn(page);

- start_pfn = node_start_pfn(nid);
- end_pfn = node_end_pfn(nid);
- nr_pages = size >> PAGE_SHIFT;
-
- /* Trace memory needs to be aligned to the size */
- end_pfn = round_down(end_pfn - nr_pages, nr_pages);
-
- lock_device_hotplug();
- for (base_pfn = end_pfn; base_pfn > start_pfn; base_pfn -= nr_pages) {
- if (memtrace_offline_pages(nid, base_pfn, nr_pages) == true) {
- /*
- * Clear the range while we still have a linear
- * mapping.
- */
- memtrace_clear_range(base_pfn, nr_pages);
- /*
- * Remove memory in memory block size chunks so that
- * iomem resources are always split to the same size and
- * we never try to remove memory that spans two iomem
- * resources.
- */
- end_pfn = base_pfn + nr_pages;
- for (pfn = base_pfn; pfn < end_pfn; pfn += bytes>> PAGE_SHIFT) {
- __remove_memory(nid, pfn << PAGE_SHIFT, bytes);
- }
- unlock_device_hotplug();
- return base_pfn << PAGE_SHIFT;
- }
- }
- unlock_device_hotplug();
+ /*
+ * Clear the range while we still have a linear mapping.
+ *
+ * TODO: use __GFP_ZERO with alloc_contig_pages() once supported.
+ */
+ memtrace_clear_range(start_pfn, nr_pages);

- return 0;
+ /*
+ * Set pages PageOffline(), to indicate that nobody (e.g., hibernation,
+ * dumping, ...) should be touching these pages.
+ */
+ for (pfn = start_pfn; pfn < start_pfn + nr_pages; pfn++)
+ __SetPageOffline(pfn_to_page(pfn));
+
+ arch_remove_linear_mapping(PFN_PHYS(start_pfn), size);
+
+ return PFN_PHYS(start_pfn);
}

static int memtrace_init_regions_runtime(u64 size)
@@ -220,16 +166,30 @@ static int memtrace_init_debugfs(void)
return ret;
}

-static int online_mem_block(struct memory_block *mem, void *arg)
+static int memtrace_free(int nid, u64 start, u64 size)
{
- return device_online(&mem->dev);
+ struct mhp_params params = { .pgprot = PAGE_KERNEL };
+ const unsigned long nr_pages = PHYS_PFN(size);
+ const unsigned long start_pfn = PHYS_PFN(start);
+ unsigned long pfn;
+ int ret;
+
+ ret = arch_create_linear_mapping(nid, start, size, &params);
+ if (ret)
+ return ret;
+
+ for (pfn = start_pfn; pfn < start_pfn + nr_pages; pfn++)
+ __ClearPageOffline(pfn_to_page(pfn));
+
+ free_contig_range(start_pfn, nr_pages);
+ return 0;
}

/*
- * Iterate through the chunks of memory we have removed from the kernel
- * and attempt to add them back to the kernel.
+ * Iterate through the chunks of memory we allocated and attempt to expose
+ * them back to the kernel.
*/
-static int memtrace_online(void)
+static int memtrace_free_regions(void)
{
int i, ret = 0;
struct memtrace_entry *ent;
@@ -237,7 +197,7 @@ static int memtrace_online(void)
for (i = memtrace_array_nr - 1; i >= 0; i--) {
ent = &memtrace_array[i];

- /* We have onlined this chunk previously */
+ /* We have freed this chunk previously */
if (ent->nid == NUMA_NO_NODE)
continue;

@@ -247,30 +207,25 @@ static int memtrace_online(void)
ent->mem = 0;
}

- if (add_memory(ent->nid, ent->start, ent->size, MHP_NONE)) {
- pr_err("Failed to add trace memory to node %d\n",
+ if (memtrace_free(ent->nid, ent->start, ent->size)) {
+ pr_err("Failed to free trace memory on node %d\n",
ent->nid);
ret += 1;
continue;
}

- lock_device_hotplug();
- walk_memory_blocks(ent->start, ent->size, NULL,
- online_mem_block);
- unlock_device_hotplug();
-
/*
- * Memory was added successfully so clean up references to it
- * so on reentry we can tell that this chunk was added.
+ * Memory was freed successfully so clean up references to it
+ * so on reentry we can tell that this chunk was freed.
*/
debugfs_remove_recursive(ent->dir);
- pr_info("Added trace memory back to node %d\n", ent->nid);
+ pr_info("Freed trace memory back on node %d\n", ent->nid);
ent->size = ent->start = ent->nid = NUMA_NO_NODE;
}
if (ret)
return ret;

- /* If all chunks of memory were added successfully, reset globals */
+ /* If all chunks of memory were freed successfully, reset globals */
kfree(memtrace_array);
memtrace_array = NULL;
memtrace_size = 0;
@@ -295,18 +250,16 @@ static int memtrace_enable_set(void *data, u64 val)

mutex_lock(&memtrace_mutex);

- /* Re-add/online previously removed/offlined memory */
- if (memtrace_size) {
- if (memtrace_online())
- goto out_unlock;
- }
+ /* Free all previously allocated memory. */
+ if (memtrace_size && memtrace_free_regions())
+ goto out_unlock;

if (!val) {
rc = 0;
goto out_unlock;
}

- /* Offline and remove memory */
+ /* Allocate memory. */
if (memtrace_init_regions_runtime(val))
goto out_unlock;

--
2.26.2

2020-11-11 14:57:22

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 5/8] powerpc/mm: print warning in arch_remove_linear_mapping()

Let's print a warning similar to in arch_add_linear_mapping() instead of
WARN_ON_ONCE() and eventually crashing the kernel.

Reviewed-by: Oscar Salvador <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Rashmica Gupta <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/powerpc/mm/mem.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index ca5c4b54c366..c5755b9efb64 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -150,7 +150,9 @@ void __ref arch_remove_linear_mapping(u64 start, u64 size)
mutex_lock(&linear_mapping_mutex);
ret = remove_section_mapping(start, start + size);
mutex_unlock(&linear_mapping_mutex);
- WARN_ON_ONCE(ret);
+ if (ret)
+ pr_warn("Unable to remove linear mapping for 0x%llx..0x%llx: %d\n",
+ start, start + size, ret);

/* Ensure all vmalloc mappings are flushed in case they also
* hit that section of memory
--
2.26.2

2020-11-11 14:58:02

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 2/8] powernv/memtrace: fix crashing the kernel when enabling concurrently

It's very easy to crash the kernel right now by simply trying to enable
memtrace concurrently, hammering on the "enable" interface

loop.sh:
#!/bin/bash

dmesg --console-off

while true; do
echo 0x40000000 > /sys/kernel/debug/powerpc/memtrace/enable
done

[root@localhost ~]# loop.sh &
[root@localhost ~]# loop.sh &

Resulting quickly in a kernel crash. Let's properly protect using a
mutex.

Fixes: 9d5171a8f248 ("powerpc/powernv: Enable removal of memory for in memory tracing")
Cc: [email protected]# v4.14+
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Rashmica Gupta <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/powerpc/platforms/powernv/memtrace.c | 22 +++++++++++++++-------
1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/memtrace.c b/arch/powerpc/platforms/powernv/memtrace.c
index eea1f94482ff..0e42fe2d7b6a 100644
--- a/arch/powerpc/platforms/powernv/memtrace.c
+++ b/arch/powerpc/platforms/powernv/memtrace.c
@@ -30,6 +30,7 @@ struct memtrace_entry {
char name[16];
};

+static DEFINE_MUTEX(memtrace_mutex);
static u64 memtrace_size;

static struct memtrace_entry *memtrace_array;
@@ -279,6 +280,7 @@ static int memtrace_online(void)

static int memtrace_enable_set(void *data, u64 val)
{
+ int rc = -EAGAIN;
u64 bytes;

/*
@@ -291,25 +293,31 @@ static int memtrace_enable_set(void *data, u64 val)
return -EINVAL;
}

+ mutex_lock(&memtrace_mutex);
+
/* Re-add/online previously removed/offlined memory */
if (memtrace_size) {
if (memtrace_online())
- return -EAGAIN;
+ goto out_unlock;
}

- if (!val)
- return 0;
+ if (!val) {
+ rc = 0;
+ goto out_unlock;
+ }

/* Offline and remove memory */
if (memtrace_init_regions_runtime(val))
- return -EINVAL;
+ goto out_unlock;

if (memtrace_init_debugfs())
- return -EINVAL;
+ goto out_unlock;

memtrace_size = val;
-
- return 0;
+ rc = 0;
+out_unlock:
+ mutex_unlock(&memtrace_mutex);
+ return rc;
}

static int memtrace_enable_get(void *data, u64 *val)
--
2.26.2

2020-11-11 14:58:12

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v2 3/8] powerpc/mm: factor out creating/removing linear mapping

We want to stop abusing memory hotplug infrastructure in memtrace code
to perform allocations and remove the linear mapping. Instead we will use
alloc_contig_pages() and remove the linear mapping manually.

Let's factor out creating/removing the linear mapping into
arch_create_linear_mapping() / arch_remove_linear_mapping() - so in the
future, we might be able to have whole arch_add_memory() /
arch_remove_memory() be implemented in common code.

Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Rashmica Gupta <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Wei Yang <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
arch/powerpc/mm/mem.c | 41 +++++++++++++++++++++++-----------
include/linux/memory_hotplug.h | 3 +++
2 files changed, 31 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 01ec2a252f09..8a86d81f8df0 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -120,34 +120,26 @@ static void flush_dcache_range_chunked(unsigned long start, unsigned long stop,
}
}

-int __ref arch_add_memory(int nid, u64 start, u64 size,
- struct mhp_params *params)
+int __ref arch_create_linear_mapping(int nid, u64 start, u64 size,
+ struct mhp_params *params)
{
- unsigned long start_pfn = start >> PAGE_SHIFT;
- unsigned long nr_pages = size >> PAGE_SHIFT;
int rc;

start = (unsigned long)__va(start);
rc = create_section_mapping(start, start + size, nid,
params->pgprot);
if (rc) {
- pr_warn("Unable to create mapping for hot added memory 0x%llx..0x%llx: %d\n",
+ pr_warn("Unable to create linear mapping for 0x%llx..0x%llx: %d\n",
start, start + size, rc);
return -EFAULT;
}
-
- return __add_pages(nid, start_pfn, nr_pages, params);
+ return 0;
}

-void __ref arch_remove_memory(int nid, u64 start, u64 size,
- struct vmem_altmap *altmap)
+void __ref arch_remove_linear_mapping(u64 start, u64 size)
{
- unsigned long start_pfn = start >> PAGE_SHIFT;
- unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;

- __remove_pages(start_pfn, nr_pages, altmap);
-
/* Remove htab bolted mappings for this section of memory */
start = (unsigned long)__va(start);
flush_dcache_range_chunked(start, start + size, FLUSH_CHUNK_SIZE);
@@ -160,6 +152,29 @@ void __ref arch_remove_memory(int nid, u64 start, u64 size,
*/
vm_unmap_aliases();
}
+
+int __ref arch_add_memory(int nid, u64 start, u64 size,
+ struct mhp_params *params)
+{
+ unsigned long start_pfn = start >> PAGE_SHIFT;
+ unsigned long nr_pages = size >> PAGE_SHIFT;
+ int rc;
+
+ rc = arch_create_linear_mapping(nid, start, size, params);
+ if (rc)
+ return rc;
+ return __add_pages(nid, start_pfn, nr_pages, params);
+}
+
+void __ref arch_remove_memory(int nid, u64 start, u64 size,
+ struct vmem_altmap *altmap)
+{
+ unsigned long start_pfn = start >> PAGE_SHIFT;
+ unsigned long nr_pages = size >> PAGE_SHIFT;
+
+ __remove_pages(start_pfn, nr_pages, altmap);
+ arch_remove_linear_mapping(start, size);
+}
#endif

#ifndef CONFIG_NEED_MULTIPLE_NODES
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index d65c6fdc5cfc..00b9e9bd3850 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -375,6 +375,9 @@ extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
unsigned long pnum);
extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
unsigned long nr_pages);
+extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
+ struct mhp_params *params);
+void arch_remove_linear_mapping(u64 start, u64 size);
#endif /* CONFIG_MEMORY_HOTPLUG */

#endif /* __LINUX_MEMORY_HOTPLUG_H */
--
2.26.2

2020-11-17 15:15:47

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v2 1/8] powernv/memtrace: don't leak kernel memory to user space

On Wed, Nov 11, 2020 at 03:53:15PM +0100, David Hildenbrand wrote:
> Reported-by: Michael Ellerman <[email protected]>
> Fixes: 9d5171a8f248 ("powerpc/powernv: Enable removal of memory for in memory tracing")
> Cc: [email protected] # v4.14+
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Rashmica Gupta <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Wei Yang <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Oscar Salvador <[email protected]>

--
Oscar Salvador
SUSE L3

2020-11-17 15:25:34

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v2 2/8] powernv/memtrace: fix crashing the kernel when enabling concurrently

On Wed, Nov 11, 2020 at 03:53:16PM +0100, David Hildenbrand wrote:
> Fixes: 9d5171a8f248 ("powerpc/powernv: Enable removal of memory for in memory tracing")
> Cc: [email protected]# v4.14+
> Cc: Michael Ellerman <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Rashmica Gupta <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Oscar Salvador <[email protected]>

--
Oscar Salvador
SUSE L3

2020-11-17 15:31:23

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v2 3/8] powerpc/mm: factor out creating/removing linear mapping

On Wed, Nov 11, 2020 at 03:53:17PM +0100, David Hildenbrand wrote:
> We want to stop abusing memory hotplug infrastructure in memtrace code
> to perform allocations and remove the linear mapping. Instead we will use
> alloc_contig_pages() and remove the linear mapping manually.
>
> Let's factor out creating/removing the linear mapping into
> arch_create_linear_mapping() / arch_remove_linear_mapping() - so in the
> future, we might be able to have whole arch_add_memory() /
> arch_remove_memory() be implemented in common code.
>
> Cc: Michael Ellerman <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Rashmica Gupta <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Wei Yang <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Oscar Salvador <[email protected]>

--
Oscar Salvador
SUSE L3

2020-11-17 15:39:50

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v2 4/8] powerpc/mm: protect linear mapping modifications by a mutex

On Wed, Nov 11, 2020 at 03:53:18PM +0100, David Hildenbrand wrote:
> @@ -144,7 +147,9 @@ void __ref arch_remove_linear_mapping(u64 start, u64 size)
> start = (unsigned long)__va(start);
> flush_dcache_range_chunked(start, start + size, FLUSH_CHUNK_SIZE);
>
> + mutex_lock(&linear_mapping_mutex);
> ret = remove_section_mapping(start, start + size);
> + mutex_unlock(&linear_mapping_mutex);
> WARN_ON_ONCE(ret);

My expertise in this area is low, so bear with me.

Why we do not need to protect flush_dcache_range_chunked and
vm_unmap_aliases?

--
Oscar Salvador
SUSE L3

2020-11-17 15:49:36

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] powerepc/book3s64/hash: drop WARN_ON in hash__remove_section_mapping

On Wed, Nov 11, 2020 at 03:53:20PM +0100, David Hildenbrand wrote:
> The single caller (arch_remove_linear_mapping()) prints a proper warning
> when this function fails. No need to eventually crash the kernel - let's
> drop this WARN_ON.
>
> Suggested-by: Oscar Salvador <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Rashmica Gupta <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Wei Yang <[email protected]>
> Cc: "Aneesh Kumar K.V" <[email protected]>
> Cc: Nicholas Piggin <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Oscar Salvador <[email protected]>

--
Oscar Salvador
SUSE L3

2020-11-17 15:51:27

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 4/8] powerpc/mm: protect linear mapping modifications by a mutex

On 17.11.20 16:37, Oscar Salvador wrote:
> On Wed, Nov 11, 2020 at 03:53:18PM +0100, David Hildenbrand wrote:
>> @@ -144,7 +147,9 @@ void __ref arch_remove_linear_mapping(u64 start, u64 size)
>> start = (unsigned long)__va(start);
>> flush_dcache_range_chunked(start, start + size, FLUSH_CHUNK_SIZE);
>>
>> + mutex_lock(&linear_mapping_mutex);
>> ret = remove_section_mapping(start, start + size);
>> + mutex_unlock(&linear_mapping_mutex);
>> WARN_ON_ONCE(ret);
>
> My expertise in this area is low, so bear with me.
>
> Why we do not need to protect flush_dcache_range_chunked and
> vm_unmap_aliases?
>

vm_unmap_aliases does own locking and can handle concurrent calls.


flush_dcache_range_chunked()->flush_dcache_range() ends up as a sequence
of memory barriers paired with dcbf instructions.

dcbf: Copies modified cache blocks to main storage and invalidates the
copy in the data cache.

It's called from various places and no global variables seem to be
involved, so it looks like it doesn't need any kind of locking.

--
Thanks,

David / dhildenb

2020-11-17 15:53:37

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v2 7/8] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory()

On Wed, Nov 11, 2020 at 03:53:21PM +0100, David Hildenbrand wrote:
> Let's revert what we did in case seomthing goes wrong and we return an
"something" :-)
> error - as already done on arm64 and s390x.
>
> Cc: Michael Ellerman <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Rashmica Gupta <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Wei Yang <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

Reviewed-by: Oscar Salvador <[email protected]>

--
Oscar Salvador
SUSE L3

2020-11-17 15:57:24

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 7/8] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory()

On 17.11.20 16:51, Oscar Salvador wrote:
> On Wed, Nov 11, 2020 at 03:53:21PM +0100, David Hildenbrand wrote:
>> Let's revert what we did in case seomthing goes wrong and we return an
> "something" :-)

Thanks! :)

@Michael, I assume if I don't have to resend, this can be fixed up?

--
Thanks,

David / dhildenb

2020-11-17 16:50:40

by Oscar Salvador

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations

On Wed, Nov 11, 2020 at 03:53:22PM +0100, David Hildenbrand wrote:
> Suggested-by: Michal Hocko <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Benjamin Herrenschmidt <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> Cc: Rashmica Gupta <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Wei Yang <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>

LGTM, and much cleaner definetely:

Reviewed-by: Oscar Salvador <[email protected]>

--
Oscar Salvador
SUSE L3

2020-11-18 02:03:03

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v2 7/8] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory()

David Hildenbrand <[email protected]> writes:
> On 17.11.20 16:51, Oscar Salvador wrote:
>> On Wed, Nov 11, 2020 at 03:53:21PM +0100, David Hildenbrand wrote:
>>> Let's revert what we did in case seomthing goes wrong and we return an
>> "something" :-)
>
> Thanks! :)
>
> @Michael, I assume if I don't have to resend, this can be fixed up?

Yep, I fixed it up.

cheers

2020-11-25 12:01:28

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v2 0/8] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations

On Wed, 11 Nov 2020 15:53:14 +0100, David Hildenbrand wrote:
> Based on latest linux/master
>
> powernv/memtrace is the only in-kernel user that rips out random memory
> it never added (doesn't own) in order to allocate memory without a
> linear mapping. Let's stop abusing memory hot(un)plug infrastructure for
> that - use alloc_contig_pages() for allocating memory and remove the
> linear mapping manually.
>
> [...]

Applied to powerpc/next.

[1/8] powerpc/powernv/memtrace: Don't leak kernel memory to user space
https://git.kernel.org/powerpc/c/c74cf7a3d59a21b290fe0468f5b470d0b8ee37df
[2/8] powerpc/powernv/memtrace: Fix crashing the kernel when enabling concurrently
https://git.kernel.org/powerpc/c/d6718941a2767fb383e105d257d2105fe4f15f0e
[3/8] powerpc/mm: factor out creating/removing linear mapping
https://git.kernel.org/powerpc/c/4abb1e5b63ac3281275315fc6b0cde0b9c2e2e42
[4/8] powerpc/mm: protect linear mapping modifications by a mutex
https://git.kernel.org/powerpc/c/e5b2af044f31bf18defa557a8cd11c23caefa34c
[5/8] powerpc/mm: print warning in arch_remove_linear_mapping()
https://git.kernel.org/powerpc/c/1f73ad3e8d755dbec52fcec98618a7ce4de12af2
[6/8] powerpc/book3s64/hash: Drop WARN_ON in hash__remove_section_mapping()
https://git.kernel.org/powerpc/c/d8bd9a121c2f2bc8b36da930dc91b69fd2a705e2
[7/8] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory()
https://git.kernel.org/powerpc/c/ca2c36cae9d48b180ea51259e35ab3d95d327df2
[8/8] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
https://git.kernel.org/powerpc/c/0bd4b96d99108b7ea9bac0573957483be7781d70

cheers