Thanks to Alex for his continued review and Josh for running v2! Please
continue to review and test, and acks for the padata parts would be
appreciated.
Daniel
--
Deferred struct page init is a bottleneck in kernel boot--the biggest
for us and probably others. Optimizing it maximizes availability for
large-memory systems and allows spinning up short-lived VMs as needed
without having to leave them running. It also benefits bare metal
machines hosting VMs that are sensitive to downtime. In projects such
as VMM Fast Restart[1], where guest state is preserved across kexec
reboot, it helps prevent application and network timeouts in the guests.
So, multithread deferred init to take full advantage of system memory
bandwidth.
Extend padata, a framework that handles many parallel singlethreaded
jobs, to handle multithreaded jobs as well by adding support for
splitting up the work evenly, specifying a minimum amount of work that's
appropriate for one helper thread to do, load balancing between helpers,
and coordinating them. More documentation in patches 4 and 8.
This series is the first step in a project to address other memory
proportional bottlenecks in the kernel such as pmem struct page init,
vfio page pinning, hugetlb fallocate, and munmap. Deferred page init
doesn't require concurrency limits, resource control, or priority
adjustments like these other users will because it happens during boot
when the system is otherwise idle and waiting for page init to finish.
This has been run on a variety of x86 systems and speeds up kernel boot
by 4% to 49%, saving up to 1.6 out of 4 seconds. Patch 6 has more
numbers.
The powerpc and s390 lists are included in case they want to give this a
try, they had enabled this feature when it was configured per arch.
Series based on v5.7-rc7 plus these three from mmotm
mm-call-touch_nmi_watchdog-on-max-order-boundaries-in-deferred-init.patch
mm-initialize-deferred-pages-with-interrupts-enabled.patch
mm-call-cond_resched-from-deferred_init_memmap.patch
and it's available here:
git://oss.oracle.com/git/linux-dmjordan.git padata-mt-definit-v3
https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-definit-v3
and the future users and related features are available as
work-in-progress:
git://oss.oracle.com/git/linux-dmjordan.git padata-mt-wip-v0.5
https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-wip-v0.5
v3:
- Remove nr_pages accounting as suggested by Alex, adding a new patch
- Align deferred init ranges up not down, simplify surrounding code (Alex)
- Add Josh's T-b's from v2 (Josh's T-b's for v1 lost in rebase, apologies!)
- Move padata.h include up in init/main.c to reduce patch collisions (Andrew)
- Slightly reword Documentation patch
- Rebase on v5.7-rc7 and retest
v2:
- Improve the problem statement (Andrew, Josh, Pavel)
- Add T-b's to unchanged patches (Josh)
- Fully initialize max-order blocks to avoid buddy issues (Alex)
- Parallelize on section-aligned boundaries to avoid potential
false sharing (Alex)
- Return the maximum thread count from a function that architectures
can override, with the generic version returning 1 (current
behavior). Override for x86 since that's the only arch this series
has been tested on so far. Other archs can test with more threads
by dropping patch 6.
- Rebase to v5.7-rc6, rerun tests
RFC v4 [2] -> v1:
- merged with padata (Peter)
- got rid of the 'task' nomenclature (Peter, Jon)
future work branch:
- made lockdep-aware (Jason, Peter)
- adjust workqueue worker priority with renice_or_cancel() (Tejun)
- fixed undo problem in VFIO (Alex)
The remaining feedback, mainly resource control awareness (cgroup etc),
is TODO for later series.
[1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf
https://www.youtube.com/watch?v=pBsHnf93tcQ
https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/
Daniel Jordan (8):
padata: remove exit routine
padata: initialize earlier
padata: allocate work structures for parallel jobs from a pool
padata: add basic support for multithreaded jobs
mm: don't track number of pages during deferred initialization
mm: parallelize deferred_init_memmap()
mm: make deferred init's max threads arch-specific
padata: document multithreaded jobs
Documentation/core-api/padata.rst | 41 +++--
arch/x86/mm/init_64.c | 12 ++
include/linux/memblock.h | 3 +
include/linux/padata.h | 43 ++++-
init/main.c | 2 +
kernel/padata.c | 277 ++++++++++++++++++++++++------
mm/Kconfig | 6 +-
mm/page_alloc.c | 59 +++++--
8 files changed, 361 insertions(+), 82 deletions(-)
base-commit: 9cb1fd0efd195590b828b9b865421ad345a4a145
prerequisite-patch-id: 4ad522141e1119a325a9799dad2bd982fbac8b7c
prerequisite-patch-id: 169273327e56f5461101a71dfbd6b4cfd4570cf0
prerequisite-patch-id: 0f34692c8a9673d4c4f6a3545cf8ec3a2abf8620
--
2.26.2
Using padata during deferred init has only been tested on x86, so for
now limit it to this architecture.
If another arch wants this, it can find the max thread limit that's best
for it and override deferred_page_init_max_threads().
Signed-off-by: Daniel Jordan <[email protected]>
Tested-by: Josh Triplett <[email protected]>
---
arch/x86/mm/init_64.c | 12 ++++++++++++
include/linux/memblock.h | 3 +++
mm/page_alloc.c | 13 ++++++++-----
3 files changed, 23 insertions(+), 5 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 8b5f73f5e207c..2d749ec12ea8a 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1260,6 +1260,18 @@ void __init mem_init(void)
mem_init_print_info(NULL);
}
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask)
+{
+ /*
+ * More CPUs always led to greater speedups on tested systems, up to
+ * all the nodes' CPUs. Use all since the system is otherwise idle
+ * now.
+ */
+ return max_t(int, cpumask_weight(node_cpumask), 1);
+}
+#endif
+
int kernel_set_to_readonly;
void mark_rodata_ro(void)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 6bc37a731d27b..2b289df44194f 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -275,6 +275,9 @@ void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
#define for_each_free_mem_pfn_range_in_zone_from(i, zone, p_start, p_end) \
for (; i != U64_MAX; \
__next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
+
+int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask);
+
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
/**
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1d47016849531..329fd1a809c59 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1835,6 +1835,13 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
}
}
+/* An arch may override for more concurrency. */
+__weak int __init
+deferred_page_init_max_threads(const struct cpumask *node_cpumask)
+{
+ return 1;
+}
+
/* Initialise remaining memory on a node */
static int __init deferred_init_memmap(void *data)
{
@@ -1883,11 +1890,7 @@ static int __init deferred_init_memmap(void *data)
first_init_pfn))
goto zone_empty;
- /*
- * More CPUs always led to greater speedups on tested systems, up to
- * all the nodes' CPUs. Use all since the system is otherwise idle now.
- */
- max_threads = max(cpumask_weight(cpumask), 1u);
+ max_threads = deferred_page_init_max_threads(cpumask);
while (spfn < epfn) {
unsigned long epfn_align = ALIGN(epfn, PAGES_PER_SECTION);
--
2.26.2
padata_driver_exit() is unnecessary because padata isn't built as a
module and doesn't exit.
padata's init routine will soon allocate memory, so getting rid of the
exit function now avoids pointless code to free it.
Signed-off-by: Daniel Jordan <[email protected]>
Tested-by: Josh Triplett <[email protected]>
---
kernel/padata.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/kernel/padata.c b/kernel/padata.c
index a6afa12fb75ee..835919c745266 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -1072,10 +1072,4 @@ static __init int padata_driver_init(void)
}
module_init(padata_driver_init);
-static __exit void padata_driver_exit(void)
-{
- cpuhp_remove_multi_state(CPUHP_PADATA_DEAD);
- cpuhp_remove_multi_state(hp_online);
-}
-module_exit(padata_driver_exit);
#endif
--
2.26.2