2020-05-20 18:29:38

by Daniel Jordan

[permalink] [raw]
Subject: [PATCH v2 5/7] mm: parallelize deferred_init_memmap()

Deferred struct page init is a significant bottleneck in kernel boot.
Optimizing it maximizes availability for large-memory systems and allows
spinning up short-lived VMs as needed without having to leave them
running. It also benefits bare metal machines hosting VMs that are
sensitive to downtime. In projects such as VMM Fast Restart[1], where
guest state is preserved across kexec reboot, it helps prevent
application and network timeouts in the guests.

Multithread to take full advantage of system memory bandwidth.

The maximum number of threads is capped at the number of CPUs on the
node because speedups always improve with additional threads on every
system tested, and at this phase of boot, the system is otherwise idle
and waiting on page init to finish.

Helper threads operate on section-aligned ranges to both avoid false
sharing when setting the pageblock's migrate type and to avoid accessing
uninitialized buddy pages, though max order alignment is enough for the
latter.

The minimum chunk size is also a section. There was benefit to using
multiple threads even on relatively small memory (1G) systems, and this
is the smallest size that the alignment allows.

The time (milliseconds) is the slowest node to initialize since boot
blocks until all nodes finish. intel_pstate is loaded in active mode
without hwp and with turbo enabled, and intel_idle is active as well.

Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
2 nodes * 26 cores * 2 threads = 104 CPUs
384G/node = 768G memory

kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 4078.0 ( 9.0) -- 1779.0 ( 8.7)
2% ( 1) 1.4% 4021.3 ( 2.9) 3.4% 1717.7 ( 7.8)
12% ( 6) 35.1% 2644.7 ( 35.3) 80.8% 341.0 ( 35.5)
25% ( 13) 38.7% 2498.0 ( 34.2) 89.1% 193.3 ( 32.3)
37% ( 19) 39.1% 2482.0 ( 25.2) 90.1% 175.3 ( 31.7)
50% ( 26) 38.8% 2495.0 ( 8.7) 89.1% 193.7 ( 3.5)
75% ( 39) 39.2% 2478.0 ( 21.0) 90.3% 172.7 ( 26.7)
100% ( 52) 40.0% 2448.0 ( 2.0) 91.9% 143.3 ( 1.5)

Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, bare metal)
1 node * 16 cores * 2 threads = 32 CPUs
192G/node = 192G memory

kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 1996.0 ( 18.0) -- 1104.3 ( 6.7)
3% ( 1) 1.4% 1968.0 ( 3.0) 2.7% 1074.7 ( 9.0)
12% ( 4) 40.1% 1196.0 ( 22.7) 72.4% 305.3 ( 16.8)
25% ( 8) 47.4% 1049.3 ( 17.2) 84.2% 174.0 ( 10.6)
37% ( 12) 48.3% 1032.0 ( 14.9) 86.8% 145.3 ( 2.5)
50% ( 16) 48.9% 1020.3 ( 2.5) 88.0% 133.0 ( 1.7)
75% ( 24) 49.1% 1016.3 ( 8.1) 88.4% 128.0 ( 1.7)
100% ( 32) 49.4% 1009.0 ( 8.5) 88.6% 126.3 ( 0.6)

Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
2 nodes * 18 cores * 2 threads = 72 CPUs
128G/node = 256G memory

kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 1682.7 ( 6.7) -- 630.0 ( 4.6)
3% ( 1) 0.4% 1676.0 ( 2.0) 0.7% 625.3 ( 3.2)
12% ( 4) 25.8% 1249.0 ( 1.0) 68.2% 200.3 ( 1.2)
25% ( 9) 30.0% 1178.0 ( 5.2) 79.7% 128.0 ( 3.5)
37% ( 13) 30.6% 1167.7 ( 3.1) 81.3% 117.7 ( 1.2)
50% ( 18) 30.6% 1167.3 ( 2.3) 81.4% 117.0 ( 1.0)
75% ( 27) 31.0% 1161.3 ( 4.6) 82.5% 110.0 ( 6.9)
100% ( 36) 32.1% 1142.0 ( 3.6) 85.7% 90.0 ( 1.0)

AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
1 node * 8 cores * 2 threads = 16 CPUs
64G/node = 64G memory

kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 1003.7 ( 16.6) -- 243.3 ( 8.1)
6% ( 1) 1.4% 990.0 ( 4.6) 1.2% 240.3 ( 1.5)
12% ( 2) 11.4% 889.3 ( 16.7) 44.5% 135.0 ( 3.0)
25% ( 4) 16.8% 835.3 ( 9.0) 65.8% 83.3 ( 2.5)
37% ( 6) 18.6% 816.7 ( 17.6) 70.4% 72.0 ( 1.0)
50% ( 8) 18.2% 821.0 ( 5.0) 70.7% 71.3 ( 1.2)
75% ( 12) 19.0% 813.3 ( 5.0) 71.8% 68.7 ( 2.1)
100% ( 16) 19.8% 805.3 ( 10.8) 76.4% 57.3 ( 15.9)

Server-oriented distros that enable deferred page init sometimes run in
small VMs, and they still benefit even though the fraction of boot time
saved is smaller:

AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
1 node * 2 cores * 2 threads = 4 CPUs
16G/node = 16G memory

kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 722.3 ( 9.5) -- 50.7 ( 0.6)
25% ( 1) -3.3% 746.3 ( 4.7) -2.0% 51.7 ( 1.2)
50% ( 2) 0.2% 721.0 ( 11.3) 29.6% 35.7 ( 4.9)
75% ( 3) -0.3% 724.3 ( 11.2) 48.7% 26.0 ( 0.0)
100% ( 4) 3.0% 700.3 ( 13.6) 55.9% 22.3 ( 0.6)

Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
1 node * 2 cores * 2 threads = 4 CPUs
14G/node = 14G memory

kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 673.0 ( 6.9) -- 57.0 ( 1.0)
25% ( 1) -0.6% 677.3 ( 19.8) 1.8% 56.0 ( 1.0)
50% ( 2) 3.4% 650.0 ( 3.6) 36.8% 36.0 ( 5.2)
75% ( 3) 4.2% 644.7 ( 7.6) 56.1% 25.0 ( 1.0)
100% ( 4) 5.3% 637.0 ( 5.6) 63.2% 21.0 ( 0.0)

On Josh's 96-CPU and 192G memory system:

Without this patch series:
[ 0.487132] node 0 initialised, 23398907 pages in 292ms
[ 0.499132] node 1 initialised, 24189223 pages in 304ms
...
[ 0.629376] Run /sbin/init as init process

With this patch series:
[ 0.227868] node 0 initialised, 23398907 pages in 28ms
[ 0.230019] node 1 initialised, 24189223 pages in 28ms
...
[ 0.361069] Run /sbin/init as init process

[1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf

Signed-off-by: Daniel Jordan <[email protected]>
---
mm/Kconfig | 6 ++---
mm/page_alloc.c | 60 ++++++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 58 insertions(+), 8 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index c1acc34c1c358..04c1da3f9f44c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -750,13 +750,13 @@ config DEFERRED_STRUCT_PAGE_INIT
depends on SPARSEMEM
depends on !NEED_PER_CPU_KM
depends on 64BIT
+ select PADATA
help
Ordinarily all struct pages are initialised during early boot in a
single thread. On very large machines this can take a considerable
amount of time. If this option is set, large machines will bring up
- a subset of memmap at boot and then initialise the rest in parallel
- by starting one-off "pgdatinitX" kernel thread for each node X. This
- has a potential performance impact on processes running early in the
+ a subset of memmap at boot and then initialise the rest in parallel.
+ This has a potential performance impact on tasks running early in the
lifetime of the system until these kthreads finish the
initialisation.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d0c0d9364aa6d..9cb780e8dec78 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -68,6 +68,7 @@
#include <linux/lockdep.h>
#include <linux/nmi.h>
#include <linux/psi.h>
+#include <linux/padata.h>

#include <asm/sections.h>
#include <asm/tlbflush.h>
@@ -1814,16 +1815,44 @@ deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
return nr_pages;
}

+struct definit_args {
+ struct zone *zone;
+ atomic_long_t nr_pages;
+};
+
+static void __init
+deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
+ void *arg)
+{
+ unsigned long spfn, epfn, nr_pages = 0;
+ struct definit_args *args = arg;
+ struct zone *zone = args->zone;
+ u64 i;
+
+ deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn, start_pfn);
+
+ /*
+ * Initialize and free pages in MAX_ORDER sized increments so that we
+ * can avoid introducing any issues with the buddy allocator.
+ */
+ while (spfn < end_pfn) {
+ nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+ cond_resched();
+ }
+
+ atomic_long_add(nr_pages, &args->nr_pages);
+}
+
/* Initialise remaining memory on a node */
static int __init deferred_init_memmap(void *data)
{
pg_data_t *pgdat = data;
const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
unsigned long spfn = 0, epfn = 0, nr_pages = 0;
- unsigned long first_init_pfn, flags;
+ unsigned long first_init_pfn, flags, epfn_align;
unsigned long start = jiffies;
struct zone *zone;
- int zid;
+ int zid, max_threads;
u64 i;

/* Bind memory initialisation thread to a local node if possible */
@@ -1863,11 +1892,32 @@ static int __init deferred_init_memmap(void *data)
goto zone_empty;

/*
- * Initialize and free pages in MAX_ORDER sized increments so
- * that we can avoid introducing any issues with the buddy
- * allocator.
+ * More CPUs always led to greater speedups on tested systems, up to
+ * all the nodes' CPUs. Use all since the system is otherwise idle now.
*/
+ max_threads = max(cpumask_weight(cpumask), 1u);
+
while (spfn < epfn) {
+ epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
+
+ if (IS_ALIGNED(spfn, PAGES_PER_SECTION) &&
+ epfn_align - spfn >= PAGES_PER_SECTION) {
+ struct definit_args arg = { zone, ATOMIC_LONG_INIT(0) };
+ struct padata_mt_job job = {
+ .thread_fn = deferred_init_memmap_chunk,
+ .fn_arg = &arg,
+ .start = spfn,
+ .size = epfn_align - spfn,
+ .align = PAGES_PER_SECTION,
+ .min_chunk = PAGES_PER_SECTION,
+ .max_threads = max_threads,
+ };
+
+ padata_do_multithreaded(&job);
+ nr_pages += atomic_long_read(&arg.nr_pages);
+ spfn = epfn_align;
+ }
+
nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
cond_resched();
}
--
2.26.2


2020-05-21 01:30:24

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] mm: parallelize deferred_init_memmap()

On Wed, May 20, 2020 at 11:27 AM Daniel Jordan
<[email protected]> wrote:
>
> Deferred struct page init is a significant bottleneck in kernel boot.
> Optimizing it maximizes availability for large-memory systems and allows
> spinning up short-lived VMs as needed without having to leave them
> running. It also benefits bare metal machines hosting VMs that are
> sensitive to downtime. In projects such as VMM Fast Restart[1], where
> guest state is preserved across kexec reboot, it helps prevent
> application and network timeouts in the guests.
>
> Multithread to take full advantage of system memory bandwidth.
>
> The maximum number of threads is capped at the number of CPUs on the
> node because speedups always improve with additional threads on every
> system tested, and at this phase of boot, the system is otherwise idle
> and waiting on page init to finish.
>
> Helper threads operate on section-aligned ranges to both avoid false
> sharing when setting the pageblock's migrate type and to avoid accessing
> uninitialized buddy pages, though max order alignment is enough for the
> latter.
>
> The minimum chunk size is also a section. There was benefit to using
> multiple threads even on relatively small memory (1G) systems, and this
> is the smallest size that the alignment allows.
>
> The time (milliseconds) is the slowest node to initialize since boot
> blocks until all nodes finish. intel_pstate is loaded in active mode
> without hwp and with turbo enabled, and intel_idle is active as well.
>
> Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
> 2 nodes * 26 cores * 2 threads = 104 CPUs
> 384G/node = 768G memory
>
> kernel boot deferred init
> ------------------------ ------------------------
> node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
> ( 0) -- 4078.0 ( 9.0) -- 1779.0 ( 8.7)
> 2% ( 1) 1.4% 4021.3 ( 2.9) 3.4% 1717.7 ( 7.8)
> 12% ( 6) 35.1% 2644.7 ( 35.3) 80.8% 341.0 ( 35.5)
> 25% ( 13) 38.7% 2498.0 ( 34.2) 89.1% 193.3 ( 32.3)
> 37% ( 19) 39.1% 2482.0 ( 25.2) 90.1% 175.3 ( 31.7)
> 50% ( 26) 38.8% 2495.0 ( 8.7) 89.1% 193.7 ( 3.5)
> 75% ( 39) 39.2% 2478.0 ( 21.0) 90.3% 172.7 ( 26.7)
> 100% ( 52) 40.0% 2448.0 ( 2.0) 91.9% 143.3 ( 1.5)
>
> Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, bare metal)
> 1 node * 16 cores * 2 threads = 32 CPUs
> 192G/node = 192G memory
>
> kernel boot deferred init
> ------------------------ ------------------------
> node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
> ( 0) -- 1996.0 ( 18.0) -- 1104.3 ( 6.7)
> 3% ( 1) 1.4% 1968.0 ( 3.0) 2.7% 1074.7 ( 9.0)
> 12% ( 4) 40.1% 1196.0 ( 22.7) 72.4% 305.3 ( 16.8)
> 25% ( 8) 47.4% 1049.3 ( 17.2) 84.2% 174.0 ( 10.6)
> 37% ( 12) 48.3% 1032.0 ( 14.9) 86.8% 145.3 ( 2.5)
> 50% ( 16) 48.9% 1020.3 ( 2.5) 88.0% 133.0 ( 1.7)
> 75% ( 24) 49.1% 1016.3 ( 8.1) 88.4% 128.0 ( 1.7)
> 100% ( 32) 49.4% 1009.0 ( 8.5) 88.6% 126.3 ( 0.6)
>
> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
> 2 nodes * 18 cores * 2 threads = 72 CPUs
> 128G/node = 256G memory
>
> kernel boot deferred init
> ------------------------ ------------------------
> node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
> ( 0) -- 1682.7 ( 6.7) -- 630.0 ( 4.6)
> 3% ( 1) 0.4% 1676.0 ( 2.0) 0.7% 625.3 ( 3.2)
> 12% ( 4) 25.8% 1249.0 ( 1.0) 68.2% 200.3 ( 1.2)
> 25% ( 9) 30.0% 1178.0 ( 5.2) 79.7% 128.0 ( 3.5)
> 37% ( 13) 30.6% 1167.7 ( 3.1) 81.3% 117.7 ( 1.2)
> 50% ( 18) 30.6% 1167.3 ( 2.3) 81.4% 117.0 ( 1.0)
> 75% ( 27) 31.0% 1161.3 ( 4.6) 82.5% 110.0 ( 6.9)
> 100% ( 36) 32.1% 1142.0 ( 3.6) 85.7% 90.0 ( 1.0)
>
> AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
> 1 node * 8 cores * 2 threads = 16 CPUs
> 64G/node = 64G memory
>
> kernel boot deferred init
> ------------------------ ------------------------
> node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
> ( 0) -- 1003.7 ( 16.6) -- 243.3 ( 8.1)
> 6% ( 1) 1.4% 990.0 ( 4.6) 1.2% 240.3 ( 1.5)
> 12% ( 2) 11.4% 889.3 ( 16.7) 44.5% 135.0 ( 3.0)
> 25% ( 4) 16.8% 835.3 ( 9.0) 65.8% 83.3 ( 2.5)
> 37% ( 6) 18.6% 816.7 ( 17.6) 70.4% 72.0 ( 1.0)
> 50% ( 8) 18.2% 821.0 ( 5.0) 70.7% 71.3 ( 1.2)
> 75% ( 12) 19.0% 813.3 ( 5.0) 71.8% 68.7 ( 2.1)
> 100% ( 16) 19.8% 805.3 ( 10.8) 76.4% 57.3 ( 15.9)
>
> Server-oriented distros that enable deferred page init sometimes run in
> small VMs, and they still benefit even though the fraction of boot time
> saved is smaller:
>
> AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
> 1 node * 2 cores * 2 threads = 4 CPUs
> 16G/node = 16G memory
>
> kernel boot deferred init
> ------------------------ ------------------------
> node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
> ( 0) -- 722.3 ( 9.5) -- 50.7 ( 0.6)
> 25% ( 1) -3.3% 746.3 ( 4.7) -2.0% 51.7 ( 1.2)
> 50% ( 2) 0.2% 721.0 ( 11.3) 29.6% 35.7 ( 4.9)
> 75% ( 3) -0.3% 724.3 ( 11.2) 48.7% 26.0 ( 0.0)
> 100% ( 4) 3.0% 700.3 ( 13.6) 55.9% 22.3 ( 0.6)
>
> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
> 1 node * 2 cores * 2 threads = 4 CPUs
> 14G/node = 14G memory
>
> kernel boot deferred init
> ------------------------ ------------------------
> node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
> ( 0) -- 673.0 ( 6.9) -- 57.0 ( 1.0)
> 25% ( 1) -0.6% 677.3 ( 19.8) 1.8% 56.0 ( 1.0)
> 50% ( 2) 3.4% 650.0 ( 3.6) 36.8% 36.0 ( 5.2)
> 75% ( 3) 4.2% 644.7 ( 7.6) 56.1% 25.0 ( 1.0)
> 100% ( 4) 5.3% 637.0 ( 5.6) 63.2% 21.0 ( 0.0)
>
> On Josh's 96-CPU and 192G memory system:
>
> Without this patch series:
> [ 0.487132] node 0 initialised, 23398907 pages in 292ms
> [ 0.499132] node 1 initialised, 24189223 pages in 304ms
> ...
> [ 0.629376] Run /sbin/init as init process
>
> With this patch series:
> [ 0.227868] node 0 initialised, 23398907 pages in 28ms
> [ 0.230019] node 1 initialised, 24189223 pages in 28ms
> ...
> [ 0.361069] Run /sbin/init as init process
>
> [1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf
>
> Signed-off-by: Daniel Jordan <[email protected]>
> ---
> mm/Kconfig | 6 ++---
> mm/page_alloc.c | 60 ++++++++++++++++++++++++++++++++++++++++++++-----
> 2 files changed, 58 insertions(+), 8 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index c1acc34c1c358..04c1da3f9f44c 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -750,13 +750,13 @@ config DEFERRED_STRUCT_PAGE_INIT
> depends on SPARSEMEM
> depends on !NEED_PER_CPU_KM
> depends on 64BIT
> + select PADATA
> help
> Ordinarily all struct pages are initialised during early boot in a
> single thread. On very large machines this can take a considerable
> amount of time. If this option is set, large machines will bring up
> - a subset of memmap at boot and then initialise the rest in parallel
> - by starting one-off "pgdatinitX" kernel thread for each node X. This
> - has a potential performance impact on processes running early in the
> + a subset of memmap at boot and then initialise the rest in parallel.
> + This has a potential performance impact on tasks running early in the
> lifetime of the system until these kthreads finish the
> initialisation.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d0c0d9364aa6d..9cb780e8dec78 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -68,6 +68,7 @@
> #include <linux/lockdep.h>
> #include <linux/nmi.h>
> #include <linux/psi.h>
> +#include <linux/padata.h>
>
> #include <asm/sections.h>
> #include <asm/tlbflush.h>
> @@ -1814,16 +1815,44 @@ deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
> return nr_pages;
> }
>
> +struct definit_args {
> + struct zone *zone;
> + atomic_long_t nr_pages;
> +};
> +
> +static void __init
> +deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> + void *arg)
> +{
> + unsigned long spfn, epfn, nr_pages = 0;
> + struct definit_args *args = arg;
> + struct zone *zone = args->zone;
> + u64 i;
> +
> + deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn, start_pfn);
> +
> + /*
> + * Initialize and free pages in MAX_ORDER sized increments so that we
> + * can avoid introducing any issues with the buddy allocator.
> + */
> + while (spfn < end_pfn) {
> + nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> + cond_resched();
> + }
> +
> + atomic_long_add(nr_pages, &args->nr_pages);
> +}
> +

Personally I would get rid of nr_pages entirely. It isn't worth the
cache thrash to have this atomic variable bouncing around. You could
probably just have this function return void since all nr_pages is
used for is a pr_info statement at the end of the initialization
which will be completely useless now anyway since we really have the
threads running in parallel anyway.

We only really need the nr_pages logic in deferred_grow_zone in order
to track if we have freed enough pages to allow us to go back to what
we were doing.

> /* Initialise remaining memory on a node */
> static int __init deferred_init_memmap(void *data)
> {
> pg_data_t *pgdat = data;
> const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> unsigned long spfn = 0, epfn = 0, nr_pages = 0;
> - unsigned long first_init_pfn, flags;
> + unsigned long first_init_pfn, flags, epfn_align;
> unsigned long start = jiffies;
> struct zone *zone;
> - int zid;
> + int zid, max_threads;
> u64 i;
>
> /* Bind memory initialisation thread to a local node if possible */
> @@ -1863,11 +1892,32 @@ static int __init deferred_init_memmap(void *data)
> goto zone_empty;
>
> /*
> - * Initialize and free pages in MAX_ORDER sized increments so
> - * that we can avoid introducing any issues with the buddy
> - * allocator.
> + * More CPUs always led to greater speedups on tested systems, up to
> + * all the nodes' CPUs. Use all since the system is otherwise idle now.
> */
> + max_threads = max(cpumask_weight(cpumask), 1u);
> +
> while (spfn < epfn) {
> + epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
> +
> + if (IS_ALIGNED(spfn, PAGES_PER_SECTION) &&
> + epfn_align - spfn >= PAGES_PER_SECTION) {
> + struct definit_args arg = { zone, ATOMIC_LONG_INIT(0) };
> + struct padata_mt_job job = {
> + .thread_fn = deferred_init_memmap_chunk,
> + .fn_arg = &arg,
> + .start = spfn,
> + .size = epfn_align - spfn,
> + .align = PAGES_PER_SECTION,
> + .min_chunk = PAGES_PER_SECTION,
> + .max_threads = max_threads,
> + };
> +
> + padata_do_multithreaded(&job);
> + nr_pages += atomic_long_read(&arg.nr_pages);
> + spfn = epfn_align;
> + }
> +
> nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> cond_resched();
> }

This doesn't look right. You are basically adding threads in addition
to calls to deferred_init_maxorder. In addition you are spawning one
job per section instead of per range. Really you should be going for
something more along the lines of:

while (spfn < epfn) {
unsigned long epfn_align = ALIGN(epfn,
PAGE_PER_SECTION);
struct definit_args arg = { zone, ATOMIC_LONG_INIT(0)
};
struct padata_mt_job job = {
.thread_fn = deferred_init_memmap_chunk,
.fn_arg = &arg,
.start = spfn,
.size = epfn_align - spfn,
.align = PAGES_PER_SECTION,
.min_chunk = PAGES_PER_SECTION,
.max_threads = max_threads,
};

padata_do_multithreaded(&job);

for_each_free_mem_pfn_range_in_zone_from(i, zone,
spfn, epfn) {
if (epfn_align <= spfn)
break;
}
}

This should accomplish the same thing, but much more efficiently. The
only thing you really lose is the tracking of nr_pages which really
doesn't add anything anyway since the value could shift around
depending on how many times deferred_grow_zone got called anyway.

Also the spfn should already be sectioned aligned, or at least be in a
new section unrelated to the one we just scheduled, so there is no
need for the extra checks you had.

2020-05-21 15:01:30

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] mm: parallelize deferred_init_memmap()

On Wed, May 20, 2020 at 6:29 PM Alexander Duyck
<[email protected]> wrote:
>
> On Wed, May 20, 2020 at 11:27 AM Daniel Jordan
> <[email protected]> wrote:
> >
> > Deferred struct page init is a significant bottleneck in kernel boot.
> > Optimizing it maximizes availability for large-memory systems and allows
> > spinning up short-lived VMs as needed without having to leave them
> > running. It also benefits bare metal machines hosting VMs that are
> > sensitive to downtime. In projects such as VMM Fast Restart[1], where
> > guest state is preserved across kexec reboot, it helps prevent
> > application and network timeouts in the guests.
> >
> > Multithread to take full advantage of system memory bandwidth.
> >
> > The maximum number of threads is capped at the number of CPUs on the
> > node because speedups always improve with additional threads on every
> > system tested, and at this phase of boot, the system is otherwise idle
> > and waiting on page init to finish.
> >
> > Helper threads operate on section-aligned ranges to both avoid false
> > sharing when setting the pageblock's migrate type and to avoid accessing
> > uninitialized buddy pages, though max order alignment is enough for the
> > latter.
> >
> > The minimum chunk size is also a section. There was benefit to using
> > multiple threads even on relatively small memory (1G) systems, and this
> > is the smallest size that the alignment allows.
> >
> > The time (milliseconds) is the slowest node to initialize since boot
> > blocks until all nodes finish. intel_pstate is loaded in active mode
> > without hwp and with turbo enabled, and intel_idle is active as well.
> >
> > Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
> > 2 nodes * 26 cores * 2 threads = 104 CPUs
> > 384G/node = 768G memory
> >
> > kernel boot deferred init
> > ------------------------ ------------------------
> > node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
> > ( 0) -- 4078.0 ( 9.0) -- 1779.0 ( 8.7)
> > 2% ( 1) 1.4% 4021.3 ( 2.9) 3.4% 1717.7 ( 7.8)
> > 12% ( 6) 35.1% 2644.7 ( 35.3) 80.8% 341.0 ( 35.5)
> > 25% ( 13) 38.7% 2498.0 ( 34.2) 89.1% 193.3 ( 32.3)
> > 37% ( 19) 39.1% 2482.0 ( 25.2) 90.1% 175.3 ( 31.7)
> > 50% ( 26) 38.8% 2495.0 ( 8.7) 89.1% 193.7 ( 3.5)
> > 75% ( 39) 39.2% 2478.0 ( 21.0) 90.3% 172.7 ( 26.7)
> > 100% ( 52) 40.0% 2448.0 ( 2.0) 91.9% 143.3 ( 1.5)
> >
> > Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, bare metal)
> > 1 node * 16 cores * 2 threads = 32 CPUs
> > 192G/node = 192G memory
> >
> > kernel boot deferred init
> > ------------------------ ------------------------
> > node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
> > ( 0) -- 1996.0 ( 18.0) -- 1104.3 ( 6.7)
> > 3% ( 1) 1.4% 1968.0 ( 3.0) 2.7% 1074.7 ( 9.0)
> > 12% ( 4) 40.1% 1196.0 ( 22.7) 72.4% 305.3 ( 16.8)
> > 25% ( 8) 47.4% 1049.3 ( 17.2) 84.2% 174.0 ( 10.6)
> > 37% ( 12) 48.3% 1032.0 ( 14.9) 86.8% 145.3 ( 2.5)
> > 50% ( 16) 48.9% 1020.3 ( 2.5) 88.0% 133.0 ( 1.7)
> > 75% ( 24) 49.1% 1016.3 ( 8.1) 88.4% 128.0 ( 1.7)
> > 100% ( 32) 49.4% 1009.0 ( 8.5) 88.6% 126.3 ( 0.6)
> >
> > Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
> > 2 nodes * 18 cores * 2 threads = 72 CPUs
> > 128G/node = 256G memory
> >
> > kernel boot deferred init
> > ------------------------ ------------------------
> > node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
> > ( 0) -- 1682.7 ( 6.7) -- 630.0 ( 4.6)
> > 3% ( 1) 0.4% 1676.0 ( 2.0) 0.7% 625.3 ( 3.2)
> > 12% ( 4) 25.8% 1249.0 ( 1.0) 68.2% 200.3 ( 1.2)
> > 25% ( 9) 30.0% 1178.0 ( 5.2) 79.7% 128.0 ( 3.5)
> > 37% ( 13) 30.6% 1167.7 ( 3.1) 81.3% 117.7 ( 1.2)
> > 50% ( 18) 30.6% 1167.3 ( 2.3) 81.4% 117.0 ( 1.0)
> > 75% ( 27) 31.0% 1161.3 ( 4.6) 82.5% 110.0 ( 6.9)
> > 100% ( 36) 32.1% 1142.0 ( 3.6) 85.7% 90.0 ( 1.0)
> >
> > AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
> > 1 node * 8 cores * 2 threads = 16 CPUs
> > 64G/node = 64G memory
> >
> > kernel boot deferred init
> > ------------------------ ------------------------
> > node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
> > ( 0) -- 1003.7 ( 16.6) -- 243.3 ( 8.1)
> > 6% ( 1) 1.4% 990.0 ( 4.6) 1.2% 240.3 ( 1.5)
> > 12% ( 2) 11.4% 889.3 ( 16.7) 44.5% 135.0 ( 3.0)
> > 25% ( 4) 16.8% 835.3 ( 9.0) 65.8% 83.3 ( 2.5)
> > 37% ( 6) 18.6% 816.7 ( 17.6) 70.4% 72.0 ( 1.0)
> > 50% ( 8) 18.2% 821.0 ( 5.0) 70.7% 71.3 ( 1.2)
> > 75% ( 12) 19.0% 813.3 ( 5.0) 71.8% 68.7 ( 2.1)
> > 100% ( 16) 19.8% 805.3 ( 10.8) 76.4% 57.3 ( 15.9)
> >
> > Server-oriented distros that enable deferred page init sometimes run in
> > small VMs, and they still benefit even though the fraction of boot time
> > saved is smaller:
> >
> > AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
> > 1 node * 2 cores * 2 threads = 4 CPUs
> > 16G/node = 16G memory
> >
> > kernel boot deferred init
> > ------------------------ ------------------------
> > node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
> > ( 0) -- 722.3 ( 9.5) -- 50.7 ( 0.6)
> > 25% ( 1) -3.3% 746.3 ( 4.7) -2.0% 51.7 ( 1.2)
> > 50% ( 2) 0.2% 721.0 ( 11.3) 29.6% 35.7 ( 4.9)
> > 75% ( 3) -0.3% 724.3 ( 11.2) 48.7% 26.0 ( 0.0)
> > 100% ( 4) 3.0% 700.3 ( 13.6) 55.9% 22.3 ( 0.6)
> >
> > Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
> > 1 node * 2 cores * 2 threads = 4 CPUs
> > 14G/node = 14G memory
> >
> > kernel boot deferred init
> > ------------------------ ------------------------
> > node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
> > ( 0) -- 673.0 ( 6.9) -- 57.0 ( 1.0)
> > 25% ( 1) -0.6% 677.3 ( 19.8) 1.8% 56.0 ( 1.0)
> > 50% ( 2) 3.4% 650.0 ( 3.6) 36.8% 36.0 ( 5.2)
> > 75% ( 3) 4.2% 644.7 ( 7.6) 56.1% 25.0 ( 1.0)
> > 100% ( 4) 5.3% 637.0 ( 5.6) 63.2% 21.0 ( 0.0)
> >
> > On Josh's 96-CPU and 192G memory system:
> >
> > Without this patch series:
> > [ 0.487132] node 0 initialised, 23398907 pages in 292ms
> > [ 0.499132] node 1 initialised, 24189223 pages in 304ms
> > ...
> > [ 0.629376] Run /sbin/init as init process
> >
> > With this patch series:
> > [ 0.227868] node 0 initialised, 23398907 pages in 28ms
> > [ 0.230019] node 1 initialised, 24189223 pages in 28ms
> > ...
> > [ 0.361069] Run /sbin/init as init process
> >
> > [1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf
> >
> > Signed-off-by: Daniel Jordan <[email protected]>
> > ---
> > mm/Kconfig | 6 ++---
> > mm/page_alloc.c | 60 ++++++++++++++++++++++++++++++++++++++++++++-----
> > 2 files changed, 58 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index c1acc34c1c358..04c1da3f9f44c 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -750,13 +750,13 @@ config DEFERRED_STRUCT_PAGE_INIT
> > depends on SPARSEMEM
> > depends on !NEED_PER_CPU_KM
> > depends on 64BIT
> > + select PADATA
> > help
> > Ordinarily all struct pages are initialised during early boot in a
> > single thread. On very large machines this can take a considerable
> > amount of time. If this option is set, large machines will bring up
> > - a subset of memmap at boot and then initialise the rest in parallel
> > - by starting one-off "pgdatinitX" kernel thread for each node X. This
> > - has a potential performance impact on processes running early in the
> > + a subset of memmap at boot and then initialise the rest in parallel.
> > + This has a potential performance impact on tasks running early in the
> > lifetime of the system until these kthreads finish the
> > initialisation.
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index d0c0d9364aa6d..9cb780e8dec78 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -68,6 +68,7 @@
> > #include <linux/lockdep.h>
> > #include <linux/nmi.h>
> > #include <linux/psi.h>
> > +#include <linux/padata.h>
> >
> > #include <asm/sections.h>
> > #include <asm/tlbflush.h>
> > @@ -1814,16 +1815,44 @@ deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
> > return nr_pages;
> > }
> >
> > +struct definit_args {
> > + struct zone *zone;
> > + atomic_long_t nr_pages;
> > +};
> > +
> > +static void __init
> > +deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> > + void *arg)
> > +{
> > + unsigned long spfn, epfn, nr_pages = 0;
> > + struct definit_args *args = arg;
> > + struct zone *zone = args->zone;
> > + u64 i;
> > +
> > + deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn, start_pfn);
> > +
> > + /*
> > + * Initialize and free pages in MAX_ORDER sized increments so that we
> > + * can avoid introducing any issues with the buddy allocator.
> > + */
> > + while (spfn < end_pfn) {
> > + nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> > + cond_resched();
> > + }
> > +
> > + atomic_long_add(nr_pages, &args->nr_pages);
> > +}
> > +
>
> Personally I would get rid of nr_pages entirely. It isn't worth the
> cache thrash to have this atomic variable bouncing around. You could
> probably just have this function return void since all nr_pages is
> used for is a pr_info statement at the end of the initialization
> which will be completely useless now anyway since we really have the
> threads running in parallel anyway.
>
> We only really need the nr_pages logic in deferred_grow_zone in order
> to track if we have freed enough pages to allow us to go back to what
> we were doing.
>
> > /* Initialise remaining memory on a node */
> > static int __init deferred_init_memmap(void *data)
> > {
> > pg_data_t *pgdat = data;
> > const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> > unsigned long spfn = 0, epfn = 0, nr_pages = 0;
> > - unsigned long first_init_pfn, flags;
> > + unsigned long first_init_pfn, flags, epfn_align;
> > unsigned long start = jiffies;
> > struct zone *zone;
> > - int zid;
> > + int zid, max_threads;
> > u64 i;
> >
> > /* Bind memory initialisation thread to a local node if possible */
> > @@ -1863,11 +1892,32 @@ static int __init deferred_init_memmap(void *data)
> > goto zone_empty;
> >
> > /*
> > - * Initialize and free pages in MAX_ORDER sized increments so
> > - * that we can avoid introducing any issues with the buddy
> > - * allocator.
> > + * More CPUs always led to greater speedups on tested systems, up to
> > + * all the nodes' CPUs. Use all since the system is otherwise idle now.
> > */
> > + max_threads = max(cpumask_weight(cpumask), 1u);
> > +
> > while (spfn < epfn) {
> > + epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
> > +
> > + if (IS_ALIGNED(spfn, PAGES_PER_SECTION) &&
> > + epfn_align - spfn >= PAGES_PER_SECTION) {
> > + struct definit_args arg = { zone, ATOMIC_LONG_INIT(0) };
> > + struct padata_mt_job job = {
> > + .thread_fn = deferred_init_memmap_chunk,
> > + .fn_arg = &arg,
> > + .start = spfn,
> > + .size = epfn_align - spfn,
> > + .align = PAGES_PER_SECTION,
> > + .min_chunk = PAGES_PER_SECTION,
> > + .max_threads = max_threads,
> > + };
> > +
> > + padata_do_multithreaded(&job);
> > + nr_pages += atomic_long_read(&arg.nr_pages);
> > + spfn = epfn_align;
> > + }
> > +
> > nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> > cond_resched();
> > }
>
> This doesn't look right. You are basically adding threads in addition
> to calls to deferred_init_maxorder. In addition you are spawning one
> job per section instead of per range. Really you should be going for
> something more along the lines of:
>
> while (spfn < epfn) {
> unsigned long epfn_align = ALIGN(epfn,
> PAGE_PER_SECTION);
> struct definit_args arg = { zone, ATOMIC_LONG_INIT(0)
> };
> struct padata_mt_job job = {
> .thread_fn = deferred_init_memmap_chunk,
> .fn_arg = &arg,
> .start = spfn,
> .size = epfn_align - spfn,
> .align = PAGES_PER_SECTION,
> .min_chunk = PAGES_PER_SECTION,
> .max_threads = max_threads,
> };
>
> padata_do_multithreaded(&job);
>
> for_each_free_mem_pfn_range_in_zone_from(i, zone,
> spfn, epfn) {
> if (epfn_align <= spfn)
> break;
> }
> }
>

So I was thinking about my suggestion further and the loop at the end
isn't quite correct as I believe it could lead to gaps. The loop on
the end should probably be:
for_each_free_mem_pfn_range_in_zone_from(i, zone, spfn, epfn) {
if (epfn <= epfn_align)
continue;
if (spfn < epfn_align)
spfn = epfn_align;
break;
}

That would generate a new range where epfn_align has actually ended
and there is a range of new PFNs to process.

Thanks.

- Alex

2020-05-21 15:39:09

by Daniel Jordan

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] mm: parallelize deferred_init_memmap()

On Wed, May 20, 2020 at 06:29:32PM -0700, Alexander Duyck wrote:
> On Wed, May 20, 2020 at 11:27 AM Daniel Jordan
> > @@ -1814,16 +1815,44 @@ deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
> > return nr_pages;
> > }
> >
> > +struct definit_args {
> > + struct zone *zone;
> > + atomic_long_t nr_pages;
> > +};
> > +
> > +static void __init
> > +deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> > + void *arg)
> > +{
> > + unsigned long spfn, epfn, nr_pages = 0;
> > + struct definit_args *args = arg;
> > + struct zone *zone = args->zone;
> > + u64 i;
> > +
> > + deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn, start_pfn);
> > +
> > + /*
> > + * Initialize and free pages in MAX_ORDER sized increments so that we
> > + * can avoid introducing any issues with the buddy allocator.
> > + */
> > + while (spfn < end_pfn) {
> > + nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> > + cond_resched();
> > + }
> > +
> > + atomic_long_add(nr_pages, &args->nr_pages);
> > +}
> > +
>
> Personally I would get rid of nr_pages entirely. It isn't worth the
> cache thrash to have this atomic variable bouncing around.

One of the things I tried to optimize was the managed_pages atomic adds in
__free_pages_core, but performance stayed the same on the biggest machine I
tested when it was done once at the end of page init instead of in every thread
for every pageblock.

I'm not sure this atomic would matter either, given it's less frequent.

> You could
> probably just have this function return void since all nr_pages is
> used for is a pr_info statement at the end of the initialization
> which will be completely useless now anyway since we really have the
> threads running in parallel anyway.

The timestamp is still useful for observability, page init is a significant
part of kernel boot on big machines, over 10% sometimes with these patches.

It's mostly the time that matters though, I agree the number of pages is less
important and is probably worth removing just to simplify the code. I'll do it
if no one sees a reason to keep it.

> We only really need the nr_pages logic in deferred_grow_zone in order
> to track if we have freed enough pages to allow us to go back to what
> we were doing.
>
> > @@ -1863,11 +1892,32 @@ static int __init deferred_init_memmap(void *data)
> > goto zone_empty;
> >
> > /*
> > - * Initialize and free pages in MAX_ORDER sized increments so
> > - * that we can avoid introducing any issues with the buddy
> > - * allocator.
> > + * More CPUs always led to greater speedups on tested systems, up to
> > + * all the nodes' CPUs. Use all since the system is otherwise idle now.
> > */
> > + max_threads = max(cpumask_weight(cpumask), 1u);
> > +
> > while (spfn < epfn) {
> > + epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
> > +
> > + if (IS_ALIGNED(spfn, PAGES_PER_SECTION) &&
> > + epfn_align - spfn >= PAGES_PER_SECTION) {
> > + struct definit_args arg = { zone, ATOMIC_LONG_INIT(0) };
> > + struct padata_mt_job job = {
> > + .thread_fn = deferred_init_memmap_chunk,
> > + .fn_arg = &arg,
> > + .start = spfn,
> > + .size = epfn_align - spfn,
> > + .align = PAGES_PER_SECTION,
> > + .min_chunk = PAGES_PER_SECTION,
> > + .max_threads = max_threads,
> > + };
> > +
> > + padata_do_multithreaded(&job);
> > + nr_pages += atomic_long_read(&arg.nr_pages);
> > + spfn = epfn_align;
> > + }
> > +
> > nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> > cond_resched();
> > }
>
> This doesn't look right. You are basically adding threads in addition
> to calls to deferred_init_maxorder.

The deferred_init_maxorder call is there to do the remaining, non-section
aligned part of a range. It doesn't have to be done this way.

> In addition you are spawning one
> job per section instead of per range.

That's not what's happening, all the above is doing is aligning the end of the
range down to a section. Each thread is working on way more than a section at
a time.

> Really you should be going for
> something more along the lines of:
>
> while (spfn < epfn) {
> unsigned long epfn_align = ALIGN(epfn,
> PAGE_PER_SECTION);
> struct definit_args arg = { zone, ATOMIC_LONG_INIT(0)
> };
> struct padata_mt_job job = {
> .thread_fn = deferred_init_memmap_chunk,
> .fn_arg = &arg,
> .start = spfn,
> .size = epfn_align - spfn,
> .align = PAGES_PER_SECTION,
> .min_chunk = PAGES_PER_SECTION,
> .max_threads = max_threads,
> };
>
> padata_do_multithreaded(&job);
>
> for_each_free_mem_pfn_range_in_zone_from(i, zone,
> spfn, epfn) {
> if (epfn_align <= spfn)
> break;
> }
> }

I can see what you're getting at even though I think this can leave ranges
uninitialized. Starting with range [a,b), b is aligned up to d and the inner
loop skips [c,e).

a b c d e
| | | section boundaries
[ ) [ )

We could use deferred_init_mem_pfn_range_in_zone() instead of the for_each
loop.

What I was trying to avoid by aligning down is creating a discontiguous pfn
range that get passed to padata. We already discussed how those are handled
by the zone iterator in the thread function, but job->size can be exaggerated
to include parts of the range that are never touched. Thinking more about it
though, it's a small fraction of the total work and shouldn't matter.

> This should accomplish the same thing, but much more efficiently.

Well, more cleanly. I'll give it a try.

> The
> only thing you really lose is the tracking of nr_pages which really
> doesn't add anything anyway since the value could shift around
> depending on how many times deferred_grow_zone got called anyway.
>
> Also the spfn should already be sectioned aligned, or at least be in a
> new section unrelated to the one we just scheduled, so there is no
> need for the extra checks you had.

I was doing it to be robust to future changes. Otherwise epfn_align - spfn
could be huge when aligning down, but with aligning up it won't matter and can
be removed.

2020-05-21 16:47:23

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] mm: parallelize deferred_init_memmap()

On Thu, May 21, 2020 at 8:37 AM Daniel Jordan
<[email protected]> wrote:
>
> On Wed, May 20, 2020 at 06:29:32PM -0700, Alexander Duyck wrote:
> > On Wed, May 20, 2020 at 11:27 AM Daniel Jordan
> > > @@ -1814,16 +1815,44 @@ deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
> > > return nr_pages;
> > > }
> > >
> > > +struct definit_args {
> > > + struct zone *zone;
> > > + atomic_long_t nr_pages;
> > > +};
> > > +
> > > +static void __init
> > > +deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> > > + void *arg)
> > > +{
> > > + unsigned long spfn, epfn, nr_pages = 0;
> > > + struct definit_args *args = arg;
> > > + struct zone *zone = args->zone;
> > > + u64 i;
> > > +
> > > + deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn, start_pfn);
> > > +
> > > + /*
> > > + * Initialize and free pages in MAX_ORDER sized increments so that we
> > > + * can avoid introducing any issues with the buddy allocator.
> > > + */
> > > + while (spfn < end_pfn) {
> > > + nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> > > + cond_resched();
> > > + }
> > > +
> > > + atomic_long_add(nr_pages, &args->nr_pages);
> > > +}
> > > +
> >
> > Personally I would get rid of nr_pages entirely. It isn't worth the
> > cache thrash to have this atomic variable bouncing around.
>
> One of the things I tried to optimize was the managed_pages atomic adds in
> __free_pages_core, but performance stayed the same on the biggest machine I
> tested when it was done once at the end of page init instead of in every thread
> for every pageblock.
>
> I'm not sure this atomic would matter either, given it's less frequent.

It is more about not bothering with the extra tracking. We don't
really need it and having it doesn't really add much in the way of
value.

> > You could
> > probably just have this function return void since all nr_pages is
> > used for is a pr_info statement at the end of the initialization
> > which will be completely useless now anyway since we really have the
> > threads running in parallel anyway.
>
> The timestamp is still useful for observability, page init is a significant
> part of kernel boot on big machines, over 10% sometimes with these patches.

Agreed.

> It's mostly the time that matters though, I agree the number of pages is less
> important and is probably worth removing just to simplify the code. I'll do it
> if no one sees a reason to keep it.

Sounds good.

> > We only really need the nr_pages logic in deferred_grow_zone in order
> > to track if we have freed enough pages to allow us to go back to what
> > we were doing.
> >
> > > @@ -1863,11 +1892,32 @@ static int __init deferred_init_memmap(void *data)
> > > goto zone_empty;
> > >
> > > /*
> > > - * Initialize and free pages in MAX_ORDER sized increments so
> > > - * that we can avoid introducing any issues with the buddy
> > > - * allocator.
> > > + * More CPUs always led to greater speedups on tested systems, up to
> > > + * all the nodes' CPUs. Use all since the system is otherwise idle now.
> > > */
> > > + max_threads = max(cpumask_weight(cpumask), 1u);
> > > +
> > > while (spfn < epfn) {
> > > + epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
> > > +
> > > + if (IS_ALIGNED(spfn, PAGES_PER_SECTION) &&
> > > + epfn_align - spfn >= PAGES_PER_SECTION) {
> > > + struct definit_args arg = { zone, ATOMIC_LONG_INIT(0) };
> > > + struct padata_mt_job job = {
> > > + .thread_fn = deferred_init_memmap_chunk,
> > > + .fn_arg = &arg,
> > > + .start = spfn,
> > > + .size = epfn_align - spfn,
> > > + .align = PAGES_PER_SECTION,
> > > + .min_chunk = PAGES_PER_SECTION,
> > > + .max_threads = max_threads,
> > > + };
> > > +
> > > + padata_do_multithreaded(&job);
> > > + nr_pages += atomic_long_read(&arg.nr_pages);
> > > + spfn = epfn_align;
> > > + }
> > > +
> > > nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> > > cond_resched();
> > > }
> >
> > This doesn't look right. You are basically adding threads in addition
> > to calls to deferred_init_maxorder.
>
> The deferred_init_maxorder call is there to do the remaining, non-section
> aligned part of a range. It doesn't have to be done this way.

It is also doing the advancing though isn't it?

> > In addition you are spawning one
> > job per section instead of per range.
>
> That's not what's happening, all the above is doing is aligning the end of the
> range down to a section. Each thread is working on way more than a section at
> a time.

Yeah, now that I reread it I see that. For some reason I was thinking
you were aligning spfn, not epfn.

> > Really you should be going for
> > something more along the lines of:
> >
> > while (spfn < epfn) {
> > unsigned long epfn_align = ALIGN(epfn,
> > PAGE_PER_SECTION);
> > struct definit_args arg = { zone, ATOMIC_LONG_INIT(0)
> > };
> > struct padata_mt_job job = {
> > .thread_fn = deferred_init_memmap_chunk,
> > .fn_arg = &arg,
> > .start = spfn,
> > .size = epfn_align - spfn,
> > .align = PAGES_PER_SECTION,
> > .min_chunk = PAGES_PER_SECTION,
> > .max_threads = max_threads,
> > };
> >
> > padata_do_multithreaded(&job);
> >
> > for_each_free_mem_pfn_range_in_zone_from(i, zone,
> > spfn, epfn) {
> > if (epfn_align <= spfn)
> > break;
> > }
> > }
>
> I can see what you're getting at even though I think this can leave ranges
> uninitialized. Starting with range [a,b), b is aligned up to d and the inner
> loop skips [c,e).
>
> a b c d e
> | | | section boundaries
> [ ) [ )

I think I resolved this with the fix for it I described in the other
email. We just need to swap out spfn for epfn and make sure we align
spfn with epfn_align. Then I think that takes care of possible skips.

> We could use deferred_init_mem_pfn_range_in_zone() instead of the for_each
> loop.
>
> What I was trying to avoid by aligning down is creating a discontiguous pfn
> range that get passed to padata. We already discussed how those are handled
> by the zone iterator in the thread function, but job->size can be exaggerated
> to include parts of the range that are never touched. Thinking more about it
> though, it's a small fraction of the total work and shouldn't matter.

So the problem with aligning down is that you are going to be slowed
up as you have to go single threaded to initialize whatever remains.
So worst case scenario is that you have a section aligned block and
you will process all but 1 section in parallel, and then have to
process the remaining section one max order block at a time.

> > This should accomplish the same thing, but much more efficiently.
>
> Well, more cleanly. I'll give it a try.

I agree I am not sure if it will make a big difference on x86, however
the more ranges you have to process the faster this approach should be
as it stays parallel the entire time rather than having to drop out
and process the last section one max order block at a time.

> > The
> > only thing you really lose is the tracking of nr_pages which really
> > doesn't add anything anyway since the value could shift around
> > depending on how many times deferred_grow_zone got called anyway.
> >
> > Also the spfn should already be sectioned aligned, or at least be in a
> > new section unrelated to the one we just scheduled, so there is no
> > need for the extra checks you had.
>
> I was doing it to be robust to future changes. Otherwise epfn_align - spfn
> could be huge when aligning down, but with aligning up it won't matter and can
> be removed.

Right. So that was to catch the case where you could potentially align
down below spfn.

2020-05-21 21:16:59

by Daniel Jordan

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] mm: parallelize deferred_init_memmap()

On Thu, May 21, 2020 at 09:46:35AM -0700, Alexander Duyck wrote:
> It is more about not bothering with the extra tracking. We don't
> really need it and having it doesn't really add much in the way of
> value.

Yeah, it can probably go.

> > > > @@ -1863,11 +1892,32 @@ static int __init deferred_init_memmap(void *data)
> > > > goto zone_empty;
> > > >
> > > > /*
> > > > - * Initialize and free pages in MAX_ORDER sized increments so
> > > > - * that we can avoid introducing any issues with the buddy
> > > > - * allocator.
> > > > + * More CPUs always led to greater speedups on tested systems, up to
> > > > + * all the nodes' CPUs. Use all since the system is otherwise idle now.
> > > > */
> > > > + max_threads = max(cpumask_weight(cpumask), 1u);
> > > > +
> > > > while (spfn < epfn) {
> > > > + epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
> > > > +
> > > > + if (IS_ALIGNED(spfn, PAGES_PER_SECTION) &&
> > > > + epfn_align - spfn >= PAGES_PER_SECTION) {
> > > > + struct definit_args arg = { zone, ATOMIC_LONG_INIT(0) };
> > > > + struct padata_mt_job job = {
> > > > + .thread_fn = deferred_init_memmap_chunk,
> > > > + .fn_arg = &arg,
> > > > + .start = spfn,
> > > > + .size = epfn_align - spfn,
> > > > + .align = PAGES_PER_SECTION,
> > > > + .min_chunk = PAGES_PER_SECTION,
> > > > + .max_threads = max_threads,
> > > > + };
> > > > +
> > > > + padata_do_multithreaded(&job);
> > > > + nr_pages += atomic_long_read(&arg.nr_pages);
> > > > + spfn = epfn_align;
> > > > + }
> > > > +
> > > > nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> > > > cond_resched();
> > > > }
> > >
> > > This doesn't look right. You are basically adding threads in addition
> > > to calls to deferred_init_maxorder.
> >
> > The deferred_init_maxorder call is there to do the remaining, non-section
> > aligned part of a range. It doesn't have to be done this way.
>
> It is also doing the advancing though isn't it?

Yes. Not sure what you're getting at. There's the 'spfn = epfn_align' before
so nothing is skipped. It's true that the nonaligned part is done outside of
padata when it could be done by a thread that'd otherwise be waiting or idle,
which should be addressed in the next version.

> I think I resolved this with the fix for it I described in the other
> email. We just need to swap out spfn for epfn and make sure we align
> spfn with epfn_align. Then I think that takes care of possible skips.

Right, though your fix looks a lot like deferred_init_mem_pfn_range_in_zone().
Seems better to just use that and not repeat ourselves. Lame that it's
starting at the beginning of the ranges every time, maybe it could be
generalized somehow, but I think it should be fast enough.

> > We could use deferred_init_mem_pfn_range_in_zone() instead of the for_each
> > loop.
> >
> > What I was trying to avoid by aligning down is creating a discontiguous pfn
> > range that get passed to padata. We already discussed how those are handled
> > by the zone iterator in the thread function, but job->size can be exaggerated
> > to include parts of the range that are never touched. Thinking more about it
> > though, it's a small fraction of the total work and shouldn't matter.
>
> So the problem with aligning down is that you are going to be slowed
> up as you have to go single threaded to initialize whatever remains.
> So worst case scenario is that you have a section aligned block and
> you will process all but 1 section in parallel, and then have to
> process the remaining section one max order block at a time.

Yes, aligning up is better.

> > > This should accomplish the same thing, but much more efficiently.
> >
> > Well, more cleanly. I'll give it a try.
>
> I agree I am not sure if it will make a big difference on x86, however
> the more ranges you have to process the faster this approach should be
> as it stays parallel the entire time rather than having to drop out
> and process the last section one max order block at a time.

Right.