2013-07-12 02:04:06

by Robin Holt

[permalink] [raw]
Subject: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator

We have been working on this since we returned from shutdown and have
something to discuss now. We restricted ourselves to 2MiB initialization
to keep the patch set a little smaller and more clear.

First, I think I want to propose getting rid of the page flag. If I knew
of a concrete way to determine that the page has not been initialized,
this patch series would look different. If there is no definitive
way to determine that the struct page has been initialized aside from
checking the entire page struct is zero, then I think I would suggest
we change the page flag to indicate the page has been initialized.

The heart of the problem as I see it comes from expand(). We nearly
always see a first reference to a struct page which is in the middle
of the 2MiB region. Due to that access, the unlikely() check that was
originally proposed really ends up referencing a different page entirely.
We actually did not introduce an unlikely and refactor the patches to
make that unlikely inside a static inline function. Also, given the
strong warning at the head of expand(), we did not feel experienced
enough to refactor it to make things always reference the 2MiB page
first.

With this patch, we did boot a 16TiB machine. Without the patches,
the v3.10 kernel with the same configuration took 407 seconds for
free_all_bootmem. With the patches and operating on 2MiB pages instead
of 1GiB, it took 26 seconds so performance was improved. I have no feel
for how the 1GiB chunk size will perform.

I am on vacation for the next three days so I am sorry in advance for
my infrequent or non-existant responses.


Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nate Zimmer <[email protected]>
To: "H. Peter Anvin" <[email protected]>
To: Ingo Molnar <[email protected]>
Cc: Linux Kernel <[email protected]>
Cc: Linux MM <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: Daniel J Blueman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Mel Gorman <[email protected]>


2013-07-12 02:04:08

by Robin Holt

[permalink] [raw]
Subject: [RFC 1/4] memblock: Introduce a for_each_reserved_mem_region iterator.

As part of initializing struct page's in 2MiB chunks, we noticed that
at the end of free_all_bootmem(), there was nothing which had forced
the reserved/allocated 4KiB pages to be initialized.

This helper function will be used for that expansion.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nate Zimmer <[email protected]>
To: "H. Peter Anvin" <[email protected]>
To: Ingo Molnar <[email protected]>
Cc: Linux Kernel <[email protected]>
Cc: Linux MM <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: Daniel J Blueman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Mel Gorman <[email protected]>
---
include/linux/memblock.h | 18 ++++++++++++++++++
mm/memblock.c | 32 ++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index f388203..e99bbd1 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -118,6 +118,24 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start,
i != (u64)ULLONG_MAX; \
__next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid))

+void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start,
+ phys_addr_t *out_end);
+
+/**
+ * for_earch_reserved_mem_region - iterate over all reserved memblock areas
+ * @i: u64 used as loop variable
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over reserved areas of memblock in. Available as soon as memblock
+ * is initialized.
+ */
+#define for_each_reserved_mem_region(i, p_start, p_end) \
+ for (i = 0UL, \
+ __next_reserved_mem_region(&i, p_start, p_end); \
+ i != (u64)ULLONG_MAX; \
+ __next_reserved_mem_region(&i, p_start, p_end))
+
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid);

diff --git a/mm/memblock.c b/mm/memblock.c
index c5fad93..0d7d6e7 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -564,6 +564,38 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
}

/**
+ * __next_reserved_mem_region - next function for for_each_reserved_region()
+ * @idx: pointer to u64 loop variable
+ * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL
+ * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL
+ *
+ * Iterate over all reserved memory regions.
+ */
+void __init_memblock __next_reserved_mem_region(u64 *idx,
+ phys_addr_t *out_start,
+ phys_addr_t *out_end)
+{
+ struct memblock_type *rsv = &memblock.reserved;
+
+ if (*idx >= 0 && *idx < rsv->cnt) {
+ struct memblock_region *r = &rsv->regions[*idx];
+ phys_addr_t base = r->base;
+ phys_addr_t size = r->size;
+
+ if (out_start)
+ *out_start = base;
+ if (out_end)
+ *out_end = base + size - 1;
+
+ *idx += 1;
+ return;
+ }
+
+ /* signal end of iteration */
+ *idx = ULLONG_MAX;
+}
+
+/**
* __next_free_mem_range - next function for for_each_free_mem_range()
* @idx: pointer to u64 loop variable
* @nid: nid: node selector, %MAX_NUMNODES for all nodes
--
1.8.2.1

2013-07-12 02:04:18

by Robin Holt

[permalink] [raw]
Subject: [RFC 4/4] Sparse initialization of struct page array.

During boot of large memory machines, a significant portion of boot
is spent initializing the struct page array. The vast majority of
those pages are not referenced during boot.

Change this over to only initializing the pages when they are
actually allocated.

Besides the advantage of boot speed, this allows us the chance to
use normal performance monitoring tools to determine where the bulk
of time is spent during page initialization.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nate Zimmer <[email protected]>
To: "H. Peter Anvin" <[email protected]>
To: Ingo Molnar <[email protected]>
Cc: Linux Kernel <[email protected]>
Cc: Linux MM <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: Daniel J Blueman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Mel Gorman <[email protected]>
---
include/linux/mm.h | 11 +++++
include/linux/page-flags.h | 5 +-
mm/nobootmem.c | 5 ++
mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++++++--
4 files changed, 132 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e0c8528..3de08b5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page)
__free_page(page);
}

+extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
+
+static inline void __reserve_bootmem_page(struct page *page)
+{
+ phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT;
+ phys_addr_t end = start + PAGE_SIZE;
+
+ __reserve_bootmem_region(start, end);
+}
+
static inline void free_reserved_page(struct page *page)
{
+ __reserve_bootmem_page(page);
__free_reserved_page(page);
adjust_managed_page_count(page, 1);
}
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675..79e8eb7 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -83,6 +83,7 @@ enum pageflags {
PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
PG_arch_1,
PG_reserved,
+ PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */
PG_private, /* If pagecache, has fs-private data */
PG_private_2, /* If pagecache, has fs aux data */
PG_writeback, /* Page is under writeback */
@@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)

__PAGEFLAG(SlobFree, slob_free)

+PAGEFLAG(Uninitialized2Mib, uninitialized2mib)
+
/*
* Private page markings that may be used by the filesystem that owns the page
* for its own purposes.
@@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
#define PAGE_FLAGS_CHECK_AT_FREE \
(1 << PG_lru | 1 << PG_locked | \
1 << PG_private | 1 << PG_private_2 | \
- 1 << PG_writeback | 1 << PG_reserved | \
+ 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized2mib | \
1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
__PG_COMPOUND_LOCK)
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 3b512ca..e3a386d 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -126,6 +126,9 @@ static unsigned long __init free_low_memory_core_early(void)
phys_addr_t start, end, size;
u64 i;

+ for_each_reserved_mem_region(i, &start, &end)
+ __reserve_bootmem_region(start, end);
+
for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
count += __free_memory_core(start, end);

@@ -162,6 +165,8 @@ unsigned long __init free_all_bootmem(void)
{
struct pglist_data *pgdat;

+ memblock_dump_all();
+
for_each_online_pgdat(pgdat)
reset_node_lowmem_managed_pages(pgdat);

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 635b131..fe51eb9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i
#endif
}

+static void expand_page_initialization(struct page *basepage)
+{
+ unsigned long pfn = page_to_pfn(basepage);
+ unsigned long end_pfn = pfn + PTRS_PER_PMD;
+ unsigned long zone = page_zonenum(basepage);
+ int reserved = PageReserved(basepage);
+ int nid = page_to_nid(basepage);
+
+ ClearPageUninitialized2Mib(basepage);
+
+ for( pfn++; pfn < end_pfn; pfn++ )
+ __init_single_page(pfn_to_page(pfn), zone, nid, reserved);
+}
+
+void ensure_pages_are_initialized(unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1);
+ unsigned long aligned_end_pfn;
+ struct page *page;
+
+ aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1);
+ aligned_end_pfn += PTRS_PER_PMD;
+ while (aligned_start_pfn < aligned_end_pfn) {
+ if (pfn_valid(aligned_start_pfn)) {
+ page = pfn_to_page(aligned_start_pfn);
+
+ if(PageUninitialized2Mib(page))
+ expand_page_initialization(page);
+ }
+
+ aligned_start_pfn += PTRS_PER_PMD;
+ }
+}
+
+void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
+{
+ unsigned long start_pfn = PFN_DOWN(start);
+ unsigned long end_pfn = PFN_UP(end);
+
+ ensure_pages_are_initialized(start_pfn, end_pfn);
+}
+
+static inline void ensure_page_is_initialized(struct page *page)
+{
+ __reserve_bootmem_page(page);
+}
+
static bool free_pages_prepare(struct page *page, unsigned int order)
{
int i;
@@ -751,7 +799,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
if (PageAnon(page))
page->mapping = NULL;
for (i = 0; i < (1 << order); i++)
- bad += free_pages_check(page + i);
+ if (PageUninitialized2Mib(page + i))
+ i += PTRS_PER_PMD - 1;
+ else
+ bad += free_pages_check(page + i);
if (bad)
return false;

@@ -795,13 +846,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
unsigned int loop;

prefetchw(page);
- for (loop = 0; loop < nr_pages; loop++) {
+ for (loop = 0; loop < nr_pages; ) {
struct page *p = &page[loop];

if (loop + 1 < nr_pages)
prefetchw(p + 1);
+
+ if ((PageUninitialized2Mib(p)) &&
+ ((loop + PTRS_PER_PMD) > nr_pages))
+ ensure_page_is_initialized(p);
+
__ClearPageReserved(p);
set_page_count(p, 0);
+ if (PageUninitialized2Mib(p))
+ loop += PTRS_PER_PMD;
+ else
+ loop += 1;
}

page_zone(page)->managed_pages += 1 << order;
@@ -856,6 +916,7 @@ static inline void expand(struct zone *zone, struct page *page,
area--;
high--;
size >>= 1;
+ ensure_page_is_initialized(page);
VM_BUG_ON(bad_range(zone, &page[size]));

#ifdef CONFIG_DEBUG_PAGEALLOC
@@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)

for (i = 0; i < (1 << order); i++) {
struct page *p = page + i;
+
+ if (PageUninitialized2Mib(p))
+ expand_page_initialization(page);
+
if (unlikely(check_new_page(p)))
return 1;
}
@@ -985,6 +1050,7 @@ int move_freepages(struct zone *zone,
unsigned long order;
int pages_moved = 0;

+ ensure_pages_are_initialized(page_to_pfn(start_page), page_to_pfn(end_page));
#ifndef CONFIG_HOLES_IN_ZONE
/*
* page_zone is not safe to call in this context when
@@ -3859,6 +3925,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn)
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn)))
return 1;
+
+ if (PageUninitialized2Mib(pfn_to_page(pfn)))
+ pfn += PTRS_PER_PMD;
}
return 0;
}
@@ -3947,6 +4016,29 @@ static void setup_zone_migrate_reserve(struct zone *zone)
}
}

+int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn,
+ unsigned long size, int nid)
+{
+ unsigned long validate_end_pfn = pfn + size;
+
+ if (pfn & (size - 1))
+ return 1;
+
+ if (pfn + size >= end_pfn)
+ return 1;
+
+ while (pfn < validate_end_pfn)
+ {
+ if (!early_pfn_valid(pfn))
+ return 1;
+ if (!early_pfn_in_nid(pfn, nid))
+ return 1;
+ pfn++;
+ }
+
+ return size;
+}
+
/*
* Initially all pages are reserved - free ones are freed
* up by free_all_bootmem() once the early boot process is
@@ -3964,20 +4056,34 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
highest_memmap_pfn = end_pfn - 1;

z = &NODE_DATA(nid)->node_zones[zone];
- for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+ for (pfn = start_pfn; pfn < end_pfn; ) {
/*
* There can be holes in boot-time mem_map[]s
* handed to this function. They do not
* exist on hotplugged memory.
*/
+ int pfns = 1;
if (context == MEMMAP_EARLY) {
- if (!early_pfn_valid(pfn))
+ if (!early_pfn_valid(pfn)) {
+ pfn++;
continue;
- if (!early_pfn_in_nid(pfn, nid))
+ }
+ if (!early_pfn_in_nid(pfn, nid)) {
+ pfn++;
continue;
+ }
+
+ pfns = pfn_range_init_avail(pfn, end_pfn,
+ PTRS_PER_PMD, nid);
}
+
page = pfn_to_page(pfn);
__init_single_page(page, zone, nid, 1);
+
+ if (pfns > 1)
+ SetPageUninitialized2Mib(page);
+
+ pfn += pfns;
}
}

@@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = {
{1UL << PG_owner_priv_1, "owner_priv_1" },
{1UL << PG_arch_1, "arch_1" },
{1UL << PG_reserved, "reserved" },
+ {1UL << PG_uninitialized2mib, "Uninit_2MiB" },
{1UL << PG_private, "private" },
{1UL << PG_private_2, "private_2" },
{1UL << PG_writeback, "writeback" },
--
1.8.2.1

2013-07-12 02:04:15

by Robin Holt

[permalink] [raw]
Subject: [RFC 3/4] Seperate page initialization into a separate function.

Currently, memmap_init_zone() has all the smarts for initializing a
single page. When we convert to initializing pages in a 2MiB chunk,
we will need to do this equivalent work from two separate places
so we are breaking out a helper function.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nate Zimmer <[email protected]>
To: "H. Peter Anvin" <[email protected]>
To: Ingo Molnar <[email protected]>
Cc: Linux Kernel <[email protected]>
Cc: Linux MM <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: Daniel J Blueman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Mel Gorman <[email protected]>
---
mm/mm_init.c | 2 +-
mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++++------------------------
2 files changed, 45 insertions(+), 32 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index c280a02..be8a539 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void)
BUG_ON(or_mask != add_mask);
}

-void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone,
+void mminit_verify_page_links(struct page *page, enum zone_type zone,
unsigned long nid, unsigned long pfn)
{
BUG_ON(page_to_nid(page) != nid);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c3edb62..635b131 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -697,6 +697,49 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
spin_unlock(&zone->lock);
}

+static void __init_single_page(struct page *page, unsigned long zone, int nid, int reserved)
+{
+ unsigned long pfn = page_to_pfn(page);
+ struct zone *z = &NODE_DATA(nid)->node_zones[zone];
+
+ set_page_links(page, zone, nid, pfn);
+ mminit_verify_page_links(page, zone, nid, pfn);
+ init_page_count(page);
+ page_mapcount_reset(page);
+ page_nid_reset_last(page);
+ if (reserved) {
+ SetPageReserved(page);
+ } else {
+ ClearPageReserved(page);
+ set_page_count(page, 0);
+ }
+ /*
+ * Mark the block movable so that blocks are reserved for
+ * movable at startup. This will force kernel allocations
+ * to reserve their blocks rather than leaking throughout
+ * the address space during boot when many long-lived
+ * kernel allocations are made. Later some blocks near
+ * the start are marked MIGRATE_RESERVE by
+ * setup_zone_migrate_reserve()
+ *
+ * bitmap is created for zone's valid pfn range. but memmap
+ * can be created for invalid pages (for alignment)
+ * check here not to call set_pageblock_migratetype() against
+ * pfn out of zone.
+ */
+ if ((z->zone_start_pfn <= pfn)
+ && (pfn < zone_end_pfn(z))
+ && !(pfn & (pageblock_nr_pages - 1)))
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+
+ INIT_LIST_HEAD(&page->lru);
+#ifdef WANT_PAGE_VIRTUAL
+ /* The shift won't overflow because ZONE_NORMAL is below 4G. */
+ if (!is_highmem_idx(zone))
+ set_page_address(page, __va(pfn << PAGE_SHIFT));
+#endif
+}
+
static bool free_pages_prepare(struct page *page, unsigned int order)
{
int i;
@@ -3934,37 +3977,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
continue;
}
page = pfn_to_page(pfn);
- set_page_links(page, zone, nid, pfn);
- mminit_verify_page_links(page, zone, nid, pfn);
- init_page_count(page);
- page_mapcount_reset(page);
- page_nid_reset_last(page);
- SetPageReserved(page);
- /*
- * Mark the block movable so that blocks are reserved for
- * movable at startup. This will force kernel allocations
- * to reserve their blocks rather than leaking throughout
- * the address space during boot when many long-lived
- * kernel allocations are made. Later some blocks near
- * the start are marked MIGRATE_RESERVE by
- * setup_zone_migrate_reserve()
- *
- * bitmap is created for zone's valid pfn range. but memmap
- * can be created for invalid pages (for alignment)
- * check here not to call set_pageblock_migratetype() against
- * pfn out of zone.
- */
- if ((z->zone_start_pfn <= pfn)
- && (pfn < zone_end_pfn(z))
- && !(pfn & (pageblock_nr_pages - 1)))
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-
- INIT_LIST_HEAD(&page->lru);
-#ifdef WANT_PAGE_VIRTUAL
- /* The shift won't overflow because ZONE_NORMAL is below 4G. */
- if (!is_highmem_idx(zone))
- set_page_address(page, __va(pfn << PAGE_SHIFT));
-#endif
+ __init_single_page(page, zone, nid, 1);
}
}

--
1.8.2.1

2013-07-12 02:04:12

by Robin Holt

[permalink] [raw]
Subject: [RFC 2/4] Have __free_pages_memory() free in larger chunks.

Currently, when free_all_bootmem() calls __free_pages_memory(), the
number of contiguous pages that __free_pages_memory() passes to the
buddy allocator is limited to BITS_PER_LONG. In order to be able to
free only the first page of a 2MiB chunk, we need that to be increased
to PTRS_PER_PMD.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nate Zimmer <[email protected]>
To: "H. Peter Anvin" <[email protected]>
To: Ingo Molnar <[email protected]>
Cc: Linux Kernel <[email protected]>
Cc: Linux MM <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: Daniel J Blueman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Mel Gorman <[email protected]>
---
mm/nobootmem.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index bdd3fa2..3b512ca 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -83,10 +83,10 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size)
static void __init __free_pages_memory(unsigned long start, unsigned long end)
{
unsigned long i, start_aligned, end_aligned;
- int order = ilog2(BITS_PER_LONG);
+ int order = ilog2(max(BITS_PER_LONG, PTRS_PER_PMD));

- start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1);
- end_aligned = end & ~(BITS_PER_LONG - 1);
+ start_aligned = (start + ((1UL << order) - 1)) & ~((1UL << order) - 1);
+ end_aligned = end & ~((1UL << order) - 1);

if (end_aligned <= start_aligned) {
for (i = start; i < end; i++)
@@ -98,7 +98,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end)
for (i = start; i < start_aligned; i++)
__free_pages_bootmem(pfn_to_page(i), 0);

- for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG)
+ for (i = start_aligned; i < end_aligned; i += 1 << order)
__free_pages_bootmem(pfn_to_page(i), order);

for (i = end_aligned; i < end; i++)
--
1.8.2.1

2013-07-12 07:46:02

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 2/4] Have __free_pages_memory() free in larger chunks.

After sleeping on this, why can't we change __free_pages_bootmem
to not take an order, but rather nr_pages? If we did that,
then __free_pages_memory could just calculate nr_pages and call
__free_pages_bootmem one time.

I don't see why any of the callers of __free_pages_bootmem would not
easily support that change. Maybe I will work that up as part of a -v2
and see if it boots/runs.

At the very least, I think we could change to:
static void __init __free_pages_memory(unsigned long start, unsigned long end)
{
int order;

while (start < end) {
order = ffs(start);

while (start + (1UL << order) > end)
order--;

__free_pages_bootmem(start, order);

start += (1UL << order);
}
}


Robin

On Thu, Jul 11, 2013 at 09:03:53PM -0500, Robin Holt wrote:
> Currently, when free_all_bootmem() calls __free_pages_memory(), the
> number of contiguous pages that __free_pages_memory() passes to the
> buddy allocator is limited to BITS_PER_LONG. In order to be able to
> free only the first page of a 2MiB chunk, we need that to be increased
> to PTRS_PER_PMD.
>
> Signed-off-by: Robin Holt <[email protected]>
> Signed-off-by: Nate Zimmer <[email protected]>
> To: "H. Peter Anvin" <[email protected]>
> To: Ingo Molnar <[email protected]>
> Cc: Linux Kernel <[email protected]>
> Cc: Linux MM <[email protected]>
> Cc: Rob Landley <[email protected]>
> Cc: Mike Travis <[email protected]>
> Cc: Daniel J Blueman <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Greg KH <[email protected]>
> Cc: Yinghai Lu <[email protected]>
> Cc: Mel Gorman <[email protected]>
> ---
> mm/nobootmem.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/mm/nobootmem.c b/mm/nobootmem.c
> index bdd3fa2..3b512ca 100644
> --- a/mm/nobootmem.c
> +++ b/mm/nobootmem.c
> @@ -83,10 +83,10 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size)
> static void __init __free_pages_memory(unsigned long start, unsigned long end)
> {
> unsigned long i, start_aligned, end_aligned;
> - int order = ilog2(BITS_PER_LONG);
> + int order = ilog2(max(BITS_PER_LONG, PTRS_PER_PMD));
>
> - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1);
> - end_aligned = end & ~(BITS_PER_LONG - 1);
> + start_aligned = (start + ((1UL << order) - 1)) & ~((1UL << order) - 1);
> + end_aligned = end & ~((1UL << order) - 1);
>
> if (end_aligned <= start_aligned) {
> for (i = start; i < end; i++)
> @@ -98,7 +98,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end)
> for (i = start; i < start_aligned; i++)
> __free_pages_bootmem(pfn_to_page(i), 0);
>
> - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG)
> + for (i = start_aligned; i < end_aligned; i += 1 << order)
> __free_pages_bootmem(pfn_to_page(i), order);
>
> for (i = end_aligned; i < end; i++)
> --
> 1.8.2.1

2013-07-12 08:28:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator


* Robin Holt <[email protected]> wrote:

> [...]
>
> With this patch, we did boot a 16TiB machine. Without the patches, the
> v3.10 kernel with the same configuration took 407 seconds for
> free_all_bootmem. With the patches and operating on 2MiB pages instead
> of 1GiB, it took 26 seconds so performance was improved. I have no feel
> for how the 1GiB chunk size will perform.

That's pretty impressive.

It's still a 15x speedup instead of a 512x speedup, so I'd say there's
something else being the current bottleneck, besides page init
granularity.

Can you boot with just a few gigs of RAM and stuff the rest into hotplug
memory, and then hot-add that memory? That would allow easy profiling of
remaining overhead.

Side note:

Robert Richter and Boris Petkov are working on 'persistent events' support
for perf, which will eventually allow boot time profiling - I'm not sure
if the patches and the tooling support is ready enough yet for your
purposes.

Robert, Boris, the following workflow would be pretty intuitive:

- kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB

- we'd get a single (cycles?) event running once the perf subsystem is up
and running, with a sampling frequency of 1 KHz, sending profiling
trace events to a sufficiently sized profiling buffer of 16 MB per
CPU.

- once the system reaches SYSTEM_RUNNING, profiling is stopped either
automatically - or the user stops it via a new tooling command.

- the profiling buffer is extracted into a regular perf.data via a
special 'perf record' call or some other, new perf tooling
solution/variant.

[ Alternatively the kernel could attempt to construct a 'virtual'
perf.data from the persistent buffer, available via /sys/debug or
elsewhere in /sys - just like the kernel constructs a 'virtual'
/proc/kcore, etc. That file could be copied or used directly. ]

- from that point on this workflow joins the regular profiling workflow:
perf report, perf script et al can be used to analyze the resulting
boot profile.

Thanks,

Ingo

2013-07-12 08:47:36

by Borislav Petkov

[permalink] [raw]
Subject: boot tracing

On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote:
> Robert Richter and Boris Petkov are working on 'persistent events'
> support for perf, which will eventually allow boot time profiling -
> I'm not sure if the patches and the tooling support is ready enough
> yet for your purposes.

Nope, not yet but we're getting there.

> Robert, Boris, the following workflow would be pretty intuitive:
>
> - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB

What does perf=boot mean? I assume boot tracing.

If so, does it mean we want to enable *all* tracepoints and collect
whatever hits us?

What makes more sense to me is to hijack what the function tracer does -
i.e. simply collect all function calls.

> - we'd get a single (cycles?) event running once the perf subsystem is up
> and running, with a sampling frequency of 1 KHz, sending profiling
> trace events to a sufficiently sized profiling buffer of 16 MB per
> CPU.

Right, what would those trace events be?

> - once the system reaches SYSTEM_RUNNING, profiling is stopped either
> automatically - or the user stops it via a new tooling command.

Ok.

> - the profiling buffer is extracted into a regular perf.data via a
> special 'perf record' call or some other, new perf tooling
> solution/variant.
>
> [ Alternatively the kernel could attempt to construct a 'virtual'
> perf.data from the persistent buffer, available via /sys/debug or
> elsewhere in /sys - just like the kernel constructs a 'virtual'
> /proc/kcore, etc. That file could be copied or used directly. ]

Yeah, that.

> - from that point on this workflow joins the regular profiling workflow:
> perf report, perf script et al can be used to analyze the resulting
> boot profile.

Agreed.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2013-07-12 08:53:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: boot tracing


* Borislav Petkov <[email protected]> wrote:

> On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote:
> > Robert Richter and Boris Petkov are working on 'persistent events'
> > support for perf, which will eventually allow boot time profiling -
> > I'm not sure if the patches and the tooling support is ready enough
> > yet for your purposes.
>
> Nope, not yet but we're getting there.
>
> > Robert, Boris, the following workflow would be pretty intuitive:
> >
> > - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB
>
> What does perf=boot mean? I assume boot tracing.

In this case it would mean boot profiling - i.e. a cycles hardware-PMU
event collecting into a perf trace buffer as usual.

Essentially a 'perf record -a' work-alike, just one that gets activated as
early as practical, and which would allow the profiling of memory
initialization.

Now, one extra complication here is that to be able to profile buddy
allocator this persistent event would have to work before the buddy
allocator is active :-/ So this sort of profiling would have to use
memblock_alloc().

Just wanted to highlight this usecase, we might eventually want to support
it.

[ Note that this is different from boot tracing of one or more trace
events - but it's a conceptually pretty close cousin. ]

Thanks,

Ingo

2013-07-12 09:19:20

by Robert Richter

[permalink] [raw]
Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator

On 12.07.13 10:27:56, Ingo Molnar wrote:
>
> * Robin Holt <[email protected]> wrote:
>
> > [...]
> >
> > With this patch, we did boot a 16TiB machine. Without the patches, the
> > v3.10 kernel with the same configuration took 407 seconds for
> > free_all_bootmem. With the patches and operating on 2MiB pages instead
> > of 1GiB, it took 26 seconds so performance was improved. I have no feel
> > for how the 1GiB chunk size will perform.
>
> That's pretty impressive.
>
> It's still a 15x speedup instead of a 512x speedup, so I'd say there's
> something else being the current bottleneck, besides page init
> granularity.
>
> Can you boot with just a few gigs of RAM and stuff the rest into hotplug
> memory, and then hot-add that memory? That would allow easy profiling of
> remaining overhead.
>
> Side note:
>
> Robert Richter and Boris Petkov are working on 'persistent events' support
> for perf, which will eventually allow boot time profiling - I'm not sure
> if the patches and the tooling support is ready enough yet for your
> purposes.

The latest patch set is still this:

git://git.kernel.org/pub/scm/linux/kernel/git/rric/oprofile.git persistent-v2

It requires the perf subsystem to be initialized first which might be
too late, see perf_event_init() in start_kernel(). The patch set is
currently also limited to tracepoints only.

If this is sufficient for you, you might register persistent events
with the function perf_add_persistent_event_by_id(), see
mcheck_init_tp() how to do this. Later you can fetch all samples with:

# perf record -e persistent/<tracepoint>/ sleep 1

> Robert, Boris, the following workflow would be pretty intuitive:
>
> - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB
>
> - we'd get a single (cycles?) event running once the perf subsystem is up
> and running, with a sampling frequency of 1 KHz, sending profiling
> trace events to a sufficiently sized profiling buffer of 16 MB per
> CPU.

I am not sure about the event you want to setup here, if it is a
tracepoint the sample_period should be always 1. The buffer size
parameter looks interesting, for now it is 512kB per cpu per default
(as perf tools setup the buffer).

>
> - once the system reaches SYSTEM_RUNNING, profiling is stopped either
> automatically - or the user stops it via a new tooling command.
>
> - the profiling buffer is extracted into a regular perf.data via a
> special 'perf record' call or some other, new perf tooling
> solution/variant.

See the perf-record command above...

>
> [ Alternatively the kernel could attempt to construct a 'virtual'
> perf.data from the persistent buffer, available via /sys/debug or
> elsewhere in /sys - just like the kernel constructs a 'virtual'
> /proc/kcore, etc. That file could be copied or used directly. ]
>
> - from that point on this workflow joins the regular profiling workflow:
> perf report, perf script et al can be used to analyze the resulting
> boot profile.

Ingo, thanks for outlining this workflow. We will look how this could
fit into the new version of persistent events we currently working on.

Thanks,

-Robert

2013-07-13 03:06:55

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC 3/4] Seperate page initialization into a separate function.

On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt <[email protected]> wrote:
> Currently, memmap_init_zone() has all the smarts for initializing a
> single page. When we convert to initializing pages in a 2MiB chunk,
> we will need to do this equivalent work from two separate places
> so we are breaking out a helper function.
>
> Signed-off-by: Robin Holt <[email protected]>
> Signed-off-by: Nate Zimmer <[email protected]>
> To: "H. Peter Anvin" <[email protected]>
> To: Ingo Molnar <[email protected]>
> Cc: Linux Kernel <[email protected]>
> Cc: Linux MM <[email protected]>
> Cc: Rob Landley <[email protected]>
> Cc: Mike Travis <[email protected]>
> Cc: Daniel J Blueman <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Greg KH <[email protected]>
> Cc: Yinghai Lu <[email protected]>
> Cc: Mel Gorman <[email protected]>
> ---
> mm/mm_init.c | 2 +-
> mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++++------------------------
> 2 files changed, 45 insertions(+), 32 deletions(-)
>
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index c280a02..be8a539 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void)
> BUG_ON(or_mask != add_mask);
> }
>
> -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone,
> +void mminit_verify_page_links(struct page *page, enum zone_type zone,
> unsigned long nid, unsigned long pfn)
> {
> BUG_ON(page_to_nid(page) != nid);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c3edb62..635b131 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -697,6 +697,49 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
> spin_unlock(&zone->lock);
> }
>
> +static void __init_single_page(struct page *page, unsigned long zone, int nid, int reserved)
> +{
> + unsigned long pfn = page_to_pfn(page);
> + struct zone *z = &NODE_DATA(nid)->node_zones[zone];
> +
> + set_page_links(page, zone, nid, pfn);
> + mminit_verify_page_links(page, zone, nid, pfn);
> + init_page_count(page);
> + page_mapcount_reset(page);
> + page_nid_reset_last(page);
> + if (reserved) {
> + SetPageReserved(page);
> + } else {
> + ClearPageReserved(page);
> + set_page_count(page, 0);
> + }
> + /*
> + * Mark the block movable so that blocks are reserved for
> + * movable at startup. This will force kernel allocations
> + * to reserve their blocks rather than leaking throughout
> + * the address space during boot when many long-lived
> + * kernel allocations are made. Later some blocks near
> + * the start are marked MIGRATE_RESERVE by
> + * setup_zone_migrate_reserve()
> + *
> + * bitmap is created for zone's valid pfn range. but memmap
> + * can be created for invalid pages (for alignment)
> + * check here not to call set_pageblock_migratetype() against
> + * pfn out of zone.
> + */
> + if ((z->zone_start_pfn <= pfn)
> + && (pfn < zone_end_pfn(z))
> + && !(pfn & (pageblock_nr_pages - 1)))
> + set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> +
> + INIT_LIST_HEAD(&page->lru);
> +#ifdef WANT_PAGE_VIRTUAL
> + /* The shift won't overflow because ZONE_NORMAL is below 4G. */
> + if (!is_highmem_idx(zone))
> + set_page_address(page, __va(pfn << PAGE_SHIFT));
> +#endif
> +}
> +
> static bool free_pages_prepare(struct page *page, unsigned int order)
> {
> int i;
> @@ -3934,37 +3977,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> continue;
> }
> page = pfn_to_page(pfn);
> - set_page_links(page, zone, nid, pfn);
> - mminit_verify_page_links(page, zone, nid, pfn);
> - init_page_count(page);
> - page_mapcount_reset(page);
> - page_nid_reset_last(page);
> - SetPageReserved(page);
> - /*
> - * Mark the block movable so that blocks are reserved for
> - * movable at startup. This will force kernel allocations
> - * to reserve their blocks rather than leaking throughout
> - * the address space during boot when many long-lived
> - * kernel allocations are made. Later some blocks near
> - * the start are marked MIGRATE_RESERVE by
> - * setup_zone_migrate_reserve()
> - *
> - * bitmap is created for zone's valid pfn range. but memmap
> - * can be created for invalid pages (for alignment)
> - * check here not to call set_pageblock_migratetype() against
> - * pfn out of zone.
> - */
> - if ((z->zone_start_pfn <= pfn)
> - && (pfn < zone_end_pfn(z))
> - && !(pfn & (pageblock_nr_pages - 1)))
> - set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> -
> - INIT_LIST_HEAD(&page->lru);
> -#ifdef WANT_PAGE_VIRTUAL
> - /* The shift won't overflow because ZONE_NORMAL is below 4G. */
> - if (!is_highmem_idx(zone))
> - set_page_address(page, __va(pfn << PAGE_SHIFT));
> -#endif
> + __init_single_page(page, zone, nid, 1);

Can you
move page = pfn_to_page(pfn) into __init_single_page
and pass pfn directly?

Yinghai

2013-07-13 03:08:57

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC 2/4] Have __free_pages_memory() free in larger chunks.

On Fri, Jul 12, 2013 at 12:45 AM, Robin Holt <[email protected]> wrote:

> At the very least, I think we could change to:
> static void __init __free_pages_memory(unsigned long start, unsigned long end)
> {
> int order;
>
> while (start < end) {
> order = ffs(start);
>
> while (start + (1UL << order) > end)
> order--;
>
> __free_pages_bootmem(start, order);
>
> start += (1UL << order);
> }
> }

should work, but need to make sure order < MAX_ORDER.

Yinghai

2013-07-13 04:19:15

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC 4/4] Sparse initialization of struct page array.

On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt <[email protected]> wrote:
> During boot of large memory machines, a significant portion of boot
> is spent initializing the struct page array. The vast majority of
> those pages are not referenced during boot.
>
> Change this over to only initializing the pages when they are
> actually allocated.
>
> Besides the advantage of boot speed, this allows us the chance to
> use normal performance monitoring tools to determine where the bulk
> of time is spent during page initialization.
>
> Signed-off-by: Robin Holt <[email protected]>
> Signed-off-by: Nate Zimmer <[email protected]>
> To: "H. Peter Anvin" <[email protected]>
> To: Ingo Molnar <[email protected]>
> Cc: Linux Kernel <[email protected]>
> Cc: Linux MM <[email protected]>
> Cc: Rob Landley <[email protected]>
> Cc: Mike Travis <[email protected]>
> Cc: Daniel J Blueman <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Greg KH <[email protected]>
> Cc: Yinghai Lu <[email protected]>
> Cc: Mel Gorman <[email protected]>
> ---
> include/linux/mm.h | 11 +++++
> include/linux/page-flags.h | 5 +-
> mm/nobootmem.c | 5 ++
> mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++++++--
> 4 files changed, 132 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e0c8528..3de08b5 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page)
> __free_page(page);
> }
>
> +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
> +
> +static inline void __reserve_bootmem_page(struct page *page)
> +{
> + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT;
> + phys_addr_t end = start + PAGE_SIZE;
> +
> + __reserve_bootmem_region(start, end);
> +}
> +
> static inline void free_reserved_page(struct page *page)
> {
> + __reserve_bootmem_page(page);
> __free_reserved_page(page);
> adjust_managed_page_count(page, 1);
> }
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 6d53675..79e8eb7 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -83,6 +83,7 @@ enum pageflags {
> PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
> PG_arch_1,
> PG_reserved,
> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */
> PG_private, /* If pagecache, has fs-private data */
> PG_private_2, /* If pagecache, has fs aux data */
> PG_writeback, /* Page is under writeback */
> @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
>
> __PAGEFLAG(SlobFree, slob_free)
>
> +PAGEFLAG(Uninitialized2Mib, uninitialized2mib)
> +
> /*
> * Private page markings that may be used by the filesystem that owns the page
> * for its own purposes.
> @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
> #define PAGE_FLAGS_CHECK_AT_FREE \
> (1 << PG_lru | 1 << PG_locked | \
> 1 << PG_private | 1 << PG_private_2 | \
> - 1 << PG_writeback | 1 << PG_reserved | \
> + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized2mib | \
> 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
> 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
> __PG_COMPOUND_LOCK)
> diff --git a/mm/nobootmem.c b/mm/nobootmem.c
> index 3b512ca..e3a386d 100644
> --- a/mm/nobootmem.c
> +++ b/mm/nobootmem.c
> @@ -126,6 +126,9 @@ static unsigned long __init free_low_memory_core_early(void)
> phys_addr_t start, end, size;
> u64 i;
>
> + for_each_reserved_mem_region(i, &start, &end)
> + __reserve_bootmem_region(start, end);
> +

How about holes that is not in memblock.reserved?

before this patch:
free_area_init_node/free_area_init_core/memmap_init_zone
will mark all page in node range to Reserved in struct page, at first.

but those holes is not mapped via kernel low mapping.
so it should be ok not touch "struct page" for them.

Now you only mark reserved for memblock.reserved at first, and later
mark {memblock.memory} - { memblock.reserved} to be available.
And that is ok.

but should split that change to another patch and add some comment
and change log for the change.
in case there is some user like UEFI etc do weird thing.

> for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
> count += __free_memory_core(start, end);
>
> @@ -162,6 +165,8 @@ unsigned long __init free_all_bootmem(void)
> {
> struct pglist_data *pgdat;
>
> + memblock_dump_all();
> +

Not needed.

> for_each_online_pgdat(pgdat)
> reset_node_lowmem_managed_pages(pgdat);
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 635b131..fe51eb9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i
> #endif
> }
>
> +static void expand_page_initialization(struct page *basepage)
> +{
> + unsigned long pfn = page_to_pfn(basepage);
> + unsigned long end_pfn = pfn + PTRS_PER_PMD;
> + unsigned long zone = page_zonenum(basepage);
> + int reserved = PageReserved(basepage);
> + int nid = page_to_nid(basepage);
> +
> + ClearPageUninitialized2Mib(basepage);
> +
> + for( pfn++; pfn < end_pfn; pfn++ )
> + __init_single_page(pfn_to_page(pfn), zone, nid, reserved);
> +}
> +
> +void ensure_pages_are_initialized(unsigned long start_pfn,
> + unsigned long end_pfn)
> +{
> + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1);
> + unsigned long aligned_end_pfn;
> + struct page *page;
> +
> + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1);
> + aligned_end_pfn += PTRS_PER_PMD;
> + while (aligned_start_pfn < aligned_end_pfn) {
> + if (pfn_valid(aligned_start_pfn)) {
> + page = pfn_to_page(aligned_start_pfn);
> +
> + if(PageUninitialized2Mib(page))
> + expand_page_initialization(page);
> + }
> +
> + aligned_start_pfn += PTRS_PER_PMD;
> + }
> +}
> +
> +void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
> +{
> + unsigned long start_pfn = PFN_DOWN(start);
> + unsigned long end_pfn = PFN_UP(end);
> +
> + ensure_pages_are_initialized(start_pfn, end_pfn);
> +}

that name is confusing, actually it is setting to struct page to Reserved only.
maybe __reserve_pages_bootmem() to be aligned to free_pages_bootmem ?

> +
> +static inline void ensure_page_is_initialized(struct page *page)
> +{
> + __reserve_bootmem_page(page);
> +}

how about use __reserve_page_bootmem directly and add comment in callers site ?

> +
> static bool free_pages_prepare(struct page *page, unsigned int order)
> {
> int i;
> @@ -751,7 +799,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
> if (PageAnon(page))
> page->mapping = NULL;
> for (i = 0; i < (1 << order); i++)
> - bad += free_pages_check(page + i);
> + if (PageUninitialized2Mib(page + i))
> + i += PTRS_PER_PMD - 1;
> + else
> + bad += free_pages_check(page + i);
> if (bad)
> return false;
>
> @@ -795,13 +846,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
> unsigned int loop;
>
> prefetchw(page);
> - for (loop = 0; loop < nr_pages; loop++) {
> + for (loop = 0; loop < nr_pages; ) {
> struct page *p = &page[loop];
>
> if (loop + 1 < nr_pages)
> prefetchw(p + 1);
> +
> + if ((PageUninitialized2Mib(p)) &&
> + ((loop + PTRS_PER_PMD) > nr_pages))
> + ensure_page_is_initialized(p);
> +
> __ClearPageReserved(p);
> set_page_count(p, 0);
> + if (PageUninitialized2Mib(p))
> + loop += PTRS_PER_PMD;
> + else
> + loop += 1;
> }
>
> page_zone(page)->managed_pages += 1 << order;
> @@ -856,6 +916,7 @@ static inline void expand(struct zone *zone, struct page *page,
> area--;
> high--;
> size >>= 1;
> + ensure_page_is_initialized(page);
> VM_BUG_ON(bad_range(zone, &page[size]));
>
> #ifdef CONFIG_DEBUG_PAGEALLOC
> @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
>
> for (i = 0; i < (1 << order); i++) {
> struct page *p = page + i;
> +
> + if (PageUninitialized2Mib(p))
> + expand_page_initialization(page);
> +
> if (unlikely(check_new_page(p)))
> return 1;
> }
> @@ -985,6 +1050,7 @@ int move_freepages(struct zone *zone,
> unsigned long order;
> int pages_moved = 0;
>
> + ensure_pages_are_initialized(page_to_pfn(start_page), page_to_pfn(end_page));
> #ifndef CONFIG_HOLES_IN_ZONE
> /*
> * page_zone is not safe to call in this context when
> @@ -3859,6 +3925,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn)
> for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn)))
> return 1;
> +
> + if (PageUninitialized2Mib(pfn_to_page(pfn)))
> + pfn += PTRS_PER_PMD;
> }
> return 0;
> }
> @@ -3947,6 +4016,29 @@ static void setup_zone_migrate_reserve(struct zone *zone)
> }
> }
>
> +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn,
> + unsigned long size, int nid)
why not use static ? it seems there is not outside user.
> +{
> + unsigned long validate_end_pfn = pfn + size;
> +
> + if (pfn & (size - 1))
> + return 1;
> +
> + if (pfn + size >= end_pfn)
> + return 1;
> +
> + while (pfn < validate_end_pfn)
> + {
> + if (!early_pfn_valid(pfn))
> + return 1;
> + if (!early_pfn_in_nid(pfn, nid))
> + return 1;
> + pfn++;
> + }
> +
> + return size;
> +}
> +
> /*
> * Initially all pages are reserved - free ones are freed
> * up by free_all_bootmem() once the early boot process is
> @@ -3964,20 +4056,34 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> highest_memmap_pfn = end_pfn - 1;
>
> z = &NODE_DATA(nid)->node_zones[zone];
> - for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> + for (pfn = start_pfn; pfn < end_pfn; ) {
> /*
> * There can be holes in boot-time mem_map[]s
> * handed to this function. They do not
> * exist on hotplugged memory.
> */
> + int pfns = 1;
> if (context == MEMMAP_EARLY) {
> - if (!early_pfn_valid(pfn))
> + if (!early_pfn_valid(pfn)) {
> + pfn++;
> continue;
> - if (!early_pfn_in_nid(pfn, nid))
> + }
> + if (!early_pfn_in_nid(pfn, nid)) {
> + pfn++;
> continue;
> + }
> +
> + pfns = pfn_range_init_avail(pfn, end_pfn,
> + PTRS_PER_PMD, nid);
> }

maybe could update memmap_init_zone() only iterate {memblock.memory} -
{memblock.reserved}, so you do need to check avail ....

as memmap_init_zone do not need to handle holes to mark reserve for them.

> +
> page = pfn_to_page(pfn);
> __init_single_page(page, zone, nid, 1);
> +
> + if (pfns > 1)
> + SetPageUninitialized2Mib(page);
> +
> + pfn += pfns;
> }
> }
>
> @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = {
> {1UL << PG_owner_priv_1, "owner_priv_1" },
> {1UL << PG_arch_1, "arch_1" },
> {1UL << PG_reserved, "reserved" },
> + {1UL << PG_uninitialized2mib, "Uninit_2MiB" },

PG_uninitialized_2m ?

> {1UL << PG_private, "private" },
> {1UL << PG_private_2, "private_2" },
> {1UL << PG_writeback, "writeback" },

Yinghai

2013-07-13 04:40:05

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 4/4] Sparse initialization of struct page array.

On 07/12/2013 09:19 PM, Yinghai Lu wrote:
>> PG_reserved,
>> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */
>> PG_private, /* If pagecache, has fs-private data */

The comment here is WTF...

-hpa

2013-07-13 05:31:18

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC 4/4] Sparse initialization of struct page array.

On Fri, Jul 12, 2013 at 9:39 PM, H. Peter Anvin <[email protected]> wrote:
> On 07/12/2013 09:19 PM, Yinghai Lu wrote:
>>> PG_reserved,
>>> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */
>>> PG_private, /* If pagecache, has fs-private data */
>
> The comment here is WTF...

ntz: Nate Zimmer?
rmh: Robin Holt?

2013-07-13 05:39:19

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 4/4] Sparse initialization of struct page array.

On 07/12/2013 10:31 PM, Yinghai Lu wrote:
> On Fri, Jul 12, 2013 at 9:39 PM, H. Peter Anvin <[email protected]> wrote:
>> On 07/12/2013 09:19 PM, Yinghai Lu wrote:
>>>> PG_reserved,
>>>> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */
>>>> PG_private, /* If pagecache, has fs-private data */
>>
>> The comment here is WTF...
>
> ntz: Nate Zimmer?
> rmh: Robin Holt?
>

This kind of conversation doesn't really belong in a code comment, though.

-hpa

2013-07-15 01:38:52

by Sam Ben

[permalink] [raw]
Subject: Re: boot tracing

On 07/12/2013 04:53 PM, Ingo Molnar wrote:
> * Borislav Petkov <[email protected]> wrote:
>
>> On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote:
>>> Robert Richter and Boris Petkov are working on 'persistent events'
>>> support for perf, which will eventually allow boot time profiling -
>>> I'm not sure if the patches and the tooling support is ready enough
>>> yet for your purposes.
>> Nope, not yet but we're getting there.
>>
>>> Robert, Boris, the following workflow would be pretty intuitive:
>>>
>>> - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB
>> What does perf=boot mean? I assume boot tracing.
> In this case it would mean boot profiling - i.e. a cycles hardware-PMU
> event collecting into a perf trace buffer as usual.
>
> Essentially a 'perf record -a' work-alike, just one that gets activated as
> early as practical, and which would allow the profiling of memory
> initialization.
>
> Now, one extra complication here is that to be able to profile buddy
> allocator this persistent event would have to work before the buddy
> allocator is active :-/ So this sort of profiling would have to use
> memblock_alloc().

Could perf=boot be used to sample the performance of memblock subsystem?
I think the perf subsystem is too late to be initialized and monitor this.

>
> Just wanted to highlight this usecase, we might eventually want to support
> it.
>
> [ Note that this is different from boot tracing of one or more trace
> events - but it's a conceptually pretty close cousin. ]
>
> Thanks,
>
> Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-07-15 03:19:37

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 3/4] Seperate page initialization into a separate function.

On Fri, Jul 12, 2013 at 08:06:52PM -0700, Yinghai Lu wrote:
> On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt <[email protected]> wrote:
> > Currently, memmap_init_zone() has all the smarts for initializing a
> > single page. When we convert to initializing pages in a 2MiB chunk,
> > we will need to do this equivalent work from two separate places
> > so we are breaking out a helper function.
> >
> > Signed-off-by: Robin Holt <[email protected]>
> > Signed-off-by: Nate Zimmer <[email protected]>
> > To: "H. Peter Anvin" <[email protected]>
> > To: Ingo Molnar <[email protected]>
> > Cc: Linux Kernel <[email protected]>
> > Cc: Linux MM <[email protected]>
> > Cc: Rob Landley <[email protected]>
> > Cc: Mike Travis <[email protected]>
> > Cc: Daniel J Blueman <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Greg KH <[email protected]>
> > Cc: Yinghai Lu <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > ---
> > mm/mm_init.c | 2 +-
> > mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++++------------------------
> > 2 files changed, 45 insertions(+), 32 deletions(-)
> >
> > diff --git a/mm/mm_init.c b/mm/mm_init.c
> > index c280a02..be8a539 100644
> > --- a/mm/mm_init.c
> > +++ b/mm/mm_init.c
> > @@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void)
> > BUG_ON(or_mask != add_mask);
> > }
> >
> > -void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone,
> > +void mminit_verify_page_links(struct page *page, enum zone_type zone,
> > unsigned long nid, unsigned long pfn)
> > {
> > BUG_ON(page_to_nid(page) != nid);
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index c3edb62..635b131 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -697,6 +697,49 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
> > spin_unlock(&zone->lock);
> > }
> >
> > +static void __init_single_page(struct page *page, unsigned long zone, int nid, int reserved)
> > +{
> > + unsigned long pfn = page_to_pfn(page);
> > + struct zone *z = &NODE_DATA(nid)->node_zones[zone];
> > +
> > + set_page_links(page, zone, nid, pfn);
> > + mminit_verify_page_links(page, zone, nid, pfn);
> > + init_page_count(page);
> > + page_mapcount_reset(page);
> > + page_nid_reset_last(page);
> > + if (reserved) {
> > + SetPageReserved(page);
> > + } else {
> > + ClearPageReserved(page);
> > + set_page_count(page, 0);
> > + }
> > + /*
> > + * Mark the block movable so that blocks are reserved for
> > + * movable at startup. This will force kernel allocations
> > + * to reserve their blocks rather than leaking throughout
> > + * the address space during boot when many long-lived
> > + * kernel allocations are made. Later some blocks near
> > + * the start are marked MIGRATE_RESERVE by
> > + * setup_zone_migrate_reserve()
> > + *
> > + * bitmap is created for zone's valid pfn range. but memmap
> > + * can be created for invalid pages (for alignment)
> > + * check here not to call set_pageblock_migratetype() against
> > + * pfn out of zone.
> > + */
> > + if ((z->zone_start_pfn <= pfn)
> > + && (pfn < zone_end_pfn(z))
> > + && !(pfn & (pageblock_nr_pages - 1)))
> > + set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> > +
> > + INIT_LIST_HEAD(&page->lru);
> > +#ifdef WANT_PAGE_VIRTUAL
> > + /* The shift won't overflow because ZONE_NORMAL is below 4G. */
> > + if (!is_highmem_idx(zone))
> > + set_page_address(page, __va(pfn << PAGE_SHIFT));
> > +#endif
> > +}
> > +
> > static bool free_pages_prepare(struct page *page, unsigned int order)
> > {
> > int i;
> > @@ -3934,37 +3977,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> > continue;
> > }
> > page = pfn_to_page(pfn);
> > - set_page_links(page, zone, nid, pfn);
> > - mminit_verify_page_links(page, zone, nid, pfn);
> > - init_page_count(page);
> > - page_mapcount_reset(page);
> > - page_nid_reset_last(page);
> > - SetPageReserved(page);
> > - /*
> > - * Mark the block movable so that blocks are reserved for
> > - * movable at startup. This will force kernel allocations
> > - * to reserve their blocks rather than leaking throughout
> > - * the address space during boot when many long-lived
> > - * kernel allocations are made. Later some blocks near
> > - * the start are marked MIGRATE_RESERVE by
> > - * setup_zone_migrate_reserve()
> > - *
> > - * bitmap is created for zone's valid pfn range. but memmap
> > - * can be created for invalid pages (for alignment)
> > - * check here not to call set_pageblock_migratetype() against
> > - * pfn out of zone.
> > - */
> > - if ((z->zone_start_pfn <= pfn)
> > - && (pfn < zone_end_pfn(z))
> > - && !(pfn & (pageblock_nr_pages - 1)))
> > - set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> > -
> > - INIT_LIST_HEAD(&page->lru);
> > -#ifdef WANT_PAGE_VIRTUAL
> > - /* The shift won't overflow because ZONE_NORMAL is below 4G. */
> > - if (!is_highmem_idx(zone))
> > - set_page_address(page, __va(pfn << PAGE_SHIFT));
> > -#endif
> > + __init_single_page(page, zone, nid, 1);
>
> Can you
> move page = pfn_to_page(pfn) into __init_single_page
> and pass pfn directly?

Sure, but then I don't care for the name so much, but I can think on
that some too.

I think the feedback I was most hoping to receive was pertaining to a
means for removing the PG_uninitialized2Mib flag entirely. If I could
get rid of that and have some page-local way of knowing if it has not
been initialized, I think this patch set would be much better.

Robin

2013-07-15 14:08:54

by Nathan Zimmer

[permalink] [raw]
Subject: Re: [RFC 4/4] Sparse initialization of struct page array.

On 07/13/2013 12:31 AM, Yinghai Lu wrote:
> On Fri, Jul 12, 2013 at 9:39 PM, H. Peter Anvin <[email protected]> wrote:
>> On 07/12/2013 09:19 PM, Yinghai Lu wrote:
>>>> PG_reserved,
>>>> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */
>>>> PG_private, /* If pagecache, has fs-private data */
>> The comment here is WTF...
> ntz: Nate Zimmer?
> rmh: Robin Holt?

Yea that comment was supposed to be removed.
Sorry about that.

Nate

2013-07-15 15:00:42

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator

On Thu, Jul 11, 2013 at 09:03:51PM -0500, Robin Holt wrote:
> We have been working on this since we returned from shutdown and have
> something to discuss now. We restricted ourselves to 2MiB initialization
> to keep the patch set a little smaller and more clear.
>
> First, I think I want to propose getting rid of the page flag. If I knew
> of a concrete way to determine that the page has not been initialized,
> this patch series would look different. If there is no definitive
> way to determine that the struct page has been initialized aside from
> checking the entire page struct is zero, then I think I would suggest
> we change the page flag to indicate the page has been initialized.

Ingo or HPA,

Did I implement this wrong or is there a way to get rid of the page flag
which is not going to impact normal operation? I don't want to put too
much more effort into this until I know we are stuck going this direction.
Currently, the expand() function has a relatively expensive checked
against the 2MiB aligned pfn's struct page. I do not know of a way to
eliminate that check against the other page as the first reference we
see for a page is in the middle of that 2MiB aligned range.

To identify this as an area of concern, we had booted with a simulator,
setting watch points on the struct page array region once the
Uninitialized flag was set and maintaining that until it was cleared.

Thanks,
Robin

2013-07-15 15:16:26

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator

On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote:
>
> * Robin Holt <[email protected]> wrote:
>
> > [...]
> >
> > With this patch, we did boot a 16TiB machine. Without the patches, the
> > v3.10 kernel with the same configuration took 407 seconds for
> > free_all_bootmem. With the patches and operating on 2MiB pages instead
> > of 1GiB, it took 26 seconds so performance was improved. I have no feel
> > for how the 1GiB chunk size will perform.
>
> That's pretty impressive.

And WRONG!

That is a 15x speedup in the freeing of memory at the free_all_bootmem
point. That is _NOT_ the speedup from memmap_init_zone. I forgot to
take that into account as Nate pointed out this morning in a hallway
discussion. Before, on the 16TiB machine, memmap_init_zone took 1152
seconds. After, it took 50. If it were a straight 1/512th, we would
have expected that 1152 to be something more on the line of 2-3 so there
is still significant room for improvement.

Sorry for the confusion.

> It's still a 15x speedup instead of a 512x speedup, so I'd say there's
> something else being the current bottleneck, besides page init
> granularity.
>
> Can you boot with just a few gigs of RAM and stuff the rest into hotplug
> memory, and then hot-add that memory? That would allow easy profiling of
> remaining overhead.

Nate and I will be working on other things for the next few hours hoping
there is a better answer to the first question we asked about there
being a way to test a page other than comparing against all zeroes to
see if it has been initialized.

Thanks,
Robin

2013-07-15 17:45:59

by Nathan Zimmer

[permalink] [raw]
Subject: Re: [RFC 4/4] Sparse initialization of struct page array.

On Fri, Jul 12, 2013 at 09:19:12PM -0700, Yinghai Lu wrote:
> On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt <[email protected]> wrote:
> > +
> > page = pfn_to_page(pfn);
> > __init_single_page(page, zone, nid, 1);
> > +
> > + if (pfns > 1)
> > + SetPageUninitialized2Mib(page);
> > +
> > + pfn += pfns;
> > }
> > }
> >
> > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = {
> > {1UL << PG_owner_priv_1, "owner_priv_1" },
> > {1UL << PG_arch_1, "arch_1" },
> > {1UL << PG_reserved, "reserved" },
> > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" },
>
> PG_uninitialized_2m ?
>
> > {1UL << PG_private, "private" },
> > {1UL << PG_private_2, "private_2" },
> > {1UL << PG_writeback, "writeback" },
>
> Yinghai


I hadn't actually been very happy with having a PG_uninitialized2mib flag.
It implies if we want to jump to 1Gb pages we would need a second flag,
PG_uninitialized1gb, for that. I was thinking of changing it to
PG_uninitialized and setting page->private to the correct order.
Thoughts?

Nate

2013-07-15 17:55:00

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 4/4] Sparse initialization of struct page array.

On 07/15/2013 10:45 AM, Nathan Zimmer wrote:
>
> I hadn't actually been very happy with having a PG_uninitialized2mib flag.
> It implies if we want to jump to 1Gb pages we would need a second flag,
> PG_uninitialized1gb, for that. I was thinking of changing it to
> PG_uninitialized and setting page->private to the correct order.
> Thoughts?
>

Seems straightforward. The bigger issue is the amount of overhead we
cause by having to check upstack for the initialization status of the
superpages.

I'm concerned, obviously, about lingering overhead that is "forever".
That being said, in the absolutely worst case we could have a counter to
the number of uninitialized pages which when it hits zero we do a static
switch and switch out the initialization code (would have to be undone
on memory hotplug, of course.)

-hpa

2013-07-15 18:26:18

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 4/4] Sparse initialization of struct page array.

On Mon, Jul 15, 2013 at 10:54:38AM -0700, H. Peter Anvin wrote:
> On 07/15/2013 10:45 AM, Nathan Zimmer wrote:
> >
> > I hadn't actually been very happy with having a PG_uninitialized2mib flag.
> > It implies if we want to jump to 1Gb pages we would need a second flag,
> > PG_uninitialized1gb, for that. I was thinking of changing it to
> > PG_uninitialized and setting page->private to the correct order.
> > Thoughts?
> >
>
> Seems straightforward. The bigger issue is the amount of overhead we
> cause by having to check upstack for the initialization status of the
> superpages.
>
> I'm concerned, obviously, about lingering overhead that is "forever".
> That being said, in the absolutely worst case we could have a counter to
> the number of uninitialized pages which when it hits zero we do a static
> switch and switch out the initialization code (would have to be undone
> on memory hotplug, of course.)

Is there a fairly cheap way to determine definitively that the struct
page is not initialized?

I think this patch set can change fairly drastically if we have that.
I think I will start working up those changes and code a heavy-handed
check until I hear of an alternative way to cheaply check.

Thanks,
Robin

2013-07-15 18:29:52

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC 4/4] Sparse initialization of struct page array.

On 07/15/2013 11:26 AM, Robin Holt wrote:
> Is there a fairly cheap way to determine definitively that the struct
> page is not initialized?

By definition I would assume no. The only way I can think of would be
to unmap the memory associated with the struct page in the TLB and
initialize the struct pages at trap time.

> I think this patch set can change fairly drastically if we have that.
> I think I will start working up those changes and code a heavy-handed
> check until I hear of an alternative way to cheaply check.

-hpa

2013-07-15 21:30:40

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC 4/4] Sparse initialization of struct page array.

On Thu, 11 Jul 2013 21:03:55 -0500 Robin Holt <[email protected]> wrote:

> During boot of large memory machines, a significant portion of boot
> is spent initializing the struct page array. The vast majority of
> those pages are not referenced during boot.
>
> Change this over to only initializing the pages when they are
> actually allocated.
>
> Besides the advantage of boot speed, this allows us the chance to
> use normal performance monitoring tools to determine where the bulk
> of time is spent during page initialization.
>
> ...
>
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page)
> __free_page(page);
> }
>
> +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
> +
> +static inline void __reserve_bootmem_page(struct page *page)
> +{
> + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT;
> + phys_addr_t end = start + PAGE_SIZE;
> +
> + __reserve_bootmem_region(start, end);
> +}

It isn't obvious that this needed to be inlined?

> static inline void free_reserved_page(struct page *page)
> {
> + __reserve_bootmem_page(page);
> __free_reserved_page(page);
> adjust_managed_page_count(page, 1);
> }
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 6d53675..79e8eb7 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -83,6 +83,7 @@ enum pageflags {
> PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
> PG_arch_1,
> PG_reserved,
> + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */

"mib" creeps me out too. And it makes me think of SNMP, which I'd
prefer not to think about.

We've traditionally had fears of running out of page flags, but I've
lost track of how close we are to that happening. IIRC the answer
depends on whether you believe there is such a thing as a 32-bit NUMA
system.

Can this be avoided anyway? I suspect there's some idiotic combination
of flags we could use to indicate the state. PG_reserved|PG_lru or
something.

"2MB" sounds terribly arch-specific. Shouldn't we make it more generic
for when the hexagon64 port wants to use 4MB?

That conversational code comment was already commented on, but it's
still there?

>
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i
> #endif
> }
>
> +static void expand_page_initialization(struct page *basepage)
> +{
> + unsigned long pfn = page_to_pfn(basepage);
> + unsigned long end_pfn = pfn + PTRS_PER_PMD;
> + unsigned long zone = page_zonenum(basepage);
> + int reserved = PageReserved(basepage);
> + int nid = page_to_nid(basepage);
> +
> + ClearPageUninitialized2Mib(basepage);
> +
> + for( pfn++; pfn < end_pfn; pfn++ )
> + __init_single_page(pfn_to_page(pfn), zone, nid, reserved);
> +}
> +
> +void ensure_pages_are_initialized(unsigned long start_pfn,
> + unsigned long end_pfn)

I think this can be made static. I hope so, as it's a somewhat
odd-sounding identifier for a global.

> +{
> + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1);
> + unsigned long aligned_end_pfn;
> + struct page *page;
> +
> + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1);
> + aligned_end_pfn += PTRS_PER_PMD;
> + while (aligned_start_pfn < aligned_end_pfn) {
> + if (pfn_valid(aligned_start_pfn)) {
> + page = pfn_to_page(aligned_start_pfn);
> +
> + if(PageUninitialized2Mib(page))

checkpatch them, please.

> + expand_page_initialization(page);
> + }
> +
> + aligned_start_pfn += PTRS_PER_PMD;
> + }
> +}

Some nice code comments for the above two functions would be helpful.

>
> ...
>
> +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn,
> + unsigned long size, int nid)
> +{
> + unsigned long validate_end_pfn = pfn + size;
> +
> + if (pfn & (size - 1))
> + return 1;
> +
> + if (pfn + size >= end_pfn)
> + return 1;
> +
> + while (pfn < validate_end_pfn)
> + {
> + if (!early_pfn_valid(pfn))
> + return 1;
> + if (!early_pfn_in_nid(pfn, nid))
> + return 1;
> + pfn++;
> + }
> +
> + return size;
> +}

Document it, please. The return value semantics look odd, so don't
forget to explain all that as well.

>
> ...
>
> @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = {
> {1UL << PG_owner_priv_1, "owner_priv_1" },
> {1UL << PG_arch_1, "arch_1" },
> {1UL << PG_reserved, "reserved" },
> + {1UL << PG_uninitialized2mib, "Uninit_2MiB" },

It would be better if the name which is visible in procfs matches the
name in the kernel source code.

> {1UL << PG_private, "private" },
> {1UL << PG_private_2, "private_2" },
> {1UL << PG_writeback, "writeback" },

2013-07-16 08:55:04

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator

On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote:
>
> * Robin Holt <[email protected]> wrote:
>
> > [...]
> >
> > With this patch, we did boot a 16TiB machine. Without the patches, the
> > v3.10 kernel with the same configuration took 407 seconds for
> > free_all_bootmem. With the patches and operating on 2MiB pages instead
> > of 1GiB, it took 26 seconds so performance was improved. I have no feel
> > for how the 1GiB chunk size will perform.
>
> That's pretty impressive.
>
> It's still a 15x speedup instead of a 512x speedup, so I'd say there's
> something else being the current bottleneck, besides page init
> granularity.
>
> Can you boot with just a few gigs of RAM and stuff the rest into hotplug
> memory, and then hot-add that memory? That would allow easy profiling of
> remaining overhead.
>
> Side note:
>
> Robert Richter and Boris Petkov are working on 'persistent events' support
> for perf, which will eventually allow boot time profiling - I'm not sure
> if the patches and the tooling support is ready enough yet for your
> purposes.
>
> Robert, Boris, the following workflow would be pretty intuitive:
>
> - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB
>
> - we'd get a single (cycles?) event running once the perf subsystem is up
> and running, with a sampling frequency of 1 KHz, sending profiling
> trace events to a sufficiently sized profiling buffer of 16 MB per
> CPU.
>
> - once the system reaches SYSTEM_RUNNING, profiling is stopped either
> automatically - or the user stops it via a new tooling command.
>
> - the profiling buffer is extracted into a regular perf.data via a
> special 'perf record' call or some other, new perf tooling
> solution/variant.
>
> [ Alternatively the kernel could attempt to construct a 'virtual'
> perf.data from the persistent buffer, available via /sys/debug or
> elsewhere in /sys - just like the kernel constructs a 'virtual'
> /proc/kcore, etc. That file could be copied or used directly. ]

Hello, Robert, Boris, Ingo.

How about executing a perf in usermodehelper and collecting output in
tmpfs? Using this approach, we can start a perf after rootfs
initialization, because we need a perf binary at least. But we can use
almost functionality of perf. If anyone have interest with
this approach, I will send patches implementing this idea.

Thanks.

2013-07-16 09:08:32

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator

On Tue, Jul 16, 2013 at 05:55:02PM +0900, Joonsoo Kim wrote:
> How about executing a perf in usermodehelper and collecting output
> in tmpfs? Using this approach, we can start a perf after rootfs
> initialization,

What for if we can start logging to buffers much earlier? *Reading*
from those buffers can be done much later, at our own leisure with full
userspace up.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2013-07-16 10:26:22

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 4/4] Sparse initialization of struct page array.

On Fri, Jul 12, 2013 at 09:19:12PM -0700, Yinghai Lu wrote:
> On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt <[email protected]> wrote:
> > During boot of large memory machines, a significant portion of boot
> > is spent initializing the struct page array. The vast majority of
> > those pages are not referenced during boot.
> >
> > Change this over to only initializing the pages when they are
> > actually allocated.
> >
> > Besides the advantage of boot speed, this allows us the chance to
> > use normal performance monitoring tools to determine where the bulk
> > of time is spent during page initialization.
> >
> > Signed-off-by: Robin Holt <[email protected]>
> > Signed-off-by: Nate Zimmer <[email protected]>
> > To: "H. Peter Anvin" <[email protected]>
> > To: Ingo Molnar <[email protected]>
> > Cc: Linux Kernel <[email protected]>
> > Cc: Linux MM <[email protected]>
> > Cc: Rob Landley <[email protected]>
> > Cc: Mike Travis <[email protected]>
> > Cc: Daniel J Blueman <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Greg KH <[email protected]>
> > Cc: Yinghai Lu <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > ---
> > include/linux/mm.h | 11 +++++
> > include/linux/page-flags.h | 5 +-
> > mm/nobootmem.c | 5 ++
> > mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++++++--
> > 4 files changed, 132 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index e0c8528..3de08b5 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page)
> > __free_page(page);
> > }
> >
> > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
> > +
> > +static inline void __reserve_bootmem_page(struct page *page)
> > +{
> > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT;
> > + phys_addr_t end = start + PAGE_SIZE;
> > +
> > + __reserve_bootmem_region(start, end);
> > +}
> > +
> > static inline void free_reserved_page(struct page *page)
> > {
> > + __reserve_bootmem_page(page);
> > __free_reserved_page(page);
> > adjust_managed_page_count(page, 1);
> > }
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 6d53675..79e8eb7 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -83,6 +83,7 @@ enum pageflags {
> > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
> > PG_arch_1,
> > PG_reserved,
> > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */
> > PG_private, /* If pagecache, has fs-private data */
> > PG_private_2, /* If pagecache, has fs aux data */
> > PG_writeback, /* Page is under writeback */
> > @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
> >
> > __PAGEFLAG(SlobFree, slob_free)
> >
> > +PAGEFLAG(Uninitialized2Mib, uninitialized2mib)
> > +
> > /*
> > * Private page markings that may be used by the filesystem that owns the page
> > * for its own purposes.
> > @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
> > #define PAGE_FLAGS_CHECK_AT_FREE \
> > (1 << PG_lru | 1 << PG_locked | \
> > 1 << PG_private | 1 << PG_private_2 | \
> > - 1 << PG_writeback | 1 << PG_reserved | \
> > + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized2mib | \
> > 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
> > 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
> > __PG_COMPOUND_LOCK)
> > diff --git a/mm/nobootmem.c b/mm/nobootmem.c
> > index 3b512ca..e3a386d 100644
> > --- a/mm/nobootmem.c
> > +++ b/mm/nobootmem.c
> > @@ -126,6 +126,9 @@ static unsigned long __init free_low_memory_core_early(void)
> > phys_addr_t start, end, size;
> > u64 i;
> >
> > + for_each_reserved_mem_region(i, &start, &end)
> > + __reserve_bootmem_region(start, end);
> > +
>
> How about holes that is not in memblock.reserved?
>
> before this patch:
> free_area_init_node/free_area_init_core/memmap_init_zone
> will mark all page in node range to Reserved in struct page, at first.
>
> but those holes is not mapped via kernel low mapping.
> so it should be ok not touch "struct page" for them.
>
> Now you only mark reserved for memblock.reserved at first, and later
> mark {memblock.memory} - { memblock.reserved} to be available.
> And that is ok.
>
> but should split that change to another patch and add some comment
> and change log for the change.
> in case there is some user like UEFI etc do weird thing.

I will split out a separate patch for this.

> > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
> > count += __free_memory_core(start, end);
> >
> > @@ -162,6 +165,8 @@ unsigned long __init free_all_bootmem(void)
> > {
> > struct pglist_data *pgdat;
> >
> > + memblock_dump_all();
> > +
>
> Not needed.

Left over debug in the rush to ask our question.

> > for_each_online_pgdat(pgdat)
> > reset_node_lowmem_managed_pages(pgdat);
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 635b131..fe51eb9 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i
> > #endif
> > }
> >
> > +static void expand_page_initialization(struct page *basepage)
> > +{
> > + unsigned long pfn = page_to_pfn(basepage);
> > + unsigned long end_pfn = pfn + PTRS_PER_PMD;
> > + unsigned long zone = page_zonenum(basepage);
> > + int reserved = PageReserved(basepage);
> > + int nid = page_to_nid(basepage);
> > +
> > + ClearPageUninitialized2Mib(basepage);
> > +
> > + for( pfn++; pfn < end_pfn; pfn++ )
> > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved);
> > +}
> > +
> > +void ensure_pages_are_initialized(unsigned long start_pfn,
> > + unsigned long end_pfn)
> > +{
> > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1);
> > + unsigned long aligned_end_pfn;
> > + struct page *page;
> > +
> > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1);
> > + aligned_end_pfn += PTRS_PER_PMD;
> > + while (aligned_start_pfn < aligned_end_pfn) {
> > + if (pfn_valid(aligned_start_pfn)) {
> > + page = pfn_to_page(aligned_start_pfn);
> > +
> > + if(PageUninitialized2Mib(page))
> > + expand_page_initialization(page);
> > + }
> > +
> > + aligned_start_pfn += PTRS_PER_PMD;
> > + }
> > +}
> > +
> > +void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
> > +{
> > + unsigned long start_pfn = PFN_DOWN(start);
> > + unsigned long end_pfn = PFN_UP(end);
> > +
> > + ensure_pages_are_initialized(start_pfn, end_pfn);
> > +}
>
> that name is confusing, actually it is setting to struct page to Reserved only.
> maybe __reserve_pages_bootmem() to be aligned to free_pages_bootmem ?

Done.

> > +
> > +static inline void ensure_page_is_initialized(struct page *page)
> > +{
> > + __reserve_bootmem_page(page);
> > +}
>
> how about use __reserve_page_bootmem directly and add comment in callers site ?

I really dislike that. The inline function makes the need for a comment
unnecessary in my opinion and leaves the implementation localized to
the one-line function. Those wanting to understand why can quickly see
that the intended functionality is accomplished by the other function.
I would really like to leave this as-is.

> > +
> > static bool free_pages_prepare(struct page *page, unsigned int order)
> > {
> > int i;
> > @@ -751,7 +799,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
> > if (PageAnon(page))
> > page->mapping = NULL;
> > for (i = 0; i < (1 << order); i++)
> > - bad += free_pages_check(page + i);
> > + if (PageUninitialized2Mib(page + i))
> > + i += PTRS_PER_PMD - 1;
> > + else
> > + bad += free_pages_check(page + i);
> > if (bad)
> > return false;
> >
> > @@ -795,13 +846,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
> > unsigned int loop;
> >
> > prefetchw(page);
> > - for (loop = 0; loop < nr_pages; loop++) {
> > + for (loop = 0; loop < nr_pages; ) {
> > struct page *p = &page[loop];
> >
> > if (loop + 1 < nr_pages)
> > prefetchw(p + 1);
> > +
> > + if ((PageUninitialized2Mib(p)) &&
> > + ((loop + PTRS_PER_PMD) > nr_pages))
> > + ensure_page_is_initialized(p);
> > +
> > __ClearPageReserved(p);
> > set_page_count(p, 0);
> > + if (PageUninitialized2Mib(p))
> > + loop += PTRS_PER_PMD;
> > + else
> > + loop += 1;
> > }
> >
> > page_zone(page)->managed_pages += 1 << order;
> > @@ -856,6 +916,7 @@ static inline void expand(struct zone *zone, struct page *page,
> > area--;
> > high--;
> > size >>= 1;
> > + ensure_page_is_initialized(page);
> > VM_BUG_ON(bad_range(zone, &page[size]));
> >
> > #ifdef CONFIG_DEBUG_PAGEALLOC
> > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
> >
> > for (i = 0; i < (1 << order); i++) {
> > struct page *p = page + i;
> > +
> > + if (PageUninitialized2Mib(p))
> > + expand_page_initialization(page);
> > +
> > if (unlikely(check_new_page(p)))
> > return 1;
> > }
> > @@ -985,6 +1050,7 @@ int move_freepages(struct zone *zone,
> > unsigned long order;
> > int pages_moved = 0;
> >
> > + ensure_pages_are_initialized(page_to_pfn(start_page), page_to_pfn(end_page));
> > #ifndef CONFIG_HOLES_IN_ZONE
> > /*
> > * page_zone is not safe to call in this context when
> > @@ -3859,6 +3925,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn)
> > for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> > if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn)))
> > return 1;
> > +
> > + if (PageUninitialized2Mib(pfn_to_page(pfn)))
> > + pfn += PTRS_PER_PMD;
> > }
> > return 0;
> > }
> > @@ -3947,6 +4016,29 @@ static void setup_zone_migrate_reserve(struct zone *zone)
> > }
> > }
> >
> > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn,
> > + unsigned long size, int nid)
> why not use static ? it seems there is not outside user.

Left over from early patch series where we were using this from
mm/nobootmem.c. Fixed.

> > +{
> > + unsigned long validate_end_pfn = pfn + size;
> > +
> > + if (pfn & (size - 1))
> > + return 1;
> > +
> > + if (pfn + size >= end_pfn)
> > + return 1;
> > +
> > + while (pfn < validate_end_pfn)
> > + {
> > + if (!early_pfn_valid(pfn))
> > + return 1;
> > + if (!early_pfn_in_nid(pfn, nid))
> > + return 1;
> > + pfn++;
> > + }
> > +
> > + return size;
> > +}
> > +
> > /*
> > * Initially all pages are reserved - free ones are freed
> > * up by free_all_bootmem() once the early boot process is
> > @@ -3964,20 +4056,34 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> > highest_memmap_pfn = end_pfn - 1;
> >
> > z = &NODE_DATA(nid)->node_zones[zone];
> > - for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> > + for (pfn = start_pfn; pfn < end_pfn; ) {
> > /*
> > * There can be holes in boot-time mem_map[]s
> > * handed to this function. They do not
> > * exist on hotplugged memory.
> > */
> > + int pfns = 1;
> > if (context == MEMMAP_EARLY) {
> > - if (!early_pfn_valid(pfn))
> > + if (!early_pfn_valid(pfn)) {
> > + pfn++;
> > continue;
> > - if (!early_pfn_in_nid(pfn, nid))
> > + }
> > + if (!early_pfn_in_nid(pfn, nid)) {
> > + pfn++;
> > continue;
> > + }
> > +
> > + pfns = pfn_range_init_avail(pfn, end_pfn,
> > + PTRS_PER_PMD, nid);
> > }
>
> maybe could update memmap_init_zone() only iterate {memblock.memory} -
> {memblock.reserved}, so you do need to check avail ....
>
> as memmap_init_zone do not need to handle holes to mark reserve for them.

Maybe I can change pfn_range_init_avail in such a way that the
__reserve_pages_bootmem() work above is not needed. I will dig into
that some more before the next patch submission.

>
> > +
> > page = pfn_to_page(pfn);
> > __init_single_page(page, zone, nid, 1);
> > +
> > + if (pfns > 1)
> > + SetPageUninitialized2Mib(page);
> > +
> > + pfn += pfns;
> > }
> > }
> >
> > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = {
> > {1UL << PG_owner_priv_1, "owner_priv_1" },
> > {1UL << PG_arch_1, "arch_1" },
> > {1UL << PG_reserved, "reserved" },
> > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" },
>
> PG_uninitialized_2m ?

Done.

> > {1UL << PG_private, "private" },
> > {1UL << PG_private_2, "private_2" },
> > {1UL << PG_writeback, "writeback" },
>
> Yinghai

2013-07-16 10:39:00

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 4/4] Sparse initialization of struct page array.

On Mon, Jul 15, 2013 at 02:30:37PM -0700, Andrew Morton wrote:
> On Thu, 11 Jul 2013 21:03:55 -0500 Robin Holt <[email protected]> wrote:
>
> > During boot of large memory machines, a significant portion of boot
> > is spent initializing the struct page array. The vast majority of
> > those pages are not referenced during boot.
> >
> > Change this over to only initializing the pages when they are
> > actually allocated.
> >
> > Besides the advantage of boot speed, this allows us the chance to
> > use normal performance monitoring tools to determine where the bulk
> > of time is spent during page initialization.
> >
> > ...
> >
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page)
> > __free_page(page);
> > }
> >
> > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
> > +
> > +static inline void __reserve_bootmem_page(struct page *page)
> > +{
> > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT;
> > + phys_addr_t end = start + PAGE_SIZE;
> > +
> > + __reserve_bootmem_region(start, end);
> > +}
>
> It isn't obvious that this needed to be inlined?

It is being declared in a header file. All the other functions I came
across in that header file are declared as inline (or __always_inline).
It feels to me like this is right. Can I leave it as-is?

>
> > static inline void free_reserved_page(struct page *page)
> > {
> > + __reserve_bootmem_page(page);
> > __free_reserved_page(page);
> > adjust_managed_page_count(page, 1);
> > }
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 6d53675..79e8eb7 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -83,6 +83,7 @@ enum pageflags {
> > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
> > PG_arch_1,
> > PG_reserved,
> > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */
>
> "mib" creeps me out too. And it makes me think of SNMP, which I'd
> prefer not to think about.
>
> We've traditionally had fears of running out of page flags, but I've
> lost track of how close we are to that happening. IIRC the answer
> depends on whether you believe there is such a thing as a 32-bit NUMA
> system.
>
> Can this be avoided anyway? I suspect there's some idiotic combination
> of flags we could use to indicate the state. PG_reserved|PG_lru or
> something.
>
> "2MB" sounds terribly arch-specific. Shouldn't we make it more generic
> for when the hexagon64 port wants to use 4MB?
>
> That conversational code comment was already commented on, but it's
> still there?

I am going to work on making it non-2m based over the course of this
week, so expect the _2m (current name based on Yinghai's comments)
to go away entirely.

> >
> > ...
> >
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i
> > #endif
> > }
> >
> > +static void expand_page_initialization(struct page *basepage)
> > +{
> > + unsigned long pfn = page_to_pfn(basepage);
> > + unsigned long end_pfn = pfn + PTRS_PER_PMD;
> > + unsigned long zone = page_zonenum(basepage);
> > + int reserved = PageReserved(basepage);
> > + int nid = page_to_nid(basepage);
> > +
> > + ClearPageUninitialized2Mib(basepage);
> > +
> > + for( pfn++; pfn < end_pfn; pfn++ )
> > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved);
> > +}
> > +
> > +void ensure_pages_are_initialized(unsigned long start_pfn,
> > + unsigned long end_pfn)
>
> I think this can be made static. I hope so, as it's a somewhat
> odd-sounding identifier for a global.

Done.

> > +{
> > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1);
> > + unsigned long aligned_end_pfn;
> > + struct page *page;
> > +
> > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1);
> > + aligned_end_pfn += PTRS_PER_PMD;
> > + while (aligned_start_pfn < aligned_end_pfn) {
> > + if (pfn_valid(aligned_start_pfn)) {
> > + page = pfn_to_page(aligned_start_pfn);
> > +
> > + if(PageUninitialized2Mib(page))
>
> checkpatch them, please.

Will certainly do.

> > + expand_page_initialization(page);
> > + }
> > +
> > + aligned_start_pfn += PTRS_PER_PMD;
> > + }
> > +}
>
> Some nice code comments for the above two functions would be helpful.

Will do.

> >
> > ...
> >
> > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn,
> > + unsigned long size, int nid)
> > +{
> > + unsigned long validate_end_pfn = pfn + size;
> > +
> > + if (pfn & (size - 1))
> > + return 1;
> > +
> > + if (pfn + size >= end_pfn)
> > + return 1;
> > +
> > + while (pfn < validate_end_pfn)
> > + {
> > + if (!early_pfn_valid(pfn))
> > + return 1;
> > + if (!early_pfn_in_nid(pfn, nid))
> > + return 1;
> > + pfn++;
> > + }
> > +
> > + return size;
> > +}
>
> Document it, please. The return value semantics look odd, so don't
> forget to explain all that as well.

Will do. Will also work on the name to make it more clear what we
are returning.

> >
> > ...
> >
> > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = {
> > {1UL << PG_owner_priv_1, "owner_priv_1" },
> > {1UL << PG_arch_1, "arch_1" },
> > {1UL << PG_reserved, "reserved" },
> > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" },
>
> It would be better if the name which is visible in procfs matches the
> name in the kernel source code.

Done and will try to maintain the consistency.

> > {1UL << PG_private, "private" },
> > {1UL << PG_private_2, "private_2" },
> > {1UL << PG_writeback, "writeback" },

Robin

2013-07-16 13:03:05

by Sam Ben

[permalink] [raw]
Subject: Re: [RFC 2/4] Have __free_pages_memory() free in larger chunks.

Hi Robin,
On 07/12/2013 10:03 AM, Robin Holt wrote:
> Currently, when free_all_bootmem() calls __free_pages_memory(), the
> number of contiguous pages that __free_pages_memory() passes to the
> buddy allocator is limited to BITS_PER_LONG. In order to be able to

I fail to understand this. Why the original page number is BITS_PER_LONG?

> free only the first page of a 2MiB chunk, we need that to be increased
> to PTRS_PER_PMD.
>
> Signed-off-by: Robin Holt <[email protected]>
> Signed-off-by: Nate Zimmer <[email protected]>
> To: "H. Peter Anvin" <[email protected]>
> To: Ingo Molnar <[email protected]>
> Cc: Linux Kernel <[email protected]>
> Cc: Linux MM <[email protected]>
> Cc: Rob Landley <[email protected]>
> Cc: Mike Travis <[email protected]>
> Cc: Daniel J Blueman <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Greg KH <[email protected]>
> Cc: Yinghai Lu <[email protected]>
> Cc: Mel Gorman <[email protected]>
> ---
> mm/nobootmem.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/mm/nobootmem.c b/mm/nobootmem.c
> index bdd3fa2..3b512ca 100644
> --- a/mm/nobootmem.c
> +++ b/mm/nobootmem.c
> @@ -83,10 +83,10 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size)
> static void __init __free_pages_memory(unsigned long start, unsigned long end)
> {
> unsigned long i, start_aligned, end_aligned;
> - int order = ilog2(BITS_PER_LONG);
> + int order = ilog2(max(BITS_PER_LONG, PTRS_PER_PMD));
>
> - start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1);
> - end_aligned = end & ~(BITS_PER_LONG - 1);
> + start_aligned = (start + ((1UL << order) - 1)) & ~((1UL << order) - 1);
> + end_aligned = end & ~((1UL << order) - 1);
>
> if (end_aligned <= start_aligned) {
> for (i = start; i < end; i++)
> @@ -98,7 +98,7 @@ static void __init __free_pages_memory(unsigned long start, unsigned long end)
> for (i = start; i < start_aligned; i++)
> __free_pages_bootmem(pfn_to_page(i), 0);
>
> - for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG)
> + for (i = start_aligned; i < end_aligned; i += 1 << order)
> __free_pages_bootmem(pfn_to_page(i), order);
>
> for (i = end_aligned; i < end; i++)

2013-07-17 05:18:02

by Sam Ben

[permalink] [raw]
Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator

On 07/12/2013 10:03 AM, Robin Holt wrote:
> We have been working on this since we returned from shutdown and have
> something to discuss now. We restricted ourselves to 2MiB initialization
> to keep the patch set a little smaller and more clear.
>
> First, I think I want to propose getting rid of the page flag. If I knew
> of a concrete way to determine that the page has not been initialized,
> this patch series would look different. If there is no definitive
> way to determine that the struct page has been initialized aside from
> checking the entire page struct is zero, then I think I would suggest
> we change the page flag to indicate the page has been initialized.
>
> The heart of the problem as I see it comes from expand(). We nearly
> always see a first reference to a struct page which is in the middle
> of the 2MiB region. Due to that access, the unlikely() check that was
> originally proposed really ends up referencing a different page entirely.
> We actually did not introduce an unlikely and refactor the patches to
> make that unlikely inside a static inline function. Also, given the
> strong warning at the head of expand(), we did not feel experienced
> enough to refactor it to make things always reference the 2MiB page
> first.
>
> With this patch, we did boot a 16TiB machine. Without the patches,
> the v3.10 kernel with the same configuration took 407 seconds for
> free_all_bootmem. With the patches and operating on 2MiB pages instead
> of 1GiB, it took 26 seconds so performance was improved. I have no feel
> for how the 1GiB chunk size will perform.

How to test how much time spend on free_all_bootmem?

>
> I am on vacation for the next three days so I am sorry in advance for
> my infrequent or non-existant responses.
>
>
> Signed-off-by: Robin Holt <[email protected]>
> Signed-off-by: Nate Zimmer <[email protected]>
> To: "H. Peter Anvin" <[email protected]>
> To: Ingo Molnar <[email protected]>
> Cc: Linux Kernel <[email protected]>
> Cc: Linux MM <[email protected]>
> Cc: Rob Landley <[email protected]>
> Cc: Mike Travis <[email protected]>
> Cc: Daniel J Blueman <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Greg KH <[email protected]>
> Cc: Yinghai Lu <[email protected]>
> Cc: Mel Gorman <[email protected]>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-07-17 09:30:55

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator

On Wed, Jul 17, 2013 at 01:17:44PM +0800, Sam Ben wrote:
> On 07/12/2013 10:03 AM, Robin Holt wrote:
> >We have been working on this since we returned from shutdown and have
> >something to discuss now. We restricted ourselves to 2MiB initialization
> >to keep the patch set a little smaller and more clear.
> >
> >First, I think I want to propose getting rid of the page flag. If I knew
> >of a concrete way to determine that the page has not been initialized,
> >this patch series would look different. If there is no definitive
> >way to determine that the struct page has been initialized aside from
> >checking the entire page struct is zero, then I think I would suggest
> >we change the page flag to indicate the page has been initialized.
> >
> >The heart of the problem as I see it comes from expand(). We nearly
> >always see a first reference to a struct page which is in the middle
> >of the 2MiB region. Due to that access, the unlikely() check that was
> >originally proposed really ends up referencing a different page entirely.
> >We actually did not introduce an unlikely and refactor the patches to
> >make that unlikely inside a static inline function. Also, given the
> >strong warning at the head of expand(), we did not feel experienced
> >enough to refactor it to make things always reference the 2MiB page
> >first.
> >
> >With this patch, we did boot a 16TiB machine. Without the patches,
> >the v3.10 kernel with the same configuration took 407 seconds for
> >free_all_bootmem. With the patches and operating on 2MiB pages instead
> >of 1GiB, it took 26 seconds so performance was improved. I have no feel
> >for how the 1GiB chunk size will perform.
>
> How to test how much time spend on free_all_bootmem?

We had put a pr_emerg at the beginning and end of free_all_bootmem and
then used a modified version of script which record the time in uSecs
at the beginning of each line of output.

Robin

>
> >
> >I am on vacation for the next three days so I am sorry in advance for
> >my infrequent or non-existant responses.
> >
> >
> >Signed-off-by: Robin Holt <[email protected]>
> >Signed-off-by: Nate Zimmer <[email protected]>
> >To: "H. Peter Anvin" <[email protected]>
> >To: Ingo Molnar <[email protected]>
> >Cc: Linux Kernel <[email protected]>
> >Cc: Linux MM <[email protected]>
> >Cc: Rob Landley <[email protected]>
> >Cc: Mike Travis <[email protected]>
> >Cc: Daniel J Blueman <[email protected]>
> >Cc: Andrew Morton <[email protected]>
> >Cc: Greg KH <[email protected]>
> >Cc: Yinghai Lu <[email protected]>
> >Cc: Mel Gorman <[email protected]>
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >the body of a message to [email protected]
> >More majordomo info at http://vger.kernel.org/majordomo-info.html
> >Please read the FAQ at http://www.tux.org/lkml/

2013-07-19 23:51:52

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator

On Wed, Jul 17, 2013 at 2:30 AM, Robin Holt <[email protected]> wrote:
> On Wed, Jul 17, 2013 at 01:17:44PM +0800, Sam Ben wrote:
>> >With this patch, we did boot a 16TiB machine. Without the patches,
>> >the v3.10 kernel with the same configuration took 407 seconds for
>> >free_all_bootmem. With the patches and operating on 2MiB pages instead
>> >of 1GiB, it took 26 seconds so performance was improved. I have no feel
>> >for how the 1GiB chunk size will perform.
>>
>> How to test how much time spend on free_all_bootmem?
>
> We had put a pr_emerg at the beginning and end of free_all_bootmem and
> then used a modified version of script which record the time in uSecs
> at the beginning of each line of output.

used two patches, found 3TiB system will take 100s before slub is ready.

about three portions:
1. sparse vmemap buf allocation, it is with bootmem wrapper, so clear those
struct page area take about 30s.
2. memmap_init_zone: take about 25s
3. mem_init/free_all_bootmem about 30s.

so still wonder why 16TiB will need hours.

also your patches looks like only address 2 and 3.

Yinghai


Attachments:
printk_time_tsc_0.patch (2.56 kB)
printk_time_tsc_1.patch (1.17 kB)
Download all attachments

2013-07-22 06:13:46

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator

On Fri, Jul 19, 2013 at 04:51:49PM -0700, Yinghai Lu wrote:
> On Wed, Jul 17, 2013 at 2:30 AM, Robin Holt <[email protected]> wrote:
> > On Wed, Jul 17, 2013 at 01:17:44PM +0800, Sam Ben wrote:
> >> >With this patch, we did boot a 16TiB machine. Without the patches,
> >> >the v3.10 kernel with the same configuration took 407 seconds for
> >> >free_all_bootmem. With the patches and operating on 2MiB pages instead
> >> >of 1GiB, it took 26 seconds so performance was improved. I have no feel
> >> >for how the 1GiB chunk size will perform.
> >>
> >> How to test how much time spend on free_all_bootmem?
> >
> > We had put a pr_emerg at the beginning and end of free_all_bootmem and
> > then used a modified version of script which record the time in uSecs
> > at the beginning of each line of output.
>
> used two patches, found 3TiB system will take 100s before slub is ready.
>
> about three portions:
> 1. sparse vmemap buf allocation, it is with bootmem wrapper, so clear those
> struct page area take about 30s.
> 2. memmap_init_zone: take about 25s
> 3. mem_init/free_all_bootmem about 30s.
>
> so still wonder why 16TiB will need hours.

I don't know where you got the figure of hours for memory initialization.
That is likely for a 32TiB boot and includes the entire boot, not just
getting the memory allocator initialized.

For a 16 TiB boot:
1) 344
2) 1151
3) 407

I hope that illustrates why we chose to address the memmap_init_zone first
which had the nice side effect of also impacting the free_all_bootmem
slowdown.

With these patches, those numbers are currently:
1) 344
2) 49
3) 26

> also your patches looks like only address 2 and 3.

Right, but I thought that was the normal way to do things. Address
one thing at a time and work toward a better kernel. I don't see a
relationship between the work we are doing here and the sparse vmemmap
buffer allocation. Have I missed something?

Did you happen to time a boot with these patches applied to see how
long it took and how much impact they had on a smaller config?

Robin

2013-07-23 08:19:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: boot tracing


* Sam Ben <[email protected]> wrote:

> On 07/12/2013 04:53 PM, Ingo Molnar wrote:
> >* Borislav Petkov <[email protected]> wrote:
> >
> >>On Fri, Jul 12, 2013 at 10:27:56AM +0200, Ingo Molnar wrote:
> >>>Robert Richter and Boris Petkov are working on 'persistent events'
> >>>support for perf, which will eventually allow boot time profiling -
> >>>I'm not sure if the patches and the tooling support is ready enough
> >>>yet for your purposes.
> >>Nope, not yet but we're getting there.
> >>
> >>>Robert, Boris, the following workflow would be pretty intuitive:
> >>>
> >>> - kernel developer sets boot flag: perf=boot,freq=1khz,size=16MB
> >>What does perf=boot mean? I assume boot tracing.
> >In this case it would mean boot profiling - i.e. a cycles hardware-PMU
> >event collecting into a perf trace buffer as usual.
> >
> >Essentially a 'perf record -a' work-alike, just one that gets activated as
> >early as practical, and which would allow the profiling of memory
> >initialization.
> >
> >Now, one extra complication here is that to be able to profile buddy
> >allocator this persistent event would have to work before the buddy
> >allocator is active :-/ So this sort of profiling would have to use
> >memblock_alloc().
>
> Could perf=boot be used to sample the performance of memblock subsystem?
> I think the perf subsystem is too late to be initialized and monitor
> this.

Yes, that would be a useful facility as well, for things with many events
were printk is not necessarily practical. Any tracepoint could be
utilized.

Thanks,

Ingo

2013-07-23 08:20:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 0/4] Transparent on-demand struct page initialization embedded in the buddy allocator


* Borislav Petkov <[email protected]> wrote:

> On Tue, Jul 16, 2013 at 05:55:02PM +0900, Joonsoo Kim wrote:
> > How about executing a perf in usermodehelper and collecting output
> > in tmpfs? Using this approach, we can start a perf after rootfs
> > initialization,
>
> What for if we can start logging to buffers much earlier? *Reading*
> from those buffers can be done much later, at our own leisure with full
> userspace up.

Yeah, agreed, I think this needs to be more integrated into the kernel, so
that people don't have to worry about "when does userspace start up the
earliest" details.

Fundamentally all perf really needs here is some memory to initialize and
buffer into.

Thanks,

Ingo

2013-07-23 08:32:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 4/4] Sparse initialization of struct page array.


* H. Peter Anvin <[email protected]> wrote:

> On 07/15/2013 11:26 AM, Robin Holt wrote:
>
> > Is there a fairly cheap way to determine definitively that the struct
> > page is not initialized?
>
> By definition I would assume no. The only way I can think of would be
> to unmap the memory associated with the struct page in the TLB and
> initialize the struct pages at trap time.

But ... the only fastpath impact I can see of delayed initialization right
now is this piece of logic in prep_new_page():

@@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)

for (i = 0; i < (1 << order); i++) {
struct page *p = page + i;
+
+ if (PageUninitialized2Mib(p))
+ expand_page_initialization(page);
+
if (unlikely(check_new_page(p)))
return 1;

That is where I think it can be made zero overhead in the
already-initialized case, because page-flags are already used in
check_new_page():

static inline int check_new_page(struct page *page)
{
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
(atomic_read(&page->_count) != 0) |
(page->flags & PAGE_FLAGS_CHECK_AT_PREP) |
(mem_cgroup_bad_page_check(page)))) {
bad_page(page);
return 1;

see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every
struct page on allocation.

We can micro-optimize that low overhead to zero-overhead, by integrating
the PageUninitialized2Mib() check into check_new_page(). This can be done
by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing:


if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) {
if (PageUninitialized2Mib(p))
expand_page_initialization(page);
...
}

if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
(atomic_read(&page->_count) != 0) |
(mem_cgroup_bad_page_check(page)))) {
bad_page(page);

return 1;

this will result in making it essentially zero-overhead, the
expand_page_initialization() logic is now in a slowpath.

Am I missing anything here?

Thanks,

Ingo

2013-07-23 11:09:51

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 4/4] Sparse initialization of struct page array.

On Tue, Jul 23, 2013 at 10:32:11AM +0200, Ingo Molnar wrote:
>
> * H. Peter Anvin <[email protected]> wrote:
>
> > On 07/15/2013 11:26 AM, Robin Holt wrote:
> >
> > > Is there a fairly cheap way to determine definitively that the struct
> > > page is not initialized?
> >
> > By definition I would assume no. The only way I can think of would be
> > to unmap the memory associated with the struct page in the TLB and
> > initialize the struct pages at trap time.
>
> But ... the only fastpath impact I can see of delayed initialization right
> now is this piece of logic in prep_new_page():
>
> @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
>
> for (i = 0; i < (1 << order); i++) {
> struct page *p = page + i;
> +
> + if (PageUninitialized2Mib(p))
> + expand_page_initialization(page);
> +
> if (unlikely(check_new_page(p)))
> return 1;
>
> That is where I think it can be made zero overhead in the
> already-initialized case, because page-flags are already used in
> check_new_page():

The problem I see here is that the page flags we need to check for the
uninitialized flag are in the "other" page for the page aligned at the
2MiB virtual address, not the page currently being referenced.

Let me try a version of the patch where we set the PG_unintialized_2m
flag on all pages, including the aligned pages and see what that does
to performance.

Robin

>
> static inline int check_new_page(struct page *page)
> {
> if (unlikely(page_mapcount(page) |
> (page->mapping != NULL) |
> (atomic_read(&page->_count) != 0) |
> (page->flags & PAGE_FLAGS_CHECK_AT_PREP) |
> (mem_cgroup_bad_page_check(page)))) {
> bad_page(page);
> return 1;
>
> see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every
> struct page on allocation.
>
> We can micro-optimize that low overhead to zero-overhead, by integrating
> the PageUninitialized2Mib() check into check_new_page(). This can be done
> by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing:
>
>
> if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) {
> if (PageUninitialized2Mib(p))
> expand_page_initialization(page);
> ...
> }
>
> if (unlikely(page_mapcount(page) |
> (page->mapping != NULL) |
> (atomic_read(&page->_count) != 0) |
> (mem_cgroup_bad_page_check(page)))) {
> bad_page(page);
>
> return 1;
>
> this will result in making it essentially zero-overhead, the
> expand_page_initialization() logic is now in a slowpath.
>
> Am I missing anything here?
>
> Thanks,
>
> Ingo

2013-07-23 11:15:51

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 4/4] Sparse initialization of struct page array.

I think the other critical path which is affected is in expand().
There, we just call ensure_page_is_initialized() blindly which does
the check against the other page. The below is a nearly zero addition.
Sorry for the confusion. My morning coffee has not kicked in yet.

Robin

On Tue, Jul 23, 2013 at 06:09:47AM -0500, Robin Holt wrote:
> On Tue, Jul 23, 2013 at 10:32:11AM +0200, Ingo Molnar wrote:
> >
> > * H. Peter Anvin <[email protected]> wrote:
> >
> > > On 07/15/2013 11:26 AM, Robin Holt wrote:
> > >
> > > > Is there a fairly cheap way to determine definitively that the struct
> > > > page is not initialized?
> > >
> > > By definition I would assume no. The only way I can think of would be
> > > to unmap the memory associated with the struct page in the TLB and
> > > initialize the struct pages at trap time.
> >
> > But ... the only fastpath impact I can see of delayed initialization right
> > now is this piece of logic in prep_new_page():
> >
> > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
> >
> > for (i = 0; i < (1 << order); i++) {
> > struct page *p = page + i;
> > +
> > + if (PageUninitialized2Mib(p))
> > + expand_page_initialization(page);
> > +
> > if (unlikely(check_new_page(p)))
> > return 1;
> >
> > That is where I think it can be made zero overhead in the
> > already-initialized case, because page-flags are already used in
> > check_new_page():
>
> The problem I see here is that the page flags we need to check for the
> uninitialized flag are in the "other" page for the page aligned at the
> 2MiB virtual address, not the page currently being referenced.
>
> Let me try a version of the patch where we set the PG_unintialized_2m
> flag on all pages, including the aligned pages and see what that does
> to performance.
>
> Robin
>
> >
> > static inline int check_new_page(struct page *page)
> > {
> > if (unlikely(page_mapcount(page) |
> > (page->mapping != NULL) |
> > (atomic_read(&page->_count) != 0) |
> > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) |
> > (mem_cgroup_bad_page_check(page)))) {
> > bad_page(page);
> > return 1;
> >
> > see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every
> > struct page on allocation.
> >
> > We can micro-optimize that low overhead to zero-overhead, by integrating
> > the PageUninitialized2Mib() check into check_new_page(). This can be done
> > by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing:
> >
> >
> > if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) {
> > if (PageUninitialized2Mib(p))
> > expand_page_initialization(page);
> > ...
> > }
> >
> > if (unlikely(page_mapcount(page) |
> > (page->mapping != NULL) |
> > (atomic_read(&page->_count) != 0) |
> > (mem_cgroup_bad_page_check(page)))) {
> > bad_page(page);
> >
> > return 1;
> >
> > this will result in making it essentially zero-overhead, the
> > expand_page_initialization() logic is now in a slowpath.
> >
> > Am I missing anything here?
> >
> > Thanks,
> >
> > Ingo

2013-07-23 11:41:54

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 4/4] Sparse initialization of struct page array.

On Tue, Jul 23, 2013 at 06:15:49AM -0500, Robin Holt wrote:
> I think the other critical path which is affected is in expand().
> There, we just call ensure_page_is_initialized() blindly which does
> the check against the other page. The below is a nearly zero addition.
> Sorry for the confusion. My morning coffee has not kicked in yet.

I don't have access to the 16TiB system until Thursday unless the other
testing on it fails early. I did boot a 2TiB system with the a change
which set the Unitialized_2m flag on all pages in that 2MiB range
during memmap_init_zone. That makes the expand check test against
the referenced page instead of having to go back to the 2MiB page.
It appears to have added less than a second to the 2TiB boot so I hope
it has equally little impact to the 16TiB boot.

I will clean up this patch some more and resend the currently untested
set later today.

Thanks,
Robin

>
> Robin
>
> On Tue, Jul 23, 2013 at 06:09:47AM -0500, Robin Holt wrote:
> > On Tue, Jul 23, 2013 at 10:32:11AM +0200, Ingo Molnar wrote:
> > >
> > > * H. Peter Anvin <[email protected]> wrote:
> > >
> > > > On 07/15/2013 11:26 AM, Robin Holt wrote:
> > > >
> > > > > Is there a fairly cheap way to determine definitively that the struct
> > > > > page is not initialized?
> > > >
> > > > By definition I would assume no. The only way I can think of would be
> > > > to unmap the memory associated with the struct page in the TLB and
> > > > initialize the struct pages at trap time.
> > >
> > > But ... the only fastpath impact I can see of delayed initialization right
> > > now is this piece of logic in prep_new_page():
> > >
> > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
> > >
> > > for (i = 0; i < (1 << order); i++) {
> > > struct page *p = page + i;
> > > +
> > > + if (PageUninitialized2Mib(p))
> > > + expand_page_initialization(page);
> > > +
> > > if (unlikely(check_new_page(p)))
> > > return 1;
> > >
> > > That is where I think it can be made zero overhead in the
> > > already-initialized case, because page-flags are already used in
> > > check_new_page():
> >
> > The problem I see here is that the page flags we need to check for the
> > uninitialized flag are in the "other" page for the page aligned at the
> > 2MiB virtual address, not the page currently being referenced.
> >
> > Let me try a version of the patch where we set the PG_unintialized_2m
> > flag on all pages, including the aligned pages and see what that does
> > to performance.
> >
> > Robin
> >
> > >
> > > static inline int check_new_page(struct page *page)
> > > {
> > > if (unlikely(page_mapcount(page) |
> > > (page->mapping != NULL) |
> > > (atomic_read(&page->_count) != 0) |
> > > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) |
> > > (mem_cgroup_bad_page_check(page)))) {
> > > bad_page(page);
> > > return 1;
> > >
> > > see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every
> > > struct page on allocation.
> > >
> > > We can micro-optimize that low overhead to zero-overhead, by integrating
> > > the PageUninitialized2Mib() check into check_new_page(). This can be done
> > > by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing:
> > >
> > >
> > > if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) {
> > > if (PageUninitialized2Mib(p))
> > > expand_page_initialization(page);
> > > ...
> > > }
> > >
> > > if (unlikely(page_mapcount(page) |
> > > (page->mapping != NULL) |
> > > (atomic_read(&page->_count) != 0) |
> > > (mem_cgroup_bad_page_check(page)))) {
> > > bad_page(page);
> > >
> > > return 1;
> > >
> > > this will result in making it essentially zero-overhead, the
> > > expand_page_initialization() logic is now in a slowpath.
> > >
> > > Am I missing anything here?
> > >
> > > Thanks,
> > >
> > > Ingo

2013-07-23 11:50:25

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 4/4] Sparse initialization of struct page array.

On Tue, Jul 23, 2013 at 06:41:50AM -0500, Robin Holt wrote:
> On Tue, Jul 23, 2013 at 06:15:49AM -0500, Robin Holt wrote:
> > I think the other critical path which is affected is in expand().
> > There, we just call ensure_page_is_initialized() blindly which does
> > the check against the other page. The below is a nearly zero addition.
> > Sorry for the confusion. My morning coffee has not kicked in yet.
>
> I don't have access to the 16TiB system until Thursday unless the other
> testing on it fails early. I did boot a 2TiB system with the a change
> which set the Unitialized_2m flag on all pages in that 2MiB range
> during memmap_init_zone. That makes the expand check test against
> the referenced page instead of having to go back to the 2MiB page.
> It appears to have added less than a second to the 2TiB boot so I hope
> it has equally little impact to the 16TiB boot.

I was wrong. One of the two logs I looked at was the wrong one.
Setting that Unitialized2m flag on all pages added 30 seconds to the
2TiB boot's memmap_init_zone(). Please disregard.

That brings me back to the belief we need a better solution for the
expand() path.

Robin

>
> I will clean up this patch some more and resend the currently untested
> set later today.
>
> Thanks,
> Robin
>
> >
> > Robin
> >
> > On Tue, Jul 23, 2013 at 06:09:47AM -0500, Robin Holt wrote:
> > > On Tue, Jul 23, 2013 at 10:32:11AM +0200, Ingo Molnar wrote:
> > > >
> > > > * H. Peter Anvin <[email protected]> wrote:
> > > >
> > > > > On 07/15/2013 11:26 AM, Robin Holt wrote:
> > > > >
> > > > > > Is there a fairly cheap way to determine definitively that the struct
> > > > > > page is not initialized?
> > > > >
> > > > > By definition I would assume no. The only way I can think of would be
> > > > > to unmap the memory associated with the struct page in the TLB and
> > > > > initialize the struct pages at trap time.
> > > >
> > > > But ... the only fastpath impact I can see of delayed initialization right
> > > > now is this piece of logic in prep_new_page():
> > > >
> > > > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
> > > >
> > > > for (i = 0; i < (1 << order); i++) {
> > > > struct page *p = page + i;
> > > > +
> > > > + if (PageUninitialized2Mib(p))
> > > > + expand_page_initialization(page);
> > > > +
> > > > if (unlikely(check_new_page(p)))
> > > > return 1;
> > > >
> > > > That is where I think it can be made zero overhead in the
> > > > already-initialized case, because page-flags are already used in
> > > > check_new_page():
> > >
> > > The problem I see here is that the page flags we need to check for the
> > > uninitialized flag are in the "other" page for the page aligned at the
> > > 2MiB virtual address, not the page currently being referenced.
> > >
> > > Let me try a version of the patch where we set the PG_unintialized_2m
> > > flag on all pages, including the aligned pages and see what that does
> > > to performance.
> > >
> > > Robin
> > >
> > > >
> > > > static inline int check_new_page(struct page *page)
> > > > {
> > > > if (unlikely(page_mapcount(page) |
> > > > (page->mapping != NULL) |
> > > > (atomic_read(&page->_count) != 0) |
> > > > (page->flags & PAGE_FLAGS_CHECK_AT_PREP) |
> > > > (mem_cgroup_bad_page_check(page)))) {
> > > > bad_page(page);
> > > > return 1;
> > > >
> > > > see that PAGE_FLAGS_CHECK_AT_PREP flag? That always gets checked for every
> > > > struct page on allocation.
> > > >
> > > > We can micro-optimize that low overhead to zero-overhead, by integrating
> > > > the PageUninitialized2Mib() check into check_new_page(). This can be done
> > > > by adding PG_uninitialized2mib to PAGE_FLAGS_CHECK_AT_PREP and doing:
> > > >
> > > >
> > > > if (unlikely(page->flags & PAGE_FLAGS_CHECK_AT_PREP)) {
> > > > if (PageUninitialized2Mib(p))
> > > > expand_page_initialization(page);
> > > > ...
> > > > }
> > > >
> > > > if (unlikely(page_mapcount(page) |
> > > > (page->mapping != NULL) |
> > > > (atomic_read(&page->_count) != 0) |
> > > > (mem_cgroup_bad_page_check(page)))) {
> > > > bad_page(page);
> > > >
> > > > return 1;
> > > >
> > > > this will result in making it essentially zero-overhead, the
> > > > expand_page_initialization() logic is now in a slowpath.
> > > >
> > > > Am I missing anything here?
> > > >
> > > > Thanks,
> > > >
> > > > Ingo

2013-07-23 15:33:14

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC 2/4] Have __free_pages_memory() free in larger chunks.

On Tue, Jul 16, 2013 at 09:02:53PM +0800, Sam Ben wrote:
> Hi Robin,
> On 07/12/2013 10:03 AM, Robin Holt wrote:
> >Currently, when free_all_bootmem() calls __free_pages_memory(), the
> >number of contiguous pages that __free_pages_memory() passes to the
> >buddy allocator is limited to BITS_PER_LONG. In order to be able to
>
> I fail to understand this. Why the original page number is BITS_PER_LONG?

The mm/bootmem.c implementation uses a bitmap to keep track of
free/reserved pages. It walks that bitmap in BITS_PER_LONG steps
because it is the biggest chunk that is still trivial and cheap to
check if all pages are free in it (chunk == ~0UL).

nobootmem.c was written based on the bootmem.c interface, so it was
probably adapted to keep things similar between the two, short of a
pressing reason not to.

2013-07-25 02:25:50

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 4/4] Sparse initialization of struct page array.

On Fri, Jul 12, 2013 at 09:19:12PM -0700, Yinghai Lu wrote:
> On Thu, Jul 11, 2013 at 7:03 PM, Robin Holt <[email protected]> wrote:
> > During boot of large memory machines, a significant portion of boot
> > is spent initializing the struct page array. The vast majority of
> > those pages are not referenced during boot.
> >
> > Change this over to only initializing the pages when they are
> > actually allocated.
> >
> > Besides the advantage of boot speed, this allows us the chance to
> > use normal performance monitoring tools to determine where the bulk
> > of time is spent during page initialization.
> >
> > Signed-off-by: Robin Holt <[email protected]>
> > Signed-off-by: Nate Zimmer <[email protected]>
> > To: "H. Peter Anvin" <[email protected]>
> > To: Ingo Molnar <[email protected]>
> > Cc: Linux Kernel <[email protected]>
> > Cc: Linux MM <[email protected]>
> > Cc: Rob Landley <[email protected]>
> > Cc: Mike Travis <[email protected]>
> > Cc: Daniel J Blueman <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Greg KH <[email protected]>
> > Cc: Yinghai Lu <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > ---
> > include/linux/mm.h | 11 +++++
> > include/linux/page-flags.h | 5 +-
> > mm/nobootmem.c | 5 ++
> > mm/page_alloc.c | 117 +++++++++++++++++++++++++++++++++++++++++++--
> > 4 files changed, 132 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index e0c8528..3de08b5 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1330,8 +1330,19 @@ static inline void __free_reserved_page(struct page *page)
> > __free_page(page);
> > }
> >
> > +extern void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
> > +
> > +static inline void __reserve_bootmem_page(struct page *page)
> > +{
> > + phys_addr_t start = page_to_pfn(page) << PAGE_SHIFT;
> > + phys_addr_t end = start + PAGE_SIZE;
> > +
> > + __reserve_bootmem_region(start, end);
> > +}
> > +
> > static inline void free_reserved_page(struct page *page)
> > {
> > + __reserve_bootmem_page(page);
> > __free_reserved_page(page);
> > adjust_managed_page_count(page, 1);
> > }
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 6d53675..79e8eb7 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -83,6 +83,7 @@ enum pageflags {
> > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
> > PG_arch_1,
> > PG_reserved,
> > + PG_uninitialized2mib, /* Is this the right spot? ntz - Yes - rmh */
> > PG_private, /* If pagecache, has fs-private data */
> > PG_private_2, /* If pagecache, has fs aux data */
> > PG_writeback, /* Page is under writeback */
> > @@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
> >
> > __PAGEFLAG(SlobFree, slob_free)
> >
> > +PAGEFLAG(Uninitialized2Mib, uninitialized2mib)
> > +
> > /*
> > * Private page markings that may be used by the filesystem that owns the page
> > * for its own purposes.
> > @@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
> > #define PAGE_FLAGS_CHECK_AT_FREE \
> > (1 << PG_lru | 1 << PG_locked | \
> > 1 << PG_private | 1 << PG_private_2 | \
> > - 1 << PG_writeback | 1 << PG_reserved | \
> > + 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized2mib | \
> > 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
> > 1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
> > __PG_COMPOUND_LOCK)
> > diff --git a/mm/nobootmem.c b/mm/nobootmem.c
> > index 3b512ca..e3a386d 100644
> > --- a/mm/nobootmem.c
> > +++ b/mm/nobootmem.c
> > @@ -126,6 +126,9 @@ static unsigned long __init free_low_memory_core_early(void)
> > phys_addr_t start, end, size;
> > u64 i;
> >
> > + for_each_reserved_mem_region(i, &start, &end)
> > + __reserve_bootmem_region(start, end);
> > +
>
> How about holes that is not in memblock.reserved?
>
> before this patch:
> free_area_init_node/free_area_init_core/memmap_init_zone
> will mark all page in node range to Reserved in struct page, at first.
>
> but those holes is not mapped via kernel low mapping.
> so it should be ok not touch "struct page" for them.
>
> Now you only mark reserved for memblock.reserved at first, and later
> mark {memblock.memory} - { memblock.reserved} to be available.
> And that is ok.
>
> but should split that change to another patch and add some comment
> and change log for the change.
> in case there is some user like UEFI etc do weird thing.

Nate and I talked this over today. Sorry for the delay, but it was the
first time we were both free. Neither of us quite understands what you
are asking for here. My interpretation is that you would like us to
change the use of the PageReserved flag such that during boot, we do not
set the flag at all from memmap_init_zone, and then only set it on pages
which, at the time of free_all_bootmem, have been allocated/reserved in
the memblock allocator. Is that correct? I will start to work that up
on the assumption that is what you are asking for.

Robin

>
> > for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
> > count += __free_memory_core(start, end);
> >
> > @@ -162,6 +165,8 @@ unsigned long __init free_all_bootmem(void)
> > {
> > struct pglist_data *pgdat;
> >
> > + memblock_dump_all();
> > +
>
> Not needed.
>
> > for_each_online_pgdat(pgdat)
> > reset_node_lowmem_managed_pages(pgdat);
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 635b131..fe51eb9 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -740,6 +740,54 @@ static void __init_single_page(struct page *page, unsigned long zone, int nid, i
> > #endif
> > }
> >
> > +static void expand_page_initialization(struct page *basepage)
> > +{
> > + unsigned long pfn = page_to_pfn(basepage);
> > + unsigned long end_pfn = pfn + PTRS_PER_PMD;
> > + unsigned long zone = page_zonenum(basepage);
> > + int reserved = PageReserved(basepage);
> > + int nid = page_to_nid(basepage);
> > +
> > + ClearPageUninitialized2Mib(basepage);
> > +
> > + for( pfn++; pfn < end_pfn; pfn++ )
> > + __init_single_page(pfn_to_page(pfn), zone, nid, reserved);
> > +}
> > +
> > +void ensure_pages_are_initialized(unsigned long start_pfn,
> > + unsigned long end_pfn)
> > +{
> > + unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1);
> > + unsigned long aligned_end_pfn;
> > + struct page *page;
> > +
> > + aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1);
> > + aligned_end_pfn += PTRS_PER_PMD;
> > + while (aligned_start_pfn < aligned_end_pfn) {
> > + if (pfn_valid(aligned_start_pfn)) {
> > + page = pfn_to_page(aligned_start_pfn);
> > +
> > + if(PageUninitialized2Mib(page))
> > + expand_page_initialization(page);
> > + }
> > +
> > + aligned_start_pfn += PTRS_PER_PMD;
> > + }
> > +}
> > +
> > +void __reserve_bootmem_region(phys_addr_t start, phys_addr_t end)
> > +{
> > + unsigned long start_pfn = PFN_DOWN(start);
> > + unsigned long end_pfn = PFN_UP(end);
> > +
> > + ensure_pages_are_initialized(start_pfn, end_pfn);
> > +}
>
> that name is confusing, actually it is setting to struct page to Reserved only.
> maybe __reserve_pages_bootmem() to be aligned to free_pages_bootmem ?
>
> > +
> > +static inline void ensure_page_is_initialized(struct page *page)
> > +{
> > + __reserve_bootmem_page(page);
> > +}
>
> how about use __reserve_page_bootmem directly and add comment in callers site ?
>
> > +
> > static bool free_pages_prepare(struct page *page, unsigned int order)
> > {
> > int i;
> > @@ -751,7 +799,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
> > if (PageAnon(page))
> > page->mapping = NULL;
> > for (i = 0; i < (1 << order); i++)
> > - bad += free_pages_check(page + i);
> > + if (PageUninitialized2Mib(page + i))
> > + i += PTRS_PER_PMD - 1;
> > + else
> > + bad += free_pages_check(page + i);
> > if (bad)
> > return false;
> >
> > @@ -795,13 +846,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
> > unsigned int loop;
> >
> > prefetchw(page);
> > - for (loop = 0; loop < nr_pages; loop++) {
> > + for (loop = 0; loop < nr_pages; ) {
> > struct page *p = &page[loop];
> >
> > if (loop + 1 < nr_pages)
> > prefetchw(p + 1);
> > +
> > + if ((PageUninitialized2Mib(p)) &&
> > + ((loop + PTRS_PER_PMD) > nr_pages))
> > + ensure_page_is_initialized(p);
> > +
> > __ClearPageReserved(p);
> > set_page_count(p, 0);
> > + if (PageUninitialized2Mib(p))
> > + loop += PTRS_PER_PMD;
> > + else
> > + loop += 1;
> > }
> >
> > page_zone(page)->managed_pages += 1 << order;
> > @@ -856,6 +916,7 @@ static inline void expand(struct zone *zone, struct page *page,
> > area--;
> > high--;
> > size >>= 1;
> > + ensure_page_is_initialized(page);
> > VM_BUG_ON(bad_range(zone, &page[size]));
> >
> > #ifdef CONFIG_DEBUG_PAGEALLOC
> > @@ -903,6 +964,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
> >
> > for (i = 0; i < (1 << order); i++) {
> > struct page *p = page + i;
> > +
> > + if (PageUninitialized2Mib(p))
> > + expand_page_initialization(page);
> > +
> > if (unlikely(check_new_page(p)))
> > return 1;
> > }
> > @@ -985,6 +1050,7 @@ int move_freepages(struct zone *zone,
> > unsigned long order;
> > int pages_moved = 0;
> >
> > + ensure_pages_are_initialized(page_to_pfn(start_page), page_to_pfn(end_page));
> > #ifndef CONFIG_HOLES_IN_ZONE
> > /*
> > * page_zone is not safe to call in this context when
> > @@ -3859,6 +3925,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn)
> > for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> > if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn)))
> > return 1;
> > +
> > + if (PageUninitialized2Mib(pfn_to_page(pfn)))
> > + pfn += PTRS_PER_PMD;
> > }
> > return 0;
> > }
> > @@ -3947,6 +4016,29 @@ static void setup_zone_migrate_reserve(struct zone *zone)
> > }
> > }
> >
> > +int __meminit pfn_range_init_avail(unsigned long pfn, unsigned long end_pfn,
> > + unsigned long size, int nid)
> why not use static ? it seems there is not outside user.
> > +{
> > + unsigned long validate_end_pfn = pfn + size;
> > +
> > + if (pfn & (size - 1))
> > + return 1;
> > +
> > + if (pfn + size >= end_pfn)
> > + return 1;
> > +
> > + while (pfn < validate_end_pfn)
> > + {
> > + if (!early_pfn_valid(pfn))
> > + return 1;
> > + if (!early_pfn_in_nid(pfn, nid))
> > + return 1;
> > + pfn++;
> > + }
> > +
> > + return size;
> > +}
> > +
> > /*
> > * Initially all pages are reserved - free ones are freed
> > * up by free_all_bootmem() once the early boot process is
> > @@ -3964,20 +4056,34 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> > highest_memmap_pfn = end_pfn - 1;
> >
> > z = &NODE_DATA(nid)->node_zones[zone];
> > - for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> > + for (pfn = start_pfn; pfn < end_pfn; ) {
> > /*
> > * There can be holes in boot-time mem_map[]s
> > * handed to this function. They do not
> > * exist on hotplugged memory.
> > */
> > + int pfns = 1;
> > if (context == MEMMAP_EARLY) {
> > - if (!early_pfn_valid(pfn))
> > + if (!early_pfn_valid(pfn)) {
> > + pfn++;
> > continue;
> > - if (!early_pfn_in_nid(pfn, nid))
> > + }
> > + if (!early_pfn_in_nid(pfn, nid)) {
> > + pfn++;
> > continue;
> > + }
> > +
> > + pfns = pfn_range_init_avail(pfn, end_pfn,
> > + PTRS_PER_PMD, nid);
> > }
>
> maybe could update memmap_init_zone() only iterate {memblock.memory} -
> {memblock.reserved}, so you do need to check avail ....
>
> as memmap_init_zone do not need to handle holes to mark reserve for them.
>
> > +
> > page = pfn_to_page(pfn);
> > __init_single_page(page, zone, nid, 1);
> > +
> > + if (pfns > 1)
> > + SetPageUninitialized2Mib(page);
> > +
> > + pfn += pfns;
> > }
> > }
> >
> > @@ -6196,6 +6302,7 @@ static const struct trace_print_flags pageflag_names[] = {
> > {1UL << PG_owner_priv_1, "owner_priv_1" },
> > {1UL << PG_arch_1, "arch_1" },
> > {1UL << PG_reserved, "reserved" },
> > + {1UL << PG_uninitialized2mib, "Uninit_2MiB" },
>
> PG_uninitialized_2m ?
>
> > {1UL << PG_private, "private" },
> > {1UL << PG_private_2, "private_2" },
> > {1UL << PG_writeback, "writeback" },
>
> Yinghai

2013-07-25 12:50:59

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC 4/4] Sparse initialization of struct page array.

On Wed, Jul 24, 2013 at 7:25 PM, Robin Holt <[email protected]> wrote:
>>
>> How about holes that is not in memblock.reserved?
>>
>> before this patch:
>> free_area_init_node/free_area_init_core/memmap_init_zone
>> will mark all page in node range to Reserved in struct page, at first.
>>
>> but those holes is not mapped via kernel low mapping.
>> so it should be ok not touch "struct page" for them.
>>
>> Now you only mark reserved for memblock.reserved at first, and later
>> mark {memblock.memory} - { memblock.reserved} to be available.
>> And that is ok.
>>
>> but should split that change to another patch and add some comment
>> and change log for the change.
>> in case there is some user like UEFI etc do weird thing.
>
> Nate and I talked this over today. Sorry for the delay, but it was the
> first time we were both free. Neither of us quite understands what you
> are asking for here. My interpretation is that you would like us to
> change the use of the PageReserved flag such that during boot, we do not
> set the flag at all from memmap_init_zone, and then only set it on pages
> which, at the time of free_all_bootmem, have been allocated/reserved in
> the memblock allocator. Is that correct? I will start to work that up
> on the assumption that is what you are asking for.

Not exactly.

your change should be right, but there is some subtle change about
holes handling.

before mem holes between memory ranges in memblock.memory, get struct page,
and initialized with to Reserved in memmap_init_zone.
Those holes is not in memory.reserved, with your patches, those hole's
struct page
will still have all 0.

Please separate change about set page to reserved according to memory.reserved
to another patch.

Thanks

Yinghai

2013-07-25 13:42:31

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 4/4] Sparse initialization of struct page array.

On Thu, Jul 25, 2013 at 05:50:57AM -0700, Yinghai Lu wrote:
> On Wed, Jul 24, 2013 at 7:25 PM, Robin Holt <[email protected]> wrote:
> >>
> >> How about holes that is not in memblock.reserved?
> >>
> >> before this patch:
> >> free_area_init_node/free_area_init_core/memmap_init_zone
> >> will mark all page in node range to Reserved in struct page, at first.
> >>
> >> but those holes is not mapped via kernel low mapping.
> >> so it should be ok not touch "struct page" for them.
> >>
> >> Now you only mark reserved for memblock.reserved at first, and later
> >> mark {memblock.memory} - { memblock.reserved} to be available.
> >> And that is ok.
> >>
> >> but should split that change to another patch and add some comment
> >> and change log for the change.
> >> in case there is some user like UEFI etc do weird thing.
> >
> > Nate and I talked this over today. Sorry for the delay, but it was the
> > first time we were both free. Neither of us quite understands what you
> > are asking for here. My interpretation is that you would like us to
> > change the use of the PageReserved flag such that during boot, we do not
> > set the flag at all from memmap_init_zone, and then only set it on pages
> > which, at the time of free_all_bootmem, have been allocated/reserved in
> > the memblock allocator. Is that correct? I will start to work that up
> > on the assumption that is what you are asking for.
>
> Not exactly.
>
> your change should be right, but there is some subtle change about
> holes handling.
>
> before mem holes between memory ranges in memblock.memory, get struct page,
> and initialized with to Reserved in memmap_init_zone.
> Those holes is not in memory.reserved, with your patches, those hole's
> struct page
> will still have all 0.
>
> Please separate change about set page to reserved according to memory.reserved
> to another patch.


Just want to make sure this is where you want me to go. Here is my
currently untested patch. Is that what you were expecting to have done?
One thing I don't like about this patch is it seems to slow down boot in
my simulator environment. I think I am going to look at restructuring
things a bit to see if I can eliminate that performance penalty.
Otherwise, I think I am following your directions.

Thanks,
Robin Holt

>From bdd2fefa59af18f283af6f066bc644ddfa5c7da8 Mon Sep 17 00:00:00 2001
From: Robin Holt <[email protected]>
Date: Thu, 25 Jul 2013 04:25:15 -0500
Subject: [RFC -v2-pre2 4/5] ZZZ Only SegPageReserved() on memblock reserved
pages.

---
include/linux/mm.h | 2 ++
mm/nobootmem.c | 3 +++
mm/page_alloc.c | 18 +++++++++++-------
3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e0c8528..b264a26 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count)
totalram_pages += count;
}

+extern void reserve_bootmem_region(unsigned long start, unsigned long end);
+
/* Free the reserved page into the buddy system, so it gets managed. */
static inline void __free_reserved_page(struct page *page)
{
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 2159e68..0840af2 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void)
phys_addr_t start, end, size;
u64 i;

+ for_each_reserved_mem_region(i, &start, &end)
+ reserve_bootmem_region(start, end);
+
for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
count += __free_memory_core(start, end);

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 048e166..3aa30b7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -698,7 +698,7 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
}

static void __init_single_page(unsigned long pfn, unsigned long zone,
- int nid, int reserved)
+ int nid, int page_count)
{
struct page *page = pfn_to_page(pfn);
struct zone *z = &NODE_DATA(nid)->node_zones[zone];
@@ -708,12 +708,9 @@ static void __init_single_page(unsigned long pfn, unsigned long zone,
init_page_count(page);
page_mapcount_reset(page);
page_nid_reset_last(page);
- if (reserved) {
- SetPageReserved(page);
- } else {
- ClearPageReserved(page);
- set_page_count(page, 0);
- }
+ ClearPageReserved(page);
+ set_page_count(page, page_count);
+
/*
* Mark the block movable so that blocks are reserved for
* movable at startup. This will force kernel allocations
@@ -741,6 +738,13 @@ static void __init_single_page(unsigned long pfn, unsigned long zone,
#endif
}

+void reserve_bootmem_region(unsigned long start, unsigned long end)
+{
+ for (; start < end; start++)
+ if (pfn_valid(start))
+ SetPageReserved(pfn_to_page(start));
+}
+
static bool free_pages_prepare(struct page *page, unsigned int order)
{
int i;
--
1.8.2.1

2013-07-25 13:52:20

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC 4/4] Sparse initialization of struct page array.

On Thu, Jul 25, 2013 at 6:42 AM, Robin Holt <[email protected]> wrote:
> On Thu, Jul 25, 2013 at 05:50:57AM -0700, Yinghai Lu wrote:
>> On Wed, Jul 24, 2013 at 7:25 PM, Robin Holt <[email protected]> wrote:
>> >>
>> >> How about holes that is not in memblock.reserved?
>> >>
>> >> before this patch:
>> >> free_area_init_node/free_area_init_core/memmap_init_zone
>> >> will mark all page in node range to Reserved in struct page, at first.
>> >>
>> >> but those holes is not mapped via kernel low mapping.
>> >> so it should be ok not touch "struct page" for them.
>> >>
>> >> Now you only mark reserved for memblock.reserved at first, and later
>> >> mark {memblock.memory} - { memblock.reserved} to be available.
>> >> And that is ok.
>> >>
>> >> but should split that change to another patch and add some comment
>> >> and change log for the change.
>> >> in case there is some user like UEFI etc do weird thing.
>> >
>> > Nate and I talked this over today. Sorry for the delay, but it was the
>> > first time we were both free. Neither of us quite understands what you
>> > are asking for here. My interpretation is that you would like us to
>> > change the use of the PageReserved flag such that during boot, we do not
>> > set the flag at all from memmap_init_zone, and then only set it on pages
>> > which, at the time of free_all_bootmem, have been allocated/reserved in
>> > the memblock allocator. Is that correct? I will start to work that up
>> > on the assumption that is what you are asking for.
>>
>> Not exactly.
>>
>> your change should be right, but there is some subtle change about
>> holes handling.
>>
>> before mem holes between memory ranges in memblock.memory, get struct page,
>> and initialized with to Reserved in memmap_init_zone.
>> Those holes is not in memory.reserved, with your patches, those hole's
>> struct page
>> will still have all 0.
>>
>> Please separate change about set page to reserved according to memory.reserved
>> to another patch.
>
>
> Just want to make sure this is where you want me to go. Here is my
> currently untested patch. Is that what you were expecting to have done?
> One thing I don't like about this patch is it seems to slow down boot in
> my simulator environment. I think I am going to look at restructuring
> things a bit to see if I can eliminate that performance penalty.
> Otherwise, I think I am following your directions.

>
> From bdd2fefa59af18f283af6f066bc644ddfa5c7da8 Mon Sep 17 00:00:00 2001
> From: Robin Holt <[email protected]>
> Date: Thu, 25 Jul 2013 04:25:15 -0500
> Subject: [RFC -v2-pre2 4/5] ZZZ Only SegPageReserved() on memblock reserved
> pages.

yes.

>
> ---
> include/linux/mm.h | 2 ++
> mm/nobootmem.c | 3 +++
> mm/page_alloc.c | 18 +++++++++++-------
> 3 files changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e0c8528..b264a26 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count)
> totalram_pages += count;
> }
>
> +extern void reserve_bootmem_region(unsigned long start, unsigned long end);
> +
> /* Free the reserved page into the buddy system, so it gets managed. */
> static inline void __free_reserved_page(struct page *page)
> {
> diff --git a/mm/nobootmem.c b/mm/nobootmem.c
> index 2159e68..0840af2 100644
> --- a/mm/nobootmem.c
> +++ b/mm/nobootmem.c
> @@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void)
> phys_addr_t start, end, size;
> u64 i;
>
> + for_each_reserved_mem_region(i, &start, &end)
> + reserve_bootmem_region(start, end);
> +
> for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
> count += __free_memory_core(start, end);
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 048e166..3aa30b7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -698,7 +698,7 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
> }
>
> static void __init_single_page(unsigned long pfn, unsigned long zone,
> - int nid, int reserved)
> + int nid, int page_count)
> {
> struct page *page = pfn_to_page(pfn);
> struct zone *z = &NODE_DATA(nid)->node_zones[zone];
> @@ -708,12 +708,9 @@ static void __init_single_page(unsigned long pfn, unsigned long zone,
> init_page_count(page);
> page_mapcount_reset(page);
> page_nid_reset_last(page);
> - if (reserved) {
> - SetPageReserved(page);
> - } else {
> - ClearPageReserved(page);
> - set_page_count(page, 0);
> - }
> + ClearPageReserved(page);
> + set_page_count(page, page_count);
> +
> /*
> * Mark the block movable so that blocks are reserved for
> * movable at startup. This will force kernel allocations
> @@ -741,6 +738,13 @@ static void __init_single_page(unsigned long pfn, unsigned long zone,
> #endif
> }
>
> +void reserve_bootmem_region(unsigned long start, unsigned long end)
> +{
> + for (; start < end; start++)
> + if (pfn_valid(start))
> + SetPageReserved(pfn_to_page(start));
> +}
> +
> static bool free_pages_prepare(struct page *page, unsigned int order)
> {
> int i;
> --
> 1.8.2.1
>

2013-08-02 17:44:44

by Nathan Zimmer

[permalink] [raw]
Subject: [RFC v2 1/5] memblock: Introduce a for_each_reserved_mem_region iterator.

From: Robin Holt <[email protected]>

As part of initializing struct page's in 2MiB chunks, we noticed that
at the end of free_all_bootmem(), there was nothing which had forced
the reserved/allocated 4KiB pages to be initialized.

This helper function will be used for that expansion.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nathan Zimmer <[email protected]>
To: "H. Peter Anvin" <[email protected]>
To: Ingo Molnar <[email protected]>
Cc: Linux Kernel <[email protected]>
Cc: Linux MM <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: Daniel J Blueman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Mel Gorman <[email protected]>
---
include/linux/memblock.h | 18 ++++++++++++++++++
mm/memblock.c | 32 ++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index f388203..e99bbd1 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -118,6 +118,24 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start,
i != (u64)ULLONG_MAX; \
__next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid))

+void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start,
+ phys_addr_t *out_end);
+
+/**
+ * for_earch_reserved_mem_region - iterate over all reserved memblock areas
+ * @i: u64 used as loop variable
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over reserved areas of memblock in. Available as soon as memblock
+ * is initialized.
+ */
+#define for_each_reserved_mem_region(i, p_start, p_end) \
+ for (i = 0UL, \
+ __next_reserved_mem_region(&i, p_start, p_end); \
+ i != (u64)ULLONG_MAX; \
+ __next_reserved_mem_region(&i, p_start, p_end))
+
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid);

diff --git a/mm/memblock.c b/mm/memblock.c
index c5fad93..0d7d6e7 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -564,6 +564,38 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
}

/**
+ * __next_reserved_mem_region - next function for for_each_reserved_region()
+ * @idx: pointer to u64 loop variable
+ * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL
+ * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL
+ *
+ * Iterate over all reserved memory regions.
+ */
+void __init_memblock __next_reserved_mem_region(u64 *idx,
+ phys_addr_t *out_start,
+ phys_addr_t *out_end)
+{
+ struct memblock_type *rsv = &memblock.reserved;
+
+ if (*idx >= 0 && *idx < rsv->cnt) {
+ struct memblock_region *r = &rsv->regions[*idx];
+ phys_addr_t base = r->base;
+ phys_addr_t size = r->size;
+
+ if (out_start)
+ *out_start = base;
+ if (out_end)
+ *out_end = base + size - 1;
+
+ *idx += 1;
+ return;
+ }
+
+ /* signal end of iteration */
+ *idx = ULLONG_MAX;
+}
+
+/**
* __next_free_mem_range - next function for for_each_free_mem_range()
* @idx: pointer to u64 loop variable
* @nid: nid: node selector, %MAX_NUMNODES for all nodes
--
1.8.2.1

2013-08-02 17:44:45

by Nathan Zimmer

[permalink] [raw]
Subject: [RFC v2 2/5] Have __free_pages_memory() free in larger chunks.

From: Robin Holt <[email protected]>

Currently, when free_all_bootmem() calls __free_pages_memory(), the
number of contiguous pages that __free_pages_memory() passes to the
buddy allocator is limited to BITS_PER_LONG. In order to be able to
free only the first page of a 2MiB chunk, we need that to be increased.
We are increasing to the maximum size available.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nathan Zimmer <[email protected]>
To: "H. Peter Anvin" <[email protected]>
To: Ingo Molnar <[email protected]>
Cc: Linux Kernel <[email protected]>
Cc: Linux MM <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: Daniel J Blueman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Mel Gorman <[email protected]>
---
mm/nobootmem.c | 25 ++++++++-----------------
1 file changed, 8 insertions(+), 17 deletions(-)

diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index bdd3fa2..2159e68 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -82,27 +82,18 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size)

static void __init __free_pages_memory(unsigned long start, unsigned long end)
{
- unsigned long i, start_aligned, end_aligned;
- int order = ilog2(BITS_PER_LONG);
+ int order;

- start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1);
- end_aligned = end & ~(BITS_PER_LONG - 1);
+ while (start < end) {
+ order = min(MAX_ORDER - 1, __ffs(start));

- if (end_aligned <= start_aligned) {
- for (i = start; i < end; i++)
- __free_pages_bootmem(pfn_to_page(i), 0);
+ while (start + (1UL << order) > end)
+ order--;

- return;
- }
-
- for (i = start; i < start_aligned; i++)
- __free_pages_bootmem(pfn_to_page(i), 0);
+ __free_pages_bootmem(pfn_to_page(start), order);

- for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG)
- __free_pages_bootmem(pfn_to_page(i), order);
-
- for (i = end_aligned; i < end; i++)
- __free_pages_bootmem(pfn_to_page(i), 0);
+ start += (1UL << order);
+ }
}

static unsigned long __init __free_memory_core(phys_addr_t start,
--
1.8.2.1

2013-08-02 17:44:42

by Nathan Zimmer

[permalink] [raw]
Subject: [RFC v2 4/5] Only set page reserved in the memblock region

Currently we when we initialze each page struct is set as reserved upon
initialization. This changes to starting with the reserved bit clear and
then only setting the bit in the reserved region.

I could restruture a bit to eliminate the perform hit. But I wanted to make
sure I am on track first.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nathan Zimmer <[email protected]>
To: "H. Peter Anvin" <[email protected]>
To: Ingo Molnar <[email protected]>
Cc: Linux Kernel <[email protected]>
Cc: Linux MM <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: Daniel J Blueman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Mel Gorman <[email protected]>
---
include/linux/mm.h | 2 ++
mm/nobootmem.c | 3 +++
mm/page_alloc.c | 16 ++++++++++++----
3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e0c8528..b264a26 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count)
totalram_pages += count;
}

+extern void reserve_bootmem_region(unsigned long start, unsigned long end);
+
/* Free the reserved page into the buddy system, so it gets managed. */
static inline void __free_reserved_page(struct page *page)
{
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 2159e68..0840af2 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void)
phys_addr_t start, end, size;
u64 i;

+ for_each_reserved_mem_region(i, &start, &end)
+ reserve_bootmem_region(start, end);
+
for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
count += __free_memory_core(start, end);

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index df3ec13..382223e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -697,17 +697,18 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
spin_unlock(&zone->lock);
}

-static void __init_single_page(unsigned long pfn, unsigned long zone, int nid)
+static void __init_single_page(unsigned long pfn, unsigned long zone,
+ int nid, int page_count)
{
struct page *page = pfn_to_page(pfn);
struct zone *z = &NODE_DATA(nid)->node_zones[zone];

set_page_links(page, zone, nid, pfn);
mminit_verify_page_links(page, zone, nid, pfn);
- init_page_count(page);
page_mapcount_reset(page);
page_nid_reset_last(page);
- SetPageReserved(page);
+ set_page_count(page, page_count);
+ ClearPageReserved(page);

/*
* Mark the block movable so that blocks are reserved for
@@ -736,6 +737,13 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, int nid)
#endif
}

+void reserve_bootmem_region(unsigned long start, unsigned long end)
+{
+ for (; start < end; start++)
+ if (pfn_valid(start))
+ SetPageReserved(pfn_to_page(start));
+}
+
static bool free_pages_prepare(struct page *page, unsigned int order)
{
int i;
@@ -4010,7 +4018,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
if (!early_pfn_in_nid(pfn, nid))
continue;
}
- __init_single_page(pfn, zone, nid);
+ __init_single_page(pfn, zone, nid, 1);
}
}

--
1.8.2.1

2013-08-02 17:46:05

by Nathan Zimmer

[permalink] [raw]
Subject: [RFC v2 3/5] Move page initialization into a separate function.

From: Robin Holt <[email protected]>

Currently, memmap_init_zone() has all the smarts for initializing a
single page. When we convert to initializing pages in a 2MiB chunk,
we will need to do this equivalent work from two separate places
so we are breaking out a helper function.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nathan Zimmer <[email protected]>
To: "H. Peter Anvin" <[email protected]>
To: Ingo Molnar <[email protected]>
Cc: Linux Kernel <[email protected]>
Cc: Linux MM <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: Daniel J Blueman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Mel Gorman <[email protected]>
---
mm/mm_init.c | 2 +-
mm/page_alloc.c | 73 +++++++++++++++++++++++++++++++--------------------------
2 files changed, 41 insertions(+), 34 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index c280a02..be8a539 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void)
BUG_ON(or_mask != add_mask);
}

-void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone,
+void mminit_verify_page_links(struct page *page, enum zone_type zone,
unsigned long nid, unsigned long pfn)
{
BUG_ON(page_to_nid(page) != nid);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5adf81e..df3ec13 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -697,6 +697,45 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
spin_unlock(&zone->lock);
}

+static void __init_single_page(unsigned long pfn, unsigned long zone, int nid)
+{
+ struct page *page = pfn_to_page(pfn);
+ struct zone *z = &NODE_DATA(nid)->node_zones[zone];
+
+ set_page_links(page, zone, nid, pfn);
+ mminit_verify_page_links(page, zone, nid, pfn);
+ init_page_count(page);
+ page_mapcount_reset(page);
+ page_nid_reset_last(page);
+ SetPageReserved(page);
+
+ /*
+ * Mark the block movable so that blocks are reserved for
+ * movable at startup. This will force kernel allocations
+ * to reserve their blocks rather than leaking throughout
+ * the address space during boot when many long-lived
+ * kernel allocations are made. Later some blocks near
+ * the start are marked MIGRATE_RESERVE by
+ * setup_zone_migrate_reserve()
+ *
+ * bitmap is created for zone's valid pfn range. but memmap
+ * can be created for invalid pages (for alignment)
+ * check here not to call set_pageblock_migratetype() against
+ * pfn out of zone.
+ */
+ if ((z->zone_start_pfn <= pfn)
+ && (pfn < zone_end_pfn(z))
+ && !(pfn & (pageblock_nr_pages - 1)))
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+
+ INIT_LIST_HEAD(&page->lru);
+#ifdef WANT_PAGE_VIRTUAL
+ /* The shift won't overflow because ZONE_NORMAL is below 4G. */
+ if (!is_highmem_idx(zone))
+ set_page_address(page, __va(pfn << PAGE_SHIFT));
+#endif
+}
+
static bool free_pages_prepare(struct page *page, unsigned int order)
{
int i;
@@ -3951,7 +3990,6 @@ static void setup_zone_migrate_reserve(struct zone *zone)
void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
unsigned long start_pfn, enum memmap_context context)
{
- struct page *page;
unsigned long end_pfn = start_pfn + size;
unsigned long pfn;
struct zone *z;
@@ -3972,38 +4010,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
if (!early_pfn_in_nid(pfn, nid))
continue;
}
- page = pfn_to_page(pfn);
- set_page_links(page, zone, nid, pfn);
- mminit_verify_page_links(page, zone, nid, pfn);
- init_page_count(page);
- page_mapcount_reset(page);
- page_nid_reset_last(page);
- SetPageReserved(page);
- /*
- * Mark the block movable so that blocks are reserved for
- * movable at startup. This will force kernel allocations
- * to reserve their blocks rather than leaking throughout
- * the address space during boot when many long-lived
- * kernel allocations are made. Later some blocks near
- * the start are marked MIGRATE_RESERVE by
- * setup_zone_migrate_reserve()
- *
- * bitmap is created for zone's valid pfn range. but memmap
- * can be created for invalid pages (for alignment)
- * check here not to call set_pageblock_migratetype() against
- * pfn out of zone.
- */
- if ((z->zone_start_pfn <= pfn)
- && (pfn < zone_end_pfn(z))
- && !(pfn & (pageblock_nr_pages - 1)))
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-
- INIT_LIST_HEAD(&page->lru);
-#ifdef WANT_PAGE_VIRTUAL
- /* The shift won't overflow because ZONE_NORMAL is below 4G. */
- if (!is_highmem_idx(zone))
- set_page_address(page, __va(pfn << PAGE_SHIFT));
-#endif
+ __init_single_page(pfn, zone, nid);
}
}

--
1.8.2.1

2013-08-02 17:46:03

by Nathan Zimmer

[permalink] [raw]
Subject: [RFC v2 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator

We are still restricting ourselves ourselves to 2MiB initialization to keep the
patch set a little smaller and more clear.

We are still struggling with the expand(). Nearly always the first reference
to a struct page which is in the middle of the 2MiB region. We were unable to
find a good solution. Also, given the strong warning at the head of expand(),
we did not feel experienced enough to refactor it to make things always
reference the 2MiB page first.
The only other fastpath impact left is the expansion in prep_new_page.

With this patch, we did boot a 16TiB machine.
The two main areas that benefit from this patch is free_all_bootmem and
memmap_init_zone. Without the patches it took 407 seconds and 1151 seconds
respectively. With the patches it took 220 and 49 seconds respectively.
This is a total savings of 1289 seconds (21 minutes).
These times were aquired using a modified version of script which record the
time in uSecs at the beginning of each line of output.

The previous patch set was faster through free_all_bootmem but I wanted to
include Yinghai suggestion. Hopefully I didn't miss the mark too much with
that patch and yes I do still need to optimize it.

I know there are some still rough parts but I wanted to check in with the full
patch set.

Nathan Zimmer (1):
Only set page reserved in the memblock region

Robin Holt (4):
memblock: Introduce a for_each_reserved_mem_region iterator.
Have __free_pages_memory() free in larger chunks.
Move page initialization into a separate function.
Sparse initialization of struct page array.

include/linux/memblock.h | 18 +++++
include/linux/mm.h | 2 +
include/linux/page-flags.h | 5 +-
mm/memblock.c | 32 ++++++++
mm/mm_init.c | 2 +-
mm/nobootmem.c | 28 +++----
mm/page_alloc.c | 194 ++++++++++++++++++++++++++++++++++++---------
7 files changed, 225 insertions(+), 56 deletions(-)

--
1.8.2.1

2013-08-02 17:46:01

by Nathan Zimmer

[permalink] [raw]
Subject: [RFC v2 5/5] Sparse initialization of struct page array.

From: Robin Holt <[email protected]>

During boot of large memory machines, a significant portion of boot
is spent initializing the struct page array. The vast majority of
those pages are not referenced during boot.

Change this over to only initializing the pages when they are
actually allocated.

Besides the advantage of boot speed, this allows us the chance to
use normal performance monitoring tools to determine where the bulk
of time is spent during page initialization.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nathan Zimmer <[email protected]>
To: "H. Peter Anvin" <[email protected]>
To: Ingo Molnar <[email protected]>
Cc: Linux Kernel <[email protected]>
Cc: Linux MM <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: Daniel J Blueman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Mel Gorman <[email protected]>
---
include/linux/page-flags.h | 5 +-
mm/page_alloc.c | 120 +++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 119 insertions(+), 6 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675..d592065 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -83,6 +83,7 @@ enum pageflags {
PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
PG_arch_1,
PG_reserved,
+ PG_uninitialized_2m,
PG_private, /* If pagecache, has fs-private data */
PG_private_2, /* If pagecache, has fs aux data */
PG_writeback, /* Page is under writeback */
@@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)

__PAGEFLAG(SlobFree, slob_free)

+PAGEFLAG(Uninitialized2m, uninitialized_2m)
+
/*
* Private page markings that may be used by the filesystem that owns the page
* for its own purposes.
@@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
#define PAGE_FLAGS_CHECK_AT_FREE \
(1 << PG_lru | 1 << PG_locked | \
1 << PG_private | 1 << PG_private_2 | \
- 1 << PG_writeback | 1 << PG_reserved | \
+ 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized_2m | \
1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
__PG_COMPOUND_LOCK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 382223e..c2fd03a0c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -737,8 +737,53 @@ static void __init_single_page(unsigned long pfn, unsigned long zone,
#endif
}

+static void expand_page_initialization(struct page *basepage)
+{
+ unsigned long pfn = page_to_pfn(basepage);
+ unsigned long end_pfn = pfn + PTRS_PER_PMD;
+ unsigned long zone = page_zonenum(basepage);
+ int count = page_count(basepage);
+ int nid = page_to_nid(basepage);
+
+ ClearPageUninitialized2m(basepage);
+
+ for (pfn++; pfn < end_pfn; pfn++)
+ __init_single_page(pfn, zone, nid, count);
+}
+
+static void ensure_pages_are_initialized(unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1);
+ unsigned long aligned_end_pfn;
+ struct page *page;
+
+ aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1);
+ aligned_end_pfn += PTRS_PER_PMD;
+ while (aligned_start_pfn < aligned_end_pfn) {
+ if (pfn_valid(aligned_start_pfn)) {
+ page = pfn_to_page(aligned_start_pfn);
+
+ if (PageUninitialized2m(page))
+ expand_page_initialization(page);
+ }
+
+ aligned_start_pfn += PTRS_PER_PMD;
+ }
+}
+
+static inline void ensure_page_is_initialized(struct page *page)
+{
+ ensure_pages_are_initialized(page_to_pfn(page), page_to_pfn(page));
+}
+
void reserve_bootmem_region(unsigned long start, unsigned long end)
{
+ unsigned long start_pfn = PFN_DOWN(start);
+ unsigned long end_pfn = PFN_UP(end);
+
+ ensure_pages_are_initialized(start_pfn, end_pfn);
+
for (; start < end; start++)
if (pfn_valid(start))
SetPageReserved(pfn_to_page(start));
@@ -755,7 +800,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
if (PageAnon(page))
page->mapping = NULL;
for (i = 0; i < (1 << order); i++)
- bad += free_pages_check(page + i);
+ if (PageUninitialized2m(page + i))
+ i += PTRS_PER_PMD - 1;
+ else
+ bad += free_pages_check(page + i);
if (bad)
return false;

@@ -799,13 +847,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
unsigned int loop;

prefetchw(page);
- for (loop = 0; loop < nr_pages; loop++) {
+ for (loop = 0; loop < nr_pages; ) {
struct page *p = &page[loop];

if (loop + 1 < nr_pages)
prefetchw(p + 1);
+
+ if ((PageUninitialized2m(p)) &&
+ ((loop + PTRS_PER_PMD) > nr_pages))
+ ensure_page_is_initialized(p);
+
__ClearPageReserved(p);
set_page_count(p, 0);
+ if (PageUninitialized2m(p))
+ loop += PTRS_PER_PMD;
+ else
+ loop += 1;
}

page_zone(page)->managed_pages += 1 << order;
@@ -860,6 +917,7 @@ static inline void expand(struct zone *zone, struct page *page,
area--;
high--;
size >>= 1;
+ ensure_page_is_initialized(page);
VM_BUG_ON(bad_range(zone, &page[size]));

#ifdef CONFIG_DEBUG_PAGEALLOC
@@ -907,6 +965,10 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)

for (i = 0; i < (1 << order); i++) {
struct page *p = page + i;
+
+ if (PageUninitialized2m(p))
+ expand_page_initialization(page);
+
if (unlikely(check_new_page(p)))
return 1;
}
@@ -989,6 +1051,8 @@ int move_freepages(struct zone *zone,
unsigned long order;
int pages_moved = 0;

+ ensure_pages_are_initialized(page_to_pfn(start_page),
+ page_to_pfn(end_page));
#ifndef CONFIG_HOLES_IN_ZONE
/*
* page_zone is not safe to call in this context when
@@ -3902,6 +3966,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn)
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn)))
return 1;
+
+ if (PageUninitialized2m(pfn_to_page(pfn)))
+ pfn += PTRS_PER_PMD;
}
return 0;
}
@@ -3991,6 +4058,34 @@ static void setup_zone_migrate_reserve(struct zone *zone)
}

/*
+ * This function tells us if we have many pfns we have available.
+ * Available meaning valid and on the specified node.
+ * It return either size if that many pfns are available, 1 otherwise
+ */
+static int __meminit pfn_range_init_avail(unsigned long pfn,
+ unsigned long end_pfn,
+ unsigned long size, int nid)
+{
+ unsigned long validate_end_pfn = pfn + size;
+
+ if (pfn & (size - 1))
+ return 1;
+
+ if (pfn + size >= end_pfn)
+ return 1;
+
+ while (pfn < validate_end_pfn) {
+ if (!early_pfn_valid(pfn))
+ return 1;
+ if (!early_pfn_in_nid(pfn, nid))
+ return 1;
+ pfn++;
+ }
+
+ return size;
+}
+
+/*
* Initially all pages are reserved - free ones are freed
* up by free_all_bootmem() once the early boot process is
* done. Non-atomic initialization, single-pass.
@@ -4006,19 +4101,33 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
highest_memmap_pfn = end_pfn - 1;

z = &NODE_DATA(nid)->node_zones[zone];
- for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+ for (pfn = start_pfn; pfn < end_pfn; ) {
/*
* There can be holes in boot-time mem_map[]s
* handed to this function. They do not
* exist on hotplugged memory.
*/
+ int pfns = 1;
if (context == MEMMAP_EARLY) {
- if (!early_pfn_valid(pfn))
+ if (!early_pfn_valid(pfn)) {
+ pfn++;
continue;
- if (!early_pfn_in_nid(pfn, nid))
+ }
+ if (!early_pfn_in_nid(pfn, nid)) {
+ pfn++;
continue;
+ }
+
+ pfns = pfn_range_init_avail(pfn, end_pfn,
+ PTRS_PER_PMD, nid);
}
+
__init_single_page(pfn, zone, nid, 1);
+
+ if (pfns > 1)
+ SetPageUninitialized2m(pfn_to_page(pfn));
+
+ pfn += pfns;
}
}

@@ -6237,6 +6346,7 @@ static const struct trace_print_flags pageflag_names[] = {
{1UL << PG_owner_priv_1, "owner_priv_1" },
{1UL << PG_arch_1, "arch_1" },
{1UL << PG_reserved, "reserved" },
+ {1UL << PG_uninitialized_2m, "uninitialized_2m" },
{1UL << PG_private, "private" },
{1UL << PG_private_2, "private_2" },
{1UL << PG_writeback, "writeback" },
--
1.8.2.1

2013-08-03 20:04:58

by Nathan Zimmer

[permalink] [raw]
Subject: Re: [RFC v2 4/5] Only set page reserved in the memblock region

On Fri, Aug 02, 2013 at 12:44:26PM -0500, Nathan Zimmer wrote:
> Currently we when we initialze each page struct is set as reserved upon
> initialization. This changes to starting with the reserved bit clear and
> then only setting the bit in the reserved region.
>
> I could restruture a bit to eliminate the perform hit. But I wanted to make
> sure I am on track first.
>
> Signed-off-by: Robin Holt <[email protected]>
> Signed-off-by: Nathan Zimmer <[email protected]>
> To: "H. Peter Anvin" <[email protected]>
> To: Ingo Molnar <[email protected]>
> Cc: Linux Kernel <[email protected]>
> Cc: Linux MM <[email protected]>
> Cc: Rob Landley <[email protected]>
> Cc: Mike Travis <[email protected]>
> Cc: Daniel J Blueman <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Greg KH <[email protected]>
> Cc: Yinghai Lu <[email protected]>
> Cc: Mel Gorman <[email protected]>
> ---
> include/linux/mm.h | 2 ++
> mm/nobootmem.c | 3 +++
> mm/page_alloc.c | 16 ++++++++++++----
> 3 files changed, 17 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e0c8528..b264a26 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count)
> totalram_pages += count;
> }
>
> +extern void reserve_bootmem_region(unsigned long start, unsigned long end);
> +
> /* Free the reserved page into the buddy system, so it gets managed. */
> static inline void __free_reserved_page(struct page *page)
> {
> diff --git a/mm/nobootmem.c b/mm/nobootmem.c
> index 2159e68..0840af2 100644
> --- a/mm/nobootmem.c
> +++ b/mm/nobootmem.c
> @@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void)
> phys_addr_t start, end, size;
> u64 i;
>
> + for_each_reserved_mem_region(i, &start, &end)
> + reserve_bootmem_region(start, end);
> +
> for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
> count += __free_memory_core(start, end);
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index df3ec13..382223e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -697,17 +697,18 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
> spin_unlock(&zone->lock);
> }
>
> -static void __init_single_page(unsigned long pfn, unsigned long zone, int nid)
> +static void __init_single_page(unsigned long pfn, unsigned long zone,
> + int nid, int page_count)
> {
> struct page *page = pfn_to_page(pfn);
> struct zone *z = &NODE_DATA(nid)->node_zones[zone];
>
> set_page_links(page, zone, nid, pfn);
> mminit_verify_page_links(page, zone, nid, pfn);
> - init_page_count(page);
> page_mapcount_reset(page);
> page_nid_reset_last(page);
> - SetPageReserved(page);
> + set_page_count(page, page_count);
> + ClearPageReserved(page);
>
> /*
> * Mark the block movable so that blocks are reserved for
> @@ -736,6 +737,13 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, int nid)
> #endif
> }
>
> +void reserve_bootmem_region(unsigned long start, unsigned long end)
> +{
> + for (; start < end; start++)
> + if (pfn_valid(start))
> + SetPageReserved(pfn_to_page(start));
> +}
> +
> static bool free_pages_prepare(struct page *page, unsigned int order)
> {
> int i;
> @@ -4010,7 +4018,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> if (!early_pfn_in_nid(pfn, nid))
> continue;
> }
> - __init_single_page(pfn, zone, nid);
> + __init_single_page(pfn, zone, nid, 1);
> }
> }
>
> --
> 1.8.2.1
>
Actually I believe reserve_bootmem_region is wrong. I am passing in phys_adr_t and not pfns.

It should be:
void reserve_bootmem_region(unsigned long start, unsigned long end)
{
unsigned long start_pfn = PFN_DOWN(start);
unsigned long end_pfn = PFN_UP(end);

for (; start_pfn < end_pfn; start_pfn++)
if (pfn_valid(start_pfn))
SetPageReserved(pfn_to_page(start_pfn));
}

That also brings the timings back in line with the previous patch set.

Nate

2013-08-05 09:58:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC v2 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator


* Nathan Zimmer <[email protected]> wrote:

> We are still restricting ourselves ourselves to 2MiB initialization to
> keep the patch set a little smaller and more clear.
>
> We are still struggling with the expand(). Nearly always the first
> reference to a struct page which is in the middle of the 2MiB region.
> We were unable to find a good solution. Also, given the strong warning
> at the head of expand(), we did not feel experienced enough to refactor
> it to make things always reference the 2MiB page first. The only other
> fastpath impact left is the expansion in prep_new_page.

I suppose it's about this chunk:

@@ -860,6 +917,7 @@ static inline void expand(struct zone *zone, struct page *page,
area--;
high--;
size >>= 1;
+ ensure_page_is_initialized(page);
VM_BUG_ON(bad_range(zone, &page[size]));

where ensure_page_is_initialized() does, in essence:

+ while (aligned_start_pfn < aligned_end_pfn) {
+ if (pfn_valid(aligned_start_pfn)) {
+ page = pfn_to_page(aligned_start_pfn);
+
+ if (PageUninitialized2m(page))
+ expand_page_initialization(page);
+ }
+
+ aligned_start_pfn += PTRS_PER_PMD;
+ }

where aligned_start_pfn is 2MB rounded down.

which looks like an expensive loop to execute for a single page: there are
512 pages in a 2MB range, so on average this iterates 256 times, for every
single page of allocation. Right?

I might be missing something, but why not just represent the
initialization state in 2MB chunks: it is either fully uninitialized, or
fully initialized. If any page in the 'middle' gets allocated, all page
heads have to get initialized.

That should make the fast path test fairly cheap, basically just
PageUninitialized2m(page) has to be tested - and that will fail in the
post-initialization fastpath.

Thanks,

Ingo

2013-08-12 21:54:48

by Nathan Zimmer

[permalink] [raw]
Subject: [RFC v3 4/5] Only set page reserved in the memblock region

Currently we when we initialze each page struct is set as reserved upon
initialization. This changes to starting with the reserved bit clear and
then only setting the bit in the reserved region.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nathan Zimmer <[email protected]>
To: "H. Peter Anvin" <[email protected]>
To: Ingo Molnar <[email protected]>
Cc: Linux Kernel <[email protected]>
Cc: Linux MM <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: Daniel J Blueman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Mel Gorman <[email protected]>
---
include/linux/mm.h | 2 ++
mm/nobootmem.c | 3 +++
mm/page_alloc.c | 19 +++++++++++++++----
3 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e0c8528..b264a26 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1322,6 +1322,8 @@ static inline void adjust_managed_page_count(struct page *page, long count)
totalram_pages += count;
}

+extern void reserve_bootmem_region(unsigned long start, unsigned long end);
+
/* Free the reserved page into the buddy system, so it gets managed. */
static inline void __free_reserved_page(struct page *page)
{
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 2159e68..0840af2 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -117,6 +117,9 @@ static unsigned long __init free_low_memory_core_early(void)
phys_addr_t start, end, size;
u64 i;

+ for_each_reserved_mem_region(i, &start, &end)
+ reserve_bootmem_region(start, end);
+
for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
count += __free_memory_core(start, end);

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index df3ec13..227bd39 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -697,17 +697,18 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
spin_unlock(&zone->lock);
}

-static void __init_single_page(unsigned long pfn, unsigned long zone, int nid)
+static void __init_single_page(unsigned long pfn, unsigned long zone,
+ int nid, int page_count)
{
struct page *page = pfn_to_page(pfn);
struct zone *z = &NODE_DATA(nid)->node_zones[zone];

set_page_links(page, zone, nid, pfn);
mminit_verify_page_links(page, zone, nid, pfn);
- init_page_count(page);
page_mapcount_reset(page);
page_nid_reset_last(page);
- SetPageReserved(page);
+ set_page_count(page, page_count);
+ ClearPageReserved(page);

/*
* Mark the block movable so that blocks are reserved for
@@ -736,6 +737,16 @@ static void __init_single_page(unsigned long pfn, unsigned long zone, int nid)
#endif
}

+void reserve_bootmem_region(unsigned long start, unsigned long end)
+{
+ unsigned long start_pfn = PFN_DOWN(start);
+ unsigned long end_pfn = PFN_UP(end);
+
+ for (; start_pfn < end_pfn; start_pfn++)
+ if (pfn_valid(start_pfn))
+ SetPageReserved(pfn_to_page(start_pfn));
+}
+
static bool free_pages_prepare(struct page *page, unsigned int order)
{
int i;
@@ -4010,7 +4021,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
if (!early_pfn_in_nid(pfn, nid))
continue;
}
- __init_single_page(pfn, zone, nid);
+ __init_single_page(pfn, zone, nid, 1);
}
}

--
1.8.2.1

2013-08-12 21:54:46

by Nathan Zimmer

[permalink] [raw]
Subject: [RFC v3 1/5] memblock: Introduce a for_each_reserved_mem_region iterator.

From: Robin Holt <[email protected]>

As part of initializing struct page's in 2MiB chunks, we noticed that
at the end of free_all_bootmem(), there was nothing which had forced
the reserved/allocated 4KiB pages to be initialized.

This helper function will be used for that expansion.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nathan Zimmer <[email protected]>
To: "H. Peter Anvin" <[email protected]>
To: Ingo Molnar <[email protected]>
Cc: Linux Kernel <[email protected]>
Cc: Linux MM <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: Daniel J Blueman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Mel Gorman <[email protected]>
---
include/linux/memblock.h | 18 ++++++++++++++++++
mm/memblock.c | 32 ++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index f388203..e99bbd1 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -118,6 +118,24 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start,
i != (u64)ULLONG_MAX; \
__next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid))

+void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start,
+ phys_addr_t *out_end);
+
+/**
+ * for_earch_reserved_mem_region - iterate over all reserved memblock areas
+ * @i: u64 used as loop variable
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over reserved areas of memblock in. Available as soon as memblock
+ * is initialized.
+ */
+#define for_each_reserved_mem_region(i, p_start, p_end) \
+ for (i = 0UL, \
+ __next_reserved_mem_region(&i, p_start, p_end); \
+ i != (u64)ULLONG_MAX; \
+ __next_reserved_mem_region(&i, p_start, p_end))
+
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid);

diff --git a/mm/memblock.c b/mm/memblock.c
index c5fad93..0d7d6e7 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -564,6 +564,38 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
}

/**
+ * __next_reserved_mem_region - next function for for_each_reserved_region()
+ * @idx: pointer to u64 loop variable
+ * @out_start: ptr to phys_addr_t for start address of the region, can be %NULL
+ * @out_end: ptr to phys_addr_t for end address of the region, can be %NULL
+ *
+ * Iterate over all reserved memory regions.
+ */
+void __init_memblock __next_reserved_mem_region(u64 *idx,
+ phys_addr_t *out_start,
+ phys_addr_t *out_end)
+{
+ struct memblock_type *rsv = &memblock.reserved;
+
+ if (*idx >= 0 && *idx < rsv->cnt) {
+ struct memblock_region *r = &rsv->regions[*idx];
+ phys_addr_t base = r->base;
+ phys_addr_t size = r->size;
+
+ if (out_start)
+ *out_start = base;
+ if (out_end)
+ *out_end = base + size - 1;
+
+ *idx += 1;
+ return;
+ }
+
+ /* signal end of iteration */
+ *idx = ULLONG_MAX;
+}
+
+/**
* __next_free_mem_range - next function for for_each_free_mem_range()
* @idx: pointer to u64 loop variable
* @nid: nid: node selector, %MAX_NUMNODES for all nodes
--
1.8.2.1

2013-08-12 21:54:45

by Nathan Zimmer

[permalink] [raw]
Subject: [RFC v3 3/5] Move page initialization into a separate function.

From: Robin Holt <[email protected]>

Currently, memmap_init_zone() has all the smarts for initializing a
single page. When we convert to initializing pages in a 2MiB chunk,
we will need to do this equivalent work from two separate places
so we are breaking out a helper function.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nathan Zimmer <[email protected]>
To: "H. Peter Anvin" <[email protected]>
To: Ingo Molnar <[email protected]>
Cc: Linux Kernel <[email protected]>
Cc: Linux MM <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: Daniel J Blueman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Mel Gorman <[email protected]>
---
mm/mm_init.c | 2 +-
mm/page_alloc.c | 73 +++++++++++++++++++++++++++++++--------------------------
2 files changed, 41 insertions(+), 34 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index c280a02..be8a539 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -128,7 +128,7 @@ void __init mminit_verify_pageflags_layout(void)
BUG_ON(or_mask != add_mask);
}

-void __meminit mminit_verify_page_links(struct page *page, enum zone_type zone,
+void mminit_verify_page_links(struct page *page, enum zone_type zone,
unsigned long nid, unsigned long pfn)
{
BUG_ON(page_to_nid(page) != nid);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5adf81e..df3ec13 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -697,6 +697,45 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
spin_unlock(&zone->lock);
}

+static void __init_single_page(unsigned long pfn, unsigned long zone, int nid)
+{
+ struct page *page = pfn_to_page(pfn);
+ struct zone *z = &NODE_DATA(nid)->node_zones[zone];
+
+ set_page_links(page, zone, nid, pfn);
+ mminit_verify_page_links(page, zone, nid, pfn);
+ init_page_count(page);
+ page_mapcount_reset(page);
+ page_nid_reset_last(page);
+ SetPageReserved(page);
+
+ /*
+ * Mark the block movable so that blocks are reserved for
+ * movable at startup. This will force kernel allocations
+ * to reserve their blocks rather than leaking throughout
+ * the address space during boot when many long-lived
+ * kernel allocations are made. Later some blocks near
+ * the start are marked MIGRATE_RESERVE by
+ * setup_zone_migrate_reserve()
+ *
+ * bitmap is created for zone's valid pfn range. but memmap
+ * can be created for invalid pages (for alignment)
+ * check here not to call set_pageblock_migratetype() against
+ * pfn out of zone.
+ */
+ if ((z->zone_start_pfn <= pfn)
+ && (pfn < zone_end_pfn(z))
+ && !(pfn & (pageblock_nr_pages - 1)))
+ set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+
+ INIT_LIST_HEAD(&page->lru);
+#ifdef WANT_PAGE_VIRTUAL
+ /* The shift won't overflow because ZONE_NORMAL is below 4G. */
+ if (!is_highmem_idx(zone))
+ set_page_address(page, __va(pfn << PAGE_SHIFT));
+#endif
+}
+
static bool free_pages_prepare(struct page *page, unsigned int order)
{
int i;
@@ -3951,7 +3990,6 @@ static void setup_zone_migrate_reserve(struct zone *zone)
void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
unsigned long start_pfn, enum memmap_context context)
{
- struct page *page;
unsigned long end_pfn = start_pfn + size;
unsigned long pfn;
struct zone *z;
@@ -3972,38 +4010,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
if (!early_pfn_in_nid(pfn, nid))
continue;
}
- page = pfn_to_page(pfn);
- set_page_links(page, zone, nid, pfn);
- mminit_verify_page_links(page, zone, nid, pfn);
- init_page_count(page);
- page_mapcount_reset(page);
- page_nid_reset_last(page);
- SetPageReserved(page);
- /*
- * Mark the block movable so that blocks are reserved for
- * movable at startup. This will force kernel allocations
- * to reserve their blocks rather than leaking throughout
- * the address space during boot when many long-lived
- * kernel allocations are made. Later some blocks near
- * the start are marked MIGRATE_RESERVE by
- * setup_zone_migrate_reserve()
- *
- * bitmap is created for zone's valid pfn range. but memmap
- * can be created for invalid pages (for alignment)
- * check here not to call set_pageblock_migratetype() against
- * pfn out of zone.
- */
- if ((z->zone_start_pfn <= pfn)
- && (pfn < zone_end_pfn(z))
- && !(pfn & (pageblock_nr_pages - 1)))
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-
- INIT_LIST_HEAD(&page->lru);
-#ifdef WANT_PAGE_VIRTUAL
- /* The shift won't overflow because ZONE_NORMAL is below 4G. */
- if (!is_highmem_idx(zone))
- set_page_address(page, __va(pfn << PAGE_SHIFT));
-#endif
+ __init_single_page(pfn, zone, nid);
}
}

--
1.8.2.1

2013-08-12 21:55:58

by Nathan Zimmer

[permalink] [raw]
Subject: [RFC v3 2/5] Have __free_pages_memory() free in larger chunks.

From: Robin Holt <[email protected]>

Currently, when free_all_bootmem() calls __free_pages_memory(), the
number of contiguous pages that __free_pages_memory() passes to the
buddy allocator is limited to BITS_PER_LONG. In order to be able to
free only the first page of a 2MiB chunk, we need that to be increased.
We are increasing to the maximum size available.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nathan Zimmer <[email protected]>
To: "H. Peter Anvin" <[email protected]>
To: Ingo Molnar <[email protected]>
Cc: Linux Kernel <[email protected]>
Cc: Linux MM <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: Daniel J Blueman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Mel Gorman <[email protected]>
---
mm/nobootmem.c | 25 ++++++++-----------------
1 file changed, 8 insertions(+), 17 deletions(-)

diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index bdd3fa2..2159e68 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -82,27 +82,18 @@ void __init free_bootmem_late(unsigned long addr, unsigned long size)

static void __init __free_pages_memory(unsigned long start, unsigned long end)
{
- unsigned long i, start_aligned, end_aligned;
- int order = ilog2(BITS_PER_LONG);
+ int order;

- start_aligned = (start + (BITS_PER_LONG - 1)) & ~(BITS_PER_LONG - 1);
- end_aligned = end & ~(BITS_PER_LONG - 1);
+ while (start < end) {
+ order = min(MAX_ORDER - 1, __ffs(start));

- if (end_aligned <= start_aligned) {
- for (i = start; i < end; i++)
- __free_pages_bootmem(pfn_to_page(i), 0);
+ while (start + (1UL << order) > end)
+ order--;

- return;
- }
-
- for (i = start; i < start_aligned; i++)
- __free_pages_bootmem(pfn_to_page(i), 0);
+ __free_pages_bootmem(pfn_to_page(start), order);

- for (i = start_aligned; i < end_aligned; i += BITS_PER_LONG)
- __free_pages_bootmem(pfn_to_page(i), order);
-
- for (i = end_aligned; i < end; i++)
- __free_pages_bootmem(pfn_to_page(i), 0);
+ start += (1UL << order);
+ }
}

static unsigned long __init __free_memory_core(phys_addr_t start,
--
1.8.2.1

2013-08-12 21:55:56

by Nathan Zimmer

[permalink] [raw]
Subject: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator

We are still restricting ourselves ourselves to 2MiB initialization.
This was initially to keep the patch set a little smaller and more clear.
However given how well it is currently performing I don't see a how much
better it could be with to 2GiB chunks.

As far as extra overhead. We incur an extra function call to
ensure_page_is_initialized but that is only really expensive when we find
uninitialized pages, otherwise it is a flag check once every PTRS_PER_PMD.
To get a better feel for this we ran two quick tests.

The first was simply timing some memhogs.
This showed no measurable difference so we made a more granular test.
We spawned N threads, start a timer, each thread mallocs totalmem/N then each
thread writes to its memory to induce page faults, stop the timer.
In this case it each thread had just under 4GB of ram to fault in.
This showed a measureable difference in the page faulting.
The baseline took an average of 2.68 seconds, the new version took an
average of 2.75 seconds. Which is .07s slower or 2.6%.
Are there some other tests I should run?

With this patch, we did boot a 16TiB machine.
The two main areas that benefit from this patch is free_all_bootmem and
memmap_init_zone. Without the patches it took 407 seconds and 1151 seconds
respectively. With the patches it took 13 and 39 seconds respectively.
This is a total savings of 1506 seconds (25 minutes).
These times were acquired using a modified version of script which record the
time in uSecs at the beginning of each line of output.

Overall I am fairly happy with the patch set at the moment. It improves boot
times without noticeable runtime overhead.
I am, as always, open for suggestions.

v2: included the Yinghai's suggestion to not set the reserved bit until later.

v3: Corrected my first attempt at moving the reserved bit.
__expand_page_initialization should only be called by ensure_pages_are_initialized

Nathan Zimmer (1):
Only set page reserved in the memblock region

Robin Holt (4):
memblock: Introduce a for_each_reserved_mem_region iterator.
Have __free_pages_memory() free in larger chunks.
Move page initialization into a separate function.
Sparse initialization of struct page array.

include/linux/memblock.h | 18 +++++
include/linux/mm.h | 2 +
include/linux/page-flags.h | 5 +-
mm/memblock.c | 32 ++++++++
mm/mm_init.c | 2 +-
mm/nobootmem.c | 28 +++----
mm/page_alloc.c | 198 ++++++++++++++++++++++++++++++++++++---------
7 files changed, 229 insertions(+), 56 deletions(-)

--
1.8.2.1

2013-08-12 21:55:55

by Nathan Zimmer

[permalink] [raw]
Subject: [RFC v3 5/5] Sparse initialization of struct page array.

From: Robin Holt <[email protected]>

During boot of large memory machines, a significant portion of boot
is spent initializing the struct page array. The vast majority of
those pages are not referenced during boot.

Change this over to only initializing the pages when they are
actually allocated.

Besides the advantage of boot speed, this allows us the chance to
use normal performance monitoring tools to determine where the bulk
of time is spent during page initialization.

Signed-off-by: Robin Holt <[email protected]>
Signed-off-by: Nathan Zimmer <[email protected]>
To: "H. Peter Anvin" <[email protected]>
To: Ingo Molnar <[email protected]>
Cc: Linux Kernel <[email protected]>
Cc: Linux MM <[email protected]>
Cc: Rob Landley <[email protected]>
Cc: Mike Travis <[email protected]>
Cc: Daniel J Blueman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Greg KH <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Mel Gorman <[email protected]>
---
include/linux/page-flags.h | 5 +-
mm/page_alloc.c | 116 +++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 115 insertions(+), 6 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675..d592065 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -83,6 +83,7 @@ enum pageflags {
PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
PG_arch_1,
PG_reserved,
+ PG_uninitialized_2m,
PG_private, /* If pagecache, has fs-private data */
PG_private_2, /* If pagecache, has fs aux data */
PG_writeback, /* Page is under writeback */
@@ -211,6 +212,8 @@ PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)

__PAGEFLAG(SlobFree, slob_free)

+PAGEFLAG(Uninitialized2m, uninitialized_2m)
+
/*
* Private page markings that may be used by the filesystem that owns the page
* for its own purposes.
@@ -499,7 +502,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
#define PAGE_FLAGS_CHECK_AT_FREE \
(1 << PG_lru | 1 << PG_locked | \
1 << PG_private | 1 << PG_private_2 | \
- 1 << PG_writeback | 1 << PG_reserved | \
+ 1 << PG_writeback | 1 << PG_reserved | 1 << PG_uninitialized_2m | \
1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
1 << PG_unevictable | __PG_MLOCKED | __PG_HWPOISON | \
__PG_COMPOUND_LOCK)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 227bd39..6c35a58 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -737,11 +737,53 @@ static void __init_single_page(unsigned long pfn, unsigned long zone,
#endif
}

+static void __expand_page_initialization(struct page *basepage)
+{
+ unsigned long pfn = page_to_pfn(basepage);
+ unsigned long end_pfn = pfn + PTRS_PER_PMD;
+ unsigned long zone = page_zonenum(basepage);
+ int count = page_count(basepage);
+ int nid = page_to_nid(basepage);
+
+ ClearPageUninitialized2m(basepage);
+
+ for (pfn++; pfn < end_pfn; pfn++)
+ __init_single_page(pfn, zone, nid, count);
+}
+
+static void ensure_pages_are_initialized(unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ unsigned long aligned_start_pfn = start_pfn & ~(PTRS_PER_PMD - 1);
+ unsigned long aligned_end_pfn;
+ struct page *page;
+
+ aligned_end_pfn = end_pfn & ~(PTRS_PER_PMD - 1);
+ aligned_end_pfn += PTRS_PER_PMD;
+ while (aligned_start_pfn < aligned_end_pfn) {
+ if (pfn_valid(aligned_start_pfn)) {
+ page = pfn_to_page(aligned_start_pfn);
+
+ if (PageUninitialized2m(page))
+ __expand_page_initialization(page);
+ }
+
+ aligned_start_pfn += PTRS_PER_PMD;
+ }
+}
+
+static inline void ensure_page_is_initialized(struct page *page)
+{
+ ensure_pages_are_initialized(page_to_pfn(page), page_to_pfn(page));
+}
+
void reserve_bootmem_region(unsigned long start, unsigned long end)
{
unsigned long start_pfn = PFN_DOWN(start);
unsigned long end_pfn = PFN_UP(end);

+ ensure_pages_are_initialized(start_pfn, end_pfn);
+
for (; start_pfn < end_pfn; start_pfn++)
if (pfn_valid(start_pfn))
SetPageReserved(pfn_to_page(start_pfn));
@@ -758,7 +800,10 @@ static bool free_pages_prepare(struct page *page, unsigned int order)
if (PageAnon(page))
page->mapping = NULL;
for (i = 0; i < (1 << order); i++)
- bad += free_pages_check(page + i);
+ if (PageUninitialized2m(page + i))
+ i += PTRS_PER_PMD - 1;
+ else
+ bad += free_pages_check(page + i);
if (bad)
return false;

@@ -802,13 +847,22 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
unsigned int loop;

prefetchw(page);
- for (loop = 0; loop < nr_pages; loop++) {
+ for (loop = 0; loop < nr_pages; ) {
struct page *p = &page[loop];

if (loop + 1 < nr_pages)
prefetchw(p + 1);
+
+ if ((PageUninitialized2m(p)) &&
+ ((loop + PTRS_PER_PMD) > nr_pages))
+ ensure_page_is_initialized(p);
+
__ClearPageReserved(p);
set_page_count(p, 0);
+ if (PageUninitialized2m(p))
+ loop += PTRS_PER_PMD;
+ else
+ loop += 1;
}

page_zone(page)->managed_pages += 1 << order;
@@ -863,6 +917,7 @@ static inline void expand(struct zone *zone, struct page *page,
area--;
high--;
size >>= 1;
+ ensure_page_is_initialized(&page[size]);
VM_BUG_ON(bad_range(zone, &page[size]));

#ifdef CONFIG_DEBUG_PAGEALLOC
@@ -908,8 +963,11 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
{
int i;

+ ensure_pages_are_initialized(page_to_pfn(page),
+ page_to_pfn(page+(1<<order)-1));
for (i = 0; i < (1 << order); i++) {
struct page *p = page + i;
+
if (unlikely(check_new_page(p)))
return 1;
}
@@ -992,6 +1050,8 @@ int move_freepages(struct zone *zone,
unsigned long order;
int pages_moved = 0;

+ ensure_pages_are_initialized(page_to_pfn(start_page),
+ page_to_pfn(end_page));
#ifndef CONFIG_HOLES_IN_ZONE
/*
* page_zone is not safe to call in this context when
@@ -3905,6 +3965,9 @@ static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn)
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn)))
return 1;
+
+ if (PageUninitialized2m(pfn_to_page(pfn)))
+ pfn += PTRS_PER_PMD;
}
return 0;
}
@@ -3994,6 +4057,34 @@ static void setup_zone_migrate_reserve(struct zone *zone)
}

/*
+ * This function tells us if we have many pfns we have available.
+ * Available meaning valid and on the specified node.
+ * It return either size if that many pfns are available, 1 otherwise
+ */
+static int __meminit pfn_range_init_avail(unsigned long pfn,
+ unsigned long end_pfn,
+ unsigned long size, int nid)
+{
+ unsigned long validate_end_pfn = pfn + size;
+
+ if (pfn & (size - 1))
+ return 1;
+
+ if (pfn + size >= end_pfn)
+ return 1;
+
+ while (pfn < validate_end_pfn) {
+ if (!early_pfn_valid(pfn))
+ return 1;
+ if (!early_pfn_in_nid(pfn, nid))
+ return 1;
+ pfn++;
+ }
+
+ return size;
+}
+
+/*
* Initially all pages are reserved - free ones are freed
* up by free_all_bootmem() once the early boot process is
* done. Non-atomic initialization, single-pass.
@@ -4009,19 +4100,33 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
highest_memmap_pfn = end_pfn - 1;

z = &NODE_DATA(nid)->node_zones[zone];
- for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+ for (pfn = start_pfn; pfn < end_pfn; ) {
/*
* There can be holes in boot-time mem_map[]s
* handed to this function. They do not
* exist on hotplugged memory.
*/
+ int pfns = 1;
if (context == MEMMAP_EARLY) {
- if (!early_pfn_valid(pfn))
+ if (!early_pfn_valid(pfn)) {
+ pfn++;
continue;
- if (!early_pfn_in_nid(pfn, nid))
+ }
+ if (!early_pfn_in_nid(pfn, nid)) {
+ pfn++;
continue;
+ }
+
+ pfns = pfn_range_init_avail(pfn, end_pfn,
+ PTRS_PER_PMD, nid);
}
+
__init_single_page(pfn, zone, nid, 1);
+
+ if (pfns > 1)
+ SetPageUninitialized2m(pfn_to_page(pfn));
+
+ pfn += pfns;
}
}

@@ -6240,6 +6345,7 @@ static const struct trace_print_flags pageflag_names[] = {
{1UL << PG_owner_priv_1, "owner_priv_1" },
{1UL << PG_arch_1, "arch_1" },
{1UL << PG_reserved, "reserved" },
+ {1UL << PG_uninitialized_2m, "uninitialized_2m" },
{1UL << PG_private, "private" },
{1UL << PG_private_2, "private_2" },
{1UL << PG_writeback, "writeback" },
--
1.8.2.1

2013-08-13 10:58:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator


* Nathan Zimmer <[email protected]> wrote:

> We are still restricting ourselves ourselves to 2MiB initialization.
> This was initially to keep the patch set a little smaller and more
> clear. However given how well it is currently performing I don't see a
> how much better it could be with to 2GiB chunks.
>
> As far as extra overhead. We incur an extra function call to
> ensure_page_is_initialized but that is only really expensive when we
> find uninitialized pages, otherwise it is a flag check once every
> PTRS_PER_PMD. [...]

Mind expanding on this in more detail?

The main fastpath overhead we are really interested in is the 'memory is
already fully ininialized and we reallocate a second time' case - i.e. the
*second* (and subsequent), post-initialization allocation of any page
range.

Those allocations are the ones that matter most: they will occur again and
again, for the lifetime of the booted up system.

What extra overhead is there in that case? Only a flag check that is
merged into an existing flag check (in free_pages_check()) and thus is
essentially zero overhead? Or is it more involved - if yes, why?

One would naively think that nothing but the flags check is needed in this
case: if all 512 pages in an aligned 2MB block is fully initialized, and
marked as initialized in all the 512 page heads, then no other runtime
check will be needed in the future.

Thanks,

Ingo

2013-08-13 17:09:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator

On Mon, Aug 12, 2013 at 2:54 PM, Nathan Zimmer <[email protected]> wrote:
>
> As far as extra overhead. We incur an extra function call to
> ensure_page_is_initialized but that is only really expensive when we find
> uninitialized pages, otherwise it is a flag check once every PTRS_PER_PMD.
> To get a better feel for this we ran two quick tests.

Sorry for coming into this late and for this last version of the
patch, but I have to say that I'd *much* rather see this delayed
initialization using another data structure than hooking into the
basic page allocation ones..

I understand that you want to do delayed initialization on some TB+
memory machines, but what I don't understand is why it has to be done
when the pages have already been added to the memory management free
list.

Could we not do this much simpler: make the early boot insert the
first few gigs of memory (initialized) synchronously into the free
lists, and then have a background thread that goes through the rest?

That way the MM layer would never see the uninitialized pages.

And I bet that *nobody* cares if you "only" have a few gigs of ram
during the first few minutes of boot, and you mysteriously end up
getting more and more memory for a while until all the RAM has been
initialized.

IOW, just don't call __free_pages_bootmem() on all the pages al at
once. If we have to remove a few __init markers to be able to do some
of it later, does anybody really care?

I really really dislike this "let's check if memory is initialized at
runtime" approach.

Linus

2013-08-13 17:24:24

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator

On 08/13/2013 10:09 AM, Linus Torvalds wrote:
>
> I really really dislike this "let's check if memory is initialized at
> runtime" approach.
>

It does seem to be getting messy, doesn't it...

The one potential serious concern if if that will end up mucking with
NUMA affinity in a way that has lasting effects past boot.

-hpa

2013-08-13 17:33:56

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator



On 8/13/2013 10:09 AM, Linus Torvalds wrote:
> On Mon, Aug 12, 2013 at 2:54 PM, Nathan Zimmer <[email protected]> wrote:
>>
>> As far as extra overhead. We incur an extra function call to
>> ensure_page_is_initialized but that is only really expensive when we find
>> uninitialized pages, otherwise it is a flag check once every PTRS_PER_PMD.
>> To get a better feel for this we ran two quick tests.
>
> Sorry for coming into this late and for this last version of the
> patch, but I have to say that I'd *much* rather see this delayed
> initialization using another data structure than hooking into the
> basic page allocation ones..
>
> I understand that you want to do delayed initialization on some TB+
> memory machines, but what I don't understand is why it has to be done
> when the pages have already been added to the memory management free
> list.
>
> Could we not do this much simpler: make the early boot insert the
> first few gigs of memory (initialized) synchronously into the free
> lists, and then have a background thread that goes through the rest?
>
> That way the MM layer would never see the uninitialized pages.
>
> And I bet that *nobody* cares if you "only" have a few gigs of ram
> during the first few minutes of boot, and you mysteriously end up
> getting more and more memory for a while until all the RAM has been
> initialized.
>
> IOW, just don't call __free_pages_bootmem() on all the pages al at
> once. If we have to remove a few __init markers to be able to do some
> of it later, does anybody really care?
>
> I really really dislike this "let's check if memory is initialized at
> runtime" approach.
>
> Linus
>

Initially this patch set consisted of diverting a major portion of the
memory to an "absent" list during e820 processing. A very late initcall
was then used to dispatch a cpu per node to add that nodes's absent
memory. By nature these ran in parallel so Nathan did the work to
"parallelize" various global resource locks to become per node locks.

This sped up insertion considerably. And by disabling the "auto-start"
of the insertion process and using a manual start command, you could
monitor the insertion process and find hot spots in the memory
initialization code.

Also small updates to the sys/devices/{memory,node} drivers to also
display the amount of memory still "absent".

-Mike

2013-08-13 17:51:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator

On Tue, Aug 13, 2013 at 10:33 AM, Mike Travis <[email protected]> wrote:
>
> Initially this patch set consisted of diverting a major portion of the
> memory to an "absent" list during e820 processing. A very late initcall
> was then used to dispatch a cpu per node to add that nodes's absent
> memory. By nature these ran in parallel so Nathan did the work to
> "parallelize" various global resource locks to become per node locks.

So quite frankly, I'm not sure how worthwhile it even is to
parallelize the thing. I realize that some environments may care about
getting up to full memory population very quicky, but I think it would
be very rare and specialized, and shouldn't necessarily be part of the
initial patches.

And it really doesn't have to be an initcall at all - at least not a
synchronous one. A late initcall to get the process *started*, but the
process itself could easily be done with a separate thread
asynchronously, and let the machine boot up while that thread is
going.

And in fact, I'd argue that instead of trying to make it fast and
parallelize things excessively, you might want to make the memory
initialization *slow*, and make all the rest of the bootup have higher
priority.

At that point, who cares if it takes 400 seconds to get all memory
initialized? In fact, who cares if it takes twice that? Let's assume
that the rest of the boot takes 30s (which is pretty aggressive for
some big server with terabytes of memory), even if the memory
initialization was running in the background and only during idle time
for probing, I'm sure you'd have a few hundred gigs of RAM initialized
by the time you can log in. And if it then takes another ten minutes
until you have the full 16TB initialized, and some things might be a
tad slower early on, does anybody really care? The machine will be up
and running with plenty of memory, even if it may not be *all* the
memory yet.

I realize that benchmarking cares, and yes, I also realize that some
benchmarks actually want to reboot the machine between some runs just
to get repeatability, but if you're benchmarking a 16TB machine I'm
guessing any serious benchmark that actually uses that much memory is
going to take many hours to a few days to run anyway? Having some way
to wait until the memory is all done (which might even be just a silly
shell script that does "ps" and waits for the kernel threads to all go
away) isn't going to kill the benchmark - and the benchmark itself
will then not have to worry about hittinf the "oops, I need to
initialize 2GB of RAM now because I hit an uninitialized page".

Ok, so I don't know all the issues, and in many ways I don't even
really care. You could do it other ways, I don't think this is a big
deal. The part I hate is the runtime hook into the core MM page
allocation code, so I'm just throwing out any random thing that comes
to my mind that could be used to avoid that part.

Linus

2013-08-13 18:04:09

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator



On 8/13/2013 10:51 AM, Linus Torvalds wrote:
> by the time you can log in. And if it then takes another ten minutes
> until you have the full 16TB initialized, and some things might be a
> tad slower early on, does anybody really care? The machine will be up
> and running with plenty of memory, even if it may not be *all* the
> memory yet.

Before the patches adding memory took ~45 mins for 16TB and almost 2 hours
for 32TB. Adding it late sped up early boot but late insertion was still
very slow, where the full 32TB was still not fully inserted after an hour.
Doing it in parallel along with the memory hotplug lock per node, we got
it down to the 10-15 minute range.

2013-08-13 19:06:26

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator



On 8/13/2013 11:04 AM, Mike Travis wrote:
>
>
> On 8/13/2013 10:51 AM, Linus Torvalds wrote:
>> by the time you can log in. And if it then takes another ten minutes
>> until you have the full 16TB initialized, and some things might be a
>> tad slower early on, does anybody really care? The machine will be up
>> and running with plenty of memory, even if it may not be *all* the
>> memory yet.
>
> Before the patches adding memory took ~45 mins for 16TB and almost 2 hours
> for 32TB. Adding it late sped up early boot but late insertion was still
> very slow, where the full 32TB was still not fully inserted after an hour.
> Doing it in parallel along with the memory hotplug lock per node, we got
> it down to the 10-15 minute range.
>

FYI, the system at this time had 128 nodes each with 256GB of memory.
About 252GB was inserted into the absent list from nodes 1 .. 126.
Memory on nodes 0 and 128 was left fully present.

2013-08-13 20:24:36

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator

On Tue, Aug 13, 2013 at 12:06 PM, Mike Travis <[email protected]> wrote:
>
>
> On 8/13/2013 11:04 AM, Mike Travis wrote:
>>
>>
>> On 8/13/2013 10:51 AM, Linus Torvalds wrote:
>>> by the time you can log in. And if it then takes another ten minutes
>>> until you have the full 16TB initialized, and some things might be a
>>> tad slower early on, does anybody really care? The machine will be up
>>> and running with plenty of memory, even if it may not be *all* the
>>> memory yet.
>>
>> Before the patches adding memory took ~45 mins for 16TB and almost 2 hours
>> for 32TB. Adding it late sped up early boot but late insertion was still
>> very slow, where the full 32TB was still not fully inserted after an hour.
>> Doing it in parallel along with the memory hotplug lock per node, we got
>> it down to the 10-15 minute range.
>>
>
> FYI, the system at this time had 128 nodes each with 256GB of memory.
> About 252GB was inserted into the absent list from nodes 1 .. 126.
> Memory on nodes 0 and 128 was left fully present.

Can we have one topic about those boot time issues in this year kernel summit?

There will be more 32 sockets x86 systems and will have lots of
memory, pci chain and cpu cores.

current kernel/smp.c::smp_init(), we still have
| /* FIXME: This should be done in userspace --RR */
| for_each_present_cpu(cpu) {
| if (num_online_cpus() >= setup_max_cpus)
| break;
| if (!cpu_online(cpu))
| cpu_up(cpu);
| }

solution would be:
1. delay some memory, pci chain, or cpus cores.
2. or parallel initialize them during booting
3. or parallel add them after booting.

Thanks

Yinghai

2013-08-13 20:38:00

by Mike Travis

[permalink] [raw]
Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator



On 8/13/2013 1:24 PM, Yinghai Lu wrote:
>> > FYI, the system at this time had 128 nodes each with 256GB of memory.
>> > About 252GB was inserted into the absent list from nodes 1 .. 126.
>> > Memory on nodes 0 and 128 was left fully present.

Actually, I was corrected, it was 256 nodes with 128GB (8 * 16GB dimms -
which are just now coming out.) So there were 254 concurrent initialization
processes running.

2013-08-13 21:35:07

by Nathan Zimmer

[permalink] [raw]
Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator

On 08/13/2013 01:04 PM, Mike Travis wrote:
>
> On 8/13/2013 10:51 AM, Linus Torvalds wrote:
>> by the time you can log in. And if it then takes another ten minutes
>> until you have the full 16TB initialized, and some things might be a
>> tad slower early on, does anybody really care? The machine will be up
>> and running with plenty of memory, even if it may not be *all* the
>> memory yet.
> Before the patches adding memory took ~45 mins for 16TB and almost 2 hours
> for 32TB. Adding it late sped up early boot but late insertion was still
> very slow, where the full 32TB was still not fully inserted after an hour.
> Doing it in parallel along with the memory hotplug lock per node, we got
> it down to the 10-15 minute range.
Yes but to get it to the 10-15 minute range I had to change an number of
system locks.
The system_sleep, the memory_hotplug, zonelist_mutex and there was some
general alteration
to various wmark routines.
Some of those fixes I don't know if they would stand up to proper
scrutiny but were quick and dirty
hacks to allow for progress.

Nate

2013-08-13 23:10:24

by Nathan Zimmer

[permalink] [raw]
Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator

On Tue, Aug 13, 2013 at 10:51:37AM -0700, Linus Torvalds wrote:
> I realize that benchmarking cares, and yes, I also realize that some
> benchmarks actually want to reboot the machine between some runs just
> to get repeatability, but if you're benchmarking a 16TB machine I'm
> guessing any serious benchmark that actually uses that much memory is
> going to take many hours to a few days to run anyway? Having some way
> to wait until the memory is all done (which might even be just a silly
> shell script that does "ps" and waits for the kernel threads to all go
> away) isn't going to kill the benchmark - and the benchmark itself
> will then not have to worry about hittinf the "oops, I need to
> initialize 2GB of RAM now because I hit an uninitialized page".
>
I am not overly concerned with cost having to setup a page struct on first
touch but what I need to avoid is adding more permanent cost to page faults
on a system that is already "primed".

> Ok, so I don't know all the issues, and in many ways I don't even
> really care. You could do it other ways, I don't think this is a big
> deal. The part I hate is the runtime hook into the core MM page
> allocation code, so I'm just throwing out any random thing that comes
> to my mind that could be used to avoid that part.
>

The only mm structure we are adding to is a new flag in page->flags.
That didn't seem too much.

I had hoped to restrict the core mm changes to check_new_page and
free_pages_check but I haven't gotten there yet.

Not putting on uninitialized pages on to the lru would work but then I
would be concerned over any calculations based on totalpages. I might be
too paranoid there but having that be incorrect until after a system is booted
worries me.


Nate

2013-08-13 23:55:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator

On Tue, Aug 13, 2013 at 4:10 PM, Nathan Zimmer <[email protected]> wrote:
>
> The only mm structure we are adding to is a new flag in page->flags.
> That didn't seem too much.

I don't agree.

I see only downsides, and no upsides. Doing the same thing *without*
the downsides seems straightforward, so I simply see no reason for any
extra flags or tests at runtime.

Linus

2013-08-14 11:06:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator


* Linus Torvalds <[email protected]> wrote:

> [...]
>
> Ok, so I don't know all the issues, and in many ways I don't even really
> care. You could do it other ways, I don't think this is a big deal. The
> part I hate is the runtime hook into the core MM page allocation code,
> so I'm just throwing out any random thing that comes to my mind that
> could be used to avoid that part.

So, my hope was that it's possible to have a single, simple, zero-cost
runtime check [zero cost for already initialized pages], because it can be
merged into already existing page flag mask checks present here and
executed for every freshly allocated page:

static inline int check_new_page(struct page *page)
{
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
(atomic_read(&page->_count) != 0) |
(page->flags & PAGE_FLAGS_CHECK_AT_PREP) |
(mem_cgroup_bad_page_check(page)))) {
bad_page(page);
return 1;
}
return 0;
}

We already run this for every new page allocated and the initialization
check could hide in PAGE_FLAGS_CHECK_AT_PREP in a zero-cost fashion.

I'd not do any of the ensure_page_is_initialized() or
__expand_page_initialization() complications in this patch-set - each page
head represents itself and gets iterated when check_new_page() is done.

During regular bootup we'd initialize like before, except we don't set up
the page heads but memset() them to zero. With each page head 32 bytes
this would mean 8 GB of page head memory to clear per 1 TB - with 16 TB
that's 128 GB to clear - that ought to be possible to do rather quickly,
perhaps with some smart SMP cross-call approach that makes sure that each
memset is done in a node-local fashion. [*]

Such an approach should IMO be far smaller and less invasive than the
patches presented so far: it should be below 100 lines or so.

I don't know why there's such a big difference between the theory I
outlined and the invasive patch-set implemented so far in practice,
perhaps I'm missing some complication. I was trying to probe that
difference, before giving up on the idea and punting back to the async
hotplug-ish approach which would obviously work well too.

All in one, I think async init just hides the real problem - there's no
way memory init should take this long.

Thanks,

Ingo

[*] alternatively maybe the main performance problem is that node-local
memory is set up on a remote (boot) node? In that case I'd try to
optimize it by migrating the memory init code's current node by using
set_cpus_allowed() to live migrate from node to node, tracking the
node whose struct page array is being initialized.

2013-08-14 11:27:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator


* Linus Torvalds <[email protected]> wrote:

> On Tue, Aug 13, 2013 at 4:10 PM, Nathan Zimmer <[email protected]> wrote:
> >
> > The only mm structure we are adding to is a new flag in page->flags.
> > That didn't seem too much.
>
> I don't agree.
>
> I see only downsides, and no upsides. Doing the same thing *without* the
> downsides seems straightforward, so I simply see no reason for any extra
> flags or tests at runtime.

The code as presented clearly looks more involved and neither simple nor
zero-cost - I was hoping for a much more simple approach.

I see three solutions:

- Speed up the synchronous memory init code: live migrate to the node
being set up via set_cpus_allowed(), to make sure the init is always
fast and local.

Pros: if it solves the problem then mem init is still synchronous,
deterministic and essentially equivalent to what we do today - so
relatively simple and well-tested, with no 'large machine' special
path.

Cons: it might not be enough and we might not have scheduling
enabled on the affected nodes yet.

- Speed up the synchronous memory init code by paralellizing the key,
most expensive initialization portion of setting up the page head
arrays to per node, via SMP function-calls.

Pros: by far the fastest synchronous option. (It will also test the
power budget and the mains fuses right during bootup.)

Cons: more complex and depends on SMP cross-calls being available at
mem init time. Not necessarily hotplug friendly.

- Avoid the problem by punting to async mem init.

Pros: it gets us to a minimal working system quickly and leaves the
memory code relatively untouched.

Disadvantages: makes memory state asynchronous and non-deterministic.
Stats either fluctuate shortly after bootup or have to be faked.

Thanks,

Ingo

2013-08-14 22:15:12

by Nathan Zimmer

[permalink] [raw]
Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator

On Wed, Aug 14, 2013 at 01:05:56PM +0200, Ingo Molnar wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
> > [...]
> >
> > Ok, so I don't know all the issues, and in many ways I don't even really
> > care. You could do it other ways, I don't think this is a big deal. The
> > part I hate is the runtime hook into the core MM page allocation code,
> > so I'm just throwing out any random thing that comes to my mind that
> > could be used to avoid that part.
>
> So, my hope was that it's possible to have a single, simple, zero-cost
> runtime check [zero cost for already initialized pages], because it can be
> merged into already existing page flag mask checks present here and
> executed for every freshly allocated page:
>
> static inline int check_new_page(struct page *page)
> {
> if (unlikely(page_mapcount(page) |
> (page->mapping != NULL) |
> (atomic_read(&page->_count) != 0) |
> (page->flags & PAGE_FLAGS_CHECK_AT_PREP) |
> (mem_cgroup_bad_page_check(page)))) {
> bad_page(page);
> return 1;
> }
> return 0;
> }
>
> We already run this for every new page allocated and the initialization
> check could hide in PAGE_FLAGS_CHECK_AT_PREP in a zero-cost fashion.
>
> I'd not do any of the ensure_page_is_initialized() or
> __expand_page_initialization() complications in this patch-set - each page
> head represents itself and gets iterated when check_new_page() is done.
>
> During regular bootup we'd initialize like before, except we don't set up
> the page heads but memset() them to zero. With each page head 32 bytes
> this would mean 8 GB of page head memory to clear per 1 TB - with 16 TB
> that's 128 GB to clear - that ought to be possible to do rather quickly,
> perhaps with some smart SMP cross-call approach that makes sure that each
> memset is done in a node-local fashion. [*]
>
> Such an approach should IMO be far smaller and less invasive than the
> patches presented so far: it should be below 100 lines or so.
>
> I don't know why there's such a big difference between the theory I
> outlined and the invasive patch-set implemented so far in practice,
> perhaps I'm missing some complication. I was trying to probe that
> difference, before giving up on the idea and punting back to the async
> hotplug-ish approach which would obviously work well too.
>

The reason, which I failed to mention, is once we pull off a page the lru in
either __rmqueue_fallback or __rmqueue_smallest the first thing we do with it
is expand() or sometimes move_freepages(). These then trip over some BUG_ON and
VM_BUG_ON.
Those BUG_ONs are what keep causing me to delve into the ensure/expand foolishness.

Nate

2013-08-16 16:36:51

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC v3 0/5] Transparent on-demand struct page initialization embedded in the buddy allocator

Hey Nathan,

Could you post your boot timing patches? My machines are much smaller
than yours, but I'm curious how things behave here as well.

I did some very imprecise timings (strace -t on a telnet attached to the
serial console). The 'struct page' initializations take about a minute
of boot time for me to do 1TB across 8 NUMA nodes (this is a glueless
QPI system[1]). My _quick_ calculations look like it's 2x as fast to
initialize node0's memory vs. the other nodes, and boot time is
increased by a second for about every 30G of memory we add.

So even with nothing else fancy, we could get some serious improvements
from just doing the initialization locally.

[1] We call anything using pure QPI without any other circuitry for the
NUMA interconnects to be "glueless"