This patchset is essentially a refactor of the page initialization logic
that is meant to provide for better code reuse while providing a
significant improvement in deferred page initialization performance.
In my testing I have seen a 60% reduction in the time needed for deferred
memory initialization on two different x86_64 based test systems I have. In
addition this provides a slight improvement for the persistent memory
initialization, the improvement is about 15% from what I can
tell and that is mostly due to combining the setting of the reserved flag
into a number of other page->flags values that could be constructed outside
of the main initialization loop itself.
The biggest gains of this patchset come from not having to test each pfn
multiple times to see if it is valid and if it is actually a part of the
node being initialized.
v1->v2:
Fixed build issue on PowerPC due to page struct size being 56
Added new patch that removed __SetPageReserved call for hotplug
v2->v3:
Removed patch that had removed __SetPageReserved call from init
Tweaked __init_pageblock to use start_pfn to get section_nr instead of pfn
Added patch that folded __SetPageReserved into set_page_links
Rebased on latest linux-next
---
Alexander Duyck (6):
mm: Use mm_zero_struct_page from SPARC on all 64b architectures
mm: Drop meminit_pfn_in_nid as it is redundant
mm: Use memblock/zone specific iterator for handling deferred page init
mm: Move hot-plug specific memory init into separate functions and optimize
mm: Use common iterator for deferred_init_pages and deferred_free_pages
mm: Add reserved flag setting to set_page_links
arch/sparc/include/asm/pgtable_64.h | 30 --
include/linux/memblock.h | 58 ++++
include/linux/mm.h | 43 +++
mm/memblock.c | 63 ++++
mm/page_alloc.c | 572 +++++++++++++++++++++--------------
5 files changed, 510 insertions(+), 256 deletions(-)
--
This change makes it so that we use the same approach that was already in
use on Sparc on all the archtectures that support a 64b long.
This is mostly motivated by the fact that 8 to 10 store/move instructions
are likely always going to be faster than having to call into a function
that is not specialized for handling page init.
An added advantage to doing it this way is that the compiler can get away
with combining writes in the __init_single_page call. As a result the
memset call will be reduced to only about 4 write operations, or at least
that is what I am seeing with GCC 6.2 as the flags, LRU poitners, and
count/mapcount seem to be cancelling out at least 4 of the 8 assignments on
my system.
One change I had to make to the function was to reduce the minimum page
size to 56 to support some powerpc64 configurations.
Signed-off-by: Alexander Duyck <[email protected]>
---
arch/sparc/include/asm/pgtable_64.h | 30 ------------------------------
include/linux/mm.h | 34 ++++++++++++++++++++++++++++++++++
2 files changed, 34 insertions(+), 30 deletions(-)
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 1393a8ac596b..22500c3be7a9 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -231,36 +231,6 @@
extern struct page *mem_map_zero;
#define ZERO_PAGE(vaddr) (mem_map_zero)
-/* This macro must be updated when the size of struct page grows above 80
- * or reduces below 64.
- * The idea that compiler optimizes out switch() statement, and only
- * leaves clrx instructions
- */
-#define mm_zero_struct_page(pp) do { \
- unsigned long *_pp = (void *)(pp); \
- \
- /* Check that struct page is either 64, 72, or 80 bytes */ \
- BUILD_BUG_ON(sizeof(struct page) & 7); \
- BUILD_BUG_ON(sizeof(struct page) < 64); \
- BUILD_BUG_ON(sizeof(struct page) > 80); \
- \
- switch (sizeof(struct page)) { \
- case 80: \
- _pp[9] = 0; /* fallthrough */ \
- case 72: \
- _pp[8] = 0; /* fallthrough */ \
- default: \
- _pp[7] = 0; \
- _pp[6] = 0; \
- _pp[5] = 0; \
- _pp[4] = 0; \
- _pp[3] = 0; \
- _pp[2] = 0; \
- _pp[1] = 0; \
- _pp[0] = 0; \
- } \
-} while (0)
-
/* PFNs are real physical page numbers. However, mem_map only begins to record
* per-page information starting at pfn_base. This is to handle systems where
* the first physical page in the machine is at some huge physical address,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bb0de406f8e7..ec6e57a0c14e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -102,8 +102,42 @@ static inline void set_max_mapnr(unsigned long limit) { }
* zeroing by defining this macro in <asm/pgtable.h>.
*/
#ifndef mm_zero_struct_page
+#if BITS_PER_LONG == 64
+/* This function must be updated when the size of struct page grows above 80
+ * or reduces below 64. The idea that compiler optimizes out switch()
+ * statement, and only leaves move/store instructions
+ */
+#define mm_zero_struct_page(pp) __mm_zero_struct_page(pp)
+static inline void __mm_zero_struct_page(struct page *page)
+{
+ unsigned long *_pp = (void *)page;
+
+ /* Check that struct page is either 56, 64, 72, or 80 bytes */
+ BUILD_BUG_ON(sizeof(struct page) & 7);
+ BUILD_BUG_ON(sizeof(struct page) < 56);
+ BUILD_BUG_ON(sizeof(struct page) > 80);
+
+ switch (sizeof(struct page)) {
+ case 80:
+ _pp[9] = 0; /* fallthrough */
+ case 72:
+ _pp[8] = 0; /* fallthrough */
+ default:
+ _pp[7] = 0; /* fallthrough */
+ case 56:
+ _pp[6] = 0;
+ _pp[5] = 0;
+ _pp[4] = 0;
+ _pp[3] = 0;
+ _pp[2] = 0;
+ _pp[1] = 0;
+ _pp[0] = 0;
+ }
+}
+#else
#define mm_zero_struct_page(pp) ((void)memset((pp), 0, sizeof(struct page)))
#endif
+#endif
/*
* Default maximum number of active map areas, this limits the number of vmas
As best as I can tell the meminit_pfn_in_nid call is completely redundant.
The deferred memory initialization is already making use of
for_each_free_mem_range which in turn will call into __next_mem_range which
will only return a memory range if it matches the node ID provided assuming
it is not NUMA_NO_NODE.
I am operating on the assumption that there are no zones or pgdata_t
structures that have a NUMA node of NUMA_NO_NODE associated with them. If
that is the case then __next_mem_range will never return a memory range
that doesn't match the zone's node ID and as such the check is redundant.
So one piece I would like to verfy on this is if this works for ia64.
Technically it was using a different approach to get the node ID, but it
seems to have the node ID also encoded into the memblock. So I am
assuming this is okay, but would like to get confirmation on that.
Signed-off-by: Alexander Duyck <[email protected]>
---
mm/page_alloc.c | 50 ++++++++++++++------------------------------------
1 file changed, 14 insertions(+), 36 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4bd858d1c3ba..a766a15fad81 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1301,36 +1301,22 @@ int __meminit early_pfn_to_nid(unsigned long pfn)
#endif
#ifdef CONFIG_NODES_SPAN_OTHER_NODES
-static inline bool __meminit __maybe_unused
-meminit_pfn_in_nid(unsigned long pfn, int node,
- struct mminit_pfnnid_cache *state)
+/* Only safe to use early in boot when initialisation is single-threaded */
+static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
{
int nid;
- nid = __early_pfn_to_nid(pfn, state);
+ nid = __early_pfn_to_nid(pfn, &early_pfnnid_cache);
if (nid >= 0 && nid != node)
return false;
return true;
}
-/* Only safe to use early in boot when initialisation is single-threaded */
-static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
-{
- return meminit_pfn_in_nid(pfn, node, &early_pfnnid_cache);
-}
-
#else
-
static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
{
return true;
}
-static inline bool __meminit __maybe_unused
-meminit_pfn_in_nid(unsigned long pfn, int node,
- struct mminit_pfnnid_cache *state)
-{
- return true;
-}
#endif
@@ -1459,21 +1445,13 @@ static inline void __init pgdat_init_report_one_done(void)
*
* Then, we check if a current large page is valid by only checking the validity
* of the head pfn.
- *
- * Finally, meminit_pfn_in_nid is checked on systems where pfns can interleave
- * within a node: a pfn is between start and end of a node, but does not belong
- * to this memory node.
*/
-static inline bool __init
-deferred_pfn_valid(int nid, unsigned long pfn,
- struct mminit_pfnnid_cache *nid_init_state)
+static inline bool __init deferred_pfn_valid(unsigned long pfn)
{
if (!pfn_valid_within(pfn))
return false;
if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn))
return false;
- if (!meminit_pfn_in_nid(pfn, nid, nid_init_state))
- return false;
return true;
}
@@ -1481,15 +1459,14 @@ static inline void __init pgdat_init_report_one_done(void)
* Free pages to buddy allocator. Try to free aligned pages in
* pageblock_nr_pages sizes.
*/
-static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
+static void __init deferred_free_pages(unsigned long pfn,
unsigned long end_pfn)
{
- struct mminit_pfnnid_cache nid_init_state = { };
unsigned long nr_pgmask = pageblock_nr_pages - 1;
unsigned long nr_free = 0;
for (; pfn < end_pfn; pfn++) {
- if (!deferred_pfn_valid(nid, pfn, &nid_init_state)) {
+ if (!deferred_pfn_valid(pfn)) {
deferred_free_range(pfn - nr_free, nr_free);
nr_free = 0;
} else if (!(pfn & nr_pgmask)) {
@@ -1509,17 +1486,18 @@ static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
* by performing it only once every pageblock_nr_pages.
* Return number of pages initialized.
*/
-static unsigned long __init deferred_init_pages(int nid, int zid,
+static unsigned long __init deferred_init_pages(struct zone *zone,
unsigned long pfn,
unsigned long end_pfn)
{
- struct mminit_pfnnid_cache nid_init_state = { };
unsigned long nr_pgmask = pageblock_nr_pages - 1;
+ int nid = zone_to_nid(zone);
unsigned long nr_pages = 0;
+ int zid = zone_idx(zone);
struct page *page = NULL;
for (; pfn < end_pfn; pfn++) {
- if (!deferred_pfn_valid(nid, pfn, &nid_init_state)) {
+ if (!deferred_pfn_valid(pfn)) {
page = NULL;
continue;
} else if (!page || !(pfn & nr_pgmask)) {
@@ -1582,12 +1560,12 @@ static int __init deferred_init_memmap(void *data)
for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
- nr_pages += deferred_init_pages(nid, zid, spfn, epfn);
+ nr_pages += deferred_init_pages(zone, spfn, epfn);
}
for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
- deferred_free_pages(nid, zid, spfn, epfn);
+ deferred_free_pages(spfn, epfn);
}
pgdat_resize_unlock(pgdat, &flags);
@@ -1676,7 +1654,7 @@ static int __init deferred_init_memmap(void *data)
while (spfn < epfn && nr_pages < nr_pages_needed) {
t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
first_deferred_pfn = min(t, epfn);
- nr_pages += deferred_init_pages(nid, zid, spfn,
+ nr_pages += deferred_init_pages(zone, spfn,
first_deferred_pfn);
spfn = first_deferred_pfn;
}
@@ -1688,7 +1666,7 @@ static int __init deferred_init_memmap(void *data)
for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa));
- deferred_free_pages(nid, zid, spfn, epfn);
+ deferred_free_pages(spfn, epfn);
if (first_deferred_pfn == epfn)
break;
This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.
This iterator will take care of making sure a given memory range provided
is in fact contained within a zone. It takes are of all the bounds checking
we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
it should help to speed up the search a bit by iterating until the end of a
range is greater than the start of the zone pfn range, and will exit
completely if the start is beyond the end of the zone.
This patch adds yet another iterator called
for_each_free_mem_range_in_zone_from and then uses it to support
initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES.
By doing this we can greatly improve the cache locality of the pages while
we do several loops over them in the init and freeing process.
We are able to tighten the loops as a result since we only really need the
checks for first_init_pfn in our first iteration and after that we can
assume that all future values will be greater than this. So I have added a
function called deferred_init_mem_pfn_range_in_zone that primes the
iterators and if it fails we can just exit.
Signed-off-by: Alexander Duyck <[email protected]>
---
include/linux/memblock.h | 58 +++++++++++++++
mm/memblock.c | 63 ++++++++++++++++
mm/page_alloc.c | 176 ++++++++++++++++++++++++++++++++--------------
3 files changed, 242 insertions(+), 55 deletions(-)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index aee299a6aa76..d62b95dba94e 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -178,6 +178,25 @@ void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start,
p_start, p_end, p_nid))
/**
+ * for_each_mem_range - iterate through memblock areas from type_a and not
+ * included in type_b. Or just type_a if type_b is NULL.
+ * @i: u64 used as loop variable
+ * @type_a: ptr to memblock_type to iterate
+ * @type_b: ptr to memblock_type which excludes from the iteration
+ * @nid: node selector, %NUMA_NO_NODE for all nodes
+ * @flags: pick from blocks based on memory attributes
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ * @p_nid: ptr to int for nid of the range, can be %NULL
+ */
+#define for_each_mem_range_from(i, type_a, type_b, nid, flags, \
+ p_start, p_end, p_nid) \
+ for (i = 0, __next_mem_range(&i, nid, flags, type_a, type_b, \
+ p_start, p_end, p_nid); \
+ i != (u64)ULLONG_MAX; \
+ __next_mem_range(&i, nid, flags, type_a, type_b, \
+ p_start, p_end, p_nid))
+/**
* for_each_mem_range_rev - reverse iterate through memblock areas from
* type_a and not included in type_b. Or just type_a if type_b is NULL.
* @i: u64 used as loop variable
@@ -248,6 +267,45 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid))
#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
+ unsigned long *out_spfn,
+ unsigned long *out_epfn);
+/**
+ * for_each_free_mem_range_in_zone - iterate through zone specific free
+ * memblock areas
+ * @i: u64 used as loop variable
+ * @zone: zone in which all of the memory blocks reside
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over free (memory && !reserved) areas of memblock in a specific
+ * zone. Available as soon as memblock is initialized.
+ */
+#define for_each_free_mem_pfn_range_in_zone(i, zone, p_start, p_end) \
+ for (i = 0, \
+ __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end); \
+ i != (u64)ULLONG_MAX; \
+ __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
+
+/**
+ * for_each_free_mem_range_in_zone_from - iterate through zone specific
+ * free memblock areas from a given point
+ * @i: u64 used as loop variable
+ * @zone: zone in which all of the memory blocks reside
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over free (memory && !reserved) areas of memblock in a specific
+ * zone, continuing from current position. Available as soon as memblock is
+ * initialized.
+ */
+#define for_each_free_mem_pfn_range_in_zone_from(i, zone, p_start, p_end) \
+ for (; i != (u64)ULLONG_MAX; \
+ __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
+
+#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
+
/**
* for_each_free_mem_range - iterate through free memblock areas
* @i: u64 used as loop variable
diff --git a/mm/memblock.c b/mm/memblock.c
index 5fefc70253ee..dc6e28e7f869 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1239,6 +1239,69 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
return 0;
}
#endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+/**
+ * __next_mem_pfn_range_in_zone - iterator for for_each_*_range_in_zone()
+ *
+ * @idx: pointer to u64 loop variable
+ * @zone: zone in which all of the memory blocks reside
+ * @out_start: ptr to ulong for start pfn of the range, can be %NULL
+ * @out_end: ptr to ulong for end pfn of the range, can be %NULL
+ *
+ * This function is meant to be a zone/pfn specific wrapper for the
+ * for_each_mem_range type iterators. Specifically they are used in the
+ * deferred memory init routines and as such we were duplicating much of
+ * this logic throughout the code. So instead of having it in multiple
+ * locations it seemed like it would make more sense to centralize this to
+ * one new iterator that does everything they need.
+ */
+void __init_memblock
+__next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
+ unsigned long *out_spfn, unsigned long *out_epfn)
+{
+ int zone_nid = zone_to_nid(zone);
+ phys_addr_t spa, epa;
+ int nid;
+
+ __next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
+ &memblock.memory, &memblock.reserved,
+ &spa, &epa, &nid);
+
+ while (*idx != ULLONG_MAX) {
+ unsigned long epfn = PFN_DOWN(epa);
+ unsigned long spfn = PFN_UP(spa);
+
+ /*
+ * Verify the end is at least past the start of the zone and
+ * that we have at least one PFN to initialize.
+ */
+ if (zone->zone_start_pfn < epfn && spfn < epfn) {
+ /* if we went too far just stop searching */
+ if (zone_end_pfn(zone) <= spfn)
+ break;
+
+ if (out_spfn)
+ *out_spfn = max(zone->zone_start_pfn, spfn);
+ if (out_epfn)
+ *out_epfn = min(zone_end_pfn(zone), epfn);
+
+ return;
+ }
+
+ __next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
+ &memblock.memory, &memblock.reserved,
+ &spa, &epa, &nid);
+ }
+
+ /* signal end of iteration */
+ *idx = ULLONG_MAX;
+ if (out_spfn)
+ *out_spfn = ULONG_MAX;
+ if (out_epfn)
+ *out_epfn = 0;
+}
+
+#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
#ifdef CONFIG_HAVE_MEMBLOCK_PFN_VALID
unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a766a15fad81..20e9eb35d75d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1512,19 +1512,103 @@ static unsigned long __init deferred_init_pages(struct zone *zone,
return (nr_pages);
}
+/*
+ * This function is meant to pre-load the iterator for the zone init.
+ * Specifically it walks through the ranges until we are caught up to the
+ * first_init_pfn value and exits there. If we never encounter the value we
+ * return false indicating there are no valid ranges left.
+ */
+static bool __init
+deferred_init_mem_pfn_range_in_zone(u64 *i, struct zone *zone,
+ unsigned long *spfn, unsigned long *epfn,
+ unsigned long first_init_pfn)
+{
+ u64 j;
+
+ /*
+ * Start out by walking through the ranges in this zone that have
+ * already been initialized. We don't need to do anything with them
+ * so we just need to flush them out of the system.
+ */
+ for_each_free_mem_pfn_range_in_zone(j, zone, spfn, epfn) {
+ if (*epfn <= first_init_pfn)
+ continue;
+ if (*spfn < first_init_pfn)
+ *spfn = first_init_pfn;
+ *i = j;
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Initialize and free pages. We do it in two loops: first we initialize
+ * struct page, than free to buddy allocator, because while we are
+ * freeing pages we can access pages that are ahead (computing buddy
+ * page in __free_one_page()).
+ *
+ * In order to try and keep some memory in the cache we have the loop
+ * broken along max page order boundaries. This way we will not cause
+ * any issues with the buddy page computation.
+ */
+static unsigned long __init
+deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
+ unsigned long *end_pfn)
+{
+ unsigned long mo_pfn = ALIGN(*start_pfn + 1, MAX_ORDER_NR_PAGES);
+ unsigned long spfn = *start_pfn, epfn = *end_pfn;
+ unsigned long nr_pages = 0;
+ u64 j = *i;
+
+ /* First we loop through and initialize the page values */
+ for_each_free_mem_pfn_range_in_zone_from(j, zone, &spfn, &epfn) {
+ unsigned long t;
+
+ if (mo_pfn <= spfn)
+ break;
+
+ t = min(mo_pfn, epfn);
+ nr_pages += deferred_init_pages(zone, spfn, t);
+
+ if (mo_pfn <= epfn)
+ break;
+ }
+
+ /* Reset values and now loop through freeing pages as needed */
+ j = *i;
+
+ for_each_free_mem_pfn_range_in_zone_from(j, zone, start_pfn, end_pfn) {
+ unsigned long t;
+
+ if (mo_pfn <= *start_pfn)
+ break;
+
+ t = min(mo_pfn, *end_pfn);
+ deferred_free_pages(*start_pfn, t);
+ *start_pfn = t;
+
+ if (mo_pfn < *end_pfn)
+ break;
+ }
+
+ /* Store our current values to be reused on the next iteration */
+ *i = j;
+
+ return nr_pages;
+}
+
/* Initialise remaining memory on a node */
static int __init deferred_init_memmap(void *data)
{
pg_data_t *pgdat = data;
- int nid = pgdat->node_id;
+ const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
+ unsigned long spfn = 0, epfn = 0, nr_pages = 0;
+ unsigned long first_init_pfn, flags;
unsigned long start = jiffies;
- unsigned long nr_pages = 0;
- unsigned long spfn, epfn, first_init_pfn, flags;
- phys_addr_t spa, epa;
- int zid;
struct zone *zone;
- const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
u64 i;
+ int zid;
/* Bind memory initialisation thread to a local node if possible */
if (!cpumask_empty(cpumask))
@@ -1549,31 +1633,30 @@ static int __init deferred_init_memmap(void *data)
if (first_init_pfn < zone_end_pfn(zone))
break;
}
- first_init_pfn = max(zone->zone_start_pfn, first_init_pfn);
+
+ /* If the zone is empty somebody else may have cleared out the zone */
+ if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
+ first_init_pfn)) {
+ pgdat_resize_unlock(pgdat, &flags);
+ pgdat_init_report_one_done();
+ return 0;
+ }
/*
- * Initialize and free pages. We do it in two loops: first we initialize
- * struct page, than free to buddy allocator, because while we are
- * freeing pages we can access pages that are ahead (computing buddy
- * page in __free_one_page()).
+ * Initialize and free pages in MAX_ORDER sized increments so
+ * that we can avoid introducing any issues with the buddy
+ * allocator.
*/
- for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
- spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
- epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
- nr_pages += deferred_init_pages(zone, spfn, epfn);
- }
- for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
- spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
- epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
- deferred_free_pages(spfn, epfn);
- }
+ while (spfn < epfn)
+ nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+
pgdat_resize_unlock(pgdat, &flags);
/* Sanity check that the next zone really is unpopulated */
WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
- pr_info("node %d initialised, %lu pages in %ums\n", nid, nr_pages,
- jiffies_to_msecs(jiffies - start));
+ pr_info("node %d initialised, %lu pages in %ums\n",
+ pgdat->node_id, nr_pages, jiffies_to_msecs(jiffies - start));
pgdat_init_report_one_done();
return 0;
@@ -1604,14 +1687,11 @@ static int __init deferred_init_memmap(void *data)
static noinline bool __init
deferred_grow_zone(struct zone *zone, unsigned int order)
{
- int zid = zone_idx(zone);
- int nid = zone_to_nid(zone);
- pg_data_t *pgdat = NODE_DATA(nid);
unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
- unsigned long nr_pages = 0;
- unsigned long first_init_pfn, spfn, epfn, t, flags;
+ pg_data_t *pgdat = zone->zone_pgdat;
unsigned long first_deferred_pfn = pgdat->first_deferred_pfn;
- phys_addr_t spa, epa;
+ unsigned long spfn, epfn, flags;
+ unsigned long nr_pages = 0;
u64 i;
/* Only the last zone may have deferred pages */
@@ -1640,37 +1720,23 @@ static int __init deferred_init_memmap(void *data)
return true;
}
- first_init_pfn = max(zone->zone_start_pfn, first_deferred_pfn);
-
- if (first_init_pfn >= pgdat_end_pfn(pgdat)) {
+ /* If the zone is empty somebody else may have cleared out the zone */
+ if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
+ first_deferred_pfn)) {
pgdat_resize_unlock(pgdat, &flags);
- return false;
+ return true;
}
- for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
- spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
- epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
-
- while (spfn < epfn && nr_pages < nr_pages_needed) {
- t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
- first_deferred_pfn = min(t, epfn);
- nr_pages += deferred_init_pages(zone, spfn,
- first_deferred_pfn);
- spfn = first_deferred_pfn;
- }
-
- if (nr_pages >= nr_pages_needed)
- break;
+ /*
+ * Initialize and free pages in MAX_ORDER sized increments so
+ * that we can avoid introducing any issues with the buddy
+ * allocator.
+ */
+ while (spfn < epfn && nr_pages < nr_pages_needed) {
+ nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+ first_deferred_pfn = spfn;
}
- for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
- spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
- epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa));
- deferred_free_pages(spfn, epfn);
-
- if (first_deferred_pfn == epfn)
- break;
- }
pgdat->first_deferred_pfn = first_deferred_pfn;
pgdat_resize_unlock(pgdat, &flags);
This patch is going through and combining the bits in memmap_init_zone and
memmap_init_zone_device that are related to hotplug into a single function
called __memmap_init_hotplug.
I also took the opportunity to integrate __init_single_page's functionality
into this function. In doing so I can get rid of some of the redundancy
such as the LRU pointers versus the pgmap.
Signed-off-by: Alexander Duyck <[email protected]>
---
mm/page_alloc.c | 232 ++++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 159 insertions(+), 73 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 20e9eb35d75d..92375e7867ba 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1192,6 +1192,94 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
#endif
}
+static void __meminit __init_pageblock(unsigned long start_pfn,
+ unsigned long nr_pages,
+ unsigned long zone, int nid,
+ struct dev_pagemap *pgmap,
+ bool is_reserved)
+{
+ unsigned long nr_pgmask = pageblock_nr_pages - 1;
+ struct page *start_page = pfn_to_page(start_pfn);
+ unsigned long pfn = start_pfn + nr_pages - 1;
+#ifdef WANT_PAGE_VIRTUAL
+ bool is_highmem = is_highmem_idx(zone);
+#endif
+ struct page *page;
+
+ /*
+ * Enforce the following requirements:
+ * size > 0
+ * size < pageblock_nr_pages
+ * start_pfn -> pfn does not cross pageblock_nr_pages boundary
+ */
+ VM_BUG_ON(((start_pfn ^ pfn) | (nr_pages - 1)) > nr_pgmask);
+
+ /*
+ * Work from highest page to lowest, this way we will still be
+ * warm in the cache when we call set_pageblock_migratetype
+ * below.
+ *
+ * The loop is based around the page pointer as the main index
+ * instead of the pfn because pfn is not used inside the loop if
+ * the section number is not in page flags and WANT_PAGE_VIRTUAL
+ * is not defined.
+ */
+ for (page = start_page + nr_pages; page-- != start_page; pfn--) {
+ mm_zero_struct_page(page);
+
+ /*
+ * We use the start_pfn instead of pfn in the set_page_links
+ * call because of the fact that the pfn number is used to
+ * get the section_nr and this function should not be
+ * spanning more than a single section.
+ */
+ set_page_links(page, zone, nid, start_pfn);
+ init_page_count(page);
+ page_mapcount_reset(page);
+ page_cpupid_reset_last(page);
+
+ /*
+ * We can use the non-atomic __set_bit operation for setting
+ * the flag as we are still initializing the pages.
+ */
+ if (is_reserved)
+ __SetPageReserved(page);
+
+ /*
+ * ZONE_DEVICE pages union ->lru with a ->pgmap back
+ * pointer and hmm_data. It is a bug if a ZONE_DEVICE
+ * page is ever freed or placed on a driver-private list.
+ */
+ page->pgmap = pgmap;
+ if (!pgmap)
+ INIT_LIST_HEAD(&page->lru);
+
+#ifdef WANT_PAGE_VIRTUAL
+ /* The shift won't overflow because ZONE_NORMAL is below 4G. */
+ if (!is_highmem)
+ set_page_address(page, __va(pfn << PAGE_SHIFT));
+#endif
+ }
+
+ /*
+ * Mark the block movable so that blocks are reserved for
+ * movable at startup. This will force kernel allocations
+ * to reserve their blocks rather than leaking throughout
+ * the address space during boot when many long-lived
+ * kernel allocations are made.
+ *
+ * bitmap is created for zone's valid pfn range. but memmap
+ * can be created for invalid pages (for alignment)
+ * check here not to call set_pageblock_migratetype() against
+ * pfn out of zone.
+ *
+ * Please note that MEMMAP_HOTPLUG path doesn't clear memmap
+ * because this is done early in sparse_add_one_section
+ */
+ if (!(start_pfn & nr_pgmask))
+ set_pageblock_migratetype(start_page, MIGRATE_MOVABLE);
+}
+
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
static void __meminit init_reserved_page(unsigned long pfn)
{
@@ -5513,6 +5601,36 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
return false;
}
+static void __meminit __memmap_init_hotplug(unsigned long size, int nid,
+ unsigned long zone,
+ unsigned long start_pfn,
+ struct dev_pagemap *pgmap)
+{
+ unsigned long pfn = start_pfn + size;
+
+ while (pfn != start_pfn) {
+ unsigned long stride = pfn;
+
+ pfn = max(ALIGN_DOWN(pfn - 1, pageblock_nr_pages), start_pfn);
+ stride -= pfn;
+
+ /*
+ * The last argument of __init_pageblock is a boolean
+ * value indicating if the page will be marked as reserved.
+ *
+ * Mark page reserved as it will need to wait for onlining
+ * phase for it to be fully associated with a zone.
+ *
+ * Under certain circumstances ZONE_DEVICE pages may not
+ * need to be marked as reserved, however there is still
+ * code that is depending on this being set for now.
+ */
+ __init_pageblock(pfn, stride, zone, nid, pgmap, true);
+
+ cond_resched();
+ }
+}
+
/*
* Initially all pages are reserved - free ones are freed
* up by memblock_free_all() once the early boot process is
@@ -5523,51 +5641,61 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
struct vmem_altmap *altmap)
{
unsigned long pfn, end_pfn = start_pfn + size;
- struct page *page;
if (highest_memmap_pfn < end_pfn - 1)
highest_memmap_pfn = end_pfn - 1;
+ if (context == MEMMAP_HOTPLUG) {
#ifdef CONFIG_ZONE_DEVICE
- /*
- * Honor reservation requested by the driver for this ZONE_DEVICE
- * memory. We limit the total number of pages to initialize to just
- * those that might contain the memory mapping. We will defer the
- * ZONE_DEVICE page initialization until after we have released
- * the hotplug lock.
- */
- if (zone == ZONE_DEVICE) {
- if (!altmap)
- return;
+ /*
+ * Honor reservation requested by the driver for this
+ * ZONE_DEVICE memory. We limit the total number of pages to
+ * initialize to just those that might contain the memory
+ * mapping. We will defer the ZONE_DEVICE page initialization
+ * until after we have released the hotplug lock.
+ */
+ if (zone == ZONE_DEVICE) {
+ if (!altmap)
+ return;
+
+ if (start_pfn == altmap->base_pfn)
+ start_pfn += altmap->reserve;
+ end_pfn = altmap->base_pfn +
+ vmem_altmap_offset(altmap);
+ }
+#endif
+ /*
+ * For these ZONE_DEVICE pages we don't need to record the
+ * pgmap as they should represent only those pages used to
+ * store the memory map. The actual ZONE_DEVICE pages will
+ * be initialized later.
+ */
+ __memmap_init_hotplug(end_pfn - start_pfn, nid, zone,
+ start_pfn, NULL);
- if (start_pfn == altmap->base_pfn)
- start_pfn += altmap->reserve;
- end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
+ return;
}
-#endif
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+ struct page *page;
+
/*
* There can be holes in boot-time mem_map[]s handed to this
* function. They do not exist on hotplugged memory.
*/
- if (context == MEMMAP_EARLY) {
- if (!early_pfn_valid(pfn)) {
- pfn = next_valid_pfn(pfn) - 1;
- continue;
- }
- if (!early_pfn_in_nid(pfn, nid))
- continue;
- if (overlap_memmap_init(zone, &pfn))
- continue;
- if (defer_init(nid, pfn, end_pfn))
- break;
+ if (!early_pfn_valid(pfn)) {
+ pfn = next_valid_pfn(pfn) - 1;
+ continue;
}
+ if (!early_pfn_in_nid(pfn, nid))
+ continue;
+ if (overlap_memmap_init(zone, &pfn))
+ continue;
+ if (defer_init(nid, pfn, end_pfn))
+ break;
page = pfn_to_page(pfn);
__init_single_page(page, pfn, zone, nid);
- if (context == MEMMAP_HOTPLUG)
- __SetPageReserved(page);
/*
* Mark the block movable so that blocks are reserved for
@@ -5594,14 +5722,12 @@ void __ref memmap_init_zone_device(struct zone *zone,
unsigned long size,
struct dev_pagemap *pgmap)
{
- unsigned long pfn, end_pfn = start_pfn + size;
struct pglist_data *pgdat = zone->zone_pgdat;
unsigned long zone_idx = zone_idx(zone);
unsigned long start = jiffies;
int nid = pgdat->node_id;
- if (WARN_ON_ONCE(!pgmap || !is_dev_zone(zone)))
- return;
+ VM_BUG_ON(!is_dev_zone(zone));
/*
* The call to memmap_init_zone should have already taken care
@@ -5610,53 +5736,13 @@ void __ref memmap_init_zone_device(struct zone *zone,
*/
if (pgmap->altmap_valid) {
struct vmem_altmap *altmap = &pgmap->altmap;
+ unsigned long end_pfn = start_pfn + size;
start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
size = end_pfn - start_pfn;
}
- for (pfn = start_pfn; pfn < end_pfn; pfn++) {
- struct page *page = pfn_to_page(pfn);
-
- __init_single_page(page, pfn, zone_idx, nid);
-
- /*
- * Mark page reserved as it will need to wait for onlining
- * phase for it to be fully associated with a zone.
- *
- * We can use the non-atomic __set_bit operation for setting
- * the flag as we are still initializing the pages.
- */
- __SetPageReserved(page);
-
- /*
- * ZONE_DEVICE pages union ->lru with a ->pgmap back
- * pointer and hmm_data. It is a bug if a ZONE_DEVICE
- * page is ever freed or placed on a driver-private list.
- */
- page->pgmap = pgmap;
- page->hmm_data = 0;
-
- /*
- * Mark the block movable so that blocks are reserved for
- * movable at startup. This will force kernel allocations
- * to reserve their blocks rather than leaking throughout
- * the address space during boot when many long-lived
- * kernel allocations are made.
- *
- * bitmap is created for zone's valid pfn range. but memmap
- * can be created for invalid pages (for alignment)
- * check here not to call set_pageblock_migratetype() against
- * pfn out of zone.
- *
- * Please note that MEMMAP_HOTPLUG path doesn't clear memmap
- * because this is done early in sparse_add_one_section
- */
- if (!(pfn & (pageblock_nr_pages - 1))) {
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
- cond_resched();
- }
- }
+ __memmap_init_hotplug(size, nid, zone_idx, start_pfn, pgmap);
pr_info("%s initialised, %lu pages in %ums\n", dev_name(pgmap->dev),
size, jiffies_to_msecs(jiffies - start));
This patch creates a common iterator to be used by both deferred_init_pages
and deferred_free_pages. By doing this we can cut down a bit on code
overhead as they will likely both be inlined into the same function anyway.
This new approach allows deferred_init_pages to make use of
__init_pageblock. By doing this we can cut down on the code size by sharing
code between both the hotplug and deferred memory init code paths.
An additional benefit to this approach is that we improve in cache locality
of the memory init as we can focus on the memory areas related to
identifying if a given PFN is valid and keep that warm in the cache until
we transition to a region of a different type. So we will stream through a
chunk of valid blocks before we turn to initializing page structs.
Signed-off-by: Alexander Duyck <[email protected]>
---
mm/page_alloc.c | 134 +++++++++++++++++++++++++++----------------------------
1 file changed, 65 insertions(+), 69 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 92375e7867ba..f145063615a7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1488,32 +1488,6 @@ void clear_zone_contiguous(struct zone *zone)
}
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-static void __init deferred_free_range(unsigned long pfn,
- unsigned long nr_pages)
-{
- struct page *page;
- unsigned long i;
-
- if (!nr_pages)
- return;
-
- page = pfn_to_page(pfn);
-
- /* Free a large naturally-aligned chunk if possible */
- if (nr_pages == pageblock_nr_pages &&
- (pfn & (pageblock_nr_pages - 1)) == 0) {
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
- __free_pages_core(page, pageblock_order);
- return;
- }
-
- for (i = 0; i < nr_pages; i++, page++, pfn++) {
- if ((pfn & (pageblock_nr_pages - 1)) == 0)
- set_pageblock_migratetype(page, MIGRATE_MOVABLE);
- __free_pages_core(page, 0);
- }
-}
-
/* Completion tracking for deferred_init_memmap() threads */
static atomic_t pgdat_init_n_undone __initdata;
static __initdata DECLARE_COMPLETION(pgdat_init_all_done_comp);
@@ -1525,48 +1499,77 @@ static inline void __init pgdat_init_report_one_done(void)
}
/*
- * Returns true if page needs to be initialized or freed to buddy allocator.
+ * Returns count if page range needs to be initialized or freed
*
- * First we check if pfn is valid on architectures where it is possible to have
- * holes within pageblock_nr_pages. On systems where it is not possible, this
- * function is optimized out.
+ * First, we check if a current large page is valid by only checking the
+ * validity of the head pfn.
*
- * Then, we check if a current large page is valid by only checking the validity
- * of the head pfn.
+ * Then we check if the contiguous pfns are valid on architectures where it
+ * is possible to have holes within pageblock_nr_pages. On systems where it
+ * is not possible, this function is optimized out.
*/
-static inline bool __init deferred_pfn_valid(unsigned long pfn)
+static unsigned long __next_pfn_valid_range(unsigned long *i,
+ unsigned long end_pfn)
{
- if (!pfn_valid_within(pfn))
- return false;
- if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn))
- return false;
- return true;
+ unsigned long pfn = *i;
+ unsigned long count;
+
+ while (pfn < end_pfn) {
+ unsigned long t = ALIGN(pfn + 1, pageblock_nr_pages);
+ unsigned long pageblock_pfn = min(t, end_pfn);
+
+#ifndef CONFIG_HOLES_IN_ZONE
+ count = pageblock_pfn - pfn;
+ pfn = pageblock_pfn;
+ if (!pfn_valid(pfn))
+ continue;
+#else
+ for (count = 0; pfn < pageblock_pfn; pfn++) {
+ if (pfn_valid_within(pfn)) {
+ count++;
+ continue;
+ }
+
+ if (count)
+ break;
+ }
+
+ if (!count)
+ continue;
+#endif
+ *i = pfn;
+ return count;
+ }
+
+ return 0;
}
+#define for_each_deferred_pfn_valid_range(i, start_pfn, end_pfn, pfn, count) \
+ for (i = (start_pfn), \
+ count = __next_pfn_valid_range(&i, (end_pfn)); \
+ count && ({ pfn = i - count; 1; }); \
+ count = __next_pfn_valid_range(&i, (end_pfn)))
/*
* Free pages to buddy allocator. Try to free aligned pages in
* pageblock_nr_pages sizes.
*/
-static void __init deferred_free_pages(unsigned long pfn,
+static void __init deferred_free_pages(unsigned long start_pfn,
unsigned long end_pfn)
{
- unsigned long nr_pgmask = pageblock_nr_pages - 1;
- unsigned long nr_free = 0;
-
- for (; pfn < end_pfn; pfn++) {
- if (!deferred_pfn_valid(pfn)) {
- deferred_free_range(pfn - nr_free, nr_free);
- nr_free = 0;
- } else if (!(pfn & nr_pgmask)) {
- deferred_free_range(pfn - nr_free, nr_free);
- nr_free = 1;
- touch_nmi_watchdog();
+ unsigned long i, pfn, count;
+
+ for_each_deferred_pfn_valid_range(i, start_pfn, end_pfn, pfn, count) {
+ struct page *page = pfn_to_page(pfn);
+
+ if (count == pageblock_nr_pages) {
+ __free_pages_core(page, pageblock_order);
} else {
- nr_free++;
+ while (count--)
+ __free_pages_core(page++, 0);
}
+
+ touch_nmi_watchdog();
}
- /* Free the last block of pages to allocator */
- deferred_free_range(pfn - nr_free, nr_free);
}
/*
@@ -1575,29 +1578,22 @@ static void __init deferred_free_pages(unsigned long pfn,
* Return number of pages initialized.
*/
static unsigned long __init deferred_init_pages(struct zone *zone,
- unsigned long pfn,
+ unsigned long start_pfn,
unsigned long end_pfn)
{
- unsigned long nr_pgmask = pageblock_nr_pages - 1;
+ unsigned long i, pfn, count;
int nid = zone_to_nid(zone);
unsigned long nr_pages = 0;
int zid = zone_idx(zone);
- struct page *page = NULL;
- for (; pfn < end_pfn; pfn++) {
- if (!deferred_pfn_valid(pfn)) {
- page = NULL;
- continue;
- } else if (!page || !(pfn & nr_pgmask)) {
- page = pfn_to_page(pfn);
- touch_nmi_watchdog();
- } else {
- page++;
- }
- __init_single_page(page, pfn, zid, nid);
- nr_pages++;
+ for_each_deferred_pfn_valid_range(i, start_pfn, end_pfn, pfn, count) {
+ nr_pages += count;
+ __init_pageblock(pfn, count, zid, nid, NULL, false);
+
+ touch_nmi_watchdog();
}
- return (nr_pages);
+
+ return nr_pages;
}
/*
This patch modifies the set_page_links function to include the setting of
the reserved flag via a simple AND and OR operation. The motivation for
this is the fact that the existing __set_bit call still seems to have
effects on performance as replacing the call with the AND and OR can reduce
initialization time by as much as 4 seconds, ~26s down to ~22s, per NUMA
node on a 4 socket system with 3TB of persistent memory per node.
Looking over the assembly code before and after the change the main
difference between the two is that the reserved bit is stored in a value
that is generated outside of the main initialization loop and is then
written with the other flags field values in one write to the page->flags
value. Previously the generated value was written and then then btsq
instruction was issued after testing the is_reserved value.
Signed-off-by: Alexander Duyck <[email protected]>
---
include/linux/mm.h | 9 ++++++++-
mm/page_alloc.c | 14 +++++---------
2 files changed, 13 insertions(+), 10 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ec6e57a0c14e..31d374279b90 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1152,11 +1152,18 @@ static inline void set_page_node(struct page *page, unsigned long node)
page->flags |= (node & NODES_MASK) << NODES_PGSHIFT;
}
+static inline void set_page_reserved(struct page *page, bool reserved)
+{
+ page->flags &= ~(1ul << PG_reserved);
+ page->flags |= (unsigned long)(!!reserved) << PG_reserved;
+}
+
static inline void set_page_links(struct page *page, enum zone_type zone,
- unsigned long node, unsigned long pfn)
+ unsigned long node, unsigned long pfn, bool reserved)
{
set_page_zone(page, zone);
set_page_node(page, node);
+ set_page_reserved(page, reserved);
#ifdef SECTION_IN_PAGE_FLAGS
set_page_section(page, pfn_to_section_nr(pfn));
#endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f145063615a7..5c2f545c7669 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1179,7 +1179,7 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
{
mm_zero_struct_page(page);
- set_page_links(page, zone, nid, pfn);
+ set_page_links(page, zone, nid, pfn, false);
init_page_count(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);
@@ -1232,20 +1232,16 @@ static void __meminit __init_pageblock(unsigned long start_pfn,
* call because of the fact that the pfn number is used to
* get the section_nr and this function should not be
* spanning more than a single section.
+ *
+ * We can use a non-atomic operation for setting the
+ * PG_reserved flag as we are still initializing the pages.
*/
- set_page_links(page, zone, nid, start_pfn);
+ set_page_links(page, zone, nid, start_pfn, is_reserved);
init_page_count(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);
/*
- * We can use the non-atomic __set_bit operation for setting
- * the flag as we are still initializing the pages.
- */
- if (is_reserved)
- __SetPageReserved(page);
-
- /*
* ZONE_DEVICE pages union ->lru with a ->pgmap back
* pointer and hmm_data. It is a bug if a ZONE_DEVICE
* page is ever freed or placed on a driver-private list.
On 10/15/18 4:26 PM, Alexander Duyck wrote:
> This change makes it so that we use the same approach that was already in
> use on Sparc on all the archtectures that support a 64b long.
>
> This is mostly motivated by the fact that 8 to 10 store/move instructions
> are likely always going to be faster than having to call into a function
> that is not specialized for handling page init.
>
> An added advantage to doing it this way is that the compiler can get away
> with combining writes in the __init_single_page call. As a result the
> memset call will be reduced to only about 4 write operations, or at least
> that is what I am seeing with GCC 6.2 as the flags, LRU poitners, and
> count/mapcount seem to be cancelling out at least 4 of the 8 assignments on
> my system.
>
> One change I had to make to the function was to reduce the minimum page
> size to 56 to support some powerpc64 configurations.
>
> Signed-off-by: Alexander Duyck <[email protected]>
I have tested on Broadcom's Stingray cpu with 48G RAM:
__init_single_page() takes 19.30ns / 64-byte struct page
Wit the change it takes 17.33ns / 64-byte struct page
Please add this data and also the data from Intel to the description.
Thank you,
Pavel
> ---
> arch/sparc/include/asm/pgtable_64.h | 30 ------------------------------
> include/linux/mm.h | 34 ++++++++++++++++++++++++++++++++++
> 2 files changed, 34 insertions(+), 30 deletions(-)
>
> diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
> index 1393a8ac596b..22500c3be7a9 100644
> --- a/arch/sparc/include/asm/pgtable_64.h
> +++ b/arch/sparc/include/asm/pgtable_64.h
> @@ -231,36 +231,6 @@
> extern struct page *mem_map_zero;
> #define ZERO_PAGE(vaddr) (mem_map_zero)
>
> -/* This macro must be updated when the size of struct page grows above 80
> - * or reduces below 64.
> - * The idea that compiler optimizes out switch() statement, and only
> - * leaves clrx instructions
> - */
> -#define mm_zero_struct_page(pp) do { \
> - unsigned long *_pp = (void *)(pp); \
> - \
> - /* Check that struct page is either 64, 72, or 80 bytes */ \
> - BUILD_BUG_ON(sizeof(struct page) & 7); \
> - BUILD_BUG_ON(sizeof(struct page) < 64); \
> - BUILD_BUG_ON(sizeof(struct page) > 80); \
> - \
> - switch (sizeof(struct page)) { \
> - case 80: \
> - _pp[9] = 0; /* fallthrough */ \
> - case 72: \
> - _pp[8] = 0; /* fallthrough */ \
> - default: \
> - _pp[7] = 0; \
> - _pp[6] = 0; \
> - _pp[5] = 0; \
> - _pp[4] = 0; \
> - _pp[3] = 0; \
> - _pp[2] = 0; \
> - _pp[1] = 0; \
> - _pp[0] = 0; \
> - } \
> -} while (0)
> -
> /* PFNs are real physical page numbers. However, mem_map only begins to record
> * per-page information starting at pfn_base. This is to handle systems where
> * the first physical page in the machine is at some huge physical address,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bb0de406f8e7..ec6e57a0c14e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -102,8 +102,42 @@ static inline void set_max_mapnr(unsigned long limit) { }
> * zeroing by defining this macro in <asm/pgtable.h>.
> */
The comment above becomes outdated. Please change, we use optimized
mm_zero_struct_page on every 64-bit platform.
> #ifndef mm_zero_struct_page
> +#if BITS_PER_LONG == 64
> +/* This function must be updated when the size of struct page grows above 80
> + * or reduces below 64. The idea that compiler optimizes out switch()
> + * statement, and only leaves move/store instructions
> + */
> +#define mm_zero_struct_page(pp) __mm_zero_struct_page(pp)
> +static inline void __mm_zero_struct_page(struct page *page)
> +{
> + unsigned long *_pp = (void *)page;
> +
> + /* Check that struct page is either 56, 64, 72, or 80 bytes */
> + BUILD_BUG_ON(sizeof(struct page) & 7);
> + BUILD_BUG_ON(sizeof(struct page) < 56);
> + BUILD_BUG_ON(sizeof(struct page) > 80);
> +
> + switch (sizeof(struct page)) {
> + case 80:
> + _pp[9] = 0; /* fallthrough */
> + case 72:
> + _pp[8] = 0; /* fallthrough */
> + default:
> + _pp[7] = 0; /* fallthrough */
> + case 56:
> + _pp[6] = 0;
> + _pp[5] = 0;
> + _pp[4] = 0;
> + _pp[3] = 0;
> + _pp[2] = 0;
> + _pp[1] = 0;
> + _pp[0] = 0;
> + }
> +}
> +#else
> #define mm_zero_struct_page(pp) ((void)memset((pp), 0, sizeof(struct page)))
> #endif
> +#endif
>
> /*
> * Default maximum number of active map areas, this limits the number of vmas
>
On 10/15/18 4:27 PM, Alexander Duyck wrote:
> As best as I can tell the meminit_pfn_in_nid call is completely redundant.
> The deferred memory initialization is already making use of
> for_each_free_mem_range which in turn will call into __next_mem_range which
> will only return a memory range if it matches the node ID provided assuming
> it is not NUMA_NO_NODE.
>
> I am operating on the assumption that there are no zones or pgdata_t
> structures that have a NUMA node of NUMA_NO_NODE associated with them. If
> that is the case then __next_mem_range will never return a memory range
> that doesn't match the zone's node ID and as such the check is redundant.
>
> So one piece I would like to verfy on this is if this works for ia64.
> Technically it was using a different approach to get the node ID, but it
> seems to have the node ID also encoded into the memblock. So I am
> assuming this is okay, but would like to get confirmation on that.
>
> Signed-off-by: Alexander Duyck <[email protected]>
If I am not mistaken, this code is for systems with memory interleaving.
Quick looks shows that x86, powerpc, s390, and sparc have it set.
I am not sure about other arches, but at least on SPARC, there are some
processors with memory interleaving feature:
http://www.fujitsu.com/global/products/computing/servers/unix/sparc-enterprise/technology/performance/memory.html
Pavel
> ---
> mm/page_alloc.c | 50 ++++++++++++++------------------------------------
> 1 file changed, 14 insertions(+), 36 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4bd858d1c3ba..a766a15fad81 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1301,36 +1301,22 @@ int __meminit early_pfn_to_nid(unsigned long pfn)
> #endif
>
> #ifdef CONFIG_NODES_SPAN_OTHER_NODES
> -static inline bool __meminit __maybe_unused
> -meminit_pfn_in_nid(unsigned long pfn, int node,
> - struct mminit_pfnnid_cache *state)
> +/* Only safe to use early in boot when initialisation is single-threaded */
> +static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
> {
> int nid;
>
> - nid = __early_pfn_to_nid(pfn, state);
> + nid = __early_pfn_to_nid(pfn, &early_pfnnid_cache);
> if (nid >= 0 && nid != node)
> return false;
> return true;
> }
>
> -/* Only safe to use early in boot when initialisation is single-threaded */
> -static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
> -{
> - return meminit_pfn_in_nid(pfn, node, &early_pfnnid_cache);
> -}
> -
> #else
> -
> static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
> {
> return true;
> }
> -static inline bool __meminit __maybe_unused
> -meminit_pfn_in_nid(unsigned long pfn, int node,
> - struct mminit_pfnnid_cache *state)
> -{
> - return true;
> -}
> #endif
>
>
> @@ -1459,21 +1445,13 @@ static inline void __init pgdat_init_report_one_done(void)
> *
> * Then, we check if a current large page is valid by only checking the validity
> * of the head pfn.
> - *
> - * Finally, meminit_pfn_in_nid is checked on systems where pfns can interleave
> - * within a node: a pfn is between start and end of a node, but does not belong
> - * to this memory node.
> */
> -static inline bool __init
> -deferred_pfn_valid(int nid, unsigned long pfn,
> - struct mminit_pfnnid_cache *nid_init_state)
> +static inline bool __init deferred_pfn_valid(unsigned long pfn)
> {
> if (!pfn_valid_within(pfn))
> return false;
> if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn))
> return false;
> - if (!meminit_pfn_in_nid(pfn, nid, nid_init_state))
> - return false;
> return true;
> }
>
> @@ -1481,15 +1459,14 @@ static inline void __init pgdat_init_report_one_done(void)
> * Free pages to buddy allocator. Try to free aligned pages in
> * pageblock_nr_pages sizes.
> */
> -static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
> +static void __init deferred_free_pages(unsigned long pfn,
> unsigned long end_pfn)
> {
> - struct mminit_pfnnid_cache nid_init_state = { };
> unsigned long nr_pgmask = pageblock_nr_pages - 1;
> unsigned long nr_free = 0;
>
> for (; pfn < end_pfn; pfn++) {
> - if (!deferred_pfn_valid(nid, pfn, &nid_init_state)) {
> + if (!deferred_pfn_valid(pfn)) {
> deferred_free_range(pfn - nr_free, nr_free);
> nr_free = 0;
> } else if (!(pfn & nr_pgmask)) {
> @@ -1509,17 +1486,18 @@ static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
> * by performing it only once every pageblock_nr_pages.
> * Return number of pages initialized.
> */
> -static unsigned long __init deferred_init_pages(int nid, int zid,
> +static unsigned long __init deferred_init_pages(struct zone *zone,
> unsigned long pfn,
> unsigned long end_pfn)
> {
> - struct mminit_pfnnid_cache nid_init_state = { };
> unsigned long nr_pgmask = pageblock_nr_pages - 1;
> + int nid = zone_to_nid(zone);
> unsigned long nr_pages = 0;
> + int zid = zone_idx(zone);
> struct page *page = NULL;
>
> for (; pfn < end_pfn; pfn++) {
> - if (!deferred_pfn_valid(nid, pfn, &nid_init_state)) {
> + if (!deferred_pfn_valid(pfn)) {
> page = NULL;
> continue;
> } else if (!page || !(pfn & nr_pgmask)) {
> @@ -1582,12 +1560,12 @@ static int __init deferred_init_memmap(void *data)
> for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
> - nr_pages += deferred_init_pages(nid, zid, spfn, epfn);
> + nr_pages += deferred_init_pages(zone, spfn, epfn);
> }
> for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
> - deferred_free_pages(nid, zid, spfn, epfn);
> + deferred_free_pages(spfn, epfn);
> }
> pgdat_resize_unlock(pgdat, &flags);
>
> @@ -1676,7 +1654,7 @@ static int __init deferred_init_memmap(void *data)
> while (spfn < epfn && nr_pages < nr_pages_needed) {
> t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
> first_deferred_pfn = min(t, epfn);
> - nr_pages += deferred_init_pages(nid, zid, spfn,
> + nr_pages += deferred_init_pages(zone, spfn,
> first_deferred_pfn);
> spfn = first_deferred_pfn;
> }
> @@ -1688,7 +1666,7 @@ static int __init deferred_init_memmap(void *data)
> for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa));
> - deferred_free_pages(nid, zid, spfn, epfn);
> + deferred_free_pages(spfn, epfn);
>
> if (first_deferred_pfn == epfn)
> break;
>
On 10/16/2018 1:33 PM, Pavel Tatashin wrote:
>
>
> On 10/15/18 4:27 PM, Alexander Duyck wrote:
>> As best as I can tell the meminit_pfn_in_nid call is completely redundant.
>> The deferred memory initialization is already making use of
>> for_each_free_mem_range which in turn will call into __next_mem_range which
>> will only return a memory range if it matches the node ID provided assuming
>> it is not NUMA_NO_NODE.
>>
>> I am operating on the assumption that there are no zones or pgdata_t
>> structures that have a NUMA node of NUMA_NO_NODE associated with them. If
>> that is the case then __next_mem_range will never return a memory range
>> that doesn't match the zone's node ID and as such the check is redundant.
>>
>> So one piece I would like to verfy on this is if this works for ia64.
>> Technically it was using a different approach to get the node ID, but it
>> seems to have the node ID also encoded into the memblock. So I am
>> assuming this is okay, but would like to get confirmation on that.
>>
>> Signed-off-by: Alexander Duyck <[email protected]>
>
> If I am not mistaken, this code is for systems with memory interleaving.
> Quick looks shows that x86, powerpc, s390, and sparc have it set.
>
> I am not sure about other arches, but at least on SPARC, there are some
> processors with memory interleaving feature:
>
> http://www.fujitsu.com/global/products/computing/servers/unix/sparc-enterprise/technology/performance/memory.html
>
> Pavel
I get what it is for. However as best I can tell the check is actually
redundant. In the case of the deferred page initialization we are
already pulling the memory regions via "for_each_free_mem_range". That
function is already passed a NUMA node ID. Because of that we are
already checking the memory range to determine if it is in the node or
not. As such it doesn't really make sense to go through for each PFN and
then go back to the memory range and see if the node matches or not.
You can take a look at __next_mem_range which is called by
for_each_free_mem_range and passed &memblock.memory and
&memblock.reserved to avoid:
https://elixir.bootlin.com/linux/latest/source/mm/memblock.c#L899
Then you can work your way through:
meminit_pfn_in_nid(pfn, node, state)
__early_pfn_to_nid(pfn, state)
memblock_search_pfn_nid(pfn, &start_pfn, &end_pfn)
memblock_search(&memblock.memory, pfn)
From what I can tell the deferred init is going back through the
memblock.memory list we pulled this range from and just validating it
against itself. This makes sense for the standard init as that is just
going from start_pfn->end_pfn, but for the deferred init we are pulling
the memory ranges ahead of time so we shouldn't need to re-validate the
memory that is contained within that range.
On 10/16/18 4:49 PM, Alexander Duyck wrote:
> On 10/16/2018 1:33 PM, Pavel Tatashin wrote:
>>
>>
>> On 10/15/18 4:27 PM, Alexander Duyck wrote:
>>> As best as I can tell the meminit_pfn_in_nid call is completely
>>> redundant.
>>> The deferred memory initialization is already making use of
>>> for_each_free_mem_range which in turn will call into __next_mem_range
>>> which
>>> will only return a memory range if it matches the node ID provided
>>> assuming
>>> it is not NUMA_NO_NODE.
>>>
>>> I am operating on the assumption that there are no zones or pgdata_t
>>> structures that have a NUMA node of NUMA_NO_NODE associated with
>>> them. If
>>> that is the case then __next_mem_range will never return a memory range
>>> that doesn't match the zone's node ID and as such the check is
>>> redundant.
>>>
>>> So one piece I would like to verfy on this is if this works for ia64.
>>> Technically it was using a different approach to get the node ID, but it
>>> seems to have the node ID also encoded into the memblock. So I am
>>> assuming this is okay, but would like to get confirmation on that.
>>>
>>> Signed-off-by: Alexander Duyck <[email protected]>
>>
>> If I am not mistaken, this code is for systems with memory interleaving.
>> Quick looks shows that x86, powerpc, s390, and sparc have it set.
>>
>> I am not sure about other arches, but at least on SPARC, there are some
>> processors with memory interleaving feature:
>>
>> http://www.fujitsu.com/global/products/computing/servers/unix/sparc-enterprise/technology/performance/memory.html
>>
>>
>> Pavel
>
> I get what it is for. However as best I can tell the check is actually
> redundant. In the case of the deferred page initialization we are
> already pulling the memory regions via "for_each_free_mem_range". That
> function is already passed a NUMA node ID. Because of that we are
> already checking the memory range to determine if it is in the node or
> not. As such it doesn't really make sense to go through for each PFN and
> then go back to the memory range and see if the node matches or not.
>
Agree, it looks redundant, nice clean-up, I like it.
Reviewed-by: Pavel Tatashin <[email protected]>
Thank you,
Pavel
> You can take a look at __next_mem_range which is called by
> for_each_free_mem_range and passed &memblock.memory and
> &memblock.reserved to avoid:
> https://elixir.bootlin.com/linux/latest/source/mm/memblock.c#L899
>
> Then you can work your way through:
> meminit_pfn_in_nid(pfn, node, state)
> __early_pfn_to_nid(pfn, state)
> memblock_search_pfn_nid(pfn, &start_pfn, &end_pfn)
> memblock_search(&memblock.memory, pfn)
>
> From what I can tell the deferred init is going back through the
> memblock.memory list we pulled this range from and just validating it
> against itself. This makes sense for the standard init as that is just
> going from start_pfn->end_pfn, but for the deferred init we are pulling
> the memory ranges ahead of time so we shouldn't need to re-validate the
> memory that is contained within that range.
On Tue, Oct 16, 2018 at 03:01:11PM -0400, Pavel Tatashin wrote:
>
>
> On 10/15/18 4:26 PM, Alexander Duyck wrote:
> > This change makes it so that we use the same approach that was already in
> > use on Sparc on all the archtectures that support a 64b long.
> >
> > This is mostly motivated by the fact that 8 to 10 store/move instructions
> > are likely always going to be faster than having to call into a function
> > that is not specialized for handling page init.
> >
> > An added advantage to doing it this way is that the compiler can get away
> > with combining writes in the __init_single_page call. As a result the
> > memset call will be reduced to only about 4 write operations, or at least
> > that is what I am seeing with GCC 6.2 as the flags, LRU poitners, and
> > count/mapcount seem to be cancelling out at least 4 of the 8 assignments on
> > my system.
> >
> > One change I had to make to the function was to reduce the minimum page
> > size to 56 to support some powerpc64 configurations.
> >
> > Signed-off-by: Alexander Duyck <[email protected]>
>
>
> I have tested on Broadcom's Stingray cpu with 48G RAM:
> __init_single_page() takes 19.30ns / 64-byte struct page
> Wit the change it takes 17.33ns / 64-byte struct page
I gave it a run on an OpenPower (S812LC 8348-21C) with Power8 processor and
with 128G of RAM. My results for 64-byte struct page were:
before: 4.6788ns
after: 4.5882ns
My two cents :)
> Please add this data and also the data from Intel to the description.
>
> Thank you,
> Pavel
>
> > ---
> > arch/sparc/include/asm/pgtable_64.h | 30 ------------------------------
> > include/linux/mm.h | 34 ++++++++++++++++++++++++++++++++++
> > 2 files changed, 34 insertions(+), 30 deletions(-)
> >
> > diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
> > index 1393a8ac596b..22500c3be7a9 100644
> > --- a/arch/sparc/include/asm/pgtable_64.h
> > +++ b/arch/sparc/include/asm/pgtable_64.h
> > @@ -231,36 +231,6 @@
> > extern struct page *mem_map_zero;
> > #define ZERO_PAGE(vaddr) (mem_map_zero)
> >
> > -/* This macro must be updated when the size of struct page grows above 80
> > - * or reduces below 64.
> > - * The idea that compiler optimizes out switch() statement, and only
> > - * leaves clrx instructions
> > - */
> > -#define mm_zero_struct_page(pp) do { \
> > - unsigned long *_pp = (void *)(pp); \
> > - \
> > - /* Check that struct page is either 64, 72, or 80 bytes */ \
> > - BUILD_BUG_ON(sizeof(struct page) & 7); \
> > - BUILD_BUG_ON(sizeof(struct page) < 64); \
> > - BUILD_BUG_ON(sizeof(struct page) > 80); \
> > - \
> > - switch (sizeof(struct page)) { \
> > - case 80: \
> > - _pp[9] = 0; /* fallthrough */ \
> > - case 72: \
> > - _pp[8] = 0; /* fallthrough */ \
> > - default: \
> > - _pp[7] = 0; \
> > - _pp[6] = 0; \
> > - _pp[5] = 0; \
> > - _pp[4] = 0; \
> > - _pp[3] = 0; \
> > - _pp[2] = 0; \
> > - _pp[1] = 0; \
> > - _pp[0] = 0; \
> > - } \
> > -} while (0)
> > -
> > /* PFNs are real physical page numbers. However, mem_map only begins to record
> > * per-page information starting at pfn_base. This is to handle systems where
> > * the first physical page in the machine is at some huge physical address,
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index bb0de406f8e7..ec6e57a0c14e 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -102,8 +102,42 @@ static inline void set_max_mapnr(unsigned long limit) { }
> > * zeroing by defining this macro in <asm/pgtable.h>.
> > */
>
> The comment above becomes outdated. Please change, we use optimized
> mm_zero_struct_page on every 64-bit platform.
>
> > #ifndef mm_zero_struct_page
> > +#if BITS_PER_LONG == 64
> > +/* This function must be updated when the size of struct page grows above 80
> > + * or reduces below 64. The idea that compiler optimizes out switch()
> > + * statement, and only leaves move/store instructions
> > + */
> > +#define mm_zero_struct_page(pp) __mm_zero_struct_page(pp)
> > +static inline void __mm_zero_struct_page(struct page *page)
> > +{
> > + unsigned long *_pp = (void *)page;
> > +
> > + /* Check that struct page is either 56, 64, 72, or 80 bytes */
> > + BUILD_BUG_ON(sizeof(struct page) & 7);
> > + BUILD_BUG_ON(sizeof(struct page) < 56);
> > + BUILD_BUG_ON(sizeof(struct page) > 80);
> > +
> > + switch (sizeof(struct page)) {
> > + case 80:
> > + _pp[9] = 0; /* fallthrough */
> > + case 72:
> > + _pp[8] = 0; /* fallthrough */
> > + default:
> > + _pp[7] = 0; /* fallthrough */
> > + case 56:
> > + _pp[6] = 0;
> > + _pp[5] = 0;
> > + _pp[4] = 0;
> > + _pp[3] = 0;
> > + _pp[2] = 0;
> > + _pp[1] = 0;
> > + _pp[0] = 0;
> > + }
> > +}
> > +#else
> > #define mm_zero_struct_page(pp) ((void)memset((pp), 0, sizeof(struct page)))
> > #endif
> > +#endif
> >
> > /*
> > * Default maximum number of active map areas, this limits the number of vmas
> >
>
--
Sincerely yours,
Mike.
On Mon 15-10-18 13:26:56, Alexander Duyck wrote:
> This change makes it so that we use the same approach that was already in
> use on Sparc on all the archtectures that support a 64b long.
>
> This is mostly motivated by the fact that 8 to 10 store/move instructions
> are likely always going to be faster than having to call into a function
> that is not specialized for handling page init.
>
> An added advantage to doing it this way is that the compiler can get away
> with combining writes in the __init_single_page call. As a result the
> memset call will be reduced to only about 4 write operations, or at least
> that is what I am seeing with GCC 6.2 as the flags, LRU poitners, and
> count/mapcount seem to be cancelling out at least 4 of the 8 assignments on
> my system.
>
> One change I had to make to the function was to reduce the minimum page
> size to 56 to support some powerpc64 configurations.
This really begs for numbers. I do not mind the change itself with some
minor comments below.
[...]
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bb0de406f8e7..ec6e57a0c14e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -102,8 +102,42 @@ static inline void set_max_mapnr(unsigned long limit) { }
> * zeroing by defining this macro in <asm/pgtable.h>.
> */
> #ifndef mm_zero_struct_page
Do we still need this ifdef? I guess we can wait for an arch which
doesn't like this change and then add the override. I would rather go
simple if possible.
> +#if BITS_PER_LONG == 64
> +/* This function must be updated when the size of struct page grows above 80
> + * or reduces below 64. The idea that compiler optimizes out switch()
> + * statement, and only leaves move/store instructions
> + */
> +#define mm_zero_struct_page(pp) __mm_zero_struct_page(pp)
> +static inline void __mm_zero_struct_page(struct page *page)
> +{
> + unsigned long *_pp = (void *)page;
> +
> + /* Check that struct page is either 56, 64, 72, or 80 bytes */
> + BUILD_BUG_ON(sizeof(struct page) & 7);
> + BUILD_BUG_ON(sizeof(struct page) < 56);
> + BUILD_BUG_ON(sizeof(struct page) > 80);
> +
> + switch (sizeof(struct page)) {
> + case 80:
> + _pp[9] = 0; /* fallthrough */
> + case 72:
> + _pp[8] = 0; /* fallthrough */
> + default:
> + _pp[7] = 0; /* fallthrough */
> + case 56:
> + _pp[6] = 0;
> + _pp[5] = 0;
> + _pp[4] = 0;
> + _pp[3] = 0;
> + _pp[2] = 0;
> + _pp[1] = 0;
> + _pp[0] = 0;
> + }
This just hit my eyes. I have to confess I have never seen default: to
be not the last one in the switch. Can we have case 64 instead or does gcc
complain? I would be surprised with the set of BUILD_BUG_ONs.
> +}
> +#else
> #define mm_zero_struct_page(pp) ((void)memset((pp), 0, sizeof(struct page)))
> #endif
> +#endif
>
> /*
> * Default maximum number of active map areas, this limits the number of vmas
>
--
Michal Hocko
SUSE Labs
On Mon 15-10-18 13:27:03, Alexander Duyck wrote:
> As best as I can tell the meminit_pfn_in_nid call is completely redundant.
> The deferred memory initialization is already making use of
> for_each_free_mem_range which in turn will call into __next_mem_range which
> will only return a memory range if it matches the node ID provided assuming
> it is not NUMA_NO_NODE.
>
> I am operating on the assumption that there are no zones or pgdata_t
> structures that have a NUMA node of NUMA_NO_NODE associated with them. If
> that is the case then __next_mem_range will never return a memory range
> that doesn't match the zone's node ID and as such the check is redundant.
>
> So one piece I would like to verfy on this is if this works for ia64.
> Technically it was using a different approach to get the node ID, but it
> seems to have the node ID also encoded into the memblock. So I am
> assuming this is okay, but would like to get confirmation on that.
Good catch. Both for_each_free_mem_range and __early_pfn_to_nid rely on
the memblock layer to properly map ranges to nids. I haven't checked too
closely whether this was really necessary in the original deferred
implementatiob by Mel but it is much more clear that it is not needed
now with the clear iterator.
> Signed-off-by: Alexander Duyck <[email protected]>
Acked-by: Michal Hocko <[email protected]>
> ---
> mm/page_alloc.c | 50 ++++++++++++++------------------------------------
> 1 file changed, 14 insertions(+), 36 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4bd858d1c3ba..a766a15fad81 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1301,36 +1301,22 @@ int __meminit early_pfn_to_nid(unsigned long pfn)
> #endif
>
> #ifdef CONFIG_NODES_SPAN_OTHER_NODES
> -static inline bool __meminit __maybe_unused
> -meminit_pfn_in_nid(unsigned long pfn, int node,
> - struct mminit_pfnnid_cache *state)
> +/* Only safe to use early in boot when initialisation is single-threaded */
> +static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
> {
> int nid;
>
> - nid = __early_pfn_to_nid(pfn, state);
> + nid = __early_pfn_to_nid(pfn, &early_pfnnid_cache);
> if (nid >= 0 && nid != node)
> return false;
> return true;
> }
>
> -/* Only safe to use early in boot when initialisation is single-threaded */
> -static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
> -{
> - return meminit_pfn_in_nid(pfn, node, &early_pfnnid_cache);
> -}
> -
> #else
> -
> static inline bool __meminit early_pfn_in_nid(unsigned long pfn, int node)
> {
> return true;
> }
> -static inline bool __meminit __maybe_unused
> -meminit_pfn_in_nid(unsigned long pfn, int node,
> - struct mminit_pfnnid_cache *state)
> -{
> - return true;
> -}
> #endif
>
>
> @@ -1459,21 +1445,13 @@ static inline void __init pgdat_init_report_one_done(void)
> *
> * Then, we check if a current large page is valid by only checking the validity
> * of the head pfn.
> - *
> - * Finally, meminit_pfn_in_nid is checked on systems where pfns can interleave
> - * within a node: a pfn is between start and end of a node, but does not belong
> - * to this memory node.
> */
> -static inline bool __init
> -deferred_pfn_valid(int nid, unsigned long pfn,
> - struct mminit_pfnnid_cache *nid_init_state)
> +static inline bool __init deferred_pfn_valid(unsigned long pfn)
> {
> if (!pfn_valid_within(pfn))
> return false;
> if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn))
> return false;
> - if (!meminit_pfn_in_nid(pfn, nid, nid_init_state))
> - return false;
> return true;
> }
>
> @@ -1481,15 +1459,14 @@ static inline void __init pgdat_init_report_one_done(void)
> * Free pages to buddy allocator. Try to free aligned pages in
> * pageblock_nr_pages sizes.
> */
> -static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
> +static void __init deferred_free_pages(unsigned long pfn,
> unsigned long end_pfn)
> {
> - struct mminit_pfnnid_cache nid_init_state = { };
> unsigned long nr_pgmask = pageblock_nr_pages - 1;
> unsigned long nr_free = 0;
>
> for (; pfn < end_pfn; pfn++) {
> - if (!deferred_pfn_valid(nid, pfn, &nid_init_state)) {
> + if (!deferred_pfn_valid(pfn)) {
> deferred_free_range(pfn - nr_free, nr_free);
> nr_free = 0;
> } else if (!(pfn & nr_pgmask)) {
> @@ -1509,17 +1486,18 @@ static void __init deferred_free_pages(int nid, int zid, unsigned long pfn,
> * by performing it only once every pageblock_nr_pages.
> * Return number of pages initialized.
> */
> -static unsigned long __init deferred_init_pages(int nid, int zid,
> +static unsigned long __init deferred_init_pages(struct zone *zone,
> unsigned long pfn,
> unsigned long end_pfn)
> {
> - struct mminit_pfnnid_cache nid_init_state = { };
> unsigned long nr_pgmask = pageblock_nr_pages - 1;
> + int nid = zone_to_nid(zone);
> unsigned long nr_pages = 0;
> + int zid = zone_idx(zone);
> struct page *page = NULL;
>
> for (; pfn < end_pfn; pfn++) {
> - if (!deferred_pfn_valid(nid, pfn, &nid_init_state)) {
> + if (!deferred_pfn_valid(pfn)) {
> page = NULL;
> continue;
> } else if (!page || !(pfn & nr_pgmask)) {
> @@ -1582,12 +1560,12 @@ static int __init deferred_init_memmap(void *data)
> for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
> - nr_pages += deferred_init_pages(nid, zid, spfn, epfn);
> + nr_pages += deferred_init_pages(zone, spfn, epfn);
> }
> for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
> - deferred_free_pages(nid, zid, spfn, epfn);
> + deferred_free_pages(spfn, epfn);
> }
> pgdat_resize_unlock(pgdat, &flags);
>
> @@ -1676,7 +1654,7 @@ static int __init deferred_init_memmap(void *data)
> while (spfn < epfn && nr_pages < nr_pages_needed) {
> t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
> first_deferred_pfn = min(t, epfn);
> - nr_pages += deferred_init_pages(nid, zid, spfn,
> + nr_pages += deferred_init_pages(zone, spfn,
> first_deferred_pfn);
> spfn = first_deferred_pfn;
> }
> @@ -1688,7 +1666,7 @@ static int __init deferred_init_memmap(void *data)
> for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa));
> - deferred_free_pages(nid, zid, spfn, epfn);
> + deferred_free_pages(spfn, epfn);
>
> if (first_deferred_pfn == epfn)
> break;
>
--
Michal Hocko
SUSE Labs
On Mon 15-10-18 13:27:09, Alexander Duyck wrote:
> This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.
>
> This iterator will take care of making sure a given memory range provided
> is in fact contained within a zone. It takes are of all the bounds checking
> we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
> it should help to speed up the search a bit by iterating until the end of a
> range is greater than the start of the zone pfn range, and will exit
> completely if the start is beyond the end of the zone.
>
> This patch adds yet another iterator called
> for_each_free_mem_range_in_zone_from and then uses it to support
> initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES.
> By doing this we can greatly improve the cache locality of the pages while
> we do several loops over them in the init and freeing process.
>
> We are able to tighten the loops as a result since we only really need the
> checks for first_init_pfn in our first iteration and after that we can
> assume that all future values will be greater than this. So I have added a
> function called deferred_init_mem_pfn_range_in_zone that primes the
> iterators and if it fails we can just exit.
Numbers please.
Besides that, this adds a lot of code and I am not convinced the result
is so much better to justify that.
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
> include/linux/memblock.h | 58 +++++++++++++++
> mm/memblock.c | 63 ++++++++++++++++
> mm/page_alloc.c | 176 ++++++++++++++++++++++++++++++++--------------
> 3 files changed, 242 insertions(+), 55 deletions(-)
--
Michal Hocko
SUSE Labs
On Mon 15-10-18 13:27:16, Alexander Duyck wrote:
> This patch is going through and combining the bits in memmap_init_zone and
> memmap_init_zone_device that are related to hotplug into a single function
> called __memmap_init_hotplug.
>
> I also took the opportunity to integrate __init_single_page's functionality
> into this function. In doing so I can get rid of some of the redundancy
> such as the LRU pointers versus the pgmap.
This patch depends on [1] which I've had some concerns about. It adds
more code on top. I am still not convinced this is the right direction.
[1] http://lkml.kernel.org/r/[email protected]
>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
> mm/page_alloc.c | 232 ++++++++++++++++++++++++++++++++++++++-----------------
> 1 file changed, 159 insertions(+), 73 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 20e9eb35d75d..92375e7867ba 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1192,6 +1192,94 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
> #endif
> }
>
> +static void __meminit __init_pageblock(unsigned long start_pfn,
> + unsigned long nr_pages,
> + unsigned long zone, int nid,
> + struct dev_pagemap *pgmap,
> + bool is_reserved)
> +{
> + unsigned long nr_pgmask = pageblock_nr_pages - 1;
> + struct page *start_page = pfn_to_page(start_pfn);
> + unsigned long pfn = start_pfn + nr_pages - 1;
> +#ifdef WANT_PAGE_VIRTUAL
> + bool is_highmem = is_highmem_idx(zone);
> +#endif
> + struct page *page;
> +
> + /*
> + * Enforce the following requirements:
> + * size > 0
> + * size < pageblock_nr_pages
> + * start_pfn -> pfn does not cross pageblock_nr_pages boundary
> + */
> + VM_BUG_ON(((start_pfn ^ pfn) | (nr_pages - 1)) > nr_pgmask);
> +
> + /*
> + * Work from highest page to lowest, this way we will still be
> + * warm in the cache when we call set_pageblock_migratetype
> + * below.
> + *
> + * The loop is based around the page pointer as the main index
> + * instead of the pfn because pfn is not used inside the loop if
> + * the section number is not in page flags and WANT_PAGE_VIRTUAL
> + * is not defined.
> + */
> + for (page = start_page + nr_pages; page-- != start_page; pfn--) {
> + mm_zero_struct_page(page);
> +
> + /*
> + * We use the start_pfn instead of pfn in the set_page_links
> + * call because of the fact that the pfn number is used to
> + * get the section_nr and this function should not be
> + * spanning more than a single section.
> + */
> + set_page_links(page, zone, nid, start_pfn);
> + init_page_count(page);
> + page_mapcount_reset(page);
> + page_cpupid_reset_last(page);
> +
> + /*
> + * We can use the non-atomic __set_bit operation for setting
> + * the flag as we are still initializing the pages.
> + */
> + if (is_reserved)
> + __SetPageReserved(page);
> +
> + /*
> + * ZONE_DEVICE pages union ->lru with a ->pgmap back
> + * pointer and hmm_data. It is a bug if a ZONE_DEVICE
> + * page is ever freed or placed on a driver-private list.
> + */
> + page->pgmap = pgmap;
> + if (!pgmap)
> + INIT_LIST_HEAD(&page->lru);
> +
> +#ifdef WANT_PAGE_VIRTUAL
> + /* The shift won't overflow because ZONE_NORMAL is below 4G. */
> + if (!is_highmem)
> + set_page_address(page, __va(pfn << PAGE_SHIFT));
> +#endif
> + }
> +
> + /*
> + * Mark the block movable so that blocks are reserved for
> + * movable at startup. This will force kernel allocations
> + * to reserve their blocks rather than leaking throughout
> + * the address space during boot when many long-lived
> + * kernel allocations are made.
> + *
> + * bitmap is created for zone's valid pfn range. but memmap
> + * can be created for invalid pages (for alignment)
> + * check here not to call set_pageblock_migratetype() against
> + * pfn out of zone.
> + *
> + * Please note that MEMMAP_HOTPLUG path doesn't clear memmap
> + * because this is done early in sparse_add_one_section
> + */
> + if (!(start_pfn & nr_pgmask))
> + set_pageblock_migratetype(start_page, MIGRATE_MOVABLE);
> +}
> +
> #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> static void __meminit init_reserved_page(unsigned long pfn)
> {
> @@ -5513,6 +5601,36 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
> return false;
> }
>
> +static void __meminit __memmap_init_hotplug(unsigned long size, int nid,
> + unsigned long zone,
> + unsigned long start_pfn,
> + struct dev_pagemap *pgmap)
> +{
> + unsigned long pfn = start_pfn + size;
> +
> + while (pfn != start_pfn) {
> + unsigned long stride = pfn;
> +
> + pfn = max(ALIGN_DOWN(pfn - 1, pageblock_nr_pages), start_pfn);
> + stride -= pfn;
> +
> + /*
> + * The last argument of __init_pageblock is a boolean
> + * value indicating if the page will be marked as reserved.
> + *
> + * Mark page reserved as it will need to wait for onlining
> + * phase for it to be fully associated with a zone.
> + *
> + * Under certain circumstances ZONE_DEVICE pages may not
> + * need to be marked as reserved, however there is still
> + * code that is depending on this being set for now.
> + */
> + __init_pageblock(pfn, stride, zone, nid, pgmap, true);
> +
> + cond_resched();
> + }
> +}
> +
> /*
> * Initially all pages are reserved - free ones are freed
> * up by memblock_free_all() once the early boot process is
> @@ -5523,51 +5641,61 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> struct vmem_altmap *altmap)
> {
> unsigned long pfn, end_pfn = start_pfn + size;
> - struct page *page;
>
> if (highest_memmap_pfn < end_pfn - 1)
> highest_memmap_pfn = end_pfn - 1;
>
> + if (context == MEMMAP_HOTPLUG) {
> #ifdef CONFIG_ZONE_DEVICE
> - /*
> - * Honor reservation requested by the driver for this ZONE_DEVICE
> - * memory. We limit the total number of pages to initialize to just
> - * those that might contain the memory mapping. We will defer the
> - * ZONE_DEVICE page initialization until after we have released
> - * the hotplug lock.
> - */
> - if (zone == ZONE_DEVICE) {
> - if (!altmap)
> - return;
> + /*
> + * Honor reservation requested by the driver for this
> + * ZONE_DEVICE memory. We limit the total number of pages to
> + * initialize to just those that might contain the memory
> + * mapping. We will defer the ZONE_DEVICE page initialization
> + * until after we have released the hotplug lock.
> + */
> + if (zone == ZONE_DEVICE) {
> + if (!altmap)
> + return;
> +
> + if (start_pfn == altmap->base_pfn)
> + start_pfn += altmap->reserve;
> + end_pfn = altmap->base_pfn +
> + vmem_altmap_offset(altmap);
> + }
> +#endif
> + /*
> + * For these ZONE_DEVICE pages we don't need to record the
> + * pgmap as they should represent only those pages used to
> + * store the memory map. The actual ZONE_DEVICE pages will
> + * be initialized later.
> + */
> + __memmap_init_hotplug(end_pfn - start_pfn, nid, zone,
> + start_pfn, NULL);
>
> - if (start_pfn == altmap->base_pfn)
> - start_pfn += altmap->reserve;
> - end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
> + return;
> }
> -#endif
>
> for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> + struct page *page;
> +
> /*
> * There can be holes in boot-time mem_map[]s handed to this
> * function. They do not exist on hotplugged memory.
> */
> - if (context == MEMMAP_EARLY) {
> - if (!early_pfn_valid(pfn)) {
> - pfn = next_valid_pfn(pfn) - 1;
> - continue;
> - }
> - if (!early_pfn_in_nid(pfn, nid))
> - continue;
> - if (overlap_memmap_init(zone, &pfn))
> - continue;
> - if (defer_init(nid, pfn, end_pfn))
> - break;
> + if (!early_pfn_valid(pfn)) {
> + pfn = next_valid_pfn(pfn) - 1;
> + continue;
> }
> + if (!early_pfn_in_nid(pfn, nid))
> + continue;
> + if (overlap_memmap_init(zone, &pfn))
> + continue;
> + if (defer_init(nid, pfn, end_pfn))
> + break;
>
> page = pfn_to_page(pfn);
> __init_single_page(page, pfn, zone, nid);
> - if (context == MEMMAP_HOTPLUG)
> - __SetPageReserved(page);
>
> /*
> * Mark the block movable so that blocks are reserved for
> @@ -5594,14 +5722,12 @@ void __ref memmap_init_zone_device(struct zone *zone,
> unsigned long size,
> struct dev_pagemap *pgmap)
> {
> - unsigned long pfn, end_pfn = start_pfn + size;
> struct pglist_data *pgdat = zone->zone_pgdat;
> unsigned long zone_idx = zone_idx(zone);
> unsigned long start = jiffies;
> int nid = pgdat->node_id;
>
> - if (WARN_ON_ONCE(!pgmap || !is_dev_zone(zone)))
> - return;
> + VM_BUG_ON(!is_dev_zone(zone));
>
> /*
> * The call to memmap_init_zone should have already taken care
> @@ -5610,53 +5736,13 @@ void __ref memmap_init_zone_device(struct zone *zone,
> */
> if (pgmap->altmap_valid) {
> struct vmem_altmap *altmap = &pgmap->altmap;
> + unsigned long end_pfn = start_pfn + size;
>
> start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
> size = end_pfn - start_pfn;
> }
>
> - for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> - struct page *page = pfn_to_page(pfn);
> -
> - __init_single_page(page, pfn, zone_idx, nid);
> -
> - /*
> - * Mark page reserved as it will need to wait for onlining
> - * phase for it to be fully associated with a zone.
> - *
> - * We can use the non-atomic __set_bit operation for setting
> - * the flag as we are still initializing the pages.
> - */
> - __SetPageReserved(page);
> -
> - /*
> - * ZONE_DEVICE pages union ->lru with a ->pgmap back
> - * pointer and hmm_data. It is a bug if a ZONE_DEVICE
> - * page is ever freed or placed on a driver-private list.
> - */
> - page->pgmap = pgmap;
> - page->hmm_data = 0;
> -
> - /*
> - * Mark the block movable so that blocks are reserved for
> - * movable at startup. This will force kernel allocations
> - * to reserve their blocks rather than leaking throughout
> - * the address space during boot when many long-lived
> - * kernel allocations are made.
> - *
> - * bitmap is created for zone's valid pfn range. but memmap
> - * can be created for invalid pages (for alignment)
> - * check here not to call set_pageblock_migratetype() against
> - * pfn out of zone.
> - *
> - * Please note that MEMMAP_HOTPLUG path doesn't clear memmap
> - * because this is done early in sparse_add_one_section
> - */
> - if (!(pfn & (pageblock_nr_pages - 1))) {
> - set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> - cond_resched();
> - }
> - }
> + __memmap_init_hotplug(size, nid, zone_idx, start_pfn, pgmap);
>
> pr_info("%s initialised, %lu pages in %ums\n", dev_name(pgmap->dev),
> size, jiffies_to_msecs(jiffies - start));
>
--
Michal Hocko
SUSE Labs
On 10/17/2018 12:30 AM, Mike Rapoport wrote:
> On Tue, Oct 16, 2018 at 03:01:11PM -0400, Pavel Tatashin wrote:
>>
>>
>> On 10/15/18 4:26 PM, Alexander Duyck wrote:
>>> This change makes it so that we use the same approach that was already in
>>> use on Sparc on all the archtectures that support a 64b long.
>>>
>>> This is mostly motivated by the fact that 8 to 10 store/move instructions
>>> are likely always going to be faster than having to call into a function
>>> that is not specialized for handling page init.
>>>
>>> An added advantage to doing it this way is that the compiler can get away
>>> with combining writes in the __init_single_page call. As a result the
>>> memset call will be reduced to only about 4 write operations, or at least
>>> that is what I am seeing with GCC 6.2 as the flags, LRU poitners, and
>>> count/mapcount seem to be cancelling out at least 4 of the 8 assignments on
>>> my system.
>>>
>>> One change I had to make to the function was to reduce the minimum page
>>> size to 56 to support some powerpc64 configurations.
>>>
>>> Signed-off-by: Alexander Duyck <[email protected]>
>>
>>
>> I have tested on Broadcom's Stingray cpu with 48G RAM:
>> __init_single_page() takes 19.30ns / 64-byte struct page
>> Wit the change it takes 17.33ns / 64-byte struct page
>
> I gave it a run on an OpenPower (S812LC 8348-21C) with Power8 processor and
> with 128G of RAM. My results for 64-byte struct page were:
>
> before: 4.6788ns
> after: 4.5882ns
>
> My two cents :)
Thanks. I will add this and Pavel's data to the patch description.
- Alex
On 10/17/2018 1:47 AM, Michal Hocko wrote:
> On Mon 15-10-18 13:26:56, Alexander Duyck wrote:
>> This change makes it so that we use the same approach that was already in
>> use on Sparc on all the archtectures that support a 64b long.
>>
>> This is mostly motivated by the fact that 8 to 10 store/move instructions
>> are likely always going to be faster than having to call into a function
>> that is not specialized for handling page init.
>>
>> An added advantage to doing it this way is that the compiler can get away
>> with combining writes in the __init_single_page call. As a result the
>> memset call will be reduced to only about 4 write operations, or at least
>> that is what I am seeing with GCC 6.2 as the flags, LRU poitners, and
>> count/mapcount seem to be cancelling out at least 4 of the 8 assignments on
>> my system.
>>
>> One change I had to make to the function was to reduce the minimum page
>> size to 56 to support some powerpc64 configurations.
>
> This really begs for numbers. I do not mind the change itself with some
> minor comments below.
>
> [...]
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index bb0de406f8e7..ec6e57a0c14e 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -102,8 +102,42 @@ static inline void set_max_mapnr(unsigned long limit) { }
>> * zeroing by defining this macro in <asm/pgtable.h>.
>> */
>> #ifndef mm_zero_struct_page
>
> Do we still need this ifdef? I guess we can wait for an arch which
> doesn't like this change and then add the override. I would rather go
> simple if possible.
We probably don't, but as soon as I remove it somebody will probably
complain somewhere. I guess I could drop it for now and see if anybody
screams. Adding it back should be pretty straight forward since it would
only be 2 lines.
>> +#if BITS_PER_LONG == 64
>> +/* This function must be updated when the size of struct page grows above 80
>> + * or reduces below 64. The idea that compiler optimizes out switch()
>> + * statement, and only leaves move/store instructions
>> + */
>> +#define mm_zero_struct_page(pp) __mm_zero_struct_page(pp)
>> +static inline void __mm_zero_struct_page(struct page *page)
>> +{
>> + unsigned long *_pp = (void *)page;
>> +
>> + /* Check that struct page is either 56, 64, 72, or 80 bytes */
>> + BUILD_BUG_ON(sizeof(struct page) & 7);
>> + BUILD_BUG_ON(sizeof(struct page) < 56);
>> + BUILD_BUG_ON(sizeof(struct page) > 80);
>> +
>> + switch (sizeof(struct page)) {
>> + case 80:
>> + _pp[9] = 0; /* fallthrough */
>> + case 72:
>> + _pp[8] = 0; /* fallthrough */
>> + default:
>> + _pp[7] = 0; /* fallthrough */
>> + case 56:
>> + _pp[6] = 0;
>> + _pp[5] = 0;
>> + _pp[4] = 0;
>> + _pp[3] = 0;
>> + _pp[2] = 0;
>> + _pp[1] = 0;
>> + _pp[0] = 0;
>> + }
>
> This just hit my eyes. I have to confess I have never seen default: to
> be not the last one in the switch. Can we have case 64 instead or does gcc
> complain? I would be surprised with the set of BUILD_BUG_ONs.
I can probably just replace the "default:" with "case 64:". I think I
have seen other switch statements in the kernel without a default so
odds are it should be okay.
On 10/17/18 11:07 AM, Alexander Duyck wrote:
> On 10/17/2018 1:47 AM, Michal Hocko wrote:
>> On Mon 15-10-18 13:26:56, Alexander Duyck wrote:
>>> This change makes it so that we use the same approach that was
>>> already in
>>> use on Sparc on all the archtectures that support a 64b long.
>>>
>>> This is mostly motivated by the fact that 8 to 10 store/move
>>> instructions
>>> are likely always going to be faster than having to call into a function
>>> that is not specialized for handling page init.
>>>
>>> An added advantage to doing it this way is that the compiler can get
>>> away
>>> with combining writes in the __init_single_page call. As a result the
>>> memset call will be reduced to only about 4 write operations, or at
>>> least
>>> that is what I am seeing with GCC 6.2 as the flags, LRU poitners, and
>>> count/mapcount seem to be cancelling out at least 4 of the 8
>>> assignments on
>>> my system.
>>>
>>> One change I had to make to the function was to reduce the minimum page
>>> size to 56 to support some powerpc64 configurations.
>>
>> This really begs for numbers. I do not mind the change itself with some
>> minor comments below.
>>
>> [...]
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index bb0de406f8e7..ec6e57a0c14e 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -102,8 +102,42 @@ static inline void set_max_mapnr(unsigned long
>>> limit) { }
>>> * zeroing by defining this macro in <asm/pgtable.h>.
>>> */
>>> #ifndef mm_zero_struct_page
>>
>> Do we still need this ifdef? I guess we can wait for an arch which
>> doesn't like this change and then add the override. I would rather go
>> simple if possible.
>
> We probably don't, but as soon as I remove it somebody will probably
> complain somewhere. I guess I could drop it for now and see if anybody
> screams. Adding it back should be pretty straight forward since it would
> only be 2 lines.
>
>>> +#if BITS_PER_LONG == 64
>>> +/* This function must be updated when the size of struct page grows
>>> above 80
>>> + * or reduces below 64. The idea that compiler optimizes out switch()
>>> + * statement, and only leaves move/store instructions
>>> + */
>>> +#define mm_zero_struct_page(pp) __mm_zero_struct_page(pp)
>>> +static inline void __mm_zero_struct_page(struct page *page)
>>> +{
>>> + unsigned long *_pp = (void *)page;
>>> +
>>> + /* Check that struct page is either 56, 64, 72, or 80 bytes */
>>> + BUILD_BUG_ON(sizeof(struct page) & 7);
>>> + BUILD_BUG_ON(sizeof(struct page) < 56);
>>> + BUILD_BUG_ON(sizeof(struct page) > 80);
>>> +
>>> + switch (sizeof(struct page)) {
>>> + case 80:
>>> + _pp[9] = 0; /* fallthrough */
>>> + case 72:
>>> + _pp[8] = 0; /* fallthrough */
>>> + default:
>>> + _pp[7] = 0; /* fallthrough */
>>> + case 56:
>>> + _pp[6] = 0;
>>> + _pp[5] = 0;
>>> + _pp[4] = 0;
>>> + _pp[3] = 0;
>>> + _pp[2] = 0;
>>> + _pp[1] = 0;
>>> + _pp[0] = 0;
>>> + }
>>
>> This just hit my eyes. I have to confess I have never seen default: to
>> be not the last one in the switch. Can we have case 64 instead or does
>> gcc
>> complain? I would be surprised with the set of BUILD_BUG_ONs.
It was me, C does not really care where default is placed, I was trying
to keep stores sequential for better cache locality, but "case 64"
should be OK, and even better for this purpose.
Pavel
>
> I can probably just replace the "default:" with "case 64:". I think I
> have seen other switch statements in the kernel without a default so
> odds are it should be okay.
>
On 10/17/2018 2:11 AM, Michal Hocko wrote:
> On Mon 15-10-18 13:27:09, Alexander Duyck wrote:
>> This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.
>>
>> This iterator will take care of making sure a given memory range provided
>> is in fact contained within a zone. It takes are of all the bounds checking
>> we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
>> it should help to speed up the search a bit by iterating until the end of a
>> range is greater than the start of the zone pfn range, and will exit
>> completely if the start is beyond the end of the zone.
>>
>> This patch adds yet another iterator called
>> for_each_free_mem_range_in_zone_from and then uses it to support
>> initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES.
>> By doing this we can greatly improve the cache locality of the pages while
>> we do several loops over them in the init and freeing process.
>>
>> We are able to tighten the loops as a result since we only really need the
>> checks for first_init_pfn in our first iteration and after that we can
>> assume that all future values will be greater than this. So I have added a
>> function called deferred_init_mem_pfn_range_in_zone that primes the
>> iterators and if it fails we can just exit.
>
> Numbers please.
>
> Besides that, this adds a lot of code and I am not convinced the result
> is so much better to justify that.
If I recall most of the gains are due to better cache locality. Instead
of running through all of memory once for init, and once for freeing
this patch has us doing it in MAX_ORDER_NR_PAGES sized chunks. So the
advantage is that we can keep most of the pages structs in the L2 cache
at least on x86 processors to avoid having to go to memory as much.
I'll run performance numbers per patch today and try to make certain I
have a line mentioning the delta for each patch in the v4 patch set.
- Alex
On 10/17/2018 2:18 AM, Michal Hocko wrote:
> On Mon 15-10-18 13:27:16, Alexander Duyck wrote:
>> This patch is going through and combining the bits in memmap_init_zone and
>> memmap_init_zone_device that are related to hotplug into a single function
>> called __memmap_init_hotplug.
>>
>> I also took the opportunity to integrate __init_single_page's functionality
>> into this function. In doing so I can get rid of some of the redundancy
>> such as the LRU pointers versus the pgmap.
>
> This patch depends on [1] which I've had some concerns about. It adds
> more code on top. I am still not convinced this is the right direction.
>
> [1] http://lkml.kernel.org/r/[email protected]
I wouldn't say this patch depends on it. The fact is I could easily make
this patch work with the code as it was before that change. This was
actually something I had started on first and then decided to move until
later since I knew this was more invasive than the
memmap_init_zone_device function was.
With that said I am also wondering if a possible solution to the
complaints you had would be to look at just exporting the
__init_pageblock function later and moving the call to
memmap_init_zone_device out to the memremap or hotplug code when Dan
gets the refactoring for HMM and memremap all sorted out.
>>
>> Signed-off-by: Alexander Duyck <[email protected]>
>> ---
>> mm/page_alloc.c | 232 ++++++++++++++++++++++++++++++++++++++-----------------
>> 1 file changed, 159 insertions(+), 73 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 20e9eb35d75d..92375e7867ba 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1192,6 +1192,94 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
>> #endif
>> }
>>
>> +static void __meminit __init_pageblock(unsigned long start_pfn,
>> + unsigned long nr_pages,
>> + unsigned long zone, int nid,
>> + struct dev_pagemap *pgmap,
>> + bool is_reserved)
>> +{
>> + unsigned long nr_pgmask = pageblock_nr_pages - 1;
>> + struct page *start_page = pfn_to_page(start_pfn);
>> + unsigned long pfn = start_pfn + nr_pages - 1;
>> +#ifdef WANT_PAGE_VIRTUAL
>> + bool is_highmem = is_highmem_idx(zone);
>> +#endif
>> + struct page *page;
>> +
>> + /*
>> + * Enforce the following requirements:
>> + * size > 0
>> + * size < pageblock_nr_pages
>> + * start_pfn -> pfn does not cross pageblock_nr_pages boundary
>> + */
>> + VM_BUG_ON(((start_pfn ^ pfn) | (nr_pages - 1)) > nr_pgmask);
>> +
>> + /*
>> + * Work from highest page to lowest, this way we will still be
>> + * warm in the cache when we call set_pageblock_migratetype
>> + * below.
>> + *
>> + * The loop is based around the page pointer as the main index
>> + * instead of the pfn because pfn is not used inside the loop if
>> + * the section number is not in page flags and WANT_PAGE_VIRTUAL
>> + * is not defined.
>> + */
>> + for (page = start_page + nr_pages; page-- != start_page; pfn--) {
>> + mm_zero_struct_page(page);
>> +
>> + /*
>> + * We use the start_pfn instead of pfn in the set_page_links
>> + * call because of the fact that the pfn number is used to
>> + * get the section_nr and this function should not be
>> + * spanning more than a single section.
>> + */
>> + set_page_links(page, zone, nid, start_pfn);
>> + init_page_count(page);
>> + page_mapcount_reset(page);
>> + page_cpupid_reset_last(page);
>> +
>> + /*
>> + * We can use the non-atomic __set_bit operation for setting
>> + * the flag as we are still initializing the pages.
>> + */
>> + if (is_reserved)
>> + __SetPageReserved(page);
>> +
>> + /*
>> + * ZONE_DEVICE pages union ->lru with a ->pgmap back
>> + * pointer and hmm_data. It is a bug if a ZONE_DEVICE
>> + * page is ever freed or placed on a driver-private list.
>> + */
>> + page->pgmap = pgmap;
>> + if (!pgmap)
>> + INIT_LIST_HEAD(&page->lru);
>> +
>> +#ifdef WANT_PAGE_VIRTUAL
>> + /* The shift won't overflow because ZONE_NORMAL is below 4G. */
>> + if (!is_highmem)
>> + set_page_address(page, __va(pfn << PAGE_SHIFT));
>> +#endif
>> + }
>> +
>> + /*
>> + * Mark the block movable so that blocks are reserved for
>> + * movable at startup. This will force kernel allocations
>> + * to reserve their blocks rather than leaking throughout
>> + * the address space during boot when many long-lived
>> + * kernel allocations are made.
>> + *
>> + * bitmap is created for zone's valid pfn range. but memmap
>> + * can be created for invalid pages (for alignment)
>> + * check here not to call set_pageblock_migratetype() against
>> + * pfn out of zone.
>> + *
>> + * Please note that MEMMAP_HOTPLUG path doesn't clear memmap
>> + * because this is done early in sparse_add_one_section
>> + */
>> + if (!(start_pfn & nr_pgmask))
>> + set_pageblock_migratetype(start_page, MIGRATE_MOVABLE);
>> +}
>> +
>> #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
>> static void __meminit init_reserved_page(unsigned long pfn)
>> {
>> @@ -5513,6 +5601,36 @@ void __ref build_all_zonelists(pg_data_t *pgdat)
>> return false;
>> }
>>
>> +static void __meminit __memmap_init_hotplug(unsigned long size, int nid,
>> + unsigned long zone,
>> + unsigned long start_pfn,
>> + struct dev_pagemap *pgmap)
>> +{
>> + unsigned long pfn = start_pfn + size;
>> +
>> + while (pfn != start_pfn) {
>> + unsigned long stride = pfn;
>> +
>> + pfn = max(ALIGN_DOWN(pfn - 1, pageblock_nr_pages), start_pfn);
>> + stride -= pfn;
>> +
>> + /*
>> + * The last argument of __init_pageblock is a boolean
>> + * value indicating if the page will be marked as reserved.
>> + *
>> + * Mark page reserved as it will need to wait for onlining
>> + * phase for it to be fully associated with a zone.
>> + *
>> + * Under certain circumstances ZONE_DEVICE pages may not
>> + * need to be marked as reserved, however there is still
>> + * code that is depending on this being set for now.
>> + */
>> + __init_pageblock(pfn, stride, zone, nid, pgmap, true);
>> +
>> + cond_resched();
>> + }
>> +}
>> +
>> /*
>> * Initially all pages are reserved - free ones are freed
>> * up by memblock_free_all() once the early boot process is
>> @@ -5523,51 +5641,61 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>> struct vmem_altmap *altmap)
>> {
>> unsigned long pfn, end_pfn = start_pfn + size;
>> - struct page *page;
>>
>> if (highest_memmap_pfn < end_pfn - 1)
>> highest_memmap_pfn = end_pfn - 1;
>>
>> + if (context == MEMMAP_HOTPLUG) {
>> #ifdef CONFIG_ZONE_DEVICE
>> - /*
>> - * Honor reservation requested by the driver for this ZONE_DEVICE
>> - * memory. We limit the total number of pages to initialize to just
>> - * those that might contain the memory mapping. We will defer the
>> - * ZONE_DEVICE page initialization until after we have released
>> - * the hotplug lock.
>> - */
>> - if (zone == ZONE_DEVICE) {
>> - if (!altmap)
>> - return;
>> + /*
>> + * Honor reservation requested by the driver for this
>> + * ZONE_DEVICE memory. We limit the total number of pages to
>> + * initialize to just those that might contain the memory
>> + * mapping. We will defer the ZONE_DEVICE page initialization
>> + * until after we have released the hotplug lock.
>> + */
>> + if (zone == ZONE_DEVICE) {
>> + if (!altmap)
>> + return;
>> +
>> + if (start_pfn == altmap->base_pfn)
>> + start_pfn += altmap->reserve;
>> + end_pfn = altmap->base_pfn +
>> + vmem_altmap_offset(altmap);
>> + }
>> +#endif
>> + /*
>> + * For these ZONE_DEVICE pages we don't need to record the
>> + * pgmap as they should represent only those pages used to
>> + * store the memory map. The actual ZONE_DEVICE pages will
>> + * be initialized later.
>> + */
>> + __memmap_init_hotplug(end_pfn - start_pfn, nid, zone,
>> + start_pfn, NULL);
>>
>> - if (start_pfn == altmap->base_pfn)
>> - start_pfn += altmap->reserve;
>> - end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
>> + return;
>> }
>> -#endif
>>
>> for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>> + struct page *page;
>> +
>> /*
>> * There can be holes in boot-time mem_map[]s handed to this
>> * function. They do not exist on hotplugged memory.
>> */
>> - if (context == MEMMAP_EARLY) {
>> - if (!early_pfn_valid(pfn)) {
>> - pfn = next_valid_pfn(pfn) - 1;
>> - continue;
>> - }
>> - if (!early_pfn_in_nid(pfn, nid))
>> - continue;
>> - if (overlap_memmap_init(zone, &pfn))
>> - continue;
>> - if (defer_init(nid, pfn, end_pfn))
>> - break;
>> + if (!early_pfn_valid(pfn)) {
>> + pfn = next_valid_pfn(pfn) - 1;
>> + continue;
>> }
>> + if (!early_pfn_in_nid(pfn, nid))
>> + continue;
>> + if (overlap_memmap_init(zone, &pfn))
>> + continue;
>> + if (defer_init(nid, pfn, end_pfn))
>> + break;
>>
>> page = pfn_to_page(pfn);
>> __init_single_page(page, pfn, zone, nid);
>> - if (context == MEMMAP_HOTPLUG)
>> - __SetPageReserved(page);
>>
>> /*
>> * Mark the block movable so that blocks are reserved for
>> @@ -5594,14 +5722,12 @@ void __ref memmap_init_zone_device(struct zone *zone,
>> unsigned long size,
>> struct dev_pagemap *pgmap)
>> {
>> - unsigned long pfn, end_pfn = start_pfn + size;
>> struct pglist_data *pgdat = zone->zone_pgdat;
>> unsigned long zone_idx = zone_idx(zone);
>> unsigned long start = jiffies;
>> int nid = pgdat->node_id;
>>
>> - if (WARN_ON_ONCE(!pgmap || !is_dev_zone(zone)))
>> - return;
>> + VM_BUG_ON(!is_dev_zone(zone));
>>
>> /*
>> * The call to memmap_init_zone should have already taken care
>> @@ -5610,53 +5736,13 @@ void __ref memmap_init_zone_device(struct zone *zone,
>> */
>> if (pgmap->altmap_valid) {
>> struct vmem_altmap *altmap = &pgmap->altmap;
>> + unsigned long end_pfn = start_pfn + size;
>>
>> start_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
>> size = end_pfn - start_pfn;
>> }
>>
>> - for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>> - struct page *page = pfn_to_page(pfn);
>> -
>> - __init_single_page(page, pfn, zone_idx, nid);
>> -
>> - /*
>> - * Mark page reserved as it will need to wait for onlining
>> - * phase for it to be fully associated with a zone.
>> - *
>> - * We can use the non-atomic __set_bit operation for setting
>> - * the flag as we are still initializing the pages.
>> - */
>> - __SetPageReserved(page);
>> -
>> - /*
>> - * ZONE_DEVICE pages union ->lru with a ->pgmap back
>> - * pointer and hmm_data. It is a bug if a ZONE_DEVICE
>> - * page is ever freed or placed on a driver-private list.
>> - */
>> - page->pgmap = pgmap;
>> - page->hmm_data = 0;
>> -
>> - /*
>> - * Mark the block movable so that blocks are reserved for
>> - * movable at startup. This will force kernel allocations
>> - * to reserve their blocks rather than leaking throughout
>> - * the address space during boot when many long-lived
>> - * kernel allocations are made.
>> - *
>> - * bitmap is created for zone's valid pfn range. but memmap
>> - * can be created for invalid pages (for alignment)
>> - * check here not to call set_pageblock_migratetype() against
>> - * pfn out of zone.
>> - *
>> - * Please note that MEMMAP_HOTPLUG path doesn't clear memmap
>> - * because this is done early in sparse_add_one_section
>> - */
>> - if (!(pfn & (pageblock_nr_pages - 1))) {
>> - set_pageblock_migratetype(page, MIGRATE_MOVABLE);
>> - cond_resched();
>> - }
>> - }
>> + __memmap_init_hotplug(size, nid, zone_idx, start_pfn, pgmap);
>>
>> pr_info("%s initialised, %lu pages in %ums\n", dev_name(pgmap->dev),
>> size, jiffies_to_msecs(jiffies - start));
>>
>
From: Pavel Tatashin
> Sent: 17 October 2018 16:12
> On 10/17/18 11:07 AM, Alexander Duyck wrote:
> > On 10/17/2018 1:47 AM, Michal Hocko wrote:
> >> On Mon 15-10-18 13:26:56, Alexander Duyck wrote:
> >>> This change makes it so that we use the same approach that was
> >>> already in
> >>> use on Sparc on all the archtectures that support a 64b long.
> >>>
> >>> This is mostly motivated by the fact that 8 to 10 store/move
> >>> instructions
> >>> are likely always going to be faster than having to call into a function
> >>> that is not specialized for handling page init.
> >>>
> >>> An added advantage to doing it this way is that the compiler can get
> >>> away
> >>> with combining writes in the __init_single_page call. As a result the
> >>> memset call will be reduced to only about 4 write operations, or at
> >>> least
> >>> that is what I am seeing with GCC 6.2 as the flags, LRU poitners, and
> >>> count/mapcount seem to be cancelling out at least 4 of the 8
> >>> assignments on
> >>> my system.
> >>>
> >>> One change I had to make to the function was to reduce the minimum page
> >>> size to 56 to support some powerpc64 configurations.
> >>
> >> This really begs for numbers. I do not mind the change itself with some
> >> minor comments below.
> >>
> >> [...]
> >>> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >>> index bb0de406f8e7..ec6e57a0c14e 100644
> >>> --- a/include/linux/mm.h
> >>> +++ b/include/linux/mm.h
> >>> @@ -102,8 +102,42 @@ static inline void set_max_mapnr(unsigned long
> >>> limit) { }
> >>> * zeroing by defining this macro in <asm/pgtable.h>.
> >>> */
> >>> #ifndef mm_zero_struct_page
> >>
> >> Do we still need this ifdef? I guess we can wait for an arch which
> >> doesn't like this change and then add the override. I would rather go
> >> simple if possible.
> >
> > We probably don't, but as soon as I remove it somebody will probably
> > complain somewhere. I guess I could drop it for now and see if anybody
> > screams. Adding it back should be pretty straight forward since it would
> > only be 2 lines.
> >
> >>> +#if BITS_PER_LONG == 64
> >>> +/* This function must be updated when the size of struct page grows
> >>> above 80
> >>> + * or reduces below 64. The idea that compiler optimizes out switch()
> >>> + * statement, and only leaves move/store instructions
> >>> + */
> >>> +#define mm_zero_struct_page(pp) __mm_zero_struct_page(pp)
> >>> +static inline void __mm_zero_struct_page(struct page *page)
> >>> +{
> >>> + unsigned long *_pp = (void *)page;
> >>> +
> >>> + /* Check that struct page is either 56, 64, 72, or 80 bytes */
> >>> + BUILD_BUG_ON(sizeof(struct page) & 7);
> >>> + BUILD_BUG_ON(sizeof(struct page) < 56);
> >>> + BUILD_BUG_ON(sizeof(struct page) > 80);
> >>> +
> >>> + switch (sizeof(struct page)) {
> >>> + case 80:
> >>> + _pp[9] = 0; /* fallthrough */
> >>> + case 72:
> >>> + _pp[8] = 0; /* fallthrough */
> >>> + default:
> >>> + _pp[7] = 0; /* fallthrough */
> >>> + case 56:
> >>> + _pp[6] = 0;
> >>> + _pp[5] = 0;
> >>> + _pp[4] = 0;
> >>> + _pp[3] = 0;
> >>> + _pp[2] = 0;
> >>> + _pp[1] = 0;
> >>> + _pp[0] = 0;
> >>> + }
> >>
> >> This just hit my eyes. I have to confess I have never seen default: to
> >> be not the last one in the switch. Can we have case 64 instead or does
> >> gcc
> >> complain? I would be surprised with the set of BUILD_BUG_ONs.
>
> It was me, C does not really care where default is placed, I was trying
> to keep stores sequential for better cache locality, but "case 64"
> should be OK, and even better for this purpose.
You'd need to put memory barriers between them to force sequential stores.
I'm also surprised that gcc doesn't inline the memset().
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On 10/17/2018 8:40 AM, David Laight wrote:
> From: Pavel Tatashin
>> Sent: 17 October 2018 16:12
>> On 10/17/18 11:07 AM, Alexander Duyck wrote:
>>> On 10/17/2018 1:47 AM, Michal Hocko wrote:
>>>> On Mon 15-10-18 13:26:56, Alexander Duyck wrote:
>>>>> This change makes it so that we use the same approach that was
>>>>> already in
>>>>> use on Sparc on all the archtectures that support a 64b long.
>>>>>
>>>>> This is mostly motivated by the fact that 8 to 10 store/move
>>>>> instructions
>>>>> are likely always going to be faster than having to call into a function
>>>>> that is not specialized for handling page init.
>>>>>
>>>>> An added advantage to doing it this way is that the compiler can get
>>>>> away
>>>>> with combining writes in the __init_single_page call. As a result the
>>>>> memset call will be reduced to only about 4 write operations, or at
>>>>> least
>>>>> that is what I am seeing with GCC 6.2 as the flags, LRU poitners, and
>>>>> count/mapcount seem to be cancelling out at least 4 of the 8
>>>>> assignments on
>>>>> my system.
>>>>>
>>>>> One change I had to make to the function was to reduce the minimum page
>>>>> size to 56 to support some powerpc64 configurations.
>>>>
>>>> This really begs for numbers. I do not mind the change itself with some
>>>> minor comments below.
>>>>
>>>> [...]
>>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>>> index bb0de406f8e7..ec6e57a0c14e 100644
>>>>> --- a/include/linux/mm.h
>>>>> +++ b/include/linux/mm.h
>>>>> @@ -102,8 +102,42 @@ static inline void set_max_mapnr(unsigned long
>>>>> limit) { }
>>>>> * zeroing by defining this macro in <asm/pgtable.h>.
>>>>> */
>>>>> #ifndef mm_zero_struct_page
>>>>
>>>> Do we still need this ifdef? I guess we can wait for an arch which
>>>> doesn't like this change and then add the override. I would rather go
>>>> simple if possible.
>>>
>>> We probably don't, but as soon as I remove it somebody will probably
>>> complain somewhere. I guess I could drop it for now and see if anybody
>>> screams. Adding it back should be pretty straight forward since it would
>>> only be 2 lines.
>>>
>>>>> +#if BITS_PER_LONG == 64
>>>>> +/* This function must be updated when the size of struct page grows
>>>>> above 80
>>>>> + * or reduces below 64. The idea that compiler optimizes out switch()
>>>>> + * statement, and only leaves move/store instructions
>>>>> + */
>>>>> +#define mm_zero_struct_page(pp) __mm_zero_struct_page(pp)
>>>>> +static inline void __mm_zero_struct_page(struct page *page)
>>>>> +{
>>>>> + unsigned long *_pp = (void *)page;
>>>>> +
>>>>> + /* Check that struct page is either 56, 64, 72, or 80 bytes */
>>>>> + BUILD_BUG_ON(sizeof(struct page) & 7);
>>>>> + BUILD_BUG_ON(sizeof(struct page) < 56);
>>>>> + BUILD_BUG_ON(sizeof(struct page) > 80);
>>>>> +
>>>>> + switch (sizeof(struct page)) {
>>>>> + case 80:
>>>>> + _pp[9] = 0; /* fallthrough */
>>>>> + case 72:
>>>>> + _pp[8] = 0; /* fallthrough */
>>>>> + default:
>>>>> + _pp[7] = 0; /* fallthrough */
>>>>> + case 56:
>>>>> + _pp[6] = 0;
>>>>> + _pp[5] = 0;
>>>>> + _pp[4] = 0;
>>>>> + _pp[3] = 0;
>>>>> + _pp[2] = 0;
>>>>> + _pp[1] = 0;
>>>>> + _pp[0] = 0;
>>>>> + }
>>>>
>>>> This just hit my eyes. I have to confess I have never seen default: to
>>>> be not the last one in the switch. Can we have case 64 instead or does
>>>> gcc
>>>> complain? I would be surprised with the set of BUILD_BUG_ONs.
>>
>> It was me, C does not really care where default is placed, I was trying
>> to keep stores sequential for better cache locality, but "case 64"
>> should be OK, and even better for this purpose.
>
> You'd need to put memory barriers between them to force sequential stores.
> I'm also surprised that gcc doesn't inline the memset().
>
> David
We don't need them to be sequential. The general idea is we have have to
fill a given amount of space with 0s. After that we have some calls that
are initialing the memory that doesn't have to be zero. Ideally the
compiler is smart enough to realize that since we don't have barriers
and we are performing assignments after the assignment of zero it can
just combine the two writes into one and drop the zero assignment.
- Alex
On Wed 17-10-18 08:07:06, Alexander Duyck wrote:
> On 10/17/2018 1:47 AM, Michal Hocko wrote:
> > On Mon 15-10-18 13:26:56, Alexander Duyck wrote:
[...]
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index bb0de406f8e7..ec6e57a0c14e 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -102,8 +102,42 @@ static inline void set_max_mapnr(unsigned long limit) { }
> > > * zeroing by defining this macro in <asm/pgtable.h>.
> > > */
> > > #ifndef mm_zero_struct_page
> >
> > Do we still need this ifdef? I guess we can wait for an arch which
> > doesn't like this change and then add the override. I would rather go
> > simple if possible.
>
> We probably don't, but as soon as I remove it somebody will probably
> complain somewhere. I guess I could drop it for now and see if anybody
> screams. Adding it back should be pretty straight forward since it would
> only be 2 lines.
Let's make it simpler please. If somebody really cares then this is
trivial to add later.
> > > +#if BITS_PER_LONG == 64
> > > +/* This function must be updated when the size of struct page grows above 80
> > > + * or reduces below 64. The idea that compiler optimizes out switch()
> > > + * statement, and only leaves move/store instructions
> > > + */
> > > +#define mm_zero_struct_page(pp) __mm_zero_struct_page(pp)
> > > +static inline void __mm_zero_struct_page(struct page *page)
> > > +{
> > > + unsigned long *_pp = (void *)page;
> > > +
> > > + /* Check that struct page is either 56, 64, 72, or 80 bytes */
> > > + BUILD_BUG_ON(sizeof(struct page) & 7);
> > > + BUILD_BUG_ON(sizeof(struct page) < 56);
> > > + BUILD_BUG_ON(sizeof(struct page) > 80);
> > > +
> > > + switch (sizeof(struct page)) {
> > > + case 80:
> > > + _pp[9] = 0; /* fallthrough */
> > > + case 72:
> > > + _pp[8] = 0; /* fallthrough */
> > > + default:
> > > + _pp[7] = 0; /* fallthrough */
> > > + case 56:
> > > + _pp[6] = 0;
> > > + _pp[5] = 0;
> > > + _pp[4] = 0;
> > > + _pp[3] = 0;
> > > + _pp[2] = 0;
> > > + _pp[1] = 0;
> > > + _pp[0] = 0;
> > > + }
> >
> > This just hit my eyes. I have to confess I have never seen default: to
> > be not the last one in the switch. Can we have case 64 instead or does gcc
> > complain? I would be surprised with the set of BUILD_BUG_ONs.
>
> I can probably just replace the "default:" with "case 64:". I think I have
> seen other switch statements in the kernel without a default so odds are it
> should be okay.
Please do, there shouldn't be any need to obfuscate the code more than
necessary.
--
Michal Hocko
SUSE Labs
On Mon, Oct 15, 2018 at 01:27:09PM -0700, Alexander Duyck wrote:
> This patch introduces a new iterator for_each_free_mem_pfn_range_in_zone.
>
> This iterator will take care of making sure a given memory range provided
> is in fact contained within a zone. It takes are of all the bounds checking
> we were doing in deferred_grow_zone, and deferred_init_memmap. In addition
> it should help to speed up the search a bit by iterating until the end of a
> range is greater than the start of the zone pfn range, and will exit
> completely if the start is beyond the end of the zone.
>
> This patch adds yet another iterator called
> for_each_free_mem_range_in_zone_from and then uses it to support
> initializing and freeing pages in groups no larger than MAX_ORDER_NR_PAGES.
> By doing this we can greatly improve the cache locality of the pages while
> we do several loops over them in the init and freeing process.
>
> We are able to tighten the loops as a result since we only really need the
> checks for first_init_pfn in our first iteration and after that we can
> assume that all future values will be greater than this. So I have added a
> function called deferred_init_mem_pfn_range_in_zone that primes the
> iterators and if it fails we can just exit.
>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
> include/linux/memblock.h | 58 +++++++++++++++
> mm/memblock.c | 63 ++++++++++++++++
> mm/page_alloc.c | 176 ++++++++++++++++++++++++++++++++--------------
> 3 files changed, 242 insertions(+), 55 deletions(-)
>
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index aee299a6aa76..d62b95dba94e 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -178,6 +178,25 @@ void __next_reserved_mem_region(u64 *idx, phys_addr_t *out_start,
> p_start, p_end, p_nid))
>
> /**
> + * for_each_mem_range - iterate through memblock areas from type_a and not
nit: for_each_mem_range_from
> + * included in type_b. Or just type_a if type_b is NULL.
> + * @i: u64 used as loop variable
> + * @type_a: ptr to memblock_type to iterate
> + * @type_b: ptr to memblock_type which excludes from the iteration
> + * @nid: node selector, %NUMA_NO_NODE for all nodes
> + * @flags: pick from blocks based on memory attributes
> + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> + * @p_nid: ptr to int for nid of the range, can be %NULL
> + */
> +#define for_each_mem_range_from(i, type_a, type_b, nid, flags, \
> + p_start, p_end, p_nid) \
> + for (i = 0, __next_mem_range(&i, nid, flags, type_a, type_b, \
> + p_start, p_end, p_nid); \
> + i != (u64)ULLONG_MAX; \
> + __next_mem_range(&i, nid, flags, type_a, type_b, \
> + p_start, p_end, p_nid))
> +/**
> * for_each_mem_range_rev - reverse iterate through memblock areas from
> * type_a and not included in type_b. Or just type_a if type_b is NULL.
> * @i: u64 used as loop variable
> @@ -248,6 +267,45 @@ void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
> i >= 0; __next_mem_pfn_range(&i, nid, p_start, p_end, p_nid))
> #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
>
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
> + unsigned long *out_spfn,
> + unsigned long *out_epfn);
> +/**
> + * for_each_free_mem_range_in_zone - iterate through zone specific free
> + * memblock areas
> + * @i: u64 used as loop variable
> + * @zone: zone in which all of the memory blocks reside
> + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> + *
> + * Walks over free (memory && !reserved) areas of memblock in a specific
> + * zone. Available as soon as memblock is initialized.
> + */
> +#define for_each_free_mem_pfn_range_in_zone(i, zone, p_start, p_end) \
> + for (i = 0, \
> + __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end); \
> + i != (u64)ULLONG_MAX; \
> + __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
> +
> +/**
> + * for_each_free_mem_range_in_zone_from - iterate through zone specific
> + * free memblock areas from a given point
> + * @i: u64 used as loop variable
> + * @zone: zone in which all of the memory blocks reside
> + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> + *
> + * Walks over free (memory && !reserved) areas of memblock in a specific
> + * zone, continuing from current position. Available as soon as memblock is
> + * initialized.
> + */
> +#define for_each_free_mem_pfn_range_in_zone_from(i, zone, p_start, p_end) \
> + for (; i != (u64)ULLONG_MAX; \
> + __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
> +
> +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
> +
> /**
> * for_each_free_mem_range - iterate through free memblock areas
> * @i: u64 used as loop variable
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 5fefc70253ee..dc6e28e7f869 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1239,6 +1239,69 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
> return 0;
> }
> #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +/**
> + * __next_mem_pfn_range_in_zone - iterator for for_each_*_range_in_zone()
> + *
> + * @idx: pointer to u64 loop variable
> + * @zone: zone in which all of the memory blocks reside
> + * @out_start: ptr to ulong for start pfn of the range, can be %NULL
> + * @out_end: ptr to ulong for end pfn of the range, can be %NULL
> + *
> + * This function is meant to be a zone/pfn specific wrapper for the
> + * for_each_mem_range type iterators. Specifically they are used in the
> + * deferred memory init routines and as such we were duplicating much of
> + * this logic throughout the code. So instead of having it in multiple
> + * locations it seemed like it would make more sense to centralize this to
> + * one new iterator that does everything they need.
> + */
> +void __init_memblock
> +__next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
> + unsigned long *out_spfn, unsigned long *out_epfn)
> +{
> + int zone_nid = zone_to_nid(zone);
> + phys_addr_t spa, epa;
> + int nid;
> +
> + __next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
> + &memblock.memory, &memblock.reserved,
> + &spa, &epa, &nid);
> +
> + while (*idx != ULLONG_MAX) {
> + unsigned long epfn = PFN_DOWN(epa);
> + unsigned long spfn = PFN_UP(spa);
> +
> + /*
> + * Verify the end is at least past the start of the zone and
> + * that we have at least one PFN to initialize.
> + */
> + if (zone->zone_start_pfn < epfn && spfn < epfn) {
> + /* if we went too far just stop searching */
> + if (zone_end_pfn(zone) <= spfn)
> + break;
> +
> + if (out_spfn)
> + *out_spfn = max(zone->zone_start_pfn, spfn);
> + if (out_epfn)
> + *out_epfn = min(zone_end_pfn(zone), epfn);
> +
> + return;
> + }
> +
> + __next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
> + &memblock.memory, &memblock.reserved,
> + &spa, &epa, &nid);
> + }
> +
> + /* signal end of iteration */
> + *idx = ULLONG_MAX;
> + if (out_spfn)
> + *out_spfn = ULONG_MAX;
> + if (out_epfn)
> + *out_epfn = 0;
> +}
> +
> +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
>
> #ifdef CONFIG_HAVE_MEMBLOCK_PFN_VALID
> unsigned long __init_memblock memblock_next_valid_pfn(unsigned long pfn)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a766a15fad81..20e9eb35d75d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1512,19 +1512,103 @@ static unsigned long __init deferred_init_pages(struct zone *zone,
> return (nr_pages);
> }
>
> +/*
> + * This function is meant to pre-load the iterator for the zone init.
> + * Specifically it walks through the ranges until we are caught up to the
> + * first_init_pfn value and exits there. If we never encounter the value we
> + * return false indicating there are no valid ranges left.
> + */
> +static bool __init
> +deferred_init_mem_pfn_range_in_zone(u64 *i, struct zone *zone,
> + unsigned long *spfn, unsigned long *epfn,
> + unsigned long first_init_pfn)
> +{
> + u64 j;
> +
> + /*
> + * Start out by walking through the ranges in this zone that have
> + * already been initialized. We don't need to do anything with them
> + * so we just need to flush them out of the system.
> + */
> + for_each_free_mem_pfn_range_in_zone(j, zone, spfn, epfn) {
> + if (*epfn <= first_init_pfn)
> + continue;
> + if (*spfn < first_init_pfn)
> + *spfn = first_init_pfn;
> + *i = j;
> + return true;
> + }
> +
> + return false;
> +}
> +
> +/*
> + * Initialize and free pages. We do it in two loops: first we initialize
> + * struct page, than free to buddy allocator, because while we are
> + * freeing pages we can access pages that are ahead (computing buddy
> + * page in __free_one_page()).
> + *
> + * In order to try and keep some memory in the cache we have the loop
> + * broken along max page order boundaries. This way we will not cause
> + * any issues with the buddy page computation.
> + */
> +static unsigned long __init
> +deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
> + unsigned long *end_pfn)
> +{
> + unsigned long mo_pfn = ALIGN(*start_pfn + 1, MAX_ORDER_NR_PAGES);
> + unsigned long spfn = *start_pfn, epfn = *end_pfn;
> + unsigned long nr_pages = 0;
> + u64 j = *i;
> +
> + /* First we loop through and initialize the page values */
> + for_each_free_mem_pfn_range_in_zone_from(j, zone, &spfn, &epfn) {
> + unsigned long t;
> +
> + if (mo_pfn <= spfn)
> + break;
> +
> + t = min(mo_pfn, epfn);
> + nr_pages += deferred_init_pages(zone, spfn, t);
> +
> + if (mo_pfn <= epfn)
> + break;
> + }
> +
> + /* Reset values and now loop through freeing pages as needed */
> + j = *i;
> +
> + for_each_free_mem_pfn_range_in_zone_from(j, zone, start_pfn, end_pfn) {
> + unsigned long t;
> +
> + if (mo_pfn <= *start_pfn)
> + break;
> +
> + t = min(mo_pfn, *end_pfn);
> + deferred_free_pages(*start_pfn, t);
> + *start_pfn = t;
> +
> + if (mo_pfn < *end_pfn)
> + break;
> + }
> +
> + /* Store our current values to be reused on the next iteration */
> + *i = j;
> +
> + return nr_pages;
> +}
> +
> /* Initialise remaining memory on a node */
> static int __init deferred_init_memmap(void *data)
> {
> pg_data_t *pgdat = data;
> - int nid = pgdat->node_id;
> + const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> + unsigned long spfn = 0, epfn = 0, nr_pages = 0;
> + unsigned long first_init_pfn, flags;
> unsigned long start = jiffies;
> - unsigned long nr_pages = 0;
> - unsigned long spfn, epfn, first_init_pfn, flags;
> - phys_addr_t spa, epa;
> - int zid;
> struct zone *zone;
> - const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> u64 i;
> + int zid;
>
> /* Bind memory initialisation thread to a local node if possible */
> if (!cpumask_empty(cpumask))
> @@ -1549,31 +1633,30 @@ static int __init deferred_init_memmap(void *data)
> if (first_init_pfn < zone_end_pfn(zone))
> break;
> }
> - first_init_pfn = max(zone->zone_start_pfn, first_init_pfn);
> +
> + /* If the zone is empty somebody else may have cleared out the zone */
> + if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
> + first_init_pfn)) {
> + pgdat_resize_unlock(pgdat, &flags);
> + pgdat_init_report_one_done();
> + return 0;
> + }
>
> /*
> - * Initialize and free pages. We do it in two loops: first we initialize
> - * struct page, than free to buddy allocator, because while we are
> - * freeing pages we can access pages that are ahead (computing buddy
> - * page in __free_one_page()).
> + * Initialize and free pages in MAX_ORDER sized increments so
> + * that we can avoid introducing any issues with the buddy
> + * allocator.
> */
> - for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> - spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> - epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
> - nr_pages += deferred_init_pages(zone, spfn, epfn);
> - }
> - for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> - spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> - epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
> - deferred_free_pages(spfn, epfn);
> - }
> + while (spfn < epfn)
> + nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> +
> pgdat_resize_unlock(pgdat, &flags);
>
> /* Sanity check that the next zone really is unpopulated */
> WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
>
> - pr_info("node %d initialised, %lu pages in %ums\n", nid, nr_pages,
> - jiffies_to_msecs(jiffies - start));
> + pr_info("node %d initialised, %lu pages in %ums\n",
> + pgdat->node_id, nr_pages, jiffies_to_msecs(jiffies - start));
>
> pgdat_init_report_one_done();
> return 0;
> @@ -1604,14 +1687,11 @@ static int __init deferred_init_memmap(void *data)
> static noinline bool __init
> deferred_grow_zone(struct zone *zone, unsigned int order)
> {
> - int zid = zone_idx(zone);
> - int nid = zone_to_nid(zone);
> - pg_data_t *pgdat = NODE_DATA(nid);
> unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
> - unsigned long nr_pages = 0;
> - unsigned long first_init_pfn, spfn, epfn, t, flags;
> + pg_data_t *pgdat = zone->zone_pgdat;
> unsigned long first_deferred_pfn = pgdat->first_deferred_pfn;
> - phys_addr_t spa, epa;
> + unsigned long spfn, epfn, flags;
> + unsigned long nr_pages = 0;
> u64 i;
>
> /* Only the last zone may have deferred pages */
> @@ -1640,37 +1720,23 @@ static int __init deferred_init_memmap(void *data)
> return true;
> }
>
> - first_init_pfn = max(zone->zone_start_pfn, first_deferred_pfn);
> -
> - if (first_init_pfn >= pgdat_end_pfn(pgdat)) {
> + /* If the zone is empty somebody else may have cleared out the zone */
> + if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
> + first_deferred_pfn)) {
> pgdat_resize_unlock(pgdat, &flags);
> - return false;
> + return true;
> }
>
> - for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> - spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> - epfn = min_t(unsigned long, zone_end_pfn(zone), PFN_DOWN(epa));
> -
> - while (spfn < epfn && nr_pages < nr_pages_needed) {
> - t = ALIGN(spfn + PAGES_PER_SECTION, PAGES_PER_SECTION);
> - first_deferred_pfn = min(t, epfn);
> - nr_pages += deferred_init_pages(zone, spfn,
> - first_deferred_pfn);
> - spfn = first_deferred_pfn;
> - }
> -
> - if (nr_pages >= nr_pages_needed)
> - break;
> + /*
> + * Initialize and free pages in MAX_ORDER sized increments so
> + * that we can avoid introducing any issues with the buddy
> + * allocator.
> + */
> + while (spfn < epfn && nr_pages < nr_pages_needed) {
> + nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> + first_deferred_pfn = spfn;
> }
>
> - for_each_free_mem_range(i, nid, MEMBLOCK_NONE, &spa, &epa, NULL) {
> - spfn = max_t(unsigned long, first_init_pfn, PFN_UP(spa));
> - epfn = min_t(unsigned long, first_deferred_pfn, PFN_DOWN(epa));
> - deferred_free_pages(spfn, epfn);
> -
> - if (first_deferred_pfn == epfn)
> - break;
> - }
> pgdat->first_deferred_pfn = first_deferred_pfn;
> pgdat_resize_unlock(pgdat, &flags);
>
>
--
Sincerely yours,
Mike.
On 10/17/18 12:31 PM, Alexander Duyck wrote:
> On 10/17/2018 8:40 AM, David Laight wrote:
>> From: Pavel Tatashin
>>> Sent: 17 October 2018 16:12
>>> On 10/17/18 11:07 AM, Alexander Duyck wrote:
>>>> On 10/17/2018 1:47 AM, Michal Hocko wrote:
>>>>> On Mon 15-10-18 13:26:56, Alexander Duyck wrote:
>>>>>> This change makes it so that we use the same approach that was
>>>>>> already in
>>>>>> use on Sparc on all the archtectures that support a 64b long.
>>>>>>
>>>>>> This is mostly motivated by the fact that 8 to 10 store/move
>>>>>> instructions
>>>>>> are likely always going to be faster than having to call into a
>>>>>> function
>>>>>> that is not specialized for handling page init.
>>>>>>
>>>>>> An added advantage to doing it this way is that the compiler can get
>>>>>> away
>>>>>> with combining writes in the __init_single_page call. As a result the
>>>>>> memset call will be reduced to only about 4 write operations, or at
>>>>>> least
>>>>>> that is what I am seeing with GCC 6.2 as the flags, LRU poitners, and
>>>>>> count/mapcount seem to be cancelling out at least 4 of the 8
>>>>>> assignments on
>>>>>> my system.
>>>>>>
>>>>>> One change I had to make to the function was to reduce the minimum
>>>>>> page
>>>>>> size to 56 to support some powerpc64 configurations.
>>>>>
>>>>> This really begs for numbers. I do not mind the change itself with
>>>>> some
>>>>> minor comments below.
>>>>>
>>>>> [...]
>>>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>>>> index bb0de406f8e7..ec6e57a0c14e 100644
>>>>>> --- a/include/linux/mm.h
>>>>>> +++ b/include/linux/mm.h
>>>>>> @@ -102,8 +102,42 @@ static inline void set_max_mapnr(unsigned long
>>>>>> limit) { }
>>>>>> * zeroing by defining this macro in <asm/pgtable.h>.
>>>>>> */
>>>>>> #ifndef mm_zero_struct_page
>>>>>
>>>>> Do we still need this ifdef? I guess we can wait for an arch which
>>>>> doesn't like this change and then add the override. I would rather go
>>>>> simple if possible.
>>>>
>>>> We probably don't, but as soon as I remove it somebody will probably
>>>> complain somewhere. I guess I could drop it for now and see if anybody
>>>> screams. Adding it back should be pretty straight forward since it
>>>> would
>>>> only be 2 lines.
>>>>
>>>>>> +#if BITS_PER_LONG == 64
>>>>>> +/* This function must be updated when the size of struct page grows
>>>>>> above 80
>>>>>> + * or reduces below 64. The idea that compiler optimizes out
>>>>>> switch()
>>>>>> + * statement, and only leaves move/store instructions
>>>>>> + */
>>>>>> +#define mm_zero_struct_page(pp) __mm_zero_struct_page(pp)
>>>>>> +static inline void __mm_zero_struct_page(struct page *page)
>>>>>> +{
>>>>>> + unsigned long *_pp = (void *)page;
>>>>>> +
>>>>>> + /* Check that struct page is either 56, 64, 72, or 80 bytes */
>>>>>> + BUILD_BUG_ON(sizeof(struct page) & 7);
>>>>>> + BUILD_BUG_ON(sizeof(struct page) < 56);
>>>>>> + BUILD_BUG_ON(sizeof(struct page) > 80);
>>>>>> +
>>>>>> + switch (sizeof(struct page)) {
>>>>>> + case 80:
>>>>>> + _pp[9] = 0; /* fallthrough */
>>>>>> + case 72:
>>>>>> + _pp[8] = 0; /* fallthrough */
>>>>>> + default:
>>>>>> + _pp[7] = 0; /* fallthrough */
>>>>>> + case 56:
>>>>>> + _pp[6] = 0;
>>>>>> + _pp[5] = 0;
>>>>>> + _pp[4] = 0;
>>>>>> + _pp[3] = 0;
>>>>>> + _pp[2] = 0;
>>>>>> + _pp[1] = 0;
>>>>>> + _pp[0] = 0;
>>>>>> + }
>>>>>
>>>>> This just hit my eyes. I have to confess I have never seen default: to
>>>>> be not the last one in the switch. Can we have case 64 instead or does
>>>>> gcc
>>>>> complain? I would be surprised with the set of BUILD_BUG_ONs.
>>>
>>> It was me, C does not really care where default is placed, I was trying
>>> to keep stores sequential for better cache locality, but "case 64"
>>> should be OK, and even better for this purpose.
>>
>> You'd need to put memory barriers between them to force sequential
>> stores.
>> I'm also surprised that gcc doesn't inline the memset().
I meant sequential only as hint, there is no reason for them to be
strictly sequential, and barrier is one of the reasons why memset() is
slower compared to having these stores here. As, most of memset()
implementations include barrier. As Alex said, compiler will most likely
drop some unnecessary stores anyway because of inlines in
__init_single_page()
Pavel
>>
>> David
>
> We don't need them to be sequential. The general idea is we have have to
> fill a given amount of space with 0s. After that we have some calls that
> are initialing the memory that doesn't have to be zero. Ideally the
> compiler is smart enough to realize that since we don't have barriers
> and we are performing assignments after the assignment of zero it can
> just combine the two writes into one and drop the zero assignment.
>
> - Alex
On Wed 17-10-18 08:26:20, Alexander Duyck wrote:
[...]
> With that said I am also wondering if a possible solution to the complaints
> you had would be to look at just exporting the __init_pageblock function
> later and moving the call to memmap_init_zone_device out to the memremap or
> hotplug code when Dan gets the refactoring for HMM and memremap all sorted
> out.
Why cannot we simply provide a constructor for each page by the caller
if there are special requirements? we currently have alt_map to do
struct page allocation but nothing really prevents to make it more
generic and control both allocation and initialization whatever suits a
specific usecase. I really do not want make special cases here and
there.
--
Michal Hocko
SUSE Labs
On Wed, 2018-10-24 at 14:36 +0200, Michal Hocko wrote:
> On Wed 17-10-18 08:26:20, Alexander Duyck wrote:
> [...]
> > With that said I am also wondering if a possible solution to the
> > complaints
> > you had would be to look at just exporting the __init_pageblock
> > function
> > later and moving the call to memmap_init_zone_device out to the
> > memremap or
> > hotplug code when Dan gets the refactoring for HMM and memremap all
> > sorted
> > out.
>
> Why cannot we simply provide a constructor for each page by the
> caller if there are special requirements? we currently have alt_map
> to do struct page allocation but nothing really prevents to make it
> more generic and control both allocation and initialization whatever
> suits a specific usecase. I really do not want make special cases
> here and there.
The advantage to the current __init_pageblock function is that we end
up constructing everything we are going to write outside of the main
loop and then are focused only on init.
If we end up having to call a per page constructor that is going to
slow things down. For example just that reserved bit setting was adding
something like 4 seconds to the init time for the system. By allowing
it to be configured as a part of the flags outside of the loop I was
able to avoid taking that performance hit.
- Alex
On Wed 24-10-18 08:08:41, Alexander Duyck wrote:
> On Wed, 2018-10-24 at 14:36 +0200, Michal Hocko wrote:
> > On Wed 17-10-18 08:26:20, Alexander Duyck wrote:
> > [...]
> > > With that said I am also wondering if a possible solution to the
> > > complaints
> > > you had would be to look at just exporting the __init_pageblock
> > > function
> > > later and moving the call to memmap_init_zone_device out to the
> > > memremap or
> > > hotplug code when Dan gets the refactoring for HMM and memremap all
> > > sorted
> > > out.
> >
> > Why cannot we simply provide a constructor for each page by the
> > caller if there are special requirements? we currently have alt_map
> > to do struct page allocation but nothing really prevents to make it
> > more generic and control both allocation and initialization whatever
> > suits a specific usecase. I really do not want make special cases
> > here and there.
>
> The advantage to the current __init_pageblock function is that we end
> up constructing everything we are going to write outside of the main
> loop and then are focused only on init.
But we do really want move_pfn_range_to_zone to provide a usable pfn
range without any additional tweaks. If there are potential
optimizations to be done there then let's do it but please do not try to
micro optimize to the point that the interface doesn't make any sense
anymore.
--
Michal Hocko
SUSE Labs
On Wed, 2018-10-24 at 17:27 +0200, Michal Hocko wrote:
> On Wed 24-10-18 08:08:41, Alexander Duyck wrote:
> > On Wed, 2018-10-24 at 14:36 +0200, Michal Hocko wrote:
> > > On Wed 17-10-18 08:26:20, Alexander Duyck wrote:
> > > [...]
> > > > With that said I am also wondering if a possible solution to
> > > > the complaints you had would be to look at just exporting the
> > > > __init_pageblock function later and moving the call to
> > > > memmap_init_zone_device out to the memremap or hotplug code
> > > > when Dan gets the refactoring for HMM and memremap all sorted
> > > > out.
> > >
> > > Why cannot we simply provide a constructor for each page by the
> > > caller if there are special requirements? we currently have
> > > alt_map
> > > to do struct page allocation but nothing really prevents to make
> > > it
> > > more generic and control both allocation and initialization
> > > whatever
> > > suits a specific usecase. I really do not want make special cases
> > > here and there.
> >
> > The advantage to the current __init_pageblock function is that we
> > end up constructing everything we are going to write outside of the
> > main loop and then are focused only on init.
>
> But we do really want move_pfn_range_to_zone to provide a usable pfn
> range without any additional tweaks. If there are potential
> optimizations to be done there then let's do it but please do not try
> to micro optimize to the point that the interface doesn't make any
> sense anymore.
The actual difference between the two setups is not all that great.
From the sound of things the ultimate difference between the
ZONE_DEVICE pages and regular pages is the pgmap and if we want the
reserved bit set or not.
What I am providing with __init_pageblock at this point is a function
that is flexible enough for us to be able to do either one and then
just expose a different front end on it for the specific type of page
we have to initialize. It works for regular hotplug, ZONE_DEVICE, and
deferred memory initialization. The way I view it is that this funciton
is a high performance multi-tasker, not something that is micro-
optimized for any one specific function.
Thanks.
- Alex
On Wed 24-10-18 10:35:09, Alexander Duyck wrote:
> On Wed, 2018-10-24 at 17:27 +0200, Michal Hocko wrote:
> > On Wed 24-10-18 08:08:41, Alexander Duyck wrote:
> > > On Wed, 2018-10-24 at 14:36 +0200, Michal Hocko wrote:
> > > > On Wed 17-10-18 08:26:20, Alexander Duyck wrote:
> > > > [...]
> > > > > With that said I am also wondering if a possible solution to
> > > > > the complaints you had would be to look at just exporting the
> > > > > __init_pageblock function later and moving the call to
> > > > > memmap_init_zone_device out to the memremap or hotplug code
> > > > > when Dan gets the refactoring for HMM and memremap all sorted
> > > > > out.
> > > >
> > > > Why cannot we simply provide a constructor for each page by the
> > > > caller if there are special requirements? we currently have
> > > > alt_map
> > > > to do struct page allocation but nothing really prevents to make
> > > > it
> > > > more generic and control both allocation and initialization
> > > > whatever
> > > > suits a specific usecase. I really do not want make special cases
> > > > here and there.
> > >
> > > The advantage to the current __init_pageblock function is that we
> > > end up constructing everything we are going to write outside of the
> > > main loop and then are focused only on init.
> >
> > But we do really want move_pfn_range_to_zone to provide a usable pfn
> > range without any additional tweaks. If there are potential
> > optimizations to be done there then let's do it but please do not try
> > to micro optimize to the point that the interface doesn't make any
> > sense anymore.
>
> The actual difference between the two setups is not all that great.
> >From the sound of things the ultimate difference between the
> ZONE_DEVICE pages and regular pages is the pgmap and if we want the
> reserved bit set or not.
So once again, why do you need PageReserved in the first place. Alos it
seems that pgmap can hold both the struct page allocator and
constructor.
> What I am providing with __init_pageblock at this point is a function
> that is flexible enough for us to be able to do either one and then
> just expose a different front end on it for the specific type of page
> we have to initialize. It works for regular hotplug, ZONE_DEVICE, and
> deferred memory initialization. The way I view it is that this funciton
> is a high performance multi-tasker, not something that is micro-
> optimized for any one specific function.
All that I argue about is that all this should live inside the core
hotplug. If you can plumb it there without hacks I have seen earlier
then I do not mind but let's try to keep the core infrastructure as
clelan as possible.
--
Michal Hocko
SUSE Labs