This is v4 post. V3 can be found here:
https://lkml.org/lkml/2018/2/27/928
V1 can be found here:
https://www.spinics.net/lists/linux-mm/msg144486.html
In sparse_init(), two temporary pointer arrays, usemap_map and map_map
are allocated with the size of NR_MEM_SECTIONS. They are used to store
each memory section's usemap and mem map if marked as present. In
5-level paging mode, this will cost 512M memory though they will be
released at the end of sparse_init(). System with few memory, like
kdump kernel which usually only has about 256M, will fail to boot
because of allocation failure if CONFIG_X86_5LEVEL=y.
In this patchset, optimize the memmap allocation code to only use
usemap_map and map_map with the size of nr_present_sections. This
makes kdump kernel boot up with normal crashkernel='' setting when
CONFIG_X86_5LEVEL=y.
Change log:
v4->v5:
Improve patch 3/4 log according to Dave's suggestion.
Correct the wrong copy&paste of making 'nr_consumed_maps' to
'alloc_usemap_and_memmap' mistakenly which is pointed out by
Dave in patch 4/4 code comment.
Otherwise, no code change in this version.
v3->v4:
Improve according to Dave's three concerns which are in patch 0004:
Rename variable 'idx_present' to 'nr_consumed_maps' which used to
index the memmap and usemap of present sections.
Add a check if 'nr_consumed_maps' goes beyond nr_present_sections.
Add code comment above the final for_each_present_section_nr() to
tell why 'nr_consumed_maps' need be increased in each iteration
whether the 'ms->section_mem_map' need cleared or out.
v2->v3:
Change nr_present_sections as __initdata and add code comment
according to Andrew's suggestion.
Change the local variable 'i' as idx_present which loops over the
present sections, and improve the code. These are suggested by
Dave and Pankaj.
Add a new patch 0003 which adds a new parameter 'data_unit_size'
to function alloc_usemap_and_memmap() in which we will update 'data'
to make it point at new position. However its type 'void *' can't give
us needed info to do that. Need pass the unit size in. So change code
in patch 0004 accordingly. This is a code bug fix found when tested
the memory deployed on multiple nodes.
v1-v2:
Split out the nr_present_sections adding as a single patch for easier
reviewing.
Rewrite patch log according to Dave's suggestion.
Fix code bug in patch 0002 reported by test robot.
Baoquan He (4):
mm/sparse: Add a static variable nr_present_sections
mm/sparsemem: Defer the ms->section_mem_map clearing
mm/sparse: Add a new parameter 'data_unit_size' for
alloc_usemap_and_memmap
mm/sparse: Optimize memmap allocation during sparse_init()
mm/sparse-vmemmap.c | 6 ++---
mm/sparse.c | 72 +++++++++++++++++++++++++++++++++++++++++------------
2 files changed, 59 insertions(+), 19 deletions(-)
--
2.13.6
In sparse_init(), if CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y, system
will allocate one continuous memory chunk for mem maps on one node and
populate the relevant page tables to map memory section one by one. If
fail to populate for a certain mem section, print warning and its
->section_mem_map will be cleared to cancel the marking of being present.
Like this, the number of mem sections marked as present could become
less during sparse_init() execution.
Here just defer the ms->section_mem_map clearing if failed to populate
its page tables until the last for_each_present_section_nr() loop. This
is in preparation for later optimizing the mem map allocation.
Signed-off-by: Baoquan He <[email protected]>
---
mm/sparse-vmemmap.c | 1 -
mm/sparse.c | 12 ++++++++----
2 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index bd0276d5f66b..640e68f8324b 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -303,7 +303,6 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
ms = __nr_to_section(pnum);
pr_err("%s: sparsemem memory map backing failed some memory will not be available\n",
__func__);
- ms->section_mem_map = 0;
}
if (vmemmap_buf_start) {
diff --git a/mm/sparse.c b/mm/sparse.c
index 6314303130b0..71ad53da2cd1 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -451,7 +451,6 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
ms = __nr_to_section(pnum);
pr_err("%s: sparsemem memory map backing failed some memory will not be available\n",
__func__);
- ms->section_mem_map = 0;
}
}
#endif /* !CONFIG_SPARSEMEM_VMEMMAP */
@@ -479,7 +478,6 @@ static struct page __init *sparse_early_mem_map_alloc(unsigned long pnum)
pr_err("%s: sparsemem memory map backing failed some memory will not be available\n",
__func__);
- ms->section_mem_map = 0;
return NULL;
}
#endif
@@ -583,17 +581,23 @@ void __init sparse_init(void)
#endif
for_each_present_section_nr(0, pnum) {
+ struct mem_section *ms;
+ ms = __nr_to_section(pnum);
usemap = usemap_map[pnum];
- if (!usemap)
+ if (!usemap) {
+ ms->section_mem_map = 0;
continue;
+ }
#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
map = map_map[pnum];
#else
map = sparse_early_mem_map_alloc(pnum);
#endif
- if (!map)
+ if (!map) {
+ ms->section_mem_map = 0;
continue;
+ }
sparse_init_one_section(__nr_to_section(pnum), pnum, map,
usemap);
--
2.13.6
It's used to record how many memory sections are marked as present
during system boot up, and will be used in the later patch.
Signed-off-by: Baoquan He <[email protected]>
---
mm/sparse.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/mm/sparse.c b/mm/sparse.c
index f13f2723950a..6314303130b0 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -200,6 +200,12 @@ static inline int next_present_section_nr(int section_nr)
(section_nr <= __highest_present_section_nr)); \
section_nr = next_present_section_nr(section_nr))
+/*
+ * Record how many memory sections are marked as present
+ * during system bootup.
+ */
+static int __initdata nr_present_sections;
+
/* Record a memory area against a node. */
void __init memory_present(int nid, unsigned long start, unsigned long end)
{
@@ -229,6 +235,7 @@ void __init memory_present(int nid, unsigned long start, unsigned long end)
ms->section_mem_map = sparse_encode_early_nid(nid) |
SECTION_IS_ONLINE;
section_mark_present(ms);
+ nr_present_sections++;
}
}
}
--
2.13.6
On 06/27/18 at 09:31am, Baoquan He wrote:
> This is v4 post. V3 can be found here:
> https://lkml.org/lkml/2018/2/27/928
Sorry, forgot updating this part.
This is v5 post, v4 can be found here:
http://lkml.kernel.org/r/[email protected]
>
> V1 can be found here:
> https://www.spinics.net/lists/linux-mm/msg144486.html
>
> In sparse_init(), two temporary pointer arrays, usemap_map and map_map
> are allocated with the size of NR_MEM_SECTIONS. They are used to store
> each memory section's usemap and mem map if marked as present. In
> 5-level paging mode, this will cost 512M memory though they will be
> released at the end of sparse_init(). System with few memory, like
> kdump kernel which usually only has about 256M, will fail to boot
> because of allocation failure if CONFIG_X86_5LEVEL=y.
>
> In this patchset, optimize the memmap allocation code to only use
> usemap_map and map_map with the size of nr_present_sections. This
> makes kdump kernel boot up with normal crashkernel='' setting when
> CONFIG_X86_5LEVEL=y.
>
> Change log:
> v4->v5:
> Improve patch 3/4 log according to Dave's suggestion.
>
> Correct the wrong copy&paste of making 'nr_consumed_maps' to
> 'alloc_usemap_and_memmap' mistakenly which is pointed out by
> Dave in patch 4/4 code comment.
>
> Otherwise, no code change in this version.
> v3->v4:
> Improve according to Dave's three concerns which are in patch 0004:
>
> Rename variable 'idx_present' to 'nr_consumed_maps' which used to
> index the memmap and usemap of present sections.
>
> Add a check if 'nr_consumed_maps' goes beyond nr_present_sections.
>
> Add code comment above the final for_each_present_section_nr() to
> tell why 'nr_consumed_maps' need be increased in each iteration
> whether the 'ms->section_mem_map' need cleared or out.
>
> v2->v3:
> Change nr_present_sections as __initdata and add code comment
> according to Andrew's suggestion.
>
> Change the local variable 'i' as idx_present which loops over the
> present sections, and improve the code. These are suggested by
> Dave and Pankaj.
>
> Add a new patch 0003 which adds a new parameter 'data_unit_size'
> to function alloc_usemap_and_memmap() in which we will update 'data'
> to make it point at new position. However its type 'void *' can't give
> us needed info to do that. Need pass the unit size in. So change code
> in patch 0004 accordingly. This is a code bug fix found when tested
> the memory deployed on multiple nodes.
>
> v1-v2:
> Split out the nr_present_sections adding as a single patch for easier
> reviewing.
>
> Rewrite patch log according to Dave's suggestion.
>
> Fix code bug in patch 0002 reported by test robot.
>
> Baoquan He (4):
> mm/sparse: Add a static variable nr_present_sections
> mm/sparsemem: Defer the ms->section_mem_map clearing
> mm/sparse: Add a new parameter 'data_unit_size' for
> alloc_usemap_and_memmap
> mm/sparse: Optimize memmap allocation during sparse_init()
>
> mm/sparse-vmemmap.c | 6 ++---
> mm/sparse.c | 72 +++++++++++++++++++++++++++++++++++++++++------------
> 2 files changed, 59 insertions(+), 19 deletions(-)
>
> --
> 2.13.6
>
alloc_usemap_and_memmap() is passing in a "void *" that points to
usemap_map or memmap_map. In next patch we will change both of the
map allocation from taking 'NR_MEM_SECTIONS' as the length to taking
'nr_present_sections' as the length. After that, the passed in 'void*'
needs to update as things get consumed. But, it knows only the
quantity of objects consumed and not the type. This effectively
tells it enough about the type to let it update the pointer as
objects are consumed.
Signed-off-by: Baoquan He <[email protected]>
---
mm/sparse.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/mm/sparse.c b/mm/sparse.c
index 71ad53da2cd1..b2848cc6e32a 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -489,10 +489,12 @@ void __weak __meminit vmemmap_populate_print_last(void)
/**
* alloc_usemap_and_memmap - memory alloction for pageblock flags and vmemmap
* @map: usemap_map for pageblock flags or mmap_map for vmemmap
+ * @unit_size: size of map unit
*/
static void __init alloc_usemap_and_memmap(void (*alloc_func)
(void *, unsigned long, unsigned long,
- unsigned long, int), void *data)
+ unsigned long, int), void *data,
+ int data_unit_size)
{
unsigned long pnum;
unsigned long map_count;
@@ -569,7 +571,8 @@ void __init sparse_init(void)
if (!usemap_map)
panic("can not allocate usemap_map\n");
alloc_usemap_and_memmap(sparse_early_usemaps_alloc_node,
- (void *)usemap_map);
+ (void *)usemap_map,
+ sizeof(usemap_map[0]));
#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
size2 = sizeof(struct page *) * NR_MEM_SECTIONS;
@@ -577,7 +580,8 @@ void __init sparse_init(void)
if (!map_map)
panic("can not allocate map_map\n");
alloc_usemap_and_memmap(sparse_early_mem_maps_alloc_node,
- (void *)map_map);
+ (void *)map_map,
+ sizeof(map_map[0]));
#endif
for_each_present_section_nr(0, pnum) {
--
2.13.6
In sparse_init(), two temporary pointer arrays, usemap_map and map_map
are allocated with the size of NR_MEM_SECTIONS. They are used to store
each memory section's usemap and mem map if marked as present. With
the help of these two arrays, continuous memory chunk is allocated for
usemap and memmap for memory sections on one node. This avoids too many
memory fragmentations. Like below diagram, '1' indicates the present
memory section, '0' means absent one. The number 'n' could be much
smaller than NR_MEM_SECTIONS on most of systems.
|1|1|1|1|0|0|0|0|1|1|0|0|...|1|0||1|0|...|1||0|1|...|0|
-------------------------------------------------------
0 1 2 3 4 5 i i+1 n-1 n
If fail to populate the page tables to map one section's memmap, its
->section_mem_map will be cleared finally to indicate that it's not present.
After use, these two arrays will be released at the end of sparse_init().
In 4-level paging mode, each array costs 4M which can be ignorable. While
in 5-level paging, they costs 256M each, 512M altogether. Kdump kernel
Usually only reserves very few memory, e.g 256M. So, even thouth they are
temporarily allocated, still not acceptable.
In fact, there's no need to allocate them with the size of NR_MEM_SECTIONS.
Since the ->section_mem_map clearing has been deferred to the last, the
number of present memory sections are kept the same during sparse_init()
until we finally clear out the memory section's ->section_mem_map if its
usemap or memmap is not correctly handled. Thus in the middle whenever
for_each_present_section_nr() loop is taken, the i-th present memory
section is always the same one.
Here only allocate usemap_map and map_map with the size of
'nr_present_sections'. For the i-th present memory section, install its
usemap and memmap to usemap_map[i] and mam_map[i] during allocation. Then
in the last for_each_present_section_nr() loop which clears the failed
memory section's ->section_mem_map, fetch usemap and memmap from
usemap_map[] and map_map[] array and set them into mem_section[]
accordingly.
Signed-off-by: Baoquan He <[email protected]>
Signed-off-by: Baoquan He <[email protected]>
---
mm/sparse-vmemmap.c | 5 +++--
mm/sparse.c | 43 ++++++++++++++++++++++++++++++++++---------
2 files changed, 37 insertions(+), 11 deletions(-)
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 640e68f8324b..a98ec2fb6915 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -281,6 +281,7 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
unsigned long pnum;
unsigned long size = sizeof(struct page) * PAGES_PER_SECTION;
void *vmemmap_buf_start;
+ int nr_consumed_maps = 0;
size = ALIGN(size, PMD_SIZE);
vmemmap_buf_start = __earlyonly_bootmem_alloc(nodeid, size * map_count,
@@ -297,8 +298,8 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
if (!present_section_nr(pnum))
continue;
- map_map[pnum] = sparse_mem_map_populate(pnum, nodeid, NULL);
- if (map_map[pnum])
+ map_map[nr_consumed_maps] = sparse_mem_map_populate(pnum, nodeid, NULL);
+ if (map_map[nr_consumed_maps++])
continue;
ms = __nr_to_section(pnum);
pr_err("%s: sparsemem memory map backing failed some memory will not be available\n",
diff --git a/mm/sparse.c b/mm/sparse.c
index b2848cc6e32a..2eb8ee72e44d 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -386,6 +386,7 @@ static void __init sparse_early_usemaps_alloc_node(void *data,
unsigned long pnum;
unsigned long **usemap_map = (unsigned long **)data;
int size = usemap_size();
+ int nr_consumed_maps = 0;
usemap = sparse_early_usemaps_alloc_pgdat_section(NODE_DATA(nodeid),
size * usemap_count);
@@ -397,9 +398,10 @@ static void __init sparse_early_usemaps_alloc_node(void *data,
for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
if (!present_section_nr(pnum))
continue;
- usemap_map[pnum] = usemap;
+ usemap_map[nr_consumed_maps] = usemap;
usemap += size;
- check_usemap_section_nr(nodeid, usemap_map[pnum]);
+ check_usemap_section_nr(nodeid, usemap_map[nr_consumed_maps]);
+ nr_consumed_maps++;
}
}
@@ -424,29 +426,33 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
void *map;
unsigned long pnum;
unsigned long size = sizeof(struct page) * PAGES_PER_SECTION;
+ int nr_consumed_maps;
size = PAGE_ALIGN(size);
map = memblock_virt_alloc_try_nid_raw(size * map_count,
PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
if (map) {
+ nr_consumed_maps = 0;
for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
if (!present_section_nr(pnum))
continue;
- map_map[pnum] = map;
+ map_map[nr_consumed_maps] = map;
map += size;
+ nr_consumed_maps++;
}
return;
}
/* fallback */
+ nr_consumed_maps = 0;
for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
struct mem_section *ms;
if (!present_section_nr(pnum))
continue;
- map_map[pnum] = sparse_mem_map_populate(pnum, nodeid, NULL);
- if (map_map[pnum])
+ map_map[nr_consumed_maps] = sparse_mem_map_populate(pnum, nodeid, NULL);
+ if (map_map[nr_consumed_maps++])
continue;
ms = __nr_to_section(pnum);
pr_err("%s: sparsemem memory map backing failed some memory will not be available\n",
@@ -526,6 +532,7 @@ static void __init alloc_usemap_and_memmap(void (*alloc_func)
/* new start, update count etc*/
nodeid_begin = nodeid;
pnum_begin = pnum;
+ data += map_count * data_unit_size;
map_count = 1;
}
/* ok, last chunk */
@@ -544,6 +551,7 @@ void __init sparse_init(void)
unsigned long *usemap;
unsigned long **usemap_map;
int size;
+ int nr_consumed_maps = 0;
#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
int size2;
struct page **map_map;
@@ -566,7 +574,7 @@ void __init sparse_init(void)
* powerpc need to call sparse_init_one_section right after each
* sparse_early_mem_map_alloc, so allocate usemap_map at first.
*/
- size = sizeof(unsigned long *) * NR_MEM_SECTIONS;
+ size = sizeof(unsigned long *) * nr_present_sections;
usemap_map = memblock_virt_alloc(size, 0);
if (!usemap_map)
panic("can not allocate usemap_map\n");
@@ -575,7 +583,7 @@ void __init sparse_init(void)
sizeof(usemap_map[0]));
#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
- size2 = sizeof(struct page *) * NR_MEM_SECTIONS;
+ size2 = sizeof(struct page *) * nr_present_sections;
map_map = memblock_virt_alloc(size2, 0);
if (!map_map)
panic("can not allocate map_map\n");
@@ -584,27 +592,44 @@ void __init sparse_init(void)
sizeof(map_map[0]));
#endif
+ /* The numner of present sections stored in nr_present_sections
+ * are kept the same since mem sections are marked as present in
+ * memory_present(). In this for loop, we need check which sections
+ * failed to allocate memmap or usemap, then clear its
+ * ->section_mem_map accordingly. During this process, we need
+ * increase 'nr_consumed_maps' whether its allocation of memmap
+ * or usemap failed or not, so that after we handle the i-th
+ * memory section, can get memmap and usemap of (i+1)-th section
+ * correctly. */
for_each_present_section_nr(0, pnum) {
struct mem_section *ms;
+
+ if (nr_consumed_maps >= nr_present_sections) {
+ pr_err("nr_consumed_maps goes beyond nr_present_sections\n");
+ break;
+ }
ms = __nr_to_section(pnum);
- usemap = usemap_map[pnum];
+ usemap = usemap_map[nr_consumed_maps];
if (!usemap) {
ms->section_mem_map = 0;
+ nr_consumed_maps++;
continue;
}
#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
- map = map_map[pnum];
+ map = map_map[nr_consumed_maps];
#else
map = sparse_early_mem_map_alloc(pnum);
#endif
if (!map) {
ms->section_mem_map = 0;
+ nr_consumed_maps++;
continue;
}
sparse_init_one_section(__nr_to_section(pnum), pnum, map,
usemap);
+ nr_consumed_maps++;
}
vmemmap_populate_print_last();
--
2.13.6
On Wed, Jun 27, 2018 at 09:31:14AM +0800, Baoquan He wrote:
> In sparse_init(), if CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y, system
> will allocate one continuous memory chunk for mem maps on one node and
> populate the relevant page tables to map memory section one by one. If
> fail to populate for a certain mem section, print warning and its
> ->section_mem_map will be cleared to cancel the marking of being present.
> Like this, the number of mem sections marked as present could become
> less during sparse_init() execution.
>
> Here just defer the ms->section_mem_map clearing if failed to populate
> its page tables until the last for_each_present_section_nr() loop. This
> is in preparation for later optimizing the mem map allocation.
>
> Signed-off-by: Baoquan He <[email protected]>
> ---
> mm/sparse-vmemmap.c | 1 -
> mm/sparse.c | 12 ++++++++----
> 2 files changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index bd0276d5f66b..640e68f8324b 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -303,7 +303,6 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
> ms = __nr_to_section(pnum);
> pr_err("%s: sparsemem memory map backing failed some memory will not be available\n",
> __func__);
> - ms->section_mem_map = 0;
Since we are deferring the clearing of section_mem_map, I guess we do not need
struct mem_section *ms;
ms = __nr_to_section(pnum);
anymore, right?
> }
>
> if (vmemmap_buf_start) {
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 6314303130b0..71ad53da2cd1 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -451,7 +451,6 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
> ms = __nr_to_section(pnum);
> pr_err("%s: sparsemem memory map backing failed some memory will not be available\n",
> __func__);
> - ms->section_mem_map = 0;
The same goes here.
--
Oscar Salvador
SUSE L3
This work made me think why do we even have
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER ? This really should be the
default behavior for all systems. Yet, it is enabled only on x86_64.
We could clean up an already messy sparse.c if we removed this config,
and enabled its path for all arches. We would not break anything
because if we cannot allocate one large mmap_map we still fallback to
allocating a page at a time the same as what happens when
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=n.
Pavel
On 06/27/18 at 11:54am, Oscar Salvador wrote:
> On Wed, Jun 27, 2018 at 09:31:14AM +0800, Baoquan He wrote:
> > In sparse_init(), if CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y, system
> > will allocate one continuous memory chunk for mem maps on one node and
> > populate the relevant page tables to map memory section one by one. If
> > fail to populate for a certain mem section, print warning and its
> > ->section_mem_map will be cleared to cancel the marking of being present.
> > Like this, the number of mem sections marked as present could become
> > less during sparse_init() execution.
> >
> > Here just defer the ms->section_mem_map clearing if failed to populate
> > its page tables until the last for_each_present_section_nr() loop. This
> > is in preparation for later optimizing the mem map allocation.
> >
> > Signed-off-by: Baoquan He <[email protected]>
> > ---
> > mm/sparse-vmemmap.c | 1 -
> > mm/sparse.c | 12 ++++++++----
> > 2 files changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> > index bd0276d5f66b..640e68f8324b 100644
> > --- a/mm/sparse-vmemmap.c
> > +++ b/mm/sparse-vmemmap.c
> > @@ -303,7 +303,6 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
> > ms = __nr_to_section(pnum);
> > pr_err("%s: sparsemem memory map backing failed some memory will not be available\n",
> > __func__);
> > - ms->section_mem_map = 0;
>
> Since we are deferring the clearing of section_mem_map, I guess we do not need
>
> struct mem_section *ms;
> ms = __nr_to_section(pnum);
>
> anymore, right?
Right, good catch, thanks.
I will post a new round to fix this.
>
> > }
> >
> > if (vmemmap_buf_start) {
> > diff --git a/mm/sparse.c b/mm/sparse.c
> > index 6314303130b0..71ad53da2cd1 100644
> > --- a/mm/sparse.c
> > +++ b/mm/sparse.c
> > @@ -451,7 +451,6 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
> > ms = __nr_to_section(pnum);
> > pr_err("%s: sparsemem memory map backing failed some memory will not be available\n",
> > __func__);
> > - ms->section_mem_map = 0;
>
> The same goes here.
>
>
>
> --
> Oscar Salvador
> SUSE L3
Hi Pavel,
On 06/27/18 at 01:47pm, Pavel Tatashin wrote:
> This work made me think why do we even have
> CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER ? This really should be the
> default behavior for all systems. Yet, it is enabled only on x86_64.
> We could clean up an already messy sparse.c if we removed this config,
> and enabled its path for all arches. We would not break anything
> because if we cannot allocate one large mmap_map we still fallback to
> allocating a page at a time the same as what happens when
> CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=n.
Thanks for your idea.
Seems the common ARCHes all have ARCH_SPARSEMEM_ENABLE, such as x86,
arm/64, power, s390, mips, others don't have. For them, removing
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER makes sense.
I will make a clean up patch to do this, but I can only test it on x86.
If test robot or other issues report issue on this clean up patch,
Andrew can help only pick the current 4 patches after updating, then
we can continue discussing the clean up patch. From the current code, it
should be OK to all ARCHes.
Thanks
Baoquan
Once you remove the ms mentioned by Oscar:
Reviewed-by: Pavel Tatashin <[email protected]>
On Wed, Jun 27, 2018 at 6:59 PM Baoquan He <[email protected]> wrote:
>
> On 06/27/18 at 11:54am, Oscar Salvador wrote:
> > On Wed, Jun 27, 2018 at 09:31:14AM +0800, Baoquan He wrote:
> > > In sparse_init(), if CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y, system
> > > will allocate one continuous memory chunk for mem maps on one node and
> > > populate the relevant page tables to map memory section one by one. If
> > > fail to populate for a certain mem section, print warning and its
> > > ->section_mem_map will be cleared to cancel the marking of being present.
> > > Like this, the number of mem sections marked as present could become
> > > less during sparse_init() execution.
> > >
> > > Here just defer the ms->section_mem_map clearing if failed to populate
> > > its page tables until the last for_each_present_section_nr() loop. This
> > > is in preparation for later optimizing the mem map allocation.
> > >
> > > Signed-off-by: Baoquan He <[email protected]>
> > > ---
> > > mm/sparse-vmemmap.c | 1 -
> > > mm/sparse.c | 12 ++++++++----
> > > 2 files changed, 8 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> > > index bd0276d5f66b..640e68f8324b 100644
> > > --- a/mm/sparse-vmemmap.c
> > > +++ b/mm/sparse-vmemmap.c
> > > @@ -303,7 +303,6 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
> > > ms = __nr_to_section(pnum);
> > > pr_err("%s: sparsemem memory map backing failed some memory will not be available\n",
> > > __func__);
> > > - ms->section_mem_map = 0;
> >
> > Since we are deferring the clearing of section_mem_map, I guess we do not need
> >
> > struct mem_section *ms;
> > ms = __nr_to_section(pnum);
> >
> > anymore, right?
>
> Right, good catch, thanks.
>
> I will post a new round to fix this.
>
> >
> > > }
> > >
> > > if (vmemmap_buf_start) {
> > > diff --git a/mm/sparse.c b/mm/sparse.c
> > > index 6314303130b0..71ad53da2cd1 100644
> > > --- a/mm/sparse.c
> > > +++ b/mm/sparse.c
> > > @@ -451,7 +451,6 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
> > > ms = __nr_to_section(pnum);
> > > pr_err("%s: sparsemem memory map backing failed some memory will not be available\n",
> > > __func__);
> > > - ms->section_mem_map = 0;
> >
> > The same goes here.
> >
> >
> >
> > --
> > Oscar Salvador
> > SUSE L3
>
Honestly, I do not like this new agrument, but it will do for now. I
could not think of a better way without rewriting everything.
Reviewed-by: Pavel Tatashin <[email protected]>
However, I will submit a series of patches to cleanup sparse.c and
completely remove large and confusing temporary buffers: map_map, and
usemap_map. In those patches, I will remove alloc_usemap_and_memmap().
On Tue, Jun 26, 2018 at 9:31 PM Baoquan He <[email protected]> wrote:
>
> alloc_usemap_and_memmap() is passing in a "void *" that points to
> usemap_map or memmap_map. In next patch we will change both of the
> map allocation from taking 'NR_MEM_SECTIONS' as the length to taking
> 'nr_present_sections' as the length. After that, the passed in 'void*'
> needs to update as things get consumed. But, it knows only the
> quantity of objects consumed and not the type. This effectively
> tells it enough about the type to let it update the pointer as
> objects are consumed.
>
> Signed-off-by: Baoquan He <[email protected]>
> ---
> mm/sparse.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 71ad53da2cd1..b2848cc6e32a 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -489,10 +489,12 @@ void __weak __meminit vmemmap_populate_print_last(void)
> /**
> * alloc_usemap_and_memmap - memory alloction for pageblock flags and vmemmap
> * @map: usemap_map for pageblock flags or mmap_map for vmemmap
> + * @unit_size: size of map unit
> */
> static void __init alloc_usemap_and_memmap(void (*alloc_func)
> (void *, unsigned long, unsigned long,
> - unsigned long, int), void *data)
> + unsigned long, int), void *data,
> + int data_unit_size)
> {
> unsigned long pnum;
> unsigned long map_count;
> @@ -569,7 +571,8 @@ void __init sparse_init(void)
> if (!usemap_map)
> panic("can not allocate usemap_map\n");
> alloc_usemap_and_memmap(sparse_early_usemaps_alloc_node,
> - (void *)usemap_map);
> + (void *)usemap_map,
> + sizeof(usemap_map[0]));
>
> #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
> size2 = sizeof(struct page *) * NR_MEM_SECTIONS;
> @@ -577,7 +580,8 @@ void __init sparse_init(void)
> if (!map_map)
> panic("can not allocate map_map\n");
> alloc_usemap_and_memmap(sparse_early_mem_maps_alloc_node,
> - (void *)map_map);
> + (void *)map_map,
> + sizeof(map_map[0]));
> #endif
>
> for_each_present_section_nr(0, pnum) {
> --
> 2.13.6
>
> Signed-off-by: Baoquan He <[email protected]>
>
> Signed-off-by: Baoquan He <[email protected]>
Please remove duplicated signed-off
> if (!usemap) {
> ms->section_mem_map = 0;
> + nr_consumed_maps++;
Currently, we do not set ms->section_mem_map to 0 when fail to
allocate usemap, only when fail to allocate mmap we set
section_mem_map to 0. I think this is an existing bug.
Reviewed-by: Pavel Tatashin <[email protected]>
Reviewed-by: Pavel Tatashin <[email protected]>
On Tue, Jun 26, 2018 at 9:31 PM Baoquan He <[email protected]> wrote:
>
> It's used to record how many memory sections are marked as present
> during system boot up, and will be used in the later patch.
>
> Signed-off-by: Baoquan He <[email protected]>
> ---
> mm/sparse.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/mm/sparse.c b/mm/sparse.c
> index f13f2723950a..6314303130b0 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -200,6 +200,12 @@ static inline int next_present_section_nr(int section_nr)
> (section_nr <= __highest_present_section_nr)); \
> section_nr = next_present_section_nr(section_nr))
>
> +/*
> + * Record how many memory sections are marked as present
> + * during system bootup.
> + */
> +static int __initdata nr_present_sections;
> +
> /* Record a memory area against a node. */
> void __init memory_present(int nid, unsigned long start, unsigned long end)
> {
> @@ -229,6 +235,7 @@ void __init memory_present(int nid, unsigned long start, unsigned long end)
> ms->section_mem_map = sparse_encode_early_nid(nid) |
> SECTION_IS_ONLINE;
> section_mark_present(ms);
> + nr_present_sections++;
> }
> }
> }
> --
> 2.13.6
>
On 06/27/18 at 11:19pm, Pavel Tatashin wrote:
> > Signed-off-by: Baoquan He <[email protected]>
> >
> > Signed-off-by: Baoquan He <[email protected]>
>
> Please remove duplicated signed-off
Done.
>
> > if (!usemap) {
> > ms->section_mem_map = 0;
> > + nr_consumed_maps++;
>
> Currently, we do not set ms->section_mem_map to 0 when fail to
> allocate usemap, only when fail to allocate mmap we set
> section_mem_map to 0. I think this is an existing bug.
Yes, found it when changing code. Later in sparse_init(), added checking
to see if usemap is available, otherwise also do "ms->section_mem_map = 0;"
to clear its ->section_mem_map.
Here if want to be perfect, we may need to free the relevant memmap
because usemap is allocated together, memmap could be allocated one
section by one section. I didn't do that because usemap allocation is
smaller one, if that allocation even failed in this early system
initializaiton stage, the kernel won't live long, so don't bother to do
that to complicate code.
>
> Reviewed-by: Pavel Tatashin <[email protected]>
On 06/27/18 at 11:14pm, Pavel Tatashin wrote:
> Honestly, I do not like this new agrument, but it will do for now. I
> could not think of a better way without rewriting everything.
>
> Reviewed-by: Pavel Tatashin <[email protected]>
>
> However, I will submit a series of patches to cleanup sparse.c and
> completely remove large and confusing temporary buffers: map_map, and
> usemap_map. In those patches, I will remove alloc_usemap_and_memmap().
Great, look forward to seeing them, I can help review.
> On Tue, Jun 26, 2018 at 9:31 PM Baoquan He <[email protected]> wrote:
> >
> > alloc_usemap_and_memmap() is passing in a "void *" that points to
> > usemap_map or memmap_map. In next patch we will change both of the
> > map allocation from taking 'NR_MEM_SECTIONS' as the length to taking
> > 'nr_present_sections' as the length. After that, the passed in 'void*'
> > needs to update as things get consumed. But, it knows only the
> > quantity of objects consumed and not the type. This effectively
> > tells it enough about the type to let it update the pointer as
> > objects are consumed.
> >
> > Signed-off-by: Baoquan He <[email protected]>
> > ---
> > mm/sparse.c | 10 +++++++---
> > 1 file changed, 7 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/sparse.c b/mm/sparse.c
> > index 71ad53da2cd1..b2848cc6e32a 100644
> > --- a/mm/sparse.c
> > +++ b/mm/sparse.c
> > @@ -489,10 +489,12 @@ void __weak __meminit vmemmap_populate_print_last(void)
> > /**
> > * alloc_usemap_and_memmap - memory alloction for pageblock flags and vmemmap
> > * @map: usemap_map for pageblock flags or mmap_map for vmemmap
> > + * @unit_size: size of map unit
> > */
> > static void __init alloc_usemap_and_memmap(void (*alloc_func)
> > (void *, unsigned long, unsigned long,
> > - unsigned long, int), void *data)
> > + unsigned long, int), void *data,
> > + int data_unit_size)
> > {
> > unsigned long pnum;
> > unsigned long map_count;
> > @@ -569,7 +571,8 @@ void __init sparse_init(void)
> > if (!usemap_map)
> > panic("can not allocate usemap_map\n");
> > alloc_usemap_and_memmap(sparse_early_usemaps_alloc_node,
> > - (void *)usemap_map);
> > + (void *)usemap_map,
> > + sizeof(usemap_map[0]));
> >
> > #ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
> > size2 = sizeof(struct page *) * NR_MEM_SECTIONS;
> > @@ -577,7 +580,8 @@ void __init sparse_init(void)
> > if (!map_map)
> > panic("can not allocate map_map\n");
> > alloc_usemap_and_memmap(sparse_early_mem_maps_alloc_node,
> > - (void *)map_map);
> > + (void *)map_map,
> > + sizeof(map_map[0]));
> > #endif
> >
> > for_each_present_section_nr(0, pnum) {
> > --
> > 2.13.6
> >
> + /* The numner of present sections stored in nr_present_sections
^ "number", please.
This comment needs CodingStyle love.
> + * are kept the same since mem sections are marked as present in
^ s/are/is/
This sentence is really odd to me. What is the point? Just that we are
not making sections present? Could we just say that instead of
referring to functions and variable names?
> + * memory_present(). In this for loop, we need check which sections
> + * failed to allocate memmap or usemap, then clear its
> + * ->section_mem_map accordingly.
Rather than referring to the for loop, how about we actually comment the
code that is doing this operation?
> During this process, we need
This is missing a "to": "we need _to_ increase".
> + * increase 'nr_consumed_maps' whether its allocation of memmap
> + * or usemap failed or not, so that after we handle the i-th
> + * memory section, can get memmap and usemap of (i+1)-th section
> + * correctly. */
This makes no sense to me. Why are we incrementing 'nr_consumed_maps'
when we do not consume one?
You say that we increment it so that things will work, but not how or
why it makes things work. I'm confused.
> > + * increase 'nr_consumed_maps' whether its allocation of memmap
> > + * or usemap failed or not, so that after we handle the i-th
> > + * memory section, can get memmap and usemap of (i+1)-th section
> > + * correctly. */
>
> This makes no sense to me. Why are we incrementing 'nr_consumed_maps'
> when we do not consume one?
>
> You say that we increment it so that things will work, but not how or
> why it makes things work. I'm confused.
Hi Dave,
nr_consumed_maps is a local counter. map_map contains struct pages for
each section. In order to assign them to correct sections this local
counter must be incremented even when some parts of map_map are empty.
Here is example:
Node1:
map_map[0] -> Struct pages ...
map_map[1] -> NULL
Node2:
map_map[2] -> Struct pages ...
We always want to configure section from Node2 with struct pages from
Node2. Even, if there are holes in-between. The same with usemap.
Pavel
On 06/29/2018 10:48 AM, Pavel Tatashin wrote:
> Here is example:
> Node1:
> map_map[0] -> Struct pages ...
> map_map[1] -> NULL
> Node2:
> map_map[2] -> Struct pages ...
>
> We always want to configure section from Node2 with struct pages from
> Node2. Even, if there are holes in-between. The same with usemap.
Right... But your example consumes two mem_map[]s.
But, from scanning the code, we increment nr_consumed_maps three times.
Correct?
On 06/29/2018 01:52 PM, Dave Hansen wrote:
> On 06/29/2018 10:48 AM, Pavel Tatashin wrote:
>> Here is example:
>> Node1:
>> map_map[0] -> Struct pages ...
>> map_map[1] -> NULL
>> Node2:
>> map_map[2] -> Struct pages ...
>>
>> We always want to configure section from Node2 with struct pages from
>> Node2. Even, if there are holes in-between. The same with usemap.
>
> Right... But your example consumes two mem_map[]s.
>
> But, from scanning the code, we increment nr_consumed_maps three times.
> Correct?
Correct: it should be incremented on every iteration of the loop. No matter if the entries contained valid data or NULLs. So we increment in three places:
if map_map[] has invalid entry, increment, continue
if usemap_map[] has invalid entry, increment, continue
at the end of the loop, everything was valid we increment it
This is done so nr_consumed_maps does not get out of sync with the current pnum. pnum does not equal to nr_consumed_maps, as there are may be holes in pnums, but there is one-to-one correlation.
Pavel
On 06/29/2018 11:01 AM, Pavel Tatashin wrote:
> Correct: it should be incremented on every iteration of the loop. No matter if the entries contained valid data or NULLs. So we increment in three places:
>
> if map_map[] has invalid entry, increment, continue
> if usemap_map[] has invalid entry, increment, continue
> at the end of the loop, everything was valid we increment it
>
> This is done so nr_consumed_maps does not get out of sync with the
> current pnum. pnum does not equal to nr_consumed_maps, as there are
> may be holes in pnums, but there is one-to-one correlation.
Can this be made more clear in the code?
> > This is done so nr_consumed_maps does not get out of sync with the
> > current pnum. pnum does not equal to nr_consumed_maps, as there are
> > may be holes in pnums, but there is one-to-one correlation.
> Can this be made more clear in the code?
Absolutely. I've done it here:
http://lkml.kernel.org/r/[email protected]
Pavel