2019-02-12 02:13:24

by Wei Yang

[permalink] [raw]
Subject: [PATCH 0/6] x86, mm: refine split_mem_range a little

split_mem_range is used to prepare range before mapping kernel page table.
Here are some patches to refine it.

Patch [1-2] are trivial clean up.
Patch [3] add some comment to illustrate the split process and prepare for
following refine work.
Patch [4-6] are the interesting refine.

There are totally 3 kinds of ranges, while 1G range is only valid on x86_64.
Because of this, the code needs to handle two different configurations. The
conditional macro makes it a little difficult to understand the logic.

We could simplify the logic with a different perception of those ranges, this
is done in Patch [4]. The change is a little, while the result is neat and
clean.

After this, the refreshed code shows we could skip some redundant work when
the range doesn't span PMD or PUD. This is in Patch [4-5].

Wei Yang (6):
x86, mm: remove second argument of split_mem_range()
x86, mm: remove check in save_mr
x86, mm: add comment for split_mem_range to help understanding
x86, mm: make split_mem_range() more easy to read
x86, mm: skip 1G range if the range doesn't span PUD
x86, mm: x86, mm: jump to split only 4K range when range doesn't span
PMD

arch/x86/mm/init.c | 87 ++++++++++++++++++++++++++++++++--------------
1 file changed, 61 insertions(+), 26 deletions(-)

--
2.19.1



2019-02-12 02:13:28

by Wei Yang

[permalink] [raw]
Subject: [PATCH 1/6] x86, mm: remove second argument of split_mem_range()

The second argument is always zero.

Signed-off-by: Wei Yang <[email protected]>
---
arch/x86/mm/init.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index ef99f3892e1f..1b980b70911a 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -330,13 +330,13 @@ static const char *page_size_string(struct map_range *mr)
return str_4k;
}

-static int __meminit split_mem_range(struct map_range *mr, int nr_range,
+static int __meminit split_mem_range(struct map_range *mr,
unsigned long start,
unsigned long end)
{
unsigned long start_pfn, end_pfn, limit_pfn;
unsigned long pfn;
- int i;
+ int i, nr_range = 0;

limit_pfn = PFN_DOWN(end);

@@ -474,7 +474,7 @@ unsigned long __ref init_memory_mapping(unsigned long start,
start, end - 1);

memset(mr, 0, sizeof(mr));
- nr_range = split_mem_range(mr, 0, start, end);
+ nr_range = split_mem_range(mr, start, end);

for (i = 0; i < nr_range; i++)
ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,
--
2.19.1


2019-02-12 02:13:47

by Wei Yang

[permalink] [raw]
Subject: [PATCH 2/6] x86, mm: remove check in save_mr

This check has been validated before calling save_mr. It is not
necessary to do this again.

Signed-off-by: Wei Yang <[email protected]>
---
arch/x86/mm/init.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 1b980b70911a..6fb84be79c7c 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -259,14 +259,12 @@ static int __meminit save_mr(struct map_range *mr, int nr_range,
unsigned long start_pfn, unsigned long end_pfn,
unsigned long page_size_mask)
{
- if (start_pfn < end_pfn) {
- if (nr_range >= NR_RANGE_MR)
- panic("run out of range for init_memory_mapping\n");
- mr[nr_range].start = start_pfn<<PAGE_SHIFT;
- mr[nr_range].end = end_pfn<<PAGE_SHIFT;
- mr[nr_range].page_size_mask = page_size_mask;
- nr_range++;
- }
+ if (nr_range >= NR_RANGE_MR)
+ panic("run out of range for init_memory_mapping\n");
+ mr[nr_range].start = start_pfn<<PAGE_SHIFT;
+ mr[nr_range].end = end_pfn<<PAGE_SHIFT;
+ mr[nr_range].page_size_mask = page_size_mask;
+ nr_range++;

return nr_range;
}
--
2.19.1


2019-02-12 02:14:01

by Wei Yang

[permalink] [raw]
Subject: [PATCH 6/6] x86, mm: x86, mm: jump to split only 4K range when range doesn't span PMD

In case the first attempt to round_up pfn with PMD_SIZE exceed the
limit_pfn, this means we could only have 4K range.

This patch jumps to split only 4K range for this case.

Signed-off-by: Wei Yang <[email protected]>
---
arch/x86/mm/init.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index cbb105388f24..bf422a637035 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -383,7 +383,7 @@ static int __meminit split_mem_range(struct map_range *mr,
end_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
#endif
if (end_pfn > limit_pfn)
- end_pfn = limit_pfn;
+ goto only_4K_range;
if (start_pfn < end_pfn) {
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn, 0);
pfn = end_pfn;
@@ -435,6 +435,7 @@ static int __meminit split_mem_range(struct map_range *mr,
* Range (E):
* tail is not big page (2M) alignment
*/
+only_4K_range:
start_pfn = pfn;
end_pfn = limit_pfn;
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn, 0);
--
2.19.1


2019-02-12 02:14:39

by Wei Yang

[permalink] [raw]
Subject: [PATCH 5/6] x86, mm: skip 1G range if the range doesn't span PUD

If the 1G aligned pfn exceed the range, we are sure there won't be
possible 1G range. so we can just jump to split 2M range directly.

Signed-off-by: Wei Yang <[email protected]>
---
arch/x86/mm/init.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 87275238dbb0..cbb105388f24 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -394,10 +394,10 @@ static int __meminit split_mem_range(struct map_range *mr,
* Range (B):
* big page (2M) range
*/
- start_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
end_pfn = round_up(pfn, PFN_DOWN(PUD_SIZE));
if (end_pfn > round_down(limit_pfn, PFN_DOWN(PMD_SIZE)))
- end_pfn = round_down(limit_pfn, PFN_DOWN(PMD_SIZE));
+ goto no_1G_range;
+ start_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
if (start_pfn < end_pfn) {
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
page_size_mask & (1<<PG_LEVEL_2M));
@@ -416,6 +416,7 @@ static int __meminit split_mem_range(struct map_range *mr,
((1<<PG_LEVEL_2M)|(1<<PG_LEVEL_1G)));
pfn = end_pfn;
}
+no_1G_range:
#endif

/*
--
2.19.1


2019-02-12 02:14:39

by Wei Yang

[permalink] [raw]
Subject: [PATCH 3/6] x86, mm: add comment for split_mem_range to help understanding

Describing the possible ranges in split and marking ranges with name to
help audience understand the logic.

Also this prepares to illustrate a code refine next.

Signed-off-by: Wei Yang <[email protected]>
---
arch/x86/mm/init.c | 51 ++++++++++++++++++++++++++++++++++++++++------
1 file changed, 45 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 6fb84be79c7c..2b782dcd6d71 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -328,6 +328,31 @@ static const char *page_size_string(struct map_range *mr)
return str_4k;
}

+/*
+ * There are 3 types of ranges:
+ *
+ * k : 4K size
+ * m : 2M size
+ * G : 1G size
+ *
+ * 1G size is only valid when CONFIG_X86_64 is set.
+ *
+ * So we can describe the possible ranges like below:
+ *
+ * kkkmmmGGGmmmkkk
+ * (A)(B)(C)(D)(E)
+ *
+ * This means there are at most:
+ *
+ * 3 ranges when CONFIG_X86_32 is set
+ * 5 ranges when CONFIG_X86_64 is set
+ *
+ * which corresponds to the definition of NR_RANGE_MR.
+ *
+ * split_mem_range() does the split from low to high. By naming these ranges
+ * to A, B, C, D, E respectively and marking the name in following comment, it
+ * may help you to understand how ranges are split.
+ */
static int __meminit split_mem_range(struct map_range *mr,
unsigned long start,
unsigned long end)
@@ -338,7 +363,10 @@ static int __meminit split_mem_range(struct map_range *mr,

limit_pfn = PFN_DOWN(end);

- /* head if not big page alignment ? */
+ /*
+ * Range (A):
+ * head if not big page alignment ?
+ */
pfn = start_pfn = PFN_DOWN(start);
#ifdef CONFIG_X86_32
/*
@@ -361,7 +389,10 @@ static int __meminit split_mem_range(struct map_range *mr,
pfn = end_pfn;
}

- /* big page (2M) range */
+ /*
+ * Range (B):
+ * big page (2M) range
+ */
start_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
#ifdef CONFIG_X86_32
end_pfn = round_down(limit_pfn, PFN_DOWN(PMD_SIZE));
@@ -370,7 +401,6 @@ static int __meminit split_mem_range(struct map_range *mr,
if (end_pfn > round_down(limit_pfn, PFN_DOWN(PMD_SIZE)))
end_pfn = round_down(limit_pfn, PFN_DOWN(PMD_SIZE));
#endif
-
if (start_pfn < end_pfn) {
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
page_size_mask & (1<<PG_LEVEL_2M));
@@ -378,7 +408,10 @@ static int __meminit split_mem_range(struct map_range *mr,
}

#ifdef CONFIG_X86_64
- /* big page (1G) range */
+ /*
+ * Range (C):
+ * big page (1G) range
+ */
start_pfn = round_up(pfn, PFN_DOWN(PUD_SIZE));
end_pfn = round_down(limit_pfn, PFN_DOWN(PUD_SIZE));
if (start_pfn < end_pfn) {
@@ -388,7 +421,10 @@ static int __meminit split_mem_range(struct map_range *mr,
pfn = end_pfn;
}

- /* tail is not big page (1G) alignment */
+ /*
+ * Range (D):
+ * big page (2M) range
+ */
start_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
end_pfn = round_down(limit_pfn, PFN_DOWN(PMD_SIZE));
if (start_pfn < end_pfn) {
@@ -398,7 +434,10 @@ static int __meminit split_mem_range(struct map_range *mr,
}
#endif

- /* tail is not big page (2M) alignment */
+ /*
+ * Range (E):
+ * tail is not big page (2M) alignment
+ */
start_pfn = pfn;
end_pfn = limit_pfn;
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn, 0);
--
2.19.1


2019-02-12 02:15:48

by Wei Yang

[permalink] [raw]
Subject: [PATCH 4/6] x86, mm: make split_mem_range() more easy to read

As the comment explains, there are at most 5 possible ranges:

kkkmmmGGGmmmkkk
(A)(B)(C)(D)(E)

While there are two ways to perceive:

* C & D are extra ranges on X86_64
* B & C are extra ranges on X86_64

Current implementation takes the first way, which leads to handling
end_pfn of B differently:

* align to PMD on X86_32
* align to PUD on X86_64

If we take the second way, we don't need to handle it differently.

* the end_pfn of B only need to align to PUD
* the end_pfn of D only need to align to PMD

This patch changes the implementation from the first perception to the
second to reduce one different handling on end_pfn. After doing so, the
code is easier to read.

Signed-off-by: Wei Yang <[email protected]>
---
arch/x86/mm/init.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 2b782dcd6d71..87275238dbb0 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -389,25 +389,21 @@ static int __meminit split_mem_range(struct map_range *mr,
pfn = end_pfn;
}

+#ifdef CONFIG_X86_64
/*
* Range (B):
* big page (2M) range
*/
start_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
-#ifdef CONFIG_X86_32
- end_pfn = round_down(limit_pfn, PFN_DOWN(PMD_SIZE));
-#else /* CONFIG_X86_64 */
end_pfn = round_up(pfn, PFN_DOWN(PUD_SIZE));
if (end_pfn > round_down(limit_pfn, PFN_DOWN(PMD_SIZE)))
end_pfn = round_down(limit_pfn, PFN_DOWN(PMD_SIZE));
-#endif
if (start_pfn < end_pfn) {
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
page_size_mask & (1<<PG_LEVEL_2M));
pfn = end_pfn;
}

-#ifdef CONFIG_X86_64
/*
* Range (C):
* big page (1G) range
@@ -420,6 +416,7 @@ static int __meminit split_mem_range(struct map_range *mr,
((1<<PG_LEVEL_2M)|(1<<PG_LEVEL_1G)));
pfn = end_pfn;
}
+#endif

/*
* Range (D):
@@ -432,7 +429,6 @@ static int __meminit split_mem_range(struct map_range *mr,
page_size_mask & (1<<PG_LEVEL_2M));
pfn = end_pfn;
}
-#endif

/*
* Range (E):
--
2.19.1


2019-03-24 14:31:04

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86, mm: make split_mem_range() more easy to read

Wei,

On Tue, 12 Feb 2019, Wei Yang wrote:
>
> This patch changes the implementation from the first perception to the
> second to reduce one different handling on end_pfn. After doing so, the
> code is easier to read.

It's maybe slightly easier to read, but it's still completely unreadable
garbage.

Not your fault, it was garbage before.

But refining garbage still results in garbage. Just the smell is slightly
different.

Why?

1) Doing all the calculations PFN based is just a pointless
indirection. Just look at all the rounding magic and back and forth
conversions.

All of that can be done purely address/size based which makes the code
truly readable.

2) The 5(3) sections are more or less copied code with a few changes of
constants, except for the first section (A) which has an extra quirk
for 32bit. Plus all the 64bit vs. 32bit #ifdeffery which is not making
it any better.

This copied mess can be avoided by using helper functions and proper
loops.

3) During the bootmem phase the code tries to preserve large mappings
_AFTER_ splitting them up and then it tries to merge the resulting
overlaps.

This is completely backwards because the expansion of the range can be
tried right away when then mapping of a large page is attempted. Surely
not with the current mess, but with a proper loop based approach it can
be done nicely.

Btw, there is a bug in that expansion code which could result in
undoing the enforced 4K mapping of the lower 2M/4M range on 32bit. It's
probably not an issue in practice because the low range is usually not
contiguous. But it works by chance not by design.

4) The debug print string retrieval function is just silly.

The logic for rewriting this is pretty obvious:

while (addr < end) {
setup_mr(mr, addr, end);
for_each_map_size(1G, 2M, 4K) {
if (try_map(mr, size))
break;
}
addr = mr->end;
}

setup_mr() takes care of the 32bit 0-2/4M range by limiting the
range and setting the allowed size mask in mr to 4k only.

try_map() does:

1) Try to expand the range to preserve large pages during bootmem

2) If the range start is not aligned, limit the end to the large
size boundary, so the next lower map size will only cover the
unaligned portion.

3) If the range end is not aligned, fit as much large
size as possible.

No magic A-E required, because it just falls into place naturally and
the expansion is done at the right place and not after the fact.

Find a mostly untested patch which implements this below. I just booted it
in a 64bit guest and it did not explode.

It removes 55 lines of code instead of adding 35 and reduces the binary
size by 408 bytes on 64bit and 128 bytes on 32bit.

Note, it's a combo of changes (including your patch 1/6) and needs to be
split up. It would be nice if you have time to split it up into separate
patches, add proper changelogs and test the heck out of it on both 32 and
64 bit. If you don't have time, please let me know.

Thanks,

tglx

8<---------------
arch/x86/mm/init.c | 259 ++++++++++++++++++++---------------------------------
1 file changed, 102 insertions(+), 157 deletions(-)

--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -157,17 +157,28 @@ void __init early_alloc_pgt_buf(void)
pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
}

-int after_bootmem;
+int after_bootmem __ro_after_init;

early_param_on_off("gbpages", "nogbpages", direct_gbpages, CONFIG_X86_DIRECT_GBPAGES);

struct map_range {
- unsigned long start;
- unsigned long end;
- unsigned page_size_mask;
+ unsigned long start;
+ unsigned long end;
+ unsigned int page_size_mask;
};

-static int page_size_mask;
+#ifdef CONFIG_X86_32
+#define NR_RANGE_MR 3
+#else
+#define NR_RANGE_MR 5
+#endif
+
+struct mapinfo {
+ unsigned int mask;
+ unsigned int size;
+};
+
+static unsigned int page_size_mask __ro_after_init;

static void __init probe_page_size_mask(void)
{
@@ -177,7 +188,7 @@ static void __init probe_page_size_mask(
* large pages into small in interrupt context, etc.
*/
if (boot_cpu_has(X86_FEATURE_PSE) && !debug_pagealloc_enabled())
- page_size_mask |= 1 << PG_LEVEL_2M;
+ page_size_mask |= 1U << PG_LEVEL_2M;
else
direct_gbpages = 0;

@@ -201,7 +212,7 @@ static void __init probe_page_size_mask(
/* Enable 1 GB linear kernel mappings if available: */
if (direct_gbpages && boot_cpu_has(X86_FEATURE_GBPAGES)) {
printk(KERN_INFO "Using GB pages for direct mapping\n");
- page_size_mask |= 1 << PG_LEVEL_1G;
+ page_size_mask |= 1U << PG_LEVEL_1G;
} else {
direct_gbpages = 0;
}
@@ -249,185 +260,119 @@ static void setup_pcid(void)
}
}

-#ifdef CONFIG_X86_32
-#define NR_RANGE_MR 3
-#else /* CONFIG_X86_64 */
-#define NR_RANGE_MR 5
+static void __meminit mr_print(struct map_range *mr, unsigned int maxidx)
+{
+#if defined(CONFIG_X86_32) && !defined(CONFIG_X86_PAE)
+ static const char *sz[2] = { "4k", "4M" };
+#else
+ static const char *sz[4] = { "4k", "2M", "1G", "" };
#endif
+ unsigned int idx, s;

-static int __meminit save_mr(struct map_range *mr, int nr_range,
- unsigned long start_pfn, unsigned long end_pfn,
- unsigned long page_size_mask)
-{
- if (start_pfn < end_pfn) {
- if (nr_range >= NR_RANGE_MR)
- panic("run out of range for init_memory_mapping\n");
- mr[nr_range].start = start_pfn<<PAGE_SHIFT;
- mr[nr_range].end = end_pfn<<PAGE_SHIFT;
- mr[nr_range].page_size_mask = page_size_mask;
- nr_range++;
+ for (idx = 0; idx < maxidx; idx++, mr++) {
+ s = (mr->page_size_mask >> PG_LEVEL_2M) & (ARRAY_SIZE(sz) -1);
+ pr_debug(" [mem %#010lx-%#010lx] page size %s\n",
+ mr->start, mr->end - 1, sz[s]);
}
-
- return nr_range;
}

/*
- * adjust the page_size_mask for small range to go with
- * big page size instead small one if nearby are ram too.
+ * Try to preserve large mappings during bootmem by expanding the current
+ * range to large page mapping of @size and verifying that the result is
+ * within a memory region.
*/
-static void __ref adjust_range_page_size_mask(struct map_range *mr,
- int nr_range)
+static void __meminit mr_expand(struct map_range *mr, unsigned int size)
{
- int i;
-
- for (i = 0; i < nr_range; i++) {
- if ((page_size_mask & (1<<PG_LEVEL_2M)) &&
- !(mr[i].page_size_mask & (1<<PG_LEVEL_2M))) {
- unsigned long start = round_down(mr[i].start, PMD_SIZE);
- unsigned long end = round_up(mr[i].end, PMD_SIZE);
+ unsigned long start = round_down(mr->start, size);
+ unsigned long end = round_up(mr->end, size);

-#ifdef CONFIG_X86_32
- if ((end >> PAGE_SHIFT) > max_low_pfn)
- continue;
-#endif
+ if (IS_ENABLED(CONFIG_X86_32) && (end >> PAGE_SHIFT) > max_low_pfn)
+ return;

- if (memblock_is_region_memory(start, end - start))
- mr[i].page_size_mask |= 1<<PG_LEVEL_2M;
- }
- if ((page_size_mask & (1<<PG_LEVEL_1G)) &&
- !(mr[i].page_size_mask & (1<<PG_LEVEL_1G))) {
- unsigned long start = round_down(mr[i].start, PUD_SIZE);
- unsigned long end = round_up(mr[i].end, PUD_SIZE);
-
- if (memblock_is_region_memory(start, end - start))
- mr[i].page_size_mask |= 1<<PG_LEVEL_1G;
- }
+ if (memblock_is_region_memory(start, end - start)) {
+ mr->start = start;
+ mr->end = end;
}
}

-static const char *page_size_string(struct map_range *mr)
+static bool __meminit mr_try_map(struct map_range *mr, const struct mapinfo *mi)
{
- static const char str_1g[] = "1G";
- static const char str_2m[] = "2M";
- static const char str_4m[] = "4M";
- static const char str_4k[] = "4k";
+ unsigned long len;

- if (mr->page_size_mask & (1<<PG_LEVEL_1G))
- return str_1g;
- /*
- * 32-bit without PAE has a 4M large page size.
- * PG_LEVEL_2M is misnamed, but we can at least
- * print out the right size in the string.
- */
- if (IS_ENABLED(CONFIG_X86_32) &&
- !IS_ENABLED(CONFIG_X86_PAE) &&
- mr->page_size_mask & (1<<PG_LEVEL_2M))
- return str_4m;
+ /* Check whether the map size is supported. PAGE_SIZE always is. */
+ if (mi->mask && !(mr->page_size_mask & mi->mask))
+ return false;

- if (mr->page_size_mask & (1<<PG_LEVEL_2M))
- return str_2m;
+ if (!after_bootmem)
+ mr_expand(mr, mi->size);

- return str_4k;
-}
+ if (!IS_ALIGNED(mr->start, mi->size)) {
+ /* Limit the range to the next boundary of this size. */
+ mr->end = min_t(unsigned long, mr->end,
+ round_up(mr->start, mi->size));
+ return false;
+ }

-static int __meminit split_mem_range(struct map_range *mr, int nr_range,
- unsigned long start,
- unsigned long end)
-{
- unsigned long start_pfn, end_pfn, limit_pfn;
- unsigned long pfn;
- int i;
+ if (!IS_ALIGNED(mr->end, mi->size)) {
+ /* Try to fit as much as possible */
+ len = round_down(mr->end - mr->start, mi->size);
+ if (!len)
+ return false;
+ mr->end = mr->start + len;
+ }

- limit_pfn = PFN_DOWN(end);
+ /* Store the effective page size mask */
+ mr->page_size_mask = mi->mask;
+ return true;
+}

- /* head if not big page alignment ? */
- pfn = start_pfn = PFN_DOWN(start);
-#ifdef CONFIG_X86_32
+static void __meminit mr_setup(struct map_range *mr, unsigned long start,
+ unsigned long end)
+{
/*
- * Don't use a large page for the first 2/4MB of memory
- * because there are often fixed size MTRRs in there
- * and overlapping MTRRs into large pages can cause
- * slowdowns.
+ * On 32bit the first 2/4MB are often covered by fixed size MTRRs.
+ * Overlapping MTRRs on large pages can cause slowdowns. Force 4k
+ * mappings.
*/
- if (pfn == 0)
- end_pfn = PFN_DOWN(PMD_SIZE);
- else
- end_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
-#else /* CONFIG_X86_64 */
- end_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
-#endif
- if (end_pfn > limit_pfn)
- end_pfn = limit_pfn;
- if (start_pfn < end_pfn) {
- nr_range = save_mr(mr, nr_range, start_pfn, end_pfn, 0);
- pfn = end_pfn;
- }
-
- /* big page (2M) range */
- start_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
-#ifdef CONFIG_X86_32
- end_pfn = round_down(limit_pfn, PFN_DOWN(PMD_SIZE));
-#else /* CONFIG_X86_64 */
- end_pfn = round_up(pfn, PFN_DOWN(PUD_SIZE));
- if (end_pfn > round_down(limit_pfn, PFN_DOWN(PMD_SIZE)))
- end_pfn = round_down(limit_pfn, PFN_DOWN(PMD_SIZE));
-#endif
-
- if (start_pfn < end_pfn) {
- nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
- page_size_mask & (1<<PG_LEVEL_2M));
- pfn = end_pfn;
+ if (IS_ENABLED(CONFIG_X86_32) && start < PMD_SIZE) {
+ mr->page_size_mask = 0;
+ mr->end = min_t(unsigned long, end, PMD_SIZE);
+ } else {
+ /* Set the possible mapping sizes and allow full range. */
+ mr->page_size_mask = page_size_mask;
+ mr->end = end;
}
+ mr->start = start;
+}

+static int __meminit split_mem_range(struct map_range *mr, unsigned long start,
+ unsigned long end)
+{
+ static const struct mapinfo mapinfos[] = {
#ifdef CONFIG_X86_64
- /* big page (1G) range */
- start_pfn = round_up(pfn, PFN_DOWN(PUD_SIZE));
- end_pfn = round_down(limit_pfn, PFN_DOWN(PUD_SIZE));
- if (start_pfn < end_pfn) {
- nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
- page_size_mask &
- ((1<<PG_LEVEL_2M)|(1<<PG_LEVEL_1G)));
- pfn = end_pfn;
- }
-
- /* tail is not big page (1G) alignment */
- start_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
- end_pfn = round_down(limit_pfn, PFN_DOWN(PMD_SIZE));
- if (start_pfn < end_pfn) {
- nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
- page_size_mask & (1<<PG_LEVEL_2M));
- pfn = end_pfn;
- }
+ { .mask = 1U << PG_LEVEL_1G, .size = PUD_SIZE },
#endif
+ { .mask = 1U << PG_LEVEL_2M, .size = PMD_SIZE },
+ { .mask = 0, .size = PAGE_SIZE },
+ };
+ const struct mapinfo *mi;
+ struct map_range *curmr;
+ unsigned long addr;
+ int idx;
+
+ for (idx = 0, addr = start, curmr = mr; addr < end; idx++, curmr++) {
+ BUG_ON(idx == NR_RANGE_MR);

- /* tail is not big page (2M) alignment */
- start_pfn = pfn;
- end_pfn = limit_pfn;
- nr_range = save_mr(mr, nr_range, start_pfn, end_pfn, 0);
+ mr_setup(curmr, addr, end);

- if (!after_bootmem)
- adjust_range_page_size_mask(mr, nr_range);
+ /* Try map sizes top down. PAGE_SIZE will always succeed. */
+ for (mi = mapinfos; !mr_try_map(curmr, mi); mi++);

- /* try to merge same page size and continuous */
- for (i = 0; nr_range > 1 && i < nr_range - 1; i++) {
- unsigned long old_start;
- if (mr[i].end != mr[i+1].start ||
- mr[i].page_size_mask != mr[i+1].page_size_mask)
- continue;
- /* move it */
- old_start = mr[i].start;
- memmove(&mr[i], &mr[i+1],
- (nr_range - 1 - i) * sizeof(struct map_range));
- mr[i--].start = old_start;
- nr_range--;
+ /* Get the start address for the next range */
+ addr = curmr->end;
}
-
- for (i = 0; i < nr_range; i++)
- pr_debug(" [mem %#010lx-%#010lx] page %s\n",
- mr[i].start, mr[i].end - 1,
- page_size_string(&mr[i]));
-
- return nr_range;
+ mr_print(mr, idx);
+ return idx;
}

struct range pfn_mapped[E820_MAX_ENTRIES];
@@ -474,7 +419,7 @@ unsigned long __ref init_memory_mapping(
start, end - 1);

memset(mr, 0, sizeof(mr));
- nr_range = split_mem_range(mr, 0, start, end);
+ nr_range = split_mem_range(mr, start, end);

for (i = 0; i < nr_range; i++)
ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,

2019-03-24 14:39:21

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/6] x86, mm: remove second argument of split_mem_range()

Wei,

On Tue, 12 Feb 2019, Wei Yang wrote:

Vs. the subject line. 'x86, mm:' is not the proper prefix.

# git log path/to/file

gives you usually a pretty good hint, i.e. in this case: 'x86/mm:'

Also the sentence after the colon starts with an upper case letter. So the
full subject would be:

x86/mm: Remove ......

Other than that. This looks good.

Thanks,

tglx

2019-03-25 01:19:52

by Wei Yang

[permalink] [raw]
Subject: Re: [PATCH 1/6] x86, mm: remove second argument of split_mem_range()

On Sun, Mar 24, 2019 at 03:38:24PM +0100, Thomas Gleixner wrote:
>Wei,
>
>On Tue, 12 Feb 2019, Wei Yang wrote:
>
>Vs. the subject line. 'x86, mm:' is not the proper prefix.
>
># git log path/to/file
>
>gives you usually a pretty good hint, i.e. in this case: 'x86/mm:'
>
>Also the sentence after the colon starts with an upper case letter. So the
>full subject would be:
>
> x86/mm: Remove ......
>

Ah, thanks for your comment! Will change this in next version.

>Other than that. This looks good.
>
>Thanks,
>
> tglx

--
Wei Yang
Help you, Help me

2019-03-27 22:06:18

by Wei Yang

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86, mm: make split_mem_range() more easy to read

On Sun, Mar 24, 2019 at 03:29:04PM +0100, Thomas Gleixner wrote:
>Wei,
>
>On Tue, 12 Feb 2019, Wei Yang wrote:
>>
>> This patch changes the implementation from the first perception to the
>> second to reduce one different handling on end_pfn. After doing so, the
>> code is easier to read.
>
>It's maybe slightly easier to read, but it's still completely unreadable
>garbage.
>
> Not your fault, it was garbage before.
>
>But refining garbage still results in garbage. Just the smell is slightly
>different.
>
>Why?
>
> 1) Doing all the calculations PFN based is just a pointless
> indirection. Just look at all the rounding magic and back and forth
> conversions.
>
> All of that can be done purely address/size based which makes the code
> truly readable.
>
> 2) The 5(3) sections are more or less copied code with a few changes of
> constants, except for the first section (A) which has an extra quirk
> for 32bit. Plus all the 64bit vs. 32bit #ifdeffery which is not making
> it any better.
>
> This copied mess can be avoided by using helper functions and proper
> loops.
>
> 3) During the bootmem phase the code tries to preserve large mappings
> _AFTER_ splitting them up and then it tries to merge the resulting
> overlaps.
>
> This is completely backwards because the expansion of the range can be
> tried right away when then mapping of a large page is attempted. Surely
> not with the current mess, but with a proper loop based approach it can
> be done nicely.
>
> Btw, there is a bug in that expansion code which could result in
> undoing the enforced 4K mapping of the lower 2M/4M range on 32bit. It's
> probably not an issue in practice because the low range is usually not
> contiguous. But it works by chance not by design.
>
> 4) The debug print string retrieval function is just silly.
>
>The logic for rewriting this is pretty obvious:
>
> while (addr < end) {
> setup_mr(mr, addr, end);
> for_each_map_size(1G, 2M, 4K) {
> if (try_map(mr, size))
> break;
> }
> addr = mr->end;
> }
>
> setup_mr() takes care of the 32bit 0-2/4M range by limiting the
> range and setting the allowed size mask in mr to 4k only.
>
> try_map() does:
>
> 1) Try to expand the range to preserve large pages during bootmem
>
> 2) If the range start is not aligned, limit the end to the large
> size boundary, so the next lower map size will only cover the
> unaligned portion.
>
> 3) If the range end is not aligned, fit as much large
> size as possible.
>
> No magic A-E required, because it just falls into place naturally and
> the expansion is done at the right place and not after the fact.
>
>Find a mostly untested patch which implements this below. I just booted it
>in a 64bit guest and it did not explode.
>
>It removes 55 lines of code instead of adding 35 and reduces the binary
>size by 408 bytes on 64bit and 128 bytes on 32bit.
>
>Note, it's a combo of changes (including your patch 1/6) and needs to be
>split up. It would be nice if you have time to split it up into separate
>patches, add proper changelogs and test the heck out of it on both 32 and
>64 bit. If you don't have time, please let me know.
>

Thanks for your suggestions :-)

Just get my head up, will try to understand the code and test on both
arch.

BTW, do you have some suggestions in the test? Currently I just use
bootup test. Basicly I think this is fine to cover the cases. Maybe you
would have some better idea.

>Thanks,
>
> tglx
>
>8<---------------
> arch/x86/mm/init.c | 259 ++++++++++++++++++++---------------------------------
> 1 file changed, 102 insertions(+), 157 deletions(-)
>
>--- a/arch/x86/mm/init.c
>+++ b/arch/x86/mm/init.c
>@@ -157,17 +157,28 @@ void __init early_alloc_pgt_buf(void)
> pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
> }
>
>-int after_bootmem;
>+int after_bootmem __ro_after_init;
>
> early_param_on_off("gbpages", "nogbpages", direct_gbpages, CONFIG_X86_DIRECT_GBPAGES);
>
> struct map_range {
>- unsigned long start;
>- unsigned long end;
>- unsigned page_size_mask;
>+ unsigned long start;
>+ unsigned long end;
>+ unsigned int page_size_mask;
> };
>
>-static int page_size_mask;
>+#ifdef CONFIG_X86_32
>+#define NR_RANGE_MR 3
>+#else
>+#define NR_RANGE_MR 5
>+#endif
>+
>+struct mapinfo {
>+ unsigned int mask;
>+ unsigned int size;
>+};
>+
>+static unsigned int page_size_mask __ro_after_init;
>
> static void __init probe_page_size_mask(void)
> {
>@@ -177,7 +188,7 @@ static void __init probe_page_size_mask(
> * large pages into small in interrupt context, etc.
> */
> if (boot_cpu_has(X86_FEATURE_PSE) && !debug_pagealloc_enabled())
>- page_size_mask |= 1 << PG_LEVEL_2M;
>+ page_size_mask |= 1U << PG_LEVEL_2M;
> else
> direct_gbpages = 0;
>
>@@ -201,7 +212,7 @@ static void __init probe_page_size_mask(
> /* Enable 1 GB linear kernel mappings if available: */
> if (direct_gbpages && boot_cpu_has(X86_FEATURE_GBPAGES)) {
> printk(KERN_INFO "Using GB pages for direct mapping\n");
>- page_size_mask |= 1 << PG_LEVEL_1G;
>+ page_size_mask |= 1U << PG_LEVEL_1G;
> } else {
> direct_gbpages = 0;
> }
>@@ -249,185 +260,119 @@ static void setup_pcid(void)
> }
> }
>
>-#ifdef CONFIG_X86_32
>-#define NR_RANGE_MR 3
>-#else /* CONFIG_X86_64 */
>-#define NR_RANGE_MR 5
>+static void __meminit mr_print(struct map_range *mr, unsigned int maxidx)
>+{
>+#if defined(CONFIG_X86_32) && !defined(CONFIG_X86_PAE)
>+ static const char *sz[2] = { "4k", "4M" };
>+#else
>+ static const char *sz[4] = { "4k", "2M", "1G", "" };
> #endif
>+ unsigned int idx, s;
>
>-static int __meminit save_mr(struct map_range *mr, int nr_range,
>- unsigned long start_pfn, unsigned long end_pfn,
>- unsigned long page_size_mask)
>-{
>- if (start_pfn < end_pfn) {
>- if (nr_range >= NR_RANGE_MR)
>- panic("run out of range for init_memory_mapping\n");
>- mr[nr_range].start = start_pfn<<PAGE_SHIFT;
>- mr[nr_range].end = end_pfn<<PAGE_SHIFT;
>- mr[nr_range].page_size_mask = page_size_mask;
>- nr_range++;
>+ for (idx = 0; idx < maxidx; idx++, mr++) {
>+ s = (mr->page_size_mask >> PG_LEVEL_2M) & (ARRAY_SIZE(sz) -1);
>+ pr_debug(" [mem %#010lx-%#010lx] page size %s\n",
>+ mr->start, mr->end - 1, sz[s]);
> }
>-
>- return nr_range;
> }
>
> /*
>- * adjust the page_size_mask for small range to go with
>- * big page size instead small one if nearby are ram too.
>+ * Try to preserve large mappings during bootmem by expanding the current
>+ * range to large page mapping of @size and verifying that the result is
>+ * within a memory region.
> */
>-static void __ref adjust_range_page_size_mask(struct map_range *mr,
>- int nr_range)
>+static void __meminit mr_expand(struct map_range *mr, unsigned int size)
> {
>- int i;
>-
>- for (i = 0; i < nr_range; i++) {
>- if ((page_size_mask & (1<<PG_LEVEL_2M)) &&
>- !(mr[i].page_size_mask & (1<<PG_LEVEL_2M))) {
>- unsigned long start = round_down(mr[i].start, PMD_SIZE);
>- unsigned long end = round_up(mr[i].end, PMD_SIZE);
>+ unsigned long start = round_down(mr->start, size);
>+ unsigned long end = round_up(mr->end, size);
>
>-#ifdef CONFIG_X86_32
>- if ((end >> PAGE_SHIFT) > max_low_pfn)
>- continue;
>-#endif
>+ if (IS_ENABLED(CONFIG_X86_32) && (end >> PAGE_SHIFT) > max_low_pfn)
>+ return;
>
>- if (memblock_is_region_memory(start, end - start))
>- mr[i].page_size_mask |= 1<<PG_LEVEL_2M;
>- }
>- if ((page_size_mask & (1<<PG_LEVEL_1G)) &&
>- !(mr[i].page_size_mask & (1<<PG_LEVEL_1G))) {
>- unsigned long start = round_down(mr[i].start, PUD_SIZE);
>- unsigned long end = round_up(mr[i].end, PUD_SIZE);
>-
>- if (memblock_is_region_memory(start, end - start))
>- mr[i].page_size_mask |= 1<<PG_LEVEL_1G;
>- }
>+ if (memblock_is_region_memory(start, end - start)) {
>+ mr->start = start;
>+ mr->end = end;
> }
> }
>
>-static const char *page_size_string(struct map_range *mr)
>+static bool __meminit mr_try_map(struct map_range *mr, const struct mapinfo *mi)
> {
>- static const char str_1g[] = "1G";
>- static const char str_2m[] = "2M";
>- static const char str_4m[] = "4M";
>- static const char str_4k[] = "4k";
>+ unsigned long len;
>
>- if (mr->page_size_mask & (1<<PG_LEVEL_1G))
>- return str_1g;
>- /*
>- * 32-bit without PAE has a 4M large page size.
>- * PG_LEVEL_2M is misnamed, but we can at least
>- * print out the right size in the string.
>- */
>- if (IS_ENABLED(CONFIG_X86_32) &&
>- !IS_ENABLED(CONFIG_X86_PAE) &&
>- mr->page_size_mask & (1<<PG_LEVEL_2M))
>- return str_4m;
>+ /* Check whether the map size is supported. PAGE_SIZE always is. */
>+ if (mi->mask && !(mr->page_size_mask & mi->mask))
>+ return false;
>
>- if (mr->page_size_mask & (1<<PG_LEVEL_2M))
>- return str_2m;
>+ if (!after_bootmem)
>+ mr_expand(mr, mi->size);
>
>- return str_4k;
>-}
>+ if (!IS_ALIGNED(mr->start, mi->size)) {
>+ /* Limit the range to the next boundary of this size. */
>+ mr->end = min_t(unsigned long, mr->end,
>+ round_up(mr->start, mi->size));
>+ return false;
>+ }
>
>-static int __meminit split_mem_range(struct map_range *mr, int nr_range,
>- unsigned long start,
>- unsigned long end)
>-{
>- unsigned long start_pfn, end_pfn, limit_pfn;
>- unsigned long pfn;
>- int i;
>+ if (!IS_ALIGNED(mr->end, mi->size)) {
>+ /* Try to fit as much as possible */
>+ len = round_down(mr->end - mr->start, mi->size);
>+ if (!len)
>+ return false;
>+ mr->end = mr->start + len;
>+ }
>
>- limit_pfn = PFN_DOWN(end);
>+ /* Store the effective page size mask */
>+ mr->page_size_mask = mi->mask;
>+ return true;
>+}
>
>- /* head if not big page alignment ? */
>- pfn = start_pfn = PFN_DOWN(start);
>-#ifdef CONFIG_X86_32
>+static void __meminit mr_setup(struct map_range *mr, unsigned long start,
>+ unsigned long end)
>+{
> /*
>- * Don't use a large page for the first 2/4MB of memory
>- * because there are often fixed size MTRRs in there
>- * and overlapping MTRRs into large pages can cause
>- * slowdowns.
>+ * On 32bit the first 2/4MB are often covered by fixed size MTRRs.
>+ * Overlapping MTRRs on large pages can cause slowdowns. Force 4k
>+ * mappings.
> */
>- if (pfn == 0)
>- end_pfn = PFN_DOWN(PMD_SIZE);
>- else
>- end_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
>-#else /* CONFIG_X86_64 */
>- end_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
>-#endif
>- if (end_pfn > limit_pfn)
>- end_pfn = limit_pfn;
>- if (start_pfn < end_pfn) {
>- nr_range = save_mr(mr, nr_range, start_pfn, end_pfn, 0);
>- pfn = end_pfn;
>- }
>-
>- /* big page (2M) range */
>- start_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
>-#ifdef CONFIG_X86_32
>- end_pfn = round_down(limit_pfn, PFN_DOWN(PMD_SIZE));
>-#else /* CONFIG_X86_64 */
>- end_pfn = round_up(pfn, PFN_DOWN(PUD_SIZE));
>- if (end_pfn > round_down(limit_pfn, PFN_DOWN(PMD_SIZE)))
>- end_pfn = round_down(limit_pfn, PFN_DOWN(PMD_SIZE));
>-#endif
>-
>- if (start_pfn < end_pfn) {
>- nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
>- page_size_mask & (1<<PG_LEVEL_2M));
>- pfn = end_pfn;
>+ if (IS_ENABLED(CONFIG_X86_32) && start < PMD_SIZE) {
>+ mr->page_size_mask = 0;
>+ mr->end = min_t(unsigned long, end, PMD_SIZE);
>+ } else {
>+ /* Set the possible mapping sizes and allow full range. */
>+ mr->page_size_mask = page_size_mask;
>+ mr->end = end;
> }
>+ mr->start = start;
>+}
>
>+static int __meminit split_mem_range(struct map_range *mr, unsigned long start,
>+ unsigned long end)
>+{
>+ static const struct mapinfo mapinfos[] = {
> #ifdef CONFIG_X86_64
>- /* big page (1G) range */
>- start_pfn = round_up(pfn, PFN_DOWN(PUD_SIZE));
>- end_pfn = round_down(limit_pfn, PFN_DOWN(PUD_SIZE));
>- if (start_pfn < end_pfn) {
>- nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
>- page_size_mask &
>- ((1<<PG_LEVEL_2M)|(1<<PG_LEVEL_1G)));
>- pfn = end_pfn;
>- }
>-
>- /* tail is not big page (1G) alignment */
>- start_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
>- end_pfn = round_down(limit_pfn, PFN_DOWN(PMD_SIZE));
>- if (start_pfn < end_pfn) {
>- nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
>- page_size_mask & (1<<PG_LEVEL_2M));
>- pfn = end_pfn;
>- }
>+ { .mask = 1U << PG_LEVEL_1G, .size = PUD_SIZE },
> #endif
>+ { .mask = 1U << PG_LEVEL_2M, .size = PMD_SIZE },
>+ { .mask = 0, .size = PAGE_SIZE },
>+ };
>+ const struct mapinfo *mi;
>+ struct map_range *curmr;
>+ unsigned long addr;
>+ int idx;
>+
>+ for (idx = 0, addr = start, curmr = mr; addr < end; idx++, curmr++) {
>+ BUG_ON(idx == NR_RANGE_MR);
>
>- /* tail is not big page (2M) alignment */
>- start_pfn = pfn;
>- end_pfn = limit_pfn;
>- nr_range = save_mr(mr, nr_range, start_pfn, end_pfn, 0);
>+ mr_setup(curmr, addr, end);
>
>- if (!after_bootmem)
>- adjust_range_page_size_mask(mr, nr_range);
>+ /* Try map sizes top down. PAGE_SIZE will always succeed. */
>+ for (mi = mapinfos; !mr_try_map(curmr, mi); mi++);
>
>- /* try to merge same page size and continuous */
>- for (i = 0; nr_range > 1 && i < nr_range - 1; i++) {
>- unsigned long old_start;
>- if (mr[i].end != mr[i+1].start ||
>- mr[i].page_size_mask != mr[i+1].page_size_mask)
>- continue;
>- /* move it */
>- old_start = mr[i].start;
>- memmove(&mr[i], &mr[i+1],
>- (nr_range - 1 - i) * sizeof(struct map_range));
>- mr[i--].start = old_start;
>- nr_range--;
>+ /* Get the start address for the next range */
>+ addr = curmr->end;
> }
>-
>- for (i = 0; i < nr_range; i++)
>- pr_debug(" [mem %#010lx-%#010lx] page %s\n",
>- mr[i].start, mr[i].end - 1,
>- page_size_string(&mr[i]));
>-
>- return nr_range;
>+ mr_print(mr, idx);
>+ return idx;
> }
>
> struct range pfn_mapped[E820_MAX_ENTRIES];
>@@ -474,7 +419,7 @@ unsigned long __ref init_memory_mapping(
> start, end - 1);
>
> memset(mr, 0, sizeof(mr));
>- nr_range = split_mem_range(mr, 0, start, end);
>+ nr_range = split_mem_range(mr, start, end);
>
> for (i = 0; i < nr_range; i++)
> ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,

--
Wei Yang
Help you, Help me

2019-03-28 00:17:08

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86, mm: make split_mem_range() more easy to read

Wei,

On Wed, 27 Mar 2019, Wei Yang wrote:
> On Sun, Mar 24, 2019 at 03:29:04PM +0100, Thomas Gleixner wrote:
> >Note, it's a combo of changes (including your patch 1/6) and needs to be
> >split up. It would be nice if you have time to split it up into separate
> >patches, add proper changelogs and test the heck out of it on both 32 and
> >64 bit. If you don't have time, please let me know.
> >
>
> Thanks for your suggestions :-)
>
> Just get my head up, will try to understand the code and test on both
> arch.
>
> BTW, do you have some suggestions in the test? Currently I just use
> bootup test. Basicly I think this is fine to cover the cases. Maybe you
> would have some better idea.

This is about bootup in the first place. After that memory hotplug which
you can emulate with qemu/kvm IIRC.

The important part about testing is to have systems which expose a wide
variety memory layouts.

Thanks,

tglx


2019-03-28 00:26:52

by Wei Yang

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86, mm: make split_mem_range() more easy to read

On Thu, Mar 28, 2019 at 01:16:08AM +0100, Thomas Gleixner wrote:
>Wei,
>
>On Wed, 27 Mar 2019, Wei Yang wrote:
>> On Sun, Mar 24, 2019 at 03:29:04PM +0100, Thomas Gleixner wrote:
>> >Note, it's a combo of changes (including your patch 1/6) and needs to be
>> >split up. It would be nice if you have time to split it up into separate
>> >patches, add proper changelogs and test the heck out of it on both 32 and
>> >64 bit. If you don't have time, please let me know.
>> >
>>
>> Thanks for your suggestions :-)
>>
>> Just get my head up, will try to understand the code and test on both
>> arch.
>>
>> BTW, do you have some suggestions in the test? Currently I just use
>> bootup test. Basicly I think this is fine to cover the cases. Maybe you
>> would have some better idea.
>
>This is about bootup in the first place. After that memory hotplug which
>you can emulate with qemu/kvm IIRC.
>

Ok, this is not difficult.

>The important part about testing is to have systems which expose a wide
>variety memory layouts.
>

May be I can add some test with nvdimm.

>Thanks,
>
> tglx

--
Wei Yang
Help you, Help me

2019-03-28 03:37:29

by Wei Yang

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86, mm: make split_mem_range() more easy to read

On Sun, Mar 24, 2019 at 03:29:04PM +0100, Thomas Gleixner wrote:
>Wei,
>
>On Tue, 12 Feb 2019, Wei Yang wrote:
>>
>> This patch changes the implementation from the first perception to the
>> second to reduce one different handling on end_pfn. After doing so, the
>> code is easier to read.
>
>It's maybe slightly easier to read, but it's still completely unreadable
>garbage.
>
> Not your fault, it was garbage before.
>
>But refining garbage still results in garbage. Just the smell is slightly
>different.
>
>Why?
>
> 1) Doing all the calculations PFN based is just a pointless
> indirection. Just look at all the rounding magic and back and forth
> conversions.
>
> All of that can be done purely address/size based which makes the code
> truly readable.
>
> 2) The 5(3) sections are more or less copied code with a few changes of
> constants, except for the first section (A) which has an extra quirk
> for 32bit. Plus all the 64bit vs. 32bit #ifdeffery which is not making
> it any better.
>
> This copied mess can be avoided by using helper functions and proper
> loops.
>
> 3) During the bootmem phase the code tries to preserve large mappings
> _AFTER_ splitting them up and then it tries to merge the resulting
> overlaps.
>
> This is completely backwards because the expansion of the range can be
> tried right away when then mapping of a large page is attempted. Surely
> not with the current mess, but with a proper loop based approach it can
> be done nicely.
>
> Btw, there is a bug in that expansion code which could result in
> undoing the enforced 4K mapping of the lower 2M/4M range on 32bit. It's
> probably not an issue in practice because the low range is usually not
> contiguous. But it works by chance not by design.

Hi, Thomas

I want to confirm with you, this bug is in adjust_range_page_size_mask(),
right?

--
Wei Yang
Help you, Help me

2019-03-28 07:21:50

by Wei Yang

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86, mm: make split_mem_range() more easy to read

On Sun, Mar 24, 2019 at 03:29:04PM +0100, Thomas Gleixner wrote:
>
>+static int __meminit split_mem_range(struct map_range *mr, unsigned long start,
>+ unsigned long end)
>+{
>+ static const struct mapinfo mapinfos[] = {
> #ifdef CONFIG_X86_64
>+ { .mask = 1U << PG_LEVEL_1G, .size = PUD_SIZE },
> #endif
>+ { .mask = 1U << PG_LEVEL_2M, .size = PMD_SIZE },
>+ { .mask = 0, .size = PAGE_SIZE },
>+ };
>+ const struct mapinfo *mi;
>+ struct map_range *curmr;
>+ unsigned long addr;
>+ int idx;
>+
>+ for (idx = 0, addr = start, curmr = mr; addr < end; idx++, curmr++) {
>+ BUG_ON(idx == NR_RANGE_MR);
>+ mr_setup(curmr, addr, end);
>
>+ /* Try map sizes top down. PAGE_SIZE will always succeed. */
>+ for (mi = mapinfos; !mr_try_map(curmr, mi); mi++);
>
>+ /* Get the start address for the next range */
>+ addr = curmr->end;
> }

I re-arrange the code to make split_mem_range() here easy to read.

My question is to the for loop.

For example, we have a range

+--+---------+-----------------------+
^ 128M 1G 2G
128M - 4K

If my understanding is correct, the original behavior will split this into
three ranges:

4K size: [128M - 4K, 128M]
2M size: [128M, 1G]
1G size: [1G, 2G]

While after your change, it will split this into two ranges:

?? size: [128M - 4K, 1G]
2M size: [1G, 2G]

The question mark here is because you leave the page_size_mask unchanged in
this case.

Is my understanding correct? Or I missed something?

--
Wei Yang
Help you, Help me

2019-03-28 08:05:31

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86, mm: make split_mem_range() more easy to read

Wei,

On Thu, 28 Mar 2019, Wei Yang wrote:

please trim your replies. It's annoying if one has to search the content in
the middle of a large useless quote.

> On Sun, Mar 24, 2019 at 03:29:04PM +0100, Thomas Gleixner wrote:
> >Wei,
> >-static int __meminit split_mem_range(struct map_range *mr, int nr_range,
> >- unsigned long start,
> >- unsigned long end)
> >-{
> >- unsigned long start_pfn, end_pfn, limit_pfn;
> >- unsigned long pfn;
> >- int i;
> >+ if (!IS_ALIGNED(mr->end, mi->size)) {
> >+ /* Try to fit as much as possible */
> >+ len = round_down(mr->end - mr->start, mi->size);
> >+ if (!len)
> >+ return false;
> >+ mr->end = mr->start + len;
> >+ }
> >
> >- limit_pfn = PFN_DOWN(end);
> >+ /* Store the effective page size mask */
> >+ mr->page_size_mask = mi->mask;
>
> I don't get the point here. Why store the effective page size mask just for
> the "middle" range.
>
> The original behavior will set the "head" and "tail" range with a lower level
> page size mask.

What has this to do with the middle range? Nothing. This is the path where
the current map level (1g, 2m, 4k) is applied from mr->start to
mr->end. That's the effective mapping of this map_range entry.

Thanks,

tglx

2019-03-28 08:09:37

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86, mm: make split_mem_range() more easy to read

On Thu, 28 Mar 2019, Wei Yang wrote:
> On Sun, Mar 24, 2019 at 03:29:04PM +0100, Thomas Gleixner wrote:
> My question is to the for loop.
>
> For example, we have a range
>
> +--+---------+-----------------------+
> ^ 128M 1G 2G
> 128M - 4K
>
> If my understanding is correct, the original behavior will split this into
> three ranges:
>
> 4K size: [128M - 4K, 128M]
> 2M size: [128M, 1G]
> 1G size: [1G, 2G]
>
> While after your change, it will split this into two ranges:
>
> ?? size: [128M - 4K, 1G]
> 2M size: [1G, 2G]
>
> The question mark here is because you leave the page_size_mask unchanged in
> this case.
>
> Is my understanding correct? Or I missed something?

Yes. You misread mr_try_map().

Thanks,

tglx

2019-03-29 03:39:21

by Wei Yang

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86, mm: make split_mem_range() more easy to read

On Thu, Mar 28, 2019 at 09:08:43AM +0100, Thomas Gleixner wrote:
>On Thu, 28 Mar 2019, Wei Yang wrote:
>> On Sun, Mar 24, 2019 at 03:29:04PM +0100, Thomas Gleixner wrote:
>> My question is to the for loop.
>>
>> For example, we have a range
>>
>> +--+---------+-----------------------+
>> ^ 128M 1G 2G
>> 128M - 4K
>>
>
>Yes. You misread mr_try_map().

You are right, I misunderstand the functionality of mr_try_map().

I went through the code and this looks nice to me. I have to say you are
genius.

Thanks for your code and I really learned a lot from this.

BTW, for the test cases, I thinks mem-hotplug may be introduce layout
diversity. Since mem-hotplug's range has to be 128M aligned.

>
>Thanks,
>
> tglx

--
Wei Yang
Help you, Help me

2019-04-12 03:10:47

by Wei Yang

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86, mm: make split_mem_range() more easy to read

On Sun, Mar 24, 2019 at 03:29:04PM +0100, Thomas Gleixner wrote:

>Find a mostly untested patch which implements this below. I just booted it
>in a 64bit guest and it did not explode.
>
>It removes 55 lines of code instead of adding 35 and reduces the binary
>size by 408 bytes on 64bit and 128 bytes on 32bit.
>
>Note, it's a combo of changes (including your patch 1/6) and needs to be
>split up. It would be nice if you have time to split it up into separate
>patches, add proper changelogs and test the heck out of it on both 32 and
>64 bit. If you don't have time, please let me know.

I tried to test mem hotplug with this applied.

On x86_64, this looks good. It split ranges well. While I found some problem
on testing mem hotplug on x86_32.

I start up a guest memory and trying to plug 1G memory.

The original memory split range looks good:

[ 0.004260] ywtest: [mem 0x00000000-0x00100000]
[ 0.004261] ywtest: [mem 0x00000000-0x000fffff] page size 4k
[ 0.004268] ywtest: [mem 0x37200000-0x37400000]
[ 0.004269] ywtest: [mem 0x37200000-0x373fffff] page size 2M
[ 0.004271] ywtest: [mem 0x20000000-0x37200000]
[ 0.004272] ywtest: [mem 0x20000000-0x371fffff] page size 2M
[ 0.004280] ywtest: [mem 0x00100000-0x20000000]
[ 0.004281] ywtest: [mem 0x00100000-0x001fffff] page size 4k
[ 0.004281] ywtest: [mem 0x00200000-0x1fffffff] page size 2M
[ 0.004292] ywtest: [mem 0x37400000-0x375fe000]
[ 0.004293] ywtest: [mem 0x37400000-0x375fdfff] page size 4k

While I can't online the new added memory device. And the new memory device's
range is out of 4G, which is a little bit strange.

>
>Thanks,
>
> tglx
>
--
Wei Yang
Help you, Help me