2012-08-29 19:05:08

by Jacob Shin

[permalink] [raw]
Subject: [PATCH V5 0/6] x86: Create direct mappings for E820_RAM only

This is the 5th revision of the patchset, which aims to create direct
mappings only for E820_RAM memory ranges. The problem description and
justification can be found in patch 4/6.

Previous discussion history can be found in the following threads:

* https://lkml.org/lkml/2012/8/24/474
* https://lkml.org/lkml/2012/8/22/680
* https://lkml.org/lkml/2012/8/13/512
* https://lkml.org/lkml/2012/8/9/536
* https://lkml.org/lkml/2011/10/20/323

Jacob Shin (4):
x86/mm: find_early_table_space based on memory ranges that are being
mapped
x86: Only direct map addresses that are marked as E820_RAM
x86: Fixup code testing if a pfn is direct mapped
x86: if kernel .text .data .bss are not marked as E820_RAM, complain
and fix

Yinghai Lu (2):
x86, mm: Add page_size_mask()
x86, mm: Split out split_mem_range

arch/x86/include/asm/page_types.h | 9 +++
arch/x86/include/asm/pgtable.h | 1 +
arch/x86/kernel/cpu/amd.c | 6 +-
arch/x86/kernel/setup.c | 115 ++++++++++++++++++++++----
arch/x86/mm/init.c | 162 ++++++++++++++++++++-----------------
arch/x86/mm/init_64.c | 6 +-
arch/x86/platform/efi/efi.c | 8 +-
7 files changed, 207 insertions(+), 100 deletions(-)

--
1.7.9.5


2012-08-29 19:04:37

by Jacob Shin

[permalink] [raw]
Subject: [PATCH 2/6] x86, mm: Split out split_mem_range

From: Yinghai Lu <[email protected]>

from init_memory_mapping, so make init_memory_mapping readable.

Suggested-by: Ingo Molnar <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 42 ++++++++++++++++++++++++++----------------
1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 838e9bc..41e615b 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -143,25 +143,13 @@ static int __meminit save_mr(struct map_range *mr, int nr_range,
return nr_range;
}

-/*
- * Setup the direct mapping of the physical memory at PAGE_OFFSET.
- * This runs before bootmem is initialized and gets pages directly from
- * the physical memory. To access them they are temporarily mapped.
- */
-unsigned long __init_refok init_memory_mapping(unsigned long start,
- unsigned long end)
+static int __meminit split_mem_range(struct map_range *mr, int nr_range,
+ unsigned long start,
+ unsigned long end)
{
unsigned long start_pfn, end_pfn;
- unsigned long ret = 0;
unsigned long pos;
- struct map_range mr[NR_RANGE_MR];
- int nr_range, i;
-
- printk(KERN_INFO "init_memory_mapping: [mem %#010lx-%#010lx]\n",
- start, end - 1);
-
- memset(mr, 0, sizeof(mr));
- nr_range = 0;
+ int i;

/* head if not big page alignment ? */
start_pfn = start >> PAGE_SHIFT;
@@ -255,6 +243,28 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
(mr[i].page_size_mask & (1<<PG_LEVEL_1G))?"1G":(
(mr[i].page_size_mask & (1<<PG_LEVEL_2M))?"2M":"4k"));

+ return nr_range;
+}
+
+/*
+ * Setup the direct mapping of the physical memory at PAGE_OFFSET.
+ * This runs before bootmem is initialized and gets pages directly from
+ * the physical memory. To access them they are temporarily mapped.
+ */
+unsigned long __init_refok init_memory_mapping(unsigned long start,
+ unsigned long end)
+{
+ struct map_range mr[NR_RANGE_MR];
+ unsigned long ret = 0;
+ int nr_range, i;
+
+ printk(KERN_INFO "init_memory_mapping: [mem %#010lx-%#010lx]\n",
+ start, end - 1);
+
+ memset(mr, 0, sizeof(mr));
+ nr_range = 0;
+ nr_range = split_mem_range(mr, nr_range, start, end);
+
/*
* Find space for the kernel direct mapping tables.
*
--
1.7.9.5

2012-08-29 19:04:35

by Jacob Shin

[permalink] [raw]
Subject: [PATCH 1/6] x86, mm: Add page_size_mask()

From: Yinghai Lu <[email protected]>

detect if need to use 1G or 2M and store them in page_size_mask.

Only probe them one time.

Suggested-by: Ingo Molnar <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/include/asm/pgtable.h | 1 +
arch/x86/kernel/setup.c | 1 +
arch/x86/mm/init.c | 66 +++++++++++++++++++---------------------
3 files changed, 33 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 49afb3f..e47e4db 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -597,6 +597,7 @@ static inline int pgd_none(pgd_t pgd)
#ifndef __ASSEMBLY__

extern int direct_gbpages;
+void probe_page_size_mask(void);

/* local pte updates need not use xchg for locking */
static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f4b9b80..d6e8c03 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -912,6 +912,7 @@ void __init setup_arch(char **cmdline_p)
setup_real_mode();

init_gbpages();
+ probe_page_size_mask();

/* max_pfn_mapped is updated here */
max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index e0e6990..838e9bc 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -35,8 +35,10 @@ struct map_range {
unsigned page_size_mask;
};

-static void __init find_early_table_space(struct map_range *mr, unsigned long end,
- int use_pse, int use_gbpages)
+static int page_size_mask;
+
+static void __init find_early_table_space(struct map_range *mr,
+ unsigned long end)
{
unsigned long puds, pmds, ptes, tables, start = 0, good_end = end;
phys_addr_t base;
@@ -44,7 +46,7 @@ static void __init find_early_table_space(struct map_range *mr, unsigned long en
puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);

- if (use_gbpages) {
+ if (page_size_mask & (1 << PG_LEVEL_1G)) {
unsigned long extra;

extra = end - ((end>>PUD_SHIFT) << PUD_SHIFT);
@@ -54,7 +56,7 @@ static void __init find_early_table_space(struct map_range *mr, unsigned long en

tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);

- if (use_pse) {
+ if (page_size_mask & (1 << PG_LEVEL_2M)) {
unsigned long extra;

extra = end - ((end>>PMD_SHIFT) << PMD_SHIFT);
@@ -90,6 +92,30 @@ static void __init find_early_table_space(struct map_range *mr, unsigned long en
(pgt_buf_top << PAGE_SHIFT) - 1);
}

+void probe_page_size_mask(void)
+{
+#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
+ /*
+ * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
+ * This will simplify cpa(), which otherwise needs to support splitting
+ * large pages into small in interrupt context, etc.
+ */
+ if (direct_gbpages)
+ page_size_mask |= 1 << PG_LEVEL_1G;
+ if (cpu_has_pse)
+ page_size_mask |= 1 << PG_LEVEL_2M;
+#endif
+
+ /* Enable PSE if available */
+ if (cpu_has_pse)
+ set_in_cr4(X86_CR4_PSE);
+
+ /* Enable PGE if available */
+ if (cpu_has_pge) {
+ set_in_cr4(X86_CR4_PGE);
+ __supported_pte_mask |= _PAGE_GLOBAL;
+ }
+}
void __init native_pagetable_reserve(u64 start, u64 end)
{
memblock_reserve(start, end - start);
@@ -125,45 +151,15 @@ static int __meminit save_mr(struct map_range *mr, int nr_range,
unsigned long __init_refok init_memory_mapping(unsigned long start,
unsigned long end)
{
- unsigned long page_size_mask = 0;
unsigned long start_pfn, end_pfn;
unsigned long ret = 0;
unsigned long pos;
-
struct map_range mr[NR_RANGE_MR];
int nr_range, i;
- int use_pse, use_gbpages;

printk(KERN_INFO "init_memory_mapping: [mem %#010lx-%#010lx]\n",
start, end - 1);

-#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KMEMCHECK)
- /*
- * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
- * This will simplify cpa(), which otherwise needs to support splitting
- * large pages into small in interrupt context, etc.
- */
- use_pse = use_gbpages = 0;
-#else
- use_pse = cpu_has_pse;
- use_gbpages = direct_gbpages;
-#endif
-
- /* Enable PSE if available */
- if (cpu_has_pse)
- set_in_cr4(X86_CR4_PSE);
-
- /* Enable PGE if available */
- if (cpu_has_pge) {
- set_in_cr4(X86_CR4_PGE);
- __supported_pte_mask |= _PAGE_GLOBAL;
- }
-
- if (use_gbpages)
- page_size_mask |= 1 << PG_LEVEL_1G;
- if (use_pse)
- page_size_mask |= 1 << PG_LEVEL_2M;
-
memset(mr, 0, sizeof(mr));
nr_range = 0;

@@ -267,7 +263,7 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
* nodes are discovered.
*/
if (!after_bootmem)
- find_early_table_space(&mr[0], end, use_pse, use_gbpages);
+ find_early_table_space(&mr[0], end);

for (i = 0; i < nr_range; i++)
ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,
--
1.7.9.5

2012-08-29 19:04:34

by Jacob Shin

[permalink] [raw]
Subject: [PATCH 3/6] x86/mm: find_early_table_space based on memory ranges that are being mapped

Current logic finds enough space for direct mapping page tables from 0
to end. Instead, we only need to find enough space to cover mr[0].start
to mr[nr_range].end -- the range that is actually being mapped by
init_memory_mapping()

This patch also reportedly fixes suspend/resume issue reported in:

https://lkml.org/lkml/2012/8/11/83

Signed-off-by: Jacob Shin <[email protected]>
---
arch/x86/mm/init.c | 62 +++++++++++++++++++++++++++++-----------------------
1 file changed, 35 insertions(+), 27 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 41e615b..916b15b 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -37,40 +37,48 @@ struct map_range {

static int page_size_mask;

-static void __init find_early_table_space(struct map_range *mr,
- unsigned long end)
+/*
+ * First calculate space needed for kernel direct mapping page tables to cover
+ * mr[0].start to mr[nr_range - 1].end, while accounting for possible 2M and 1GB
+ * pages. Then find enough contiguous space for those page tables.
+ */
+static void __init find_early_table_space(struct map_range *mr, int nr_range)
{
- unsigned long puds, pmds, ptes, tables, start = 0, good_end = end;
+ int i;
+ unsigned long puds = 0, pmds = 0, ptes = 0, tables;
+ unsigned long start = 0, good_end;
phys_addr_t base;

- puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
- tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
-
- if (page_size_mask & (1 << PG_LEVEL_1G)) {
- unsigned long extra;
+ for (i = 0; i < nr_range; i++) {
+ unsigned long range, extra;

- extra = end - ((end>>PUD_SHIFT) << PUD_SHIFT);
- pmds = (extra + PMD_SIZE - 1) >> PMD_SHIFT;
- } else
- pmds = (end + PMD_SIZE - 1) >> PMD_SHIFT;
+ range = mr[i].end - mr[i].start;
+ puds += (range + PUD_SIZE - 1) >> PUD_SHIFT;

- tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);
+ if (mr[i].page_size_mask & (1 << PG_LEVEL_1G)) {
+ extra = range - ((range >> PUD_SHIFT) << PUD_SHIFT);
+ pmds += (extra + PMD_SIZE - 1) >> PMD_SHIFT;
+ } else {
+ pmds += (range + PMD_SIZE - 1) >> PMD_SHIFT;
+ }

- if (page_size_mask & (1 << PG_LEVEL_2M)) {
- unsigned long extra;
-
- extra = end - ((end>>PMD_SHIFT) << PMD_SHIFT);
+ if (mr[i].page_size_mask & (1 << PG_LEVEL_2M)) {
+ extra = range - ((range >> PMD_SHIFT) << PMD_SHIFT);
#ifdef CONFIG_X86_32
- extra += PMD_SIZE;
+ extra += PMD_SIZE;
#endif
- /* The first 2/4M doesn't use large pages. */
- if (mr->start < PMD_SIZE)
- extra += mr->end - mr->start;
-
- ptes = (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
- } else
- ptes = (end + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ /* The first 2/4M doesn't use large pages. */
+ if (mr[i].start < PMD_SIZE)
+ extra += range;
+
+ ptes += (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ } else {
+ ptes += (range + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ }
+ }

+ tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
+ tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);
tables += roundup(ptes * sizeof(pte_t), PAGE_SIZE);

#ifdef CONFIG_X86_32
@@ -88,7 +96,7 @@ static void __init find_early_table_space(struct map_range *mr,
pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);

printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx]\n",
- end - 1, pgt_buf_start << PAGE_SHIFT,
+ mr[nr_range - 1].end - 1, pgt_buf_start << PAGE_SHIFT,
(pgt_buf_top << PAGE_SHIFT) - 1);
}

@@ -273,7 +281,7 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
* nodes are discovered.
*/
if (!after_bootmem)
- find_early_table_space(&mr[0], end);
+ find_early_table_space(mr, nr_range);

for (i = 0; i < nr_range; i++)
ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,
--
1.7.9.5

2012-08-29 19:05:10

by Jacob Shin

[permalink] [raw]
Subject: [PATCH 4/6] x86: Only direct map addresses that are marked as E820_RAM

Currently direct mappings are created for [ 0 to max_low_pfn<<PAGE_SHIFT )
and [ 4GB to max_pfn<<PAGE_SHIFT ), which may include regions that are not
backed by actual DRAM. This is fine for holes under 4GB which are covered
by fixed and variable range MTRRs to be UC. However, we run into trouble
on higher memory addresses which cannot be covered by MTRRs.

Our system with 1TB of RAM has an e820 that looks like this:

BIOS-e820: [mem 0x0000000000000000-0x00000000000983ff] usable
BIOS-e820: [mem 0x0000000000098400-0x000000000009ffff] reserved
BIOS-e820: [mem 0x00000000000d0000-0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000-0x00000000c7ebffff] usable
BIOS-e820: [mem 0x00000000c7ec0000-0x00000000c7ed7fff] ACPI data
BIOS-e820: [mem 0x00000000c7ed8000-0x00000000c7ed9fff] ACPI NVS
BIOS-e820: [mem 0x00000000c7eda000-0x00000000c7ffffff] reserved
BIOS-e820: [mem 0x00000000fec00000-0x00000000fec0ffff] reserved
BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
BIOS-e820: [mem 0x00000000fff00000-0x00000000ffffffff] reserved
BIOS-e820: [mem 0x0000000100000000-0x000000e037ffffff] usable
BIOS-e820: [mem 0x000000e038000000-0x000000fcffffffff] reserved
BIOS-e820: [mem 0x0000010000000000-0x0000011ffeffffff] usable

and so direct mappings are created for huge memory hole between
0x000000e038000000 to 0x0000010000000000. Even though the kernel never
generates memory accesses in that region, since the page tables mark
them incorrectly as being WB, our (AMD) processor ends up causing a MCE
while doing some memory bookkeeping/optimizations around that area.

This patch iterates through e820 and only direct maps ranges that are
marked as E820_RAM, and keeps track of those pfn ranges. Depending on
the alignment of E820 ranges, this may possibly result in using smaller
size (i.e. 4K instead of 2M or 1G) page tables.

Signed-off-by: Jacob Shin <[email protected]>
---
arch/x86/include/asm/page_types.h | 9 ++++
arch/x86/kernel/setup.c | 100 +++++++++++++++++++++++++++++++------
arch/x86/mm/init.c | 2 +
arch/x86/mm/init_64.c | 6 +--
4 files changed, 99 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index e21fdd1..409047a 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -3,6 +3,7 @@

#include <linux/const.h>
#include <linux/types.h>
+#include <asm/e820.h>

/* PAGE_SHIFT determines the page size */
#define PAGE_SHIFT 12
@@ -40,12 +41,20 @@
#endif /* CONFIG_X86_64 */

#ifndef __ASSEMBLY__
+#include <linux/range.h>

extern int devmem_is_allowed(unsigned long pagenr);

extern unsigned long max_low_pfn_mapped;
extern unsigned long max_pfn_mapped;

+extern struct range pfn_mapped[E820_X_MAX];
+extern int nr_pfn_mapped;
+
+extern void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn);
+extern bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
+extern bool pfn_is_mapped(unsigned long pfn);
+
static inline phys_addr_t get_max_mapped(void)
{
return (phys_addr_t)max_pfn_mapped << PAGE_SHIFT;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d6e8c03..a2e392e 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -115,13 +115,47 @@
#include <asm/prom.h>

/*
- * end_pfn only includes RAM, while max_pfn_mapped includes all e820 entries.
- * The direct mapping extends to max_pfn_mapped, so that we can directly access
- * apertures, ACPI and other tables without having to play with fixmaps.
+ * max_low_pfn_mapped: highest direct mapped pfn under 4GB
+ * max_pfn_mapped: highest direct mapped pfn over 4GB
+ *
+ * The direct mapping only covers E820_RAM regions, so the ranges and gaps are
+ * represented by pfn_mapped
*/
unsigned long max_low_pfn_mapped;
unsigned long max_pfn_mapped;

+struct range pfn_mapped[E820_X_MAX];
+int nr_pfn_mapped;
+
+void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn)
+{
+ nr_pfn_mapped = add_range_with_merge(pfn_mapped, E820_X_MAX,
+ nr_pfn_mapped, start_pfn, end_pfn);
+ nr_pfn_mapped = clean_sort_range(pfn_mapped, E820_X_MAX);
+
+ max_pfn_mapped = max(max_pfn_mapped, end_pfn);
+
+ if (end_pfn <= (1UL << (32 - PAGE_SHIFT)))
+ max_low_pfn_mapped = max(max_low_pfn_mapped, end_pfn);
+}
+
+bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
+{
+ int i;
+
+ for (i = 0; i < nr_pfn_mapped; i++)
+ if ((start_pfn >= pfn_mapped[i].start) &&
+ (end_pfn <= pfn_mapped[i].end))
+ return true;
+
+ return false;
+}
+
+bool pfn_is_mapped(unsigned long pfn)
+{
+ return pfn_range_is_mapped(pfn, pfn + 1);
+}
+
#ifdef CONFIG_DMI
RESERVE_BRK(dmi_alloc, 65536);
#endif
@@ -296,6 +330,54 @@ static void __init cleanup_highmap(void)
}
#endif

+/*
+ * Iterate through E820 memory map and create direct mappings for only E820_RAM
+ * regions. We cannot simply create direct mappings for all pfns from
+ * [0 to max_low_pfn) and [4GB to max_pfn) because of possible memory holes in
+ * high addresses that cannot be marked as UC by fixed/variable range MTRRs.
+ * Depending on the alignment of E820 ranges, this may possibly result in using
+ * smaller size (i.e. 4K instead of 2M or 1G) page tables.
+ */
+static void __init init_direct_mapping(void)
+{
+ int i;
+
+ /* the ISA range is always mapped regardless of memory holes */
+ init_memory_mapping(0, ISA_END_ADDRESS);
+
+ for (i = 0; i < e820.nr_map; i++) {
+ struct e820entry *ei = &e820.map[i];
+ u64 start = ei->addr;
+ u64 end = ei->addr + ei->size;
+
+ /* we only map E820_RAM */
+ if (ei->type != E820_RAM)
+ continue;
+
+ if (end <= ISA_END_ADDRESS)
+ continue;
+
+ if (start < ISA_END_ADDRESS)
+ start = ISA_END_ADDRESS;
+#ifdef CONFIG_X86_32
+ /* on 32 bit, we only map up to max_low_pfn */
+ if ((start >> PAGE_SHIFT) >= max_low_pfn)
+ continue;
+
+ if ((end >> PAGE_SHIFT) > max_low_pfn)
+ end = max_low_pfn << PAGE_SHIFT;
+#endif
+ init_memory_mapping(start, end);
+ }
+
+#ifdef CONFIG_X86_64
+ if (max_pfn > max_low_pfn) {
+ /* can we preseve max_low_pfn ?*/
+ max_low_pfn = max_pfn;
+ }
+#endif
+}
+
static void __init reserve_brk(void)
{
if (_brk_end > _brk_start)
@@ -914,18 +996,8 @@ void __init setup_arch(char **cmdline_p)
init_gbpages();
probe_page_size_mask();

- /* max_pfn_mapped is updated here */
- max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
- max_pfn_mapped = max_low_pfn_mapped;
+ init_direct_mapping();

-#ifdef CONFIG_X86_64
- if (max_pfn > max_low_pfn) {
- max_pfn_mapped = init_memory_mapping(1UL<<32,
- max_pfn<<PAGE_SHIFT);
- /* can we preseve max_low_pfn ?*/
- max_low_pfn = max_pfn;
- }
-#endif
memblock.current_limit = get_max_mapped();
dma_contiguous_reserve(0);

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 916b15b..4ec6968 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -317,6 +317,8 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
if (!after_bootmem)
early_memtest(start, end);

+ add_pfn_range_mapped(start >> PAGE_SHIFT, ret >> PAGE_SHIFT);
+
return ret >> PAGE_SHIFT;
}

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 2b6b4a3..ab558eb 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -657,13 +657,11 @@ int arch_add_memory(int nid, u64 start, u64 size)
{
struct pglist_data *pgdat = NODE_DATA(nid);
struct zone *zone = pgdat->node_zones + ZONE_NORMAL;
- unsigned long last_mapped_pfn, start_pfn = start >> PAGE_SHIFT;
+ unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;

- last_mapped_pfn = init_memory_mapping(start, start + size);
- if (last_mapped_pfn > max_pfn_mapped)
- max_pfn_mapped = last_mapped_pfn;
+ init_memory_mapping(start, start + size);

ret = __add_pages(nid, zone, start_pfn, nr_pages);
WARN_ON_ONCE(ret);
--
1.7.9.5

2012-08-29 19:05:22

by Jacob Shin

[permalink] [raw]
Subject: [PATCH 5/6] x86: Fixup code testing if a pfn is direct mapped

Update code that previously assumed pfns [ 0 - max_low_pfn_mapped ) and
[ 4GB - max_pfn_mapped ) were always direct mapped, to now look up
pfn_mapped ranges instead.

Signed-off-by: Jacob Shin <[email protected]>
---
arch/x86/kernel/cpu/amd.c | 6 +-----
arch/x86/platform/efi/efi.c | 8 ++++----
2 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 9d92e19..554ccfc 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -677,11 +677,7 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c)
*/
if (!rdmsrl_safe(MSR_K8_TSEG_ADDR, &tseg)) {
printk(KERN_DEBUG "tseg: %010llx\n", tseg);
- if ((tseg>>PMD_SHIFT) <
- (max_low_pfn_mapped>>(PMD_SHIFT-PAGE_SHIFT)) ||
- ((tseg>>PMD_SHIFT) <
- (max_pfn_mapped>>(PMD_SHIFT-PAGE_SHIFT)) &&
- (tseg>>PMD_SHIFT) >= (1ULL<<(32 - PMD_SHIFT))))
+ if (pfn_is_mapped(tseg))
set_memory_4k((unsigned long)__va(tseg), 1);
}
}
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 92660eda..f1facde 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -776,7 +776,7 @@ void __init efi_enter_virtual_mode(void)
efi_memory_desc_t *md, *prev_md = NULL;
efi_status_t status;
unsigned long size;
- u64 end, systab, addr, npages, end_pfn;
+ u64 end, systab, addr, npages, start_pfn, end_pfn;
void *p, *va, *new_memmap = NULL;
int count = 0;

@@ -827,10 +827,10 @@ void __init efi_enter_virtual_mode(void)
size = md->num_pages << EFI_PAGE_SHIFT;
end = md->phys_addr + size;

+ start_pfn = PFN_DOWN(md->phys_addr);
end_pfn = PFN_UP(end);
- if (end_pfn <= max_low_pfn_mapped
- || (end_pfn > (1UL << (32 - PAGE_SHIFT))
- && end_pfn <= max_pfn_mapped))
+
+ if (pfn_range_is_mapped(start_pfn, end_pfn))
va = __va(md->phys_addr);
else
va = efi_ioremap(md->phys_addr, size, md->type);
--
1.7.9.5

2012-08-29 19:06:41

by Jacob Shin

[permalink] [raw]
Subject: [PATCH 6/6] x86: if kernel .text .data .bss are not marked as E820_RAM, complain and fix

There could be cases where user supplied memmap=exactmap memory
mappings do not mark the region where the kernel .text .data and
.bss reside as E820_RAM, as reported here:

https://lkml.org/lkml/2012/8/14/86

Handle it by complaining, and adding the range back into the e820.

Signed-off-by: Jacob Shin <[email protected]>
---
arch/x86/kernel/setup.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index a2e392e..68f82d2 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -913,6 +913,20 @@ void __init setup_arch(char **cmdline_p)
insert_resource(&iomem_resource, &data_resource);
insert_resource(&iomem_resource, &bss_resource);

+ /*
+ * Complain if .text .data and .bss are not marked as E820_RAM and
+ * attempt to fix it by adding the range. We may have a confused BIOS,
+ * or the user may have incorrectly supplied it via memmap=exactmap. If
+ * we really are running on top non-RAM, we will crash later anyways.
+ */
+ if (!e820_all_mapped(code_resource.start, __pa(__brk_limit), E820_RAM)) {
+ pr_warn(".text .data .bss are not marked as E820_RAM!\n");
+
+ e820_add_region(code_resource.start,
+ __pa(__brk_limit) - code_resource.start + 1,
+ E820_RAM);
+ }
+
trim_bios_range();
#ifdef CONFIG_X86_32
if (ppro_with_ram_bug()) {
--
1.7.9.5

2012-08-29 21:02:42

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 5/6] x86: Fixup code testing if a pfn is direct mapped

On Wed, Aug 29, 2012 at 12:04 PM, Jacob Shin <[email protected]> wrote:
> Update code that previously assumed pfns [ 0 - max_low_pfn_mapped ) and
> [ 4GB - max_pfn_mapped ) were always direct mapped, to now look up
> pfn_mapped ranges instead.

please swap patch 5 and patch 4 applying sequence.

aka.
should have
[PATCH 4/6] x86:Fixup code testing if a pfn is direct mapped
and it should have dummy function

bool pfn_range_is_mapped(u64 start_pfn, u64 end_pfn)
{
return end_pfn <= max_low_pfn_mapped
|| (end_pfn > (1UL << (32 - PAGE_SHIFT))
&& end_pfn <= max_pfn_mapped);
}

and

[PATCH 5/6] x86: Only direct map addresses that are marked as E820_RAM
will update pfn_range_is_mapped accordingly.

Thanks

Yinghai

2012-08-29 21:17:57

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86: Only direct map addresses that are marked as E820_RAM

On Wed, Aug 29, 2012 at 12:04 PM, Jacob Shin <[email protected]> wrote:
> Currently direct mappings are created for [ 0 to max_low_pfn<<PAGE_SHIFT )
> and [ 4GB to max_pfn<<PAGE_SHIFT ), which may include regions that are not
> backed by actual DRAM. This is fine for holes under 4GB which are covered
> by fixed and variable range MTRRs to be UC. However, we run into trouble
> on higher memory addresses which cannot be covered by MTRRs.
>
> Our system with 1TB of RAM has an e820 that looks like this:
>
> BIOS-e820: [mem 0x0000000000000000-0x00000000000983ff] usable
> BIOS-e820: [mem 0x0000000000098400-0x000000000009ffff] reserved
> BIOS-e820: [mem 0x00000000000d0000-0x00000000000fffff] reserved
> BIOS-e820: [mem 0x0000000000100000-0x00000000c7ebffff] usable
> BIOS-e820: [mem 0x00000000c7ec0000-0x00000000c7ed7fff] ACPI data
> BIOS-e820: [mem 0x00000000c7ed8000-0x00000000c7ed9fff] ACPI NVS
> BIOS-e820: [mem 0x00000000c7eda000-0x00000000c7ffffff] reserved
> BIOS-e820: [mem 0x00000000fec00000-0x00000000fec0ffff] reserved
> BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
> BIOS-e820: [mem 0x00000000fff00000-0x00000000ffffffff] reserved
> BIOS-e820: [mem 0x0000000100000000-0x000000e037ffffff] usable
> BIOS-e820: [mem 0x000000e038000000-0x000000fcffffffff] reserved
> BIOS-e820: [mem 0x0000010000000000-0x0000011ffeffffff] usable
>
> and so direct mappings are created for huge memory hole between
> 0x000000e038000000 to 0x0000010000000000. Even though the kernel never
> generates memory accesses in that region, since the page tables mark
> them incorrectly as being WB, our (AMD) processor ends up causing a MCE
> while doing some memory bookkeeping/optimizations around that area.
>
> This patch iterates through e820 and only direct maps ranges that are
> marked as E820_RAM, and keeps track of those pfn ranges. Depending on
> the alignment of E820 ranges, this may possibly result in using smaller
> size (i.e. 4K instead of 2M or 1G) page tables.
>
> Signed-off-by: Jacob Shin <[email protected]>
> ---
> arch/x86/include/asm/page_types.h | 9 ++++
> arch/x86/kernel/setup.c | 100 +++++++++++++++++++++++++++++++------
> arch/x86/mm/init.c | 2 +
> arch/x86/mm/init_64.c | 6 +--
> 4 files changed, 99 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
> index e21fdd1..409047a 100644
> --- a/arch/x86/include/asm/page_types.h
> +++ b/arch/x86/include/asm/page_types.h
> @@ -3,6 +3,7 @@
>
> #include <linux/const.h>
> #include <linux/types.h>
> +#include <asm/e820.h>
>
> /* PAGE_SHIFT determines the page size */
> #define PAGE_SHIFT 12
> @@ -40,12 +41,20 @@
> #endif /* CONFIG_X86_64 */
>
> #ifndef __ASSEMBLY__
> +#include <linux/range.h>
>
> extern int devmem_is_allowed(unsigned long pagenr);
>
> extern unsigned long max_low_pfn_mapped;
> extern unsigned long max_pfn_mapped;
>
> +extern struct range pfn_mapped[E820_X_MAX];
> +extern int nr_pfn_mapped;
> +
> +extern void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn);
> +extern bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
> +extern bool pfn_is_mapped(unsigned long pfn);
> +
> static inline phys_addr_t get_max_mapped(void)
> {
> return (phys_addr_t)max_pfn_mapped << PAGE_SHIFT;
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index d6e8c03..a2e392e 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -115,13 +115,47 @@
> #include <asm/prom.h>
>
> /*
> - * end_pfn only includes RAM, while max_pfn_mapped includes all e820 entries.
> - * The direct mapping extends to max_pfn_mapped, so that we can directly access
> - * apertures, ACPI and other tables without having to play with fixmaps.
> + * max_low_pfn_mapped: highest direct mapped pfn under 4GB
> + * max_pfn_mapped: highest direct mapped pfn over 4GB
> + *
> + * The direct mapping only covers E820_RAM regions, so the ranges and gaps are
> + * represented by pfn_mapped
> */
> unsigned long max_low_pfn_mapped;
> unsigned long max_pfn_mapped;
>
> +struct range pfn_mapped[E820_X_MAX];
> +int nr_pfn_mapped;

change to static?

> +
> +void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn)
> +{
> + nr_pfn_mapped = add_range_with_merge(pfn_mapped, E820_X_MAX,
> + nr_pfn_mapped, start_pfn, end_pfn);
> + nr_pfn_mapped = clean_sort_range(pfn_mapped, E820_X_MAX);
> +
> + max_pfn_mapped = max(max_pfn_mapped, end_pfn);
> +
> + if (end_pfn <= (1UL << (32 - PAGE_SHIFT)))
> + max_low_pfn_mapped = max(max_low_pfn_mapped, end_pfn);
> +}
> +
> +bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
> +{
> + int i;
> +
> + for (i = 0; i < nr_pfn_mapped; i++)
> + if ((start_pfn >= pfn_mapped[i].start) &&
> + (end_pfn <= pfn_mapped[i].end))
> + return true;
> +
> + return false;
> +}
> +
> +bool pfn_is_mapped(unsigned long pfn)
> +{
> + return pfn_range_is_mapped(pfn, pfn + 1);
> +}

wonder if those functions have to be in arch/x86/kernel/setup.c.

also do we need to update the tracking array when we have do memory hot-remove?

Thanks

Yinghai

2012-08-29 21:33:02

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86: Only direct map addresses that are marked as E820_RAM

On Wed, Aug 29, 2012 at 02:17:51PM -0700, Yinghai Lu wrote:
> > +struct range pfn_mapped[E820_X_MAX];
> > +int nr_pfn_mapped;
>
> change to static?
>
> > +
> > +void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn)
> > +{
> > + nr_pfn_mapped = add_range_with_merge(pfn_mapped, E820_X_MAX,
> > + nr_pfn_mapped, start_pfn, end_pfn);
> > + nr_pfn_mapped = clean_sort_range(pfn_mapped, E820_X_MAX);
> > +
> > + max_pfn_mapped = max(max_pfn_mapped, end_pfn);
> > +
> > + if (end_pfn <= (1UL << (32 - PAGE_SHIFT)))
> > + max_low_pfn_mapped = max(max_low_pfn_mapped, end_pfn);
> > +}
> > +
> > +bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < nr_pfn_mapped; i++)
> > + if ((start_pfn >= pfn_mapped[i].start) &&
> > + (end_pfn <= pfn_mapped[i].end))
> > + return true;
> > +
> > + return false;
> > +}
> > +
> > +bool pfn_is_mapped(unsigned long pfn)
> > +{
> > + return pfn_range_is_mapped(pfn, pfn + 1);
> > +}
>
> wonder if those functions have to be in arch/x86/kernel/setup.c.
>
> also do we need to update the tracking array when we have do memory hot-remove?

Would you please make sure you've reviewed this whole patchset
thoroughly so that Jacob can do all changes at once and not keep
resending them twice a week.

Thanks a lot!

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

2012-08-29 21:46:38

by Jacob Shin

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86: Only direct map addresses that are marked as E820_RAM

On Wed, Aug 29, 2012 at 02:17:51PM -0700, Yinghai Lu wrote:
> On Wed, Aug 29, 2012 at 12:04 PM, Jacob Shin <[email protected]> wrote:
> > Currently direct mappings are created for [ 0 to max_low_pfn<<PAGE_SHIFT )
> > and [ 4GB to max_pfn<<PAGE_SHIFT ), which may include regions that are not
> > backed by actual DRAM. This is fine for holes under 4GB which are covered
> > by fixed and variable range MTRRs to be UC. However, we run into trouble
> > on higher memory addresses which cannot be covered by MTRRs.
> >
> > Our system with 1TB of RAM has an e820 that looks like this:
> >
> > BIOS-e820: [mem 0x0000000000000000-0x00000000000983ff] usable
> > BIOS-e820: [mem 0x0000000000098400-0x000000000009ffff] reserved
> > BIOS-e820: [mem 0x00000000000d0000-0x00000000000fffff] reserved
> > BIOS-e820: [mem 0x0000000000100000-0x00000000c7ebffff] usable
> > BIOS-e820: [mem 0x00000000c7ec0000-0x00000000c7ed7fff] ACPI data
> > BIOS-e820: [mem 0x00000000c7ed8000-0x00000000c7ed9fff] ACPI NVS
> > BIOS-e820: [mem 0x00000000c7eda000-0x00000000c7ffffff] reserved
> > BIOS-e820: [mem 0x00000000fec00000-0x00000000fec0ffff] reserved
> > BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
> > BIOS-e820: [mem 0x00000000fff00000-0x00000000ffffffff] reserved
> > BIOS-e820: [mem 0x0000000100000000-0x000000e037ffffff] usable
> > BIOS-e820: [mem 0x000000e038000000-0x000000fcffffffff] reserved
> > BIOS-e820: [mem 0x0000010000000000-0x0000011ffeffffff] usable
> >
> > and so direct mappings are created for huge memory hole between
> > 0x000000e038000000 to 0x0000010000000000. Even though the kernel never
> > generates memory accesses in that region, since the page tables mark
> > them incorrectly as being WB, our (AMD) processor ends up causing a MCE
> > while doing some memory bookkeeping/optimizations around that area.
> >
> > This patch iterates through e820 and only direct maps ranges that are
> > marked as E820_RAM, and keeps track of those pfn ranges. Depending on
> > the alignment of E820 ranges, this may possibly result in using smaller
> > size (i.e. 4K instead of 2M or 1G) page tables.
> >
> > Signed-off-by: Jacob Shin <[email protected]>
> > ---
> > arch/x86/include/asm/page_types.h | 9 ++++
> > arch/x86/kernel/setup.c | 100 +++++++++++++++++++++++++++++++------
> > arch/x86/mm/init.c | 2 +
> > arch/x86/mm/init_64.c | 6 +--
> > 4 files changed, 99 insertions(+), 18 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
> > index e21fdd1..409047a 100644
> > --- a/arch/x86/include/asm/page_types.h
> > +++ b/arch/x86/include/asm/page_types.h
> > @@ -3,6 +3,7 @@
> >
> > #include <linux/const.h>
> > #include <linux/types.h>
> > +#include <asm/e820.h>
> >
> > /* PAGE_SHIFT determines the page size */
> > #define PAGE_SHIFT 12
> > @@ -40,12 +41,20 @@
> > #endif /* CONFIG_X86_64 */
> >
> > #ifndef __ASSEMBLY__
> > +#include <linux/range.h>
> >
> > extern int devmem_is_allowed(unsigned long pagenr);
> >
> > extern unsigned long max_low_pfn_mapped;
> > extern unsigned long max_pfn_mapped;
> >
> > +extern struct range pfn_mapped[E820_X_MAX];
> > +extern int nr_pfn_mapped;
> > +
> > +extern void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn);
> > +extern bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
> > +extern bool pfn_is_mapped(unsigned long pfn);
> > +
> > static inline phys_addr_t get_max_mapped(void)
> > {
> > return (phys_addr_t)max_pfn_mapped << PAGE_SHIFT;
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index d6e8c03..a2e392e 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -115,13 +115,47 @@
> > #include <asm/prom.h>
> >
> > /*
> > - * end_pfn only includes RAM, while max_pfn_mapped includes all e820 entries.
> > - * The direct mapping extends to max_pfn_mapped, so that we can directly access
> > - * apertures, ACPI and other tables without having to play with fixmaps.
> > + * max_low_pfn_mapped: highest direct mapped pfn under 4GB
> > + * max_pfn_mapped: highest direct mapped pfn over 4GB
> > + *
> > + * The direct mapping only covers E820_RAM regions, so the ranges and gaps are
> > + * represented by pfn_mapped
> > */
> > unsigned long max_low_pfn_mapped;
> > unsigned long max_pfn_mapped;
> >
> > +struct range pfn_mapped[E820_X_MAX];
> > +int nr_pfn_mapped;
>
> change to static?

Hm .. yeah I guess we could, the initial reason why I didn't make it
static was because max_pfn_mapped was not static. But I guess as long
as everyone down the line uses pfn_range_is_mapped() to test for direct
mappings, I guess we can change it to static.

>
> > +
> > +void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn)
> > +{
> > + nr_pfn_mapped = add_range_with_merge(pfn_mapped, E820_X_MAX,
> > + nr_pfn_mapped, start_pfn, end_pfn);
> > + nr_pfn_mapped = clean_sort_range(pfn_mapped, E820_X_MAX);
> > +
> > + max_pfn_mapped = max(max_pfn_mapped, end_pfn);
> > +
> > + if (end_pfn <= (1UL << (32 - PAGE_SHIFT)))
> > + max_low_pfn_mapped = max(max_low_pfn_mapped, end_pfn);
> > +}
> > +
> > +bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < nr_pfn_mapped; i++)
> > + if ((start_pfn >= pfn_mapped[i].start) &&
> > + (end_pfn <= pfn_mapped[i].end))
> > + return true;
> > +
> > + return false;
> > +}
> > +
> > +bool pfn_is_mapped(unsigned long pfn)
> > +{
> > + return pfn_range_is_mapped(pfn, pfn + 1);
> > +}
>
> wonder if those functions have to be in arch/x86/kernel/setup.c.

Where do you suggest we move it to?

>
> also do we need to update the tracking array when we have do memory hot-remove?

Hm .. how is it handled right now? does the hot-remove tear down direct
mappings? If it does, I guess we could hook remove range code where that
happens ..



>
> Thanks
>
> Yinghai
>

2012-08-30 06:28:38

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 4/6] x86: Only direct map addresses that are marked as E820_RAM

This week is Kernel Summit & Plumbers... people are kind of distracted.

Borislav Petkov <[email protected]> wrote:

>On Wed, Aug 29, 2012 at 02:17:51PM -0700, Yinghai Lu wrote:
>> > +struct range pfn_mapped[E820_X_MAX];
>> > +int nr_pfn_mapped;
>>
>> change to static?
>>
>> > +
>> > +void add_pfn_range_mapped(unsigned long start_pfn, unsigned long
>end_pfn)
>> > +{
>> > + nr_pfn_mapped = add_range_with_merge(pfn_mapped,
>E820_X_MAX,
>> > + nr_pfn_mapped,
>start_pfn, end_pfn);
>> > + nr_pfn_mapped = clean_sort_range(pfn_mapped, E820_X_MAX);
>> > +
>> > + max_pfn_mapped = max(max_pfn_mapped, end_pfn);
>> > +
>> > + if (end_pfn <= (1UL << (32 - PAGE_SHIFT)))
>> > + max_low_pfn_mapped = max(max_low_pfn_mapped,
>end_pfn);
>> > +}
>> > +
>> > +bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long
>end_pfn)
>> > +{
>> > + int i;
>> > +
>> > + for (i = 0; i < nr_pfn_mapped; i++)
>> > + if ((start_pfn >= pfn_mapped[i].start) &&
>> > + (end_pfn <= pfn_mapped[i].end))
>> > + return true;
>> > +
>> > + return false;
>> > +}
>> > +
>> > +bool pfn_is_mapped(unsigned long pfn)
>> > +{
>> > + return pfn_range_is_mapped(pfn, pfn + 1);
>> > +}
>>
>> wonder if those functions have to be in arch/x86/kernel/setup.c.
>>
>> also do we need to update the tracking array when we have do memory
>hot-remove?
>
>Would you please make sure you've reviewed this whole patchset
>thoroughly so that Jacob can do all changes at once and not keep
>resending them twice a week.
>
>Thanks a lot!
>
>--
>Regards/Gruss,
>Boris.
>
>Advanced Micro Devices GmbH
>Einsteinring 24, 85609 Dornach
>GM: Alberto Bozzo
>Reg: Dornach, Landkreis Muenchen
>HRB Nr. 43632 WEEE Registernr: 129 19551

--
Sent from my mobile phone. Please excuse brevity and lack of formatting.

2012-08-30 23:06:54

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 0/8] x86, mm: init_memory_mapping cleanup

Only create mapping for E820_820 and E820_RESERVED_KERN.

Also seperate find_early_page_table out with init_memory_mapping.

Jacob Shin (3):
x86: if kernel .text .data .bss are not marked as E820_RAM, complain
and fix
x86: Fixup code testing if a pfn is direct mapped
x86: Only direct map addresses that are marked as E820_RAM

Yinghai Lu (5):
x86, mm: Add global page_size_mask
x86, mm: Split out split_mem_range
x86, mm: Moving init_memory_mapping calling
x86, mm: Revert back good_end setting for 64bit
x86, mm: Find early page table only one time

arch/x86/include/asm/init.h | 1 -
arch/x86/include/asm/page_types.h | 3 +
arch/x86/include/asm/pgtable.h | 1 +
arch/x86/kernel/cpu/amd.c | 8 +-
arch/x86/kernel/setup.c | 34 ++++---
arch/x86/mm/init.c | 225 ++++++++++++++++++++++++++-----------
arch/x86/mm/init_64.c | 6 +-
arch/x86/platform/efi/efi.c | 8 +-
8 files changed, 191 insertions(+), 95 deletions(-)

--
1.7.7

2012-08-30 23:07:00

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 1/8] x86, mm: Add global page_size_mask

detect if need to use 1G or 2M and store them in page_size_mask.

Only probe them one time.

Suggested-by: Ingo Molnar <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/include/asm/pgtable.h | 1 +
arch/x86/kernel/setup.c | 1 +
arch/x86/mm/init.c | 66 +++++++++++++++++++---------------------
3 files changed, 33 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 49afb3f..e47e4db 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -597,6 +597,7 @@ static inline int pgd_none(pgd_t pgd)
#ifndef __ASSEMBLY__

extern int direct_gbpages;
+void probe_page_size_mask(void);

/* local pte updates need not use xchg for locking */
static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f4b9b80..d6e8c03 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -912,6 +912,7 @@ void __init setup_arch(char **cmdline_p)
setup_real_mode();

init_gbpages();
+ probe_page_size_mask();

/* max_pfn_mapped is updated here */
max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index e0e6990..838e9bc 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -35,8 +35,10 @@ struct map_range {
unsigned page_size_mask;
};

-static void __init find_early_table_space(struct map_range *mr, unsigned long end,
- int use_pse, int use_gbpages)
+static int page_size_mask;
+
+static void __init find_early_table_space(struct map_range *mr,
+ unsigned long end)
{
unsigned long puds, pmds, ptes, tables, start = 0, good_end = end;
phys_addr_t base;
@@ -44,7 +46,7 @@ static void __init find_early_table_space(struct map_range *mr, unsigned long en
puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);

- if (use_gbpages) {
+ if (page_size_mask & (1 << PG_LEVEL_1G)) {
unsigned long extra;

extra = end - ((end>>PUD_SHIFT) << PUD_SHIFT);
@@ -54,7 +56,7 @@ static void __init find_early_table_space(struct map_range *mr, unsigned long en

tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);

- if (use_pse) {
+ if (page_size_mask & (1 << PG_LEVEL_2M)) {
unsigned long extra;

extra = end - ((end>>PMD_SHIFT) << PMD_SHIFT);
@@ -90,6 +92,30 @@ static void __init find_early_table_space(struct map_range *mr, unsigned long en
(pgt_buf_top << PAGE_SHIFT) - 1);
}

+void probe_page_size_mask(void)
+{
+#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
+ /*
+ * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
+ * This will simplify cpa(), which otherwise needs to support splitting
+ * large pages into small in interrupt context, etc.
+ */
+ if (direct_gbpages)
+ page_size_mask |= 1 << PG_LEVEL_1G;
+ if (cpu_has_pse)
+ page_size_mask |= 1 << PG_LEVEL_2M;
+#endif
+
+ /* Enable PSE if available */
+ if (cpu_has_pse)
+ set_in_cr4(X86_CR4_PSE);
+
+ /* Enable PGE if available */
+ if (cpu_has_pge) {
+ set_in_cr4(X86_CR4_PGE);
+ __supported_pte_mask |= _PAGE_GLOBAL;
+ }
+}
void __init native_pagetable_reserve(u64 start, u64 end)
{
memblock_reserve(start, end - start);
@@ -125,45 +151,15 @@ static int __meminit save_mr(struct map_range *mr, int nr_range,
unsigned long __init_refok init_memory_mapping(unsigned long start,
unsigned long end)
{
- unsigned long page_size_mask = 0;
unsigned long start_pfn, end_pfn;
unsigned long ret = 0;
unsigned long pos;
-
struct map_range mr[NR_RANGE_MR];
int nr_range, i;
- int use_pse, use_gbpages;

printk(KERN_INFO "init_memory_mapping: [mem %#010lx-%#010lx]\n",
start, end - 1);

-#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KMEMCHECK)
- /*
- * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
- * This will simplify cpa(), which otherwise needs to support splitting
- * large pages into small in interrupt context, etc.
- */
- use_pse = use_gbpages = 0;
-#else
- use_pse = cpu_has_pse;
- use_gbpages = direct_gbpages;
-#endif
-
- /* Enable PSE if available */
- if (cpu_has_pse)
- set_in_cr4(X86_CR4_PSE);
-
- /* Enable PGE if available */
- if (cpu_has_pge) {
- set_in_cr4(X86_CR4_PGE);
- __supported_pte_mask |= _PAGE_GLOBAL;
- }
-
- if (use_gbpages)
- page_size_mask |= 1 << PG_LEVEL_1G;
- if (use_pse)
- page_size_mask |= 1 << PG_LEVEL_2M;
-
memset(mr, 0, sizeof(mr));
nr_range = 0;

@@ -267,7 +263,7 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
* nodes are discovered.
*/
if (!after_bootmem)
- find_early_table_space(&mr[0], end, use_pse, use_gbpages);
+ find_early_table_space(&mr[0], end);

for (i = 0; i < nr_range; i++)
ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,
--
1.7.7

2012-08-30 23:07:11

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 2/8] x86, mm: Split out split_mem_range

from init_memory_mapping, so make init_memory_mapping readable.

Suggested-by: Ingo Molnar <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 42 ++++++++++++++++++++++++++----------------
1 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 838e9bc..7d05e28 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -143,25 +143,13 @@ static int __meminit save_mr(struct map_range *mr, int nr_range,
return nr_range;
}

-/*
- * Setup the direct mapping of the physical memory at PAGE_OFFSET.
- * This runs before bootmem is initialized and gets pages directly from
- * the physical memory. To access them they are temporarily mapped.
- */
-unsigned long __init_refok init_memory_mapping(unsigned long start,
- unsigned long end)
+static int __meminit split_mem_range(struct map_range *mr, int nr_range,
+ unsigned long start,
+ unsigned long end)
{
unsigned long start_pfn, end_pfn;
- unsigned long ret = 0;
unsigned long pos;
- struct map_range mr[NR_RANGE_MR];
- int nr_range, i;
-
- printk(KERN_INFO "init_memory_mapping: [mem %#010lx-%#010lx]\n",
- start, end - 1);
-
- memset(mr, 0, sizeof(mr));
- nr_range = 0;
+ int i;

/* head if not big page alignment ? */
start_pfn = start >> PAGE_SHIFT;
@@ -255,6 +243,28 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
(mr[i].page_size_mask & (1<<PG_LEVEL_1G))?"1G":(
(mr[i].page_size_mask & (1<<PG_LEVEL_2M))?"2M":"4k"));

+ return nr_range;
+}
+
+/*
+ * Setup the direct mapping of the physical memory at PAGE_OFFSET.
+ * This runs before bootmem is initialized and gets pages directly from
+ * the physical memory. To access them they are temporarily mapped.
+ */
+unsigned long __init_refok init_memory_mapping(unsigned long start,
+ unsigned long end)
+{
+ struct map_range mr[NR_RANGE_MR];
+ unsigned long ret = 0;
+ int nr_range, i;
+
+ pr_info("init_memory_mapping: [mem %#010lx-%#010lx]\n",
+ start, end - 1);
+
+ memset(mr, 0, sizeof(mr));
+ nr_range = 0;
+ nr_range = split_mem_range(mr, nr_range, start, end);
+
/*
* Find space for the kernel direct mapping tables.
*
--
1.7.7

2012-08-30 23:07:18

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 4/8] x86, mm: Revert back good_end setting for 64bit

So we could put page table high again for 64bit.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 15a6a38..cca9b7d 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -76,8 +76,8 @@ static void __init find_early_table_space(struct map_range *mr,
#ifdef CONFIG_X86_32
/* for fixmap */
tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
-#endif
good_end = max_pfn_mapped << PAGE_SHIFT;
+#endif

base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
if (!base)
--
1.7.7

2012-08-30 23:07:25

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 5/8] x86, mm: Find early page table only one time

Should not do that in every calling of init_memory_mapping.
Actually in early time, only need do once.

Also move down early_memtest.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 71 +++++++++++++++++++++++++++-------------------------
1 files changed, 37 insertions(+), 34 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index cca9b7d..c3e4341 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -37,7 +37,7 @@ struct map_range {

static int page_size_mask;

-static void __init find_early_table_space(struct map_range *mr,
+static void __init find_early_table_space(unsigned long begin,
unsigned long end)
{
unsigned long puds, pmds, ptes, tables, start = 0, good_end = end;
@@ -64,8 +64,8 @@ static void __init find_early_table_space(struct map_range *mr,
extra += PMD_SIZE;
#endif
/* The first 2/4M doesn't use large pages. */
- if (mr->start < PMD_SIZE)
- extra += mr->end - mr->start;
+ if (begin < PMD_SIZE)
+ extra += (PMD_SIZE - start) >> PAGE_SHIFT;

ptes = (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
} else
@@ -265,15 +265,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
nr_range = 0;
nr_range = split_mem_range(mr, nr_range, start, end);

- /*
- * Find space for the kernel direct mapping tables.
- *
- * Later we should allocate these tables in the local node of the
- * memory mapped. Unfortunately this is done currently before the
- * nodes are discovered.
- */
- if (!after_bootmem)
- find_early_table_space(&mr[0], end);

for (i = 0; i < nr_range; i++)
ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,
@@ -287,6 +278,36 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,

__flush_tlb_all();

+ return ret >> PAGE_SHIFT;
+}
+
+void __init init_mem_mapping(void)
+{
+ probe_page_size_mask();
+
+ /*
+ * Find space for the kernel direct mapping tables.
+ *
+ * Later we should allocate these tables in the local node of the
+ * memory mapped. Unfortunately this is done currently before the
+ * nodes are discovered.
+ */
+#ifdef CONFIG_X86_64
+ find_early_table_space(0, max_pfn<<PAGE_SHIFT);
+#else
+ find_early_table_space(0, max_low_pfn<<PAGE_SHIFT);
+#endif
+ max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
+ max_pfn_mapped = max_low_pfn_mapped;
+
+#ifdef CONFIG_X86_64
+ if (max_pfn > max_low_pfn) {
+ max_pfn_mapped = init_memory_mapping(1UL<<32,
+ max_pfn<<PAGE_SHIFT);
+ /* can we preseve max_low_pfn ?*/
+ max_low_pfn = max_pfn;
+ }
+#endif
/*
* Reserve the kernel pagetable pages we used (pgt_buf_start -
* pgt_buf_end) and free the other ones (pgt_buf_end - pgt_buf_top)
@@ -302,32 +323,14 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
* RO all the pagetable pages, including the ones that are beyond
* pgt_buf_end at that time.
*/
- if (!after_bootmem && pgt_buf_end > pgt_buf_start)
+ if (pgt_buf_end > pgt_buf_start)
x86_init.mapping.pagetable_reserve(PFN_PHYS(pgt_buf_start),
PFN_PHYS(pgt_buf_end));

- if (!after_bootmem)
- early_memtest(start, end);
-
- return ret >> PAGE_SHIFT;
-}
-
-void __init init_mem_mapping(void)
-{
- probe_page_size_mask();
-
- /* max_pfn_mapped is updated here */
- max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
- max_pfn_mapped = max_low_pfn_mapped;
+ /* stop the wrong using */
+ pgt_buf_top = 0;

-#ifdef CONFIG_X86_64
- if (max_pfn > max_low_pfn) {
- max_pfn_mapped = init_memory_mapping(1UL<<32,
- max_pfn<<PAGE_SHIFT);
- /* can we preseve max_low_pfn ?*/
- max_low_pfn = max_pfn;
- }
-#endif
+ early_memtest(0, max_pfn << PAGE_SHIFT);
}

/*
--
1.7.7

2012-08-30 23:07:33

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 6/8] x86: if kernel .text .data .bss are not marked as E820_RAM, complain and fix

From: Jacob Shin <[email protected]>

There could be cases where user supplied memmap=exactmap memory
mappings do not mark the region where the kernel .text .data and
.bss reside as E820_RAM, as reported here:

https://lkml.org/lkml/2012/8/14/86

Handle it by complaining, and adding the range back into the e820.

Signed-off-by: Jacob Shin <[email protected]>
---
arch/x86/kernel/setup.c | 14 ++++++++++++++
1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index c30c78c..587dcd9 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -831,6 +831,20 @@ void __init setup_arch(char **cmdline_p)
insert_resource(&iomem_resource, &data_resource);
insert_resource(&iomem_resource, &bss_resource);

+ /*
+ * Complain if .text .data and .bss are not marked as E820_RAM and
+ * attempt to fix it by adding the range. We may have a confused BIOS,
+ * or the user may have incorrectly supplied it via memmap=exactmap. If
+ * we really are running on top non-RAM, we will crash later anyways.
+ */
+ if (!e820_all_mapped(code_resource.start, __pa(__brk_limit), E820_RAM)) {
+ pr_warn(".text .data .bss are not marked as E820_RAM!\n");
+
+ e820_add_region(code_resource.start,
+ __pa(__brk_limit) - code_resource.start + 1,
+ E820_RAM);
+ }
+
trim_bios_range();
#ifdef CONFIG_X86_32
if (ppro_with_ram_bug()) {
--
1.7.7

2012-08-30 23:07:43

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 7/8] x86: Fixup code testing if a pfn is direct mapped

From: Jacob Shin <[email protected]>

Update code that previously assumed pfns [ 0 - max_low_pfn_mapped ) and
[ 4GB - max_pfn_mapped ) were always direct mapped, to now look up
pfn_mapped ranges instead.


-v2: change applying sequence to keep git bisecting working.
so add dummy pfn_range_is_mapped(). - Yinghai Lu

Signed-off-by: Jacob Shin <[email protected]>
---
arch/x86/include/asm/page_types.h | 8 ++++++++
arch/x86/kernel/cpu/amd.c | 8 +++-----
arch/x86/platform/efi/efi.c | 8 ++++----
3 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index e21fdd1..45aae6e 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -51,6 +51,14 @@ static inline phys_addr_t get_max_mapped(void)
return (phys_addr_t)max_pfn_mapped << PAGE_SHIFT;
}

+static inline bool pfn_range_is_mapped(unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ return end_pfn <= max_low_pfn_mapped ||
+ (end_pfn > (1UL << (32 - PAGE_SHIFT)) &&
+ end_pfn <= max_pfn_mapped);
+}
+
extern unsigned long init_memory_mapping(unsigned long start,
unsigned long end);

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 9d92e19..4235553 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -676,12 +676,10 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c)
* benefit in doing so.
*/
if (!rdmsrl_safe(MSR_K8_TSEG_ADDR, &tseg)) {
+ unsigned long pfn = tseg >> PAGE_SHIFT;
+
printk(KERN_DEBUG "tseg: %010llx\n", tseg);
- if ((tseg>>PMD_SHIFT) <
- (max_low_pfn_mapped>>(PMD_SHIFT-PAGE_SHIFT)) ||
- ((tseg>>PMD_SHIFT) <
- (max_pfn_mapped>>(PMD_SHIFT-PAGE_SHIFT)) &&
- (tseg>>PMD_SHIFT) >= (1ULL<<(32 - PMD_SHIFT))))
+ if (pfn_range_is_mapped(pfn, pfn + 1))
set_memory_4k((unsigned long)__va(tseg), 1);
}
}
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 92660eda..f1facde 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -776,7 +776,7 @@ void __init efi_enter_virtual_mode(void)
efi_memory_desc_t *md, *prev_md = NULL;
efi_status_t status;
unsigned long size;
- u64 end, systab, addr, npages, end_pfn;
+ u64 end, systab, addr, npages, start_pfn, end_pfn;
void *p, *va, *new_memmap = NULL;
int count = 0;

@@ -827,10 +827,10 @@ void __init efi_enter_virtual_mode(void)
size = md->num_pages << EFI_PAGE_SHIFT;
end = md->phys_addr + size;

+ start_pfn = PFN_DOWN(md->phys_addr);
end_pfn = PFN_UP(end);
- if (end_pfn <= max_low_pfn_mapped
- || (end_pfn > (1UL << (32 - PAGE_SHIFT))
- && end_pfn <= max_pfn_mapped))
+
+ if (pfn_range_is_mapped(start_pfn, end_pfn))
va = __va(md->phys_addr);
else
va = efi_ioremap(md->phys_addr, size, md->type);
--
1.7.7

2012-08-30 23:07:48

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 8/8] x86: Only direct map addresses that are marked as E820_RAM

From: Jacob Shin <[email protected]>

Currently direct mappings are created for [ 0 to max_low_pfn<<PAGE_SHIFT )
and [ 4GB to max_pfn<<PAGE_SHIFT ), which may include regions that are not
backed by actual DRAM. This is fine for holes under 4GB which are covered
by fixed and variable range MTRRs to be UC. However, we run into trouble
on higher memory addresses which cannot be covered by MTRRs.

Our system with 1TB of RAM has an e820 that looks like this:

BIOS-e820: [mem 0x0000000000000000-0x00000000000983ff] usable
BIOS-e820: [mem 0x0000000000098400-0x000000000009ffff] reserved
BIOS-e820: [mem 0x00000000000d0000-0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000-0x00000000c7ebffff] usable
BIOS-e820: [mem 0x00000000c7ec0000-0x00000000c7ed7fff] ACPI data
BIOS-e820: [mem 0x00000000c7ed8000-0x00000000c7ed9fff] ACPI NVS
BIOS-e820: [mem 0x00000000c7eda000-0x00000000c7ffffff] reserved
BIOS-e820: [mem 0x00000000fec00000-0x00000000fec0ffff] reserved
BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
BIOS-e820: [mem 0x00000000fff00000-0x00000000ffffffff] reserved
BIOS-e820: [mem 0x0000000100000000-0x000000e037ffffff] usable
BIOS-e820: [mem 0x000000e038000000-0x000000fcffffffff] reserved
BIOS-e820: [mem 0x0000010000000000-0x0000011ffeffffff] usable

and so direct mappings are created for huge memory hole between
0x000000e038000000 to 0x0000010000000000. Even though the kernel never
generates memory accesses in that region, since the page tables mark
them incorrectly as being WB, our (AMD) processor ends up causing a MCE
while doing some memory bookkeeping/optimizations around that area.

This patch iterates through e820 and only direct maps ranges that are
marked as E820_RAM, and keeps track of those pfn ranges. Depending on
the alignment of E820 ranges, this may possibly result in using smaller
size (i.e. 4K instead of 2M or 1G) page tables.

-v2: move changes from setup.c to mm/init.c, also use for_each_mem_pfn_range
instead. - Yinghai Lu

Signed-off-by: Jacob Shin <[email protected]>
---
arch/x86/include/asm/page_types.h | 11 +----
arch/x86/kernel/setup.c | 8 ++-
arch/x86/mm/init.c | 85 ++++++++++++++++++++++++++++++++----
arch/x86/mm/init_64.c | 6 +--
4 files changed, 85 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index 45aae6e..fbf5cc4 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -46,19 +46,14 @@ extern int devmem_is_allowed(unsigned long pagenr);
extern unsigned long max_low_pfn_mapped;
extern unsigned long max_pfn_mapped;

+void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn);
+bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
+
static inline phys_addr_t get_max_mapped(void)
{
return (phys_addr_t)max_pfn_mapped << PAGE_SHIFT;
}

-static inline bool pfn_range_is_mapped(unsigned long start_pfn,
- unsigned long end_pfn)
-{
- return end_pfn <= max_low_pfn_mapped ||
- (end_pfn > (1UL << (32 - PAGE_SHIFT)) &&
- end_pfn <= max_pfn_mapped);
-}
-
extern unsigned long init_memory_mapping(unsigned long start,
unsigned long end);

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 587dcd9..2eb91b7 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -115,9 +115,11 @@
#include <asm/prom.h>

/*
- * end_pfn only includes RAM, while max_pfn_mapped includes all e820 entries.
- * The direct mapping extends to max_pfn_mapped, so that we can directly access
- * apertures, ACPI and other tables without having to play with fixmaps.
+ * max_low_pfn_mapped: highest direct mapped pfn under 4GB
+ * max_pfn_mapped: highest direct mapped pfn over 4GB
+ *
+ * The direct mapping only covers E820_RAM regions, so the ranges and gaps are
+ * represented by pfn_mapped
*/
unsigned long max_low_pfn_mapped;
unsigned long max_pfn_mapped;
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index c3e4341..9b871d0 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -246,6 +246,33 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
return nr_range;
}

+static struct range pfn_mapped[E820_X_MAX];
+static int nr_pfn_mapped;
+
+void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn)
+{
+ nr_pfn_mapped = add_range_with_merge(pfn_mapped, E820_X_MAX,
+ nr_pfn_mapped, start_pfn, end_pfn);
+ nr_pfn_mapped = clean_sort_range(pfn_mapped, E820_X_MAX);
+
+ max_pfn_mapped = max(max_pfn_mapped, end_pfn);
+
+ if (end_pfn <= (1UL << (32 - PAGE_SHIFT)))
+ max_low_pfn_mapped = max(max_low_pfn_mapped, end_pfn);
+}
+
+bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
+{
+ int i;
+
+ for (i = 0; i < nr_pfn_mapped; i++)
+ if ((start_pfn >= pfn_mapped[i].start) &&
+ (end_pfn <= pfn_mapped[i].end))
+ return true;
+
+ return false;
+}
+
/*
* Setup the direct mapping of the physical memory at PAGE_OFFSET.
* This runs before bootmem is initialized and gets pages directly from
@@ -278,9 +305,55 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,

__flush_tlb_all();

+ add_pfn_range_mapped(start >> PAGE_SHIFT, ret >> PAGE_SHIFT);
+
return ret >> PAGE_SHIFT;
}

+/*
+ * Iterate through E820 memory map and create direct mappings for only E820_RAM
+ * regions. We cannot simply create direct mappings for all pfns from
+ * [0 to max_low_pfn) and [4GB to max_pfn) because of possible memory holes in
+ * high addresses that cannot be marked as UC by fixed/variable range MTRRs.
+ * Depending on the alignment of E820 ranges, this may possibly result in using
+ * smaller size (i.e. 4K instead of 2M or 1G) page tables.
+ */
+static void __init __init_mem_mapping(void)
+{
+ unsigned long start_pfn, end_pfn;
+ int i;
+
+ /* the ISA range is always mapped regardless of memory holes */
+ init_memory_mapping(0, ISA_END_ADDRESS);
+
+ for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
+ u64 start = start_pfn << PAGE_SHIFT;
+ u64 end = end_pfn << PAGE_SHIFT;
+
+ if (end <= ISA_END_ADDRESS)
+ continue;
+
+ if (start < ISA_END_ADDRESS)
+ start = ISA_END_ADDRESS;
+#ifdef CONFIG_X86_32
+ /* on 32 bit, we only map up to max_low_pfn */
+ if ((start >> PAGE_SHIFT) >= max_low_pfn)
+ continue;
+
+ if ((end >> PAGE_SHIFT) > max_low_pfn)
+ end = max_low_pfn << PAGE_SHIFT;
+#endif
+ init_memory_mapping(start, end);
+ }
+
+#ifdef CONFIG_X86_64
+ if (max_pfn > max_low_pfn) {
+ /* can we preseve max_low_pfn ?*/
+ max_low_pfn = max_pfn;
+ }
+#endif
+}
+
void __init init_mem_mapping(void)
{
probe_page_size_mask();
@@ -297,17 +370,9 @@ void __init init_mem_mapping(void)
#else
find_early_table_space(0, max_low_pfn<<PAGE_SHIFT);
#endif
- max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
- max_pfn_mapped = max_low_pfn_mapped;

-#ifdef CONFIG_X86_64
- if (max_pfn > max_low_pfn) {
- max_pfn_mapped = init_memory_mapping(1UL<<32,
- max_pfn<<PAGE_SHIFT);
- /* can we preseve max_low_pfn ?*/
- max_low_pfn = max_pfn;
- }
-#endif
+ __init_mem_mapping();
+
/*
* Reserve the kernel pagetable pages we used (pgt_buf_start -
* pgt_buf_end) and free the other ones (pgt_buf_end - pgt_buf_top)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 2b6b4a3..ab558eb 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -657,13 +657,11 @@ int arch_add_memory(int nid, u64 start, u64 size)
{
struct pglist_data *pgdat = NODE_DATA(nid);
struct zone *zone = pgdat->node_zones + ZONE_NORMAL;
- unsigned long last_mapped_pfn, start_pfn = start >> PAGE_SHIFT;
+ unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;

- last_mapped_pfn = init_memory_mapping(start, start + size);
- if (last_mapped_pfn > max_pfn_mapped)
- max_pfn_mapped = last_mapped_pfn;
+ init_memory_mapping(start, start + size);

ret = __add_pages(nid, zone, start_pfn, nr_pages);
WARN_ON_ONCE(ret);
--
1.7.7

2012-08-30 23:08:57

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 3/8] x86, mm: Moving init_memory_mapping calling

from setup.c to mm/init.c

So could update all related calling together.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/include/asm/init.h | 1 -
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/kernel/setup.c | 13 +------------
arch/x86/mm/init.c | 19 ++++++++++++++++++-
4 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
index adcc0ae..4f13998 100644
--- a/arch/x86/include/asm/init.h
+++ b/arch/x86/include/asm/init.h
@@ -12,7 +12,6 @@ kernel_physical_mapping_init(unsigned long start,
unsigned long end,
unsigned long page_size_mask);

-
extern unsigned long __initdata pgt_buf_start;
extern unsigned long __meminitdata pgt_buf_end;
extern unsigned long __meminitdata pgt_buf_top;
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e47e4db..ae2cabb 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -597,7 +597,7 @@ static inline int pgd_none(pgd_t pgd)
#ifndef __ASSEMBLY__

extern int direct_gbpages;
-void probe_page_size_mask(void);
+void init_mem_mapping(void);

/* local pte updates need not use xchg for locking */
static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d6e8c03..c30c78c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -912,20 +912,9 @@ void __init setup_arch(char **cmdline_p)
setup_real_mode();

init_gbpages();
- probe_page_size_mask();

- /* max_pfn_mapped is updated here */
- max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
- max_pfn_mapped = max_low_pfn_mapped;
+ init_mem_mapping();

-#ifdef CONFIG_X86_64
- if (max_pfn > max_low_pfn) {
- max_pfn_mapped = init_memory_mapping(1UL<<32,
- max_pfn<<PAGE_SHIFT);
- /* can we preseve max_low_pfn ?*/
- max_low_pfn = max_pfn;
- }
-#endif
memblock.current_limit = get_max_mapped();
dma_contiguous_reserve(0);

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 7d05e28..15a6a38 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -92,7 +92,7 @@ static void __init find_early_table_space(struct map_range *mr,
(pgt_buf_top << PAGE_SHIFT) - 1);
}

-void probe_page_size_mask(void)
+static void __init probe_page_size_mask(void)
{
#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
/*
@@ -312,6 +312,23 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
return ret >> PAGE_SHIFT;
}

+void __init init_mem_mapping(void)
+{
+ probe_page_size_mask();
+
+ /* max_pfn_mapped is updated here */
+ max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
+ max_pfn_mapped = max_low_pfn_mapped;
+
+#ifdef CONFIG_X86_64
+ if (max_pfn > max_low_pfn) {
+ max_pfn_mapped = init_memory_mapping(1UL<<32,
+ max_pfn<<PAGE_SHIFT);
+ /* can we preseve max_low_pfn ?*/
+ max_low_pfn = max_pfn;
+ }
+#endif
+}

/*
* devmem_is_allowed() checks to see if /dev/mem access to a certain address
--
1.7.7

2012-08-30 23:14:52

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, mm: init_memory_mapping cleanup

On Thu, Aug 30, 2012 at 4:06 PM, Yinghai Lu <[email protected]> wrote:
> Only create mapping for E820_820 and E820_RESERVED_KERN.
>
> Also seperate find_early_page_table out with init_memory_mapping.
>
> Jacob Shin (3):
> x86: if kernel .text .data .bss are not marked as E820_RAM, complain
> and fix
> x86: Fixup code testing if a pfn is direct mapped
> x86: Only direct map addresses that are marked as E820_RAM
>
> Yinghai Lu (5):
> x86, mm: Add global page_size_mask
> x86, mm: Split out split_mem_range
> x86, mm: Moving init_memory_mapping calling
> x86, mm: Revert back good_end setting for 64bit
> x86, mm: Find early page table only one time
>
> arch/x86/include/asm/init.h | 1 -
> arch/x86/include/asm/page_types.h | 3 +
> arch/x86/include/asm/pgtable.h | 1 +
> arch/x86/kernel/cpu/amd.c | 8 +-
> arch/x86/kernel/setup.c | 34 ++++---
> arch/x86/mm/init.c | 225 ++++++++++++++++++++++++++-----------
> arch/x86/mm/init_64.c | 6 +-
> arch/x86/platform/efi/efi.c | 8 +-
> 8 files changed, 191 insertions(+), 95 deletions(-)

could be found at

git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
for-x86-mm

2012-08-30 23:22:21

by Jacob Shin

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, mm: init_memory_mapping cleanup

On Thu, Aug 30, 2012 at 04:06:07PM -0700, Yinghai Lu wrote:
> Only create mapping for E820_820 and E820_RESERVED_KERN.
>
> Also seperate find_early_page_table out with init_memory_mapping.
>
> Jacob Shin (3):
> x86: if kernel .text .data .bss are not marked as E820_RAM, complain
> and fix
> x86: Fixup code testing if a pfn is direct mapped
> x86: Only direct map addresses that are marked as E820_RAM
>
> Yinghai Lu (5):
> x86, mm: Add global page_size_mask
> x86, mm: Split out split_mem_range
> x86, mm: Moving init_memory_mapping calling
> x86, mm: Revert back good_end setting for 64bit
> x86, mm: Find early page table only one time
>
> arch/x86/include/asm/init.h | 1 -
> arch/x86/include/asm/page_types.h | 3 +
> arch/x86/include/asm/pgtable.h | 1 +
> arch/x86/kernel/cpu/amd.c | 8 +-
> arch/x86/kernel/setup.c | 34 ++++---
> arch/x86/mm/init.c | 225 ++++++++++++++++++++++++++-----------
> arch/x86/mm/init_64.c | 6 +-
> arch/x86/platform/efi/efi.c | 8 +-
> 8 files changed, 191 insertions(+), 95 deletions(-)
>
> --
> 1.7.7
>
>

I'll be out of office tomorrow, and Monday is a holiday, so I'll test it
on our machines on Tuesday,

Thanks,

-Jacob