2012-08-14 22:39:30

by Jacob Shin

[permalink] [raw]
Subject: [PATCH V3 0/4] x86: Create direct mappings for E820_RAM only

Currently kernel direct mappings are created for all pfns between
[ 0 to max_low_pfn ) and [ 4GB to max_pfn ). When we introduce memory
holes, we end up mapping memory ranges that are not backed by physical
DRAM. This is fine for lower memory addresses which can be marked as UC
by fixed/variable range MTRRs, however we run in to trouble with high
addresses.

The following patchset creates direct mappings only for E820_RAM regions
between 0 ~ max_low_pfn and 4GB ~ max_pfn. And leaves non-E820_RAM and
memory holes unmapped.

This third revision of the patchset attempts to resolve comments and
concerns from the following threads:

* https://lkml.org/lkml/2012/8/13/512
* https://lkml.org/lkml/2012/8/9/536
* https://lkml.org/lkml/2011/10/20/323

Jacob Shin (4):
x86: Move enabling of PSE and PGE out of init_memory_mapping
x86: find_early_table_space based on memory ranges that are being
mapped
x86: Only direct map addresses that are marked as E820_RAM
x86: Fixup code testing if a pfn is direct mapped

arch/x86/include/asm/page_types.h | 9 +++
arch/x86/kernel/cpu/amd.c | 6 +-
arch/x86/kernel/setup.c | 121 +++++++++++++++++++++++++++++++------
arch/x86/mm/init.c | 74 +++++++++++------------
arch/x86/mm/init_64.c | 6 +-
arch/x86/platform/efi/efi.c | 8 +--
6 files changed, 155 insertions(+), 69 deletions(-)

--
1.7.9.5


2012-08-14 22:39:28

by Jacob Shin

[permalink] [raw]
Subject: [PATCH 3/4] x86: Only direct map addresses that are marked as E820_RAM

Currently direct mappings are created for [ 0 to max_low_pfn<<PAGE_SHIFT )
and [ 4GB to max_pfn<<PAGE_SHIFT ), which may include regions that are not
backed by actual DRAM. This is fine for holes under 4GB which are covered
by fixed and variable range MTRRs to be UC. However, we run into trouble
on higher memory addresses which cannot be covered by MTRRs.

Our system with 1TB of RAM has an e820 that looks like this:

BIOS-e820: [mem 0x0000000000000000-0x00000000000983ff] usable
BIOS-e820: [mem 0x0000000000098400-0x000000000009ffff] reserved
BIOS-e820: [mem 0x00000000000d0000-0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000-0x00000000c7ebffff] usable
BIOS-e820: [mem 0x00000000c7ec0000-0x00000000c7ed7fff] ACPI data
BIOS-e820: [mem 0x00000000c7ed8000-0x00000000c7ed9fff] ACPI NVS
BIOS-e820: [mem 0x00000000c7eda000-0x00000000c7ffffff] reserved
BIOS-e820: [mem 0x00000000fec00000-0x00000000fec0ffff] reserved
BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
BIOS-e820: [mem 0x00000000fff00000-0x00000000ffffffff] reserved
BIOS-e820: [mem 0x0000000100000000-0x000000e037ffffff] usable
BIOS-e820: [mem 0x000000e038000000-0x000000fcffffffff] reserved
BIOS-e820: [mem 0x0000010000000000-0x0000011ffeffffff] usable

and so direct mappings are created for huge memory hole between
0x000000e038000000 to 0x0000010000000000. Even though the kernel never
generates memory accesses in that region, since the page tables mark
them incorrectly as being WB, our (AMD) processor ends up causing a MCE
while doing some memory bookkeeping/optimizations around that area.

This patch iterates through e820 and only direct maps ranges that are
marked as E820_RAM, and keeps track of those pfn ranges. Depending on
the alignment of E820 ranges, this may possibly result in using smaller
size (i.e. 4K instead of 2M or 1G) page tables.

Signed-off-by: Jacob Shin <[email protected]>
---
arch/x86/include/asm/page_types.h | 9 +++
arch/x86/kernel/setup.c | 131 +++++++++++++++++++++++++++++--------
arch/x86/mm/init.c | 2 +
arch/x86/mm/init_64.c | 6 +-
4 files changed, 115 insertions(+), 33 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index e21fdd1..409047a 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -3,6 +3,7 @@

#include <linux/const.h>
#include <linux/types.h>
+#include <asm/e820.h>

/* PAGE_SHIFT determines the page size */
#define PAGE_SHIFT 12
@@ -40,12 +41,20 @@
#endif /* CONFIG_X86_64 */

#ifndef __ASSEMBLY__
+#include <linux/range.h>

extern int devmem_is_allowed(unsigned long pagenr);

extern unsigned long max_low_pfn_mapped;
extern unsigned long max_pfn_mapped;

+extern struct range pfn_mapped[E820_X_MAX];
+extern int nr_pfn_mapped;
+
+extern void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn);
+extern bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
+extern bool pfn_is_mapped(unsigned long pfn);
+
static inline phys_addr_t get_max_mapped(void)
{
return (phys_addr_t)max_pfn_mapped << PAGE_SHIFT;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 751e020..cba080e 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -115,13 +115,46 @@
#include <asm/prom.h>

/*
- * end_pfn only includes RAM, while max_pfn_mapped includes all e820 entries.
- * The direct mapping extends to max_pfn_mapped, so that we can directly access
- * apertures, ACPI and other tables without having to play with fixmaps.
+ * max_low_pfn_mapped: highest direct mapped pfn under 4GB
+ * max_pfn_mapped: highest direct mapped pfn over 4GB
+ *
+ * The direct mapping only covers E820_RAM regions, so the ranges and gaps are
+ * represented by pfn_mapped
*/
unsigned long max_low_pfn_mapped;
unsigned long max_pfn_mapped;

+struct range pfn_mapped[E820_X_MAX];
+int nr_pfn_mapped;
+
+void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn)
+{
+ nr_pfn_mapped = add_range_with_merge(pfn_mapped, E820_X_MAX,
+ nr_pfn_mapped, start_pfn, end_pfn);
+
+ max_pfn_mapped = max(max_pfn_mapped, end_pfn);
+
+ if (end_pfn <= (1UL << (32 - PAGE_SHIFT)))
+ max_low_pfn_mapped = max(max_low_pfn_mapped, end_pfn);
+}
+
+bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
+{
+ int i;
+
+ for (i = 0; i < nr_pfn_mapped; i++)
+ if ((start_pfn >= pfn_mapped[i].start) &&
+ (end_pfn <= pfn_mapped[i].end))
+ return true;
+
+ return false;
+}
+
+bool pfn_is_mapped(unsigned long pfn)
+{
+ return pfn_range_is_mapped(pfn, pfn + 1);
+}
+
#ifdef CONFIG_DMI
RESERVE_BRK(dmi_alloc, 65536);
#endif
@@ -296,6 +329,71 @@ static void __init cleanup_highmap(void)
}
#endif

+/*
+ * Iterate through E820 memory map and create direct mappings for only E820_RAM
+ * regions. We cannot simply create direct mappings for all pfns from
+ * [0 to max_low_pfn) and [4GB to max_pfn) because of possible memory holes in
+ * high addresses that cannot be marked as UC by fixed/variable range MTRRs.
+ * Depending on the alignment of E820 ranges, this may possibly result in using
+ * smaller size (i.e. 4K instead of 2M or 1G) page tables.
+ */
+static void __init init_memory(void)
+{
+ int i;
+ unsigned long init_pfn = max_pfn_mapped;
+
+ /* add initial memory already mapped by early boot process */
+ add_pfn_range_mapped(0, init_pfn);
+
+ pr_info("initial memory mapped: [mem 0x00000000-%#010lx]\n",
+ (init_pfn<<PAGE_SHIFT) - 1);
+
+ init_gbpages();
+
+ /* Enable PSE if available */
+ if (cpu_has_pse)
+ set_in_cr4(X86_CR4_PSE);
+
+ /* Enable PGE if available */
+ if (cpu_has_pge) {
+ set_in_cr4(X86_CR4_PGE);
+ __supported_pte_mask |= _PAGE_GLOBAL;
+ }
+
+ for (i = 0; i < e820.nr_map; i++) {
+ struct e820entry *ei = &e820.map[i];
+ u64 start = ei->addr;
+ u64 end = ei->addr + ei->size;
+
+ /* we only map E820_RAM */
+ if (ei->type != E820_RAM)
+ continue;
+
+ /* skip over initial memory mapped */
+ if ((end >> PAGE_SHIFT) <= init_pfn)
+ continue;
+
+ if ((start >> PAGE_SHIFT) < init_pfn)
+ start = init_pfn << PAGE_SHIFT;
+#ifdef CONFIG_X86_32
+ /* on 32 bit, we only map up to max_low_pfn */
+ if ((start >> PAGE_SHIFT) >= max_low_pfn)
+ continue;
+
+ if ((end >> PAGE_SHIFT) > max_low_pfn)
+ end = max_low_pfn << PAGE_SHIFT;
+#endif
+ init_memory_mapping(start, end);
+ }
+
+#ifdef CONFIG_X86_64
+ if (max_pfn > max_low_pfn) {
+ /* can we preseve max_low_pfn ?*/
+ max_low_pfn = max_pfn;
+ }
+#endif
+}
+
static void __init reserve_brk(void)
{
if (_brk_end > _brk_start)
@@ -906,35 +1004,10 @@ void __init setup_arch(char **cmdline_p)
setup_bios_corruption_check();
#endif

- printk(KERN_DEBUG "initial memory mapped: [mem 0x00000000-%#010lx]\n",
- (max_pfn_mapped<<PAGE_SHIFT) - 1);
-
setup_real_mode();

- init_gbpages();
+ init_memory();

- /* Enable PSE if available */
- if (cpu_has_pse)
- set_in_cr4(X86_CR4_PSE);
-
- /* Enable PGE if available */
- if (cpu_has_pge) {
- set_in_cr4(X86_CR4_PGE);
- __supported_pte_mask |= _PAGE_GLOBAL;
- }
-
- /* max_pfn_mapped is updated here */
- max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
- max_pfn_mapped = max_low_pfn_mapped;
-
-#ifdef CONFIG_X86_64
- if (max_pfn > max_low_pfn) {
- max_pfn_mapped = init_memory_mapping(1UL<<32,
- max_pfn<<PAGE_SHIFT);
- /* can we preseve max_low_pfn ?*/
- max_low_pfn = max_pfn;
- }
-#endif
memblock.current_limit = get_max_mapped();
dma_contiguous_reserve(0);

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index e2b21e0..d7b24da 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -301,6 +301,8 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
if (!after_bootmem)
early_memtest(start, end);

+ add_pfn_range_mapped(start >> PAGE_SHIFT, ret >> PAGE_SHIFT);
+
return ret >> PAGE_SHIFT;
}

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 2b6b4a3..ab558eb 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -657,13 +657,11 @@ int arch_add_memory(int nid, u64 start, u64 size)
{
struct pglist_data *pgdat = NODE_DATA(nid);
struct zone *zone = pgdat->node_zones + ZONE_NORMAL;
- unsigned long last_mapped_pfn, start_pfn = start >> PAGE_SHIFT;
+ unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;

- last_mapped_pfn = init_memory_mapping(start, start + size);
- if (last_mapped_pfn > max_pfn_mapped)
- max_pfn_mapped = last_mapped_pfn;
+ init_memory_mapping(start, start + size);

ret = __add_pages(nid, zone, start_pfn, nr_pages);
WARN_ON_ONCE(ret);
--
1.7.9.5

2012-08-14 22:39:32

by Jacob Shin

[permalink] [raw]
Subject: [PATCH 1/4] x86: Move enabling of PSE and PGE out of init_memory_mapping

Depending on the platform, init_memory_mapping() may be called multiple
times. Move it out to setup_arch() to avoid writing to cr4 on every call.

Signed-off-by: Jacob Shin <[email protected]>
---
arch/x86/kernel/setup.c | 10 ++++++++++
arch/x86/mm/init.c | 10 ----------
2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f4b9b80..751e020 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -913,6 +913,16 @@ void __init setup_arch(char **cmdline_p)

init_gbpages();

+ /* Enable PSE if available */
+ if (cpu_has_pse)
+ set_in_cr4(X86_CR4_PSE);
+
+ /* Enable PGE if available */
+ if (cpu_has_pge) {
+ set_in_cr4(X86_CR4_PGE);
+ __supported_pte_mask |= _PAGE_GLOBAL;
+ }
+
/* max_pfn_mapped is updated here */
max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
max_pfn_mapped = max_low_pfn_mapped;
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index e0e6990..2f07e09 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -149,16 +149,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
use_gbpages = direct_gbpages;
#endif

- /* Enable PSE if available */
- if (cpu_has_pse)
- set_in_cr4(X86_CR4_PSE);
-
- /* Enable PGE if available */
- if (cpu_has_pge) {
- set_in_cr4(X86_CR4_PGE);
- __supported_pte_mask |= _PAGE_GLOBAL;
- }
-
if (use_gbpages)
page_size_mask |= 1 << PG_LEVEL_1G;
if (use_pse)
--
1.7.9.5

2012-08-14 22:39:35

by Jacob Shin

[permalink] [raw]
Subject: [PATCH 2/4] x86: find_early_table_space based on memory ranges that are being mapped

Current logic finds enough space for direct mapping page tables from 0
to end. Instead, we only need to find enough space to cover mr[0].start
to mr[nr_range].end -- the range that is actually being mapped by
init_memory_mapping()

This patch also reportedly fixes suspend/resume issue reported in:

https://lkml.org/lkml/2012/8/11/83

Signed-off-by: Jacob Shin <[email protected]>
---
arch/x86/mm/init.c | 62 +++++++++++++++++++++++++++++-----------------------
1 file changed, 35 insertions(+), 27 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 2f07e09..e2b21e0 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -35,40 +35,48 @@ struct map_range {
unsigned page_size_mask;
};

-static void __init find_early_table_space(struct map_range *mr, unsigned long end,
- int use_pse, int use_gbpages)
+/*
+ * First calculate space needed for kernel direct mapping page tables to cover
+ * mr[0].start to mr[nr_range - 1].end, while accounting for possible 2M and 1GB
+ * pages. Then find enough contiguous space for those page tables.
+ */
+static void __init find_early_table_space(struct map_range *mr, int nr_range)
{
- unsigned long puds, pmds, ptes, tables, start = 0, good_end = end;
+ int i;
+ unsigned long puds = 0, pmds = 0, ptes = 0, tables;
+ unsigned long start = 0, good_end;
phys_addr_t base;

- puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
- tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
-
- if (use_gbpages) {
- unsigned long extra;
-
- extra = end - ((end>>PUD_SHIFT) << PUD_SHIFT);
- pmds = (extra + PMD_SIZE - 1) >> PMD_SHIFT;
- } else
- pmds = (end + PMD_SIZE - 1) >> PMD_SHIFT;
+ for (i = 0; i < nr_range; i++) {
+ unsigned long range, extra;

- tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);
+ range = mr[i].end - mr[i].start;
+ puds += (range + PUD_SIZE - 1) >> PUD_SHIFT;

- if (use_pse) {
- unsigned long extra;
+ if (mr[i].page_size_mask & (1 << PG_LEVEL_1G)) {
+ extra = range - ((range >> PUD_SHIFT) << PUD_SHIFT);
+ pmds += (extra + PMD_SIZE - 1) >> PMD_SHIFT;
+ } else {
+ pmds += (range + PMD_SIZE - 1) >> PMD_SHIFT;
+ }

- extra = end - ((end>>PMD_SHIFT) << PMD_SHIFT);
+ if (mr[i].page_size_mask & (1 << PG_LEVEL_2M)) {
+ extra = range - ((range >> PMD_SHIFT) << PMD_SHIFT);
#ifdef CONFIG_X86_32
- extra += PMD_SIZE;
+ extra += PMD_SIZE;
#endif
- /* The first 2/4M doesn't use large pages. */
- if (mr->start < PMD_SIZE)
- extra += mr->end - mr->start;
-
- ptes = (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
- } else
- ptes = (end + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ /* The first 2/4M doesn't use large pages. */
+ if (mr[i].start < PMD_SIZE)
+ extra += range;
+
+ ptes += (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ } else {
+ ptes += (range + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ }
+ }

+ tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
+ tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);
tables += roundup(ptes * sizeof(pte_t), PAGE_SIZE);

#ifdef CONFIG_X86_32
@@ -86,7 +94,7 @@ static void __init find_early_table_space(struct map_range *mr, unsigned long en
pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);

printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx]\n",
- end - 1, pgt_buf_start << PAGE_SHIFT,
+ mr[nr_range - 1].end - 1, pgt_buf_start << PAGE_SHIFT,
(pgt_buf_top << PAGE_SHIFT) - 1);
}

@@ -257,7 +265,7 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
* nodes are discovered.
*/
if (!after_bootmem)
- find_early_table_space(&mr[0], end, use_pse, use_gbpages);
+ find_early_table_space(mr, nr_range);

for (i = 0; i < nr_range; i++)
ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,
--
1.7.9.5

2012-08-14 22:40:24

by Jacob Shin

[permalink] [raw]
Subject: [PATCH 4/4] x86: Fixup code testing if a pfn is direct mapped

Update code that previously assumed pfns [ 0 - max_low_pfn_mapped ) and
[ 4GB - max_pfn_mapped ) were always direct mapped, to now look up
pfn_mapped ranges instead.

Signed-off-by: Jacob Shin <[email protected]>
---
arch/x86/kernel/cpu/amd.c | 6 +-----
arch/x86/platform/efi/efi.c | 8 ++++----
2 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 9d92e19..554ccfc 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -677,11 +677,7 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c)
*/
if (!rdmsrl_safe(MSR_K8_TSEG_ADDR, &tseg)) {
printk(KERN_DEBUG "tseg: %010llx\n", tseg);
- if ((tseg>>PMD_SHIFT) <
- (max_low_pfn_mapped>>(PMD_SHIFT-PAGE_SHIFT)) ||
- ((tseg>>PMD_SHIFT) <
- (max_pfn_mapped>>(PMD_SHIFT-PAGE_SHIFT)) &&
- (tseg>>PMD_SHIFT) >= (1ULL<<(32 - PMD_SHIFT))))
+ if (pfn_is_mapped(tseg))
set_memory_4k((unsigned long)__va(tseg), 1);
}
}
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 2dc29f5..4810ab3 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -754,7 +754,7 @@ void __init efi_enter_virtual_mode(void)
efi_memory_desc_t *md, *prev_md = NULL;
efi_status_t status;
unsigned long size;
- u64 end, systab, addr, npages, end_pfn;
+ u64 end, systab, addr, npages, start_pfn, end_pfn;
void *p, *va, *new_memmap = NULL;
int count = 0;

@@ -805,10 +805,10 @@ void __init efi_enter_virtual_mode(void)
size = md->num_pages << EFI_PAGE_SHIFT;
end = md->phys_addr + size;

+ start_pfn = PFN_DOWN(md->phys_addr);
end_pfn = PFN_UP(end);
- if (end_pfn <= max_low_pfn_mapped
- || (end_pfn > (1UL << (32 - PAGE_SHIFT))
- && end_pfn <= max_pfn_mapped))
+
+ if (pfn_range_is_mapped(start_pfn, end_pfn))
va = __va(md->phys_addr);
else
va = efi_ioremap(md->phys_addr, size, md->type);
--
1.7.9.5

2012-08-22 23:31:04

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86: Only direct map addresses that are marked as E820_RAM

On 08/14/2012 03:39 PM, Jacob Shin wrote:
> Currently direct mappings are created for [ 0 to max_low_pfn<<PAGE_SHIFT )
> and [ 4GB to max_pfn<<PAGE_SHIFT ), which may include regions that are not
> backed by actual DRAM. This is fine for holes under 4GB which are covered
> by fixed and variable range MTRRs to be UC. However, we run into trouble
> on higher memory addresses which cannot be covered by MTRRs.
>
> Our system with 1TB of RAM has an e820 that looks like this:
>
> BIOS-e820: [mem 0x0000000000000000-0x00000000000983ff] usable
> BIOS-e820: [mem 0x0000000000098400-0x000000000009ffff] reserved
> BIOS-e820: [mem 0x00000000000d0000-0x00000000000fffff] reserved
> BIOS-e820: [mem 0x0000000000100000-0x00000000c7ebffff] usable
> BIOS-e820: [mem 0x00000000c7ec0000-0x00000000c7ed7fff] ACPI data
> BIOS-e820: [mem 0x00000000c7ed8000-0x00000000c7ed9fff] ACPI NVS
> BIOS-e820: [mem 0x00000000c7eda000-0x00000000c7ffffff] reserved
> BIOS-e820: [mem 0x00000000fec00000-0x00000000fec0ffff] reserved
> BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
> BIOS-e820: [mem 0x00000000fff00000-0x00000000ffffffff] reserved
> BIOS-e820: [mem 0x0000000100000000-0x000000e037ffffff] usable
> BIOS-e820: [mem 0x000000e038000000-0x000000fcffffffff] reserved
> BIOS-e820: [mem 0x0000010000000000-0x0000011ffeffffff] usable
>
> and so direct mappings are created for huge memory hole between
> 0x000000e038000000 to 0x0000010000000000. Even though the kernel never
> generates memory accesses in that region, since the page tables mark
> them incorrectly as being WB, our (AMD) processor ends up causing a MCE
> while doing some memory bookkeeping/optimizations around that area.
>
> This patch iterates through e820 and only direct maps ranges that are
> marked as E820_RAM, and keeps track of those pfn ranges. Depending on
> the alignment of E820 ranges, this may possibly result in using smaller
> size (i.e. 4K instead of 2M or 1G) page tables.
>
> Signed-off-by: Jacob Shin <[email protected]>

I have one concern with this, which is that it leaves in place mapping
below the initial max_pfn_mapped. Although that neatly resolves the
legacy area (0-1 MiB) issues, it really isn't right above the 1 MiB
point. Any way I could get you to seek out and unmap any such ranges?
We have already seen some Dell machines which put memory holes in low
RAM, and perhaps there are still some machines out there with an I/O
hole at 15 MiB.

-hpa

2012-08-23 14:51:00

by Jacob Shin

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86: Only direct map addresses that are marked as E820_RAM

On Wed, Aug 22, 2012 at 04:30:49PM -0700, H. Peter Anvin wrote:
> On 08/14/2012 03:39 PM, Jacob Shin wrote:
> > Currently direct mappings are created for [ 0 to max_low_pfn<<PAGE_SHIFT )
> > and [ 4GB to max_pfn<<PAGE_SHIFT ), which may include regions that are not
> > backed by actual DRAM. This is fine for holes under 4GB which are covered
> > by fixed and variable range MTRRs to be UC. However, we run into trouble
> > on higher memory addresses which cannot be covered by MTRRs.
> >
> > Our system with 1TB of RAM has an e820 that looks like this:
> >
> > BIOS-e820: [mem 0x0000000000000000-0x00000000000983ff] usable
> > BIOS-e820: [mem 0x0000000000098400-0x000000000009ffff] reserved
> > BIOS-e820: [mem 0x00000000000d0000-0x00000000000fffff] reserved
> > BIOS-e820: [mem 0x0000000000100000-0x00000000c7ebffff] usable
> > BIOS-e820: [mem 0x00000000c7ec0000-0x00000000c7ed7fff] ACPI data
> > BIOS-e820: [mem 0x00000000c7ed8000-0x00000000c7ed9fff] ACPI NVS
> > BIOS-e820: [mem 0x00000000c7eda000-0x00000000c7ffffff] reserved
> > BIOS-e820: [mem 0x00000000fec00000-0x00000000fec0ffff] reserved
> > BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
> > BIOS-e820: [mem 0x00000000fff00000-0x00000000ffffffff] reserved
> > BIOS-e820: [mem 0x0000000100000000-0x000000e037ffffff] usable
> > BIOS-e820: [mem 0x000000e038000000-0x000000fcffffffff] reserved
> > BIOS-e820: [mem 0x0000010000000000-0x0000011ffeffffff] usable
> >
> > and so direct mappings are created for huge memory hole between
> > 0x000000e038000000 to 0x0000010000000000. Even though the kernel never
> > generates memory accesses in that region, since the page tables mark
> > them incorrectly as being WB, our (AMD) processor ends up causing a MCE
> > while doing some memory bookkeeping/optimizations around that area.
> >
> > This patch iterates through e820 and only direct maps ranges that are
> > marked as E820_RAM, and keeps track of those pfn ranges. Depending on
> > the alignment of E820 ranges, this may possibly result in using smaller
> > size (i.e. 4K instead of 2M or 1G) page tables.
> >
> > Signed-off-by: Jacob Shin <[email protected]>
>
> I have one concern with this, which is that it leaves in place mapping
> below the initial max_pfn_mapped. Although that neatly resolves the
> legacy area (0-1 MiB) issues, it really isn't right above the 1 MiB
> point. Any way I could get you to seek out and unmap any such ranges?
> We have already seen some Dell machines which put memory holes in low
> RAM, and perhaps there are still some machines out there with an I/O
> hole at 15 MiB.

So I believe in V2 of the patchset this was done, however, Dave Young
from redhat reported that it broke their KVM guest with a user supplied
memory map that looked like this:

>> [ 0.000000] e820: user-defined physical RAM map:
>> [ 0.000000] user: [mem 0x0000000000010000-0x000000000009dbff] usable
>> [ 0.000000] user: [mem 0x0000000024000000-0x0000000033f6bfff] usable

And looking into that scenario, the early boot code seems to allocates
space for fixmap right under initial max_pfn_mapped, which is no longer
direct mapped with my patch, and that seems to cause problems for later
APIC code that initializes APIC base address into the fixmap area.

So I guess to address your concern, we need to go back to V2 and try to
resolve the fixmap problem with user supplied memory map that reserves
memory below initial max_pfn_mapped ?

>
> -hpa
>
>

2012-08-23 15:39:28

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86: Only direct map addresses that are marked as E820_RAM

On 08/23/2012 07:50 AM, Jacob Shin wrote:
>>
>> I have one concern with this, which is that it leaves in place mapping
>> below the initial max_pfn_mapped. Although that neatly resolves the
>> legacy area (0-1 MiB) issues, it really isn't right above the 1 MiB
>> point. Any way I could get you to seek out and unmap any such ranges?
>> We have already seen some Dell machines which put memory holes in low
>> RAM, and perhaps there are still some machines out there with an I/O
>> hole at 15 MiB.
>
> So I believe in V2 of the patchset this was done, however, Dave Young
> from redhat reported that it broke their KVM guest with a user supplied
> memory map that looked like this:
>
>>> [ 0.000000] e820: user-defined physical RAM map:
>>> [ 0.000000] user: [mem 0x0000000000010000-0x000000000009dbff] usable
>>> [ 0.000000] user: [mem 0x0000000024000000-0x0000000033f6bfff] usable
>
> And looking into that scenario, the early boot code seems to allocates
> space for fixmap right under initial max_pfn_mapped, which is no longer
> direct mapped with my patch, and that seems to cause problems for later
> APIC code that initializes APIC base address into the fixmap area.
>
> So I guess to address your concern, we need to go back to V2 and try to
> resolve the fixmap problem with user supplied memory map that reserves
> memory below initial max_pfn_mapped ?
>

Okay... I think I need to grok that a bit better. For memory
allocations, we probably should just use brk allocations, for virtual
space allocations it is called the fixmap for a reason (even though the
Xen people managed to break that on 32 bits, sigh!)

I guess I need to go back and look at David's bug report...

-hpa


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-08-23 20:22:36

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86: Only direct map addresses that are marked as E820_RAM

On Thu, Aug 23, 2012 at 08:39:10AM -0700, H. Peter Anvin wrote:
> On 08/23/2012 07:50 AM, Jacob Shin wrote:
> >>
> >>I have one concern with this, which is that it leaves in place mapping
> >>below the initial max_pfn_mapped. Although that neatly resolves the
> >>legacy area (0-1 MiB) issues, it really isn't right above the 1 MiB
> >>point. Any way I could get you to seek out and unmap any such ranges?
> >>We have already seen some Dell machines which put memory holes in low
> >>RAM, and perhaps there are still some machines out there with an I/O
> >>hole at 15 MiB.
> >
> >So I believe in V2 of the patchset this was done, however, Dave Young
> >from redhat reported that it broke their KVM guest with a user supplied
> >memory map that looked like this:
> >
> >>>[ 0.000000] e820: user-defined physical RAM map:
> >>>[ 0.000000] user: [mem 0x0000000000010000-0x000000000009dbff] usable
> >>>[ 0.000000] user: [mem 0x0000000024000000-0x0000000033f6bfff] usable
> >
> >And looking into that scenario, the early boot code seems to allocates
> >space for fixmap right under initial max_pfn_mapped, which is no longer
> >direct mapped with my patch, and that seems to cause problems for later
> >APIC code that initializes APIC base address into the fixmap area.
> >
> >So I guess to address your concern, we need to go back to V2 and try to
> >resolve the fixmap problem with user supplied memory map that reserves
> >memory below initial max_pfn_mapped ?
> >
>
> Okay... I think I need to grok that a bit better. For memory
> allocations, we probably should just use brk allocations, for
> virtual space allocations it is called the fixmap for a reason (even
> though the Xen people managed to break that on 32 bits, sigh!)

Can you rope me in on that? Was that added ooh years ago ?
>
> I guess I need to go back and look at David's bug report...
>
> -hpa
>
>
> --
> H. Peter Anvin, Intel Open Source Technology Center
> I work for Intel. I don't speak on their behalf.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-08-23 21:24:10

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86: Only direct map addresses that are marked as E820_RAM

On 08/23/2012 01:12 PM, Konrad Rzeszutek Wilk wrote:
>>
>> Okay... I think I need to grok that a bit better. For memory
>> allocations, we probably should just use brk allocations, for
>> virtual space allocations it is called the fixmap for a reason (even
>> though the Xen people managed to break that on 32 bits, sigh!)
>
> Can you rope me in on that? Was that added ooh years ago ?

Yeah, it dates back to the original paravirt work.

The issue is that Xen likes to live at the top of the virtual address
space, which is where the fixmap lives. Suddenly the fixmap is no
longer at a fixed address, which was sort of the whole idea.

The problem is that on i386 there are very few unmovable boundaries left
in the virtual address space. The only other one left, really, is at
the very beginning of kernel space.

At this point people's appetite for large changes on 32 bits is limited,
of course.

-hpa

2012-08-23 22:35:28

by Jacob Shin

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86: Only direct map addresses that are marked as E820_RAM

On Thu, Aug 23, 2012 at 08:39:10AM -0700, H. Peter Anvin wrote:
> On 08/23/2012 07:50 AM, Jacob Shin wrote:
> >>
> >>I have one concern with this, which is that it leaves in place mapping
> >>below the initial max_pfn_mapped. Although that neatly resolves the
> >>legacy area (0-1 MiB) issues, it really isn't right above the 1 MiB
> >>point. Any way I could get you to seek out and unmap any such ranges?
> >>We have already seen some Dell machines which put memory holes in low
> >>RAM, and perhaps there are still some machines out there with an I/O
> >>hole at 15 MiB.
> >
> >So I believe in V2 of the patchset this was done, however, Dave Young
> >from redhat reported that it broke their KVM guest with a user supplied
> >memory map that looked like this:
> >
> >>>[ 0.000000] e820: user-defined physical RAM map:
> >>>[ 0.000000] user: [mem 0x0000000000010000-0x000000000009dbff] usable
> >>>[ 0.000000] user: [mem 0x0000000024000000-0x0000000033f6bfff] usable
> >

I looked into this a bit more, and I think what's happening is that this
user defined memory map leaves out the region where the kernel is loaded on
to during the boot process. The kernel and the direct mapped page tables up
to initial max_pfn_mapped reside somwhere under 512M (KERNEL_IMAGE_SIZE),
I guess it depends on how big your uncompressed kernel is.

And at the first attempt to set_fixmap_nocache(FIX_APIC_BASE, address) in
arch/x86/apic/apic.c: register_lapic_address runs into badness because the
memory region where the initial page tables live is no longer mapped
because of the above user supplied memory map.

So I guess there is a disconnect between really early code that seems to
rely on the boot loader as to where in physical memory it resides and its
initial page tables live, and the later memory initialization code where
it looks at the E820 (and here user can interject their own memory map
using the command line arguments)

Not really sure how to handle this case .. any advice?

2012-08-23 23:47:02

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86: Only direct map addresses that are marked as E820_RAM

On Thu, Aug 23, 2012 at 3:35 PM, Jacob Shin <[email protected]> wrote:
>> >So I believe in V2 of the patchset this was done, however, Dave Young
>> >from redhat reported that it broke their KVM guest with a user supplied
>> >memory map that looked like this:
>> >
>> >>>[ 0.000000] e820: user-defined physical RAM map:
>> >>>[ 0.000000] user: [mem 0x0000000000010000-0x000000000009dbff] usable
>> >>>[ 0.000000] user: [mem 0x0000000024000000-0x0000000033f6bfff] usable
>> >
>
> I looked into this a bit more, and I think what's happening is that this
> user defined memory map leaves out the region where the kernel is loaded on
> to during the boot process. The kernel and the direct mapped page tables up
> to initial max_pfn_mapped reside somwhere under 512M (KERNEL_IMAGE_SIZE),
> I guess it depends on how big your uncompressed kernel is.
>
> And at the first attempt to set_fixmap_nocache(FIX_APIC_BASE, address) in
> arch/x86/apic/apic.c: register_lapic_address runs into badness because the
> memory region where the initial page tables live is no longer mapped
> because of the above user supplied memory map.
>
> So I guess there is a disconnect between really early code that seems to
> rely on the boot loader as to where in physical memory it resides and its
> initial page tables live, and the later memory initialization code where
> it looks at the E820 (and here user can interject their own memory map
> using the command line arguments)
>
> Not really sure how to handle this case .. any advice?

Maybe you can just put back kernel range to E820 as RAM?

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f4b9b80..e1a1f28 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -830,6 +830,9 @@ void __init setup_arch(char **cmdline_p)
insert_resource(&iomem_resource, &code_resource);
insert_resource(&iomem_resource, &data_resource);
insert_resource(&iomem_resource, &bss_resource);
+ e820_add_range(code_resource.start, code_resource.end -
code_resource.start + 1, E820_RAM);
+ e820_add_range(data_resource.start, data_resource.end -
data_resource.start + 1, E820_RAM);
+ e820_add_range(bss_resource.start, bss_resource.end -
bss_resource.start + 1, E820_RAM);

trim_bios_range();
#ifdef CONFIG_X86_32

2012-08-23 23:54:13

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 3/4] x86: Only direct map addresses that are marked as E820_RAM

On 08/23/2012 03:35 PM, Jacob Shin wrote:
>
> I looked into this a bit more, and I think what's happening is that this
> user defined memory map leaves out the region where the kernel is loaded on
> to during the boot process. The kernel and the direct mapped page tables up
> to initial max_pfn_mapped reside somwhere under 512M (KERNEL_IMAGE_SIZE),
> I guess it depends on how big your uncompressed kernel is.
>
> And at the first attempt to set_fixmap_nocache(FIX_APIC_BASE, address) in
> arch/x86/apic/apic.c: register_lapic_address runs into badness because the
> memory region where the initial page tables live is no longer mapped
> because of the above user supplied memory map.
>
> So I guess there is a disconnect between really early code that seems to
> rely on the boot loader as to where in physical memory it resides and its
> initial page tables live, and the later memory initialization code where
> it looks at the E820 (and here user can interject their own memory map
> using the command line arguments)
>
> Not really sure how to handle this case .. any advice?
>

We have two options: one scream really loud and die, assuming the
bootloader actually loaded us on top of non-memory and we're going to
die anyway; or scream really loud but try to continue (i.e. override the
memory type). I would suggest doing the latter in the near term, and
shift to the former a bit further down the line.

-hpa