2012-08-13 21:47:32

by Jacob Shin

[permalink] [raw]
Subject: [PATCH V2 0/5] x86: Create direct mappings for E820_RAM only

Currently kernel direct mappings are created for all pfns between
[ 0 to max_low_pfn ) and [ 4GB to max_pfn ). When we introduce memory
holes, we end up mapping memory ranges that are not backed by physical
DRAM. This is fine for lower memory addresses which can be marked as UC
by fixed/variable range MTRRs, however we run in to trouble with high
addresses.

The following patchset creates direct mappings only for E820_RAM regions
between 0 ~ max_low_pfn and 4GB ~ max_pfn. And leaves non-E820_RAM and
memory holes unmapped.

This revision of the patchset attempts to resolve comments and concerns
from the following threads:

https://lkml.org/lkml/2012/8/11/95

and

https://lkml.org/lkml/2011/12/16/486

Jacob Shin (5):
x86: Only direct map addresses that are marked as E820_RAM
x86: find_early_table_space based on memory ranges that are being
mapped
x86: Keep track of direct mapped pfn ranges
x86: Fixup code testing if a pfn is direct mapped
x86: Move enabling of PSE and PGE out of init_memory_mapping

arch/x86/include/asm/page_types.h | 9 +++
arch/x86/kernel/amd_gart_64.c | 4 +-
arch/x86/kernel/cpu/amd.c | 6 +-
arch/x86/kernel/setup.c | 118 ++++++++++++++++++++++++++++++++-----
arch/x86/mm/init.c | 72 +++++++++++-----------
arch/x86/mm/init_64.c | 3 +-
arch/x86/platform/efi/efi.c | 8 +--
arch/x86/platform/efi/efi_64.c | 2 +
8 files changed, 157 insertions(+), 65 deletions(-)

--
1.7.9.5


2012-08-13 21:47:36

by Jacob Shin

[permalink] [raw]
Subject: [PATCH 5/5] x86: Move enabling of PSE and PGE out of init_memory_mapping

Since we now call init_memory_mapping for each E820_RAM region in a
loop, move cr4 writes out to setup_arch.

Signed-off-by: Jacob Shin <[email protected]>
---
arch/x86/kernel/setup.c | 10 ++++++++++
arch/x86/mm/init.c | 10 ----------
2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f71fa310..69b43f2 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -342,6 +342,16 @@ static void __init init_memory(void)

init_gbpages();

+ /* Enable PSE if available */
+ if (cpu_has_pse)
+ set_in_cr4(X86_CR4_PSE);
+
+ /* Enable PGE if available */
+ if (cpu_has_pge) {
+ set_in_cr4(X86_CR4_PGE);
+ __supported_pte_mask |= _PAGE_GLOBAL;
+ }
+
for (i = 0; i < e820.nr_map; i++) {
struct e820entry *ei = &e820.map[i];
u64 start = ei->addr;
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index df8baaa..e2b21e0 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -157,16 +157,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
use_gbpages = direct_gbpages;
#endif

- /* Enable PSE if available */
- if (cpu_has_pse)
- set_in_cr4(X86_CR4_PSE);
-
- /* Enable PGE if available */
- if (cpu_has_pge) {
- set_in_cr4(X86_CR4_PGE);
- __supported_pte_mask |= _PAGE_GLOBAL;
- }
-
if (use_gbpages)
page_size_mask |= 1 << PG_LEVEL_1G;
if (use_pse)
--
1.7.9.5

2012-08-13 21:47:34

by Jacob Shin

[permalink] [raw]
Subject: [PATCH 1/5] x86: Only direct map addresses that are marked as E820_RAM

Currently direct mappings are created for [ 0 to max_low_pfn<<PAGE_SHIFT )
and [ 4GB to max_pfn<<PAGE_SHIFT ), which may include regions that are not
backed by actual DRAM. This is fine for holes under 4GB which are covered
by fixed and variable range MTRRs to be UC. However, we run into trouble
on higher memory addresses which cannot be covered by MTRRs.

This patch iterates through e820 and only direct maps ranges that are
marked as E820_RAM, and keeps track of those pfn ranges.

Signed-off-by: Jacob Shin <[email protected]>
---
arch/x86/include/asm/page_types.h | 9 ++++
arch/x86/kernel/setup.c | 108 +++++++++++++++++++++++++++++++------
2 files changed, 101 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index e21fdd1..409047a 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -3,6 +3,7 @@

#include <linux/const.h>
#include <linux/types.h>
+#include <asm/e820.h>

/* PAGE_SHIFT determines the page size */
#define PAGE_SHIFT 12
@@ -40,12 +41,20 @@
#endif /* CONFIG_X86_64 */

#ifndef __ASSEMBLY__
+#include <linux/range.h>

extern int devmem_is_allowed(unsigned long pagenr);

extern unsigned long max_low_pfn_mapped;
extern unsigned long max_pfn_mapped;

+extern struct range pfn_mapped[E820_X_MAX];
+extern int nr_pfn_mapped;
+
+extern void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn);
+extern bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
+extern bool pfn_is_mapped(unsigned long pfn);
+
static inline phys_addr_t get_max_mapped(void)
{
return (phys_addr_t)max_pfn_mapped << PAGE_SHIFT;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f4b9b80..f71fa310 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -115,13 +115,46 @@
#include <asm/prom.h>

/*
- * end_pfn only includes RAM, while max_pfn_mapped includes all e820 entries.
- * The direct mapping extends to max_pfn_mapped, so that we can directly access
- * apertures, ACPI and other tables without having to play with fixmaps.
+ * max_low_pfn_mapped: highest direct mapped pfn under 4GB
+ * max_pfn_mapped: highest direct mapped pfn over 4GB
+ *
+ * The direct mapping only covers E820_RAM regions, so the ranges and gaps are
+ * represented by pfn_mapped
*/
unsigned long max_low_pfn_mapped;
unsigned long max_pfn_mapped;

+struct range pfn_mapped[E820_X_MAX];
+int nr_pfn_mapped;
+
+void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn)
+{
+ nr_pfn_mapped = add_range_with_merge(pfn_mapped, E820_X_MAX,
+ nr_pfn_mapped, start_pfn, end_pfn);
+
+ max_pfn_mapped = max(max_pfn_mapped, end_pfn);
+
+ if (end_pfn <= (1UL << (32 - PAGE_SHIFT)))
+ max_low_pfn_mapped = max(max_low_pfn_mapped, end_pfn);
+}
+
+bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
+{
+ int i;
+
+ for (i = 0; i < nr_pfn_mapped; i++)
+ if ((start_pfn >= pfn_mapped[i].start) &&
+ (end_pfn <= pfn_mapped[i].end))
+ return true;
+
+ return false;
+}
+
+bool pfn_is_mapped(unsigned long pfn)
+{
+ return pfn_range_is_mapped(pfn, pfn + 1);
+}
+
#ifdef CONFIG_DMI
RESERVE_BRK(dmi_alloc, 65536);
#endif
@@ -296,6 +329,61 @@ static void __init cleanup_highmap(void)
}
#endif

+/*
+ * Iterate through E820 memory map and create direct mappings for only E820_RAM
+ * regions. We cannot simply create direct mappings for all pfns from
+ * [0 to max_low_pfn) and [4GB to max_pfn) because of possible memory holes in
+ * high addresses that cannot be marked as UC by fixed/variable range MTRRs.
+ */
+static void __init init_memory(void)
+{
+ int i;
+ unsigned long pfn;
+
+ init_gbpages();
+
+ for (i = 0; i < e820.nr_map; i++) {
+ struct e820entry *ei = &e820.map[i];
+ u64 start = ei->addr;
+ u64 end = ei->addr + ei->size;
+
+ /* we only map E820_RAM */
+ if (ei->type != E820_RAM)
+ continue;
+
+ /* except we need to ignore gaps under 1MB */
+ if (end <= ISA_END_ADDRESS)
+ continue;
+
+ /* expand the first entry that spans 1MB to start at 0 */
+ if (start <= ISA_END_ADDRESS)
+ start = 0;
+#ifdef CONFIG_X86_32
+ /* on 32 bit, we only map up to max_low_pfn */
+ if ((start >> PAGE_SHIFT) >= max_low_pfn)
+ continue;
+
+ if ((end >> PAGE_SHIFT) > max_low_pfn)
+ end = max_low_pfn << PAGE_SHIFT;
+#endif
+ pfn = init_memory_mapping(start, end);
+ add_pfn_range_mapped(start >> PAGE_SHIFT, pfn);
+ }
+
+ /* map 0 to 1MB if we haven't already */
+ if (!pfn_range_is_mapped(0, ISA_END_ADDRESS << PAGE_SHIFT)) {
+ pfn = init_memory_mapping(0, ISA_END_ADDRESS);
+ add_pfn_range_mapped(0, pfn);
+ }
+
+#ifdef CONFIG_X86_64
+ if (max_pfn > max_low_pfn) {
+ /* can we preseve max_low_pfn ?*/
+ max_low_pfn = max_pfn;
+ }
+#endif
+}
+
static void __init reserve_brk(void)
{
if (_brk_end > _brk_start)
@@ -911,20 +999,8 @@ void __init setup_arch(char **cmdline_p)

setup_real_mode();

- init_gbpages();
-
- /* max_pfn_mapped is updated here */
- max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
- max_pfn_mapped = max_low_pfn_mapped;
+ init_memory();

-#ifdef CONFIG_X86_64
- if (max_pfn > max_low_pfn) {
- max_pfn_mapped = init_memory_mapping(1UL<<32,
- max_pfn<<PAGE_SHIFT);
- /* can we preseve max_low_pfn ?*/
- max_low_pfn = max_pfn;
- }
-#endif
memblock.current_limit = get_max_mapped();
dma_contiguous_reserve(0);

--
1.7.9.5

2012-08-13 21:47:30

by Jacob Shin

[permalink] [raw]
Subject: [PATCH 3/5] x86: Keep track of direct mapped pfn ranges

Update later calls to init_memory_mapping to keep track of direct mapped
pfn ranges so that at any point in time we can accurately represent what
memory ranges are direct mapped or not.

Signed-off-by: Jacob Shin <[email protected]>
---
arch/x86/kernel/amd_gart_64.c | 4 +++-
arch/x86/mm/init_64.c | 3 +--
arch/x86/platform/efi/efi_64.c | 2 ++
3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
index e663112..5ac26b9 100644
--- a/arch/x86/kernel/amd_gart_64.c
+++ b/arch/x86/kernel/amd_gart_64.c
@@ -770,7 +770,9 @@ int __init gart_iommu_init(void)

if (end_pfn > max_low_pfn_mapped) {
start_pfn = (aper_base>>PAGE_SHIFT);
- init_memory_mapping(start_pfn<<PAGE_SHIFT, end_pfn<<PAGE_SHIFT);
+ end_pfn = init_memory_mapping(start_pfn<<PAGE_SHIFT,
+ end_pfn<<PAGE_SHIFT);
+ add_pfn_range_mapped(start_pfn, end_pfn);
}

pr_info("PCI-DMA: using GART IOMMU.\n");
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 2b6b4a3..cbed965 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -662,8 +662,7 @@ int arch_add_memory(int nid, u64 start, u64 size)
int ret;

last_mapped_pfn = init_memory_mapping(start, start + size);
- if (last_mapped_pfn > max_pfn_mapped)
- max_pfn_mapped = last_mapped_pfn;
+ add_pfn_range_mapped(start_pfn, last_mapped_pfn);

ret = __add_pages(nid, zone, start_pfn, nr_pages);
WARN_ON_ONCE(ret);
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index ac3aa54..e822c89 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -90,6 +90,8 @@ void __iomem *__init efi_ioremap(unsigned long phys_addr, unsigned long size,
return ioremap(phys_addr, size);

last_map_pfn = init_memory_mapping(phys_addr, phys_addr + size);
+ add_pfn_range_mapped(phys_addr >> PAGE_SHIFT, last_map_pfn);
+
if ((last_map_pfn << PAGE_SHIFT) < phys_addr + size) {
unsigned long top = last_map_pfn << PAGE_SHIFT;
efi_ioremap(top, size - (top - phys_addr), type);
--
1.7.9.5

2012-08-13 21:48:26

by Jacob Shin

[permalink] [raw]
Subject: [PATCH 4/5] x86: Fixup code testing if a pfn is direct mapped

Update code that previously assumed pfns [ 0 - max_low_pfn_mapped ) and
[ 4GB - max_pfn_mapped ) were always direct mapped, to now look up
pfn_mapped ranges instead.

Signed-off-by: Jacob Shin <[email protected]>
---
arch/x86/kernel/cpu/amd.c | 6 +-----
arch/x86/platform/efi/efi.c | 8 ++++----
2 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 9d92e19..554ccfc 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -677,11 +677,7 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c)
*/
if (!rdmsrl_safe(MSR_K8_TSEG_ADDR, &tseg)) {
printk(KERN_DEBUG "tseg: %010llx\n", tseg);
- if ((tseg>>PMD_SHIFT) <
- (max_low_pfn_mapped>>(PMD_SHIFT-PAGE_SHIFT)) ||
- ((tseg>>PMD_SHIFT) <
- (max_pfn_mapped>>(PMD_SHIFT-PAGE_SHIFT)) &&
- (tseg>>PMD_SHIFT) >= (1ULL<<(32 - PMD_SHIFT))))
+ if (pfn_is_mapped(tseg))
set_memory_4k((unsigned long)__va(tseg), 1);
}
}
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 2dc29f5..4810ab3 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -754,7 +754,7 @@ void __init efi_enter_virtual_mode(void)
efi_memory_desc_t *md, *prev_md = NULL;
efi_status_t status;
unsigned long size;
- u64 end, systab, addr, npages, end_pfn;
+ u64 end, systab, addr, npages, start_pfn, end_pfn;
void *p, *va, *new_memmap = NULL;
int count = 0;

@@ -805,10 +805,10 @@ void __init efi_enter_virtual_mode(void)
size = md->num_pages << EFI_PAGE_SHIFT;
end = md->phys_addr + size;

+ start_pfn = PFN_DOWN(md->phys_addr);
end_pfn = PFN_UP(end);
- if (end_pfn <= max_low_pfn_mapped
- || (end_pfn > (1UL << (32 - PAGE_SHIFT))
- && end_pfn <= max_pfn_mapped))
+
+ if (pfn_range_is_mapped(start_pfn, end_pfn))
va = __va(md->phys_addr);
else
va = efi_ioremap(md->phys_addr, size, md->type);
--
1.7.9.5

2012-08-13 21:48:24

by Jacob Shin

[permalink] [raw]
Subject: [PATCH 2/5] x86: find_early_table_space based on memory ranges that are being mapped

Current logic finds enough space to cover number of tables from 0 to end.
Instead, we only need to find enough space to cover from mr[0].start to
mr[nr_range].end.

Signed-off-by: Jacob Shin <[email protected]>
---
arch/x86/mm/init.c | 62 +++++++++++++++++++++++++++++-----------------------
1 file changed, 35 insertions(+), 27 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index e0e6990..df8baaa 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -35,40 +35,48 @@ struct map_range {
unsigned page_size_mask;
};

-static void __init find_early_table_space(struct map_range *mr, unsigned long end,
- int use_pse, int use_gbpages)
+/*
+ * First calculate space needed for kernel direct mapping page tables to cover
+ * mr[0].start to mr[nr_range - 1].end, while accounting for possible 2M and 1GB
+ * pages. Then find enough contiguous space for those page tables.
+ */
+static void __init find_early_table_space(struct map_range *mr, int nr_range)
{
- unsigned long puds, pmds, ptes, tables, start = 0, good_end = end;
+ int i;
+ unsigned long puds = 0, pmds = 0, ptes = 0, tables;
+ unsigned long start = 0, good_end;
phys_addr_t base;

- puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
- tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
-
- if (use_gbpages) {
- unsigned long extra;
-
- extra = end - ((end>>PUD_SHIFT) << PUD_SHIFT);
- pmds = (extra + PMD_SIZE - 1) >> PMD_SHIFT;
- } else
- pmds = (end + PMD_SIZE - 1) >> PMD_SHIFT;
+ for (i = 0; i < nr_range; i++) {
+ unsigned long range, extra;

- tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);
+ range = mr[i].end - mr[i].start;
+ puds += (range + PUD_SIZE - 1) >> PUD_SHIFT;

- if (use_pse) {
- unsigned long extra;
+ if (mr[i].page_size_mask & (1 << PG_LEVEL_1G)) {
+ extra = range - ((range >> PUD_SHIFT) << PUD_SHIFT);
+ pmds += (extra + PMD_SIZE - 1) >> PMD_SHIFT;
+ } else {
+ pmds += (range + PMD_SIZE - 1) >> PMD_SHIFT;
+ }

- extra = end - ((end>>PMD_SHIFT) << PMD_SHIFT);
+ if (mr[i].page_size_mask & (1 << PG_LEVEL_2M)) {
+ extra = range - ((range >> PMD_SHIFT) << PMD_SHIFT);
#ifdef CONFIG_X86_32
- extra += PMD_SIZE;
+ extra += PMD_SIZE;
#endif
- /* The first 2/4M doesn't use large pages. */
- if (mr->start < PMD_SIZE)
- extra += mr->end - mr->start;
-
- ptes = (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
- } else
- ptes = (end + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ /* The first 2/4M doesn't use large pages. */
+ if (mr[i].start < PMD_SIZE)
+ extra += range;
+
+ ptes += (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ } else {
+ ptes += (range + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ }
+ }

+ tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
+ tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);
tables += roundup(ptes * sizeof(pte_t), PAGE_SIZE);

#ifdef CONFIG_X86_32
@@ -86,7 +94,7 @@ static void __init find_early_table_space(struct map_range *mr, unsigned long en
pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);

printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx]\n",
- end - 1, pgt_buf_start << PAGE_SHIFT,
+ mr[nr_range - 1].end - 1, pgt_buf_start << PAGE_SHIFT,
(pgt_buf_top << PAGE_SHIFT) - 1);
}

@@ -267,7 +275,7 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
* nodes are discovered.
*/
if (!after_bootmem)
- find_early_table_space(&mr[0], end, use_pse, use_gbpages);
+ find_early_table_space(mr, nr_range);

for (i = 0; i < nr_range; i++)
ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,
--
1.7.9.5

2012-08-13 21:58:35

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/5] x86: Only direct map addresses that are marked as E820_RAM

Hello,

On Mon, Aug 13, 2012 at 04:47:00PM -0500, Jacob Shin wrote:
> Currently direct mappings are created for [ 0 to max_low_pfn<<PAGE_SHIFT )
> and [ 4GB to max_pfn<<PAGE_SHIFT ), which may include regions that are not
> backed by actual DRAM. This is fine for holes under 4GB which are covered
> by fixed and variable range MTRRs to be UC. However, we run into trouble
> on higher memory addresses which cannot be covered by MTRRs.

I presume one of the problems is the mysterious reboot on S4 resume?
Please be a bit more detailed. Let's say someone discovers a
performance regression on an obscure machine, say, two years from now,
which isn't too crazy given how enterprises roll. Somebody bisects it
to this commit. Then what? It's very difficult to assess whether the
said "problem" is something which we should avoid at the cost of the
regression or it was just something somebody thought might be a
problem and created the patch assuming the change wouldn't affect
anything.

So, *please* explain what the problems are, preferably with
LKML-References or links to bugzilla bugs if there are any.

> This patch iterates through e820 and only direct maps ranges that are
> marked as E820_RAM, and keeps track of those pfn ranges.

Also, please mention the possibility of using smaller size memory
mappings if e820 didn't align physical memory to GB boundary.

Thanks.

--
tejun

2012-08-13 22:00:32

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 1/5] x86: Only direct map addresses that are marked as E820_RAM

On 08/13/2012 02:58 PM, Tejun Heo wrote:
>
> Also, please mention the possibility of using smaller size memory
> mappings if e820 didn't align physical memory to GB boundary.
>

... as it generally won't.

-hpa

2012-08-13 22:09:19

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 3/5] x86: Keep track of direct mapped pfn ranges

On Mon, Aug 13, 2012 at 04:47:02PM -0500, Jacob Shin wrote:
> Update later calls to init_memory_mapping to keep track of direct mapped
> pfn ranges so that at any point in time we can accurately represent what
> memory ranges are direct mapped or not.

Maybe we want to roll add_pfn_range_mapped() call into
init_memory_mapping()?

Thanks.

--
tejun

2012-08-13 23:02:30

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/5] x86: Move enabling of PSE and PGE out of init_memory_mapping

On Mon, Aug 13, 2012 at 04:47:04PM -0500, Jacob Shin wrote:
> Since we now call init_memory_mapping for each E820_RAM region in a
> loop, move cr4 writes out to setup_arch.

Wouldn't it be better if this happens *before* init_memory_mapping()
is called multiple times?

Thanks.

--
tejun

2012-08-14 08:38:35

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH V2 0/5] x86: Create direct mappings for E820_RAM only

On 08/14/2012 05:46 AM, Jacob Shin wrote:

> Currently kernel direct mappings are created for all pfns between
> [ 0 to max_low_pfn ) and [ 4GB to max_pfn ). When we introduce memory
> holes, we end up mapping memory ranges that are not backed by physical
> DRAM. This is fine for lower memory addresses which can be marked as UC
> by fixed/variable range MTRRs, however we run in to trouble with high
> addresses.
>
> The following patchset creates direct mappings only for E820_RAM regions
> between 0 ~ max_low_pfn and 4GB ~ max_pfn. And leaves non-E820_RAM and
> memory holes unmapped.


Hi,

Chaowang did some kdump test in a kvm guest with this patchset, 2nd
kenrel just reboot after some ACPI printk, see below dmesg of 2nd kernel:

After a crash:
[snip]
I'm in purgatory
[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Linux version 3.5.0-0.23.el7.bz846531.x86_64 (mockbuild@)
(gcc version 4.7.1 20120720 (Red Hat 4.7.1-5) (GCC) ) #1 SMP Mon Aug 13
22:17:46 EDT 2012
[ 0.000000] Command line:
BOOT_IMAGE=/vmlinuz-3.5.0-0.23.el7.bz846531.x86_64
root=/dev/mapper/vg_none-lv_root ro rd.md=0 rd.lvm.lv=vg_none/lv_swap
rd.dm=0 rd.lvm.lv=vg_none/lv_root rd.luks=0 LANG=en_US.UTF-8
console=ttyS0,115200 SYSFONT=True KEYTABLE=us earlyprintk=serial
irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off
earlyprintk=serial memmap=exactmap memmap=567K@64K
memmap=261552K@589824K elfcorehdr=851376K
[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000100-0x000000000009dbff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009dc00-0x000000000009ffff]
reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff]
reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000003fffdfff] usable
[ 0.000000] BIOS-e820: [mem 0x000000003fffe000-0x000000003fffffff]
reserved
[ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff]
reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff]
reserved
[ 0.000000] bootconsole [earlyser0] enabled
[ 0.000000] ERROR: earlyprintk= earlyser already used
[ 0.000000] e820: last_pfn = 0x3fffe max_arch_pfn = 0x400000000
[ 0.000000] NX (Execute Disable) protection: active
[ 0.000000] e820: user-defined physical RAM map:
[ 0.000000] user: [mem 0x0000000000010000-0x000000000009dbff] usable
[ 0.000000] user: [mem 0x0000000024000000-0x0000000033f6bfff] usable
[ 0.000000] DMI 2.4 present.
[ 0.000000] No AGP bridge found
[ 0.000000] e820: last_pfn = 0x33f6c max_arch_pfn = 0x400000000
[ 0.000000] PAT not supported by CPU.
[ 0.000000] found SMP MP-table at [mem 0x000fdae0-0x000fdaef] mapped
at [ffff8800000fdae0]
[ 0.000000] init_memory_mapping: [mem 0x24000000-0x33f6bfff]
[ 0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[ 0.000000] RAMDISK: [mem 0x3378a000-0x33f58fff]
[ 0.000000] ACPI: RSDP 00000000000fd980 00014 (v00 BOCHS )
[ 0.000000] ACPI: RSDT 000000003fffe5b0 00038 (v01 BOCHS BXPCRSDT
00000001 BXPC 00000001)
[ 0.000000] ACPI: FACP 000000003fffff80 00074 (v01 BOCHS BXPCFACP
00000001 BXPC 00000001)
[ 0.000000] ACPI: DSDT 000000003fffe5f0 01121 (v01 BXPC BXDSDT
00000001 INTL 20100528)
[ 0.000000] ACPI: FACS 000000003fffff40 00040
[ 0.000000] ACPI: SSDT 000000003ffffea0 0009E (v01 BOCHS BXPCSSDT
00000001 BXPC 00000001)
[ 0.000000] ACPI: APIC 000000003ffffdb0 00078 (v01 BOCHS BXPCAPIC
00000001 BXPC 00000001)
[ 0.000000] ACPI: HPET 000000003ffffd70 00038 (v01 BOCHS BXPCHPET
00000001 BXPC 00000001)
[ 0.000000] ACPI: SSDT 000000003ffff720 00644 (v01 BXPC BXSSDTPC
00000001 INTL 20100528)

====2nd kernel reboot here=====


>
> This revision of the patchset attempts to resolve comments and concerns
> from the following threads:
>
> https://lkml.org/lkml/2012/8/11/95
>
> and
>
> https://lkml.org/lkml/2011/12/16/486
>
> Jacob Shin (5):
> x86: Only direct map addresses that are marked as E820_RAM
> x86: find_early_table_space based on memory ranges that are being
> mapped
> x86: Keep track of direct mapped pfn ranges
> x86: Fixup code testing if a pfn is direct mapped
> x86: Move enabling of PSE and PGE out of init_memory_mapping
>
> arch/x86/include/asm/page_types.h | 9 +++
> arch/x86/kernel/amd_gart_64.c | 4 +-
> arch/x86/kernel/cpu/amd.c | 6 +-
> arch/x86/kernel/setup.c | 118 ++++++++++++++++++++++++++++++++-----
> arch/x86/mm/init.c | 72 +++++++++++-----------
> arch/x86/mm/init_64.c | 3 +-
> arch/x86/platform/efi/efi.c | 8 +--
> arch/x86/platform/efi/efi_64.c | 2 +
> 8 files changed, 157 insertions(+), 65 deletions(-)
>



--
Thanks
Dave

2012-08-14 09:10:01

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH V2 0/5] x86: Create direct mappings for E820_RAM only

On 08/14/2012 04:34 PM, Dave Young wrote:

> On 08/14/2012 05:46 AM, Jacob Shin wrote:
>
>> Currently kernel direct mappings are created for all pfns between
>> [ 0 to max_low_pfn ) and [ 4GB to max_pfn ). When we introduce memory
>> holes, we end up mapping memory ranges that are not backed by physical
>> DRAM. This is fine for lower memory addresses which can be marked as UC
>> by fixed/variable range MTRRs, however we run in to trouble with high
>> addresses.
>>
>> The following patchset creates direct mappings only for E820_RAM regions
>> between 0 ~ max_low_pfn and 4GB ~ max_pfn. And leaves non-E820_RAM and
>> memory holes unmapped.
>
>
> Hi,
>
> Chaowang did some kdump test in a kvm guest with this patchset, 2nd
> kenrel just reboot after some ACPI printk, see below dmesg of 2nd kernel:

>

> After a crash:
> [snip]
> I'm in purgatory
> [ 0.000000] Initializing cgroup subsys cpuset
> [ 0.000000] Initializing cgroup subsys cpu
> [ 0.000000] Linux version 3.5.0-0.23.el7.bz846531.x86_64 (mockbuild@)
> (gcc version 4.7.1 20120720 (Red Hat 4.7.1-5) (GCC) ) #1 SMP Mon Aug 13
> 22:17:46 EDT 2012
> [ 0.000000] Command line:
> BOOT_IMAGE=/vmlinuz-3.5.0-0.23.el7.bz846531.x86_64
> root=/dev/mapper/vg_none-lv_root ro rd.md=0 rd.lvm.lv=vg_none/lv_swap
> rd.dm=0 rd.lvm.lv=vg_none/lv_root rd.luks=0 LANG=en_US.UTF-8
> console=ttyS0,115200 SYSFONT=True KEYTABLE=us earlyprintk=serial
> irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off
> earlyprintk=serial memmap=exactmap memmap=567K@64K
> memmap=261552K@589824K elfcorehdr=851376K
> [ 0.000000] e820: BIOS-provided physical RAM map:
> [ 0.000000] BIOS-e820: [mem 0x0000000000000100-0x000000000009dbff] usable
> [ 0.000000] BIOS-e820: [mem 0x000000000009dc00-0x000000000009ffff]
> reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff]
> reserved
> [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000003fffdfff] usable
> [ 0.000000] BIOS-e820: [mem 0x000000003fffe000-0x000000003fffffff]
> reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff]
> reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff]
> reserved
> [ 0.000000] bootconsole [earlyser0] enabled
> [ 0.000000] ERROR: earlyprintk= earlyser already used
> [ 0.000000] e820: last_pfn = 0x3fffe max_arch_pfn = 0x400000000
> [ 0.000000] NX (Execute Disable) protection: active
> [ 0.000000] e820: user-defined physical RAM map:
> [ 0.000000] user: [mem 0x0000000000010000-0x000000000009dbff] usable
> [ 0.000000] user: [mem 0x0000000024000000-0x0000000033f6bfff] usable
> [ 0.000000] DMI 2.4 present.
> [ 0.000000] No AGP bridge found
> [ 0.000000] e820: last_pfn = 0x33f6c max_arch_pfn = 0x400000000
> [ 0.000000] PAT not supported by CPU.
> [ 0.000000] found SMP MP-table at [mem 0x000fdae0-0x000fdaef] mapped
> at [ffff8800000fdae0]
> [ 0.000000] init_memory_mapping: [mem 0x24000000-0x33f6bfff]
> [ 0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
> [ 0.000000] RAMDISK: [mem 0x3378a000-0x33f58fff]
> [ 0.000000] ACPI: RSDP 00000000000fd980 00014 (v00 BOCHS )
> [ 0.000000] ACPI: RSDT 000000003fffe5b0 00038 (v01 BOCHS BXPCRSDT
> 00000001 BXPC 00000001)
> [ 0.000000] ACPI: FACP 000000003fffff80 00074 (v01 BOCHS BXPCFACP
> 00000001 BXPC 00000001)
> [ 0.000000] ACPI: DSDT 000000003fffe5f0 01121 (v01 BXPC BXDSDT
> 00000001 INTL 20100528)
> [ 0.000000] ACPI: FACS 000000003fffff40 00040
> [ 0.000000] ACPI: SSDT 000000003ffffea0 0009E (v01 BOCHS BXPCSSDT
> 00000001 BXPC 00000001)
> [ 0.000000] ACPI: APIC 000000003ffffdb0 00078 (v01 BOCHS BXPCAPIC
> 00000001 BXPC 00000001)
> [ 0.000000] ACPI: HPET 000000003ffffd70 00038 (v01 BOCHS BXPCHPET
> 00000001 BXPC 00000001)
> [ 0.000000] ACPI: SSDT 000000003ffff720 00644 (v01 BXPC BXSSDTPC
> 00000001 INTL 20100528)
>
> ====2nd kernel reboot here=====


The above is copied from rhel7 3.5.0 test kernel, with linus tree + this
patch set there's nearly same output.

>
>
>>
>> This revision of the patchset attempts to resolve comments and concerns
>> from the following threads:
>>
>> https://lkml.org/lkml/2012/8/11/95
>>
>> and
>>
>> https://lkml.org/lkml/2011/12/16/486
>>
>> Jacob Shin (5):
>> x86: Only direct map addresses that are marked as E820_RAM
>> x86: find_early_table_space based on memory ranges that are being
>> mapped
>> x86: Keep track of direct mapped pfn ranges
>> x86: Fixup code testing if a pfn is direct mapped
>> x86: Move enabling of PSE and PGE out of init_memory_mapping
>>
>> arch/x86/include/asm/page_types.h | 9 +++
>> arch/x86/kernel/amd_gart_64.c | 4 +-
>> arch/x86/kernel/cpu/amd.c | 6 +-
>> arch/x86/kernel/setup.c | 118 ++++++++++++++++++++++++++++++++-----
>> arch/x86/mm/init.c | 72 +++++++++++-----------
>> arch/x86/mm/init_64.c | 3 +-
>> arch/x86/platform/efi/efi.c | 8 +--
>> arch/x86/platform/efi/efi_64.c | 2 +
>> 8 files changed, 157 insertions(+), 65 deletions(-)
>>
>
>
>



--
Thanks
Dave

2012-08-14 22:54:30

by Jacob Shin

[permalink] [raw]
Subject: Re: [PATCH V2 0/5] x86: Create direct mappings for E820_RAM only

On Tue, Aug 14, 2012 at 04:34:39PM +0800, Dave Young wrote:
> On 08/14/2012 05:46 AM, Jacob Shin wrote:
>
> > Currently kernel direct mappings are created for all pfns between
> > [ 0 to max_low_pfn ) and [ 4GB to max_pfn ). When we introduce memory
> > holes, we end up mapping memory ranges that are not backed by physical
> > DRAM. This is fine for lower memory addresses which can be marked as UC
> > by fixed/variable range MTRRs, however we run in to trouble with high
> > addresses.
> >
> > The following patchset creates direct mappings only for E820_RAM regions
> > between 0 ~ max_low_pfn and 4GB ~ max_pfn. And leaves non-E820_RAM and
> > memory holes unmapped.
>
>
> Hi,
>
> Chaowang did some kdump test in a kvm guest with this patchset, 2nd
> kenrel just reboot after some ACPI printk, see below dmesg of 2nd kernel:

Hello, thanks for testing, since I have not tested under KVM .. I also have
not tested passing in user supplied memory maps as your kernel log suggests.

Looking into this, it seems like we get a page fault while trying to set up
fixmap for the APIC. I think the fixmap is set up even before we get to
setup_arch(), and it is sitting in memory that is not marked as usable by
your user supplied e820.

Could you give V3 a try? I just sent it out a minute ago, this version
won't try to remap what has already been mapped as part of the boot process
before we get to setup_arch, it'll just take what its given.

>
> After a crash:
> [snip]
> I'm in purgatory
> [ 0.000000] Initializing cgroup subsys cpuset
> [ 0.000000] Initializing cgroup subsys cpu
> [ 0.000000] Linux version 3.5.0-0.23.el7.bz846531.x86_64 (mockbuild@)
> (gcc version 4.7.1 20120720 (Red Hat 4.7.1-5) (GCC) ) #1 SMP Mon Aug 13
> 22:17:46 EDT 2012
> [ 0.000000] Command line:
> BOOT_IMAGE=/vmlinuz-3.5.0-0.23.el7.bz846531.x86_64
> root=/dev/mapper/vg_none-lv_root ro rd.md=0 rd.lvm.lv=vg_none/lv_swap
> rd.dm=0 rd.lvm.lv=vg_none/lv_root rd.luks=0 LANG=en_US.UTF-8
> console=ttyS0,115200 SYSFONT=True KEYTABLE=us earlyprintk=serial
> irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off
> earlyprintk=serial memmap=exactmap memmap=567K@64K
> memmap=261552K@589824K elfcorehdr=851376K
> [ 0.000000] e820: BIOS-provided physical RAM map:
> [ 0.000000] BIOS-e820: [mem 0x0000000000000100-0x000000000009dbff] usable
> [ 0.000000] BIOS-e820: [mem 0x000000000009dc00-0x000000000009ffff]
> reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff]
> reserved
> [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000003fffdfff] usable
> [ 0.000000] BIOS-e820: [mem 0x000000003fffe000-0x000000003fffffff]
> reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff]
> reserved
> [ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff]
> reserved
> [ 0.000000] bootconsole [earlyser0] enabled
> [ 0.000000] ERROR: earlyprintk= earlyser already used
> [ 0.000000] e820: last_pfn = 0x3fffe max_arch_pfn = 0x400000000
> [ 0.000000] NX (Execute Disable) protection: active
> [ 0.000000] e820: user-defined physical RAM map:
> [ 0.000000] user: [mem 0x0000000000010000-0x000000000009dbff] usable
> [ 0.000000] user: [mem 0x0000000024000000-0x0000000033f6bfff] usable
> [ 0.000000] DMI 2.4 present.
> [ 0.000000] No AGP bridge found
> [ 0.000000] e820: last_pfn = 0x33f6c max_arch_pfn = 0x400000000
> [ 0.000000] PAT not supported by CPU.
> [ 0.000000] found SMP MP-table at [mem 0x000fdae0-0x000fdaef] mapped
> at [ffff8800000fdae0]
> [ 0.000000] init_memory_mapping: [mem 0x24000000-0x33f6bfff]
> [ 0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
> [ 0.000000] RAMDISK: [mem 0x3378a000-0x33f58fff]
> [ 0.000000] ACPI: RSDP 00000000000fd980 00014 (v00 BOCHS )
> [ 0.000000] ACPI: RSDT 000000003fffe5b0 00038 (v01 BOCHS BXPCRSDT
> 00000001 BXPC 00000001)
> [ 0.000000] ACPI: FACP 000000003fffff80 00074 (v01 BOCHS BXPCFACP
> 00000001 BXPC 00000001)
> [ 0.000000] ACPI: DSDT 000000003fffe5f0 01121 (v01 BXPC BXDSDT
> 00000001 INTL 20100528)
> [ 0.000000] ACPI: FACS 000000003fffff40 00040
> [ 0.000000] ACPI: SSDT 000000003ffffea0 0009E (v01 BOCHS BXPCSSDT
> 00000001 BXPC 00000001)
> [ 0.000000] ACPI: APIC 000000003ffffdb0 00078 (v01 BOCHS BXPCAPIC
> 00000001 BXPC 00000001)
> [ 0.000000] ACPI: HPET 000000003ffffd70 00038 (v01 BOCHS BXPCHPET
> 00000001 BXPC 00000001)
> [ 0.000000] ACPI: SSDT 000000003ffff720 00644 (v01 BXPC BXSSDTPC
> 00000001 INTL 20100528)
>
> ====2nd kernel reboot here=====
>
>
> >
> > This revision of the patchset attempts to resolve comments and concerns
> > from the following threads:
> >
> > https://lkml.org/lkml/2012/8/11/95
> >
> > and
> >
> > https://lkml.org/lkml/2011/12/16/486
> >
> > Jacob Shin (5):
> > x86: Only direct map addresses that are marked as E820_RAM
> > x86: find_early_table_space based on memory ranges that are being
> > mapped
> > x86: Keep track of direct mapped pfn ranges
> > x86: Fixup code testing if a pfn is direct mapped
> > x86: Move enabling of PSE and PGE out of init_memory_mapping
> >
> > arch/x86/include/asm/page_types.h | 9 +++
> > arch/x86/kernel/amd_gart_64.c | 4 +-
> > arch/x86/kernel/cpu/amd.c | 6 +-
> > arch/x86/kernel/setup.c | 118 ++++++++++++++++++++++++++++++++-----
> > arch/x86/mm/init.c | 72 +++++++++++-----------
> > arch/x86/mm/init_64.c | 3 +-
> > arch/x86/platform/efi/efi.c | 8 +--
> > arch/x86/platform/efi/efi_64.c | 2 +
> > 8 files changed, 157 insertions(+), 65 deletions(-)
> >
>
>
>
> --
> Thanks
> Dave
>

2012-08-15 05:52:06

by WANG Chao

[permalink] [raw]
Subject: Re: [PATCH V2 0/5] x86: Create direct mappings for E820_RAM only

On 08/15/2012 06:54 AM, Jacob Shin wrote:
> On Tue, Aug 14, 2012 at 04:34:39PM +0800, Dave Young wrote:
>> On 08/14/2012 05:46 AM, Jacob Shin wrote:
>>
>>> Currently kernel direct mappings are created for all pfns between
>>> [ 0 to max_low_pfn ) and [ 4GB to max_pfn ). When we introduce memory
>>> holes, we end up mapping memory ranges that are not backed by physical
>>> DRAM. This is fine for lower memory addresses which can be marked as UC
>>> by fixed/variable range MTRRs, however we run in to trouble with high
>>> addresses.
>>>
>>> The following patchset creates direct mappings only for E820_RAM regions
>>> between 0 ~ max_low_pfn and 4GB ~ max_pfn. And leaves non-E820_RAM and
>>> memory holes unmapped.
>>
>>
>> Hi,
>>
>> Chaowang did some kdump test in a kvm guest with this patchset, 2nd
>> kenrel just reboot after some ACPI printk, see below dmesg of 2nd kernel:
>
> Hello, thanks for testing, since I have not tested under KVM .. I also have
> not tested passing in user supplied memory maps as your kernel log suggests.
>
> Looking into this, it seems like we get a page fault while trying to set up
> fixmap for the APIC. I think the fixmap is set up even before we get to
> setup_arch(), and it is sitting in memory that is not marked as usable by
> your user supplied e820.
>
> Could you give V3 a try? I just sent it out a minute ago, this version
> won't try to remap what has already been mapped as part of the boot process
> before we get to setup_arch, it'll just take what its given.
>

Hi, Jacob

I just tried v3 patchset in my x86_64 kvm guest, it was booting
successfully and the issue mentioned is gone.

-WANG Chao