2012-08-09 21:24:37

by Jacob Shin

[permalink] [raw]
Subject: [PATCH 0/5] Only create direct mappings for E820_RAM regions

This is a revision of an earlier attempt, with suggestions and concernes
from previous conversation (https://lkml.org/lkml/2011/12/16/486) taken
into account.

The following patchset creates direct mappings only for E820_RAM regions
between 0 ~ max_low_pfn and 4GB ~ max_pfn. And leaves non-E820_RAM
unmapped.

Jacob Shin (5):
x86: Only direct map addresses that are marked as E820_RAM
x86: find_early_table_space based on memory ranges that are being
mapped
x86: Keep track of direct mapped pfn ranges
x86: Fixup code testing if a pfn is direct mapped
x86: Move enabling of PSE and PGE out of init_memory_mapping

arch/x86/include/asm/page_types.h | 9 ++++
arch/x86/kernel/amd_gart_64.c | 4 +-
arch/x86/kernel/cpu/amd.c | 6 +--
arch/x86/kernel/setup.c | 97 ++++++++++++++++++++++++++++++++++---
arch/x86/mm/init.c | 67 ++++++++++++-------------
arch/x86/mm/init_64.c | 3 +-
arch/x86/platform/efi/efi.c | 8 +--
arch/x86/platform/efi/efi_64.c | 2 +
8 files changed, 139 insertions(+), 57 deletions(-)

--
1.7.9.5


2012-08-09 21:23:34

by Jacob Shin

[permalink] [raw]
Subject: [PATCH 3/5] x86: Keep track of direct mapped pfn ranges

Update later calls to init_memory_mapping to keep track of direct mapped
pfn ranges.

Signed-off-by: Jacob Shin <[email protected]>
---
arch/x86/kernel/amd_gart_64.c | 4 +++-
arch/x86/mm/init_64.c | 3 +--
arch/x86/platform/efi/efi_64.c | 2 ++
3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
index e663112..5ac26b9 100644
--- a/arch/x86/kernel/amd_gart_64.c
+++ b/arch/x86/kernel/amd_gart_64.c
@@ -770,7 +770,9 @@ int __init gart_iommu_init(void)

if (end_pfn > max_low_pfn_mapped) {
start_pfn = (aper_base>>PAGE_SHIFT);
- init_memory_mapping(start_pfn<<PAGE_SHIFT, end_pfn<<PAGE_SHIFT);
+ end_pfn = init_memory_mapping(start_pfn<<PAGE_SHIFT,
+ end_pfn<<PAGE_SHIFT);
+ add_pfn_range_mapped(start_pfn, end_pfn);
}

pr_info("PCI-DMA: using GART IOMMU.\n");
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 2b6b4a3..cbed965 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -662,8 +662,7 @@ int arch_add_memory(int nid, u64 start, u64 size)
int ret;

last_mapped_pfn = init_memory_mapping(start, start + size);
- if (last_mapped_pfn > max_pfn_mapped)
- max_pfn_mapped = last_mapped_pfn;
+ add_pfn_range_mapped(start_pfn, last_mapped_pfn);

ret = __add_pages(nid, zone, start_pfn, nr_pages);
WARN_ON_ONCE(ret);
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index ac3aa54..e822c89 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -90,6 +90,8 @@ void __iomem *__init efi_ioremap(unsigned long phys_addr, unsigned long size,
return ioremap(phys_addr, size);

last_map_pfn = init_memory_mapping(phys_addr, phys_addr + size);
+ add_pfn_range_mapped(phys_addr >> PAGE_SHIFT, last_map_pfn);
+
if ((last_map_pfn << PAGE_SHIFT) < phys_addr + size) {
unsigned long top = last_map_pfn << PAGE_SHIFT;
efi_ioremap(top, size - (top - phys_addr), type);
--
1.7.9.5

2012-08-09 21:23:32

by Jacob Shin

[permalink] [raw]
Subject: [PATCH 5/5] x86: Move enabling of PSE and PGE out of init_memory_mapping

Since we now call init_memory_mapping for each E820_RAM region in a
loop, move cr4 writes out to setup_arch.

Signed-off-by: Jacob Shin <[email protected]>
---
arch/x86/kernel/setup.c | 10 ++++++++++
arch/x86/mm/init.c | 10 ----------
2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 4f26944..5dfeb8f 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -958,6 +958,16 @@ void __init setup_arch(char **cmdline_p)

init_gbpages();

+ /* Enable PSE if available */
+ if (cpu_has_pse)
+ set_in_cr4(X86_CR4_PSE);
+
+ /* Enable PGE if available */
+ if (cpu_has_pge) {
+ set_in_cr4(X86_CR4_PGE);
+ __supported_pte_mask |= _PAGE_GLOBAL;
+ }
+
init_pfn = max_pfn_mapped;

memset(pfn_mapped, 0, sizeof(pfn_mapped));
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index d4e01a7..99f111e 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -152,16 +152,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
use_gbpages = direct_gbpages;
#endif

- /* Enable PSE if available */
- if (cpu_has_pse)
- set_in_cr4(X86_CR4_PSE);
-
- /* Enable PGE if available */
- if (cpu_has_pge) {
- set_in_cr4(X86_CR4_PGE);
- __supported_pte_mask |= _PAGE_GLOBAL;
- }
-
if (use_gbpages)
page_size_mask |= 1 << PG_LEVEL_1G;
if (use_pse)
--
1.7.9.5

2012-08-09 21:23:29

by Jacob Shin

[permalink] [raw]
Subject: [PATCH 2/5] x86: find_early_table_space based on memory ranges that are being mapped

Current logic finds enough space to cover number of tables from 0 to end.
Instead, we only need to find enough space to cover from mr[0].start to
mr[nr_range].end.

Signed-off-by: Jacob Shin <[email protected]>
---
arch/x86/mm/init.c | 57 +++++++++++++++++++++++++++-------------------------
1 file changed, 30 insertions(+), 27 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index e0e6990..d4e01a7 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -35,40 +35,43 @@ struct map_range {
unsigned page_size_mask;
};

-static void __init find_early_table_space(struct map_range *mr, unsigned long end,
- int use_pse, int use_gbpages)
+static void __init find_early_table_space(struct map_range *mr, int nr_range)
{
- unsigned long puds, pmds, ptes, tables, start = 0, good_end = end;
+ int i;
+ unsigned long puds = 0, pmds = 0, ptes = 0, tables;
+ unsigned long start = 0, good_end;
phys_addr_t base;

- puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
- tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
-
- if (use_gbpages) {
- unsigned long extra;
+ for (i = 0; i < nr_range; i++) {
+ unsigned long range, extra;

- extra = end - ((end>>PUD_SHIFT) << PUD_SHIFT);
- pmds = (extra + PMD_SIZE - 1) >> PMD_SHIFT;
- } else
- pmds = (end + PMD_SIZE - 1) >> PMD_SHIFT;
-
- tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);
+ range = mr[i].end - mr[i].start;
+ puds += (range + PUD_SIZE - 1) >> PUD_SHIFT;

- if (use_pse) {
- unsigned long extra;
+ if (mr[i].page_size_mask & (1 << PG_LEVEL_1G)) {
+ extra = range - ((range >> PUD_SHIFT) << PUD_SHIFT);
+ pmds += (extra + PMD_SIZE - 1) >> PMD_SHIFT;
+ } else {
+ pmds += (range + PMD_SIZE - 1) >> PMD_SHIFT;
+ }

- extra = end - ((end>>PMD_SHIFT) << PMD_SHIFT);
+ if (mr[i].page_size_mask & (1 << PG_LEVEL_2M)) {
+ extra = range - ((range >> PMD_SHIFT) << PMD_SHIFT);
#ifdef CONFIG_X86_32
- extra += PMD_SIZE;
+ extra += PMD_SIZE;
#endif
- /* The first 2/4M doesn't use large pages. */
- if (mr->start < PMD_SIZE)
- extra += mr->end - mr->start;
-
- ptes = (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
- } else
- ptes = (end + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ /* The first 2/4M doesn't use large pages. */
+ if (mr[i].start < PMD_SIZE)
+ extra += range;
+
+ ptes += (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ } else {
+ ptes += (range + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ }
+ }

+ tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
+ tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);
tables += roundup(ptes * sizeof(pte_t), PAGE_SIZE);

#ifdef CONFIG_X86_32
@@ -86,7 +89,7 @@ static void __init find_early_table_space(struct map_range *mr, unsigned long en
pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);

printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx]\n",
- end - 1, pgt_buf_start << PAGE_SHIFT,
+ mr[nr_range - 1].end - 1, pgt_buf_start << PAGE_SHIFT,
(pgt_buf_top << PAGE_SHIFT) - 1);
}

@@ -267,7 +270,7 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
* nodes are discovered.
*/
if (!after_bootmem)
- find_early_table_space(&mr[0], end, use_pse, use_gbpages);
+ find_early_table_space(mr, nr_range);

for (i = 0; i < nr_range; i++)
ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,
--
1.7.9.5

2012-08-09 21:24:04

by Jacob Shin

[permalink] [raw]
Subject: [PATCH 1/5] x86: Only direct map addresses that are marked as E820_RAM

Currently direct mappings are created for [ 0 to max_low_pfn<<PAGE_SHIFT )
and [ 4GB to max_pfn<<PAGE_SHIFT ), which may include regions that are not
backed by actual DRAM. This is fine for holes under 4GB which are covered
by fixed and variable range MTRRs to be UC. However, we run into trouble
on higher memory addresses which cannot be covered by MTRRs.

This patch iterates through e820 and only direct maps ranges that are
marked as E820_RAM, and keeps track of those pfn ranges.

Signed-off-by: Jacob Shin <[email protected]>
---
arch/x86/include/asm/page_types.h | 9 ++++
arch/x86/kernel/setup.c | 87 +++++++++++++++++++++++++++++++++----
2 files changed, 88 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index e21fdd1..0b8aa52 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -3,6 +3,7 @@

#include <linux/const.h>
#include <linux/types.h>
+#include <asm/e820.h>

/* PAGE_SHIFT determines the page size */
#define PAGE_SHIFT 12
@@ -40,12 +41,20 @@
#endif /* CONFIG_X86_64 */

#ifndef __ASSEMBLY__
+#include <linux/range.h>

extern int devmem_is_allowed(unsigned long pagenr);

extern unsigned long max_low_pfn_mapped;
extern unsigned long max_pfn_mapped;

+extern struct range pfn_mapped[E820_X_MAX];
+extern int nr_pfn_mapped;
+
+extern void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn);
+extern int pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
+extern int pfn_is_mapped(unsigned long pfn);
+
static inline phys_addr_t get_max_mapped(void)
{
return (phys_addr_t)max_pfn_mapped << PAGE_SHIFT;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f4b9b80..4f26944 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -115,13 +115,55 @@
#include <asm/prom.h>

/*
- * end_pfn only includes RAM, while max_pfn_mapped includes all e820 entries.
- * The direct mapping extends to max_pfn_mapped, so that we can directly access
- * apertures, ACPI and other tables without having to play with fixmaps.
+ * max_low_pfn_mapped: highest direct mapped pfn under 4GB
+ * max_pfn_mapped: highest direct mapped pfn over 4GB
+ *
+ * The direct mapping only covers E820_RAM regions, so the ranges and gaps are
+ * represented by pfn_mapped
*/
unsigned long max_low_pfn_mapped;
unsigned long max_pfn_mapped;

+struct range pfn_mapped[E820_X_MAX];
+int nr_pfn_mapped;
+
+void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn)
+{
+ nr_pfn_mapped = add_range_with_merge(pfn_mapped, E820_X_MAX,
+ nr_pfn_mapped, start_pfn, end_pfn);
+
+ if (end_pfn > max_pfn_mapped)
+ max_pfn_mapped = end_pfn;
+
+ if ((end_pfn <= (1UL << (32 - PAGE_SHIFT))) &&
+ (end_pfn > max_low_pfn_mapped))
+ max_low_pfn_mapped = end_pfn;
+}
+
+int pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
+{
+ int i;
+
+ for (i = 0; i < nr_pfn_mapped; i++)
+ if ((start_pfn >= pfn_mapped[i].start) &&
+ (end_pfn <= pfn_mapped[i].end))
+ break;
+
+ return i < nr_pfn_mapped;
+}
+
+int pfn_is_mapped(unsigned long pfn)
+{
+ int i;
+
+ for (i = 0; i < nr_pfn_mapped; i++)
+ if ((pfn >= pfn_mapped[i].start) &&
+ (pfn < pfn_mapped[i].end))
+ break;
+
+ return i < nr_pfn_mapped;
+}
+
#ifdef CONFIG_DMI
RESERVE_BRK(dmi_alloc, 65536);
#endif
@@ -673,6 +715,9 @@ early_param("reservelow", parse_reservelow);

void __init setup_arch(char **cmdline_p)
{
+ int i;
+ unsigned long init_pfn, pfn;
+
#ifdef CONFIG_X86_32
memcpy(&boot_cpu_data, &new_cpu_data, sizeof(new_cpu_data));
visws_early_detect();
@@ -913,14 +958,40 @@ void __init setup_arch(char **cmdline_p)

init_gbpages();

- /* max_pfn_mapped is updated here */
- max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
- max_pfn_mapped = max_low_pfn_mapped;
+ init_pfn = max_pfn_mapped;
+
+ memset(pfn_mapped, 0, sizeof(pfn_mapped));
+ nr_pfn_mapped = 0;
+
+ add_pfn_range_mapped(0, max_pfn_mapped);
+
+ for (i = 0; i < e820.nr_map; i++) {
+ struct e820entry *ei = &e820.map[i];
+ u64 start = ei->addr;
+ u64 end = ei->addr + ei->size;
+
+ if (ei->type != E820_RAM)
+ continue;
+
+ if (end <= (init_pfn << PAGE_SHIFT))
+ continue;
+
+ if (start < (init_pfn << PAGE_SHIFT))
+ start = init_pfn << PAGE_SHIFT;
+
+#ifdef CONFIG_X86_32
+ if ((start >> PAGE_SHIFT) >= max_low_pfn)
+ continue;
+
+ if ((end >> PAGE_SHIFT) > max_low_pfn)
+ end = max_low_pfn << PAGE_SHIFT;
+#endif
+ pfn = init_memory_mapping(start, end);
+ add_pfn_range_mapped(start >> PAGE_SHIFT, pfn);
+ }

#ifdef CONFIG_X86_64
if (max_pfn > max_low_pfn) {
- max_pfn_mapped = init_memory_mapping(1UL<<32,
- max_pfn<<PAGE_SHIFT);
/* can we preseve max_low_pfn ?*/
max_low_pfn = max_pfn;
}
--
1.7.9.5

2012-08-09 21:24:01

by Jacob Shin

[permalink] [raw]
Subject: [PATCH 4/5] x86: Fixup code testing if a pfn is direct mapped

Update code that previously assumed pfns [ 0 - max_low_pfn_mapped ) and
[ 4GB - max_pfn_mapped ) were always direct mapped; to now look up
pfn_mapped ranges instead.

Signed-off-by: Jacob Shin <[email protected]>
---
arch/x86/kernel/cpu/amd.c | 6 +-----
arch/x86/platform/efi/efi.c | 8 ++++----
2 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 9d92e19..554ccfc 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -677,11 +677,7 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c)
*/
if (!rdmsrl_safe(MSR_K8_TSEG_ADDR, &tseg)) {
printk(KERN_DEBUG "tseg: %010llx\n", tseg);
- if ((tseg>>PMD_SHIFT) <
- (max_low_pfn_mapped>>(PMD_SHIFT-PAGE_SHIFT)) ||
- ((tseg>>PMD_SHIFT) <
- (max_pfn_mapped>>(PMD_SHIFT-PAGE_SHIFT)) &&
- (tseg>>PMD_SHIFT) >= (1ULL<<(32 - PMD_SHIFT))))
+ if (pfn_is_mapped(tseg))
set_memory_4k((unsigned long)__va(tseg), 1);
}
}
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 2dc29f5..4810ab3 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -754,7 +754,7 @@ void __init efi_enter_virtual_mode(void)
efi_memory_desc_t *md, *prev_md = NULL;
efi_status_t status;
unsigned long size;
- u64 end, systab, addr, npages, end_pfn;
+ u64 end, systab, addr, npages, start_pfn, end_pfn;
void *p, *va, *new_memmap = NULL;
int count = 0;

@@ -805,10 +805,10 @@ void __init efi_enter_virtual_mode(void)
size = md->num_pages << EFI_PAGE_SHIFT;
end = md->phys_addr + size;

+ start_pfn = PFN_DOWN(md->phys_addr);
end_pfn = PFN_UP(end);
- if (end_pfn <= max_low_pfn_mapped
- || (end_pfn > (1UL << (32 - PAGE_SHIFT))
- && end_pfn <= max_pfn_mapped))
+
+ if (pfn_range_is_mapped(start_pfn, end_pfn))
va = __va(md->phys_addr);
else
va = efi_ioremap(md->phys_addr, size, md->type);
--
1.7.9.5

2012-08-09 22:03:52

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 1/5] x86: Only direct map addresses that are marked as E820_RAM

On Thu, Aug 9, 2012 at 2:23 PM, Jacob Shin <[email protected]> wrote:
> Currently direct mappings are created for [ 0 to max_low_pfn<<PAGE_SHIFT )
> and [ 4GB to max_pfn<<PAGE_SHIFT ), which may include regions that are not
> backed by actual DRAM. This is fine for holes under 4GB which are covered
> by fixed and variable range MTRRs to be UC. However, we run into trouble
> on higher memory addresses which cannot be covered by MTRRs.
>
> This patch iterates through e820 and only direct maps ranges that are
> marked as E820_RAM, and keeps track of those pfn ranges.
>
> Signed-off-by: Jacob Shin <[email protected]>
> ---
> arch/x86/include/asm/page_types.h | 9 ++++
> arch/x86/kernel/setup.c | 87 +++++++++++++++++++++++++++++++++----
> 2 files changed, 88 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
> index e21fdd1..0b8aa52 100644
> --- a/arch/x86/include/asm/page_types.h
> +++ b/arch/x86/include/asm/page_types.h
> @@ -3,6 +3,7 @@
>
> #include <linux/const.h>
> #include <linux/types.h>
> +#include <asm/e820.h>
>
> /* PAGE_SHIFT determines the page size */
> #define PAGE_SHIFT 12
> @@ -40,12 +41,20 @@
> #endif /* CONFIG_X86_64 */
>
> #ifndef __ASSEMBLY__
> +#include <linux/range.h>
>
> extern int devmem_is_allowed(unsigned long pagenr);
>
> extern unsigned long max_low_pfn_mapped;
> extern unsigned long max_pfn_mapped;
>
> +extern struct range pfn_mapped[E820_X_MAX];
> +extern int nr_pfn_mapped;
> +
> +extern void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn);
> +extern int pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
> +extern int pfn_is_mapped(unsigned long pfn);
> +
> static inline phys_addr_t get_max_mapped(void)
> {
> return (phys_addr_t)max_pfn_mapped << PAGE_SHIFT;
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index f4b9b80..4f26944 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -115,13 +115,55 @@
> #include <asm/prom.h>
>
> /*
> - * end_pfn only includes RAM, while max_pfn_mapped includes all e820 entries.
> - * The direct mapping extends to max_pfn_mapped, so that we can directly access
> - * apertures, ACPI and other tables without having to play with fixmaps.
> + * max_low_pfn_mapped: highest direct mapped pfn under 4GB
> + * max_pfn_mapped: highest direct mapped pfn over 4GB
> + *
> + * The direct mapping only covers E820_RAM regions, so the ranges and gaps are
> + * represented by pfn_mapped
> */
> unsigned long max_low_pfn_mapped;
> unsigned long max_pfn_mapped;
>
> +struct range pfn_mapped[E820_X_MAX];
> +int nr_pfn_mapped;
> +
> +void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn)
> +{
> + nr_pfn_mapped = add_range_with_merge(pfn_mapped, E820_X_MAX,
> + nr_pfn_mapped, start_pfn, end_pfn);
> +
> + if (end_pfn > max_pfn_mapped)
> + max_pfn_mapped = end_pfn;
> +
> + if ((end_pfn <= (1UL << (32 - PAGE_SHIFT))) &&
> + (end_pfn > max_low_pfn_mapped))
> + max_low_pfn_mapped = end_pfn;
> +}
> +
> +int pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
> +{
> + int i;
> +
> + for (i = 0; i < nr_pfn_mapped; i++)
> + if ((start_pfn >= pfn_mapped[i].start) &&
> + (end_pfn <= pfn_mapped[i].end))
> + break;
> +
> + return i < nr_pfn_mapped;
> +}
> +
> +int pfn_is_mapped(unsigned long pfn)
> +{
> + int i;
> +
> + for (i = 0; i < nr_pfn_mapped; i++)
> + if ((pfn >= pfn_mapped[i].start) &&
> + (pfn < pfn_mapped[i].end))
> + break;
> +
> + return i < nr_pfn_mapped;
> +}
> +
> #ifdef CONFIG_DMI
> RESERVE_BRK(dmi_alloc, 65536);
> #endif
> @@ -673,6 +715,9 @@ early_param("reservelow", parse_reservelow);
>
> void __init setup_arch(char **cmdline_p)
> {
> + int i;
> + unsigned long init_pfn, pfn;
> +
> #ifdef CONFIG_X86_32
> memcpy(&boot_cpu_data, &new_cpu_data, sizeof(new_cpu_data));
> visws_early_detect();
> @@ -913,14 +958,40 @@ void __init setup_arch(char **cmdline_p)
>
> init_gbpages();
>
> - /* max_pfn_mapped is updated here */
> - max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
> - max_pfn_mapped = max_low_pfn_mapped;
> + init_pfn = max_pfn_mapped;
> +
> + memset(pfn_mapped, 0, sizeof(pfn_mapped));
> + nr_pfn_mapped = 0;
> +
> + add_pfn_range_mapped(0, max_pfn_mapped);
> +
> + for (i = 0; i < e820.nr_map; i++) {
> + struct e820entry *ei = &e820.map[i];
> + u64 start = ei->addr;
> + u64 end = ei->addr + ei->size;
> +
> + if (ei->type != E820_RAM)
> + continue;
> +
> + if (end <= (init_pfn << PAGE_SHIFT))
> + continue;
> +
> + if (start < (init_pfn << PAGE_SHIFT))
> + start = init_pfn << PAGE_SHIFT;
> +
> +#ifdef CONFIG_X86_32
> + if ((start >> PAGE_SHIFT) >= max_low_pfn)
> + continue;
> +
> + if ((end >> PAGE_SHIFT) > max_low_pfn)
> + end = max_low_pfn << PAGE_SHIFT;
> +#endif
> + pfn = init_memory_mapping(start, end);
> + add_pfn_range_mapped(start >> PAGE_SHIFT, pfn);
> + }

can you put those line in another function?

setup_arch is way too big now.

Also put tj to CC list, because last time I separate page table for
every node upset him.

Thanks

Yinghai Lu

>
> #ifdef CONFIG_X86_64
> if (max_pfn > max_low_pfn) {
> - max_pfn_mapped = init_memory_mapping(1UL<<32,
> - max_pfn<<PAGE_SHIFT);
> /* can we preseve max_low_pfn ?*/
> max_low_pfn = max_pfn;
> }
> --
> 1.7.9.5
>
>

2012-08-11 18:36:44

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 1/5] x86: Only direct map addresses that are marked as E820_RAM

On 08/09/2012 03:03 PM, Yinghai Lu wrote:
>
> can you put those line in another function?
>
> setup_arch is way too big now.
>

I agree with this ... Jacob, could you rev the patch?

-hpa


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-08-11 19:49:54

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/5] x86: Only direct map addresses that are marked as E820_RAM

Hello, Jacob.

On Thu, Aug 09, 2012 at 04:23:05PM -0500, Jacob Shin wrote:
> +struct range pfn_mapped[E820_X_MAX];
> +int nr_pfn_mapped;

Why aren't these __initdata? Are they gonna be used for other
purposes?

> +void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn)
> +{
> + nr_pfn_mapped = add_range_with_merge(pfn_mapped, E820_X_MAX,
> + nr_pfn_mapped, start_pfn, end_pfn);
> +
> + if (end_pfn > max_pfn_mapped)
> + max_pfn_mapped = end_pfn;

Maybe use max()?

> + if ((end_pfn <= (1UL << (32 - PAGE_SHIFT))) &&
> + (end_pfn > max_low_pfn_mapped))
> + max_low_pfn_mapped = end_pfn;
> +}
> +
> +int pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)

bool?

> +{
> + int i;
> +
> + for (i = 0; i < nr_pfn_mapped; i++)
> + if ((start_pfn >= pfn_mapped[i].start) &&
> + (end_pfn <= pfn_mapped[i].end))
> + break;
> +
> + return i < nr_pfn_mapped;
> +}

for (...)
if (xxx)
return true;
return false;

> +int pfn_is_mapped(unsigned long pfn)
> +{
> + int i;
> +
> + for (i = 0; i < nr_pfn_mapped; i++)
> + if ((pfn >= pfn_mapped[i].start) &&
> + (pfn < pfn_mapped[i].end))
> + break;
> +
> + return i < nr_pfn_mapped;
> +}

How about...

return pfn_range_is_mapped(pfn, pfn + 1);

> @@ -913,14 +958,40 @@ void __init setup_arch(char **cmdline_p)
>
> init_gbpages();
>
> - /* max_pfn_mapped is updated here */
> - max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
> - max_pfn_mapped = max_low_pfn_mapped;
> + init_pfn = max_pfn_mapped;
> +
> + memset(pfn_mapped, 0, sizeof(pfn_mapped));
> + nr_pfn_mapped = 0;

Are these necessary? We clear .bss way before control reaches here.

> +
> + add_pfn_range_mapped(0, max_pfn_mapped);
> +
> + for (i = 0; i < e820.nr_map; i++) {
> + struct e820entry *ei = &e820.map[i];
> + u64 start = ei->addr;
> + u64 end = ei->addr + ei->size;
> +
> + if (ei->type != E820_RAM)
> + continue;
> +
> + if (end <= (init_pfn << PAGE_SHIFT))
> + continue;
> +
> + if (start < (init_pfn << PAGE_SHIFT))
> + start = init_pfn << PAGE_SHIFT;
> +
> +#ifdef CONFIG_X86_32
> + if ((start >> PAGE_SHIFT) >= max_low_pfn)
> + continue;
> +
> + if ((end >> PAGE_SHIFT) > max_low_pfn)
> + end = max_low_pfn << PAGE_SHIFT;
> +#endif
> + pfn = init_memory_mapping(start, end);
> + add_pfn_range_mapped(start >> PAGE_SHIFT, pfn);
> + }

Some comments please? Also, while this may be the right thing to do,
if I'm not mistaken, this is also likely to make linear space to use
smaller mappings depending on how the physical memory is laid out,
which could be a trade off we're willing to make, but that *should* be
explicit. Please describe what's going on and provide rationale.

Thanks.

--
tejun

2012-08-11 19:51:31

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/5] Only create direct mappings for E820_RAM regions

On Thu, Aug 09, 2012 at 04:23:04PM -0500, Jacob Shin wrote:
> This is a revision of an earlier attempt, with suggestions and concernes
> from previous conversation (https://lkml.org/lkml/2011/12/16/486) taken
> into account.
>
> The following patchset creates direct mappings only for E820_RAM regions
> between 0 ~ max_low_pfn and 4GB ~ max_pfn. And leaves non-E820_RAM
> unmapped.

Can you please explain why this is necessary here and in patch
descriptions?

Thanks.

--
tejun

2012-08-11 19:56:22

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/5] x86: find_early_table_space based on memory ranges that are being mapped

Hello,

On Thu, Aug 09, 2012 at 04:23:06PM -0500, Jacob Shin wrote:
> Current logic finds enough space to cover number of tables from 0 to end.
> Instead, we only need to find enough space to cover from mr[0].start to
> mr[nr_range].end.
>
> Signed-off-by: Jacob Shin <[email protected]>
> ---
> arch/x86/mm/init.c | 57 +++++++++++++++++++++++++++-------------------------
> 1 file changed, 30 insertions(+), 27 deletions(-)
>
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index e0e6990..d4e01a7 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -35,40 +35,43 @@ struct map_range {
> unsigned page_size_mask;
> };
>
> -static void __init find_early_table_space(struct map_range *mr, unsigned long end,
> - int use_pse, int use_gbpages)
> +static void __init find_early_table_space(struct map_range *mr, int nr_range)
> {
> - unsigned long puds, pmds, ptes, tables, start = 0, good_end = end;
> + int i;
> + unsigned long puds = 0, pmds = 0, ptes = 0, tables;
> + unsigned long start = 0, good_end;
> phys_addr_t base;
>
> - puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
> - tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
> -
> - if (use_gbpages) {
> - unsigned long extra;
> + for (i = 0; i < nr_range; i++) {
> + unsigned long range, extra;
>
> - extra = end - ((end>>PUD_SHIFT) << PUD_SHIFT);
> - pmds = (extra + PMD_SIZE - 1) >> PMD_SHIFT;
> - } else
> - pmds = (end + PMD_SIZE - 1) >> PMD_SHIFT;
> -
> - tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);
> + range = mr[i].end - mr[i].start;
> + puds += (range + PUD_SIZE - 1) >> PUD_SHIFT;
>
> - if (use_pse) {
> - unsigned long extra;
> + if (mr[i].page_size_mask & (1 << PG_LEVEL_1G)) {
> + extra = range - ((range >> PUD_SHIFT) << PUD_SHIFT);
> + pmds += (extra + PMD_SIZE - 1) >> PMD_SHIFT;
> + } else {
> + pmds += (range + PMD_SIZE - 1) >> PMD_SHIFT;
> + }
>
> - extra = end - ((end>>PMD_SHIFT) << PMD_SHIFT);
> + if (mr[i].page_size_mask & (1 << PG_LEVEL_2M)) {
> + extra = range - ((range >> PMD_SHIFT) << PMD_SHIFT);
> #ifdef CONFIG_X86_32
> - extra += PMD_SIZE;
> + extra += PMD_SIZE;
> #endif
> - /* The first 2/4M doesn't use large pages. */
> - if (mr->start < PMD_SIZE)
> - extra += mr->end - mr->start;
> -
> - ptes = (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
> - } else
> - ptes = (end + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + /* The first 2/4M doesn't use large pages. */
> + if (mr[i].start < PMD_SIZE)
> + extra += range;
> +
> + ptes += (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + } else {
> + ptes += (range + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + }
> + }
>
> + tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
> + tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);

It would really be great if you can sprinkle some comments explaining
what the function is trying to do and how it does that. Functions
like this can be pretty difficult to extract higher-level concept out
of.

Thanks.

--
tejun

2012-08-11 19:57:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 3/5] x86: Keep track of direct mapped pfn ranges

On Thu, Aug 09, 2012 at 04:23:07PM -0500, Jacob Shin wrote:
> Update later calls to init_memory_mapping to keep track of direct mapped
> pfn ranges.

so that....

--
tejun

2012-08-11 20:01:39

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/5] x86: Move enabling of PSE and PGE out of init_memory_mapping

Hello,

On Thu, Aug 09, 2012 at 04:23:09PM -0500, Jacob Shin wrote:
> Since we now call init_memory_mapping for each E820_RAM region in a
> loop, move cr4 writes out to setup_arch.

Shouldn't this happen before init_memory_mapping() is called multiple
times? Also, there seem to be other stuff which need to be moved out.

Thanks.

--
tejun

2012-08-13 07:34:46

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 5/5] x86: Move enabling of PSE and PGE out of init_memory_mapping

Hi Tejun,

On Sat, Aug 11, 2012 at 01:01:33PM -0700, Tejun Heo wrote:
> Shouldn't this happen before init_memory_mapping() is called multiple
> times?

It does.

Those CR4 flags are set before the loop which calls init_memory_mapping().

> Also, there seem to be other stuff which need to be moved out.

Which are those pls?

Thanks.

--
Regards/Gruss,
Boris.

2012-08-13 14:11:30

by Jacob Shin

[permalink] [raw]
Subject: Re: [PATCH 1/5] x86: Only direct map addresses that are marked as E820_RAM

On Sat, Aug 11, 2012 at 11:36:39AM -0700, H. Peter Anvin wrote:
> On 08/09/2012 03:03 PM, Yinghai Lu wrote:
> >
> >can you put those line in another function?
> >
> >setup_arch is way too big now.
> >
>
> I agree with this ... Jacob, could you rev the patch?

Yes of course, I'll send out the revised patchset later today,

-Jacob

>
> -hpa
>
>
> --
> H. Peter Anvin, Intel Open Source Technology Center
> I work for Intel. I don't speak on their behalf.
>
>

2012-08-13 14:32:09

by Jacob Shin

[permalink] [raw]
Subject: Re: [PATCH 1/5] x86: Only direct map addresses that are marked as E820_RAM

On Sat, Aug 11, 2012 at 12:49:48PM -0700, Tejun Heo wrote:
> Hello, Jacob.

Hi,

>
> On Thu, Aug 09, 2012 at 04:23:05PM -0500, Jacob Shin wrote:
> > +struct range pfn_mapped[E820_X_MAX];
> > +int nr_pfn_mapped;
>
> Why aren't these __initdata? Are they gonna be used for other
> purposes?

Yes, the thought was that later code may want to know what pfns are direct
mapped or not. For example, memory hotplug has to call init_memory_mapping and
updates direct mapping.

>
> > +void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn)
> > +{
> > + nr_pfn_mapped = add_range_with_merge(pfn_mapped, E820_X_MAX,
> > + nr_pfn_mapped, start_pfn, end_pfn);
> > +
> > + if (end_pfn > max_pfn_mapped)
> > + max_pfn_mapped = end_pfn;
>
> Maybe use max()?

Okay,

>
> > + if ((end_pfn <= (1UL << (32 - PAGE_SHIFT))) &&
> > + (end_pfn > max_low_pfn_mapped))
> > + max_low_pfn_mapped = end_pfn;
> > +}
> > +
> > +int pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
>
> bool?

Okay, will change to bool.

>
> > +{
> > + int i;
> > +
> > + for (i = 0; i < nr_pfn_mapped; i++)
> > + if ((start_pfn >= pfn_mapped[i].start) &&
> > + (end_pfn <= pfn_mapped[i].end))
> > + break;
> > +
> > + return i < nr_pfn_mapped;
> > +}
>
> for (...)
> if (xxx)
> return true;
> return false;
>
> > +int pfn_is_mapped(unsigned long pfn)
> > +{
> > + int i;
> > +
> > + for (i = 0; i < nr_pfn_mapped; i++)
> > + if ((pfn >= pfn_mapped[i].start) &&
> > + (pfn < pfn_mapped[i].end))
> > + break;
> > +
> > + return i < nr_pfn_mapped;
> > +}
>
> How about...
>
> return pfn_range_is_mapped(pfn, pfn + 1);

Okay,

>
> > @@ -913,14 +958,40 @@ void __init setup_arch(char **cmdline_p)
> >
> > init_gbpages();
> >
> > - /* max_pfn_mapped is updated here */
> > - max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
> > - max_pfn_mapped = max_low_pfn_mapped;
> > + init_pfn = max_pfn_mapped;
> > +
> > + memset(pfn_mapped, 0, sizeof(pfn_mapped));
> > + nr_pfn_mapped = 0;
>
> Are these necessary? We clear .bss way before control reaches here.

Ah okay, I'll remove them, and test to double check.

>
> > +
> > + add_pfn_range_mapped(0, max_pfn_mapped);
> > +
> > + for (i = 0; i < e820.nr_map; i++) {
> > + struct e820entry *ei = &e820.map[i];
> > + u64 start = ei->addr;
> > + u64 end = ei->addr + ei->size;
> > +
> > + if (ei->type != E820_RAM)
> > + continue;
> > +
> > + if (end <= (init_pfn << PAGE_SHIFT))
> > + continue;
> > +
> > + if (start < (init_pfn << PAGE_SHIFT))
> > + start = init_pfn << PAGE_SHIFT;
> > +
> > +#ifdef CONFIG_X86_32
> > + if ((start >> PAGE_SHIFT) >= max_low_pfn)
> > + continue;
> > +
> > + if ((end >> PAGE_SHIFT) > max_low_pfn)
> > + end = max_low_pfn << PAGE_SHIFT;
> > +#endif
> > + pfn = init_memory_mapping(start, end);
> > + add_pfn_range_mapped(start >> PAGE_SHIFT, pfn);
> > + }
>
> Some comments please? Also, while this may be the right thing to do,
> if I'm not mistaken, this is also likely to make linear space to use
> smaller mappings depending on how the physical memory is laid out,
> which could be a trade off we're willing to make, but that *should* be
> explicit. Please describe what's going on and provide rationale.

Ah .. okay, so you are concerned about BIOSes with E820 that break up a
large linear memory range into 2 different E820 entries? But if I'm not
mistaken, the E820 code does some cleansing of the values it gets from
the BIOS, in arch/x86/kernel/e820.c: sanitize_e820_map

But yes, I'll add comments, as well as break this logic out to its own
function as Yinghai suggested.

Thanks!

-Jacob

>
> Thanks.
>
> --
> tejun
>

2012-08-13 23:20:47

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/5] x86: Move enabling of PSE and PGE out of init_memory_mapping

Hello,

On Mon, Aug 13, 2012 at 09:34:40AM +0200, Borislav Petkov wrote:
> On Sat, Aug 11, 2012 at 01:01:33PM -0700, Tejun Heo wrote:
> > Shouldn't this happen before init_memory_mapping() is called multiple
> > times?
>
> It does.
>
> Those CR4 flags are set before the loop which calls init_memory_mapping().

I meant the patch should come before the patch making multiple calls
to init_memory_mapping().

> > Also, there seem to be other stuff which need to be moved out.
>
> Which are those pls?

Rebuilding pgtable on each invocation?

Thanks.

--
tejun

2012-08-14 08:49:24

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 5/5] x86: Move enabling of PSE and PGE out of init_memory_mapping

On Mon, Aug 13, 2012 at 04:20:40PM -0700, Tejun Heo wrote:
> I meant the patch should come before the patch making multiple calls
> to init_memory_mapping().

Ah, this makes sense.

> > > Also, there seem to be other stuff which need to be moved out.
> >
> > Which are those pls?
>
> Rebuilding pgtable on each invocation?

You mean pagetable_reserve() right?

Yes, Jacob, you basically might want to carve out all functionality
from init_memory_mapping() which is independent from its start and end
args and do the carving in a pre-patch or two to 1/5.

Thanks.

--
Regards/Gruss,
Boris.

2012-08-14 19:52:57

by Jacob Shin

[permalink] [raw]
Subject: Re: [PATCH 5/5] x86: Move enabling of PSE and PGE out of init_memory_mapping

On Tue, Aug 14, 2012 at 10:49:16AM +0200, Borislav Petkov wrote:
> On Mon, Aug 13, 2012 at 04:20:40PM -0700, Tejun Heo wrote:
> > I meant the patch should come before the patch making multiple calls
> > to init_memory_mapping().

Ah, okay .. got it.

Hm .. for some reason Tejun's emails are no longer making it into my
inbox, there must be some filtering going on at the corporate IT level.
I'll try and inquire IT about it ..

Sorry, :-(

>
> Ah, this makes sense.
>
> > > > Also, there seem to be other stuff which need to be moved out.
> > >
> > > Which are those pls?
> >
> > Rebuilding pgtable on each invocation?
>
> You mean pagetable_reserve() right?

This is actually needed on every call to init_memory_mapping(),

My patch 2/5 changes find_early_table_space() to find just enough space
to map start to end. The pagetable_reserve() will then reserve what we
actually used. Since init_memory_mapping() is called again and again
with different start to end ranges, we find space for the page tables
and reserve them every time.

>
> Yes, Jacob, you basically might want to carve out all functionality
> from init_memory_mapping() which is independent from its start and end
> args and do the carving in a pre-patch or two to 1/5.
>
> Thanks.
>
> --
> Regards/Gruss,
> Boris.
>

2012-08-14 20:20:46

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/5] x86: Move enabling of PSE and PGE out of init_memory_mapping

Hello,

On Tue, Aug 14, 2012 at 02:52:48PM -0500, Jacob Shin wrote:
> > You mean pagetable_reserve() right?
>
> This is actually needed on every call to init_memory_mapping(),
>
> My patch 2/5 changes find_early_table_space() to find just enough space
> to map start to end. The pagetable_reserve() will then reserve what we
> actually used. Since init_memory_mapping() is called again and again
> with different start to end ranges, we find space for the page tables
> and reserve them every time.

I thought the function was rebuilding pagetable for the whole memory
area each time. Maybe I misread. Does it only build the part which
is newly being mapped?

Thanks.

--
tejun