2012-10-09 23:59:08

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

on top of tip/x86/mm2, but please zap last patch in that branch.

1. use brk to mapping first PMD_SIZE range.
2. top down to initialize page table range by range.
3. get rid of calculate page table, and find_early_page_table.
4. remove early_ioremap in page table accessing.

v2: changes, update xen interface about pagetable_reserve, so not
use pgt_buf_* in xen code directly.
v3: use range top-down to initialize page table, so will not use
calculating/find early table anymore.
also reorder the patches sequence.

could be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm

later we could get rid of workaround about xen_mapping_pagetable_reserve, that
could kill another 50 lines codes. --- will do that later because x86/mm2 is
not updated to linus/master yet. If we do that now, will have merge conflicts.

Thanks

Yinghai Lu

Yinghai Lu (7):
x86, mm: align start address to correct big page size
x86, mm: Use big page size for small memory range
x86, mm: Don't clear page table if next range is ram
x86, mm: only keep initial mapping for ram
x86, mm: Break down init_all_memory_mapping
x86, mm: setup page table from top-down
x86, mm: Remove early_memremap workaround for page table accessing

arch/x86/include/asm/page_types.h | 1 +
arch/x86/include/asm/pgtable.h | 1 +
arch/x86/kernel/setup.c | 3 +
arch/x86/mm/init.c | 251 ++++++++++++------------------------
arch/x86/mm/init_32.c | 18 +++-
arch/x86/mm/init_64.c | 100 ++++++---------
6 files changed, 144 insertions(+), 230 deletions(-)

--
1.7.7


2012-10-09 23:58:59

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 3/7] x86, mm: Don't clear page table if next range is ram

After we add code use BRK to map buffer for final page table,

It should be safe to remove early_memmap for page table accessing.

But we get panic with that.

It turns out we clear the initial page table wrongly for next range that is
separated by holes.
And it only happens when we are trying to map range one by one range separately.

After we add checking before clearing the related page table, that panic will
not happen anymore.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init_64.c | 37 ++++++++++++++++---------------------
1 files changed, 16 insertions(+), 21 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index f40f383..61b3c44 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -363,20 +363,19 @@ static unsigned long __meminit
phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
pgprot_t prot)
{
- unsigned pages = 0;
+ unsigned long pages = 0, next;
unsigned long last_map_addr = end;
int i;

pte_t *pte = pte_page + pte_index(addr);

- for(i = pte_index(addr); i < PTRS_PER_PTE; i++, addr += PAGE_SIZE, pte++) {
-
+ for (i = pte_index(addr); i < PTRS_PER_PTE; i++, addr = next, pte++) {
+ next = (addr & PAGE_MASK) + PAGE_SIZE;
if (addr >= end) {
- if (!after_bootmem) {
- for(; i < PTRS_PER_PTE; i++, pte++)
- set_pte(pte, __pte(0));
- }
- break;
+ if (!after_bootmem &&
+ !e820_any_mapped(addr & PAGE_MASK, next, 0))
+ set_pte(pte, __pte(0));
+ continue;
}

/*
@@ -418,16 +417,14 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
pte_t *pte;
pgprot_t new_prot = prot;

+ next = (address & PMD_MASK) + PMD_SIZE;
if (address >= end) {
- if (!after_bootmem) {
- for (; i < PTRS_PER_PMD; i++, pmd++)
- set_pmd(pmd, __pmd(0));
- }
- break;
+ if (!after_bootmem &&
+ !e820_any_mapped(address & PMD_MASK, next, 0))
+ set_pmd(pmd, __pmd(0));
+ continue;
}

- next = (address & PMD_MASK) + PMD_SIZE;
-
if (pmd_val(*pmd)) {
if (!pmd_large(*pmd)) {
spin_lock(&init_mm.page_table_lock);
@@ -494,13 +491,11 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
pmd_t *pmd;
pgprot_t prot = PAGE_KERNEL;

- if (addr >= end)
- break;
-
next = (addr & PUD_MASK) + PUD_SIZE;
-
- if (!after_bootmem && !e820_any_mapped(addr, next, 0)) {
- set_pud(pud, __pud(0));
+ if (addr >= end) {
+ if (!after_bootmem &&
+ !e820_any_mapped(addr & PUD_MASK, next, 0))
+ set_pud(pud, __pud(0));
continue;
}

--
1.7.7

2012-10-09 23:59:05

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 7/7] x86, mm: Remove early_memremap workaround for page table accessing

Not needed that anymore after patches include premaping page table buf
and not clear initial page table wrongly.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init_64.c | 38 ++++----------------------------------
1 files changed, 4 insertions(+), 34 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 7dfa69b..4e6873f 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -340,36 +340,12 @@ static __ref void *alloc_low_page(unsigned long *phys)
} else
pfn = pgt_buf_end++;

- adr = early_memremap(pfn * PAGE_SIZE, PAGE_SIZE);
+ adr = __va(pfn * PAGE_SIZE);
clear_page(adr);
*phys = pfn * PAGE_SIZE;
return adr;
}

-static __ref void *map_low_page(void *virt)
-{
- void *adr;
- unsigned long phys, left;
-
- if (after_bootmem)
- return virt;
-
- phys = __pa(virt);
- left = phys & (PAGE_SIZE - 1);
- adr = early_memremap(phys & PAGE_MASK, PAGE_SIZE);
- adr = (void *)(((unsigned long)adr) | left);
-
- return adr;
-}
-
-static __ref void unmap_low_page(void *adr)
-{
- if (after_bootmem)
- return;
-
- early_iounmap((void *)((unsigned long)adr & PAGE_MASK), PAGE_SIZE);
-}
-
static unsigned long __meminit
phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
pgprot_t prot)
@@ -441,10 +417,9 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
if (pmd_val(*pmd)) {
if (!pmd_large(*pmd)) {
spin_lock(&init_mm.page_table_lock);
- pte = map_low_page((pte_t *)pmd_page_vaddr(*pmd));
+ pte = (pte_t *)pmd_page_vaddr(*pmd);
last_map_addr = phys_pte_init(pte, address,
end, prot);
- unmap_low_page(pte);
spin_unlock(&init_mm.page_table_lock);
continue;
}
@@ -480,7 +455,6 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,

pte = alloc_low_page(&pte_phys);
last_map_addr = phys_pte_init(pte, address, end, new_prot);
- unmap_low_page(pte);

spin_lock(&init_mm.page_table_lock);
pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
@@ -515,10 +489,9 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,

if (pud_val(*pud)) {
if (!pud_large(*pud)) {
- pmd = map_low_page(pmd_offset(pud, 0));
+ pmd = pmd_offset(pud, 0);
last_map_addr = phys_pmd_init(pmd, addr, end,
page_size_mask, prot);
- unmap_low_page(pmd);
__flush_tlb_all();
continue;
}
@@ -555,7 +528,6 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
pmd = alloc_low_page(&pmd_phys);
last_map_addr = phys_pmd_init(pmd, addr, end, page_size_mask,
prot);
- unmap_low_page(pmd);

spin_lock(&init_mm.page_table_lock);
pud_populate(&init_mm, pud, __va(pmd_phys));
@@ -591,17 +563,15 @@ kernel_physical_mapping_init(unsigned long start,
next = end;

if (pgd_val(*pgd)) {
- pud = map_low_page((pud_t *)pgd_page_vaddr(*pgd));
+ pud = (pud_t *)pgd_page_vaddr(*pgd);
last_map_addr = phys_pud_init(pud, __pa(start),
__pa(end), page_size_mask);
- unmap_low_page(pud);
continue;
}

pud = alloc_low_page(&pud_phys);
last_map_addr = phys_pud_init(pud, __pa(start), __pa(next),
page_size_mask);
- unmap_low_page(pud);

spin_lock(&init_mm.page_table_lock);
pgd_populate(&init_mm, pgd, __va(pud_phys));
--
1.7.7

2012-10-09 23:59:11

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 1/7] x86, mm: align start address to correct big page size

We are going to use buffer in BRK to pre-map final page table buffer.

Final page table buffer could be only page aligened, but around it are
still ram, we could use bigger page to map it to avoid small page.

We will probe to adjust page_size_mask in next patch to make big
page size could be used with small ram range.

Before that, this patch will make start address to be aligned down
according to bigger page size. otherwise entry in page page will
not have correct value.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init_32.c | 1 +
arch/x86/mm/init_64.c | 5 +++--
2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 11a5800..27f7fc6 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -310,6 +310,7 @@ repeat:
__pgprot(PTE_IDENT_ATTR |
_PAGE_PSE);

+ pfn &= PMD_MASK >> PAGE_SHIFT;
addr2 = (pfn + PTRS_PER_PTE-1) * PAGE_SIZE +
PAGE_OFFSET + PAGE_SIZE-1;

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index ab558eb..f40f383 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -461,7 +461,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
pages++;
spin_lock(&init_mm.page_table_lock);
set_pte((pte_t *)pmd,
- pfn_pte(address >> PAGE_SHIFT,
+ pfn_pte((address & PMD_MASK) >> PAGE_SHIFT,
__pgprot(pgprot_val(prot) | _PAGE_PSE)));
spin_unlock(&init_mm.page_table_lock);
last_map_addr = next;
@@ -536,7 +536,8 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
pages++;
spin_lock(&init_mm.page_table_lock);
set_pte((pte_t *)pud,
- pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_LARGE));
+ pfn_pte((addr & PUD_MASK) >> PAGE_SHIFT,
+ PAGE_KERNEL_LARGE));
spin_unlock(&init_mm.page_table_lock);
last_map_addr = next;
continue;
--
1.7.7

2012-10-09 23:59:15

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 5/7] x86, mm: Break down init_all_memory_mapping

Will replace that will top-down page table initialization.
new one need to take range.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 41 +++++++++++++++++++----------------------
1 files changed, 19 insertions(+), 22 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index ebf76ce..23ce4db 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -393,40 +393,30 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
* Depending on the alignment of E820 ranges, this may possibly result in using
* smaller size (i.e. 4K instead of 2M or 1G) page tables.
*/
-static void __init init_all_memory_mapping(void)
+static void __init init_all_memory_mapping(unsigned long all_start,
+ unsigned long all_end)
{
unsigned long start_pfn, end_pfn;
int i;

- /* the ISA range is always mapped regardless of memory holes */
- init_memory_mapping(0, ISA_END_ADDRESS);
-
for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
u64 start = start_pfn << PAGE_SHIFT;
u64 end = end_pfn << PAGE_SHIFT;

- if (end <= ISA_END_ADDRESS)
+ if (end <= all_start)
continue;

- if (start < ISA_END_ADDRESS)
- start = ISA_END_ADDRESS;
-#ifdef CONFIG_X86_32
- /* on 32 bit, we only map up to max_low_pfn */
- if ((start >> PAGE_SHIFT) >= max_low_pfn)
+ if (start < all_start)
+ start = all_start;
+
+ if (start >= all_end)
continue;

- if ((end >> PAGE_SHIFT) > max_low_pfn)
- end = max_low_pfn << PAGE_SHIFT;
-#endif
- init_memory_mapping(start, end);
- }
+ if (end > all_end)
+ end = all_end;

-#ifdef CONFIG_X86_64
- if (max_pfn > max_low_pfn) {
- /* can we preseve max_low_pfn ?*/
- max_low_pfn = max_pfn;
+ init_memory_mapping(start, end);
}
-#endif
}

void __init init_mem_mapping(void)
@@ -456,8 +446,15 @@ void __init init_mem_mapping(void)
(pgt_buf_top << PAGE_SHIFT) - 1);

max_pfn_mapped = 0; /* will get exact value next */
- init_all_memory_mapping();
-
+ /* the ISA range is always mapped regardless of memory holes */
+ init_memory_mapping(0, ISA_END_ADDRESS);
+ init_all_memory_mapping(ISA_END_ADDRESS, end);
+#ifdef CONFIG_X86_64
+ if (max_pfn > max_low_pfn) {
+ /* can we preseve max_low_pfn ?*/
+ max_low_pfn = max_pfn;
+ }
+#endif
/*
* Reserve the kernel pagetable pages we used (pgt_buf_start -
* pgt_buf_end) and free the other ones (pgt_buf_end - pgt_buf_top)
--
1.7.7

2012-10-09 23:58:57

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 4/7] x86, mm: only keep initial mapping for ram

0 mean any e820 type will be kept, and only hole is removed.

change to E820_RAM and E820_RESERVED_KERN only.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init_64.c | 9 ++++++---
1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 61b3c44..4898e80 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -373,7 +373,8 @@ phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
next = (addr & PAGE_MASK) + PAGE_SIZE;
if (addr >= end) {
if (!after_bootmem &&
- !e820_any_mapped(addr & PAGE_MASK, next, 0))
+ !e820_any_mapped(addr & PAGE_MASK, next, E820_RAM) &&
+ !e820_any_mapped(addr & PAGE_MASK, next, E820_RESERVED_KERN))
set_pte(pte, __pte(0));
continue;
}
@@ -420,7 +421,8 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
next = (address & PMD_MASK) + PMD_SIZE;
if (address >= end) {
if (!after_bootmem &&
- !e820_any_mapped(address & PMD_MASK, next, 0))
+ !e820_any_mapped(address & PMD_MASK, next, E820_RAM) &&
+ !e820_any_mapped(address & PMD_MASK, next, E820_RESERVED_KERN))
set_pmd(pmd, __pmd(0));
continue;
}
@@ -494,7 +496,8 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
next = (addr & PUD_MASK) + PUD_SIZE;
if (addr >= end) {
if (!after_bootmem &&
- !e820_any_mapped(addr & PUD_MASK, next, 0))
+ !e820_any_mapped(addr & PUD_MASK, next, E820_RAM) &&
+ !e820_any_mapped(addr & PUD_MASK, next, E820_RESERVED_KERN))
set_pud(pud, __pud(0));
continue;
}
--
1.7.7

2012-10-10 00:00:27

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 2/7] x86, mm: Use big page size for small memory range

We could map small range in the middle of big range at first, so should use
big page size at first to avoid using small page size to break down page table.

Only can set big page bit when that range has big ram area around it.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 32 ++++++++++++++++++++++++++++++++
1 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index c12dfd5..ebf76ce 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -88,6 +88,35 @@ static int __meminit save_mr(struct map_range *mr, int nr_range,
return nr_range;
}

+/*
+ * adjust the page_size_mask for small range to go with
+ * big page size instead small one if nearby are ram too.
+ */
+static void __init_refok adjust_range_page_size_mask(struct map_range *mr,
+ int nr_range)
+{
+ int i;
+
+ for (i = 0; i < nr_range; i++) {
+ if ((page_size_mask & (1<<PG_LEVEL_2M)) &&
+ !(mr[i].page_size_mask & (1<<PG_LEVEL_2M))) {
+ unsigned long start = round_down(mr[i].start, PMD_SIZE);
+ unsigned long end = round_up(mr[i].end, PMD_SIZE);
+
+ if (memblock_is_region_memory(start, end - start))
+ mr[i].page_size_mask |= 1<<PG_LEVEL_2M;
+ }
+ if ((page_size_mask & (1<<PG_LEVEL_1G)) &&
+ !(mr[i].page_size_mask & (1<<PG_LEVEL_1G))) {
+ unsigned long start = round_down(mr[i].start, PUD_SIZE);
+ unsigned long end = round_up(mr[i].end, PUD_SIZE);
+
+ if (memblock_is_region_memory(start, end - start))
+ mr[i].page_size_mask |= 1<<PG_LEVEL_1G;
+ }
+ }
+}
+
static int __meminit split_mem_range(struct map_range *mr, int nr_range,
unsigned long start,
unsigned long end)
@@ -182,6 +211,9 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
nr_range--;
}

+ if (!after_bootmem)
+ adjust_range_page_size_mask(mr, nr_range);
+
for (i = 0; i < nr_range; i++)
printk(KERN_DEBUG " [mem %#010lx-%#010lx] page %s\n",
mr[i].start, mr[i].end - 1,
--
1.7.7

2012-10-10 00:00:24

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 6/7] x86, mm: setup page table from top-down

Get pgt_buf early from BRK, and use it to map PMD_SIZE to top at first.
then use page from PMD_SIZE to map next blow range.

alloc_low_page will use page from BRK at first, then will switch to use
to memblock to find and reserve page for page table usage.

At last we could get rid of calculation and find early pgt related code.

Suggested-by: "H. Peter Anvin" <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/include/asm/page_types.h | 1 +
arch/x86/include/asm/pgtable.h | 1 +
arch/x86/kernel/setup.c | 3 +
arch/x86/mm/init.c | 188 ++++++++-----------------------------
arch/x86/mm/init_32.c | 17 +++-
arch/x86/mm/init_64.c | 17 +++-
6 files changed, 71 insertions(+), 156 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index 54c9787..9f6f3e6 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -45,6 +45,7 @@ extern int devmem_is_allowed(unsigned long pagenr);

extern unsigned long max_low_pfn_mapped;
extern unsigned long max_pfn_mapped;
+extern unsigned long min_pfn_mapped;

static inline phys_addr_t get_max_mapped(void)
{
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 52d40a1..25fa5bb 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -599,6 +599,7 @@ static inline int pgd_none(pgd_t pgd)

extern int direct_gbpages;
void init_mem_mapping(void);
+void early_alloc_pgt_buf(void);

/* local pte updates need not use xchg for locking */
static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 4989f80..3987daa 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -123,6 +123,7 @@
*/
unsigned long max_low_pfn_mapped;
unsigned long max_pfn_mapped;
+unsigned long min_pfn_mapped;

#ifdef CONFIG_DMI
RESERVE_BRK(dmi_alloc, 65536);
@@ -896,6 +897,8 @@ void __init setup_arch(char **cmdline_p)

reserve_ibft_region();

+ early_alloc_pgt_buf();
+
/*
* Need to conclude brk, before memblock_x86_fill()
* it could use memblock_find_in_range, could overlap with
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 23ce4db..a060381 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -223,105 +223,6 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
return nr_range;
}

-static unsigned long __init calculate_table_space_size(unsigned long start,
- unsigned long end)
-{
- unsigned long puds = 0, pmds = 0, ptes = 0, tables;
- struct map_range mr[NR_RANGE_MR];
- int nr_range, i;
-
- pr_info("calculate_table_space_size: [mem %#010lx-%#010lx]\n",
- start, end - 1);
-
- memset(mr, 0, sizeof(mr));
- nr_range = 0;
- nr_range = split_mem_range(mr, nr_range, start, end);
-
- for (i = 0; i < nr_range; i++) {
- unsigned long range, extra;
-
- range = mr[i].end - mr[i].start;
- puds += (range + PUD_SIZE - 1) >> PUD_SHIFT;
-
- if (mr[i].page_size_mask & (1 << PG_LEVEL_1G)) {
- extra = range - ((range >> PUD_SHIFT) << PUD_SHIFT);
- pmds += (extra + PMD_SIZE - 1) >> PMD_SHIFT;
- } else
- pmds += (range + PMD_SIZE - 1) >> PMD_SHIFT;
-
- if (mr[i].page_size_mask & (1 << PG_LEVEL_2M)) {
- extra = range - ((range >> PMD_SHIFT) << PMD_SHIFT);
-#ifdef CONFIG_X86_32
- extra += PMD_SIZE;
-#endif
- /* The first 2/4M doesn't use large pages. */
- if (mr[i].start < PMD_SIZE)
- extra += PMD_SIZE - mr[i].start;
-
- ptes += (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
- } else
- ptes += (range + PAGE_SIZE - 1) >> PAGE_SHIFT;
- }
-
- tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
- tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);
- tables += roundup(ptes * sizeof(pte_t), PAGE_SIZE);
-
-#ifdef CONFIG_X86_32
- /* for fixmap */
- tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
-#endif
-
- return tables;
-}
-
-static unsigned long __init calculate_all_table_space_size(void)
-{
- unsigned long start_pfn, end_pfn;
- unsigned long tables;
- int i;
-
- /* the ISA range is always mapped regardless of memory holes */
- tables = calculate_table_space_size(0, ISA_END_ADDRESS);
-
- for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
- u64 start = start_pfn << PAGE_SHIFT;
- u64 end = end_pfn << PAGE_SHIFT;
-
- if (end <= ISA_END_ADDRESS)
- continue;
-
- if (start < ISA_END_ADDRESS)
- start = ISA_END_ADDRESS;
-#ifdef CONFIG_X86_32
- /* on 32 bit, we only map up to max_low_pfn */
- if ((start >> PAGE_SHIFT) >= max_low_pfn)
- continue;
-
- if ((end >> PAGE_SHIFT) > max_low_pfn)
- end = max_low_pfn << PAGE_SHIFT;
-#endif
- tables += calculate_table_space_size(start, end);
- }
-
- return tables;
-}
-
-static void __init find_early_table_space(unsigned long start,
- unsigned long good_end,
- unsigned long tables)
-{
- phys_addr_t base;
-
- base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
- if (!base)
- panic("Cannot find space for the kernel page tables");
-
- pgt_buf_start = base >> PAGE_SHIFT;
- pgt_buf_end = pgt_buf_start;
- pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
-}
-
static struct range pfn_mapped[E820_X_MAX];
static int nr_pfn_mapped;

@@ -386,22 +287,17 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
}

/*
- * Iterate through E820 memory map and create direct mappings for only E820_RAM
- * regions. We cannot simply create direct mappings for all pfns from
- * [0 to max_low_pfn) and [4GB to max_pfn) because of possible memory holes in
- * high addresses that cannot be marked as UC by fixed/variable range MTRRs.
- * Depending on the alignment of E820 ranges, this may possibly result in using
- * smaller size (i.e. 4K instead of 2M or 1G) page tables.
+ * this one could take range with hole in it
*/
-static void __init init_all_memory_mapping(unsigned long all_start,
+static void __init init_range_memory_mapping(unsigned long all_start,
unsigned long all_end)
{
unsigned long start_pfn, end_pfn;
int i;

for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
- u64 start = start_pfn << PAGE_SHIFT;
- u64 end = end_pfn << PAGE_SHIFT;
+ u64 start = (u64)start_pfn << PAGE_SHIFT;
+ u64 end = (u64)end_pfn << PAGE_SHIFT;

if (end <= all_start)
continue;
@@ -421,67 +317,59 @@ static void __init init_all_memory_mapping(unsigned long all_start,

void __init init_mem_mapping(void)
{
- unsigned long tables, good_end, end;
+ unsigned long end, start, last_start;
+ unsigned long step_size;

probe_page_size_mask();

- /*
- * Find space for the kernel direct mapping tables.
- *
- * Later we should allocate these tables in the local node of the
- * memory mapped. Unfortunately this is done currently before the
- * nodes are discovered.
- */
#ifdef CONFIG_X86_64
end = max_pfn << PAGE_SHIFT;
- good_end = end;
#else
end = max_low_pfn << PAGE_SHIFT;
- good_end = max_pfn_mapped << PAGE_SHIFT;
#endif
- tables = calculate_all_table_space_size();
- find_early_table_space(0, good_end, tables);
- printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx] prealloc\n",
- end - 1, pgt_buf_start << PAGE_SHIFT,
- (pgt_buf_top << PAGE_SHIFT) - 1);

- max_pfn_mapped = 0; /* will get exact value next */
/* the ISA range is always mapped regardless of memory holes */
init_memory_mapping(0, ISA_END_ADDRESS);
- init_all_memory_mapping(ISA_END_ADDRESS, end);
+
+ /* step_size need to be small so pgt_buf from BRK could cover it */
+ step_size = PMD_SIZE;
+ max_pfn_mapped = 0; /* will get exact value next */
+ min_pfn_mapped = end >> PAGE_SHIFT;
+ last_start = start = end;
+ while (last_start > ISA_END_ADDRESS) {
+ if (last_start > step_size) {
+ start = round_down(last_start - 1, step_size);
+ if (start < ISA_END_ADDRESS)
+ start = ISA_END_ADDRESS;
+ } else
+ start = ISA_END_ADDRESS;
+ init_all_memory_mapping(start, last_start);
+ last_start = start;
+ min_pfn_mapped = last_start >> PAGE_SHIFT;
+ step_size <<= 5;
+ }
+
#ifdef CONFIG_X86_64
if (max_pfn > max_low_pfn) {
/* can we preseve max_low_pfn ?*/
max_low_pfn = max_pfn;
}
#endif
- /*
- * Reserve the kernel pagetable pages we used (pgt_buf_start -
- * pgt_buf_end) and free the other ones (pgt_buf_end - pgt_buf_top)
- * so that they can be reused for other purposes.
- *
- * On native it just means calling memblock_reserve, on Xen it also
- * means marking RW the pagetable pages that we allocated before
- * but that haven't been used.
- *
- * In fact on xen we mark RO the whole range pgt_buf_start -
- * pgt_buf_top, because we have to make sure that when
- * init_memory_mapping reaches the pagetable pages area, it maps
- * RO all the pagetable pages, including the ones that are beyond
- * pgt_buf_end at that time.
- */
- if (pgt_buf_end > pgt_buf_start) {
- printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx] final\n",
- end - 1, pgt_buf_start << PAGE_SHIFT,
- (pgt_buf_end << PAGE_SHIFT) - 1);
- x86_init.mapping.pagetable_reserve(PFN_PHYS(pgt_buf_start),
- PFN_PHYS(pgt_buf_end));
- }
+ early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
+}

- /* stop the wrong using */
- pgt_buf_top = 0;
+/* need 3 4k for initial PMD_SIZE, 4k for 0-ISA_END_ADDRESS */
+RESERVE_BRK(early_pgt_alloc, 16384);
+void __init early_alloc_pgt_buf(void)
+{
+ unsigned long tables = 16384;
+ phys_addr_t base;

- early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
+ base = __pa(extend_brk(tables, PAGE_SIZE));
+
+ pgt_buf_start = base >> PAGE_SHIFT;
+ pgt_buf_end = pgt_buf_start;
+ pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
}

/*
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 27f7fc6..7bb1106 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -61,11 +61,22 @@ bool __read_mostly __vmalloc_start_set = false;

static __init void *alloc_low_page(void)
{
- unsigned long pfn = pgt_buf_end++;
+ unsigned long pfn;
void *adr;

- if (pfn >= pgt_buf_top)
- panic("alloc_low_page: ran out of memory");
+ if ((pgt_buf_end + 1) >= pgt_buf_top) {
+ unsigned long ret;
+ if (min_pfn_mapped >= max_pfn_mapped)
+ panic("alloc_low_page: ran out of memory");
+ ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
+ max_pfn_mapped << PAGE_SHIFT,
+ PAGE_SIZE, PAGE_SIZE);
+ if (!ret)
+ panic("alloc_low_page: can not alloc memory");
+ memblock_reserve(ret, PAGE_SIZE);
+ pfn = ret >> PAGE_SHIFT;
+ } else
+ pfn = pgt_buf_end++;

adr = __va(pfn * PAGE_SIZE);
clear_page(adr);
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 4898e80..7dfa69b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -316,7 +316,7 @@ void __init cleanup_highmap(void)

static __ref void *alloc_low_page(unsigned long *phys)
{
- unsigned long pfn = pgt_buf_end++;
+ unsigned long pfn;
void *adr;

if (after_bootmem) {
@@ -326,8 +326,19 @@ static __ref void *alloc_low_page(unsigned long *phys)
return adr;
}

- if (pfn >= pgt_buf_top)
- panic("alloc_low_page: ran out of memory");
+ if ((pgt_buf_end + 1) >= pgt_buf_top) {
+ unsigned long ret;
+ if (min_pfn_mapped >= max_pfn_mapped)
+ panic("alloc_low_page: ran out of memory");
+ ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
+ max_pfn_mapped << PAGE_SHIFT,
+ PAGE_SIZE, PAGE_SIZE);
+ if (!ret)
+ panic("alloc_low_page: can not alloc memory");
+ memblock_reserve(ret, PAGE_SIZE);
+ pfn = ret >> PAGE_SHIFT;
+ } else
+ pfn = pgt_buf_end++;

adr = early_memremap(pfn * PAGE_SIZE, PAGE_SIZE);
clear_page(adr);
--
1.7.7

2012-10-10 01:53:36

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 6/7] x86, mm: setup page table from top-down

On Tue, Oct 9, 2012 at 4:58 PM, Yinghai Lu <[email protected]> wrote:
> Get pgt_buf early from BRK, and use it to map PMD_SIZE to top at first.
> then use page from PMD_SIZE to map next blow range.
>
> alloc_low_page will use page from BRK at first, then will switch to use
> to memblock to find and reserve page for page table usage.
>
> At last we could get rid of calculation and find early pgt related code.
>
> Suggested-by: "H. Peter Anvin" <[email protected]>
> Signed-off-by: Yinghai Lu <[email protected]>

sorry , there one typo in this patch, please use attached one instead.


Attachments:
fix_max_pfn_xx_13.patch (11.60 kB)

2012-10-10 13:59:18

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

On Tue, Oct 09, 2012 at 04:58:28PM -0700, Yinghai Lu wrote:
> on top of tip/x86/mm2, but please zap last patch in that branch.

So while I appreciate you actively looking at this and iteratively
sending snapshots of the progress - I think it would be easier if
you posted a patchset that has rework done completly per what
Peter described.

That way folks who are reviewing will know when they can focus their
time on reviewing the whole thing in one go instead of doing it
step by step - as some of the patches still haven't addressed the
review comments that were given the first time.


> 1. use brk to mapping first PMD_SIZE range.
> 2. top down to initialize page table range by range.
> 3. get rid of calculate page table, and find_early_page_table.
> 4. remove early_ioremap in page table accessing.
>
> v2: changes, update xen interface about pagetable_reserve, so not
> use pgt_buf_* in xen code directly.
> v3: use range top-down to initialize page table, so will not use
> calculating/find early table anymore.
> also reorder the patches sequence.
>
> could be found at:
> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm
>
> later we could get rid of workaround about xen_mapping_pagetable_reserve, that
> could kill another 50 lines codes. --- will do that later because x86/mm2 is
> not updated to linus/master yet. If we do that now, will have merge conflicts.

I am confused. Why do you need x86/mm2? If you do need it, you need to
describe in this writeup why you depend on it, and what is there.

You can't base it on linus's tree?

>
> Thanks
>
> Yinghai Lu
>
> Yinghai Lu (7):
> x86, mm: align start address to correct big page size
> x86, mm: Use big page size for small memory range
> x86, mm: Don't clear page table if next range is ram
> x86, mm: only keep initial mapping for ram
> x86, mm: Break down init_all_memory_mapping
> x86, mm: setup page table from top-down
> x86, mm: Remove early_memremap workaround for page table accessing
>
> arch/x86/include/asm/page_types.h | 1 +
> arch/x86/include/asm/pgtable.h | 1 +
> arch/x86/kernel/setup.c | 3 +
> arch/x86/mm/init.c | 251 ++++++++++++------------------------
> arch/x86/mm/init_32.c | 18 +++-
> arch/x86/mm/init_64.c | 100 ++++++---------
> 6 files changed, 144 insertions(+), 230 deletions(-)
>
> --
> 1.7.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-10-10 14:55:41

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

On Wed, Oct 10, 2012 at 6:47 AM, Konrad Rzeszutek Wilk
<[email protected]> wrote:
> On Tue, Oct 09, 2012 at 04:58:28PM -0700, Yinghai Lu wrote:
>> on top of tip/x86/mm2, but please zap last patch in that branch.
>
> So while I appreciate you actively looking at this and iteratively
> sending snapshots of the progress - I think it would be easier if
> you posted a patchset that has rework done completly per what
> Peter described.

it should be done this time, so asked Stefano to test it again.

>
> That way folks who are reviewing will know when they can focus their
> time on reviewing the whole thing in one go instead of doing it
> step by step - as some of the patches still haven't addressed the
> review comments that were given the first time.

I addressed one that i can do.

>
>
>> 1. use brk to mapping first PMD_SIZE range.
>> 2. top down to initialize page table range by range.
>> 3. get rid of calculate page table, and find_early_page_table.
>> 4. remove early_ioremap in page table accessing.
>>
>> v2: changes, update xen interface about pagetable_reserve, so not
>> use pgt_buf_* in xen code directly.
>> v3: use range top-down to initialize page table, so will not use
>> calculating/find early table anymore.
>> also reorder the patches sequence.
>>
>> could be found at:
>> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm
>>
>> later we could get rid of workaround about xen_mapping_pagetable_reserve, that
>> could kill another 50 lines codes. --- will do that later because x86/mm2 is
>> not updated to linus/master yet. If we do that now, will have merge conflicts.
>
> I am confused. Why do you need x86/mm2? If you do need it, you need to
> describe in this writeup why you depend on it, and what is there.
>
> You can't base it on linus's tree?

no, some init_memory_mapping related cleanups are in x86/mm2.

the whole story: while reviewing x86/mm2,
Stefano say it would be better to get rid of ioremap to access page table area.
so according to hpa's concept, I use some pages in BRK to pre-map page
table at first.
on -v2, with xen Stefano found panic on system with more than 4g.
To address that panic, we come out -v3 that will map memory top-down
and range size for
each step is from PMD_SIZE to more bigger...

Thanks

Yinghai

2012-10-10 14:07:23

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH 5/7] x86, mm: Break down init_all_memory_mapping

On Tue, Oct 09, 2012 at 04:58:33PM -0700, Yinghai Lu wrote:
> Will replace that will top-down page table initialization.

s/will/with?

> new one need to take range.

Huh? I have no idea what this patch does from your description.
It says it will replace something (not identified) with
top-down table initialization.

And the code is not that simple - you should explain _how_ you
are changing it. From what it did before to what it does _now_.

Think of the audience of kernel engineers who have memory loss
and will forget everything in three months - the perfect time when
the merge window opens and bugs start appearing. They (ok, maybe
it is only me who is needs this) need the documentation to figure
out what went wrong. Please explain it.


>
> Signed-off-by: Yinghai Lu <[email protected]>
> ---
> arch/x86/mm/init.c | 41 +++++++++++++++++++----------------------
> 1 files changed, 19 insertions(+), 22 deletions(-)
>
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index ebf76ce..23ce4db 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -393,40 +393,30 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
> * Depending on the alignment of E820 ranges, this may possibly result in using
> * smaller size (i.e. 4K instead of 2M or 1G) page tables.
> */
> -static void __init init_all_memory_mapping(void)
> +static void __init init_all_memory_mapping(unsigned long all_start,
> + unsigned long all_end)
> {
> unsigned long start_pfn, end_pfn;
> int i;
>
> - /* the ISA range is always mapped regardless of memory holes */
> - init_memory_mapping(0, ISA_END_ADDRESS);
> -
> for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
> u64 start = start_pfn << PAGE_SHIFT;
> u64 end = end_pfn << PAGE_SHIFT;
>
> - if (end <= ISA_END_ADDRESS)
> + if (end <= all_start)
> continue;
>
> - if (start < ISA_END_ADDRESS)
> - start = ISA_END_ADDRESS;
> -#ifdef CONFIG_X86_32
> - /* on 32 bit, we only map up to max_low_pfn */
> - if ((start >> PAGE_SHIFT) >= max_low_pfn)
> + if (start < all_start)
> + start = all_start;
> +
> + if (start >= all_end)
> continue;
>
> - if ((end >> PAGE_SHIFT) > max_low_pfn)
> - end = max_low_pfn << PAGE_SHIFT;
> -#endif
> - init_memory_mapping(start, end);
> - }
> + if (end > all_end)
> + end = all_end;
>
> -#ifdef CONFIG_X86_64
> - if (max_pfn > max_low_pfn) {
> - /* can we preseve max_low_pfn ?*/
> - max_low_pfn = max_pfn;
> + init_memory_mapping(start, end);
> }
> -#endif
> }
>
> void __init init_mem_mapping(void)
> @@ -456,8 +446,15 @@ void __init init_mem_mapping(void)
> (pgt_buf_top << PAGE_SHIFT) - 1);
>
> max_pfn_mapped = 0; /* will get exact value next */
> - init_all_memory_mapping();
> -
> + /* the ISA range is always mapped regardless of memory holes */
> + init_memory_mapping(0, ISA_END_ADDRESS);
> + init_all_memory_mapping(ISA_END_ADDRESS, end);
> +#ifdef CONFIG_X86_64
> + if (max_pfn > max_low_pfn) {
> + /* can we preseve max_low_pfn ?*/
> + max_low_pfn = max_pfn;
> + }
> +#endif
> /*
> * Reserve the kernel pagetable pages we used (pgt_buf_start -
> * pgt_buf_end) and free the other ones (pgt_buf_end - pgt_buf_top)
> --
> 1.7.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-10-10 17:07:54

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 6/7] x86, mm: setup page table from top-down

On Wed, Oct 10, 2012 at 9:38 AM, Stefano Stabellini
<[email protected]> wrote:
>> - if (pfn >= pgt_buf_top)
>> - panic("alloc_low_page: ran out of memory");
>> + if ((pgt_buf_end + 1) >= pgt_buf_top) {
>> + unsigned long ret;
>> + if (min_pfn_mapped >= max_pfn_mapped)
>> + panic("alloc_low_page: ran out of memory");
>> + ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
>> + max_pfn_mapped << PAGE_SHIFT,
>> + PAGE_SIZE, PAGE_SIZE);
>> + if (!ret)
>> + panic("alloc_low_page: can not alloc memory");
>> + memblock_reserve(ret, PAGE_SIZE);
>> + pfn = ret >> PAGE_SHIFT;
>
> This cannot be right: you are allocating another page to be used as
> pagetable page, outside the range pgt_buf_start-pgt_buf_top.
>
> When that page is going to be hooked into the live pagetable, the kernel
> is going to panic on Xen because the page wasn't marked RO.
>
> If you want to do that you need to tell the Xen subsystem of the new
> page. pagetable_reserve is not the right call, we need a new one (see
> past emails).

ok, will change that interface and call in from alloc_low_page.

how about the pages from BRK, do we need to call xen hooks to mark it as RO?

2012-10-10 17:26:21

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH 6/7] x86, mm: setup page table from top-down

On Wed, 10 Oct 2012, Yinghai Lu wrote:
> On Wed, Oct 10, 2012 at 9:38 AM, Stefano Stabellini
> <[email protected]> wrote:
> >> - if (pfn >= pgt_buf_top)
> >> - panic("alloc_low_page: ran out of memory");
> >> + if ((pgt_buf_end + 1) >= pgt_buf_top) {
> >> + unsigned long ret;
> >> + if (min_pfn_mapped >= max_pfn_mapped)
> >> + panic("alloc_low_page: ran out of memory");
> >> + ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
> >> + max_pfn_mapped << PAGE_SHIFT,
> >> + PAGE_SIZE, PAGE_SIZE);
> >> + if (!ret)
> >> + panic("alloc_low_page: can not alloc memory");
> >> + memblock_reserve(ret, PAGE_SIZE);
> >> + pfn = ret >> PAGE_SHIFT;
> >
> > This cannot be right: you are allocating another page to be used as
> > pagetable page, outside the range pgt_buf_start-pgt_buf_top.
> >
> > When that page is going to be hooked into the live pagetable, the kernel
> > is going to panic on Xen because the page wasn't marked RO.
> >
> > If you want to do that you need to tell the Xen subsystem of the new
> > page. pagetable_reserve is not the right call, we need a new one (see
> > past emails).
>
> ok, will change that interface and call in from alloc_low_page.
>
> how about the pages from BRK, do we need to call xen hooks to mark it as RO?
>

It doesn't matter whether they come from BRK or other memory: Xen
assumes that all the pagetable pages come from
pgt_buf_start-pgt_buf_top, so if you are going to use another range you
need to tell Xen about it.

Alternatively, you can follow Peter's suggestion and replace the current
hooks with a new one with a more precise and well defined semantic.
Something along the lines of "this pagetable page is about to be hooked
into the live pagetable". Xen would use the hook to mark it RO.

2012-10-10 14:59:49

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86, mm: only keep initial mapping for ram

On Wed, Oct 10, 2012 at 6:48 AM, Konrad Rzeszutek Wilk
<[email protected]> wrote:
> On Tue, Oct 09, 2012 at 04:58:32PM -0700, Yinghai Lu wrote:
>> 0 mean any e820 type will be kept, and only hole is removed.
>>
>> change to E820_RAM and E820_RESERVED_KERN only.
>>
>
> This is good candidate for squashing in-to the previous patch, with
> verbose explanation why you care only about those two types.

two patches should be more safe.

new series will make sure only map RAM/RESEVED_KERN, so we should only
keep initial page table with those types.

2012-10-10 17:38:40

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 6/7] x86, mm: setup page table from top-down

On Wed, Oct 10, 2012 at 10:26 AM, Stefano Stabellini
<[email protected]> wrote:
> On Wed, 10 Oct 2012, Yinghai Lu wrote:
>
> It doesn't matter whether they come from BRK or other memory: Xen
> assumes that all the pagetable pages come from
> pgt_buf_start-pgt_buf_top, so if you are going to use another range you
> need to tell Xen about it.
>
> Alternatively, you can follow Peter's suggestion and replace the current
> hooks with a new one with a more precise and well defined semantic.
> Something along the lines of "this pagetable page is about to be hooked
> into the live pagetable". Xen would use the hook to mark it RO.

attached patch on top of this patch will fix the problem?


Attachments:
fix_xen.patch (3.28 kB)

2012-10-10 16:38:43

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH 6/7] x86, mm: setup page table from top-down

On Wed, 10 Oct 2012, Yinghai Lu wrote:
> Get pgt_buf early from BRK, and use it to map PMD_SIZE to top at first.
> then use page from PMD_SIZE to map next blow range.
>
> alloc_low_page will use page from BRK at first, then will switch to use
> to memblock to find and reserve page for page table usage.
>
> At last we could get rid of calculation and find early pgt related code.
>
> Suggested-by: "H. Peter Anvin" <[email protected]>
> Signed-off-by: Yinghai Lu <[email protected]>
> ---
> arch/x86/include/asm/page_types.h | 1 +
> arch/x86/include/asm/pgtable.h | 1 +
> arch/x86/kernel/setup.c | 3 +
> arch/x86/mm/init.c | 188 ++++++++-----------------------------
> arch/x86/mm/init_32.c | 17 +++-
> arch/x86/mm/init_64.c | 17 +++-
> 6 files changed, 71 insertions(+), 156 deletions(-)
>
> diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
> index 54c9787..9f6f3e6 100644
> --- a/arch/x86/include/asm/page_types.h
> +++ b/arch/x86/include/asm/page_types.h
> @@ -45,6 +45,7 @@ extern int devmem_is_allowed(unsigned long pagenr);
>
> extern unsigned long max_low_pfn_mapped;
> extern unsigned long max_pfn_mapped;
> +extern unsigned long min_pfn_mapped;
>
> static inline phys_addr_t get_max_mapped(void)
> {
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 52d40a1..25fa5bb 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -599,6 +599,7 @@ static inline int pgd_none(pgd_t pgd)
>
> extern int direct_gbpages;
> void init_mem_mapping(void);
> +void early_alloc_pgt_buf(void);
>
> /* local pte updates need not use xchg for locking */
> static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 4989f80..3987daa 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -123,6 +123,7 @@
> */
> unsigned long max_low_pfn_mapped;
> unsigned long max_pfn_mapped;
> +unsigned long min_pfn_mapped;
>
> #ifdef CONFIG_DMI
> RESERVE_BRK(dmi_alloc, 65536);
> @@ -896,6 +897,8 @@ void __init setup_arch(char **cmdline_p)
>
> reserve_ibft_region();
>
> + early_alloc_pgt_buf();
> +
> /*
> * Need to conclude brk, before memblock_x86_fill()
> * it could use memblock_find_in_range, could overlap with
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 23ce4db..a060381 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -223,105 +223,6 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
> return nr_range;
> }
>
> -static unsigned long __init calculate_table_space_size(unsigned long start,
> - unsigned long end)
> -{
> - unsigned long puds = 0, pmds = 0, ptes = 0, tables;
> - struct map_range mr[NR_RANGE_MR];
> - int nr_range, i;
> -
> - pr_info("calculate_table_space_size: [mem %#010lx-%#010lx]\n",
> - start, end - 1);
> -
> - memset(mr, 0, sizeof(mr));
> - nr_range = 0;
> - nr_range = split_mem_range(mr, nr_range, start, end);
> -
> - for (i = 0; i < nr_range; i++) {
> - unsigned long range, extra;
> -
> - range = mr[i].end - mr[i].start;
> - puds += (range + PUD_SIZE - 1) >> PUD_SHIFT;
> -
> - if (mr[i].page_size_mask & (1 << PG_LEVEL_1G)) {
> - extra = range - ((range >> PUD_SHIFT) << PUD_SHIFT);
> - pmds += (extra + PMD_SIZE - 1) >> PMD_SHIFT;
> - } else
> - pmds += (range + PMD_SIZE - 1) >> PMD_SHIFT;
> -
> - if (mr[i].page_size_mask & (1 << PG_LEVEL_2M)) {
> - extra = range - ((range >> PMD_SHIFT) << PMD_SHIFT);
> -#ifdef CONFIG_X86_32
> - extra += PMD_SIZE;
> -#endif
> - /* The first 2/4M doesn't use large pages. */
> - if (mr[i].start < PMD_SIZE)
> - extra += PMD_SIZE - mr[i].start;
> -
> - ptes += (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
> - } else
> - ptes += (range + PAGE_SIZE - 1) >> PAGE_SHIFT;
> - }
> -
> - tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
> - tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);
> - tables += roundup(ptes * sizeof(pte_t), PAGE_SIZE);
> -
> -#ifdef CONFIG_X86_32
> - /* for fixmap */
> - tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
> -#endif
> -
> - return tables;
> -}
> -
> -static unsigned long __init calculate_all_table_space_size(void)
> -{
> - unsigned long start_pfn, end_pfn;
> - unsigned long tables;
> - int i;
> -
> - /* the ISA range is always mapped regardless of memory holes */
> - tables = calculate_table_space_size(0, ISA_END_ADDRESS);
> -
> - for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
> - u64 start = start_pfn << PAGE_SHIFT;
> - u64 end = end_pfn << PAGE_SHIFT;
> -
> - if (end <= ISA_END_ADDRESS)
> - continue;
> -
> - if (start < ISA_END_ADDRESS)
> - start = ISA_END_ADDRESS;
> -#ifdef CONFIG_X86_32
> - /* on 32 bit, we only map up to max_low_pfn */
> - if ((start >> PAGE_SHIFT) >= max_low_pfn)
> - continue;
> -
> - if ((end >> PAGE_SHIFT) > max_low_pfn)
> - end = max_low_pfn << PAGE_SHIFT;
> -#endif
> - tables += calculate_table_space_size(start, end);
> - }
> -
> - return tables;
> -}
> -
> -static void __init find_early_table_space(unsigned long start,
> - unsigned long good_end,
> - unsigned long tables)
> -{
> - phys_addr_t base;
> -
> - base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
> - if (!base)
> - panic("Cannot find space for the kernel page tables");
> -
> - pgt_buf_start = base >> PAGE_SHIFT;
> - pgt_buf_end = pgt_buf_start;
> - pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
> -}
> -
> static struct range pfn_mapped[E820_X_MAX];
> static int nr_pfn_mapped;
>
> @@ -386,22 +287,17 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
> }
>
> /*
> - * Iterate through E820 memory map and create direct mappings for only E820_RAM
> - * regions. We cannot simply create direct mappings for all pfns from
> - * [0 to max_low_pfn) and [4GB to max_pfn) because of possible memory holes in
> - * high addresses that cannot be marked as UC by fixed/variable range MTRRs.
> - * Depending on the alignment of E820 ranges, this may possibly result in using
> - * smaller size (i.e. 4K instead of 2M or 1G) page tables.
> + * this one could take range with hole in it
> */
> -static void __init init_all_memory_mapping(unsigned long all_start,
> +static void __init init_range_memory_mapping(unsigned long all_start,
> unsigned long all_end)
> {
> unsigned long start_pfn, end_pfn;
> int i;
>
> for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
> - u64 start = start_pfn << PAGE_SHIFT;
> - u64 end = end_pfn << PAGE_SHIFT;
> + u64 start = (u64)start_pfn << PAGE_SHIFT;
> + u64 end = (u64)end_pfn << PAGE_SHIFT;
>
> if (end <= all_start)
> continue;
> @@ -421,67 +317,59 @@ static void __init init_all_memory_mapping(unsigned long all_start,
>
> void __init init_mem_mapping(void)
> {
> - unsigned long tables, good_end, end;
> + unsigned long end, start, last_start;
> + unsigned long step_size;
>
> probe_page_size_mask();
>
> - /*
> - * Find space for the kernel direct mapping tables.
> - *
> - * Later we should allocate these tables in the local node of the
> - * memory mapped. Unfortunately this is done currently before the
> - * nodes are discovered.
> - */
> #ifdef CONFIG_X86_64
> end = max_pfn << PAGE_SHIFT;
> - good_end = end;
> #else
> end = max_low_pfn << PAGE_SHIFT;
> - good_end = max_pfn_mapped << PAGE_SHIFT;
> #endif
> - tables = calculate_all_table_space_size();
> - find_early_table_space(0, good_end, tables);
> - printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx] prealloc\n",
> - end - 1, pgt_buf_start << PAGE_SHIFT,
> - (pgt_buf_top << PAGE_SHIFT) - 1);
>
> - max_pfn_mapped = 0; /* will get exact value next */
> /* the ISA range is always mapped regardless of memory holes */
> init_memory_mapping(0, ISA_END_ADDRESS);
> - init_all_memory_mapping(ISA_END_ADDRESS, end);
> +
> + /* step_size need to be small so pgt_buf from BRK could cover it */
> + step_size = PMD_SIZE;
> + max_pfn_mapped = 0; /* will get exact value next */
> + min_pfn_mapped = end >> PAGE_SHIFT;
> + last_start = start = end;
> + while (last_start > ISA_END_ADDRESS) {
> + if (last_start > step_size) {
> + start = round_down(last_start - 1, step_size);
> + if (start < ISA_END_ADDRESS)
> + start = ISA_END_ADDRESS;
> + } else
> + start = ISA_END_ADDRESS;
> + init_all_memory_mapping(start, last_start);
> + last_start = start;
> + min_pfn_mapped = last_start >> PAGE_SHIFT;
> + step_size <<= 5;
> + }
> +
> #ifdef CONFIG_X86_64
> if (max_pfn > max_low_pfn) {
> /* can we preseve max_low_pfn ?*/
> max_low_pfn = max_pfn;
> }
> #endif
> - /*
> - * Reserve the kernel pagetable pages we used (pgt_buf_start -
> - * pgt_buf_end) and free the other ones (pgt_buf_end - pgt_buf_top)
> - * so that they can be reused for other purposes.
> - *
> - * On native it just means calling memblock_reserve, on Xen it also
> - * means marking RW the pagetable pages that we allocated before
> - * but that haven't been used.
> - *
> - * In fact on xen we mark RO the whole range pgt_buf_start -
> - * pgt_buf_top, because we have to make sure that when
> - * init_memory_mapping reaches the pagetable pages area, it maps
> - * RO all the pagetable pages, including the ones that are beyond
> - * pgt_buf_end at that time.
> - */
> - if (pgt_buf_end > pgt_buf_start) {
> - printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx] final\n",
> - end - 1, pgt_buf_start << PAGE_SHIFT,
> - (pgt_buf_end << PAGE_SHIFT) - 1);
> - x86_init.mapping.pagetable_reserve(PFN_PHYS(pgt_buf_start),
> - PFN_PHYS(pgt_buf_end));
> - }
> + early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
> +}
>
> - /* stop the wrong using */
> - pgt_buf_top = 0;
> +/* need 3 4k for initial PMD_SIZE, 4k for 0-ISA_END_ADDRESS */
> +RESERVE_BRK(early_pgt_alloc, 16384);
> +void __init early_alloc_pgt_buf(void)
> +{
> + unsigned long tables = 16384;
> + phys_addr_t base;
>
> - early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
> + base = __pa(extend_brk(tables, PAGE_SIZE));
> +
> + pgt_buf_start = base >> PAGE_SHIFT;
> + pgt_buf_end = pgt_buf_start;
> + pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
> }
>
> /*
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index 27f7fc6..7bb1106 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -61,11 +61,22 @@ bool __read_mostly __vmalloc_start_set = false;
>
> static __init void *alloc_low_page(void)
> {
> - unsigned long pfn = pgt_buf_end++;
> + unsigned long pfn;
> void *adr;
>
> - if (pfn >= pgt_buf_top)
> - panic("alloc_low_page: ran out of memory");
> + if ((pgt_buf_end + 1) >= pgt_buf_top) {
> + unsigned long ret;
> + if (min_pfn_mapped >= max_pfn_mapped)
> + panic("alloc_low_page: ran out of memory");
> + ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
> + max_pfn_mapped << PAGE_SHIFT,
> + PAGE_SIZE, PAGE_SIZE);
> + if (!ret)
> + panic("alloc_low_page: can not alloc memory");
> + memblock_reserve(ret, PAGE_SIZE);
> + pfn = ret >> PAGE_SHIFT;
> + } else
> + pfn = pgt_buf_end++;
>
> adr = __va(pfn * PAGE_SIZE);
> clear_page(adr);
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 4898e80..7dfa69b 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -316,7 +316,7 @@ void __init cleanup_highmap(void)
>
> static __ref void *alloc_low_page(unsigned long *phys)
> {
> - unsigned long pfn = pgt_buf_end++;
> + unsigned long pfn;
> void *adr;
>
> if (after_bootmem) {
> @@ -326,8 +326,19 @@ static __ref void *alloc_low_page(unsigned long *phys)
> return adr;
> }
>
> - if (pfn >= pgt_buf_top)
> - panic("alloc_low_page: ran out of memory");
> + if ((pgt_buf_end + 1) >= pgt_buf_top) {
> + unsigned long ret;
> + if (min_pfn_mapped >= max_pfn_mapped)
> + panic("alloc_low_page: ran out of memory");
> + ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
> + max_pfn_mapped << PAGE_SHIFT,
> + PAGE_SIZE, PAGE_SIZE);
> + if (!ret)
> + panic("alloc_low_page: can not alloc memory");
> + memblock_reserve(ret, PAGE_SIZE);
> + pfn = ret >> PAGE_SHIFT;

This cannot be right: you are allocating another page to be used as
pagetable page, outside the range pgt_buf_start-pgt_buf_top.

When that page is going to be hooked into the live pagetable, the kernel
is going to panic on Xen because the page wasn't marked RO.

If you want to do that you need to tell the Xen subsystem of the new
page. pagetable_reserve is not the right call, we need a new one (see
past emails).


> + } else
> + pfn = pgt_buf_end++;
>
> adr = early_memremap(pfn * PAGE_SIZE, PAGE_SIZE);
> clear_page(adr);
> --
> 1.7.7
>

2012-10-10 16:41:00

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

On Wed, 10 Oct 2012, Yinghai Lu wrote:
> on top of tip/x86/mm2, but please zap last patch in that branch.
>
> 1. use brk to mapping first PMD_SIZE range.
> 2. top down to initialize page table range by range.
> 3. get rid of calculate page table, and find_early_page_table.
> 4. remove early_ioremap in page table accessing.
>
> v2: changes, update xen interface about pagetable_reserve, so not
> use pgt_buf_* in xen code directly.
> v3: use range top-down to initialize page table, so will not use
> calculating/find early table anymore.
> also reorder the patches sequence.
>
> could be found at:
> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm
>
> later we could get rid of workaround about xen_mapping_pagetable_reserve, that
> could kill another 50 lines codes. --- will do that later because x86/mm2 is
> not updated to linus/master yet. If we do that now, will have merge conflicts.

I don't think you can change the x86 code without changing the xen code
at the same time, otherwise you'll be really likely to break Xen. That
is unless you don't change any of the pvops but I thought that it was
one of the point of this series.



> Yinghai Lu
>
> Yinghai Lu (7):
> x86, mm: align start address to correct big page size
> x86, mm: Use big page size for small memory range
> x86, mm: Don't clear page table if next range is ram
> x86, mm: only keep initial mapping for ram
> x86, mm: Break down init_all_memory_mapping
> x86, mm: setup page table from top-down
> x86, mm: Remove early_memremap workaround for page table accessing
>
> arch/x86/include/asm/page_types.h | 1 +
> arch/x86/include/asm/pgtable.h | 1 +
> arch/x86/kernel/setup.c | 3 +
> arch/x86/mm/init.c | 251 ++++++++++++------------------------
> arch/x86/mm/init_32.c | 18 +++-
> arch/x86/mm/init_64.c | 100 ++++++---------
> 6 files changed, 144 insertions(+), 230 deletions(-)

So you are missing the Xen patches entirely in this iteration of the
series?

2012-10-10 14:00:46

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH 4/7] x86, mm: only keep initial mapping for ram

On Tue, Oct 09, 2012 at 04:58:32PM -0700, Yinghai Lu wrote:
> 0 mean any e820 type will be kept, and only hole is removed.
>
> change to E820_RAM and E820_RESERVED_KERN only.
>

This is good candidate for squashing in-to the previous patch, with
verbose explanation why you care only about those two types.

> Signed-off-by: Yinghai Lu <[email protected]>
> ---
> arch/x86/mm/init_64.c | 9 ++++++---
> 1 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 61b3c44..4898e80 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -373,7 +373,8 @@ phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
> next = (addr & PAGE_MASK) + PAGE_SIZE;
> if (addr >= end) {
> if (!after_bootmem &&
> - !e820_any_mapped(addr & PAGE_MASK, next, 0))
> + !e820_any_mapped(addr & PAGE_MASK, next, E820_RAM) &&
> + !e820_any_mapped(addr & PAGE_MASK, next, E820_RESERVED_KERN))
> set_pte(pte, __pte(0));
> continue;
> }
> @@ -420,7 +421,8 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
> next = (address & PMD_MASK) + PMD_SIZE;
> if (address >= end) {
> if (!after_bootmem &&
> - !e820_any_mapped(address & PMD_MASK, next, 0))
> + !e820_any_mapped(address & PMD_MASK, next, E820_RAM) &&
> + !e820_any_mapped(address & PMD_MASK, next, E820_RESERVED_KERN))
> set_pmd(pmd, __pmd(0));
> continue;
> }
> @@ -494,7 +496,8 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
> next = (addr & PUD_MASK) + PUD_SIZE;
> if (addr >= end) {
> if (!after_bootmem &&
> - !e820_any_mapped(addr & PUD_MASK, next, 0))
> + !e820_any_mapped(addr & PUD_MASK, next, E820_RAM) &&
> + !e820_any_mapped(addr & PUD_MASK, next, E820_RESERVED_KERN))
> set_pud(pud, __pud(0));
> continue;
> }
> --
> 1.7.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-10-10 15:43:00

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 5/7] x86, mm: Break down init_all_memory_mapping

On Wed, Oct 10, 2012 at 6:55 AM, Konrad Rzeszutek Wilk
<[email protected]> wrote:
> On Tue, Oct 09, 2012 at 04:58:33PM -0700, Yinghai Lu wrote:
>> Will replace that will top-down page table initialization.
>
> s/will/with?
>
>> new one need to take range.
>
> Huh? I have no idea what this patch does from your description.
> It says it will replace something (not identified) with
> top-down table initialization.
>
> And the code is not that simple - you should explain _how_ you
> are changing it. From what it did before to what it does _now_.

I tried to make the next patch to be little smaller, so split this change out.

2012-10-11 06:13:48

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

On Wed, Oct 10, 2012 at 9:40 AM, Stefano Stabellini
<[email protected]> wrote:
>
> So you are missing the Xen patches entirely in this iteration of the
> series?

please check updated for-x86-mm branch.

[PATCH -v4 00/15] x86: Use BRK to pre mapping page table to make xen happy

on top of current linus/master and tip/x86/mm2, but please zap last
patch in that branch.

1. use brk to mapping first PMD_SIZE range.
2. top down to initialize page table range by range.
3. get rid of calculate page table, and find_early_page_table.
4. remove early_ioremap in page table accessing.

v2: changes, update xen interface about pagetable_reserve, so not
use pgt_buf_* in xen code directly.
v3: use range top-down to initialize page table, so will not use
calculating/find early table anymore.
also reorder the patches sequence.
v4: add mapping_mark_page_ro to fix xen, also move pgt_buf_* to init.c
and merge alloc_low_page()

could be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
for-x86-mm

Yinghai Lu (15):
x86, mm: Align start address to correct big page size
x86, mm: Use big page size for small memory range
x86, mm: Don't clear page table if next range is ram
x86, mm: only keep initial mapping for ram
x86, mm: Break down init_all_memory_mapping
x86, xen: Add xen_mapping_mark_page_ro
x86, mm: setup page table in top-down
x86, mm: Remove early_memremap workaround for page table accessing on 64bit
x86, mm: Remove parameter in alloc_low_page for 64bit
x86, mm: Merge alloc_low_page between 64bit and 32bit
x86, mm: Move min_pfn_mapped back to mm/init.c
x86, mm, xen: Remove mapping_pagatable_reserve
x86, mm: Add alloc_low_pages(num)
x86, mm: only call early_ioremap_page_range_init() one time on 32bit
x86, mm: Move back pgt_buf_* to mm/init.c

arch/x86/include/asm/init.h | 4 -
arch/x86/include/asm/pgtable.h | 1 +
arch/x86/include/asm/pgtable_types.h | 1 -
arch/x86/include/asm/x86_init.h | 2 +-
arch/x86/kernel/setup.c | 2 +
arch/x86/kernel/x86_init.c | 3 +-
arch/x86/mm/init.c | 321 +++++++++++++++-------------------
arch/x86/mm/init_32.c | 76 ++++++--
arch/x86/mm/init_64.c | 119 ++++---------
arch/x86/mm/mm_internal.h | 11 ++
arch/x86/xen/mmu.c | 29 +---
11 files changed, 249 insertions(+), 320 deletions(-)
create mode 100644 arch/x86/mm/mm_internal.h

2012-10-11 23:04:47

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

On Wed, Oct 10, 2012 at 11:13 PM, Yinghai Lu <[email protected]> wrote:
> On Wed, Oct 10, 2012 at 9:40 AM, Stefano Stabellini
> <[email protected]> wrote:
>>
>> So you are missing the Xen patches entirely in this iteration of the
>> series?
>
> please check updated for-x86-mm branch.
>
> [PATCH -v4 00/15] x86: Use BRK to pre mapping page table to make xen happy
>
> could be found at:
> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
> for-x86-mm

Stefano,

Can you try -v4 to see if xen works with the changes?

Thanks

Yinghai

2012-10-18 16:17:55

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

On Thu, 11 Oct 2012, Yinghai Lu wrote:
> On Wed, Oct 10, 2012 at 9:40 AM, Stefano Stabellini
> <[email protected]> wrote:
> >
> > So you are missing the Xen patches entirely in this iteration of the
> > series?
>
> please check updated for-x86-mm branch.
>
> [PATCH -v4 00/15] x86: Use BRK to pre mapping page table to make xen happy
>
> on top of current linus/master and tip/x86/mm2, but please zap last
> patch in that branch.
>
> 1. use brk to mapping first PMD_SIZE range.
> 2. top down to initialize page table range by range.
> 3. get rid of calculate page table, and find_early_page_table.
> 4. remove early_ioremap in page table accessing.
>
> v2: changes, update xen interface about pagetable_reserve, so not
> use pgt_buf_* in xen code directly.
> v3: use range top-down to initialize page table, so will not use
> calculating/find early table anymore.
> also reorder the patches sequence.
> v4: add mapping_mark_page_ro to fix xen, also move pgt_buf_* to init.c
> and merge alloc_low_page()
>
> could be found at:
> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
> for-x86-mm

I find that patch series are easier to review than having to checkout
your code and read the commit messages. Please post your patch series to
the LKML next time.

In any case, regarding "x86, xen: Add xen_mapping_mark_page_ro": please
take Peter's feedback into account; mark_page_ro is not a good name for
a pvops.
Also I don't believe that this call is actually needed, see below.

Regarding "x86, mm: setup page table in top-down": if you mark the
pagetable page RO in alloc_low_page, won't the entire thing crash as
soon as you try to write to it? You are supposed to mark it RO after
filling up the pagetable page and before hooking it into the live
pagetable.
However contrary to my expectations, I did a quick test and it seems to
be working, that is probably due to a bug: maybe __pa or lookup_address
don't work correctly when called so early?

In any case we don't care about that because if we assume that
alloc_low_pages will always return a page from a range that is already
mapped, then we do not need this pvop anymore. That's because
xen_alloc_pte_init, xen_alloc_pmd_init, etc, will mark the page RO
before hooking it into the pagetable automatically.
Sorry for I misled you last time.

Let me repeat it again:
Can we assume that the page returned by alloc_low_pages is already mapped?

Yes? In that case let's get rid of mark_page_ro and everything should
work.

It is worth stating it in clear letters in a comment on top of
alloc_low_pages:

"This function always returns a page from a memory range already
mapped."

2012-10-18 16:26:21

by Jacob Shin

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

On Thu, Oct 18, 2012 at 05:17:28PM +0100, Stefano Stabellini wrote:
> On Thu, 11 Oct 2012, Yinghai Lu wrote:
> > On Wed, Oct 10, 2012 at 9:40 AM, Stefano Stabellini
> > <[email protected]> wrote:
> > >
> > > So you are missing the Xen patches entirely in this iteration of the
> > > series?
> >
> > please check updated for-x86-mm branch.
> >
> > [PATCH -v4 00/15] x86: Use BRK to pre mapping page table to make xen happy
> >
> > on top of current linus/master and tip/x86/mm2, but please zap last
> > patch in that branch.
> >
> > 1. use brk to mapping first PMD_SIZE range.
> > 2. top down to initialize page table range by range.
> > 3. get rid of calculate page table, and find_early_page_table.
> > 4. remove early_ioremap in page table accessing.
> >
> > v2: changes, update xen interface about pagetable_reserve, so not
> > use pgt_buf_* in xen code directly.
> > v3: use range top-down to initialize page table, so will not use
> > calculating/find early table anymore.
> > also reorder the patches sequence.
> > v4: add mapping_mark_page_ro to fix xen, also move pgt_buf_* to init.c
> > and merge alloc_low_page()
> >
> > could be found at:
> > git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
> > for-x86-mm
>
> I find that patch series are easier to review than having to checkout
> your code and read the commit messages. Please post your patch series to
> the LKML next time.
>
> In any case, regarding "x86, xen: Add xen_mapping_mark_page_ro": please
> take Peter's feedback into account; mark_page_ro is not a good name for
> a pvops.
> Also I don't believe that this call is actually needed, see below.
>
> Regarding "x86, mm: setup page table in top-down": if you mark the
> pagetable page RO in alloc_low_page, won't the entire thing crash as
> soon as you try to write to it? You are supposed to mark it RO after
> filling up the pagetable page and before hooking it into the live
> pagetable.
> However contrary to my expectations, I did a quick test and it seems to
> be working, that is probably due to a bug: maybe __pa or lookup_address
> don't work correctly when called so early?

Hi Yinghai, I just tried it and dom0 died during init_memory_mapping, here
is the Oops snippet, full logs are attached:

e820: last_pfn = 0x22f000 max_arch_pfn = 0x400000000
e820: last_pfn = 0xcff00 max_arch_pfn = 0x400000000
initial memory mapped: [mem 0x00000000-0x022affff]
Base memory trampoline at [ffff880000096000] 96000 size 24576
init_memory_mapping: [mem 0x00000000-0x000fffff]
[mem 0x00000000-0x000fffff] page 4k
init_memory_mapping: [mem 0x21fe00000-0x21fe77fff]
[mem 0x21fe00000-0x21fe77fff] page 4k
init_memory_mapping: [mem 0x21c000000-0x21fdfffff]
[mem 0x21c000000-0x21fdfffff] page 4k
init_memory_mapping: [mem 0x200000000-0x21bffffff]
[mem 0x200000000-0x21bffffff] page 4k
init_memory_mapping: [mem 0x00100000-0xcec6dfff]
[mem 0x00100000-0xcec6dfff] page 4k
init_memory_mapping: [mem 0xcf5f4000-0xcf5f4fff]
[mem 0xcf5f4000-0xcf5f4fff] page 4k
init_memory_mapping: [mem 0xcf7fb000-0xcfc19fff]
[mem 0xcf7fb000-0xcfc19fff] page 4k
init_memory_mapping: [mem 0xcfef4000-0xcfefffff]
[mem 0xcfef4000-0xcfefffff] page 4k
init_memory_mapping: [mem 0x100001000-0x1ffffffff]
[mem 0x100001000-0x1ffffffff] page 4k
PGD 0
Oops: 0003 [#1] SMP
Modules linked in:
CPU 0
Pid: 0, comm: swapper Not tainted 3.6.0+ #3 AMD Pike/Pike
RIP: e030:[<ffffffff81cd778a>] [<ffffffff81cd778a>] xen_set_pte_init+0x1/0x9
RSP: e02b:ffffffff81c01ae8 EFLAGS: 00010086
RAX: 80000001913d1063 RBX: ffff88021f63b1c8 RCX: 8000000000000163
RDX: 00000000000001ff RSI: 80000001913d1063 RDI: ffff88021f63b1c8
RBP: ffffffff81c01af8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 000000011b039000
R13: 000000000000003a R14: 000000011b03a000 R15: 000000011b039000
FS: 0000000000000000(0000) GS:ffffffff81cbe000(0000) knlGS:0000000000000000
CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000001c0b000 CR4: 0000000000000660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
Process swapper (pid: 0, threadinfo ffffffff81c00000, task ffffffff81c13410)
Stack:
ffffffff81c01af8 ffffffff810330f5 ffffffff81c01b58 ffffffff816aa9f3
8000000000000163 ffff88021f63c000 0000000200000000 000000011b039000
ffffffff81c01b38 0000000000000000 000000011b000000 ffff88021f7146c0
Call Trace:
[<ffffffff810330f5>] ? set_pte+0xb/0xd
[<ffffffff816aa9f3>] phys_pte_init+0xd4/0x106
[<ffffffff816aabe0>] phys_pmd_init+0x1bb/0x215
[<ffffffff816aadf3>] phys_pud_init+0x1b9/0x218
[<ffffffff816aaeff>] kernel_physical_mapping_init+0xad/0x14a
[<ffffffff81682a1a>] init_memory_mapping+0x275/0x303
[<ffffffff81ce6e62>] init_range_memory_mapping+0x8b/0xc8
[<ffffffff81ce6f91>] init_mem_mapping+0xf2/0x162
[<ffffffff81cd9d74>] setup_arch+0x682/0xaac
[<ffffffff816af4ab>] ? printk+0x48/0x4a
[<ffffffff81cd3868>] start_kernel+0x8e/0x3d8
[<ffffffff81cd32d3>] x86_64_start_reservations+0xae/0xb2
[<ffffffff81cd6dbc>] xen_start_kernel+0x63d/0x63f
Code: 00 00 48 c7 c7 f2 a8 aa 81 e8 ee 61 36 ff c7 05 59 10 06 00 01 00 00 00 5d c3 55 48 89 f7 48 89 e5 e8 95 cf 32 ff 31 c0 5d c3 55 <48> 89 37 48 89 e5 5d c3 55 48 89 e5 41 57 41 56 45 31 f6 41 55
RSP <ffffffff81c01ae8>
CR2: 0000000000000000
---[ end trace c2b54da46b5614cf ]---

>
> In any case we don't care about that because if we assume that
> alloc_low_pages will always return a page from a range that is already
> mapped, then we do not need this pvop anymore. That's because
> xen_alloc_pte_init, xen_alloc_pmd_init, etc, will mark the page RO
> before hooking it into the pagetable automatically.
> Sorry for I misled you last time.
>
> Let me repeat it again:
> Can we assume that the page returned by alloc_low_pages is already mapped?
>
> Yes? In that case let's get rid of mark_page_ro and everything should
> work.
>
> It is worth stating it in clear letters in a comment on top of
> alloc_low_pages:
>
> "This function always returns a page from a memory range already
> mapped."
>


Attachments:
(No filename) (6.01 kB)
dom0.txt (5.23 kB)
dom0.txt
xen.txt (11.71 kB)
xen.txt
Download all attachments

2012-10-18 16:58:26

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

Jacob,
thanks for testing!

Yinghai, it might be useful for you to try your patch series on Xen.
It is pretty simple, considering that you only need the hypervisor.
Just follow these steps:

- clone the xen-unstable git mirror
git clone git://xenbits.xen.org/xen.git

- compile and install xen
make xen
cp xen/xen.gz /boot

- add an entry to grub2.cfg
see the following example, note the multiboot line and the placeholder
argument after vmlinuz:

menuentry 'GNU/Linux, with Linux 2.6.32.40-pv' --class gnu-linux --class gnu --class os {
recordfail
insmod part_msdos
insmod ext2
search --no-floppy --fs-uuid --set=root 016e7c8a-4bdd-4873-92dd-d71171a49d6d
set root='(/dev/sda,msdos2)'
search --no-floppy --fs-uuid --set=root 016e7c8a-4bdd-4873-92dd-d71171a49d6d
multiboot /boot/xen-4.2-unstable.gz
module /boot/vmlinuz-2.6.32.40-pv placeholder root=UUID=016e7c8a-4bdd-4873-92dd-d71171a49d6d dom0_mem=1024 console=tty quiet splash vt.handoff=7
module /boot/initrd.img-2.6.32.40-pv
}

- reboot and enjoy!



On Thu, 18 Oct 2012, Jacob Shin wrote:
> On Thu, Oct 18, 2012 at 05:17:28PM +0100, Stefano Stabellini wrote:
> > On Thu, 11 Oct 2012, Yinghai Lu wrote:
> > > On Wed, Oct 10, 2012 at 9:40 AM, Stefano Stabellini
> > > <[email protected]> wrote:
> > > >
> > > > So you are missing the Xen patches entirely in this iteration of the
> > > > series?
> > >
> > > please check updated for-x86-mm branch.
> > >
> > > [PATCH -v4 00/15] x86: Use BRK to pre mapping page table to make xen happy
> > >
> > > on top of current linus/master and tip/x86/mm2, but please zap last
> > > patch in that branch.
> > >
> > > 1. use brk to mapping first PMD_SIZE range.
> > > 2. top down to initialize page table range by range.
> > > 3. get rid of calculate page table, and find_early_page_table.
> > > 4. remove early_ioremap in page table accessing.
> > >
> > > v2: changes, update xen interface about pagetable_reserve, so not
> > > use pgt_buf_* in xen code directly.
> > > v3: use range top-down to initialize page table, so will not use
> > > calculating/find early table anymore.
> > > also reorder the patches sequence.
> > > v4: add mapping_mark_page_ro to fix xen, also move pgt_buf_* to init.c
> > > and merge alloc_low_page()
> > >
> > > could be found at:
> > > git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
> > > for-x86-mm
> >
> > I find that patch series are easier to review than having to checkout
> > your code and read the commit messages. Please post your patch series to
> > the LKML next time.
> >
> > In any case, regarding "x86, xen: Add xen_mapping_mark_page_ro": please
> > take Peter's feedback into account; mark_page_ro is not a good name for
> > a pvops.
> > Also I don't believe that this call is actually needed, see below.
> >
> > Regarding "x86, mm: setup page table in top-down": if you mark the
> > pagetable page RO in alloc_low_page, won't the entire thing crash as
> > soon as you try to write to it? You are supposed to mark it RO after
> > filling up the pagetable page and before hooking it into the live
> > pagetable.
> > However contrary to my expectations, I did a quick test and it seems to
> > be working, that is probably due to a bug: maybe __pa or lookup_address
> > don't work correctly when called so early?
>
> Hi Yinghai, I just tried it and dom0 died during init_memory_mapping, here
> is the Oops snippet, full logs are attached:
>
> e820: last_pfn = 0x22f000 max_arch_pfn = 0x400000000
> e820: last_pfn = 0xcff00 max_arch_pfn = 0x400000000
> initial memory mapped: [mem 0x00000000-0x022affff]
> Base memory trampoline at [ffff880000096000] 96000 size 24576
> init_memory_mapping: [mem 0x00000000-0x000fffff]
> [mem 0x00000000-0x000fffff] page 4k
> init_memory_mapping: [mem 0x21fe00000-0x21fe77fff]
> [mem 0x21fe00000-0x21fe77fff] page 4k
> init_memory_mapping: [mem 0x21c000000-0x21fdfffff]
> [mem 0x21c000000-0x21fdfffff] page 4k
> init_memory_mapping: [mem 0x200000000-0x21bffffff]
> [mem 0x200000000-0x21bffffff] page 4k
> init_memory_mapping: [mem 0x00100000-0xcec6dfff]
> [mem 0x00100000-0xcec6dfff] page 4k
> init_memory_mapping: [mem 0xcf5f4000-0xcf5f4fff]
> [mem 0xcf5f4000-0xcf5f4fff] page 4k
> init_memory_mapping: [mem 0xcf7fb000-0xcfc19fff]
> [mem 0xcf7fb000-0xcfc19fff] page 4k
> init_memory_mapping: [mem 0xcfef4000-0xcfefffff]
> [mem 0xcfef4000-0xcfefffff] page 4k
> init_memory_mapping: [mem 0x100001000-0x1ffffffff]
> [mem 0x100001000-0x1ffffffff] page 4k
> PGD 0
> Oops: 0003 [#1] SMP
> Modules linked in:
> CPU 0
> Pid: 0, comm: swapper Not tainted 3.6.0+ #3 AMD Pike/Pike
> RIP: e030:[<ffffffff81cd778a>] [<ffffffff81cd778a>] xen_set_pte_init+0x1/0x9
> RSP: e02b:ffffffff81c01ae8 EFLAGS: 00010086
> RAX: 80000001913d1063 RBX: ffff88021f63b1c8 RCX: 8000000000000163
> RDX: 00000000000001ff RSI: 80000001913d1063 RDI: ffff88021f63b1c8
> RBP: ffffffff81c01af8 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 000000011b039000
> R13: 000000000000003a R14: 000000011b03a000 R15: 000000011b039000
> FS: 0000000000000000(0000) GS:ffffffff81cbe000(0000) knlGS:0000000000000000
> CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 0000000001c0b000 CR4: 0000000000000660
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
> Process swapper (pid: 0, threadinfo ffffffff81c00000, task ffffffff81c13410)
> Stack:
> ffffffff81c01af8 ffffffff810330f5 ffffffff81c01b58 ffffffff816aa9f3
> 8000000000000163 ffff88021f63c000 0000000200000000 000000011b039000
> ffffffff81c01b38 0000000000000000 000000011b000000 ffff88021f7146c0
> Call Trace:
> [<ffffffff810330f5>] ? set_pte+0xb/0xd
> [<ffffffff816aa9f3>] phys_pte_init+0xd4/0x106
> [<ffffffff816aabe0>] phys_pmd_init+0x1bb/0x215
> [<ffffffff816aadf3>] phys_pud_init+0x1b9/0x218
> [<ffffffff816aaeff>] kernel_physical_mapping_init+0xad/0x14a
> [<ffffffff81682a1a>] init_memory_mapping+0x275/0x303
> [<ffffffff81ce6e62>] init_range_memory_mapping+0x8b/0xc8
> [<ffffffff81ce6f91>] init_mem_mapping+0xf2/0x162
> [<ffffffff81cd9d74>] setup_arch+0x682/0xaac
> [<ffffffff816af4ab>] ? printk+0x48/0x4a
> [<ffffffff81cd3868>] start_kernel+0x8e/0x3d8
> [<ffffffff81cd32d3>] x86_64_start_reservations+0xae/0xb2
> [<ffffffff81cd6dbc>] xen_start_kernel+0x63d/0x63f
> Code: 00 00 48 c7 c7 f2 a8 aa 81 e8 ee 61 36 ff c7 05 59 10 06 00 01 00 00 00 5d c3 55 48 89 f7 48 89 e5 e8 95 cf 32 ff 31 c0 5d c3 55 <48> 89 37 48 89 e5 5d c3 55 48 89 e5 41 57 41 56 45 31 f6 41 55
> RSP <ffffffff81c01ae8>
> CR2: 0000000000000000
> ---[ end trace c2b54da46b5614cf ]---

2012-10-18 20:36:08

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

On Thu, Oct 18, 2012 at 9:57 AM, Stefano Stabellini
<[email protected]> wrote:
> Jacob,
> thanks for testing!
>
> Yinghai, it might be useful for you to try your patch series on Xen.
> It is pretty simple, considering that you only need the hypervisor.
> Just follow these steps:
>
> - clone the xen-unstable git mirror
> git clone git://xenbits.xen.org/xen.git
>
> - compile and install xen
> make xen
> cp xen/xen.gz /boot

yes, i tried dom0 with xen 32bit and 64bit, all can boot well.

but did not test with domU.

Thanks

Yinghai

2012-10-18 20:40:24

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

On Thu, Oct 18, 2012 at 9:26 AM, Jacob Shin <[email protected]> wrote:
>
> Hi Yinghai, I just tried it and dom0 died during init_memory_mapping, here
> is the Oops snippet, full logs are attached:
>
> e820: last_pfn = 0x22f000 max_arch_pfn = 0x400000000
> e820: last_pfn = 0xcff00 max_arch_pfn = 0x400000000
> initial memory mapped: [mem 0x00000000-0x022affff]
> Base memory trampoline at [ffff880000096000] 96000 size 24576
> init_memory_mapping: [mem 0x00000000-0x000fffff]
> [mem 0x00000000-0x000fffff] page 4k
> init_memory_mapping: [mem 0x21fe00000-0x21fe77fff]
> [mem 0x21fe00000-0x21fe77fff] page 4k
> init_memory_mapping: [mem 0x21c000000-0x21fdfffff]
> [mem 0x21c000000-0x21fdfffff] page 4k
> init_memory_mapping: [mem 0x200000000-0x21bffffff]
> [mem 0x200000000-0x21bffffff] page 4k
> init_memory_mapping: [mem 0x00100000-0xcec6dfff]
> [mem 0x00100000-0xcec6dfff] page 4k
> init_memory_mapping: [mem 0xcf5f4000-0xcf5f4fff]
> [mem 0xcf5f4000-0xcf5f4fff] page 4k
> init_memory_mapping: [mem 0xcf7fb000-0xcfc19fff]
> [mem 0xcf7fb000-0xcfc19fff] page 4k
> init_memory_mapping: [mem 0xcfef4000-0xcfefffff]
> [mem 0xcfef4000-0xcfefffff] page 4k
> init_memory_mapping: [mem 0x100001000-0x1ffffffff]
> [mem 0x100001000-0x1ffffffff] page 4k
> PGD 0
> Oops: 0003 [#1] SMP
> Modules linked in:
> CPU 0
> Pid: 0, comm: swapper Not tainted 3.6.0+ #3 AMD Pike/Pike
> RIP: e030:[<ffffffff81cd778a>] [<ffffffff81cd778a>] xen_set_pte_init+0x1/0x9
> RSP: e02b:ffffffff81c01ae8 EFLAGS: 00010086
> RAX: 80000001913d1063 RBX: ffff88021f63b1c8 RCX: 8000000000000163
> RDX: 00000000000001ff RSI: 80000001913d1063 RDI: ffff88021f63b1c8
> RBP: ffffffff81c01af8 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 000000011b039000
> R13: 000000000000003a R14: 000000011b03a000 R15: 000000011b039000
> FS: 0000000000000000(0000) GS:ffffffff81cbe000(0000) knlGS:0000000000000000
> CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 0000000001c0b000 CR4: 0000000000000660
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
> Process swapper (pid: 0, threadinfo ffffffff81c00000, task ffffffff81c13410)
> Stack:
> ffffffff81c01af8 ffffffff810330f5 ffffffff81c01b58 ffffffff816aa9f3
> 8000000000000163 ffff88021f63c000 0000000200000000 000000011b039000
> ffffffff81c01b38 0000000000000000 000000011b000000 ffff88021f7146c0
> Call Trace:
> [<ffffffff810330f5>] ? set_pte+0xb/0xd
> [<ffffffff816aa9f3>] phys_pte_init+0xd4/0x106
> [<ffffffff816aabe0>] phys_pmd_init+0x1bb/0x215
> [<ffffffff816aadf3>] phys_pud_init+0x1b9/0x218
> [<ffffffff816aaeff>] kernel_physical_mapping_init+0xad/0x14a
> [<ffffffff81682a1a>] init_memory_mapping+0x275/0x303
> [<ffffffff81ce6e62>] init_range_memory_mapping+0x8b/0xc8
> [<ffffffff81ce6f91>] init_mem_mapping+0xf2/0x162
> [<ffffffff81cd9d74>] setup_arch+0x682/0xaac
> [<ffffffff816af4ab>] ? printk+0x48/0x4a
> [<ffffffff81cd3868>] start_kernel+0x8e/0x3d8
> [<ffffffff81cd32d3>] x86_64_start_reservations+0xae/0xb2
> [<ffffffff81cd6dbc>] xen_start_kernel+0x63d/0x63f
> Code: 00 00 48 c7 c7 f2 a8 aa 81 e8 ee 61 36 ff c7 05 59 10 06 00 01 00 00 00 5d c3 55 48 89 f7 48 89 e5 e8 95 cf 32 ff 31 c0 5d c3 55 <48> 89 37 48 89 e5 5d c3 55 48 89 e5 41 57 41 56 45 31 f6 41 55
> RSP <ffffffff81c01ae8>
> CR2: 0000000000000000
> ---[ end trace c2b54da46b5614cf ]---

i tested dom0 conf on 32 bit and 64 bit, they are all working.

and just now I tried with mem=8944m, and they are still working.

anyway, can you please updated -v5 for-x86-mm branch ?
I removed mark_page_ro workaround according to stefano

Thanks

Yinghai

2012-10-18 20:43:45

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

On Thu, Oct 18, 2012 at 9:17 AM, Stefano Stabellini
<[email protected]> wrote:
>
> I find that patch series are easier to review than having to checkout
> your code and read the commit messages. Please post your patch series to
> the LKML next time.

ok, will post -v5 to the list.

>
> In any case, regarding "x86, xen: Add xen_mapping_mark_page_ro": please
> take Peter's feedback into account; mark_page_ro is not a good name for
> a pvops.
> Also I don't believe that this call is actually needed, see below.
>
> Regarding "x86, mm: setup page table in top-down": if you mark the
> pagetable page RO in alloc_low_page, won't the entire thing crash as
> soon as you try to write to it? You are supposed to mark it RO after
> filling up the pagetable page and before hooking it into the live
> pagetable.
> However contrary to my expectations, I did a quick test and it seems to
> be working, that is probably due to a bug: maybe __pa or lookup_address
> don't work correctly when called so early?
>
> In any case we don't care about that because if we assume that
> alloc_low_pages will always return a page from a range that is already
> mapped, then we do not need this pvop anymore. That's because
> xen_alloc_pte_init, xen_alloc_pmd_init, etc, will mark the page RO
> before hooking it into the pagetable automatically.
> Sorry for I misled you last time.

yeah, I was confused how xen could RO page for page table.

>
> Let me repeat it again:
> Can we assume that the page returned by alloc_low_pages is already mapped?
>
> Yes? In that case let's get rid of mark_page_ro and everything should
> work.

yes, removed.

>
> It is worth stating it in clear letters in a comment on top of
> alloc_low_pages:
>
> "This function always returns a page from a memory range already
> mapped."

actually alloc_... should be always return mapped.

Thanks

Yinghai

2012-10-18 21:57:19

by Jacob Shin

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

On Thu, Oct 18, 2012 at 01:40:21PM -0700, Yinghai Lu wrote:
> On Thu, Oct 18, 2012 at 9:26 AM, Jacob Shin <[email protected]> wrote:
>
> i tested dom0 conf on 32 bit and 64 bit, they are all working.
>
> and just now I tried with mem=8944m, and they are still working.
>
> anyway, can you please updated -v5 for-x86-mm branch ?
> I removed mark_page_ro workaround according to stefano

Just tested -v5 with Xen, and Dom0 no longer dies. Also tried it on HVM DomU
and also boots fine.

Also tested on our 1TB machine, and it looks good, only E820_RAM are mapped:

[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x00000000000983ff] usable
[ 0.000000] BIOS-e820: [mem 0x0000000000098400-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000d2000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000c7ebffff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000c7ec0000-0x00000000c7ed7fff] ACPI data
[ 0.000000] BIOS-e820: [mem 0x00000000c7ed8000-0x00000000c7ed9fff] ACPI NVS
[ 0.000000] BIOS-e820: [mem 0x00000000c7eda000-0x00000000c7ffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec0ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fff00000-0x00000000ffffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000e037ffffff] usable
[ 0.000000] BIOS-e820: [mem 0x000000e038000000-0x000000ffffffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000010000000000-0x0000011ffeffffff] usable

...

[ 0.000000] e820: last_pfn = 0xc7ec0 max_arch_pfn = 0x400000000
[ 0.000000] initial memory mapped: [mem 0x00000000-0x1fffffff]
[ 0.000000] Base memory trampoline at [ffff880000092000] 92000 size 24576
[ 0.000000] Using GB pages for direct mapping
[ 0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[ 0.000000] [mem 0x00000000-0x000fffff] page 4k
[ 0.000000] init_memory_mapping: [mem 0x11ffee00000-0x11ffeffffff]
[ 0.000000] [mem 0x11ffee00000-0x11ffeffffff] page 2M
[ 0.000000] init_memory_mapping: [mem 0x11ffc000000-0x11ffedfffff]
[ 0.000000] [mem 0x11ffc000000-0x11ffedfffff] page 2M
[ 0.000000] init_memory_mapping: [mem 0x11f80000000-0x11ffbffffff]
[ 0.000000] [mem 0x11f80000000-0x11fbfffffff] page 1G
[ 0.000000] [mem 0x11fc0000000-0x11ffbffffff] page 2M
[ 0.000000] init_memory_mapping: [mem 0x11000000000-0x11f7fffffff]
[ 0.000000] [mem 0x11000000000-0x11f7fffffff] page 1G
[ 0.000000] init_memory_mapping: [mem 0x00100000-0xc7ebffff]
[ 0.000000] [mem 0x00100000-0x001fffff] page 4k
[ 0.000000] [mem 0x00200000-0x3fffffff] page 2M
[ 0.000000] [mem 0x40000000-0xbfffffff] page 1G
[ 0.000000] [mem 0xc0000000-0xc7dfffff] page 2M
[ 0.000000] [mem 0xc7e00000-0xc7ebffff] page 4k
[ 0.000000] init_memory_mapping: [mem 0x100000000-0xe037ffffff]
[ 0.000000] [mem 0x100000000-0xdfffffffff] page 1G
[ 0.000000] [mem 0xe000000000-0xe037ffffff] page 2M
[ 0.000000] init_memory_mapping: [mem 0x10000000000-0x10fffffffff]
[ 0.000000] [mem 0x10000000000-0x10fffffffff] page 1G

Thanks!!

>
> Thanks
>
> Yinghai
>

2012-10-30 13:57:16

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

On Wed, Oct 10, 2012 at 11:13:45PM -0700, Yinghai Lu wrote:
> On Wed, Oct 10, 2012 at 9:40 AM, Stefano Stabellini
> <[email protected]> wrote:
> >
> > So you are missing the Xen patches entirely in this iteration of the
> > series?
>
> please check updated for-x86-mm branch.
>
> [PATCH -v4 00/15] x86: Use BRK to pre mapping page table to make xen happy
>
> on top of current linus/master and tip/x86/mm2, but please zap last
> patch in that branch.

fyi, the way to do that is to revert it the offending patch.

So which branch should I try out? Do you have one with all of the
required patches so I can just do a 3.7-rc3 'git pull' and try it out?

>
> 1. use brk to mapping first PMD_SIZE range.
> 2. top down to initialize page table range by range.
> 3. get rid of calculate page table, and find_early_page_table.
> 4. remove early_ioremap in page table accessing.
>
> v2: changes, update xen interface about pagetable_reserve, so not
> use pgt_buf_* in xen code directly.
> v3: use range top-down to initialize page table, so will not use
> calculating/find early table anymore.
> also reorder the patches sequence.
> v4: add mapping_mark_page_ro to fix xen, also move pgt_buf_* to init.c
> and merge alloc_low_page()
>
> could be found at:
> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
> for-x86-mm
>
> Yinghai Lu (15):
> x86, mm: Align start address to correct big page size
> x86, mm: Use big page size for small memory range
> x86, mm: Don't clear page table if next range is ram
> x86, mm: only keep initial mapping for ram
> x86, mm: Break down init_all_memory_mapping
> x86, xen: Add xen_mapping_mark_page_ro
> x86, mm: setup page table in top-down
> x86, mm: Remove early_memremap workaround for page table accessing on 64bit
> x86, mm: Remove parameter in alloc_low_page for 64bit
> x86, mm: Merge alloc_low_page between 64bit and 32bit
> x86, mm: Move min_pfn_mapped back to mm/init.c
> x86, mm, xen: Remove mapping_pagatable_reserve
> x86, mm: Add alloc_low_pages(num)
> x86, mm: only call early_ioremap_page_range_init() one time on 32bit
> x86, mm: Move back pgt_buf_* to mm/init.c
>
> arch/x86/include/asm/init.h | 4 -
> arch/x86/include/asm/pgtable.h | 1 +
> arch/x86/include/asm/pgtable_types.h | 1 -
> arch/x86/include/asm/x86_init.h | 2 +-
> arch/x86/kernel/setup.c | 2 +
> arch/x86/kernel/x86_init.c | 3 +-
> arch/x86/mm/init.c | 321 +++++++++++++++-------------------
> arch/x86/mm/init_32.c | 76 ++++++--
> arch/x86/mm/init_64.c | 119 ++++---------
> arch/x86/mm/mm_internal.h | 11 ++
> arch/x86/xen/mmu.c | 29 +---
> 11 files changed, 249 insertions(+), 320 deletions(-)
> create mode 100644 arch/x86/mm/mm_internal.h
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-10-30 14:47:41

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

On Tue, Oct 30, 2012 at 6:44 AM, Konrad Rzeszutek Wilk
<[email protected]> wrote:
> On Wed, Oct 10, 2012 at 11:13:45PM -0700, Yinghai Lu wrote:
> So which branch should I try out? Do you have one with all of the
> required patches so I can just do a 3.7-rc3 'git pull' and try it out?

add for-x86-mm-test branch, and it is based on 3.7-rc3, and merged
with for-x86-mm branch.

so you can try
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
for-x86-mm-test

there is conflicts between for-x86-mm and 3.7-rc3, and attached patch
could be used to fix them

Thanks

Yinghai


Attachments:
merge.patch (5.38 kB)

2012-11-03 21:35:57

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

On Tue, Oct 30, 2012 at 7:47 AM, Yinghai Lu <[email protected]> wrote:
> On Tue, Oct 30, 2012 at 6:44 AM, Konrad Rzeszutek Wilk
> <[email protected]> wrote:
>> On Wed, Oct 10, 2012 at 11:13:45PM -0700, Yinghai Lu wrote:
>> So which branch should I try out? Do you have one with all of the
>> required patches so I can just do a 3.7-rc3 'git pull' and try it out?
>
> add for-x86-mm-test branch, and it is based on 3.7-rc3, and merged
> with for-x86-mm branch.
>
> so you can try
> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
> for-x86-mm-test
>
> there is conflicts between for-x86-mm and 3.7-rc3, and attached patch
> could be used to fix them
>

Peter/Ingo,

can you put for-x86-mm-test to tip for more testing?

or you want to rebase the whole patchset?

Thanks

Yinghai

2012-11-03 21:37:49

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

I am travelling at the moment... I hope to be able to look at it Sunday.

Yinghai Lu <[email protected]> wrote:

>On Tue, Oct 30, 2012 at 7:47 AM, Yinghai Lu <[email protected]> wrote:
>> On Tue, Oct 30, 2012 at 6:44 AM, Konrad Rzeszutek Wilk
>> <[email protected]> wrote:
>>> On Wed, Oct 10, 2012 at 11:13:45PM -0700, Yinghai Lu wrote:
>>> So which branch should I try out? Do you have one with all of the
>>> required patches so I can just do a 3.7-rc3 'git pull' and try it
>out?
>>
>> add for-x86-mm-test branch, and it is based on 3.7-rc3, and merged
>> with for-x86-mm branch.
>>
>> so you can try
>>
>git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
>> for-x86-mm-test
>>
>> there is conflicts between for-x86-mm and 3.7-rc3, and attached patch
>> could be used to fix them
>>
>
>Peter/Ingo,
>
>can you put for-x86-mm-test to tip for more testing?
>
>or you want to rebase the whole patchset?
>
>Thanks
>
>Yinghai

--
Sent from my mobile phone. Please excuse brevity and lack of formatting.

2012-11-05 20:25:16

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

On Sat, Nov 3, 2012 at 2:37 PM, H. Peter Anvin <[email protected]> wrote:
> I am travelling at the moment... I hope to be able to look at it Sunday.
>>
>>can you put for-x86-mm-test to tip for more testing?
>>
>>or you want to rebase the whole patchset?

All,

please check branch that is rebased on top of 3.7-rc4.
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
for-x86-mm

Stefano, Konrad,
please check them with xen parts, and give your acked-by or
Reviewed-by accordingly.

Thanks

Yinghai


>From 1b66ccf15ff4bd0200567e8d70446a8763f96ee7 Mon Sep 17 00:00:00 2001
From: Yinghai Lu <[email protected]>
Date: Mon, 5 Nov 2012 11:54:12 -0800
Subject: [PATCH v6 00/42] x86, mm: map ram from top-down with BRK and memblock.

rebase patchset together tip/x86/mm2 on top of linus v3.7-rc4

so this one include patchset : x86, mm: init_memory_mapping cleanup
in tip/x86/mm2
---
Current kernel init memory mapping between [0, TOML) and [4G, TOMH)
Some AMD systems have mem hole between 4G and TOMH, around 1T.
According to HPA, we should only mapping ram range.
1. Seperate calculate_table_space_size and find_early_page_table out with
init_memory_mapping.
2. For all ranges, will allocate page table one time
3. init mapping for ram range one by one.
---

pre mapping page table patcheset includes:
1. use brk to mapping first PMD_SIZE range under end of ram.
2. top down to initialize page table range by range.
3. get rid of calculate_page_table, and find_early_page_table.
4. remove early_ioremap in page table accessing.
5. remove workaround in xen to mark page RO.

v2: changes, update xen interface about pagetable_reserve, so not
use pgt_buf_* in xen code directly.
v3: use range top-down to initialize page table, so will not use
calculating/find early table anymore.
also reorder the patches sequence.
v4: add mapping_mark_page_ro to fix xen, also move pgt_buf_* to init.c
and merge alloc_low_page(), and for 32bit need to add alloc_low_pages
to fix 32bit kmap setting.
v5: remove mark_page_ro workaround and add another 5 cleanup patches.
v6: rebase on v3.7-rc4 and add 4 cleanup patches.

could be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
for-x86-mm

1b66ccf: mm: Kill NO_BOOTMEM version free_all_bootmem_node()
0332736: sparc, mm: Remove calling of free_all_bootmem_node()
0f88d27: x86, mm: kill numa_64.h
10c4c68: x86, mm: kill numa_free_all_bootmem()
0187d6e: x86, mm: Let "memmap=" take more entries one time
2c20fd0: x86, mm: Use clamp_t() in init_range_memory_mapping
770db30: x86, mm: Move after_bootmem to mm_internel.h
d9bd282: x86, mm: Unifying after_bootmem for 32bit and 64bit
003d654: x86, mm: use limit_pfn for end pfn
29f27b8: x86, mm: use pfn instead of pos in split_mem_range
8edaab8: x86, mm: use PFN_DOWN in split_mem_range()
e652c73: x86, mm: use round_up/down in split_mem_range()
5fd1391: x86, mm: Add check before clear pte above max_low_pfn on 32bit
f4fd136: x86, mm: Move function declaration into mm_internal.h
ea421df: x86, mm: change low/hignmem_pfn_init to static on 32bit
867525b: x86, mm: Move init_gbpages() out of setup.c
97fb23a: x86, mm: Move back pgt_buf_* to mm/init.c
ba55b2f: x86, mm: only call early_ioremap_page_table_range_init() once
bb6d2d9: x86, mm: Add pointer about Xen mmu requirement for alloc_low_pages
6bc7e9f: x86, mm: Add alloc_low_pages(num)
d90df25: x86, mm, Xen: Remove mapping_pagetable_reserve()
f6699d2: x86, mm: Move min_pfn_mapped back to mm/init.c
675697b: x86, mm: Merge alloc_low_page between 64bit and 32bit
a14a382: x86, mm: Remove parameter in alloc_low_page for 64bit
a0d7a52: x86, mm: Remove early_memremap workaround for page table
accessing on 64bit
384a79d: x86, mm: setup page table in top-down
16252f2: x86, mm: Break down init_all_memory_mapping
255a0bf: x86, mm: Don't clear page table if range is ram
4bc70c5: x86, mm: Use big page size for small memory range
dc62aa8: x86, mm: Align start address to correct big page size
18f374c: x86: Only direct map addresses that are marked as E820_RAM
150387c: x86: Fixup code testing if a pfn is direct mapped
45cb049: x86: if kernel .text .data .bss are not marked as E820_RAM,
complain and fix
8d0b8bd: x86, mm: Set memblock initial limit to 1M
6c712cf: x86, mm: Separate out calculate_table_space_size()
33b0520: x86, mm: Find early page table buffer together
a95ff4b: x86, mm: Change find_early_table_space() paramters
ea564f8: x86, mm: Revert back good_end setting for 64bit
912c686: x86, mm: Move init_memory_mapping calling out of setup.c
47cfa63: x86, mm: Move down find_early_table_space()
060ed6c: x86, mm: Split out split_mem_range from init_memory_mapping
2ad38b5: x86, mm: Add global page_size_mask and probe one time only

arch/sparc/mm/init_64.c | 24 +-
arch/x86/include/asm/init.h | 21 +--
arch/x86/include/asm/numa.h | 2 -
arch/x86/include/asm/numa_64.h | 6 -
arch/x86/include/asm/page_types.h | 2 +
arch/x86/include/asm/pgtable.h | 2 +
arch/x86/include/asm/pgtable_types.h | 1 -
arch/x86/include/asm/x86_init.h | 12 -
arch/x86/kernel/acpi/boot.c | 1 -
arch/x86/kernel/cpu/amd.c | 9 +-
arch/x86/kernel/cpu/intel.c | 1 -
arch/x86/kernel/e820.c | 16 ++-
arch/x86/kernel/setup.c | 70 ++----
arch/x86/kernel/x86_init.c | 4 -
arch/x86/mm/init.c | 445 ++++++++++++++++++++++------------
arch/x86/mm/init_32.c | 106 +++++---
arch/x86/mm/init_64.c | 140 ++++-------
arch/x86/mm/mm_internal.h | 19 ++
arch/x86/mm/numa_64.c | 13 -
arch/x86/platform/efi/efi.c | 7 +-
arch/x86/xen/mmu.c | 28 ---
include/linux/mm.h | 1 -
mm/nobootmem.c | 14 -
23 files changed, 480 insertions(+), 464 deletions(-)
delete mode 100644 arch/x86/include/asm/numa_64.h
create mode 100644 arch/x86/mm/mm_internal.h

--
1.7.7

2012-11-05 20:28:16

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 39/42] x86, mm: kill numa_free_all_bootmem()

Now NO_BOOTMEM version free_all_bootmem_node() does not really
do free_bootmem at all, and it only call register_page_bootmem_info_node
instead.

That is confusing, try to kill that free_all_bootmem_node().

Before that, this patch will remove numa_free_all_bootmem().

That function could be replaced with register_page_bootmem_info() and
free_all_bootmem();

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/include/asm/numa_64.h | 2 --
arch/x86/mm/init_64.c | 15 +++++++++++----
arch/x86/mm/numa_64.c | 13 -------------
3 files changed, 11 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/numa_64.h b/arch/x86/include/asm/numa_64.h
index 0c05f7a..fe4d2d4 100644
--- a/arch/x86/include/asm/numa_64.h
+++ b/arch/x86/include/asm/numa_64.h
@@ -1,6 +1,4 @@
#ifndef _ASM_X86_NUMA_64_H
#define _ASM_X86_NUMA_64_H

-extern unsigned long numa_free_all_bootmem(void);
-
#endif /* _ASM_X86_NUMA_64_H */
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 1d53def..4178530 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -629,6 +629,16 @@ EXPORT_SYMBOL_GPL(arch_add_memory);

static struct kcore_list kcore_vsyscall;

+static void __init register_page_bootmem_info(void)
+{
+#ifdef CONFIG_NUMA
+ int i;
+
+ for_each_online_node(i)
+ register_page_bootmem_info_node(NODE_DATA(i));
+#endif
+}
+
void __init mem_init(void)
{
long codesize, reservedpages, datasize, initsize;
@@ -641,11 +651,8 @@ void __init mem_init(void)
reservedpages = 0;

/* this will put all low memory onto the freelists */
-#ifdef CONFIG_NUMA
- totalram_pages = numa_free_all_bootmem();
-#else
+ register_page_bootmem_info();
totalram_pages = free_all_bootmem();
-#endif

absent_pages = absent_pages_in_range(0, max_pfn);
reservedpages = max_pfn - totalram_pages - absent_pages;
diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
index 92e2711..9405ffc 100644
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -10,16 +10,3 @@ void __init initmem_init(void)
{
x86_numa_init();
}
-
-unsigned long __init numa_free_all_bootmem(void)
-{
- unsigned long pages = 0;
- int i;
-
- for_each_online_node(i)
- pages += free_all_bootmem_node(NODE_DATA(i));
-
- pages += free_low_memory_core_early(MAX_NUMNODES);
-
- return pages;
-}
--
1.7.7

2012-11-05 20:28:18

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 40/42] x86, mm: kill numa_64.h

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/include/asm/numa.h | 2 --
arch/x86/include/asm/numa_64.h | 4 ----
arch/x86/kernel/acpi/boot.c | 1 -
arch/x86/kernel/cpu/amd.c | 1 -
arch/x86/kernel/cpu/intel.c | 1 -
arch/x86/kernel/setup.c | 3 ---
6 files changed, 0 insertions(+), 12 deletions(-)
delete mode 100644 arch/x86/include/asm/numa_64.h

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 49119fc..52560a2 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -54,8 +54,6 @@ static inline int numa_cpu_node(int cpu)

#ifdef CONFIG_X86_32
# include <asm/numa_32.h>
-#else
-# include <asm/numa_64.h>
#endif

#ifdef CONFIG_NUMA
diff --git a/arch/x86/include/asm/numa_64.h b/arch/x86/include/asm/numa_64.h
deleted file mode 100644
index fe4d2d4..0000000
--- a/arch/x86/include/asm/numa_64.h
+++ /dev/null
@@ -1,4 +0,0 @@
-#ifndef _ASM_X86_NUMA_64_H
-#define _ASM_X86_NUMA_64_H
-
-#endif /* _ASM_X86_NUMA_64_H */
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index e651f7a..4b23aa1 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -51,7 +51,6 @@ EXPORT_SYMBOL(acpi_disabled);

#ifdef CONFIG_X86_64
# include <asm/proto.h>
-# include <asm/numa_64.h>
#endif /* X86 */

#define BAD_MADT_ENTRY(entry, end) ( \
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 9619ba6..913f94f 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -12,7 +12,6 @@
#include <asm/pci-direct.h>

#ifdef CONFIG_X86_64
-# include <asm/numa_64.h>
# include <asm/mmconfig.h>
# include <asm/cacheflush.h>
#endif
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 198e019..3b547cc 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -17,7 +17,6 @@

#ifdef CONFIG_X86_64
#include <linux/topology.h>
-#include <asm/numa_64.h>
#endif

#include "cpu.h"
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 457de1d..8b0fb6fd 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -108,9 +108,6 @@
#include <asm/topology.h>
#include <asm/apicdef.h>
#include <asm/amd_nb.h>
-#ifdef CONFIG_X86_64
-#include <asm/numa_64.h>
-#endif
#include <asm/mce.h>
#include <asm/alternative.h>
#include <asm/prom.h>
--
1.7.7

2012-11-05 20:28:25

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 42/42] mm: Kill NO_BOOTMEM version free_all_bootmem_node()

Now NO_BOOTMEM version free_all_bootmem_node() does not really
do free_bootmem at all, and it only call register_page_bootmem_info_node
for online nodes instead.

That is confusing.

We can kill that free_all_bootmem_node(), after we kill two callings
in x86 and sparc.

Signed-off-by: Yinghai Lu <[email protected]>
---
mm/nobootmem.c | 14 --------------
1 files changed, 0 insertions(+), 14 deletions(-)

diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 714d5d6..f22c228 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -141,20 +141,6 @@ unsigned long __init free_low_memory_core_early(int nodeid)
}

/**
- * free_all_bootmem_node - release a node's free pages to the buddy allocator
- * @pgdat: node to be released
- *
- * Returns the number of pages actually released.
- */
-unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
-{
- register_page_bootmem_info_node(pgdat);
-
- /* free_low_memory_core_early(MAX_NUMNODES) will be called later */
- return 0;
-}
-
-/**
* free_all_bootmem - release free pages to the buddy allocator
*
* Returns the number of pages actually released.
--
1.7.7

2012-11-05 20:28:26

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 41/42] sparc, mm: Remove calling of free_all_bootmem_node()

Now NO_BOOTMEM version free_all_bootmem_node() does not really
do free_bootmem at all, and it only call
register_page_bootmem_info_node instead.

That is confusing, try to kill that free_all_bootmem_node().

Before that, this patch will remove calling of free_all_bootmem_node()

We add register_page_bootmem_info() to call register_page_bootmem_info_node
directly.

Also could use free_all_bootmem() for numa case, and it is just
the same as free_low_memory_core_early().

Signed-off-by: Yinghai Lu <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
---
arch/sparc/mm/init_64.c | 24 +++++++++++-------------
1 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 9e28a11..b24bac2 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2021,6 +2021,16 @@ static void __init patch_tlb_miss_handler_bitmap(void)
flushi(&valid_addr_bitmap_insn[0]);
}

+static void __init register_page_bootmem_info(void)
+{
+#ifdef CONFIG_NEED_MULTIPLE_NODES
+ int i;
+
+ for_each_online_node(i)
+ if (NODE_DATA(i)->node_spanned_pages)
+ register_page_bootmem_info_node(NODE_DATA(i));
+#endif
+}
void __init mem_init(void)
{
unsigned long codepages, datapages, initpages;
@@ -2038,20 +2048,8 @@ void __init mem_init(void)

high_memory = __va(last_valid_pfn << PAGE_SHIFT);

-#ifdef CONFIG_NEED_MULTIPLE_NODES
- {
- int i;
- for_each_online_node(i) {
- if (NODE_DATA(i)->node_spanned_pages != 0) {
- totalram_pages +=
- free_all_bootmem_node(NODE_DATA(i));
- }
- }
- totalram_pages += free_low_memory_core_early(MAX_NUMNODES);
- }
-#else
+ register_page_bootmem_info();
totalram_pages = free_all_bootmem();
-#endif

/* We subtract one to account for the mem_map_zero page
* allocated below.
--
1.7.7

2012-11-06 17:44:20

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 41/42] sparc, mm: Remove calling of free_all_bootmem_node()

From: Yinghai Lu <[email protected]>
Date: Mon, 5 Nov 2012 12:27:42 -0800

> Now NO_BOOTMEM version free_all_bootmem_node() does not really
> do free_bootmem at all, and it only call
> register_page_bootmem_info_node instead.
>
> That is confusing, try to kill that free_all_bootmem_node().
>
> Before that, this patch will remove calling of free_all_bootmem_node()
>
> We add register_page_bootmem_info() to call register_page_bootmem_info_node
> directly.
>
> Also could use free_all_bootmem() for numa case, and it is just
> the same as free_low_memory_core_early().
>
> Signed-off-by: Yinghai Lu <[email protected]>

Acked-by: David S. Miller <[email protected]>

2012-11-07 16:24:54

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

On Mon, Nov 05, 2012 at 12:25:12PM -0800, Yinghai Lu wrote:
> On Sat, Nov 3, 2012 at 2:37 PM, H. Peter Anvin <[email protected]> wrote:
> > I am travelling at the moment... I hope to be able to look at it Sunday.
> >>
> >>can you put for-x86-mm-test to tip for more testing?
> >>
> >>or you want to rebase the whole patchset?
>
> All,
>
> please check branch that is rebased on top of 3.7-rc4.
> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
> for-x86-mm

I did a overnight test and it passed that. Let me do a couple of more of bootups
with various memory configurations (1GB, 2GB, 3333M, 4G, 8GB, 16GB, 32GB) to
get a good feel.

Thanks!
>
> Stefano, Konrad,
> please check them with xen parts, and give your acked-by or
> Reviewed-by accordingly.
>
> Thanks
>
> Yinghai
>
>
> >From 1b66ccf15ff4bd0200567e8d70446a8763f96ee7 Mon Sep 17 00:00:00 2001
> From: Yinghai Lu <[email protected]>
> Date: Mon, 5 Nov 2012 11:54:12 -0800
> Subject: [PATCH v6 00/42] x86, mm: map ram from top-down with BRK and memblock.
>
> rebase patchset together tip/x86/mm2 on top of linus v3.7-rc4
>
> so this one include patchset : x86, mm: init_memory_mapping cleanup
> in tip/x86/mm2
> ---
> Current kernel init memory mapping between [0, TOML) and [4G, TOMH)
> Some AMD systems have mem hole between 4G and TOMH, around 1T.
> According to HPA, we should only mapping ram range.
> 1. Seperate calculate_table_space_size and find_early_page_table out with
> init_memory_mapping.
> 2. For all ranges, will allocate page table one time
> 3. init mapping for ram range one by one.
> ---
>
> pre mapping page table patcheset includes:
> 1. use brk to mapping first PMD_SIZE range under end of ram.
> 2. top down to initialize page table range by range.
> 3. get rid of calculate_page_table, and find_early_page_table.
> 4. remove early_ioremap in page table accessing.
> 5. remove workaround in xen to mark page RO.
>
> v2: changes, update xen interface about pagetable_reserve, so not
> use pgt_buf_* in xen code directly.
> v3: use range top-down to initialize page table, so will not use
> calculating/find early table anymore.
> also reorder the patches sequence.
> v4: add mapping_mark_page_ro to fix xen, also move pgt_buf_* to init.c
> and merge alloc_low_page(), and for 32bit need to add alloc_low_pages
> to fix 32bit kmap setting.
> v5: remove mark_page_ro workaround and add another 5 cleanup patches.
> v6: rebase on v3.7-rc4 and add 4 cleanup patches.
>
> could be found at:
> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
> for-x86-mm
>
> 1b66ccf: mm: Kill NO_BOOTMEM version free_all_bootmem_node()
> 0332736: sparc, mm: Remove calling of free_all_bootmem_node()
> 0f88d27: x86, mm: kill numa_64.h
> 10c4c68: x86, mm: kill numa_free_all_bootmem()
> 0187d6e: x86, mm: Let "memmap=" take more entries one time
> 2c20fd0: x86, mm: Use clamp_t() in init_range_memory_mapping
> 770db30: x86, mm: Move after_bootmem to mm_internel.h
> d9bd282: x86, mm: Unifying after_bootmem for 32bit and 64bit
> 003d654: x86, mm: use limit_pfn for end pfn
> 29f27b8: x86, mm: use pfn instead of pos in split_mem_range
> 8edaab8: x86, mm: use PFN_DOWN in split_mem_range()
> e652c73: x86, mm: use round_up/down in split_mem_range()
> 5fd1391: x86, mm: Add check before clear pte above max_low_pfn on 32bit
> f4fd136: x86, mm: Move function declaration into mm_internal.h
> ea421df: x86, mm: change low/hignmem_pfn_init to static on 32bit
> 867525b: x86, mm: Move init_gbpages() out of setup.c
> 97fb23a: x86, mm: Move back pgt_buf_* to mm/init.c
> ba55b2f: x86, mm: only call early_ioremap_page_table_range_init() once
> bb6d2d9: x86, mm: Add pointer about Xen mmu requirement for alloc_low_pages
> 6bc7e9f: x86, mm: Add alloc_low_pages(num)
> d90df25: x86, mm, Xen: Remove mapping_pagetable_reserve()
> f6699d2: x86, mm: Move min_pfn_mapped back to mm/init.c
> 675697b: x86, mm: Merge alloc_low_page between 64bit and 32bit
> a14a382: x86, mm: Remove parameter in alloc_low_page for 64bit
> a0d7a52: x86, mm: Remove early_memremap workaround for page table
> accessing on 64bit
> 384a79d: x86, mm: setup page table in top-down
> 16252f2: x86, mm: Break down init_all_memory_mapping
> 255a0bf: x86, mm: Don't clear page table if range is ram
> 4bc70c5: x86, mm: Use big page size for small memory range
> dc62aa8: x86, mm: Align start address to correct big page size
> 18f374c: x86: Only direct map addresses that are marked as E820_RAM
> 150387c: x86: Fixup code testing if a pfn is direct mapped
> 45cb049: x86: if kernel .text .data .bss are not marked as E820_RAM,
> complain and fix
> 8d0b8bd: x86, mm: Set memblock initial limit to 1M
> 6c712cf: x86, mm: Separate out calculate_table_space_size()
> 33b0520: x86, mm: Find early page table buffer together
> a95ff4b: x86, mm: Change find_early_table_space() paramters
> ea564f8: x86, mm: Revert back good_end setting for 64bit
> 912c686: x86, mm: Move init_memory_mapping calling out of setup.c
> 47cfa63: x86, mm: Move down find_early_table_space()
> 060ed6c: x86, mm: Split out split_mem_range from init_memory_mapping
> 2ad38b5: x86, mm: Add global page_size_mask and probe one time only
>
> arch/sparc/mm/init_64.c | 24 +-
> arch/x86/include/asm/init.h | 21 +--
> arch/x86/include/asm/numa.h | 2 -
> arch/x86/include/asm/numa_64.h | 6 -
> arch/x86/include/asm/page_types.h | 2 +
> arch/x86/include/asm/pgtable.h | 2 +
> arch/x86/include/asm/pgtable_types.h | 1 -
> arch/x86/include/asm/x86_init.h | 12 -
> arch/x86/kernel/acpi/boot.c | 1 -
> arch/x86/kernel/cpu/amd.c | 9 +-
> arch/x86/kernel/cpu/intel.c | 1 -
> arch/x86/kernel/e820.c | 16 ++-
> arch/x86/kernel/setup.c | 70 ++----
> arch/x86/kernel/x86_init.c | 4 -
> arch/x86/mm/init.c | 445 ++++++++++++++++++++++------------
> arch/x86/mm/init_32.c | 106 +++++---
> arch/x86/mm/init_64.c | 140 ++++-------
> arch/x86/mm/mm_internal.h | 19 ++
> arch/x86/mm/numa_64.c | 13 -
> arch/x86/platform/efi/efi.c | 7 +-
> arch/x86/xen/mmu.c | 28 ---
> include/linux/mm.h | 1 -
> mm/nobootmem.c | 14 -
> 23 files changed, 480 insertions(+), 464 deletions(-)
> delete mode 100644 arch/x86/include/asm/numa_64.h
> create mode 100644 arch/x86/mm/mm_internal.h
>
> --
> 1.7.7

2012-11-08 01:54:37

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

On Wed, Nov 07, 2012 at 11:11:44AM -0500, Konrad Rzeszutek Wilk wrote:
> On Mon, Nov 05, 2012 at 12:25:12PM -0800, Yinghai Lu wrote:
> > On Sat, Nov 3, 2012 at 2:37 PM, H. Peter Anvin <[email protected]> wrote:
> > > I am travelling at the moment... I hope to be able to look at it Sunday.
> > >>
> > >>can you put for-x86-mm-test to tip for more testing?
> > >>
> > >>or you want to rebase the whole patchset?
> >
> > All,
> >
> > please check branch that is rebased on top of 3.7-rc4.
> > git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
> > for-x86-mm
>
> I did a overnight test and it passed that. Let me do a couple of more of bootups
> with various memory configurations (1GB, 2GB, 3333M, 4G, 8GB, 16GB, 32GB) to
> get a good feel.

I ran in a problem with launching an 8GB guest. When launching a 4GB it worked
fine, but with 8GB I get:

.000000] init_memory_mapping: [mem 0x1f4000000-0x1f47fffff]
[ 0.000000] [mem 0x1f4000000-0x1f47fffff] page 4k
[ 0.000000] memblock_reserve: [0x000001f311c000-0x000001f311d000] alloc_low_pages+0x103/0x130
[ 0.000000] memblock_reserve: [0x000001f311b000-0x000001f311c000] alloc_low_pages+0x103/0x130
[ 0.000000] memblock_reserve: [0x000001f311a000-0x000001f311b000] alloc_low_pages+0x103/0x130
[ 0.000000] memblock_reserve: [0x000001f3119000-0x000001f311a000] alloc_low_pages+0x103/0x130
[ 0.000000] Kernel panic - not syncing: initrd too large to handle, disabling initrd (348401664 needed, 524288 available)
[ 0.000000]
[ 0.000000] Pid: 0, comm: swapper Not tainted 3.7.0-rc4upstream-00042-g1b66ccf #1
[ 0.000000] Call Trace:
[ 0.000000] [<ffffffff81633a5c>] panic+0xbf/0x1df
[ 0.000000] [<ffffffff81ac00e1>] setup_arch+0x728/0xb29
[ 0.000000] [<ffffffff81633c48>] ? printk+0x48/0x4a
[ 0.000000] [<ffffffff81aba897>] start_kernel+0x90/0x39e
[ 0.000000] [<ffffffff81aba356>] x86_64_start_reservations+0x131/0x136
[ 0.000000] [<ffffffff81abca38>] xen_start_kernel+0x546/0x548

Attaching the full log.


Attachments:
(No filename) (1.97 kB)
8gb-pv-domu.log (369.04 kB)
Download all attachments

2012-11-08 04:06:24

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

On Wed, Nov 7, 2012 at 5:40 PM, Konrad Rzeszutek Wilk
<[email protected]> wrote:
>
> I ran in a problem with launching an 8GB guest. When launching a 4GB it worked
> fine, but with 8GB I get:
>
> .000000] init_memory_mapping: [mem 0x1f4000000-0x1f47fffff]
> [ 0.000000] [mem 0x1f4000000-0x1f47fffff] page 4k
> [ 0.000000] memblock_reserve: [0x000001f311c000-0x000001f311d000] alloc_low_pages+0x103/0x130
> [ 0.000000] memblock_reserve: [0x000001f311b000-0x000001f311c000] alloc_low_pages+0x103/0x130
> [ 0.000000] memblock_reserve: [0x000001f311a000-0x000001f311b000] alloc_low_pages+0x103/0x130
> [ 0.000000] memblock_reserve: [0x000001f3119000-0x000001f311a000] alloc_low_pages+0x103/0x130
> [ 0.000000] Kernel panic - not syncing: initrd too large to handle, disabling initrd (348401664 needed, 524288 available)
> [ 0.000000]
> [ 0.000000] Pid: 0, comm: swapper Not tainted 3.7.0-rc4upstream-00042-g1b66ccf #1
> [ 0.000000] Call Trace:
> [ 0.000000] [<ffffffff81633a5c>] panic+0xbf/0x1df
> [ 0.000000] [<ffffffff81ac00e1>] setup_arch+0x728/0xb29
> [ 0.000000] [<ffffffff81633c48>] ? printk+0x48/0x4a
> [ 0.000000] [<ffffffff81aba897>] start_kernel+0x90/0x39e
> [ 0.000000] [<ffffffff81aba356>] x86_64_start_reservations+0x131/0x136
> [ 0.000000] [<ffffffff81abca38>] xen_start_kernel+0x546/0x548
>

xen memmap

[ 0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
[ 0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
[ 0.000000] Xen: [mem 0x0000000000100000-0x00000001f47fffff] usable

there is no hole under 4G, trigger the bug about max_low_pfn_mapped updating in
add_pfn_range_mapped().

please check attached patch, that should fix the problem.

then I will fold it into corresponding commit that instroduce
add_pfn_range_mapped().

Thanks

Yinghai


Attachments:
fix_max_low_pfn_mapped.patch (692.00 B)

2012-11-09 20:31:15

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

On Wed, Nov 7, 2012 at 8:06 PM, Yinghai Lu <[email protected]> wrote:
> On Wed, Nov 7, 2012 at 5:40 PM, Konrad Rzeszutek Wilk
> <[email protected]> wrote:
>>
>> I ran in a problem with launching an 8GB guest. When launching a 4GB it worked
>> fine, but with 8GB I get:
>>
>> .000000] init_memory_mapping: [mem 0x1f4000000-0x1f47fffff]
>> [ 0.000000] [mem 0x1f4000000-0x1f47fffff] page 4k
>> [ 0.000000] memblock_reserve: [0x000001f311c000-0x000001f311d000] alloc_low_pages+0x103/0x130
>> [ 0.000000] memblock_reserve: [0x000001f311b000-0x000001f311c000] alloc_low_pages+0x103/0x130
>> [ 0.000000] memblock_reserve: [0x000001f311a000-0x000001f311b000] alloc_low_pages+0x103/0x130
>> [ 0.000000] memblock_reserve: [0x000001f3119000-0x000001f311a000] alloc_low_pages+0x103/0x130
>> [ 0.000000] Kernel panic - not syncing: initrd too large to handle, disabling initrd (348401664 needed, 524288 available)
>> [ 0.000000]
>> [ 0.000000] Pid: 0, comm: swapper Not tainted 3.7.0-rc4upstream-00042-g1b66ccf #1
>> [ 0.000000] Call Trace:
>> [ 0.000000] [<ffffffff81633a5c>] panic+0xbf/0x1df
>> [ 0.000000] [<ffffffff81ac00e1>] setup_arch+0x728/0xb29
>> [ 0.000000] [<ffffffff81633c48>] ? printk+0x48/0x4a
>> [ 0.000000] [<ffffffff81aba897>] start_kernel+0x90/0x39e
>> [ 0.000000] [<ffffffff81aba356>] x86_64_start_reservations+0x131/0x136
>> [ 0.000000] [<ffffffff81abca38>] xen_start_kernel+0x546/0x548
>>
>
> xen memmap
>
> [ 0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
> [ 0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
> [ 0.000000] Xen: [mem 0x0000000000100000-0x00000001f47fffff] usable
>
> there is no hole under 4G, trigger the bug about max_low_pfn_mapped updating in
> add_pfn_range_mapped().
>
> please check attached patch, that should fix the problem.
>
> then I will fold it into corresponding commit that instroduce
> add_pfn_range_mapped().
>

If you did not try the patch yet, please get for-x86-mm again.
I folded that patch into that commit about add_pfn_range_mapped.

2012-11-12 19:30:52

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH -v3 0/7] x86: Use BRK to pre mapping page table to make xen happy

On Sat, Nov 03, 2012 at 02:35:54PM -0700, Yinghai Lu wrote:
> On Tue, Oct 30, 2012 at 7:47 AM, Yinghai Lu <[email protected]> wrote:
> > On Tue, Oct 30, 2012 at 6:44 AM, Konrad Rzeszutek Wilk
> > <[email protected]> wrote:
> >> On Wed, Oct 10, 2012 at 11:13:45PM -0700, Yinghai Lu wrote:
> >> So which branch should I try out? Do you have one with all of the
> >> required patches so I can just do a 3.7-rc3 'git pull' and try it out?
> >
> > add for-x86-mm-test branch, and it is based on 3.7-rc3, and merged
> > with for-x86-mm branch.
> >
> > so you can try
> > git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
> > for-x86-mm-test
> >
> > there is conflicts between for-x86-mm and 3.7-rc3, and attached patch
> > could be used to fix them
> >
>
> Peter/Ingo,
>
> can you put for-x86-mm-test to tip for more testing?

Yinghai,

I've tested it the patches with the modifications and they were
working.

Can you repost them once more so that I can review the last time
and provide the proper acks, etc.


>
> or you want to rebase the whole patchset?
>
> Thanks
>
> Yinghai
>

2012-11-12 21:19:28

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 17/46] x86, mm: Align start address to correct big page size

We are going to use buffer in BRK to map small range just under memory top,
and use those new mapped ram to map ram range under it.

The ram range that will be mapped at first could be only page aligned,
but ranges around it are ram too, we could use bigger page to map it to
avoid small page size.

We will adjust page_size_mask in following patch:
x86, mm: Use big page size for small memory range
to use big page size for small ram range.

Before that patch, this patch will make sure start address to be
aligned down according to bigger page size, otherwise entry in page
page will not have correct value.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init_32.c | 1 +
arch/x86/mm/init_64.c | 5 +++--
2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 11a5800..27f7fc6 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -310,6 +310,7 @@ repeat:
__pgprot(PTE_IDENT_ATTR |
_PAGE_PSE);

+ pfn &= PMD_MASK >> PAGE_SHIFT;
addr2 = (pfn + PTRS_PER_PTE-1) * PAGE_SIZE +
PAGE_OFFSET + PAGE_SIZE-1;

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 32c7e38..869372a 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -464,7 +464,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
pages++;
spin_lock(&init_mm.page_table_lock);
set_pte((pte_t *)pmd,
- pfn_pte(address >> PAGE_SHIFT,
+ pfn_pte((address & PMD_MASK) >> PAGE_SHIFT,
__pgprot(pgprot_val(prot) | _PAGE_PSE)));
spin_unlock(&init_mm.page_table_lock);
last_map_addr = next;
@@ -541,7 +541,8 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
pages++;
spin_lock(&init_mm.page_table_lock);
set_pte((pte_t *)pud,
- pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_LARGE));
+ pfn_pte((addr & PUD_MASK) >> PAGE_SHIFT,
+ PAGE_KERNEL_LARGE));
spin_unlock(&init_mm.page_table_lock);
last_map_addr = next;
continue;
--
1.7.7

2012-11-12 21:19:35

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v7 00/46] x86, mm: map ram from top-down with BRK and memblock.

rebase patchset together tip/x86/mm2 on top of linus v3.7-rc4

so this one include patchset : x86, mm: init_memory_mapping cleanup
in tip/x86/mm2
---
Current kernel init memory mapping between [0, TOML) and [4G, TOMH)
Some AMD systems have mem hole between 4G and TOMH, around 1T.
According to HPA, we should only mapping ram range.
1. Seperate calculate_table_space_size and find_early_page_table out with
init_memory_mapping.
2. For all ranges, will allocate page table one time
3. init mapping for ram range one by one.
---

pre mapping page table patcheset includes:
1. use brk to mapping first PMD_SIZE range under end of ram.
2. top down to initialize page table range by range.
3. get rid of calculate_page_table, and find_early_page_table.
4. remove early_ioremap in page table accessing.
5. remove workaround in xen to mark page RO.

v2: changes, update xen interface about pagetable_reserve, so not
use pgt_buf_* in xen code directly.
v3: use range top-down to initialize page table, so will not use
calculating/find early table anymore.
also reorder the patches sequence.
v4: add mapping_mark_page_ro to fix xen, also move pgt_buf_* to init.c
and merge alloc_low_page(), and for 32bit need to add alloc_low_pages
to fix 32bit kmap setting.
v5: remove mark_page_ro workaround and add another 5 cleanup patches.
v6: rebase on v3.7-rc4 and add 4 cleanup patches.
v7: fix max_low_pfn_mapped for xen domu memmap that does not have hole under 4g
add pfn_range_is_mapped() calling for left over.

could be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm

4d02fa2: x86, mm: Let "memmap=" take more entries one time
69c9485: mm: Kill NO_BOOTMEM version free_all_bootmem_node()
27a6151: sparc, mm: Remove calling of free_all_bootmem_node()
60d9772: x86, mm: kill numa_64.h
37c4eb8: x86, mm: kill numa_free_all_bootmem()
96e6c74: x86, mm: Use clamp_t() in init_range_memory_mapping
714535a: x86, mm: Move after_bootmem to mm_internel.h
5b10dbc: x86, mm: Unifying after_bootmem for 32bit and 64bit
84c1df0: x86, mm: use limit_pfn for end pfn
1108331: x86, mm: use pfn instead of pos in split_mem_range
7c1bf23: x86, mm: use PFN_DOWN in split_mem_range()
3ba0781: x86, mm: use round_up/down in split_mem_range()
34fb23f: x86, mm: Add check before clear pte above max_low_pfn on 32bit
df4a7d9: x86, mm: Move function declaration into mm_internal.h
c9b0822: x86, mm: change low/hignmem_pfn_init to static on 32bit
0467f80: x86, mm: Move init_gbpages() out of setup.c
28170b7: x86, mm: Move back pgt_buf_* to mm/init.c
b678b7c: x86, mm: only call early_ioremap_page_table_range_init() once
c31ef78: x86, mm: Add pointer about Xen mmu requirement for alloc_low_pages
ef4d350: x86, mm: Add alloc_low_pages(num)
ceaa6ce: x86, mm, Xen: Remove mapping_pagetable_reserve()
8782e42: x86, mm: Move min_pfn_mapped back to mm/init.c
fd3fb05: x86, mm: Merge alloc_low_page between 64bit and 32bit
2c0d92c: x86, mm: Remove parameter in alloc_low_page for 64bit
2a0c505: x86, mm: Remove early_memremap workaround for page table accessing on 64bit
e14b94f: x86, mm: setup page table in top-down
6db7bfb: x86, mm: Break down init_all_memory_mapping
2f799be: x86, mm: Don't clear page table if range is ram
686f1c4: x86, mm: Use big page size for small memory range
a473cf6: x86, mm: Align start address to correct big page size
114b025: x86, mm: relocate initrd under all mem for 64bit
bb3c507: x86, mm: Only direct map addresses that are marked as E820_RAM
7d59f08: x86, mm: use pfn_range_is_mapped() with reserve_initrd
2d2a11e: x86, mm: use pfn_range_is_mapped() with gart
e108072: x86, mm: use pfn_range_is_mapped() with CPA
4894260: x86, mm: Fixup code testing if a pfn is direct mapped
b0771c3: x86, mm: if kernel .text .data .bss are not marked as E820_RAM, complain and fix
3159e6b: x86, mm: Set memblock initial limit to 1M
593cf88: x86, mm: Separate out calculate_table_space_size()
fba94e2: x86, mm: Find early page table buffer together
e1585d2: x86, mm: Change find_early_table_space() paramters
6a93a89: x86, mm: Revert back good_end setting for 64bit
306c44a: x86, mm: Move init_memory_mapping calling out of setup.c
fb40d13: x86, mm: Move down find_early_table_space()
e748645: x86, mm: Split out split_mem_range from init_memory_mapping
e419542: x86, mm: Add global page_size_mask and probe one time only

arch/sparc/mm/init_64.c | 24 +-
arch/x86/include/asm/init.h | 21 +--
arch/x86/include/asm/numa.h | 2 -
arch/x86/include/asm/numa_64.h | 6 -
arch/x86/include/asm/page_types.h | 2 +
arch/x86/include/asm/pgtable.h | 2 +
arch/x86/include/asm/pgtable_types.h | 1 -
arch/x86/include/asm/x86_init.h | 12 -
arch/x86/kernel/acpi/boot.c | 1 -
arch/x86/kernel/amd_gart_64.c | 5 +-
arch/x86/kernel/cpu/amd.c | 9 +-
arch/x86/kernel/cpu/intel.c | 1 -
arch/x86/kernel/e820.c | 16 ++-
arch/x86/kernel/setup.c | 121 ++++------
arch/x86/kernel/x86_init.c | 4 -
arch/x86/mm/init.c | 446 ++++++++++++++++++++++------------
arch/x86/mm/init_32.c | 106 +++++---
arch/x86/mm/init_64.c | 140 ++++-------
arch/x86/mm/mm_internal.h | 19 ++
arch/x86/mm/numa_64.c | 13 -
arch/x86/mm/pageattr.c | 16 +-
arch/x86/platform/efi/efi.c | 7 +-
arch/x86/xen/mmu.c | 28 ---
include/linux/mm.h | 1 -
mm/nobootmem.c | 14 -
25 files changed, 513 insertions(+), 504 deletions(-)
delete mode 100644 arch/x86/include/asm/numa_64.h
create mode 100644 arch/x86/mm/mm_internal.h

--
1.7.7

2012-11-12 21:19:44

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 39/46] x86, mm: Unifying after_bootmem for 32bit and 64bit

after_bootmem has different meaning in 32bit and 64bit.
32bit: after bootmem is ready
64bit: after bootmem is distroyed
Let's merget them make 32bit the same as 64bit.

for 32bit, it is mixing alloc_bootmem_pages, and alloc_low_page under
after_bootmem is set or not set.

alloc_bootmem is just wrapper for memblock for x86.

Now we have alloc_low_page() with memblock too. We can drop bootmem path
now, and only alloc_low_page only.

At the same time, we make alloc_low_page could handle real after_bootmem
for 32bit, because alloc_bootmem_pages could fallback to use slab too.

At last move after_bootmem set position for 32bit the same as 64bit.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 2 --
arch/x86/mm/init_32.c | 21 ++++-----------------
2 files changed, 4 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index a0f579a..028a129 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -35,7 +35,6 @@ __ref void *alloc_low_pages(unsigned int num)
unsigned long pfn;
int i;

-#ifdef CONFIG_X86_64
if (after_bootmem) {
unsigned int order;

@@ -43,7 +42,6 @@ __ref void *alloc_low_pages(unsigned int num)
return (void *)__get_free_pages(GFP_ATOMIC | __GFP_NOTRACK |
__GFP_ZERO, order);
}
-#endif

if ((pgt_buf_end + num) >= pgt_buf_top) {
unsigned long ret;
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 19ef9f0..f4fc4a2 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -73,10 +73,7 @@ static pmd_t * __init one_md_table_init(pgd_t *pgd)

#ifdef CONFIG_X86_PAE
if (!(pgd_val(*pgd) & _PAGE_PRESENT)) {
- if (after_bootmem)
- pmd_table = (pmd_t *)alloc_bootmem_pages(PAGE_SIZE);
- else
- pmd_table = (pmd_t *)alloc_low_page();
+ pmd_table = (pmd_t *)alloc_low_page();
paravirt_alloc_pmd(&init_mm, __pa(pmd_table) >> PAGE_SHIFT);
set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT));
pud = pud_offset(pgd, 0);
@@ -98,17 +95,7 @@ static pmd_t * __init one_md_table_init(pgd_t *pgd)
static pte_t * __init one_page_table_init(pmd_t *pmd)
{
if (!(pmd_val(*pmd) & _PAGE_PRESENT)) {
- pte_t *page_table = NULL;
-
- if (after_bootmem) {
-#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KMEMCHECK)
- page_table = (pte_t *) alloc_bootmem_pages(PAGE_SIZE);
-#endif
- if (!page_table)
- page_table =
- (pte_t *)alloc_bootmem_pages(PAGE_SIZE);
- } else
- page_table = (pte_t *)alloc_low_page();
+ pte_t *page_table = (pte_t *)alloc_low_page();

paravirt_alloc_pte(&init_mm, __pa(page_table) >> PAGE_SHIFT);
set_pmd(pmd, __pmd(__pa(page_table) | _PAGE_TABLE));
@@ -708,8 +695,6 @@ void __init setup_bootmem_allocator(void)
printk(KERN_INFO " mapped low ram: 0 - %08lx\n",
max_pfn_mapped<<PAGE_SHIFT);
printk(KERN_INFO " low ram: 0 - %08lx\n", max_low_pfn<<PAGE_SHIFT);
-
- after_bootmem = 1;
}

/*
@@ -795,6 +780,8 @@ void __init mem_init(void)
if (page_is_ram(tmp) && PageReserved(pfn_to_page(tmp)))
reservedpages++;

+ after_bootmem = 1;
+
codesize = (unsigned long) &_etext - (unsigned long) &_text;
datasize = (unsigned long) &_edata - (unsigned long) &_etext;
initsize = (unsigned long) &__init_end - (unsigned long) &__init_begin;
--
1.7.7

2012-11-12 21:19:43

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 37/46] x86, mm: use pfn instead of pos in split_mem_range

could save some bit shifting operations.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 29 ++++++++++++++---------------
1 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index a4fdf31..e430f1e 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -200,12 +200,11 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
unsigned long end)
{
unsigned long start_pfn, end_pfn;
- unsigned long pos;
+ unsigned long pfn;
int i;

/* head if not big page alignment ? */
- start_pfn = PFN_DOWN(start);
- pos = PFN_PHYS(start_pfn);
+ pfn = start_pfn = PFN_DOWN(start);
#ifdef CONFIG_X86_32
/*
* Don't use a large page for the first 2/4MB of memory
@@ -213,26 +212,26 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
* and overlapping MTRRs into large pages can cause
* slowdowns.
*/
- if (pos == 0)
+ if (pfn == 0)
end_pfn = PFN_DOWN(PMD_SIZE);
else
- end_pfn = PFN_DOWN(round_up(pos, PMD_SIZE));
+ end_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
#else /* CONFIG_X86_64 */
- end_pfn = PFN_DOWN(round_up(pos, PMD_SIZE));
+ end_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
#endif
if (end_pfn > PFN_DOWN(end))
end_pfn = PFN_DOWN(end);
if (start_pfn < end_pfn) {
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn, 0);
- pos = PFN_PHYS(end_pfn);
+ pfn = end_pfn;
}

/* big page (2M) range */
- start_pfn = PFN_DOWN(round_up(pos, PMD_SIZE));
+ start_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
#ifdef CONFIG_X86_32
end_pfn = PFN_DOWN(round_down(end, PMD_SIZE));
#else /* CONFIG_X86_64 */
- end_pfn = PFN_DOWN(round_up(pos, PUD_SIZE));
+ end_pfn = round_up(pfn, PFN_DOWN(PUD_SIZE));
if (end_pfn > PFN_DOWN(round_down(end, PMD_SIZE)))
end_pfn = PFN_DOWN(round_down(end, PMD_SIZE));
#endif
@@ -240,32 +239,32 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
if (start_pfn < end_pfn) {
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
page_size_mask & (1<<PG_LEVEL_2M));
- pos = PFN_PHYS(end_pfn);
+ pfn = end_pfn;
}

#ifdef CONFIG_X86_64
/* big page (1G) range */
- start_pfn = PFN_DOWN(round_up(pos, PUD_SIZE));
+ start_pfn = round_up(pfn, PFN_DOWN(PUD_SIZE));
end_pfn = PFN_DOWN(round_down(end, PUD_SIZE));
if (start_pfn < end_pfn) {
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
page_size_mask &
((1<<PG_LEVEL_2M)|(1<<PG_LEVEL_1G)));
- pos = PFN_PHYS(end_pfn);
+ pfn = end_pfn;
}

/* tail is not big page (1G) alignment */
- start_pfn = PFN_DOWN(round_up(pos, PMD_SIZE));
+ start_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
end_pfn = PFN_DOWN(round_down(end, PMD_SIZE));
if (start_pfn < end_pfn) {
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
page_size_mask & (1<<PG_LEVEL_2M));
- pos = PFN_PHYS(end_pfn);
+ pfn = end_pfn;
}
#endif

/* tail is not big page (2M) alignment */
- start_pfn = PFN_DOWN(pos);
+ start_pfn = pfn;
end_pfn = PFN_DOWN(end);
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn, 0);

--
1.7.7

2012-11-12 21:19:41

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 45/46] mm: Kill NO_BOOTMEM version free_all_bootmem_node()

Now NO_BOOTMEM version free_all_bootmem_node() does not really
do free_bootmem at all, and it only call register_page_bootmem_info_node
for online nodes instead.

That is confusing.

We can kill that free_all_bootmem_node(), after we kill two callings
in x86 and sparc.

Signed-off-by: Yinghai Lu <[email protected]>
---
mm/nobootmem.c | 14 --------------
1 files changed, 0 insertions(+), 14 deletions(-)

diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 714d5d6..f22c228 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -141,20 +141,6 @@ unsigned long __init free_low_memory_core_early(int nodeid)
}

/**
- * free_all_bootmem_node - release a node's free pages to the buddy allocator
- * @pgdat: node to be released
- *
- * Returns the number of pages actually released.
- */
-unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
-{
- register_page_bootmem_info_node(pgdat);
-
- /* free_low_memory_core_early(MAX_NUMNODES) will be called later */
- return 0;
-}
-
-/**
* free_all_bootmem - release free pages to the buddy allocator
*
* Returns the number of pages actually released.
--
1.7.7

2012-11-12 21:20:21

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 33/46] x86, mm: Move function declaration into mm_internal.h

They are only for mm/init*.c.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/include/asm/init.h | 16 +++-------------
arch/x86/mm/mm_internal.h | 7 +++++++
2 files changed, 10 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
index 626ea8d..bac770b 100644
--- a/arch/x86/include/asm/init.h
+++ b/arch/x86/include/asm/init.h
@@ -1,15 +1,5 @@
-#ifndef _ASM_X86_INIT_32_H
-#define _ASM_X86_INIT_32_H
+#ifndef _ASM_X86_INIT_H
+#define _ASM_X86_INIT_H

-#ifdef CONFIG_X86_32
-extern void __init early_ioremap_page_table_range_init(void);
-#endif

-extern void __init zone_sizes_init(void);
-
-extern unsigned long __init
-kernel_physical_mapping_init(unsigned long start,
- unsigned long end,
- unsigned long page_size_mask);
-
-#endif /* _ASM_X86_INIT_32_H */
+#endif /* _ASM_X86_INIT_H */
diff --git a/arch/x86/mm/mm_internal.h b/arch/x86/mm/mm_internal.h
index 7e3b88e..dc79ac1 100644
--- a/arch/x86/mm/mm_internal.h
+++ b/arch/x86/mm/mm_internal.h
@@ -7,4 +7,11 @@ static inline void *alloc_low_page(void)
return alloc_low_pages(1);
}

+void early_ioremap_page_table_range_init(void);
+
+unsigned long kernel_physical_mapping_init(unsigned long start,
+ unsigned long end,
+ unsigned long page_size_mask);
+void zone_sizes_init(void);
+
#endif /* __X86_MM_INTERNAL_H */
--
1.7.7

2012-11-12 21:20:20

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 02/46] x86, mm: Split out split_mem_range from init_memory_mapping

So make init_memory_mapping smaller and readable.

Suggested-by: Ingo Molnar <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
---
arch/x86/mm/init.c | 42 ++++++++++++++++++++++++++----------------
1 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index aa5b0da..6d8e102 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -146,25 +146,13 @@ static int __meminit save_mr(struct map_range *mr, int nr_range,
return nr_range;
}

-/*
- * Setup the direct mapping of the physical memory at PAGE_OFFSET.
- * This runs before bootmem is initialized and gets pages directly from
- * the physical memory. To access them they are temporarily mapped.
- */
-unsigned long __init_refok init_memory_mapping(unsigned long start,
- unsigned long end)
+static int __meminit split_mem_range(struct map_range *mr, int nr_range,
+ unsigned long start,
+ unsigned long end)
{
unsigned long start_pfn, end_pfn;
- unsigned long ret = 0;
unsigned long pos;
- struct map_range mr[NR_RANGE_MR];
- int nr_range, i;
-
- printk(KERN_INFO "init_memory_mapping: [mem %#010lx-%#010lx]\n",
- start, end - 1);
-
- memset(mr, 0, sizeof(mr));
- nr_range = 0;
+ int i;

/* head if not big page alignment ? */
start_pfn = start >> PAGE_SHIFT;
@@ -258,6 +246,28 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
(mr[i].page_size_mask & (1<<PG_LEVEL_1G))?"1G":(
(mr[i].page_size_mask & (1<<PG_LEVEL_2M))?"2M":"4k"));

+ return nr_range;
+}
+
+/*
+ * Setup the direct mapping of the physical memory at PAGE_OFFSET.
+ * This runs before bootmem is initialized and gets pages directly from
+ * the physical memory. To access them they are temporarily mapped.
+ */
+unsigned long __init_refok init_memory_mapping(unsigned long start,
+ unsigned long end)
+{
+ struct map_range mr[NR_RANGE_MR];
+ unsigned long ret = 0;
+ int nr_range, i;
+
+ pr_info("init_memory_mapping: [mem %#010lx-%#010lx]\n",
+ start, end - 1);
+
+ memset(mr, 0, sizeof(mr));
+ nr_range = 0;
+ nr_range = split_mem_range(mr, nr_range, start, end);
+
/*
* Find space for the kernel direct mapping tables.
*
--
1.7.7

2012-11-12 21:20:18

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 08/46] x86, mm: Separate out calculate_table_space_size()

It should take physical address range that will need to be mapped.
find_early_table_space should take range that pgt buff should be in.

Separating page table size calculating and finding early page table to
reduce confusing.

Signed-off-by: Yinghai Lu <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
---
arch/x86/mm/init.c | 38 +++++++++++++++++++++++++++-----------
1 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index c273edb..2b8091c 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -196,12 +196,10 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
* mr[0].start to mr[nr_range - 1].end, while accounting for possible 2M and 1GB
* pages. Then find enough contiguous space for those page tables.
*/
-static void __init find_early_table_space(unsigned long start, unsigned long end)
+static unsigned long __init calculate_table_space_size(unsigned long start, unsigned long end)
{
int i;
unsigned long puds = 0, pmds = 0, ptes = 0, tables;
- unsigned long good_end;
- phys_addr_t base;
struct map_range mr[NR_RANGE_MR];
int nr_range;

@@ -240,9 +238,17 @@ static void __init find_early_table_space(unsigned long start, unsigned long end
#ifdef CONFIG_X86_32
/* for fixmap */
tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
- good_end = max_pfn_mapped << PAGE_SHIFT;
#endif

+ return tables;
+}
+
+static void __init find_early_table_space(unsigned long start,
+ unsigned long good_end,
+ unsigned long tables)
+{
+ phys_addr_t base;
+
base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
if (!base)
panic("Cannot find space for the kernel page tables");
@@ -250,10 +256,6 @@ static void __init find_early_table_space(unsigned long start, unsigned long end
pgt_buf_start = base >> PAGE_SHIFT;
pgt_buf_end = pgt_buf_start;
pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
-
- printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx]\n",
- mr[nr_range - 1].end - 1, pgt_buf_start << PAGE_SHIFT,
- (pgt_buf_top << PAGE_SHIFT) - 1);
}

/*
@@ -292,6 +294,8 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,

void __init init_mem_mapping(void)
{
+ unsigned long tables, good_end, end;
+
probe_page_size_mask();

/*
@@ -302,10 +306,18 @@ void __init init_mem_mapping(void)
* nodes are discovered.
*/
#ifdef CONFIG_X86_64
- find_early_table_space(0, max_pfn<<PAGE_SHIFT);
+ end = max_pfn << PAGE_SHIFT;
+ good_end = end;
#else
- find_early_table_space(0, max_low_pfn<<PAGE_SHIFT);
+ end = max_low_pfn << PAGE_SHIFT;
+ good_end = max_pfn_mapped << PAGE_SHIFT;
#endif
+ tables = calculate_table_space_size(0, end);
+ find_early_table_space(0, good_end, tables);
+ printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx] prealloc\n",
+ end - 1, pgt_buf_start << PAGE_SHIFT,
+ (pgt_buf_top << PAGE_SHIFT) - 1);
+
max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
max_pfn_mapped = max_low_pfn_mapped;

@@ -332,9 +344,13 @@ void __init init_mem_mapping(void)
* RO all the pagetable pages, including the ones that are beyond
* pgt_buf_end at that time.
*/
- if (pgt_buf_end > pgt_buf_start)
+ if (pgt_buf_end > pgt_buf_start) {
+ printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx] final\n",
+ end - 1, pgt_buf_start << PAGE_SHIFT,
+ (pgt_buf_end << PAGE_SHIFT) - 1);
x86_init.mapping.pagetable_reserve(PFN_PHYS(pgt_buf_start),
PFN_PHYS(pgt_buf_end));
+ }

/* stop the wrong using */
pgt_buf_top = 0;
--
1.7.7

2012-11-12 21:20:16

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 01/46] x86, mm: Add global page_size_mask and probe one time only

Now we pass around use_gbpages and use_pse for calculating page table size,
Later we will need to call init_memory_mapping for every ram range one by one,
that mean those calculation will be done several times.

Those information are the same for all ram range and could be stored in
page_size_mask and could be probed it one time only.

Move that probing code out of init_memory_mapping into separated function
probe_page_size_mask(), and call it before all init_memory_mapping.

Suggested-by: Ingo Molnar <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
---
arch/x86/include/asm/pgtable.h | 1 +
arch/x86/kernel/setup.c | 1 +
arch/x86/mm/init.c | 55 ++++++++++++++++++---------------------
3 files changed, 27 insertions(+), 30 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a1f780d..98ac76d 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -602,6 +602,7 @@ static inline int pgd_none(pgd_t pgd)
#ifndef __ASSEMBLY__

extern int direct_gbpages;
+void probe_page_size_mask(void);

/* local pte updates need not use xchg for locking */
static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index ca45696..01fb5f9 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -913,6 +913,7 @@ void __init setup_arch(char **cmdline_p)
setup_real_mode();

init_gbpages();
+ probe_page_size_mask();

/* max_pfn_mapped is updated here */
max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index d7aea41..aa5b0da 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -35,6 +35,7 @@ struct map_range {
unsigned page_size_mask;
};

+static int page_size_mask;
/*
* First calculate space needed for kernel direct mapping page tables to cover
* mr[0].start to mr[nr_range - 1].end, while accounting for possible 2M and 1GB
@@ -94,6 +95,30 @@ static void __init find_early_table_space(struct map_range *mr, int nr_range)
(pgt_buf_top << PAGE_SHIFT) - 1);
}

+void probe_page_size_mask(void)
+{
+#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
+ /*
+ * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
+ * This will simplify cpa(), which otherwise needs to support splitting
+ * large pages into small in interrupt context, etc.
+ */
+ if (direct_gbpages)
+ page_size_mask |= 1 << PG_LEVEL_1G;
+ if (cpu_has_pse)
+ page_size_mask |= 1 << PG_LEVEL_2M;
+#endif
+
+ /* Enable PSE if available */
+ if (cpu_has_pse)
+ set_in_cr4(X86_CR4_PSE);
+
+ /* Enable PGE if available */
+ if (cpu_has_pge) {
+ set_in_cr4(X86_CR4_PGE);
+ __supported_pte_mask |= _PAGE_GLOBAL;
+ }
+}
void __init native_pagetable_reserve(u64 start, u64 end)
{
memblock_reserve(start, end - start);
@@ -129,45 +154,15 @@ static int __meminit save_mr(struct map_range *mr, int nr_range,
unsigned long __init_refok init_memory_mapping(unsigned long start,
unsigned long end)
{
- unsigned long page_size_mask = 0;
unsigned long start_pfn, end_pfn;
unsigned long ret = 0;
unsigned long pos;
-
struct map_range mr[NR_RANGE_MR];
int nr_range, i;
- int use_pse, use_gbpages;

printk(KERN_INFO "init_memory_mapping: [mem %#010lx-%#010lx]\n",
start, end - 1);

-#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KMEMCHECK)
- /*
- * For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
- * This will simplify cpa(), which otherwise needs to support splitting
- * large pages into small in interrupt context, etc.
- */
- use_pse = use_gbpages = 0;
-#else
- use_pse = cpu_has_pse;
- use_gbpages = direct_gbpages;
-#endif
-
- /* Enable PSE if available */
- if (cpu_has_pse)
- set_in_cr4(X86_CR4_PSE);
-
- /* Enable PGE if available */
- if (cpu_has_pge) {
- set_in_cr4(X86_CR4_PGE);
- __supported_pte_mask |= _PAGE_GLOBAL;
- }
-
- if (use_gbpages)
- page_size_mask |= 1 << PG_LEVEL_1G;
- if (use_pse)
- page_size_mask |= 1 << PG_LEVEL_2M;
-
memset(mr, 0, sizeof(mr));
nr_range = 0;

--
1.7.7

2012-11-12 21:20:14

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 40/46] x86, mm: Move after_bootmem to mm_internel.h

it is only used in arch/x86/mm/init*.c

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/mm_internal.h | 2 ++
include/linux/mm.h | 1 -
2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/mm_internal.h b/arch/x86/mm/mm_internal.h
index dc79ac1..6b563a1 100644
--- a/arch/x86/mm/mm_internal.h
+++ b/arch/x86/mm/mm_internal.h
@@ -14,4 +14,6 @@ unsigned long kernel_physical_mapping_init(unsigned long start,
unsigned long page_size_mask);
void zone_sizes_init(void);

+extern int after_bootmem;
+
#endif /* __X86_MM_INTERNAL_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa06804..390bd14 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1355,7 +1355,6 @@ extern void __init mmap_init(void);
extern void show_mem(unsigned int flags);
extern void si_meminfo(struct sysinfo * val);
extern void si_meminfo_node(struct sysinfo *val, int nid);
-extern int after_bootmem;

extern __printf(3, 4)
void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...);
--
1.7.7

2012-11-12 21:22:06

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 32/46] x86, mm: change low/hignmem_pfn_init to static on 32bit

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init_32.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 0ae1ba8..322ee56 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -575,7 +575,7 @@ early_param("highmem", parse_highmem);
* artificially via the highmem=x boot parameter then create
* it:
*/
-void __init lowmem_pfn_init(void)
+static void __init lowmem_pfn_init(void)
{
/* max_low_pfn is 0, we already have early_res support */
max_low_pfn = max_pfn;
@@ -611,7 +611,7 @@ void __init lowmem_pfn_init(void)
* We have more RAM than fits into lowmem - we try to put it into
* highmem, also taking the highmem=x boot parameter into account:
*/
-void __init highmem_pfn_init(void)
+static void __init highmem_pfn_init(void)
{
max_low_pfn = MAXMEM_PFN;

--
1.7.7

2012-11-12 21:22:04

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 41/46] x86, mm: Use clamp_t() in init_range_memory_mapping

save some lines, and make code more readable.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 21 +++++----------------
1 files changed, 5 insertions(+), 16 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 028a129..3c48114 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -354,31 +354,20 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
* this one could take range with hole in it.
*/
static unsigned long __init init_range_memory_mapping(
- unsigned long range_start,
- unsigned long range_end)
+ unsigned long r_start,
+ unsigned long r_end)
{
unsigned long start_pfn, end_pfn;
unsigned long mapped_ram_size = 0;
int i;

for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
- u64 start = (u64)start_pfn << PAGE_SHIFT;
- u64 end = (u64)end_pfn << PAGE_SHIFT;
-
- if (end <= range_start)
- continue;
-
- if (start < range_start)
- start = range_start;
-
- if (start >= range_end)
+ u64 start = clamp_val(PFN_PHYS(start_pfn), r_start, r_end);
+ u64 end = clamp_val(PFN_PHYS(end_pfn), r_start, r_end);
+ if (start >= end)
continue;

- if (end > range_end)
- end = range_end;
-
init_memory_mapping(start, end);
-
mapped_ram_size += end - start;
}

--
1.7.7

2012-11-12 21:22:01

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 43/46] x86, mm: kill numa_64.h

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/include/asm/numa.h | 2 --
arch/x86/include/asm/numa_64.h | 4 ----
arch/x86/kernel/acpi/boot.c | 1 -
arch/x86/kernel/cpu/amd.c | 1 -
arch/x86/kernel/cpu/intel.c | 1 -
arch/x86/kernel/setup.c | 3 ---
6 files changed, 0 insertions(+), 12 deletions(-)
delete mode 100644 arch/x86/include/asm/numa_64.h

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 49119fc..52560a2 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -54,8 +54,6 @@ static inline int numa_cpu_node(int cpu)

#ifdef CONFIG_X86_32
# include <asm/numa_32.h>
-#else
-# include <asm/numa_64.h>
#endif

#ifdef CONFIG_NUMA
diff --git a/arch/x86/include/asm/numa_64.h b/arch/x86/include/asm/numa_64.h
deleted file mode 100644
index fe4d2d4..0000000
--- a/arch/x86/include/asm/numa_64.h
+++ /dev/null
@@ -1,4 +0,0 @@
-#ifndef _ASM_X86_NUMA_64_H
-#define _ASM_X86_NUMA_64_H
-
-#endif /* _ASM_X86_NUMA_64_H */
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index e651f7a..4b23aa1 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -51,7 +51,6 @@ EXPORT_SYMBOL(acpi_disabled);

#ifdef CONFIG_X86_64
# include <asm/proto.h>
-# include <asm/numa_64.h>
#endif /* X86 */

#define BAD_MADT_ENTRY(entry, end) ( \
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 9619ba6..913f94f 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -12,7 +12,6 @@
#include <asm/pci-direct.h>

#ifdef CONFIG_X86_64
-# include <asm/numa_64.h>
# include <asm/mmconfig.h>
# include <asm/cacheflush.h>
#endif
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 198e019..3b547cc 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -17,7 +17,6 @@

#ifdef CONFIG_X86_64
#include <linux/topology.h>
-#include <asm/numa_64.h>
#endif

#include "cpu.h"
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 85b62f1..6d29d1f 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -108,9 +108,6 @@
#include <asm/topology.h>
#include <asm/apicdef.h>
#include <asm/amd_nb.h>
-#ifdef CONFIG_X86_64
-#include <asm/numa_64.h>
-#endif
#include <asm/mce.h>
#include <asm/alternative.h>
#include <asm/prom.h>
--
1.7.7

2012-11-12 21:21:59

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 13/46] x86, mm: use pfn_range_is_mapped() with gart

We are going to map ram only, so under max_low_pfn_mapped,
between 4g and max_pfn_mapped does not mean mapped at all.

Use pfn_range_is_mapped() directly.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/kernel/amd_gart_64.c | 5 ++---
1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
index e663112..b574b29 100644
--- a/arch/x86/kernel/amd_gart_64.c
+++ b/arch/x86/kernel/amd_gart_64.c
@@ -768,10 +768,9 @@ int __init gart_iommu_init(void)
aper_base = info.aper_base;
end_pfn = (aper_base>>PAGE_SHIFT) + (aper_size>>PAGE_SHIFT);

- if (end_pfn > max_low_pfn_mapped) {
- start_pfn = (aper_base>>PAGE_SHIFT);
+ start_pfn = PFN_DOWN(aper_base);
+ if (!pfn_range_is_mapped(start_pfn, end_pfn))
init_memory_mapping(start_pfn<<PAGE_SHIFT, end_pfn<<PAGE_SHIFT);
- }

pr_info("PCI-DMA: using GART IOMMU.\n");
iommu_size = check_iommu_size(info.aper_base, aper_size);
--
1.7.7

2012-11-12 21:23:18

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 42/46] x86, mm: kill numa_free_all_bootmem()

Now NO_BOOTMEM version free_all_bootmem_node() does not really
do free_bootmem at all, and it only call register_page_bootmem_info_node
instead.

That is confusing, try to kill that free_all_bootmem_node().

Before that, this patch will remove numa_free_all_bootmem().

That function could be replaced with register_page_bootmem_info() and
free_all_bootmem();

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/include/asm/numa_64.h | 2 --
arch/x86/mm/init_64.c | 15 +++++++++++----
arch/x86/mm/numa_64.c | 13 -------------
3 files changed, 11 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/numa_64.h b/arch/x86/include/asm/numa_64.h
index 0c05f7a..fe4d2d4 100644
--- a/arch/x86/include/asm/numa_64.h
+++ b/arch/x86/include/asm/numa_64.h
@@ -1,6 +1,4 @@
#ifndef _ASM_X86_NUMA_64_H
#define _ASM_X86_NUMA_64_H

-extern unsigned long numa_free_all_bootmem(void);
-
#endif /* _ASM_X86_NUMA_64_H */
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 1d53def..4178530 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -629,6 +629,16 @@ EXPORT_SYMBOL_GPL(arch_add_memory);

static struct kcore_list kcore_vsyscall;

+static void __init register_page_bootmem_info(void)
+{
+#ifdef CONFIG_NUMA
+ int i;
+
+ for_each_online_node(i)
+ register_page_bootmem_info_node(NODE_DATA(i));
+#endif
+}
+
void __init mem_init(void)
{
long codesize, reservedpages, datasize, initsize;
@@ -641,11 +651,8 @@ void __init mem_init(void)
reservedpages = 0;

/* this will put all low memory onto the freelists */
-#ifdef CONFIG_NUMA
- totalram_pages = numa_free_all_bootmem();
-#else
+ register_page_bootmem_info();
totalram_pages = free_all_bootmem();
-#endif

absent_pages = absent_pages_in_range(0, max_pfn);
reservedpages = max_pfn - totalram_pages - absent_pages;
diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
index 92e2711..9405ffc 100644
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -10,16 +10,3 @@ void __init initmem_init(void)
{
x86_numa_init();
}
-
-unsigned long __init numa_free_all_bootmem(void)
-{
- unsigned long pages = 0;
- int i;
-
- for_each_online_node(i)
- pages += free_all_bootmem_node(NODE_DATA(i));
-
- pages += free_low_memory_core_early(MAX_NUMNODES);
-
- return pages;
-}
--
1.7.7

2012-11-12 21:23:16

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 46/46] x86, mm: Let "memmap=" take more entries one time

Current "memmap=" only can take one entry every time.
when we have more entries, we have to use memmap= for each of them.

For pxe booting, we have command line length limitation, those extra
"memmap=" would waste too much space.

This patch make memmap= could take several entries one time,
and those entries will be split with ','

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/kernel/e820.c | 16 +++++++++++++++-
1 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index df06ade..d32abea 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -835,7 +835,7 @@ static int __init parse_memopt(char *p)
}
early_param("mem", parse_memopt);

-static int __init parse_memmap_opt(char *p)
+static int __init parse_memmap_one(char *p)
{
char *oldp;
u64 start_at, mem_size;
@@ -877,6 +877,20 @@ static int __init parse_memmap_opt(char *p)

return *p == '\0' ? 0 : -EINVAL;
}
+static int __init parse_memmap_opt(char *str)
+{
+ while (str) {
+ char *k = strchr(str, ',');
+
+ if (k)
+ *k++ = 0;
+
+ parse_memmap_one(str);
+ str = k;
+ }
+
+ return 0;
+}
early_param("memmap", parse_memmap_opt);

void __init finish_e820_parsing(void)
--
1.7.7

2012-11-12 21:23:14

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 44/46] sparc, mm: Remove calling of free_all_bootmem_node()

Now NO_BOOTMEM version free_all_bootmem_node() does not really
do free_bootmem at all, and it only call
register_page_bootmem_info_node instead.

That is confusing, try to kill that free_all_bootmem_node().

Before that, this patch will remove calling of free_all_bootmem_node()

We add register_page_bootmem_info() to call register_page_bootmem_info_node
directly.

Also could use free_all_bootmem() for numa case, and it is just
the same as free_low_memory_core_early().

Signed-off-by: Yinghai Lu <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Acked-by: "David S. Miller" <[email protected]>
---
arch/sparc/mm/init_64.c | 24 +++++++++++-------------
1 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 9e28a11..b24bac2 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2021,6 +2021,16 @@ static void __init patch_tlb_miss_handler_bitmap(void)
flushi(&valid_addr_bitmap_insn[0]);
}

+static void __init register_page_bootmem_info(void)
+{
+#ifdef CONFIG_NEED_MULTIPLE_NODES
+ int i;
+
+ for_each_online_node(i)
+ if (NODE_DATA(i)->node_spanned_pages)
+ register_page_bootmem_info_node(NODE_DATA(i));
+#endif
+}
void __init mem_init(void)
{
unsigned long codepages, datapages, initpages;
@@ -2038,20 +2048,8 @@ void __init mem_init(void)

high_memory = __va(last_valid_pfn << PAGE_SHIFT);

-#ifdef CONFIG_NEED_MULTIPLE_NODES
- {
- int i;
- for_each_online_node(i) {
- if (NODE_DATA(i)->node_spanned_pages != 0) {
- totalram_pages +=
- free_all_bootmem_node(NODE_DATA(i));
- }
- }
- totalram_pages += free_low_memory_core_early(MAX_NUMNODES);
- }
-#else
+ register_page_bootmem_info();
totalram_pages = free_all_bootmem();
-#endif

/* We subtract one to account for the mem_map_zero page
* allocated below.
--
1.7.7

2012-11-12 21:24:02

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 11/46] x86, mm: Fixup code testing if a pfn is direct mapped

From: Jacob Shin <[email protected]>

Update code that previously assumed pfns [ 0 - max_low_pfn_mapped ) and
[ 4GB - max_pfn_mapped ) were always direct mapped, to now look up
pfn_mapped ranges instead.

-v2: change applying sequence to keep git bisecting working.
so add dummy pfn_range_is_mapped(). - Yinghai Lu

Signed-off-by: Jacob Shin <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/include/asm/page_types.h | 8 ++++++++
arch/x86/kernel/cpu/amd.c | 8 +++-----
arch/x86/platform/efi/efi.c | 7 +++----
3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index e21fdd1..45aae6e 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -51,6 +51,14 @@ static inline phys_addr_t get_max_mapped(void)
return (phys_addr_t)max_pfn_mapped << PAGE_SHIFT;
}

+static inline bool pfn_range_is_mapped(unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ return end_pfn <= max_low_pfn_mapped ||
+ (end_pfn > (1UL << (32 - PAGE_SHIFT)) &&
+ end_pfn <= max_pfn_mapped);
+}
+
extern unsigned long init_memory_mapping(unsigned long start,
unsigned long end);

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index f7e98a2..9619ba6 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -676,12 +676,10 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c)
* benefit in doing so.
*/
if (!rdmsrl_safe(MSR_K8_TSEG_ADDR, &tseg)) {
+ unsigned long pfn = tseg >> PAGE_SHIFT;
+
printk(KERN_DEBUG "tseg: %010llx\n", tseg);
- if ((tseg>>PMD_SHIFT) <
- (max_low_pfn_mapped>>(PMD_SHIFT-PAGE_SHIFT)) ||
- ((tseg>>PMD_SHIFT) <
- (max_pfn_mapped>>(PMD_SHIFT-PAGE_SHIFT)) &&
- (tseg>>PMD_SHIFT) >= (1ULL<<(32 - PMD_SHIFT))))
+ if (pfn_range_is_mapped(pfn, pfn + 1))
set_memory_4k((unsigned long)__va(tseg), 1);
}
}
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index ad44391..36e53f0 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -835,7 +835,7 @@ void __init efi_enter_virtual_mode(void)
efi_memory_desc_t *md, *prev_md = NULL;
efi_status_t status;
unsigned long size;
- u64 end, systab, end_pfn;
+ u64 end, systab, start_pfn, end_pfn;
void *p, *va, *new_memmap = NULL;
int count = 0;

@@ -888,10 +888,9 @@ void __init efi_enter_virtual_mode(void)
size = md->num_pages << EFI_PAGE_SHIFT;
end = md->phys_addr + size;

+ start_pfn = PFN_DOWN(md->phys_addr);
end_pfn = PFN_UP(end);
- if (end_pfn <= max_low_pfn_mapped
- || (end_pfn > (1UL << (32 - PAGE_SHIFT))
- && end_pfn <= max_pfn_mapped)) {
+ if (pfn_range_is_mapped(start_pfn, end_pfn)) {
va = __va(md->phys_addr);

if (!(md->attribute & EFI_MEMORY_WB))
--
1.7.7

2012-11-12 21:19:33

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 07/46] x86, mm: Find early page table buffer together

We should not do that in every calling of init_memory_mapping.

At the same time need to move down early_memtest, and could remove after_bootmem
checking.

-v2: fix one early_memtest with 32bit by passing max_pfn_mapped instead.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 66 ++++++++++++++++++++++++++-------------------------
1 files changed, 34 insertions(+), 32 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 00089bf..c273edb 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -275,16 +275,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
nr_range = 0;
nr_range = split_mem_range(mr, nr_range, start, end);

- /*
- * Find space for the kernel direct mapping tables.
- *
- * Later we should allocate these tables in the local node of the
- * memory mapped. Unfortunately this is done currently before the
- * nodes are discovered.
- */
- if (!after_bootmem)
- find_early_table_space(start, end);
-
for (i = 0; i < nr_range; i++)
ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,
mr[i].page_size_mask);
@@ -297,6 +287,36 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,

__flush_tlb_all();

+ return ret >> PAGE_SHIFT;
+}
+
+void __init init_mem_mapping(void)
+{
+ probe_page_size_mask();
+
+ /*
+ * Find space for the kernel direct mapping tables.
+ *
+ * Later we should allocate these tables in the local node of the
+ * memory mapped. Unfortunately this is done currently before the
+ * nodes are discovered.
+ */
+#ifdef CONFIG_X86_64
+ find_early_table_space(0, max_pfn<<PAGE_SHIFT);
+#else
+ find_early_table_space(0, max_low_pfn<<PAGE_SHIFT);
+#endif
+ max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
+ max_pfn_mapped = max_low_pfn_mapped;
+
+#ifdef CONFIG_X86_64
+ if (max_pfn > max_low_pfn) {
+ max_pfn_mapped = init_memory_mapping(1UL<<32,
+ max_pfn<<PAGE_SHIFT);
+ /* can we preseve max_low_pfn ?*/
+ max_low_pfn = max_pfn;
+ }
+#endif
/*
* Reserve the kernel pagetable pages we used (pgt_buf_start -
* pgt_buf_end) and free the other ones (pgt_buf_end - pgt_buf_top)
@@ -312,32 +332,14 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
* RO all the pagetable pages, including the ones that are beyond
* pgt_buf_end at that time.
*/
- if (!after_bootmem && pgt_buf_end > pgt_buf_start)
+ if (pgt_buf_end > pgt_buf_start)
x86_init.mapping.pagetable_reserve(PFN_PHYS(pgt_buf_start),
PFN_PHYS(pgt_buf_end));

- if (!after_bootmem)
- early_memtest(start, end);
+ /* stop the wrong using */
+ pgt_buf_top = 0;

- return ret >> PAGE_SHIFT;
-}
-
-void __init init_mem_mapping(void)
-{
- probe_page_size_mask();
-
- /* max_pfn_mapped is updated here */
- max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
- max_pfn_mapped = max_low_pfn_mapped;
-
-#ifdef CONFIG_X86_64
- if (max_pfn > max_low_pfn) {
- max_pfn_mapped = init_memory_mapping(1UL<<32,
- max_pfn<<PAGE_SHIFT);
- /* can we preseve max_low_pfn ?*/
- max_low_pfn = max_pfn;
- }
-#endif
+ early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
}

/*
--
1.7.7

2012-11-12 21:24:28

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 23/46] x86, mm: Remove parameter in alloc_low_page for 64bit

Now all page table buf are pre-mapped, and could use virtual address directly.
So don't need to remember physical address anymore.

Remove that phys pointer in alloc_low_page(), and that will allow us to merge
alloc_low_page between 64bit and 32bit.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init_64.c | 19 +++++++------------
1 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5ee9242..1960820 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -314,14 +314,13 @@ void __init cleanup_highmap(void)
}
}

-static __ref void *alloc_low_page(unsigned long *phys)
+static __ref void *alloc_low_page(void)
{
unsigned long pfn;
void *adr;

if (after_bootmem) {
adr = (void *)get_zeroed_page(GFP_ATOMIC | __GFP_NOTRACK);
- *phys = __pa(adr);

return adr;
}
@@ -342,7 +341,6 @@ static __ref void *alloc_low_page(unsigned long *phys)

adr = __va(pfn * PAGE_SIZE);
clear_page(adr);
- *phys = pfn * PAGE_SIZE;
return adr;
}

@@ -401,7 +399,6 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
int i = pmd_index(address);

for (; i < PTRS_PER_PMD; i++, address = next) {
- unsigned long pte_phys;
pmd_t *pmd = pmd_page + pmd_index(address);
pte_t *pte;
pgprot_t new_prot = prot;
@@ -456,11 +453,11 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
continue;
}

- pte = alloc_low_page(&pte_phys);
+ pte = alloc_low_page();
last_map_addr = phys_pte_init(pte, address, end, new_prot);

spin_lock(&init_mm.page_table_lock);
- pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
+ pmd_populate_kernel(&init_mm, pmd, pte);
spin_unlock(&init_mm.page_table_lock);
}
update_page_count(PG_LEVEL_2M, pages);
@@ -476,7 +473,6 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
int i = pud_index(addr);

for (; i < PTRS_PER_PUD; i++, addr = next) {
- unsigned long pmd_phys;
pud_t *pud = pud_page + pud_index(addr);
pmd_t *pmd;
pgprot_t prot = PAGE_KERNEL;
@@ -530,12 +526,12 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
continue;
}

- pmd = alloc_low_page(&pmd_phys);
+ pmd = alloc_low_page();
last_map_addr = phys_pmd_init(pmd, addr, end, page_size_mask,
prot);

spin_lock(&init_mm.page_table_lock);
- pud_populate(&init_mm, pud, __va(pmd_phys));
+ pud_populate(&init_mm, pud, pmd);
spin_unlock(&init_mm.page_table_lock);
}
__flush_tlb_all();
@@ -560,7 +556,6 @@ kernel_physical_mapping_init(unsigned long start,

for (; start < end; start = next) {
pgd_t *pgd = pgd_offset_k(start);
- unsigned long pud_phys;
pud_t *pud;

next = (start + PGDIR_SIZE) & PGDIR_MASK;
@@ -574,12 +569,12 @@ kernel_physical_mapping_init(unsigned long start,
continue;
}

- pud = alloc_low_page(&pud_phys);
+ pud = alloc_low_page();
last_map_addr = phys_pud_init(pud, __pa(start), __pa(next),
page_size_mask);

spin_lock(&init_mm.page_table_lock);
- pgd_populate(&init_mm, pgd, __va(pud_phys));
+ pgd_populate(&init_mm, pgd, pud);
spin_unlock(&init_mm.page_table_lock);
pgd_changed = true;
}
--
1.7.7

2012-11-12 21:24:35

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 21/46] x86, mm: setup page table in top-down

Get pgt_buf early from BRK, and use it to map PMD_SIZE from top at first.
Then use mapped pages to map more ranges below, and keep looping until
all pages get mapped.

alloc_low_page will use page from BRK at first, after that buffer is used
up, will use memblock to find and reserve pages for page table usage.

Introduce min_pfn_mapped to make sure find new pages from mapped ranges,
that will be updated when lower pages get mapped.

Also add step_size to make sure that don't try to map too big range with
limited mapped pages initially, and increase the step_size when we have
more mapped pages on hand.

At last we can get rid of calculation and find early pgt related code.

-v2: update to after fix_xen change,
also use MACRO for initial pgt_buf size and add comments with it.
-v3: skip big reserved range in memblock.reserved near end.
-v4: don't need fix_xen change now.

Suggested-by: "H. Peter Anvin" <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/include/asm/page_types.h | 1 +
arch/x86/include/asm/pgtable.h | 1 +
arch/x86/kernel/setup.c | 3 +
arch/x86/mm/init.c | 210 +++++++++++--------------------------
arch/x86/mm/init_32.c | 17 +++-
arch/x86/mm/init_64.c | 17 +++-
6 files changed, 94 insertions(+), 155 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index 54c9787..9f6f3e6 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -45,6 +45,7 @@ extern int devmem_is_allowed(unsigned long pagenr);

extern unsigned long max_low_pfn_mapped;
extern unsigned long max_pfn_mapped;
+extern unsigned long min_pfn_mapped;

static inline phys_addr_t get_max_mapped(void)
{
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index dd1a888..6991a3e 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -603,6 +603,7 @@ static inline int pgd_none(pgd_t pgd)

extern int direct_gbpages;
void init_mem_mapping(void);
+void early_alloc_pgt_buf(void);

/* local pte updates need not use xchg for locking */
static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 94f922a..f7634092 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -124,6 +124,7 @@
*/
unsigned long max_low_pfn_mapped;
unsigned long max_pfn_mapped;
+unsigned long min_pfn_mapped;

#ifdef CONFIG_DMI
RESERVE_BRK(dmi_alloc, 65536);
@@ -900,6 +901,8 @@ void __init setup_arch(char **cmdline_p)

reserve_ibft_region();

+ early_alloc_pgt_buf();
+
/*
* Need to conclude brk, before memblock_x86_fill()
* it could use memblock_find_in_range, could overlap with
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 47a1ba2..76a6e82 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -21,6 +21,21 @@ unsigned long __initdata pgt_buf_start;
unsigned long __meminitdata pgt_buf_end;
unsigned long __meminitdata pgt_buf_top;

+/* need 4 4k for initial PMD_SIZE, 4k for 0-ISA_END_ADDRESS */
+#define INIT_PGT_BUF_SIZE (5 * PAGE_SIZE)
+RESERVE_BRK(early_pgt_alloc, INIT_PGT_BUF_SIZE);
+void __init early_alloc_pgt_buf(void)
+{
+ unsigned long tables = INIT_PGT_BUF_SIZE;
+ phys_addr_t base;
+
+ base = __pa(extend_brk(tables, PAGE_SIZE));
+
+ pgt_buf_start = base >> PAGE_SHIFT;
+ pgt_buf_end = pgt_buf_start;
+ pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
+}
+
int after_bootmem;

int direct_gbpages
@@ -228,105 +243,6 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
return nr_range;
}

-/*
- * First calculate space needed for kernel direct mapping page tables to cover
- * mr[0].start to mr[nr_range - 1].end, while accounting for possible 2M and 1GB
- * pages. Then find enough contiguous space for those page tables.
- */
-static unsigned long __init calculate_table_space_size(unsigned long start, unsigned long end)
-{
- int i;
- unsigned long puds = 0, pmds = 0, ptes = 0, tables;
- struct map_range mr[NR_RANGE_MR];
- int nr_range;
-
- memset(mr, 0, sizeof(mr));
- nr_range = 0;
- nr_range = split_mem_range(mr, nr_range, start, end);
-
- for (i = 0; i < nr_range; i++) {
- unsigned long range, extra;
-
- range = mr[i].end - mr[i].start;
- puds += (range + PUD_SIZE - 1) >> PUD_SHIFT;
-
- if (mr[i].page_size_mask & (1 << PG_LEVEL_1G)) {
- extra = range - ((range >> PUD_SHIFT) << PUD_SHIFT);
- pmds += (extra + PMD_SIZE - 1) >> PMD_SHIFT;
- } else {
- pmds += (range + PMD_SIZE - 1) >> PMD_SHIFT;
- }
-
- if (mr[i].page_size_mask & (1 << PG_LEVEL_2M)) {
- extra = range - ((range >> PMD_SHIFT) << PMD_SHIFT);
-#ifdef CONFIG_X86_32
- extra += PMD_SIZE;
-#endif
- ptes += (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
- } else {
- ptes += (range + PAGE_SIZE - 1) >> PAGE_SHIFT;
- }
- }
-
- tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
- tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);
- tables += roundup(ptes * sizeof(pte_t), PAGE_SIZE);
-
-#ifdef CONFIG_X86_32
- /* for fixmap */
- tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
-#endif
-
- return tables;
-}
-
-static unsigned long __init calculate_all_table_space_size(void)
-{
- unsigned long start_pfn, end_pfn;
- unsigned long tables;
- int i;
-
- /* the ISA range is always mapped regardless of memory holes */
- tables = calculate_table_space_size(0, ISA_END_ADDRESS);
-
- for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
- u64 start = start_pfn << PAGE_SHIFT;
- u64 end = end_pfn << PAGE_SHIFT;
-
- if (end <= ISA_END_ADDRESS)
- continue;
-
- if (start < ISA_END_ADDRESS)
- start = ISA_END_ADDRESS;
-#ifdef CONFIG_X86_32
- /* on 32 bit, we only map up to max_low_pfn */
- if ((start >> PAGE_SHIFT) >= max_low_pfn)
- continue;
-
- if ((end >> PAGE_SHIFT) > max_low_pfn)
- end = max_low_pfn << PAGE_SHIFT;
-#endif
- tables += calculate_table_space_size(start, end);
- }
-
- return tables;
-}
-
-static void __init find_early_table_space(unsigned long start,
- unsigned long good_end,
- unsigned long tables)
-{
- phys_addr_t base;
-
- base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
- if (!base)
- panic("Cannot find space for the kernel page tables");
-
- pgt_buf_start = base >> PAGE_SHIFT;
- pgt_buf_end = pgt_buf_start;
- pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
-}
-
static struct range pfn_mapped[E820_X_MAX];
static int nr_pfn_mapped;

@@ -392,17 +308,14 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
}

/*
- * Iterate through E820 memory map and create direct mappings for only E820_RAM
- * regions. We cannot simply create direct mappings for all pfns from
- * [0 to max_low_pfn) and [4GB to max_pfn) because of possible memory holes in
- * high addresses that cannot be marked as UC by fixed/variable range MTRRs.
- * Depending on the alignment of E820 ranges, this may possibly result in using
- * smaller size (i.e. 4K instead of 2M or 1G) page tables.
+ * this one could take range with hole in it.
*/
-static void __init init_range_memory_mapping(unsigned long range_start,
+static unsigned long __init init_range_memory_mapping(
+ unsigned long range_start,
unsigned long range_end)
{
unsigned long start_pfn, end_pfn;
+ unsigned long mapped_ram_size = 0;
int i;

for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
@@ -422,71 +335,70 @@ static void __init init_range_memory_mapping(unsigned long range_start,
end = range_end;

init_memory_mapping(start, end);
+
+ mapped_ram_size += end - start;
}
+
+ return mapped_ram_size;
}

+/* (PUD_SHIFT-PMD_SHIFT)/2 */
+#define STEP_SIZE_SHIFT 5
void __init init_mem_mapping(void)
{
- unsigned long tables, good_end, end;
+ unsigned long end, real_end, start, last_start;
+ unsigned long step_size;
+ unsigned long addr;
+ unsigned long mapped_ram_size = 0;
+ unsigned long new_mapped_ram_size;

probe_page_size_mask();

- /*
- * Find space for the kernel direct mapping tables.
- *
- * Later we should allocate these tables in the local node of the
- * memory mapped. Unfortunately this is done currently before the
- * nodes are discovered.
- */
#ifdef CONFIG_X86_64
end = max_pfn << PAGE_SHIFT;
- good_end = end;
#else
end = max_low_pfn << PAGE_SHIFT;
- good_end = max_pfn_mapped << PAGE_SHIFT;
#endif
- tables = calculate_all_table_space_size();
- find_early_table_space(0, good_end, tables);
- printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx] prealloc\n",
- end - 1, pgt_buf_start << PAGE_SHIFT,
- (pgt_buf_top << PAGE_SHIFT) - 1);

- max_pfn_mapped = 0; /* will get exact value next */
/* the ISA range is always mapped regardless of memory holes */
init_memory_mapping(0, ISA_END_ADDRESS);
- init_range_memory_mapping(ISA_END_ADDRESS, end);
+
+ /* xen has big range in reserved near end of ram, skip it at first */
+ addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE,
+ PAGE_SIZE);
+ real_end = addr + PMD_SIZE;
+
+ /* step_size need to be small so pgt_buf from BRK could cover it */
+ step_size = PMD_SIZE;
+ max_pfn_mapped = 0; /* will get exact value next */
+ min_pfn_mapped = real_end >> PAGE_SHIFT;
+ last_start = start = real_end;
+ while (last_start > ISA_END_ADDRESS) {
+ if (last_start > step_size) {
+ start = round_down(last_start - 1, step_size);
+ if (start < ISA_END_ADDRESS)
+ start = ISA_END_ADDRESS;
+ } else
+ start = ISA_END_ADDRESS;
+ new_mapped_ram_size = init_range_memory_mapping(start,
+ last_start);
+ last_start = start;
+ min_pfn_mapped = last_start >> PAGE_SHIFT;
+ /* only increase step_size after big range get mapped */
+ if (new_mapped_ram_size > mapped_ram_size)
+ step_size <<= STEP_SIZE_SHIFT;
+ mapped_ram_size += new_mapped_ram_size;
+ }
+
+ if (real_end < end)
+ init_range_memory_mapping(real_end, end);
+
#ifdef CONFIG_X86_64
if (max_pfn > max_low_pfn) {
/* can we preseve max_low_pfn ?*/
max_low_pfn = max_pfn;
}
#endif
- /*
- * Reserve the kernel pagetable pages we used (pgt_buf_start -
- * pgt_buf_end) and free the other ones (pgt_buf_end - pgt_buf_top)
- * so that they can be reused for other purposes.
- *
- * On native it just means calling memblock_reserve, on Xen it also
- * means marking RW the pagetable pages that we allocated before
- * but that haven't been used.
- *
- * In fact on xen we mark RO the whole range pgt_buf_start -
- * pgt_buf_top, because we have to make sure that when
- * init_memory_mapping reaches the pagetable pages area, it maps
- * RO all the pagetable pages, including the ones that are beyond
- * pgt_buf_end at that time.
- */
- if (pgt_buf_end > pgt_buf_start) {
- printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx] final\n",
- end - 1, pgt_buf_start << PAGE_SHIFT,
- (pgt_buf_end << PAGE_SHIFT) - 1);
- x86_init.mapping.pagetable_reserve(PFN_PHYS(pgt_buf_start),
- PFN_PHYS(pgt_buf_end));
- }
-
- /* stop the wrong using */
- pgt_buf_top = 0;
-
early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
}

diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 27f7fc6..7bb1106 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -61,11 +61,22 @@ bool __read_mostly __vmalloc_start_set = false;

static __init void *alloc_low_page(void)
{
- unsigned long pfn = pgt_buf_end++;
+ unsigned long pfn;
void *adr;

- if (pfn >= pgt_buf_top)
- panic("alloc_low_page: ran out of memory");
+ if ((pgt_buf_end + 1) >= pgt_buf_top) {
+ unsigned long ret;
+ if (min_pfn_mapped >= max_pfn_mapped)
+ panic("alloc_low_page: ran out of memory");
+ ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
+ max_pfn_mapped << PAGE_SHIFT,
+ PAGE_SIZE, PAGE_SIZE);
+ if (!ret)
+ panic("alloc_low_page: can not alloc memory");
+ memblock_reserve(ret, PAGE_SIZE);
+ pfn = ret >> PAGE_SHIFT;
+ } else
+ pfn = pgt_buf_end++;

adr = __va(pfn * PAGE_SIZE);
clear_page(adr);
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index fa28e3e..eefaea6 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -316,7 +316,7 @@ void __init cleanup_highmap(void)

static __ref void *alloc_low_page(unsigned long *phys)
{
- unsigned long pfn = pgt_buf_end++;
+ unsigned long pfn;
void *adr;

if (after_bootmem) {
@@ -326,8 +326,19 @@ static __ref void *alloc_low_page(unsigned long *phys)
return adr;
}

- if (pfn >= pgt_buf_top)
- panic("alloc_low_page: ran out of memory");
+ if ((pgt_buf_end + 1) >= pgt_buf_top) {
+ unsigned long ret;
+ if (min_pfn_mapped >= max_pfn_mapped)
+ panic("alloc_low_page: ran out of memory");
+ ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
+ max_pfn_mapped << PAGE_SHIFT,
+ PAGE_SIZE, PAGE_SIZE);
+ if (!ret)
+ panic("alloc_low_page: can not alloc memory");
+ memblock_reserve(ret, PAGE_SIZE);
+ pfn = ret >> PAGE_SHIFT;
+ } else
+ pfn = pgt_buf_end++;

adr = early_memremap(pfn * PAGE_SIZE, PAGE_SIZE);
clear_page(adr);
--
1.7.7

2012-11-12 21:24:33

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 12/46] x86, mm: use pfn_range_is_mapped() with CPA

We are going to map ram only, so under max_low_pfn_mapped,
between 4g and max_pfn_mapped does not mean mapped at all.

Use pfn_range_is_mapped() directly.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/pageattr.c | 16 +++-------------
1 files changed, 3 insertions(+), 13 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index a718e0d..44acfcd 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -551,16 +551,10 @@ static int split_large_page(pte_t *kpte, unsigned long address)
for (i = 0; i < PTRS_PER_PTE; i++, pfn += pfninc)
set_pte(&pbase[i], pfn_pte(pfn, ref_prot));

- if (address >= (unsigned long)__va(0) &&
- address < (unsigned long)__va(max_low_pfn_mapped << PAGE_SHIFT))
+ if (pfn_range_is_mapped(PFN_DOWN(__pa(address)),
+ PFN_DOWN(__pa(address)) + 1))
split_page_count(level);

-#ifdef CONFIG_X86_64
- if (address >= (unsigned long)__va(1UL<<32) &&
- address < (unsigned long)__va(max_pfn_mapped << PAGE_SHIFT))
- split_page_count(level);
-#endif
-
/*
* Install the new, split up pagetable.
*
@@ -729,13 +723,9 @@ static int cpa_process_alias(struct cpa_data *cpa)
unsigned long vaddr;
int ret;

- if (cpa->pfn >= max_pfn_mapped)
+ if (!pfn_range_is_mapped(cpa->pfn, cpa->pfn + 1))
return 0;

-#ifdef CONFIG_X86_64
- if (cpa->pfn >= max_low_pfn_mapped && cpa->pfn < (1UL<<(32-PAGE_SHIFT)))
- return 0;
-#endif
/*
* No need to redo, when the primary call touched the direct
* mapping already:
--
1.7.7

2012-11-12 21:24:31

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 24/46] x86, mm: Merge alloc_low_page between 64bit and 32bit

They are almost same except 64 bit need to handle after_bootmem case.

Add mm_internal.h to make that alloc_low_page() only to be accessible
from arch/x86/mm/init*.c

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 34 ++++++++++++++++++++++++++++++++++
arch/x86/mm/init_32.c | 26 ++------------------------
arch/x86/mm/init_64.c | 32 ++------------------------------
arch/x86/mm/mm_internal.h | 6 ++++++
4 files changed, 44 insertions(+), 54 deletions(-)
create mode 100644 arch/x86/mm/mm_internal.h

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 76a6e82..ffbb7af 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -17,10 +17,44 @@
#include <asm/proto.h>
#include <asm/dma.h> /* for MAX_DMA_PFN */

+#include "mm_internal.h"
+
unsigned long __initdata pgt_buf_start;
unsigned long __meminitdata pgt_buf_end;
unsigned long __meminitdata pgt_buf_top;

+__ref void *alloc_low_page(void)
+{
+ unsigned long pfn;
+ void *adr;
+
+#ifdef CONFIG_X86_64
+ if (after_bootmem) {
+ adr = (void *)get_zeroed_page(GFP_ATOMIC | __GFP_NOTRACK);
+
+ return adr;
+ }
+#endif
+
+ if ((pgt_buf_end + 1) >= pgt_buf_top) {
+ unsigned long ret;
+ if (min_pfn_mapped >= max_pfn_mapped)
+ panic("alloc_low_page: ran out of memory");
+ ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
+ max_pfn_mapped << PAGE_SHIFT,
+ PAGE_SIZE, PAGE_SIZE);
+ if (!ret)
+ panic("alloc_low_page: can not alloc memory");
+ memblock_reserve(ret, PAGE_SIZE);
+ pfn = ret >> PAGE_SHIFT;
+ } else
+ pfn = pgt_buf_end++;
+
+ adr = __va(pfn * PAGE_SIZE);
+ clear_page(adr);
+ return adr;
+}
+
/* need 4 4k for initial PMD_SIZE, 4k for 0-ISA_END_ADDRESS */
#define INIT_PGT_BUF_SIZE (5 * PAGE_SIZE)
RESERVE_BRK(early_pgt_alloc, INIT_PGT_BUF_SIZE);
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 7bb1106..a7f2df1 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -53,36 +53,14 @@
#include <asm/page_types.h>
#include <asm/init.h>

+#include "mm_internal.h"
+
unsigned long highstart_pfn, highend_pfn;

static noinline int do_test_wp_bit(void);

bool __read_mostly __vmalloc_start_set = false;

-static __init void *alloc_low_page(void)
-{
- unsigned long pfn;
- void *adr;
-
- if ((pgt_buf_end + 1) >= pgt_buf_top) {
- unsigned long ret;
- if (min_pfn_mapped >= max_pfn_mapped)
- panic("alloc_low_page: ran out of memory");
- ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
- max_pfn_mapped << PAGE_SHIFT,
- PAGE_SIZE, PAGE_SIZE);
- if (!ret)
- panic("alloc_low_page: can not alloc memory");
- memblock_reserve(ret, PAGE_SIZE);
- pfn = ret >> PAGE_SHIFT;
- } else
- pfn = pgt_buf_end++;
-
- adr = __va(pfn * PAGE_SIZE);
- clear_page(adr);
- return adr;
-}
-
/*
* Creates a middle page table and puts a pointer to it in the
* given global directory entry. This only returns the gd entry
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 1960820..1d53def 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -54,6 +54,8 @@
#include <asm/uv/uv.h>
#include <asm/setup.h>

+#include "mm_internal.h"
+
static int __init parse_direct_gbpages_off(char *arg)
{
direct_gbpages = 0;
@@ -314,36 +316,6 @@ void __init cleanup_highmap(void)
}
}

-static __ref void *alloc_low_page(void)
-{
- unsigned long pfn;
- void *adr;
-
- if (after_bootmem) {
- adr = (void *)get_zeroed_page(GFP_ATOMIC | __GFP_NOTRACK);
-
- return adr;
- }
-
- if ((pgt_buf_end + 1) >= pgt_buf_top) {
- unsigned long ret;
- if (min_pfn_mapped >= max_pfn_mapped)
- panic("alloc_low_page: ran out of memory");
- ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
- max_pfn_mapped << PAGE_SHIFT,
- PAGE_SIZE, PAGE_SIZE);
- if (!ret)
- panic("alloc_low_page: can not alloc memory");
- memblock_reserve(ret, PAGE_SIZE);
- pfn = ret >> PAGE_SHIFT;
- } else
- pfn = pgt_buf_end++;
-
- adr = __va(pfn * PAGE_SIZE);
- clear_page(adr);
- return adr;
-}
-
static unsigned long __meminit
phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
pgprot_t prot)
diff --git a/arch/x86/mm/mm_internal.h b/arch/x86/mm/mm_internal.h
new file mode 100644
index 0000000..b3f993a
--- /dev/null
+++ b/arch/x86/mm/mm_internal.h
@@ -0,0 +1,6 @@
+#ifndef __X86_MM_INTERNAL_H
+#define __X86_MM_INTERNAL_H
+
+void *alloc_low_page(void);
+
+#endif /* __X86_MM_INTERNAL_H */
--
1.7.7

2012-11-12 21:26:58

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 05/46] x86, mm: Revert back good_end setting for 64bit

After

| commit 8548c84da2f47e71bbbe300f55edb768492575f7
| Author: Takashi Iwai <[email protected]>
| Date: Sun Oct 23 23:19:12 2011 +0200
|
| x86: Fix S4 regression
|
| Commit 4b239f458 ("x86-64, mm: Put early page table high") causes a S4
| regression since 2.6.39, namely the machine reboots occasionally at S4
| resume. It doesn't happen always, overall rate is about 1/20. But,
| like other bugs, once when this happens, it continues to happen.
|
| This patch fixes the problem by essentially reverting the memory
| assignment in the older way.

Have some page table around 512M again, that will prevent kdump to find 512M
under 768M.

We need revert that reverting, so we could put page table high again for 64bit.

Takashi agreed that S4 regression could be something else.

https://lkml.org/lkml/2012/6/15/182

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 8927276..8f57b12 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -234,8 +234,8 @@ static void __init find_early_table_space(struct map_range *mr, int nr_range)
#ifdef CONFIG_X86_32
/* for fixmap */
tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
-#endif
good_end = max_pfn_mapped << PAGE_SHIFT;
+#endif

base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
if (!base)
--
1.7.7

2012-11-12 21:26:59

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 38/46] x86, mm: use limit_pfn for end pfn

instead of shifting end to get that.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 20 +++++++++++---------
1 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index e430f1e..a0f579a 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -199,10 +199,12 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
unsigned long start,
unsigned long end)
{
- unsigned long start_pfn, end_pfn;
+ unsigned long start_pfn, end_pfn, limit_pfn;
unsigned long pfn;
int i;

+ limit_pfn = PFN_DOWN(end);
+
/* head if not big page alignment ? */
pfn = start_pfn = PFN_DOWN(start);
#ifdef CONFIG_X86_32
@@ -219,8 +221,8 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
#else /* CONFIG_X86_64 */
end_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
#endif
- if (end_pfn > PFN_DOWN(end))
- end_pfn = PFN_DOWN(end);
+ if (end_pfn > limit_pfn)
+ end_pfn = limit_pfn;
if (start_pfn < end_pfn) {
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn, 0);
pfn = end_pfn;
@@ -229,11 +231,11 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
/* big page (2M) range */
start_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
#ifdef CONFIG_X86_32
- end_pfn = PFN_DOWN(round_down(end, PMD_SIZE));
+ end_pfn = round_down(limit_pfn, PFN_DOWN(PMD_SIZE));
#else /* CONFIG_X86_64 */
end_pfn = round_up(pfn, PFN_DOWN(PUD_SIZE));
- if (end_pfn > PFN_DOWN(round_down(end, PMD_SIZE)))
- end_pfn = PFN_DOWN(round_down(end, PMD_SIZE));
+ if (end_pfn > round_down(limit_pfn, PFN_DOWN(PMD_SIZE)))
+ end_pfn = round_down(limit_pfn, PFN_DOWN(PMD_SIZE));
#endif

if (start_pfn < end_pfn) {
@@ -245,7 +247,7 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
#ifdef CONFIG_X86_64
/* big page (1G) range */
start_pfn = round_up(pfn, PFN_DOWN(PUD_SIZE));
- end_pfn = PFN_DOWN(round_down(end, PUD_SIZE));
+ end_pfn = round_down(limit_pfn, PFN_DOWN(PUD_SIZE));
if (start_pfn < end_pfn) {
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
page_size_mask &
@@ -255,7 +257,7 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,

/* tail is not big page (1G) alignment */
start_pfn = round_up(pfn, PFN_DOWN(PMD_SIZE));
- end_pfn = PFN_DOWN(round_down(end, PMD_SIZE));
+ end_pfn = round_down(limit_pfn, PFN_DOWN(PMD_SIZE));
if (start_pfn < end_pfn) {
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
page_size_mask & (1<<PG_LEVEL_2M));
@@ -265,7 +267,7 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,

/* tail is not big page (2M) alignment */
start_pfn = pfn;
- end_pfn = PFN_DOWN(end);
+ end_pfn = limit_pfn;
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn, 0);

/* try to merge same page size and continuous */
--
1.7.7

2012-11-12 21:26:56

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 10/46] x86, mm: if kernel .text .data .bss are not marked as E820_RAM, complain and fix

From: Jacob Shin <[email protected]>

There could be cases where user supplied memmap=exactmap memory
mappings do not mark the region where the kernel .text .data and
.bss reside as E820_RAM, as reported here:

https://lkml.org/lkml/2012/8/14/86

Handle it by complaining, and adding the range back into the e820.

Signed-off-by: Jacob Shin <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
---
arch/x86/kernel/setup.c | 14 ++++++++++++++
1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 4bd8921..d85cbd9 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -832,6 +832,20 @@ void __init setup_arch(char **cmdline_p)
insert_resource(&iomem_resource, &data_resource);
insert_resource(&iomem_resource, &bss_resource);

+ /*
+ * Complain if .text .data and .bss are not marked as E820_RAM and
+ * attempt to fix it by adding the range. We may have a confused BIOS,
+ * or the user may have incorrectly supplied it via memmap=exactmap. If
+ * we really are running on top non-RAM, we will crash later anyways.
+ */
+ if (!e820_all_mapped(code_resource.start, __pa(__brk_limit), E820_RAM)) {
+ pr_warn(".text .data .bss are not marked as E820_RAM!\n");
+
+ e820_add_region(code_resource.start,
+ __pa(__brk_limit) - code_resource.start + 1,
+ E820_RAM);
+ }
+
trim_bios_range();
#ifdef CONFIG_X86_32
if (ppro_with_ram_bug()) {
--
1.7.7

2012-11-12 21:26:55

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 16/46] x86, mm: relocate initrd under all mem for 64bit

instead of under 4g.

For 64bit, we can use any mapped mem instead of low mem.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/kernel/setup.c | 7 +++----
1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 68dffec..94f922a 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -324,7 +324,7 @@ static void __init relocate_initrd(void)
char *p, *q;

/* We need to move the initrd down into directly mapped mem */
- ramdisk_here = memblock_find_in_range(0, PFN_PHYS(max_low_pfn_mapped),
+ ramdisk_here = memblock_find_in_range(0, PFN_PHYS(max_pfn_mapped),
area_size, PAGE_SIZE);

if (!ramdisk_here)
@@ -392,7 +392,7 @@ static void __init reserve_initrd(void)

initrd_start = 0;

- mapped_size = get_mem_size(max_low_pfn_mapped);
+ mapped_size = get_mem_size(max_pfn_mapped);
if (ramdisk_size >= (mapped_size>>1))
panic("initrd too large to handle, "
"disabling initrd (%lld needed, %lld available)\n",
@@ -401,8 +401,7 @@ static void __init reserve_initrd(void)
printk(KERN_INFO "RAMDISK: [mem %#010llx-%#010llx]\n", ramdisk_image,
ramdisk_end - 1);

- if (ramdisk_end <= (max_low_pfn_mapped<<PAGE_SHIFT) &&
- pfn_range_is_mapped(PFN_DOWN(ramdisk_image),
+ if (pfn_range_is_mapped(PFN_DOWN(ramdisk_image),
PFN_DOWN(ramdisk_end))) {
/* All are mapped, easy case */
/*
--
1.7.7

2012-11-12 21:26:52

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 20/46] x86, mm: Break down init_all_memory_mapping

Will replace that with top-down page table initialization.
New API need to take range: init_range_memory_mapping()

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 41 +++++++++++++++++++----------------------
1 files changed, 19 insertions(+), 22 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index daea254..47a1ba2 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -399,40 +399,30 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
* Depending on the alignment of E820 ranges, this may possibly result in using
* smaller size (i.e. 4K instead of 2M or 1G) page tables.
*/
-static void __init init_all_memory_mapping(void)
+static void __init init_range_memory_mapping(unsigned long range_start,
+ unsigned long range_end)
{
unsigned long start_pfn, end_pfn;
int i;

- /* the ISA range is always mapped regardless of memory holes */
- init_memory_mapping(0, ISA_END_ADDRESS);
-
for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
u64 start = (u64)start_pfn << PAGE_SHIFT;
u64 end = (u64)end_pfn << PAGE_SHIFT;

- if (end <= ISA_END_ADDRESS)
+ if (end <= range_start)
continue;

- if (start < ISA_END_ADDRESS)
- start = ISA_END_ADDRESS;
-#ifdef CONFIG_X86_32
- /* on 32 bit, we only map up to max_low_pfn */
- if ((start >> PAGE_SHIFT) >= max_low_pfn)
+ if (start < range_start)
+ start = range_start;
+
+ if (start >= range_end)
continue;

- if ((end >> PAGE_SHIFT) > max_low_pfn)
- end = max_low_pfn << PAGE_SHIFT;
-#endif
- init_memory_mapping(start, end);
- }
+ if (end > range_end)
+ end = range_end;

-#ifdef CONFIG_X86_64
- if (max_pfn > max_low_pfn) {
- /* can we preseve max_low_pfn ?*/
- max_low_pfn = max_pfn;
+ init_memory_mapping(start, end);
}
-#endif
}

void __init init_mem_mapping(void)
@@ -462,8 +452,15 @@ void __init init_mem_mapping(void)
(pgt_buf_top << PAGE_SHIFT) - 1);

max_pfn_mapped = 0; /* will get exact value next */
- init_all_memory_mapping();
-
+ /* the ISA range is always mapped regardless of memory holes */
+ init_memory_mapping(0, ISA_END_ADDRESS);
+ init_range_memory_mapping(ISA_END_ADDRESS, end);
+#ifdef CONFIG_X86_64
+ if (max_pfn > max_low_pfn) {
+ /* can we preseve max_low_pfn ?*/
+ max_low_pfn = max_pfn;
+ }
+#endif
/*
* Reserve the kernel pagetable pages we used (pgt_buf_start -
* pgt_buf_end) and free the other ones (pgt_buf_end - pgt_buf_top)
--
1.7.7

2012-11-12 21:26:50

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 14/46] x86, mm: use pfn_range_is_mapped() with reserve_initrd

We are going to map ram only, so under max_low_pfn_mapped,
between 4g and max_pfn_mapped does not mean mapped at all.

Use pfn_range_is_mapped() to find out if range is mapped for initrd.

That could happen bootloader put initrd in range but user could
use memmap to carve some of range out.

Also during copying need to use early_memmap to map original initrd
for accessing.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/kernel/setup.c | 52 +++++++++++++++++++++++++---------------------
1 files changed, 28 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d85cbd9..bd52f9d 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -317,20 +317,19 @@ static void __init relocate_initrd(void)
u64 ramdisk_image = boot_params.hdr.ramdisk_image;
u64 ramdisk_size = boot_params.hdr.ramdisk_size;
u64 area_size = PAGE_ALIGN(ramdisk_size);
- u64 end_of_lowmem = max_low_pfn_mapped << PAGE_SHIFT;
u64 ramdisk_here;
unsigned long slop, clen, mapaddr;
char *p, *q;

- /* We need to move the initrd down into lowmem */
- ramdisk_here = memblock_find_in_range(0, end_of_lowmem, area_size,
- PAGE_SIZE);
+ /* We need to move the initrd down into directly mapped mem */
+ ramdisk_here = memblock_find_in_range(0, PFN_PHYS(max_low_pfn_mapped),
+ area_size, PAGE_SIZE);

if (!ramdisk_here)
panic("Cannot find place for new RAMDISK of size %lld\n",
ramdisk_size);

- /* Note: this includes all the lowmem currently occupied by
+ /* Note: this includes all the mem currently occupied by
the initrd, we rely on that fact to keep the data intact. */
memblock_reserve(ramdisk_here, area_size);
initrd_start = ramdisk_here + PAGE_OFFSET;
@@ -340,17 +339,7 @@ static void __init relocate_initrd(void)

q = (char *)initrd_start;

- /* Copy any lowmem portion of the initrd */
- if (ramdisk_image < end_of_lowmem) {
- clen = end_of_lowmem - ramdisk_image;
- p = (char *)__va(ramdisk_image);
- memcpy(q, p, clen);
- q += clen;
- ramdisk_image += clen;
- ramdisk_size -= clen;
- }
-
- /* Copy the highmem portion of the initrd */
+ /* Copy the initrd */
while (ramdisk_size) {
slop = ramdisk_image & ~PAGE_MASK;
clen = ramdisk_size;
@@ -364,7 +353,7 @@ static void __init relocate_initrd(void)
ramdisk_image += clen;
ramdisk_size -= clen;
}
- /* high pages is not converted by early_res_to_bootmem */
+
ramdisk_image = boot_params.hdr.ramdisk_image;
ramdisk_size = boot_params.hdr.ramdisk_size;
printk(KERN_INFO "Move RAMDISK from [mem %#010llx-%#010llx] to"
@@ -373,13 +362,27 @@ static void __init relocate_initrd(void)
ramdisk_here, ramdisk_here + ramdisk_size - 1);
}

+static u64 __init get_mem_size(unsigned long limit_pfn)
+{
+ int i;
+ u64 mapped_pages = 0;
+ unsigned long start_pfn, end_pfn;
+
+ for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
+ start_pfn = min_t(unsigned long, start_pfn, limit_pfn);
+ end_pfn = min_t(unsigned long, end_pfn, limit_pfn);
+ mapped_pages += end_pfn - start_pfn;
+ }
+
+ return mapped_pages << PAGE_SHIFT;
+}
static void __init reserve_initrd(void)
{
/* Assume only end is not page aligned */
u64 ramdisk_image = boot_params.hdr.ramdisk_image;
u64 ramdisk_size = boot_params.hdr.ramdisk_size;
u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size);
- u64 end_of_lowmem = max_low_pfn_mapped << PAGE_SHIFT;
+ u64 mapped_size;

if (!boot_params.hdr.type_of_loader ||
!ramdisk_image || !ramdisk_size)
@@ -387,18 +390,19 @@ static void __init reserve_initrd(void)

initrd_start = 0;

- if (ramdisk_size >= (end_of_lowmem>>1)) {
+ mapped_size = get_mem_size(max_low_pfn_mapped);
+ if (ramdisk_size >= (mapped_size>>1))
panic("initrd too large to handle, "
"disabling initrd (%lld needed, %lld available)\n",
- ramdisk_size, end_of_lowmem>>1);
- }
+ ramdisk_size, mapped_size>>1);

printk(KERN_INFO "RAMDISK: [mem %#010llx-%#010llx]\n", ramdisk_image,
ramdisk_end - 1);

-
- if (ramdisk_end <= end_of_lowmem) {
- /* All in lowmem, easy case */
+ if (ramdisk_end <= (max_low_pfn_mapped<<PAGE_SHIFT) &&
+ pfn_range_is_mapped(PFN_DOWN(ramdisk_image),
+ PFN_DOWN(ramdisk_end))) {
+ /* All are mapped, easy case */
/*
* don't need to reserve again, already reserved early
* in i386_start_kernel
--
1.7.7

2012-11-12 21:32:07

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 22/46] x86, mm: Remove early_memremap workaround for page table accessing on 64bit

We try to put page table high to make room for kdump, and at that time
those ranges are not mapped yet, and have to use ioremap to access it.

Now after patch that pre-map page table top down.
x86, mm: setup page table in top-down
We do not need that workaround anymore.

Just use __va to return directly mapping address.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init_64.c | 38 ++++----------------------------------
1 files changed, 4 insertions(+), 34 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index eefaea6..5ee9242 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -340,36 +340,12 @@ static __ref void *alloc_low_page(unsigned long *phys)
} else
pfn = pgt_buf_end++;

- adr = early_memremap(pfn * PAGE_SIZE, PAGE_SIZE);
+ adr = __va(pfn * PAGE_SIZE);
clear_page(adr);
*phys = pfn * PAGE_SIZE;
return adr;
}

-static __ref void *map_low_page(void *virt)
-{
- void *adr;
- unsigned long phys, left;
-
- if (after_bootmem)
- return virt;
-
- phys = __pa(virt);
- left = phys & (PAGE_SIZE - 1);
- adr = early_memremap(phys & PAGE_MASK, PAGE_SIZE);
- adr = (void *)(((unsigned long)adr) | left);
-
- return adr;
-}
-
-static __ref void unmap_low_page(void *adr)
-{
- if (after_bootmem)
- return;
-
- early_iounmap((void *)((unsigned long)adr & PAGE_MASK), PAGE_SIZE);
-}
-
static unsigned long __meminit
phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
pgprot_t prot)
@@ -442,10 +418,9 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
if (pmd_val(*pmd)) {
if (!pmd_large(*pmd)) {
spin_lock(&init_mm.page_table_lock);
- pte = map_low_page((pte_t *)pmd_page_vaddr(*pmd));
+ pte = (pte_t *)pmd_page_vaddr(*pmd);
last_map_addr = phys_pte_init(pte, address,
end, prot);
- unmap_low_page(pte);
spin_unlock(&init_mm.page_table_lock);
continue;
}
@@ -483,7 +458,6 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,

pte = alloc_low_page(&pte_phys);
last_map_addr = phys_pte_init(pte, address, end, new_prot);
- unmap_low_page(pte);

spin_lock(&init_mm.page_table_lock);
pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
@@ -518,10 +492,9 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,

if (pud_val(*pud)) {
if (!pud_large(*pud)) {
- pmd = map_low_page(pmd_offset(pud, 0));
+ pmd = pmd_offset(pud, 0);
last_map_addr = phys_pmd_init(pmd, addr, end,
page_size_mask, prot);
- unmap_low_page(pmd);
__flush_tlb_all();
continue;
}
@@ -560,7 +533,6 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
pmd = alloc_low_page(&pmd_phys);
last_map_addr = phys_pmd_init(pmd, addr, end, page_size_mask,
prot);
- unmap_low_page(pmd);

spin_lock(&init_mm.page_table_lock);
pud_populate(&init_mm, pud, __va(pmd_phys));
@@ -596,17 +568,15 @@ kernel_physical_mapping_init(unsigned long start,
next = end;

if (pgd_val(*pgd)) {
- pud = map_low_page((pud_t *)pgd_page_vaddr(*pgd));
+ pud = (pud_t *)pgd_page_vaddr(*pgd);
last_map_addr = phys_pud_init(pud, __pa(start),
__pa(end), page_size_mask);
- unmap_low_page(pud);
continue;
}

pud = alloc_low_page(&pud_phys);
last_map_addr = phys_pud_init(pud, __pa(start), __pa(next),
page_size_mask);
- unmap_low_page(pud);

spin_lock(&init_mm.page_table_lock);
pgd_populate(&init_mm, pgd, __va(pud_phys));
--
1.7.7

2012-11-12 21:32:06

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 18/46] x86, mm: Use big page size for small memory range

We could map small range in the middle of big range at first, so should use
big page size at first to avoid using small page size to break down page table.

Only can set big page bit when that range has ram area around it.

-v2: fix 32bit boundary checking. We can not count ram above max_low_pfn
for 32 bit.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 37 +++++++++++++++++++++++++++++++++++++
1 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 99d584c..daea254 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -88,6 +88,40 @@ static int __meminit save_mr(struct map_range *mr, int nr_range,
return nr_range;
}

+/*
+ * adjust the page_size_mask for small range to go with
+ * big page size instead small one if nearby are ram too.
+ */
+static void __init_refok adjust_range_page_size_mask(struct map_range *mr,
+ int nr_range)
+{
+ int i;
+
+ for (i = 0; i < nr_range; i++) {
+ if ((page_size_mask & (1<<PG_LEVEL_2M)) &&
+ !(mr[i].page_size_mask & (1<<PG_LEVEL_2M))) {
+ unsigned long start = round_down(mr[i].start, PMD_SIZE);
+ unsigned long end = round_up(mr[i].end, PMD_SIZE);
+
+#ifdef CONFIG_X86_32
+ if ((end >> PAGE_SHIFT) > max_low_pfn)
+ continue;
+#endif
+
+ if (memblock_is_region_memory(start, end - start))
+ mr[i].page_size_mask |= 1<<PG_LEVEL_2M;
+ }
+ if ((page_size_mask & (1<<PG_LEVEL_1G)) &&
+ !(mr[i].page_size_mask & (1<<PG_LEVEL_1G))) {
+ unsigned long start = round_down(mr[i].start, PUD_SIZE);
+ unsigned long end = round_up(mr[i].end, PUD_SIZE);
+
+ if (memblock_is_region_memory(start, end - start))
+ mr[i].page_size_mask |= 1<<PG_LEVEL_1G;
+ }
+ }
+}
+
static int __meminit split_mem_range(struct map_range *mr, int nr_range,
unsigned long start,
unsigned long end)
@@ -182,6 +216,9 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
nr_range--;
}

+ if (!after_bootmem)
+ adjust_range_page_size_mask(mr, nr_range);
+
for (i = 0; i < nr_range; i++)
printk(KERN_DEBUG " [mem %#010lx-%#010lx] page %s\n",
mr[i].start, mr[i].end - 1,
--
1.7.7

2012-11-12 21:32:40

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 27/46] x86, mm: Add alloc_low_pages(num)

32bit kmap mapping needs pages to be used for low to high.
At this point those pages are still from pgt_buf_* from BRK, so it is
ok now.
But we want to move early_ioremap_page_table_range_init() out of
init_memory_mapping() and only call it one time later, that will
make page_table_range_init/page_table_kmap_check/alloc_low_page to
use memblock to get page.

memblock allocation for pages are from high to low.
So will get panic from page_table_kmap_check() that has BUG_ON to do
ordering checking.

This patch add alloc_low_pages to make it possible to allocate serveral
pages at first, and hand out pages one by one from low to high.

-v2: add one line comment about xen requirements.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Andrew Morton <[email protected]>
---
arch/x86/mm/init.c | 33 +++++++++++++++++++++------------
arch/x86/mm/mm_internal.h | 6 +++++-
2 files changed, 26 insertions(+), 13 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 9d51af72..f5e0120 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -25,36 +25,45 @@ unsigned long __meminitdata pgt_buf_top;

static unsigned long min_pfn_mapped;

-__ref void *alloc_low_page(void)
+__ref void *alloc_low_pages(unsigned int num)
{
unsigned long pfn;
- void *adr;
+ int i;

#ifdef CONFIG_X86_64
if (after_bootmem) {
- adr = (void *)get_zeroed_page(GFP_ATOMIC | __GFP_NOTRACK);
+ unsigned int order;

- return adr;
+ order = get_order((unsigned long)num << PAGE_SHIFT);
+ return (void *)__get_free_pages(GFP_ATOMIC | __GFP_NOTRACK |
+ __GFP_ZERO, order);
}
#endif

- if ((pgt_buf_end + 1) >= pgt_buf_top) {
+ if ((pgt_buf_end + num) >= pgt_buf_top) {
unsigned long ret;
if (min_pfn_mapped >= max_pfn_mapped)
panic("alloc_low_page: ran out of memory");
ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
max_pfn_mapped << PAGE_SHIFT,
- PAGE_SIZE, PAGE_SIZE);
+ PAGE_SIZE * num , PAGE_SIZE);
if (!ret)
panic("alloc_low_page: can not alloc memory");
- memblock_reserve(ret, PAGE_SIZE);
+ memblock_reserve(ret, PAGE_SIZE * num);
pfn = ret >> PAGE_SHIFT;
- } else
- pfn = pgt_buf_end++;
+ } else {
+ pfn = pgt_buf_end;
+ pgt_buf_end += num;
+ }
+
+ for (i = 0; i < num; i++) {
+ void *adr;
+
+ adr = __va((pfn + i) << PAGE_SHIFT);
+ clear_page(adr);
+ }

- adr = __va(pfn * PAGE_SIZE);
- clear_page(adr);
- return adr;
+ return __va(pfn << PAGE_SHIFT);
}

/* need 4 4k for initial PMD_SIZE, 4k for 0-ISA_END_ADDRESS */
diff --git a/arch/x86/mm/mm_internal.h b/arch/x86/mm/mm_internal.h
index b3f993a..7e3b88e 100644
--- a/arch/x86/mm/mm_internal.h
+++ b/arch/x86/mm/mm_internal.h
@@ -1,6 +1,10 @@
#ifndef __X86_MM_INTERNAL_H
#define __X86_MM_INTERNAL_H

-void *alloc_low_page(void);
+void *alloc_low_pages(unsigned int num);
+static inline void *alloc_low_page(void)
+{
+ return alloc_low_pages(1);
+}

#endif /* __X86_MM_INTERNAL_H */
--
1.7.7

2012-11-12 21:19:30

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 09/46] x86, mm: Set memblock initial limit to 1M

memblock_x86_fill() could double memory array.
If we set memblock.current_limit to 512M, so memory array could be around 512M.
So kdump will not get big range (like 512M) under 1024M.

Try to put it down under 1M, it would use about 4k or so, and that is limited.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/kernel/setup.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 23b079f..4bd8921 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -890,7 +890,7 @@ void __init setup_arch(char **cmdline_p)

cleanup_highmap();

- memblock.current_limit = get_max_mapped();
+ memblock.current_limit = ISA_END_ADDRESS;
memblock_x86_fill();

/*
--
1.7.7

2012-11-12 21:32:58

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 26/46] x86, mm, Xen: Remove mapping_pagetable_reserve()

Page table area are pre-mapped now after
x86, mm: setup page table in top-down
x86, mm: Remove early_memremap workaround for page table accessing on 64bit

mapping_pagetable_reserve is not used anymore, so remove it.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 1 -
arch/x86/include/asm/x86_init.h | 12 ------------
arch/x86/kernel/x86_init.c | 4 ----
arch/x86/mm/init.c | 4 ----
arch/x86/xen/mmu.c | 28 ----------------------------
5 files changed, 0 insertions(+), 49 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index ec8a1fc..79738f2 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -301,7 +301,6 @@ int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,
/* Install a pte for a particular vaddr in kernel space. */
void set_pte_vaddr(unsigned long vaddr, pte_t pte);

-extern void native_pagetable_reserve(u64 start, u64 end);
#ifdef CONFIG_X86_32
extern void native_pagetable_init(void);
#else
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 5769349..3b2ce8f 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -69,17 +69,6 @@ struct x86_init_oem {
};

/**
- * struct x86_init_mapping - platform specific initial kernel pagetable setup
- * @pagetable_reserve: reserve a range of addresses for kernel pagetable usage
- *
- * For more details on the purpose of this hook, look in
- * init_memory_mapping and the commit that added it.
- */
-struct x86_init_mapping {
- void (*pagetable_reserve)(u64 start, u64 end);
-};
-
-/**
* struct x86_init_paging - platform specific paging functions
* @pagetable_init: platform specific paging initialization call to setup
* the kernel pagetables and prepare accessors functions.
@@ -136,7 +125,6 @@ struct x86_init_ops {
struct x86_init_mpparse mpparse;
struct x86_init_irqs irqs;
struct x86_init_oem oem;
- struct x86_init_mapping mapping;
struct x86_init_paging paging;
struct x86_init_timers timers;
struct x86_init_iommu iommu;
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index 7a3d075..50cf83e 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -62,10 +62,6 @@ struct x86_init_ops x86_init __initdata = {
.banner = default_banner,
},

- .mapping = {
- .pagetable_reserve = native_pagetable_reserve,
- },
-
.paging = {
.pagetable_init = native_pagetable_init,
},
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 7a6669e..9d51af72 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -112,10 +112,6 @@ static void __init probe_page_size_mask(void)
__supported_pte_mask |= _PAGE_GLOBAL;
}
}
-void __init native_pagetable_reserve(u64 start, u64 end)
-{
- memblock_reserve(start, end - start);
-}

#ifdef CONFIG_X86_32
#define NR_RANGE_MR 3
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index dcf5f2d..bbb883f 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1178,20 +1178,6 @@ static void xen_exit_mmap(struct mm_struct *mm)

static void xen_post_allocator_init(void);

-static __init void xen_mapping_pagetable_reserve(u64 start, u64 end)
-{
- /* reserve the range used */
- native_pagetable_reserve(start, end);
-
- /* set as RW the rest */
- printk(KERN_DEBUG "xen: setting RW the range %llx - %llx\n", end,
- PFN_PHYS(pgt_buf_top));
- while (end < PFN_PHYS(pgt_buf_top)) {
- make_lowmem_page_readwrite(__va(end));
- end += PAGE_SIZE;
- }
-}
-
#ifdef CONFIG_X86_64
static void __init xen_cleanhighmap(unsigned long vaddr,
unsigned long vaddr_end)
@@ -1503,19 +1489,6 @@ static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
#else /* CONFIG_X86_64 */
static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
{
- unsigned long pfn = pte_pfn(pte);
-
- /*
- * If the new pfn is within the range of the newly allocated
- * kernel pagetable, and it isn't being mapped into an
- * early_ioremap fixmap slot as a freshly allocated page, make sure
- * it is RO.
- */
- if (((!is_early_ioremap_ptep(ptep) &&
- pfn >= pgt_buf_start && pfn < pgt_buf_top)) ||
- (is_early_ioremap_ptep(ptep) && pfn != (pgt_buf_end - 1)))
- pte = pte_wrprotect(pte);
-
return pte;
}
#endif /* CONFIG_X86_64 */
@@ -2197,7 +2170,6 @@ static const struct pv_mmu_ops xen_mmu_ops __initconst = {

void __init xen_init_mmu_ops(void)
{
- x86_init.mapping.pagetable_reserve = xen_mapping_pagetable_reserve;
x86_init.paging.pagetable_init = xen_pagetable_init;
pv_mmu_ops = xen_mmu_ops;

--
1.7.7

2012-11-12 21:32:56

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 35/46] x86, mm: use round_up/down in split_mem_range()

to replace own inline version for those roundup and rounddown.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 30 ++++++++++++------------------
1 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 61734b4..ae3d642 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -214,13 +214,11 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
* slowdowns.
*/
if (pos == 0)
- end_pfn = 1<<(PMD_SHIFT - PAGE_SHIFT);
+ end_pfn = PMD_SIZE >> PAGE_SHIFT;
else
- end_pfn = ((pos + (PMD_SIZE - 1))>>PMD_SHIFT)
- << (PMD_SHIFT - PAGE_SHIFT);
+ end_pfn = round_up(pos, PMD_SIZE) >> PAGE_SHIFT;
#else /* CONFIG_X86_64 */
- end_pfn = ((pos + (PMD_SIZE - 1)) >> PMD_SHIFT)
- << (PMD_SHIFT - PAGE_SHIFT);
+ end_pfn = round_up(pos, PMD_SIZE) >> PAGE_SHIFT;
#endif
if (end_pfn > (end >> PAGE_SHIFT))
end_pfn = end >> PAGE_SHIFT;
@@ -230,15 +228,13 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
}

/* big page (2M) range */
- start_pfn = ((pos + (PMD_SIZE - 1))>>PMD_SHIFT)
- << (PMD_SHIFT - PAGE_SHIFT);
+ start_pfn = round_up(pos, PMD_SIZE) >> PAGE_SHIFT;
#ifdef CONFIG_X86_32
- end_pfn = (end>>PMD_SHIFT) << (PMD_SHIFT - PAGE_SHIFT);
+ end_pfn = round_down(end, PMD_SIZE) >> PAGE_SHIFT;
#else /* CONFIG_X86_64 */
- end_pfn = ((pos + (PUD_SIZE - 1))>>PUD_SHIFT)
- << (PUD_SHIFT - PAGE_SHIFT);
- if (end_pfn > ((end>>PMD_SHIFT)<<(PMD_SHIFT - PAGE_SHIFT)))
- end_pfn = ((end>>PMD_SHIFT)<<(PMD_SHIFT - PAGE_SHIFT));
+ end_pfn = round_up(pos, PUD_SIZE) >> PAGE_SHIFT;
+ if (end_pfn > (round_down(end, PMD_SIZE) >> PAGE_SHIFT))
+ end_pfn = round_down(end, PMD_SIZE) >> PAGE_SHIFT;
#endif

if (start_pfn < end_pfn) {
@@ -249,9 +245,8 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,

#ifdef CONFIG_X86_64
/* big page (1G) range */
- start_pfn = ((pos + (PUD_SIZE - 1))>>PUD_SHIFT)
- << (PUD_SHIFT - PAGE_SHIFT);
- end_pfn = (end >> PUD_SHIFT) << (PUD_SHIFT - PAGE_SHIFT);
+ start_pfn = round_up(pos, PUD_SIZE) >> PAGE_SHIFT;
+ end_pfn = round_down(end, PUD_SIZE) >> PAGE_SHIFT;
if (start_pfn < end_pfn) {
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
page_size_mask &
@@ -260,9 +255,8 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
}

/* tail is not big page (1G) alignment */
- start_pfn = ((pos + (PMD_SIZE - 1))>>PMD_SHIFT)
- << (PMD_SHIFT - PAGE_SHIFT);
- end_pfn = (end >> PMD_SHIFT) << (PMD_SHIFT - PAGE_SHIFT);
+ start_pfn = round_up(pos, PMD_SIZE) >> PAGE_SHIFT;
+ end_pfn = round_down(end, PMD_SIZE) >> PAGE_SHIFT;
if (start_pfn < end_pfn) {
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
page_size_mask & (1<<PG_LEVEL_2M));
--
1.7.7

2012-11-12 21:32:55

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 34/46] x86, mm: Add check before clear pte above max_low_pfn on 32bit

During test patch that adjust page_size_mask to map small range ram with
big page size, found page table is setup wrongly for 32bit. And
native_pagetable_init wrong clear pte for pmd with large page support.

1. add more comments about why we are expecting pte.

2. add BUG checking, so next time we could find problem earlier
when we mess up page table setup again.

3. max_low_pfn is not included boundary for low memory mapping.
We should check from max_low_pfn instead of +1.

4. add print out when some pte really get cleared, or we should use
WARN() to find out why above max_low_pfn get mapped? so we could
fix it.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init_32.c | 18 ++++++++++++++++--
1 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 322ee56..19ef9f0 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -480,9 +480,14 @@ void __init native_pagetable_init(void)

/*
* Remove any mappings which extend past the end of physical
- * memory from the boot time page table:
+ * memory from the boot time page table.
+ * In virtual address space, we should have at least two pages
+ * from VMALLOC_END to pkmap or fixmap according to VMALLOC_END
+ * definition. And max_low_pfn is set to VMALLOC_END physical
+ * address. If initial memory mapping is doing right job, we
+ * should have pte used near max_low_pfn or one pmd is not present.
*/
- for (pfn = max_low_pfn + 1; pfn < 1<<(32-PAGE_SHIFT); pfn++) {
+ for (pfn = max_low_pfn; pfn < 1<<(32-PAGE_SHIFT); pfn++) {
va = PAGE_OFFSET + (pfn<<PAGE_SHIFT);
pgd = base + pgd_index(va);
if (!pgd_present(*pgd))
@@ -493,10 +498,19 @@ void __init native_pagetable_init(void)
if (!pmd_present(*pmd))
break;

+ /* should not be large page here */
+ if (pmd_large(*pmd)) {
+ pr_warn("try to clear pte for ram above max_low_pfn: pfn: %lx pmd: %p pmd phys: %lx, but pmd is big page and is not using pte !\n",
+ pfn, pmd, __pa(pmd));
+ BUG_ON(1);
+ }
+
pte = pte_offset_kernel(pmd, va);
if (!pte_present(*pte))
break;

+ printk(KERN_DEBUG "clearing pte for ram above max_low_pfn: pfn: %lx pmd: %p pmd phys: %lx pte: %p pte phys: %lx\n",
+ pfn, pmd, __pa(pmd), pte, __pa(pte));
pte_clear(NULL, va, pte);
}
paravirt_alloc_pmd(&init_mm, __pa(base) >> PAGE_SHIFT);
--
1.7.7

2012-11-12 21:34:12

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 19/46] x86, mm: Don't clear page table if range is ram

After we add code use buffer in BRK to pre-map buf for page table in
following patch:
x86, mm: setup page table in top-down
it should be safe to remove early_memmap for page table accessing.
Instead we get panic with that.

It turns out that we clear the initial page table wrongly for next range
that is separated by holes.
And it only happens when we are trying to map ram range one by one.

We need to check if the range is ram before clearing page table.

We change the loop structure to remove the extra little loop and use
one loop only, and in that loop will caculate next at first, and check if
[addr,next) is covered by E820_RAM.

-v2: E820_RESERVED_KERN is treated as E820_RAM. EFI one change some E820_RAM
to that, so next kernel by kexec will know that range is used already.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init_64.c | 40 +++++++++++++++++++---------------------
1 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 869372a..fa28e3e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -363,20 +363,20 @@ static unsigned long __meminit
phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
pgprot_t prot)
{
- unsigned pages = 0;
+ unsigned long pages = 0, next;
unsigned long last_map_addr = end;
int i;

pte_t *pte = pte_page + pte_index(addr);

- for(i = pte_index(addr); i < PTRS_PER_PTE; i++, addr += PAGE_SIZE, pte++) {
-
+ for (i = pte_index(addr); i < PTRS_PER_PTE; i++, addr = next, pte++) {
+ next = (addr & PAGE_MASK) + PAGE_SIZE;
if (addr >= end) {
- if (!after_bootmem) {
- for(; i < PTRS_PER_PTE; i++, pte++)
- set_pte(pte, __pte(0));
- }
- break;
+ if (!after_bootmem &&
+ !e820_any_mapped(addr & PAGE_MASK, next, E820_RAM) &&
+ !e820_any_mapped(addr & PAGE_MASK, next, E820_RESERVED_KERN))
+ set_pte(pte, __pte(0));
+ continue;
}

/*
@@ -419,16 +419,15 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
pte_t *pte;
pgprot_t new_prot = prot;

+ next = (address & PMD_MASK) + PMD_SIZE;
if (address >= end) {
- if (!after_bootmem) {
- for (; i < PTRS_PER_PMD; i++, pmd++)
- set_pmd(pmd, __pmd(0));
- }
- break;
+ if (!after_bootmem &&
+ !e820_any_mapped(address & PMD_MASK, next, E820_RAM) &&
+ !e820_any_mapped(address & PMD_MASK, next, E820_RESERVED_KERN))
+ set_pmd(pmd, __pmd(0));
+ continue;
}

- next = (address & PMD_MASK) + PMD_SIZE;
-
if (pmd_val(*pmd)) {
if (!pmd_large(*pmd)) {
spin_lock(&init_mm.page_table_lock);
@@ -497,13 +496,12 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
pmd_t *pmd;
pgprot_t prot = PAGE_KERNEL;

- if (addr >= end)
- break;
-
next = (addr & PUD_MASK) + PUD_SIZE;
-
- if (!after_bootmem && !e820_any_mapped(addr, next, 0)) {
- set_pud(pud, __pud(0));
+ if (addr >= end) {
+ if (!after_bootmem &&
+ !e820_any_mapped(addr & PUD_MASK, next, E820_RAM) &&
+ !e820_any_mapped(addr & PUD_MASK, next, E820_RESERVED_KERN))
+ set_pud(pud, __pud(0));
continue;
}

--
1.7.7

2012-11-12 21:34:10

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 30/46] x86, mm: Move back pgt_buf_* to mm/init.c

Also change them to static.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/include/asm/init.h | 4 ----
arch/x86/mm/init.c | 6 +++---
2 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
index 4f13998..626ea8d 100644
--- a/arch/x86/include/asm/init.h
+++ b/arch/x86/include/asm/init.h
@@ -12,8 +12,4 @@ kernel_physical_mapping_init(unsigned long start,
unsigned long end,
unsigned long page_size_mask);

-extern unsigned long __initdata pgt_buf_start;
-extern unsigned long __meminitdata pgt_buf_end;
-extern unsigned long __meminitdata pgt_buf_top;
-
#endif /* _ASM_X86_INIT_32_H */
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index d2df52c..5caddf9 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -19,9 +19,9 @@

#include "mm_internal.h"

-unsigned long __initdata pgt_buf_start;
-unsigned long __meminitdata pgt_buf_end;
-unsigned long __meminitdata pgt_buf_top;
+static unsigned long __initdata pgt_buf_start;
+static unsigned long __initdata pgt_buf_end;
+static unsigned long __initdata pgt_buf_top;

static unsigned long min_pfn_mapped;

--
1.7.7

2012-11-12 21:34:08

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 36/46] x86, mm: use PFN_DOWN in split_mem_range()

to replace own inline version for shifting.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 44 ++++++++++++++++++++++----------------------
1 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index ae3d642..a4fdf31 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -204,8 +204,8 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
int i;

/* head if not big page alignment ? */
- start_pfn = start >> PAGE_SHIFT;
- pos = start_pfn << PAGE_SHIFT;
+ start_pfn = PFN_DOWN(start);
+ pos = PFN_PHYS(start_pfn);
#ifdef CONFIG_X86_32
/*
* Don't use a large page for the first 2/4MB of memory
@@ -214,59 +214,59 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
* slowdowns.
*/
if (pos == 0)
- end_pfn = PMD_SIZE >> PAGE_SHIFT;
+ end_pfn = PFN_DOWN(PMD_SIZE);
else
- end_pfn = round_up(pos, PMD_SIZE) >> PAGE_SHIFT;
+ end_pfn = PFN_DOWN(round_up(pos, PMD_SIZE));
#else /* CONFIG_X86_64 */
- end_pfn = round_up(pos, PMD_SIZE) >> PAGE_SHIFT;
+ end_pfn = PFN_DOWN(round_up(pos, PMD_SIZE));
#endif
- if (end_pfn > (end >> PAGE_SHIFT))
- end_pfn = end >> PAGE_SHIFT;
+ if (end_pfn > PFN_DOWN(end))
+ end_pfn = PFN_DOWN(end);
if (start_pfn < end_pfn) {
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn, 0);
- pos = end_pfn << PAGE_SHIFT;
+ pos = PFN_PHYS(end_pfn);
}

/* big page (2M) range */
- start_pfn = round_up(pos, PMD_SIZE) >> PAGE_SHIFT;
+ start_pfn = PFN_DOWN(round_up(pos, PMD_SIZE));
#ifdef CONFIG_X86_32
- end_pfn = round_down(end, PMD_SIZE) >> PAGE_SHIFT;
+ end_pfn = PFN_DOWN(round_down(end, PMD_SIZE));
#else /* CONFIG_X86_64 */
- end_pfn = round_up(pos, PUD_SIZE) >> PAGE_SHIFT;
- if (end_pfn > (round_down(end, PMD_SIZE) >> PAGE_SHIFT))
- end_pfn = round_down(end, PMD_SIZE) >> PAGE_SHIFT;
+ end_pfn = PFN_DOWN(round_up(pos, PUD_SIZE));
+ if (end_pfn > PFN_DOWN(round_down(end, PMD_SIZE)))
+ end_pfn = PFN_DOWN(round_down(end, PMD_SIZE));
#endif

if (start_pfn < end_pfn) {
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
page_size_mask & (1<<PG_LEVEL_2M));
- pos = end_pfn << PAGE_SHIFT;
+ pos = PFN_PHYS(end_pfn);
}

#ifdef CONFIG_X86_64
/* big page (1G) range */
- start_pfn = round_up(pos, PUD_SIZE) >> PAGE_SHIFT;
- end_pfn = round_down(end, PUD_SIZE) >> PAGE_SHIFT;
+ start_pfn = PFN_DOWN(round_up(pos, PUD_SIZE));
+ end_pfn = PFN_DOWN(round_down(end, PUD_SIZE));
if (start_pfn < end_pfn) {
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
page_size_mask &
((1<<PG_LEVEL_2M)|(1<<PG_LEVEL_1G)));
- pos = end_pfn << PAGE_SHIFT;
+ pos = PFN_PHYS(end_pfn);
}

/* tail is not big page (1G) alignment */
- start_pfn = round_up(pos, PMD_SIZE) >> PAGE_SHIFT;
- end_pfn = round_down(end, PMD_SIZE) >> PAGE_SHIFT;
+ start_pfn = PFN_DOWN(round_up(pos, PMD_SIZE));
+ end_pfn = PFN_DOWN(round_down(end, PMD_SIZE));
if (start_pfn < end_pfn) {
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn,
page_size_mask & (1<<PG_LEVEL_2M));
- pos = end_pfn << PAGE_SHIFT;
+ pos = PFN_PHYS(end_pfn);
}
#endif

/* tail is not big page (2M) alignment */
- start_pfn = pos>>PAGE_SHIFT;
- end_pfn = end>>PAGE_SHIFT;
+ start_pfn = PFN_DOWN(pos);
+ end_pfn = PFN_DOWN(end);
nr_range = save_mr(mr, nr_range, start_pfn, end_pfn, 0);

/* try to merge same page size and continuous */
--
1.7.7

2012-11-12 21:34:06

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 06/46] x86, mm: Change find_early_table_space() paramters

call split_mem_range inside the function.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 12 +++++++++---
1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 8f57b12..00089bf 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -196,12 +196,18 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
* mr[0].start to mr[nr_range - 1].end, while accounting for possible 2M and 1GB
* pages. Then find enough contiguous space for those page tables.
*/
-static void __init find_early_table_space(struct map_range *mr, int nr_range)
+static void __init find_early_table_space(unsigned long start, unsigned long end)
{
int i;
unsigned long puds = 0, pmds = 0, ptes = 0, tables;
- unsigned long start = 0, good_end;
+ unsigned long good_end;
phys_addr_t base;
+ struct map_range mr[NR_RANGE_MR];
+ int nr_range;
+
+ memset(mr, 0, sizeof(mr));
+ nr_range = 0;
+ nr_range = split_mem_range(mr, nr_range, start, end);

for (i = 0; i < nr_range; i++) {
unsigned long range, extra;
@@ -277,7 +283,7 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
* nodes are discovered.
*/
if (!after_bootmem)
- find_early_table_space(mr, nr_range);
+ find_early_table_space(start, end);

for (i = 0; i < nr_range; i++)
ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,
--
1.7.7

2012-11-12 21:34:04

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 31/46] x86, mm: Move init_gbpages() out of setup.c

Put it in mm/init.c, and call it from probe_page_mask().
init_mem_mapping is calling probe_page_mask at first.
So calling sequence is not changed.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/kernel/setup.c | 15 +--------------
arch/x86/mm/init.c | 12 ++++++++++++
2 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 2015194..85b62f1 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -282,18 +282,7 @@ void * __init extend_brk(size_t size, size_t align)
return ret;
}

-#ifdef CONFIG_X86_64
-static void __init init_gbpages(void)
-{
- if (direct_gbpages && cpu_has_gbpages)
- printk(KERN_INFO "Using GB pages for direct mapping\n");
- else
- direct_gbpages = 0;
-}
-#else
-static inline void init_gbpages(void)
-{
-}
+#ifdef CONFIG_X86_32
static void __init cleanup_highmap(void)
{
}
@@ -933,8 +922,6 @@ void __init setup_arch(char **cmdline_p)

setup_real_mode();

- init_gbpages();
-
init_mem_mapping();

memblock.current_limit = get_max_mapped();
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 5caddf9..61734b4 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -94,6 +94,16 @@ int direct_gbpages
#endif
;

+static void __init init_gbpages(void)
+{
+#ifdef CONFIG_X86_64
+ if (direct_gbpages && cpu_has_gbpages)
+ printk(KERN_INFO "Using GB pages for direct mapping\n");
+ else
+ direct_gbpages = 0;
+#endif
+}
+
struct map_range {
unsigned long start;
unsigned long end;
@@ -104,6 +114,8 @@ static int page_size_mask;

static void __init probe_page_size_mask(void)
{
+ init_gbpages();
+
#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
/*
* For CONFIG_DEBUG_PAGEALLOC, identity mapping will use small pages.
--
1.7.7

2012-11-12 21:35:56

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 15/46] x86, mm: Only direct map addresses that are marked as E820_RAM

From: Jacob Shin <[email protected]>

Currently direct mappings are created for [ 0 to max_low_pfn<<PAGE_SHIFT )
and [ 4GB to max_pfn<<PAGE_SHIFT ), which may include regions that are not
backed by actual DRAM. This is fine for holes under 4GB which are covered
by fixed and variable range MTRRs to be UC. However, we run into trouble
on higher memory addresses which cannot be covered by MTRRs.

Our system with 1TB of RAM has an e820 that looks like this:

BIOS-e820: [mem 0x0000000000000000-0x00000000000983ff] usable
BIOS-e820: [mem 0x0000000000098400-0x000000000009ffff] reserved
BIOS-e820: [mem 0x00000000000d0000-0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000-0x00000000c7ebffff] usable
BIOS-e820: [mem 0x00000000c7ec0000-0x00000000c7ed7fff] ACPI data
BIOS-e820: [mem 0x00000000c7ed8000-0x00000000c7ed9fff] ACPI NVS
BIOS-e820: [mem 0x00000000c7eda000-0x00000000c7ffffff] reserved
BIOS-e820: [mem 0x00000000fec00000-0x00000000fec0ffff] reserved
BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
BIOS-e820: [mem 0x00000000fff00000-0x00000000ffffffff] reserved
BIOS-e820: [mem 0x0000000100000000-0x000000e037ffffff] usable
BIOS-e820: [mem 0x000000e038000000-0x000000fcffffffff] reserved
BIOS-e820: [mem 0x0000010000000000-0x0000011ffeffffff] usable

and so direct mappings are created for huge memory hole between
0x000000e038000000 to 0x0000010000000000. Even though the kernel never
generates memory accesses in that region, since the page tables mark
them incorrectly as being WB, our (AMD) processor ends up causing a MCE
while doing some memory bookkeeping/optimizations around that area.

This patch iterates through e820 and only direct maps ranges that are
marked as E820_RAM, and keeps track of those pfn ranges. Depending on
the alignment of E820 ranges, this may possibly result in using smaller
size (i.e. 4K instead of 2M or 1G) page tables.

-v2: move changes from setup.c to mm/init.c, also use for_each_mem_pfn_range
instead. - Yinghai Lu
-v3: add calculate_all_table_space_size() to get correct needed page table
size. - Yinghai Lu
-v4: fix add_pfn_range_mapped() to get correct max_low_pfn_mapped when
mem map does have hole under 4g that is found by Konard on xen
domU with 8g ram. - Yinghai

Signed-off-by: Jacob Shin <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
---
arch/x86/include/asm/page_types.h | 8 +--
arch/x86/kernel/setup.c | 8 ++-
arch/x86/mm/init.c | 120 +++++++++++++++++++++++++++++++++----
arch/x86/mm/init_64.c | 6 +-
4 files changed, 117 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index 45aae6e..54c9787 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -51,13 +51,7 @@ static inline phys_addr_t get_max_mapped(void)
return (phys_addr_t)max_pfn_mapped << PAGE_SHIFT;
}

-static inline bool pfn_range_is_mapped(unsigned long start_pfn,
- unsigned long end_pfn)
-{
- return end_pfn <= max_low_pfn_mapped ||
- (end_pfn > (1UL << (32 - PAGE_SHIFT)) &&
- end_pfn <= max_pfn_mapped);
-}
+bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);

extern unsigned long init_memory_mapping(unsigned long start,
unsigned long end);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index bd52f9d..68dffec 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -116,9 +116,11 @@
#include <asm/prom.h>

/*
- * end_pfn only includes RAM, while max_pfn_mapped includes all e820 entries.
- * The direct mapping extends to max_pfn_mapped, so that we can directly access
- * apertures, ACPI and other tables without having to play with fixmaps.
+ * max_low_pfn_mapped: highest direct mapped pfn under 4GB
+ * max_pfn_mapped: highest direct mapped pfn over 4GB
+ *
+ * The direct mapping only covers E820_RAM regions, so the ranges and gaps are
+ * represented by pfn_mapped
*/
unsigned long max_low_pfn_mapped;
unsigned long max_pfn_mapped;
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 2b8091c..99d584c 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -243,6 +243,38 @@ static unsigned long __init calculate_table_space_size(unsigned long start, unsi
return tables;
}

+static unsigned long __init calculate_all_table_space_size(void)
+{
+ unsigned long start_pfn, end_pfn;
+ unsigned long tables;
+ int i;
+
+ /* the ISA range is always mapped regardless of memory holes */
+ tables = calculate_table_space_size(0, ISA_END_ADDRESS);
+
+ for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
+ u64 start = start_pfn << PAGE_SHIFT;
+ u64 end = end_pfn << PAGE_SHIFT;
+
+ if (end <= ISA_END_ADDRESS)
+ continue;
+
+ if (start < ISA_END_ADDRESS)
+ start = ISA_END_ADDRESS;
+#ifdef CONFIG_X86_32
+ /* on 32 bit, we only map up to max_low_pfn */
+ if ((start >> PAGE_SHIFT) >= max_low_pfn)
+ continue;
+
+ if ((end >> PAGE_SHIFT) > max_low_pfn)
+ end = max_low_pfn << PAGE_SHIFT;
+#endif
+ tables += calculate_table_space_size(start, end);
+ }
+
+ return tables;
+}
+
static void __init find_early_table_space(unsigned long start,
unsigned long good_end,
unsigned long tables)
@@ -258,6 +290,34 @@ static void __init find_early_table_space(unsigned long start,
pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
}

+static struct range pfn_mapped[E820_X_MAX];
+static int nr_pfn_mapped;
+
+static void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn)
+{
+ nr_pfn_mapped = add_range_with_merge(pfn_mapped, E820_X_MAX,
+ nr_pfn_mapped, start_pfn, end_pfn);
+ nr_pfn_mapped = clean_sort_range(pfn_mapped, E820_X_MAX);
+
+ max_pfn_mapped = max(max_pfn_mapped, end_pfn);
+
+ if (start_pfn < (1UL<<(32-PAGE_SHIFT)))
+ max_low_pfn_mapped = max(max_low_pfn_mapped,
+ min(end_pfn, 1UL<<(32-PAGE_SHIFT)));
+}
+
+bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
+{
+ int i;
+
+ for (i = 0; i < nr_pfn_mapped; i++)
+ if ((start_pfn >= pfn_mapped[i].start) &&
+ (end_pfn <= pfn_mapped[i].end))
+ return true;
+
+ return false;
+}
+
/*
* Setup the direct mapping of the physical memory at PAGE_OFFSET.
* This runs before bootmem is initialized and gets pages directly from
@@ -289,9 +349,55 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,

__flush_tlb_all();

+ add_pfn_range_mapped(start >> PAGE_SHIFT, ret >> PAGE_SHIFT);
+
return ret >> PAGE_SHIFT;
}

+/*
+ * Iterate through E820 memory map and create direct mappings for only E820_RAM
+ * regions. We cannot simply create direct mappings for all pfns from
+ * [0 to max_low_pfn) and [4GB to max_pfn) because of possible memory holes in
+ * high addresses that cannot be marked as UC by fixed/variable range MTRRs.
+ * Depending on the alignment of E820 ranges, this may possibly result in using
+ * smaller size (i.e. 4K instead of 2M or 1G) page tables.
+ */
+static void __init init_all_memory_mapping(void)
+{
+ unsigned long start_pfn, end_pfn;
+ int i;
+
+ /* the ISA range is always mapped regardless of memory holes */
+ init_memory_mapping(0, ISA_END_ADDRESS);
+
+ for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
+ u64 start = (u64)start_pfn << PAGE_SHIFT;
+ u64 end = (u64)end_pfn << PAGE_SHIFT;
+
+ if (end <= ISA_END_ADDRESS)
+ continue;
+
+ if (start < ISA_END_ADDRESS)
+ start = ISA_END_ADDRESS;
+#ifdef CONFIG_X86_32
+ /* on 32 bit, we only map up to max_low_pfn */
+ if ((start >> PAGE_SHIFT) >= max_low_pfn)
+ continue;
+
+ if ((end >> PAGE_SHIFT) > max_low_pfn)
+ end = max_low_pfn << PAGE_SHIFT;
+#endif
+ init_memory_mapping(start, end);
+ }
+
+#ifdef CONFIG_X86_64
+ if (max_pfn > max_low_pfn) {
+ /* can we preseve max_low_pfn ?*/
+ max_low_pfn = max_pfn;
+ }
+#endif
+}
+
void __init init_mem_mapping(void)
{
unsigned long tables, good_end, end;
@@ -312,23 +418,15 @@ void __init init_mem_mapping(void)
end = max_low_pfn << PAGE_SHIFT;
good_end = max_pfn_mapped << PAGE_SHIFT;
#endif
- tables = calculate_table_space_size(0, end);
+ tables = calculate_all_table_space_size();
find_early_table_space(0, good_end, tables);
printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx] prealloc\n",
end - 1, pgt_buf_start << PAGE_SHIFT,
(pgt_buf_top << PAGE_SHIFT) - 1);

- max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
- max_pfn_mapped = max_low_pfn_mapped;
+ max_pfn_mapped = 0; /* will get exact value next */
+ init_all_memory_mapping();

-#ifdef CONFIG_X86_64
- if (max_pfn > max_low_pfn) {
- max_pfn_mapped = init_memory_mapping(1UL<<32,
- max_pfn<<PAGE_SHIFT);
- /* can we preseve max_low_pfn ?*/
- max_low_pfn = max_pfn;
- }
-#endif
/*
* Reserve the kernel pagetable pages we used (pgt_buf_start -
* pgt_buf_end) and free the other ones (pgt_buf_end - pgt_buf_top)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 3baff25..32c7e38 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -662,13 +662,11 @@ int arch_add_memory(int nid, u64 start, u64 size)
{
struct pglist_data *pgdat = NODE_DATA(nid);
struct zone *zone = pgdat->node_zones + ZONE_NORMAL;
- unsigned long last_mapped_pfn, start_pfn = start >> PAGE_SHIFT;
+ unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;

- last_mapped_pfn = init_memory_mapping(start, start + size);
- if (last_mapped_pfn > max_pfn_mapped)
- max_pfn_mapped = last_mapped_pfn;
+ init_memory_mapping(start, start + size);

ret = __add_pages(nid, zone, start_pfn, nr_pages);
WARN_ON_ONCE(ret);
--
1.7.7

2012-11-12 21:35:55

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 25/46] x86, mm: Move min_pfn_mapped back to mm/init.c

Also change it to static.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/include/asm/page_types.h | 1 -
arch/x86/kernel/setup.c | 1 -
arch/x86/mm/init.c | 2 ++
3 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index 9f6f3e6..54c9787 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -45,7 +45,6 @@ extern int devmem_is_allowed(unsigned long pagenr);

extern unsigned long max_low_pfn_mapped;
extern unsigned long max_pfn_mapped;
-extern unsigned long min_pfn_mapped;

static inline phys_addr_t get_max_mapped(void)
{
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f7634092..2015194 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -124,7 +124,6 @@
*/
unsigned long max_low_pfn_mapped;
unsigned long max_pfn_mapped;
-unsigned long min_pfn_mapped;

#ifdef CONFIG_DMI
RESERVE_BRK(dmi_alloc, 65536);
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index ffbb7af..7a6669e 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -23,6 +23,8 @@ unsigned long __initdata pgt_buf_start;
unsigned long __meminitdata pgt_buf_end;
unsigned long __meminitdata pgt_buf_top;

+static unsigned long min_pfn_mapped;
+
__ref void *alloc_low_page(void)
{
unsigned long pfn;
--
1.7.7

2012-11-12 21:35:53

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 29/46] x86, mm: only call early_ioremap_page_table_range_init() once

On 32bit, before patcheset that only set page table for ram, we only
call that one time.

Now, we are calling that during every init_memory_mapping if we have holes
under max_low_pfn.

We should only call it one time after all ranges under max_low_page get
mapped just like we did before.

Also that could avoid the risk to run out of pgt_buf in BRK.

Need to update page_table_range_init() to count the pages for kmap page table
at first, and use new added alloc_low_pages() to get pages in sequence.
That will conform to the requirement that pages need to be in low to high order.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 13 +++++--------
arch/x86/mm/init_32.c | 47 +++++++++++++++++++++++++++++++++++++++++------
2 files changed, 46 insertions(+), 14 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index a7939ed..d2df52c 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -340,14 +340,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,
mr[i].page_size_mask);

-#ifdef CONFIG_X86_32
- early_ioremap_page_table_range_init();
-
- load_cr3(swapper_pg_dir);
-#endif
-
- __flush_tlb_all();
-
add_pfn_range_mapped(start >> PAGE_SHIFT, ret >> PAGE_SHIFT);

return ret >> PAGE_SHIFT;
@@ -444,7 +436,12 @@ void __init init_mem_mapping(void)
/* can we preseve max_low_pfn ?*/
max_low_pfn = max_pfn;
}
+#else
+ early_ioremap_page_table_range_init();
+ load_cr3(swapper_pg_dir);
+ __flush_tlb_all();
#endif
+
early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
}

diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index a7f2df1..0ae1ba8 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -135,8 +135,39 @@ pte_t * __init populate_extra_pte(unsigned long vaddr)
return one_page_table_init(pmd) + pte_idx;
}

+static unsigned long __init
+page_table_range_init_count(unsigned long start, unsigned long end)
+{
+ unsigned long count = 0;
+#ifdef CONFIG_HIGHMEM
+ int pmd_idx_kmap_begin = fix_to_virt(FIX_KMAP_END) >> PMD_SHIFT;
+ int pmd_idx_kmap_end = fix_to_virt(FIX_KMAP_BEGIN) >> PMD_SHIFT;
+ int pgd_idx, pmd_idx;
+ unsigned long vaddr;
+
+ if (pmd_idx_kmap_begin == pmd_idx_kmap_end)
+ return 0;
+
+ vaddr = start;
+ pgd_idx = pgd_index(vaddr);
+
+ for ( ; (pgd_idx < PTRS_PER_PGD) && (vaddr != end); pgd_idx++) {
+ for (; (pmd_idx < PTRS_PER_PMD) && (vaddr != end);
+ pmd_idx++) {
+ if ((vaddr >> PMD_SHIFT) >= pmd_idx_kmap_begin &&
+ (vaddr >> PMD_SHIFT) <= pmd_idx_kmap_end)
+ count++;
+ vaddr += PMD_SIZE;
+ }
+ pmd_idx = 0;
+ }
+#endif
+ return count;
+}
+
static pte_t *__init page_table_kmap_check(pte_t *pte, pmd_t *pmd,
- unsigned long vaddr, pte_t *lastpte)
+ unsigned long vaddr, pte_t *lastpte,
+ void **adr)
{
#ifdef CONFIG_HIGHMEM
/*
@@ -150,16 +181,15 @@ static pte_t *__init page_table_kmap_check(pte_t *pte, pmd_t *pmd,

if (pmd_idx_kmap_begin != pmd_idx_kmap_end
&& (vaddr >> PMD_SHIFT) >= pmd_idx_kmap_begin
- && (vaddr >> PMD_SHIFT) <= pmd_idx_kmap_end
- && ((__pa(pte) >> PAGE_SHIFT) < pgt_buf_start
- || (__pa(pte) >> PAGE_SHIFT) >= pgt_buf_end)) {
+ && (vaddr >> PMD_SHIFT) <= pmd_idx_kmap_end) {
pte_t *newpte;
int i;

BUG_ON(after_bootmem);
- newpte = alloc_low_page();
+ newpte = *adr;
for (i = 0; i < PTRS_PER_PTE; i++)
set_pte(newpte + i, pte[i]);
+ *adr = (void *)(((unsigned long)(*adr)) + PAGE_SIZE);

paravirt_alloc_pte(&init_mm, __pa(newpte) >> PAGE_SHIFT);
set_pmd(pmd, __pmd(__pa(newpte)|_PAGE_TABLE));
@@ -193,6 +223,11 @@ page_table_range_init(unsigned long start, unsigned long end, pgd_t *pgd_base)
pgd_t *pgd;
pmd_t *pmd;
pte_t *pte = NULL;
+ unsigned long count = page_table_range_init_count(start, end);
+ void *adr = NULL;
+
+ if (count)
+ adr = alloc_low_pages(count);

vaddr = start;
pgd_idx = pgd_index(vaddr);
@@ -205,7 +240,7 @@ page_table_range_init(unsigned long start, unsigned long end, pgd_t *pgd_base)
for (; (pmd_idx < PTRS_PER_PMD) && (vaddr != end);
pmd++, pmd_idx++) {
pte = page_table_kmap_check(one_page_table_init(pmd),
- pmd, vaddr, pte);
+ pmd, vaddr, pte, &adr);

vaddr += PMD_SIZE;
}
--
1.7.7

2012-11-12 21:35:51

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 28/46] x86, mm: Add pointer about Xen mmu requirement for alloc_low_pages

From: Stefano Stabellini <[email protected]>

Add link to commit 279b706 for more information

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index f5e0120..a7939ed 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -25,6 +25,11 @@ unsigned long __meminitdata pgt_buf_top;

static unsigned long min_pfn_mapped;

+/*
+ * Pages returned are already directly mapped.
+ *
+ * Changing that is likely to break Xen, see commit 279b706 for detail info.
+ */
__ref void *alloc_low_pages(unsigned int num)
{
unsigned long pfn;
--
1.7.7

2012-11-12 21:37:55

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 04/46] x86, mm: Move init_memory_mapping calling out of setup.c

Now init_memory_mapping is called two times, later will be called for every
ram ranges.

Could put all related init_mem calling together and out of setup.c.

Actually, it reverts commit 1bbbbe7
x86: Exclude E820_RESERVED regions and memory holes above 4 GB from direct mapping.
will address that later with complete solution include handling hole under 4g.

Signed-off-by: Yinghai Lu <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
---
arch/x86/include/asm/init.h | 1 -
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/kernel/setup.c | 27 +--------------------------
arch/x86/mm/init.c | 19 ++++++++++++++++++-
4 files changed, 20 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
index adcc0ae..4f13998 100644
--- a/arch/x86/include/asm/init.h
+++ b/arch/x86/include/asm/init.h
@@ -12,7 +12,6 @@ kernel_physical_mapping_init(unsigned long start,
unsigned long end,
unsigned long page_size_mask);

-
extern unsigned long __initdata pgt_buf_start;
extern unsigned long __meminitdata pgt_buf_end;
extern unsigned long __meminitdata pgt_buf_top;
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 98ac76d..dd1a888 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -602,7 +602,7 @@ static inline int pgd_none(pgd_t pgd)
#ifndef __ASSEMBLY__

extern int direct_gbpages;
-void probe_page_size_mask(void);
+void init_mem_mapping(void);

/* local pte updates need not use xchg for locking */
static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 01fb5f9..23b079f 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -913,34 +913,9 @@ void __init setup_arch(char **cmdline_p)
setup_real_mode();

init_gbpages();
- probe_page_size_mask();

- /* max_pfn_mapped is updated here */
- max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
- max_pfn_mapped = max_low_pfn_mapped;
+ init_mem_mapping();

-#ifdef CONFIG_X86_64
- if (max_pfn > max_low_pfn) {
- int i;
- unsigned long start, end;
- unsigned long start_pfn, end_pfn;
-
- for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn,
- NULL) {
-
- end = PFN_PHYS(end_pfn);
- if (end <= (1UL<<32))
- continue;
-
- start = PFN_PHYS(start_pfn);
- max_pfn_mapped = init_memory_mapping(
- max((1UL<<32), start), end);
- }
-
- /* can we preseve max_low_pfn ?*/
- max_low_pfn = max_pfn;
- }
-#endif
memblock.current_limit = get_max_mapped();
dma_contiguous_reserve(0);

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 4a372d7..8927276 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -37,7 +37,7 @@ struct map_range {

static int page_size_mask;

-void probe_page_size_mask(void)
+static void __init probe_page_size_mask(void)
{
#if !defined(CONFIG_DEBUG_PAGEALLOC) && !defined(CONFIG_KMEMCHECK)
/*
@@ -316,6 +316,23 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
return ret >> PAGE_SHIFT;
}

+void __init init_mem_mapping(void)
+{
+ probe_page_size_mask();
+
+ /* max_pfn_mapped is updated here */
+ max_low_pfn_mapped = init_memory_mapping(0, max_low_pfn<<PAGE_SHIFT);
+ max_pfn_mapped = max_low_pfn_mapped;
+
+#ifdef CONFIG_X86_64
+ if (max_pfn > max_low_pfn) {
+ max_pfn_mapped = init_memory_mapping(1UL<<32,
+ max_pfn<<PAGE_SHIFT);
+ /* can we preseve max_low_pfn ?*/
+ max_low_pfn = max_pfn;
+ }
+#endif
+}

/*
* devmem_is_allowed() checks to see if /dev/mem access to a certain address
--
1.7.7

2012-11-12 21:37:54

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH 03/46] x86, mm: Move down find_early_table_space()

It will need to call split_mem_range().
Move it down after that to avoid extra declaration.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 117 ++++++++++++++++++++++++++--------------------------
1 files changed, 59 insertions(+), 58 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 6d8e102..4a372d7 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -36,64 +36,6 @@ struct map_range {
};

static int page_size_mask;
-/*
- * First calculate space needed for kernel direct mapping page tables to cover
- * mr[0].start to mr[nr_range - 1].end, while accounting for possible 2M and 1GB
- * pages. Then find enough contiguous space for those page tables.
- */
-static void __init find_early_table_space(struct map_range *mr, int nr_range)
-{
- int i;
- unsigned long puds = 0, pmds = 0, ptes = 0, tables;
- unsigned long start = 0, good_end;
- phys_addr_t base;
-
- for (i = 0; i < nr_range; i++) {
- unsigned long range, extra;
-
- range = mr[i].end - mr[i].start;
- puds += (range + PUD_SIZE - 1) >> PUD_SHIFT;
-
- if (mr[i].page_size_mask & (1 << PG_LEVEL_1G)) {
- extra = range - ((range >> PUD_SHIFT) << PUD_SHIFT);
- pmds += (extra + PMD_SIZE - 1) >> PMD_SHIFT;
- } else {
- pmds += (range + PMD_SIZE - 1) >> PMD_SHIFT;
- }
-
- if (mr[i].page_size_mask & (1 << PG_LEVEL_2M)) {
- extra = range - ((range >> PMD_SHIFT) << PMD_SHIFT);
-#ifdef CONFIG_X86_32
- extra += PMD_SIZE;
-#endif
- ptes += (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
- } else {
- ptes += (range + PAGE_SIZE - 1) >> PAGE_SHIFT;
- }
- }
-
- tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
- tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);
- tables += roundup(ptes * sizeof(pte_t), PAGE_SIZE);
-
-#ifdef CONFIG_X86_32
- /* for fixmap */
- tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
-#endif
- good_end = max_pfn_mapped << PAGE_SHIFT;
-
- base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
- if (!base)
- panic("Cannot find space for the kernel page tables");
-
- pgt_buf_start = base >> PAGE_SHIFT;
- pgt_buf_end = pgt_buf_start;
- pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
-
- printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx]\n",
- mr[nr_range - 1].end - 1, pgt_buf_start << PAGE_SHIFT,
- (pgt_buf_top << PAGE_SHIFT) - 1);
-}

void probe_page_size_mask(void)
{
@@ -250,6 +192,65 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
}

/*
+ * First calculate space needed for kernel direct mapping page tables to cover
+ * mr[0].start to mr[nr_range - 1].end, while accounting for possible 2M and 1GB
+ * pages. Then find enough contiguous space for those page tables.
+ */
+static void __init find_early_table_space(struct map_range *mr, int nr_range)
+{
+ int i;
+ unsigned long puds = 0, pmds = 0, ptes = 0, tables;
+ unsigned long start = 0, good_end;
+ phys_addr_t base;
+
+ for (i = 0; i < nr_range; i++) {
+ unsigned long range, extra;
+
+ range = mr[i].end - mr[i].start;
+ puds += (range + PUD_SIZE - 1) >> PUD_SHIFT;
+
+ if (mr[i].page_size_mask & (1 << PG_LEVEL_1G)) {
+ extra = range - ((range >> PUD_SHIFT) << PUD_SHIFT);
+ pmds += (extra + PMD_SIZE - 1) >> PMD_SHIFT;
+ } else {
+ pmds += (range + PMD_SIZE - 1) >> PMD_SHIFT;
+ }
+
+ if (mr[i].page_size_mask & (1 << PG_LEVEL_2M)) {
+ extra = range - ((range >> PMD_SHIFT) << PMD_SHIFT);
+#ifdef CONFIG_X86_32
+ extra += PMD_SIZE;
+#endif
+ ptes += (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ } else {
+ ptes += (range + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ }
+ }
+
+ tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
+ tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);
+ tables += roundup(ptes * sizeof(pte_t), PAGE_SIZE);
+
+#ifdef CONFIG_X86_32
+ /* for fixmap */
+ tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
+#endif
+ good_end = max_pfn_mapped << PAGE_SHIFT;
+
+ base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
+ if (!base)
+ panic("Cannot find space for the kernel page tables");
+
+ pgt_buf_start = base >> PAGE_SHIFT;
+ pgt_buf_end = pgt_buf_start;
+ pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
+
+ printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx]\n",
+ mr[nr_range - 1].end - 1, pgt_buf_start << PAGE_SHIFT,
+ (pgt_buf_top << PAGE_SHIFT) - 1);
+}
+
+/*
* Setup the direct mapping of the physical memory at PAGE_OFFSET.
* This runs before bootmem is initialized and gets pages directly from
* the physical memory. To access them they are temporarily mapped.
--
1.7.7

2012-11-13 05:51:59

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: Re: [PATCH 02/46] x86, mm: Split out split_mem_range from init_memory_mapping

2012/11/13 6:17, Yinghai Lu wrote:
> So make init_memory_mapping smaller and readable.
>
> Suggested-by: Ingo Molnar <[email protected]>
> Signed-off-by: Yinghai Lu <[email protected]>
> Reviewed-by: Pekka Enberg <[email protected]>
> ---
> arch/x86/mm/init.c | 42 ++++++++++++++++++++++++++----------------
> 1 files changed, 26 insertions(+), 16 deletions(-)
>
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index aa5b0da..6d8e102 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -146,25 +146,13 @@ static int __meminit save_mr(struct map_range *mr, int nr_range,
> return nr_range;
> }
>
> -/*
> - * Setup the direct mapping of the physical memory at PAGE_OFFSET.
> - * This runs before bootmem is initialized and gets pages directly from
> - * the physical memory. To access them they are temporarily mapped.
> - */
> -unsigned long __init_refok init_memory_mapping(unsigned long start,
> - unsigned long end)
> +static int __meminit split_mem_range(struct map_range *mr, int nr_range,
> + unsigned long start,
> + unsigned long end)
> {
> unsigned long start_pfn, end_pfn;
> - unsigned long ret = 0;
> unsigned long pos;
> - struct map_range mr[NR_RANGE_MR];
> - int nr_range, i;
> -
> - printk(KERN_INFO "init_memory_mapping: [mem %#010lx-%#010lx]\n",
> - start, end - 1);
> -
> - memset(mr, 0, sizeof(mr));
> - nr_range = 0;
> + int i;
>
> /* head if not big page alignment ? */
> start_pfn = start >> PAGE_SHIFT;
> @@ -258,6 +246,28 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
> (mr[i].page_size_mask & (1<<PG_LEVEL_1G))?"1G":(
> (mr[i].page_size_mask & (1<<PG_LEVEL_2M))?"2M":"4k"));
>
> + return nr_range;
> +}
> +
> +/*
> + * Setup the direct mapping of the physical memory at PAGE_OFFSET.
> + * This runs before bootmem is initialized and gets pages directly from
> + * the physical memory. To access them they are temporarily mapped.
> + */
> +unsigned long __init_refok init_memory_mapping(unsigned long start,
> + unsigned long end)
> +{
> + struct map_range mr[NR_RANGE_MR];
> + unsigned long ret = 0;
> + int nr_range, i;
> +
> + pr_info("init_memory_mapping: [mem %#010lx-%#010lx]\n",
> + start, end - 1);
> +
> + memset(mr, 0, sizeof(mr));

> + nr_range = 0;

This is unnecessary since it is set in the below.

> + nr_range = split_mem_range(mr, nr_range, start, end);

Thanks,
Yasuaki Ishimatsu

> +
> /*
> * Find space for the kernel direct mapping tables.
> *
>

2012-11-13 06:20:24

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 02/46] x86, mm: Split out split_mem_range from init_memory_mapping

On Mon, Nov 12, 2012 at 9:51 PM, Yasuaki Ishimatsu
<[email protected]> wrote:
> 2012/11/13 6:17, Yinghai Lu wrote:
>> + nr_range = 0;
>
> This is unnecessary since it is set in the below.
>
>> + nr_range = split_mem_range(mr, nr_range, start, end);
^^^^^^^^

2012-11-13 07:04:01

by bigboy

[permalink] [raw]
Subject: Re: [PATCH 02/46] x86, mm: Split out split_mem_range from init_memory_mapping

On 2012年11月13日 14:20, Yinghai Lu wrote:
> On Mon, Nov 12, 2012 at 9:51 PM, Yasuaki Ishimatsu
> <[email protected]> wrote:
>> 2012/11/13 6:17, Yinghai Lu wrote:
>>> + nr_range = 0;
>> This is unnecessary since it is set in the below.
>>
>>> + nr_range = split_mem_range(mr, nr_range, start, end);
> ^^^^^^^^
Why not use 0 directly?

nr_range = split_mem_range(mr, 0, start, end);



> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2012-11-13 16:53:01

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH 22/46] x86, mm: Remove early_memremap workaround for page table accessing on 64bit

On Mon, 12 Nov 2012, Yinghai Lu wrote:
> We try to put page table high to make room for kdump, and at that time
> those ranges are not mapped yet, and have to use ioremap to access it.
>
> Now after patch that pre-map page table top down.
> x86, mm: setup page table in top-down
> We do not need that workaround anymore.
>
> Just use __va to return directly mapping address.
>
> Signed-off-by: Yinghai Lu <[email protected]>


Acked-by: Stefano Stabellini <[email protected]>

> arch/x86/mm/init_64.c | 38 ++++----------------------------------
> 1 files changed, 4 insertions(+), 34 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index eefaea6..5ee9242 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -340,36 +340,12 @@ static __ref void *alloc_low_page(unsigned long *phys)
> } else
> pfn = pgt_buf_end++;
>
> - adr = early_memremap(pfn * PAGE_SIZE, PAGE_SIZE);
> + adr = __va(pfn * PAGE_SIZE);
> clear_page(adr);
> *phys = pfn * PAGE_SIZE;
> return adr;
> }
>
> -static __ref void *map_low_page(void *virt)
> -{
> - void *adr;
> - unsigned long phys, left;
> -
> - if (after_bootmem)
> - return virt;
> -
> - phys = __pa(virt);
> - left = phys & (PAGE_SIZE - 1);
> - adr = early_memremap(phys & PAGE_MASK, PAGE_SIZE);
> - adr = (void *)(((unsigned long)adr) | left);
> -
> - return adr;
> -}
> -
> -static __ref void unmap_low_page(void *adr)
> -{
> - if (after_bootmem)
> - return;
> -
> - early_iounmap((void *)((unsigned long)adr & PAGE_MASK), PAGE_SIZE);
> -}
> -
> static unsigned long __meminit
> phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
> pgprot_t prot)
> @@ -442,10 +418,9 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
> if (pmd_val(*pmd)) {
> if (!pmd_large(*pmd)) {
> spin_lock(&init_mm.page_table_lock);
> - pte = map_low_page((pte_t *)pmd_page_vaddr(*pmd));
> + pte = (pte_t *)pmd_page_vaddr(*pmd);
> last_map_addr = phys_pte_init(pte, address,
> end, prot);
> - unmap_low_page(pte);
> spin_unlock(&init_mm.page_table_lock);
> continue;
> }
> @@ -483,7 +458,6 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
>
> pte = alloc_low_page(&pte_phys);
> last_map_addr = phys_pte_init(pte, address, end, new_prot);
> - unmap_low_page(pte);
>
> spin_lock(&init_mm.page_table_lock);
> pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
> @@ -518,10 +492,9 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
>
> if (pud_val(*pud)) {
> if (!pud_large(*pud)) {
> - pmd = map_low_page(pmd_offset(pud, 0));
> + pmd = pmd_offset(pud, 0);
> last_map_addr = phys_pmd_init(pmd, addr, end,
> page_size_mask, prot);
> - unmap_low_page(pmd);
> __flush_tlb_all();
> continue;
> }
> @@ -560,7 +533,6 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
> pmd = alloc_low_page(&pmd_phys);
> last_map_addr = phys_pmd_init(pmd, addr, end, page_size_mask,
> prot);
> - unmap_low_page(pmd);
>
> spin_lock(&init_mm.page_table_lock);
> pud_populate(&init_mm, pud, __va(pmd_phys));
> @@ -596,17 +568,15 @@ kernel_physical_mapping_init(unsigned long start,
> next = end;
>
> if (pgd_val(*pgd)) {
> - pud = map_low_page((pud_t *)pgd_page_vaddr(*pgd));
> + pud = (pud_t *)pgd_page_vaddr(*pgd);
> last_map_addr = phys_pud_init(pud, __pa(start),
> __pa(end), page_size_mask);
> - unmap_low_page(pud);
> continue;
> }
>
> pud = alloc_low_page(&pud_phys);
> last_map_addr = phys_pud_init(pud, __pa(start), __pa(next),
> page_size_mask);
> - unmap_low_page(pud);
>
> spin_lock(&init_mm.page_table_lock);
> pgd_populate(&init_mm, pgd, __va(pud_phys));
> --
> 1.7.7
>

2012-11-13 16:53:27

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH 27/46] x86, mm: Add alloc_low_pages(num)

On Mon, 12 Nov 2012, Yinghai Lu wrote:
> 32bit kmap mapping needs pages to be used for low to high.
> At this point those pages are still from pgt_buf_* from BRK, so it is
> ok now.
> But we want to move early_ioremap_page_table_range_init() out of
> init_memory_mapping() and only call it one time later, that will
> make page_table_range_init/page_table_kmap_check/alloc_low_page to
> use memblock to get page.
>
> memblock allocation for pages are from high to low.
> So will get panic from page_table_kmap_check() that has BUG_ON to do
> ordering checking.
>
> This patch add alloc_low_pages to make it possible to allocate serveral
> pages at first, and hand out pages one by one from low to high.
>
> -v2: add one line comment about xen requirements.

where is it?

> Signed-off-by: Yinghai Lu <[email protected]>
> Cc: Andrew Morton <[email protected]>
> ---
> arch/x86/mm/init.c | 33 +++++++++++++++++++++------------
> arch/x86/mm/mm_internal.h | 6 +++++-
> 2 files changed, 26 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 9d51af72..f5e0120 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -25,36 +25,45 @@ unsigned long __meminitdata pgt_buf_top;
>
> static unsigned long min_pfn_mapped;
>
> -__ref void *alloc_low_page(void)
> +__ref void *alloc_low_pages(unsigned int num)
> {
> unsigned long pfn;
> - void *adr;
> + int i;
>
> #ifdef CONFIG_X86_64
> if (after_bootmem) {
> - adr = (void *)get_zeroed_page(GFP_ATOMIC | __GFP_NOTRACK);
> + unsigned int order;
>
> - return adr;
> + order = get_order((unsigned long)num << PAGE_SHIFT);
> + return (void *)__get_free_pages(GFP_ATOMIC | __GFP_NOTRACK |
> + __GFP_ZERO, order);
> }
> #endif
>
> - if ((pgt_buf_end + 1) >= pgt_buf_top) {
> + if ((pgt_buf_end + num) >= pgt_buf_top) {
> unsigned long ret;
> if (min_pfn_mapped >= max_pfn_mapped)
> panic("alloc_low_page: ran out of memory");
> ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
> max_pfn_mapped << PAGE_SHIFT,
> - PAGE_SIZE, PAGE_SIZE);
> + PAGE_SIZE * num , PAGE_SIZE);
> if (!ret)
> panic("alloc_low_page: can not alloc memory");
> - memblock_reserve(ret, PAGE_SIZE);
> + memblock_reserve(ret, PAGE_SIZE * num);
> pfn = ret >> PAGE_SHIFT;
> - } else
> - pfn = pgt_buf_end++;
> + } else {
> + pfn = pgt_buf_end;
> + pgt_buf_end += num;
> + }
> +
> + for (i = 0; i < num; i++) {
> + void *adr;
> +
> + adr = __va((pfn + i) << PAGE_SHIFT);
> + clear_page(adr);
> + }
>
> - adr = __va(pfn * PAGE_SIZE);
> - clear_page(adr);
> - return adr;
> + return __va(pfn << PAGE_SHIFT);
> }
>
> /* need 4 4k for initial PMD_SIZE, 4k for 0-ISA_END_ADDRESS */
> diff --git a/arch/x86/mm/mm_internal.h b/arch/x86/mm/mm_internal.h
> index b3f993a..7e3b88e 100644
> --- a/arch/x86/mm/mm_internal.h
> +++ b/arch/x86/mm/mm_internal.h
> @@ -1,6 +1,10 @@
> #ifndef __X86_MM_INTERNAL_H
> #define __X86_MM_INTERNAL_H
>
> -void *alloc_low_page(void);
> +void *alloc_low_pages(unsigned int num);
> +static inline void *alloc_low_page(void)
> +{
> + return alloc_low_pages(1);
> +}
>
> #endif /* __X86_MM_INTERNAL_H */
> --
> 1.7.7
>

2012-11-13 16:53:33

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH 26/46] x86, mm, Xen: Remove mapping_pagetable_reserve()

On Mon, 12 Nov 2012, Yinghai Lu wrote:
> Page table area are pre-mapped now after
> x86, mm: setup page table in top-down
> x86, mm: Remove early_memremap workaround for page table accessing on 64bit
>
> mapping_pagetable_reserve is not used anymore, so remove it.

You should mention in the description of the patch that you are
removing mask_rw_pte too.

The reason why you can do that safely is that you previously modified
allow_low_page to always return pages that are already mapped, moreover
xen_alloc_pte_init, xen_alloc_pmd_init, etc, will mark the page RO
before hooking it into the pagetable automatically.

[ ... ]

> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index dcf5f2d..bbb883f 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -1178,20 +1178,6 @@ static void xen_exit_mmap(struct mm_struct *mm)
>
> static void xen_post_allocator_init(void);
>
> -static __init void xen_mapping_pagetable_reserve(u64 start, u64 end)
> -{
> - /* reserve the range used */
> - native_pagetable_reserve(start, end);
> -
> - /* set as RW the rest */
> - printk(KERN_DEBUG "xen: setting RW the range %llx - %llx\n", end,
> - PFN_PHYS(pgt_buf_top));
> - while (end < PFN_PHYS(pgt_buf_top)) {
> - make_lowmem_page_readwrite(__va(end));
> - end += PAGE_SIZE;
> - }
> -}
> -
> #ifdef CONFIG_X86_64
> static void __init xen_cleanhighmap(unsigned long vaddr,
> unsigned long vaddr_end)
> @@ -1503,19 +1489,6 @@ static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
> #else /* CONFIG_X86_64 */
> static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
> {
> - unsigned long pfn = pte_pfn(pte);
> -
> - /*
> - * If the new pfn is within the range of the newly allocated
> - * kernel pagetable, and it isn't being mapped into an
> - * early_ioremap fixmap slot as a freshly allocated page, make sure
> - * it is RO.
> - */
> - if (((!is_early_ioremap_ptep(ptep) &&
> - pfn >= pgt_buf_start && pfn < pgt_buf_top)) ||
> - (is_early_ioremap_ptep(ptep) && pfn != (pgt_buf_end - 1)))
> - pte = pte_wrprotect(pte);
> -
> return pte;

you should just get rid of mask_rw_pte completely

2012-11-13 16:53:26

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH 28/46] x86, mm: Add pointer about Xen mmu requirement for alloc_low_pages

On Mon, 12 Nov 2012, Yinghai Lu wrote:
> From: Stefano Stabellini <[email protected]>
>
> Add link to commit 279b706 for more information
>
> Signed-off-by: Yinghai Lu <[email protected]>

ah, here it is, OK then

> arch/x86/mm/init.c | 5 +++++
> 1 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index f5e0120..a7939ed 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -25,6 +25,11 @@ unsigned long __meminitdata pgt_buf_top;
>
> static unsigned long min_pfn_mapped;
>
> +/*
> + * Pages returned are already directly mapped.
> + *
> + * Changing that is likely to break Xen, see commit 279b706 for detail info.
^detailed

> + */
> __ref void *alloc_low_pages(unsigned int num)
> {
> unsigned long pfn;
> --
> 1.7.7
>

2012-11-13 17:26:57

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH 21/46] x86, mm: setup page table in top-down

On Mon, 12 Nov 2012, Yinghai Lu wrote:
> Get pgt_buf early from BRK, and use it to map PMD_SIZE from top at first.
> Then use mapped pages to map more ranges below, and keep looping until
> all pages get mapped.
>
> alloc_low_page will use page from BRK at first, after that buffer is used
> up, will use memblock to find and reserve pages for page table usage.
>
> Introduce min_pfn_mapped to make sure find new pages from mapped ranges,
> that will be updated when lower pages get mapped.
>
> Also add step_size to make sure that don't try to map too big range with
> limited mapped pages initially, and increase the step_size when we have
> more mapped pages on hand.
>
> At last we can get rid of calculation and find early pgt related code.
>
> -v2: update to after fix_xen change,
> also use MACRO for initial pgt_buf size and add comments with it.
> -v3: skip big reserved range in memblock.reserved near end.
> -v4: don't need fix_xen change now.
>
> Suggested-by: "H. Peter Anvin" <[email protected]>
> Signed-off-by: Yinghai Lu <[email protected]>

The changes to alloc_low_page and early_alloc_pgt_buf look OK to me.

The changes to init_mem_mapping are a bit iffy but they aren't too
unreasonable.
Overall the patch is OK even though I would certainly appreciate more
comments and better variable names (real_end?), see below.


> arch/x86/include/asm/page_types.h | 1 +
> arch/x86/include/asm/pgtable.h | 1 +
> arch/x86/kernel/setup.c | 3 +
> arch/x86/mm/init.c | 210 +++++++++++--------------------------
> arch/x86/mm/init_32.c | 17 +++-
> arch/x86/mm/init_64.c | 17 +++-
> 6 files changed, 94 insertions(+), 155 deletions(-)
>
> diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
> index 54c9787..9f6f3e6 100644
> --- a/arch/x86/include/asm/page_types.h
> +++ b/arch/x86/include/asm/page_types.h
> @@ -45,6 +45,7 @@ extern int devmem_is_allowed(unsigned long pagenr);
>
> extern unsigned long max_low_pfn_mapped;
> extern unsigned long max_pfn_mapped;
> +extern unsigned long min_pfn_mapped;
>
> static inline phys_addr_t get_max_mapped(void)
> {
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index dd1a888..6991a3e 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -603,6 +603,7 @@ static inline int pgd_none(pgd_t pgd)
>
> extern int direct_gbpages;
> void init_mem_mapping(void);
> +void early_alloc_pgt_buf(void);
>
> /* local pte updates need not use xchg for locking */
> static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 94f922a..f7634092 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -124,6 +124,7 @@
> */
> unsigned long max_low_pfn_mapped;
> unsigned long max_pfn_mapped;
> +unsigned long min_pfn_mapped;
>
> #ifdef CONFIG_DMI
> RESERVE_BRK(dmi_alloc, 65536);
> @@ -900,6 +901,8 @@ void __init setup_arch(char **cmdline_p)
>
> reserve_ibft_region();
>
> + early_alloc_pgt_buf();
> +
> /*
> * Need to conclude brk, before memblock_x86_fill()
> * it could use memblock_find_in_range, could overlap with
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 47a1ba2..76a6e82 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -21,6 +21,21 @@ unsigned long __initdata pgt_buf_start;
> unsigned long __meminitdata pgt_buf_end;
> unsigned long __meminitdata pgt_buf_top;
>
> +/* need 4 4k for initial PMD_SIZE, 4k for 0-ISA_END_ADDRESS */
> +#define INIT_PGT_BUF_SIZE (5 * PAGE_SIZE)
> +RESERVE_BRK(early_pgt_alloc, INIT_PGT_BUF_SIZE);
> +void __init early_alloc_pgt_buf(void)
> +{
> + unsigned long tables = INIT_PGT_BUF_SIZE;
> + phys_addr_t base;
> +
> + base = __pa(extend_brk(tables, PAGE_SIZE));
> +
> + pgt_buf_start = base >> PAGE_SHIFT;
> + pgt_buf_end = pgt_buf_start;
> + pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
> +}
> +
> int after_bootmem;
>
> int direct_gbpages
> @@ -228,105 +243,6 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
> return nr_range;
> }
>
> -/*
> - * First calculate space needed for kernel direct mapping page tables to cover
> - * mr[0].start to mr[nr_range - 1].end, while accounting for possible 2M and 1GB
> - * pages. Then find enough contiguous space for those page tables.
> - */
> -static unsigned long __init calculate_table_space_size(unsigned long start, unsigned long end)
> -{
> - int i;
> - unsigned long puds = 0, pmds = 0, ptes = 0, tables;
> - struct map_range mr[NR_RANGE_MR];
> - int nr_range;
> -
> - memset(mr, 0, sizeof(mr));
> - nr_range = 0;
> - nr_range = split_mem_range(mr, nr_range, start, end);
> -
> - for (i = 0; i < nr_range; i++) {
> - unsigned long range, extra;
> -
> - range = mr[i].end - mr[i].start;
> - puds += (range + PUD_SIZE - 1) >> PUD_SHIFT;
> -
> - if (mr[i].page_size_mask & (1 << PG_LEVEL_1G)) {
> - extra = range - ((range >> PUD_SHIFT) << PUD_SHIFT);
> - pmds += (extra + PMD_SIZE - 1) >> PMD_SHIFT;
> - } else {
> - pmds += (range + PMD_SIZE - 1) >> PMD_SHIFT;
> - }
> -
> - if (mr[i].page_size_mask & (1 << PG_LEVEL_2M)) {
> - extra = range - ((range >> PMD_SHIFT) << PMD_SHIFT);
> -#ifdef CONFIG_X86_32
> - extra += PMD_SIZE;
> -#endif
> - ptes += (extra + PAGE_SIZE - 1) >> PAGE_SHIFT;
> - } else {
> - ptes += (range + PAGE_SIZE - 1) >> PAGE_SHIFT;
> - }
> - }
> -
> - tables = roundup(puds * sizeof(pud_t), PAGE_SIZE);
> - tables += roundup(pmds * sizeof(pmd_t), PAGE_SIZE);
> - tables += roundup(ptes * sizeof(pte_t), PAGE_SIZE);
> -
> -#ifdef CONFIG_X86_32
> - /* for fixmap */
> - tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
> -#endif
> -
> - return tables;
> -}
> -
> -static unsigned long __init calculate_all_table_space_size(void)
> -{
> - unsigned long start_pfn, end_pfn;
> - unsigned long tables;
> - int i;
> -
> - /* the ISA range is always mapped regardless of memory holes */
> - tables = calculate_table_space_size(0, ISA_END_ADDRESS);
> -
> - for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
> - u64 start = start_pfn << PAGE_SHIFT;
> - u64 end = end_pfn << PAGE_SHIFT;
> -
> - if (end <= ISA_END_ADDRESS)
> - continue;
> -
> - if (start < ISA_END_ADDRESS)
> - start = ISA_END_ADDRESS;
> -#ifdef CONFIG_X86_32
> - /* on 32 bit, we only map up to max_low_pfn */
> - if ((start >> PAGE_SHIFT) >= max_low_pfn)
> - continue;
> -
> - if ((end >> PAGE_SHIFT) > max_low_pfn)
> - end = max_low_pfn << PAGE_SHIFT;
> -#endif
> - tables += calculate_table_space_size(start, end);
> - }
> -
> - return tables;
> -}
> -
> -static void __init find_early_table_space(unsigned long start,
> - unsigned long good_end,
> - unsigned long tables)
> -{
> - phys_addr_t base;
> -
> - base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
> - if (!base)
> - panic("Cannot find space for the kernel page tables");
> -
> - pgt_buf_start = base >> PAGE_SHIFT;
> - pgt_buf_end = pgt_buf_start;
> - pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
> -}
> -
> static struct range pfn_mapped[E820_X_MAX];
> static int nr_pfn_mapped;
>
> @@ -392,17 +308,14 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
> }
>
> /*
> - * Iterate through E820 memory map and create direct mappings for only E820_RAM
> - * regions. We cannot simply create direct mappings for all pfns from
> - * [0 to max_low_pfn) and [4GB to max_pfn) because of possible memory holes in
> - * high addresses that cannot be marked as UC by fixed/variable range MTRRs.
> - * Depending on the alignment of E820 ranges, this may possibly result in using
> - * smaller size (i.e. 4K instead of 2M or 1G) page tables.
> + * this one could take range with hole in it.
> */

this comment in particular is not very satisfactory


> -static void __init init_range_memory_mapping(unsigned long range_start,
> +static unsigned long __init init_range_memory_mapping(
> + unsigned long range_start,
> unsigned long range_end)
> {
> unsigned long start_pfn, end_pfn;
> + unsigned long mapped_ram_size = 0;
> int i;
>
> for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) {
> @@ -422,71 +335,70 @@ static void __init init_range_memory_mapping(unsigned long range_start,
> end = range_end;
>
> init_memory_mapping(start, end);
> +
> + mapped_ram_size += end - start;
> }
> +
> + return mapped_ram_size;
> }
>
> +/* (PUD_SHIFT-PMD_SHIFT)/2 */
> +#define STEP_SIZE_SHIFT 5
> void __init init_mem_mapping(void)
> {
> - unsigned long tables, good_end, end;
> + unsigned long end, real_end, start, last_start;
> + unsigned long step_size;
> + unsigned long addr;
> + unsigned long mapped_ram_size = 0;
> + unsigned long new_mapped_ram_size;
>
> probe_page_size_mask();
>
> - /*
> - * Find space for the kernel direct mapping tables.
> - *
> - * Later we should allocate these tables in the local node of the
> - * memory mapped. Unfortunately this is done currently before the
> - * nodes are discovered.
> - */
> #ifdef CONFIG_X86_64
> end = max_pfn << PAGE_SHIFT;
> - good_end = end;
> #else
> end = max_low_pfn << PAGE_SHIFT;
> - good_end = max_pfn_mapped << PAGE_SHIFT;
> #endif
> - tables = calculate_all_table_space_size();
> - find_early_table_space(0, good_end, tables);
> - printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx] prealloc\n",
> - end - 1, pgt_buf_start << PAGE_SHIFT,
> - (pgt_buf_top << PAGE_SHIFT) - 1);
>
> - max_pfn_mapped = 0; /* will get exact value next */
> /* the ISA range is always mapped regardless of memory holes */
> init_memory_mapping(0, ISA_END_ADDRESS);
> - init_range_memory_mapping(ISA_END_ADDRESS, end);
> +
> + /* xen has big range in reserved near end of ram, skip it at first */
> + addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE,
> + PAGE_SIZE);
> + real_end = addr + PMD_SIZE;
> +
> + /* step_size need to be small so pgt_buf from BRK could cover it */
> + step_size = PMD_SIZE;
> + max_pfn_mapped = 0; /* will get exact value next */
> + min_pfn_mapped = real_end >> PAGE_SHIFT;
> + last_start = start = real_end;
> + while (last_start > ISA_END_ADDRESS) {
> + if (last_start > step_size) {
> + start = round_down(last_start - 1, step_size);
> + if (start < ISA_END_ADDRESS)
> + start = ISA_END_ADDRESS;
> + } else
> + start = ISA_END_ADDRESS;
> + new_mapped_ram_size = init_range_memory_mapping(start,
> + last_start);
> + last_start = start;
> + min_pfn_mapped = last_start >> PAGE_SHIFT;
> + /* only increase step_size after big range get mapped */
> + if (new_mapped_ram_size > mapped_ram_size)
> + step_size <<= STEP_SIZE_SHIFT;
> + mapped_ram_size += new_mapped_ram_size;
> + }
> +
> + if (real_end < end)
> + init_range_memory_mapping(real_end, end);
> +
> #ifdef CONFIG_X86_64
> if (max_pfn > max_low_pfn) {
> /* can we preseve max_low_pfn ?*/
> max_low_pfn = max_pfn;
> }
> #endif
> - /*
> - * Reserve the kernel pagetable pages we used (pgt_buf_start -
> - * pgt_buf_end) and free the other ones (pgt_buf_end - pgt_buf_top)
> - * so that they can be reused for other purposes.
> - *
> - * On native it just means calling memblock_reserve, on Xen it also
> - * means marking RW the pagetable pages that we allocated before
> - * but that haven't been used.
> - *
> - * In fact on xen we mark RO the whole range pgt_buf_start -
> - * pgt_buf_top, because we have to make sure that when
> - * init_memory_mapping reaches the pagetable pages area, it maps
> - * RO all the pagetable pages, including the ones that are beyond
> - * pgt_buf_end at that time.
> - */
> - if (pgt_buf_end > pgt_buf_start) {
> - printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx] final\n",
> - end - 1, pgt_buf_start << PAGE_SHIFT,
> - (pgt_buf_end << PAGE_SHIFT) - 1);
> - x86_init.mapping.pagetable_reserve(PFN_PHYS(pgt_buf_start),
> - PFN_PHYS(pgt_buf_end));
> - }
> -
> - /* stop the wrong using */
> - pgt_buf_top = 0;
> -
> early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
> }

you should say why we don't need to call pagetable_reserve anymore: is
it because alloc_low_page is going to reserve each page that it
allocates?


> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index 27f7fc6..7bb1106 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -61,11 +61,22 @@ bool __read_mostly __vmalloc_start_set = false;
>
> static __init void *alloc_low_page(void)
> {
> - unsigned long pfn = pgt_buf_end++;
> + unsigned long pfn;
> void *adr;
>
> - if (pfn >= pgt_buf_top)
> - panic("alloc_low_page: ran out of memory");
> + if ((pgt_buf_end + 1) >= pgt_buf_top) {
> + unsigned long ret;
> + if (min_pfn_mapped >= max_pfn_mapped)
> + panic("alloc_low_page: ran out of memory");
> + ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
> + max_pfn_mapped << PAGE_SHIFT,
> + PAGE_SIZE, PAGE_SIZE);
> + if (!ret)
> + panic("alloc_low_page: can not alloc memory");
> + memblock_reserve(ret, PAGE_SIZE);
> + pfn = ret >> PAGE_SHIFT;
> + } else
> + pfn = pgt_buf_end++;
>
> adr = __va(pfn * PAGE_SIZE);
> clear_page(adr);
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index fa28e3e..eefaea6 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -316,7 +316,7 @@ void __init cleanup_highmap(void)
>
> static __ref void *alloc_low_page(unsigned long *phys)
> {
> - unsigned long pfn = pgt_buf_end++;
> + unsigned long pfn;
> void *adr;
>
> if (after_bootmem) {
> @@ -326,8 +326,19 @@ static __ref void *alloc_low_page(unsigned long *phys)
> return adr;
> }
>
> - if (pfn >= pgt_buf_top)
> - panic("alloc_low_page: ran out of memory");
> + if ((pgt_buf_end + 1) >= pgt_buf_top) {
> + unsigned long ret;
> + if (min_pfn_mapped >= max_pfn_mapped)
> + panic("alloc_low_page: ran out of memory");
> + ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
> + max_pfn_mapped << PAGE_SHIFT,
> + PAGE_SIZE, PAGE_SIZE);
> + if (!ret)
> + panic("alloc_low_page: can not alloc memory");
> + memblock_reserve(ret, PAGE_SIZE);
> + pfn = ret >> PAGE_SHIFT;
> + } else
> + pfn = pgt_buf_end++;
>
> adr = early_memremap(pfn * PAGE_SIZE, PAGE_SIZE);
> clear_page(adr);
> --
> 1.7.7
>

2012-11-13 17:49:46

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH 23/46] x86, mm: Remove parameter in alloc_low_page for 64bit

On Mon, 12 Nov 2012, Yinghai Lu wrote:
> Now all page table buf are pre-mapped, and could use virtual address directly.
> So don't need to remember physical address anymore.
>
> Remove that phys pointer in alloc_low_page(), and that will allow us to merge
> alloc_low_page between 64bit and 32bit.
>
> Signed-off-by: Yinghai Lu <[email protected]>

Acked-by: Stefano Stabellini <[email protected]>


> arch/x86/mm/init_64.c | 19 +++++++------------
> 1 files changed, 7 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 5ee9242..1960820 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -314,14 +314,13 @@ void __init cleanup_highmap(void)
> }
> }
>
> -static __ref void *alloc_low_page(unsigned long *phys)
> +static __ref void *alloc_low_page(void)
> {
> unsigned long pfn;
> void *adr;
>
> if (after_bootmem) {
> adr = (void *)get_zeroed_page(GFP_ATOMIC | __GFP_NOTRACK);
> - *phys = __pa(adr);
>
> return adr;
> }
> @@ -342,7 +341,6 @@ static __ref void *alloc_low_page(unsigned long *phys)
>
> adr = __va(pfn * PAGE_SIZE);
> clear_page(adr);
> - *phys = pfn * PAGE_SIZE;
> return adr;
> }
>
> @@ -401,7 +399,6 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
> int i = pmd_index(address);
>
> for (; i < PTRS_PER_PMD; i++, address = next) {
> - unsigned long pte_phys;
> pmd_t *pmd = pmd_page + pmd_index(address);
> pte_t *pte;
> pgprot_t new_prot = prot;
> @@ -456,11 +453,11 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
> continue;
> }
>
> - pte = alloc_low_page(&pte_phys);
> + pte = alloc_low_page();
> last_map_addr = phys_pte_init(pte, address, end, new_prot);
>
> spin_lock(&init_mm.page_table_lock);
> - pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
> + pmd_populate_kernel(&init_mm, pmd, pte);
> spin_unlock(&init_mm.page_table_lock);
> }
> update_page_count(PG_LEVEL_2M, pages);
> @@ -476,7 +473,6 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
> int i = pud_index(addr);
>
> for (; i < PTRS_PER_PUD; i++, addr = next) {
> - unsigned long pmd_phys;
> pud_t *pud = pud_page + pud_index(addr);
> pmd_t *pmd;
> pgprot_t prot = PAGE_KERNEL;
> @@ -530,12 +526,12 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
> continue;
> }
>
> - pmd = alloc_low_page(&pmd_phys);
> + pmd = alloc_low_page();
> last_map_addr = phys_pmd_init(pmd, addr, end, page_size_mask,
> prot);
>
> spin_lock(&init_mm.page_table_lock);
> - pud_populate(&init_mm, pud, __va(pmd_phys));
> + pud_populate(&init_mm, pud, pmd);
> spin_unlock(&init_mm.page_table_lock);
> }
> __flush_tlb_all();
> @@ -560,7 +556,6 @@ kernel_physical_mapping_init(unsigned long start,
>
> for (; start < end; start = next) {
> pgd_t *pgd = pgd_offset_k(start);
> - unsigned long pud_phys;
> pud_t *pud;
>
> next = (start + PGDIR_SIZE) & PGDIR_MASK;
> @@ -574,12 +569,12 @@ kernel_physical_mapping_init(unsigned long start,
> continue;
> }
>
> - pud = alloc_low_page(&pud_phys);
> + pud = alloc_low_page();
> last_map_addr = phys_pud_init(pud, __pa(start), __pa(next),
> page_size_mask);
>
> spin_lock(&init_mm.page_table_lock);
> - pgd_populate(&init_mm, pgd, __va(pud_phys));
> + pgd_populate(&init_mm, pgd, pud);
> spin_unlock(&init_mm.page_table_lock);
> pgd_changed = true;
> }
> --
> 1.7.7
>

2012-11-13 17:56:58

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 28/46] x86, mm: Add pointer about Xen mmu requirement for alloc_low_pages

On 11/13/2012 08:38 AM, Stefano Stabellini wrote:
> On Mon, 12 Nov 2012, Yinghai Lu wrote:
>> From: Stefano Stabellini <[email protected]>
>>
>> Add link to commit 279b706 for more information
>>
>> Signed-off-by: Yinghai Lu <[email protected]>
>
> ah, here it is, OK then
>
>> arch/x86/mm/init.c | 5 +++++
>> 1 files changed, 5 insertions(+), 0 deletions(-)
>>
>> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
>> index f5e0120..a7939ed 100644
>> --- a/arch/x86/mm/init.c
>> +++ b/arch/x86/mm/init.c
>> @@ -25,6 +25,11 @@ unsigned long __meminitdata pgt_buf_top;
>>
>> static unsigned long min_pfn_mapped;
>>
>> +/*
>> + * Pages returned are already directly mapped.
>> + *
>> + * Changing that is likely to break Xen, see commit 279b706 for detail info.
> ^detailed
>

When making references to git commits, please include the title of the
commit:

*
* Changing that is likely to break Xen, see commit:
*
* 279b706 x86,xen: introduce x86_init.mapping.pagetable_reserve
*
* for detailed information.

Otherwise it might be very confusing if we have an abbreviated hash
collision or end up having to change git to use another hash system.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-11-13 18:34:30

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 02/46] x86, mm: Split out split_mem_range from init_memory_mapping

On Mon, Nov 12, 2012 at 11:03 PM, <[email protected]> wrote:
> On 2012年11月13日 14:20, Yinghai Lu wrote:
>>> 2012/11/13 6:17, Yinghai Lu wrote:
>>>>
>>>> + nr_range = 0;
>>>
>>> This is unnecessary since it is set in the below.
>>>
>>>> + nr_range = split_mem_range(mr, nr_range, start, end);
>>
>> ^^^^^^^^
>
> Why not use 0 directly?
>
> nr_range = split_mem_range(mr, 0, start, end);

yes, even could remove nr_range from split_mem_range.

but originally try to like "really" chopping the code to two parts.

2012-11-13 18:52:03

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 26/46] x86, mm, Xen: Remove mapping_pagetable_reserve()

On Tue, Nov 13, 2012 at 8:36 AM, Stefano Stabellini
<[email protected]> wrote:
> On Mon, 12 Nov 2012, Yinghai Lu wrote:
>> Page table area are pre-mapped now after
>> x86, mm: setup page table in top-down
>> x86, mm: Remove early_memremap workaround for page table accessing on 64bit
>>
>> mapping_pagetable_reserve is not used anymore, so remove it.
>
> You should mention in the description of the patch that you are
> removing mask_rw_pte too.
>
> The reason why you can do that safely is that you previously modified
> allow_low_page to always return pages that are already mapped, moreover
> xen_alloc_pte_init, xen_alloc_pmd_init, etc, will mark the page RO
> before hooking it into the pagetable automatically.

updated change log:
---
x86, mm, Xen: Remove mapping_pagetable_reserve()

Page table area are pre-mapped now after
x86, mm: setup page table in top-down
x86, mm: Remove early_memremap workaround for page table
accessing on 64bit

mapping_pagetable_reserve is not used anymore, so remove it.

Also remove operation in mask_rw_pte(), as modified allow_low_page
always return pages that are already mapped, moreover
xen_alloc_pte_init, xen_alloc_pmd_init, etc, will mark the page RO
before hooking it into the pagetable automatically.

-v2: add changelog about mask_rw_pte() from Stefano.
-----


>
> [ ... ]
>
> you should just get rid of mask_rw_pte completely

then how about 32bit mask_rw_pte? Maybe you can clean up that later?

Thanks

Yinghai

2012-11-13 18:53:09

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 27/46] x86, mm: Add alloc_low_pages(num)

On Tue, Nov 13, 2012 at 8:37 AM, Stefano Stabellini
<[email protected]> wrote:
> On Mon, 12 Nov 2012, Yinghai Lu wrote:
>> 32bit kmap mapping needs pages to be used for low to high.
>> At this point those pages are still from pgt_buf_* from BRK, so it is
>> ok now.
>> But we want to move early_ioremap_page_table_range_init() out of
>> init_memory_mapping() and only call it one time later, that will
>> make page_table_range_init/page_table_kmap_check/alloc_low_page to
>> use memblock to get page.
>>
>> memblock allocation for pages are from high to low.
>> So will get panic from page_table_kmap_check() that has BUG_ON to do
>> ordering checking.
>>
>> This patch add alloc_low_pages to make it possible to allocate serveral
>> pages at first, and hand out pages one by one from low to high.
>>
>> -v2: add one line comment about xen requirements.
>
> where is it?

removed.

2012-11-13 18:58:37

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 28/46] x86, mm: Add pointer about Xen mmu requirement for alloc_low_pages

On Tue, Nov 13, 2012 at 9:56 AM, H. Peter Anvin <[email protected]> wrote:
> When making references to git commits, please include the title of the
> commit:
>
> *
> * Changing that is likely to break Xen, see commit:
> *
> * 279b706 x86,xen: introduce x86_init.mapping.pagetable_reserve
> *
> * for detailed information.
>
> Otherwise it might be very confusing if we have an abbreviated hash
> collision or end up having to change git to use another hash system.

updated to your version.

2012-11-13 19:59:42

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 21/46] x86, mm: setup page table in top-down

On Tue, Nov 13, 2012 at 9:26 AM, Stefano Stabellini
<[email protected]> wrote:
> On Mon, 12 Nov 2012, Yinghai Lu wrote:
>> Get pgt_buf early from BRK, and use it to map PMD_SIZE from top at first.
>> Then use mapped pages to map more ranges below, and keep looping until
>> all pages get mapped.
>>
>> alloc_low_page will use page from BRK at first, after that buffer is used
>> up, will use memblock to find and reserve pages for page table usage.
>>
>> Introduce min_pfn_mapped to make sure find new pages from mapped ranges,
>> that will be updated when lower pages get mapped.
>>
>> Also add step_size to make sure that don't try to map too big range with
>> limited mapped pages initially, and increase the step_size when we have
>> more mapped pages on hand.
>>
>> At last we can get rid of calculation and find early pgt related code.
>>
>> -v2: update to after fix_xen change,
>> also use MACRO for initial pgt_buf size and add comments with it.
>> -v3: skip big reserved range in memblock.reserved near end.
>> -v4: don't need fix_xen change now.
>>
>> Suggested-by: "H. Peter Anvin" <[email protected]>
>> Signed-off-by: Yinghai Lu <[email protected]>
>
> The changes to alloc_low_page and early_alloc_pgt_buf look OK to me.
>
> The changes to init_mem_mapping are a bit iffy but they aren't too
> unreasonable.
> Overall the patch is OK even though I would certainly appreciate more
> comments and better variable names (real_end?), see below.

real_end is not good?

xen put big reserved range between real_end and end.

that real_end is some kind of end of real usable areas.

so change to real_usable_end or usable_end?

2012-11-13 20:01:44

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 21/46] x86, mm: setup page table in top-down

On 11/13/2012 11:59 AM, Yinghai Lu wrote:
>>
>> The changes to init_mem_mapping are a bit iffy but they aren't too
>> unreasonable.
>> Overall the patch is OK even though I would certainly appreciate more
>> comments and better variable names (real_end?), see below.
>
> real_end is not good?
>
> xen put big reserved range between real_end and end.
>
> that real_end is some kind of end of real usable areas.
>
> so change to real_usable_end or usable_end?
>

A description of a variable that includes the words "some kind of"
clearly indicates major confusion. We need to know what the semantics
are, here.

-hpa

2012-11-13 20:36:30

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 21/46] x86, mm: setup page table in top-down

On Tue, Nov 13, 2012 at 12:01 PM, H. Peter Anvin <[email protected]> wrote:
> On 11/13/2012 11:59 AM, Yinghai Lu wrote:
>>>
>>> The changes to init_mem_mapping are a bit iffy but they aren't too
>>> unreasonable.
>>> Overall the patch is OK even though I would certainly appreciate more
>>> comments and better variable names (real_end?), see below.
>>
>> real_end is not good?
>>
>> xen put big reserved range between real_end and end.
>>
>> that real_end is some kind of end of real usable areas.
>>
>> so change to real_usable_end or usable_end?
>>
>
> A description of a variable that includes the words "some kind of"
> clearly indicates major confusion. We need to know what the semantics
> are, here.

originally, we map range in this sequence:
1. map [0, 1M],
2. map 2M near max_pfn. and end is max_pfn<<PAGE_SHIFT
3. use new mapped area, to map low range.
4. change step size if more ram get mapped.
5. goto 3, until reach 1M.

now for xen, there is big chunk range under max_pfn and they are reserved
in memblock.reserved.
so even we map them, we still can not use them for page table.

so 2 become:
2a. find real_end under max_pfn that we can use it for page table
2b, map [real_end - 2M, real_end)

and need to add
6. map [real_end, max_pfn<<PAGE_SHIFT)

2012-11-14 11:19:42

by Stefano Stabellini

[permalink] [raw]
Subject: Re: [PATCH 26/46] x86, mm, Xen: Remove mapping_pagetable_reserve()

On Tue, 13 Nov 2012, Yinghai Lu wrote:
> On Tue, Nov 13, 2012 at 8:36 AM, Stefano Stabellini
> <[email protected]> wrote:
> > On Mon, 12 Nov 2012, Yinghai Lu wrote:
> >> Page table area are pre-mapped now after
> >> x86, mm: setup page table in top-down
> >> x86, mm: Remove early_memremap workaround for page table accessing on 64bit
> >>
> >> mapping_pagetable_reserve is not used anymore, so remove it.
> >
> > You should mention in the description of the patch that you are
> > removing mask_rw_pte too.
> >
> > The reason why you can do that safely is that you previously modified
> > allow_low_page to always return pages that are already mapped, moreover
> > xen_alloc_pte_init, xen_alloc_pmd_init, etc, will mark the page RO
> > before hooking it into the pagetable automatically.
>
> updated change log:
> ---
> x86, mm, Xen: Remove mapping_pagetable_reserve()
>
> Page table area are pre-mapped now after
> x86, mm: setup page table in top-down
> x86, mm: Remove early_memremap workaround for page table
> accessing on 64bit
>
> mapping_pagetable_reserve is not used anymore, so remove it.
>
> Also remove operation in mask_rw_pte(), as modified allow_low_page
> always return pages that are already mapped, moreover
> xen_alloc_pte_init, xen_alloc_pmd_init, etc, will mark the page RO
> before hooking it into the pagetable automatically.
>
> -v2: add changelog about mask_rw_pte() from Stefano.

Thanks


> >
> > [ ... ]
> >
> > you should just get rid of mask_rw_pte completely
>
> then how about 32bit mask_rw_pte? Maybe you can clean up that later?

Yes, I can clean it up later.

However it is trivial: mask_rw_pte is only called by xen_set_pte_init
and in the 32bit case it already returns pte without modifications.

I would just remove the call to mask_rw_pte in xen_set_pte_init.

2012-11-15 19:28:23

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 21/46] x86, mm: setup page table in top-down

On Tue, Nov 13, 2012 at 12:36 PM, Yinghai Lu <[email protected]> wrote:
> On Tue, Nov 13, 2012 at 12:01 PM, H. Peter Anvin <[email protected]> wrote:
>> On 11/13/2012 11:59 AM, Yinghai Lu wrote:
>>>>
>>>> The changes to init_mem_mapping are a bit iffy but they aren't too
>>>> unreasonable.
>>>> Overall the patch is OK even though I would certainly appreciate more
>>>> comments and better variable names (real_end?), see below.
>>>
>>> real_end is not good?
>>>
>>> xen put big reserved range between real_end and end.
>>>
>>> that real_end is some kind of end of real usable areas.
>>>
>>> so change to real_usable_end or usable_end?
>>>
>>
>> A description of a variable that includes the words "some kind of"
>> clearly indicates major confusion. We need to know what the semantics
>> are, here.
>
> originally, we map range in this sequence:
> 1. map [0, 1M],
> 2. map 2M near max_pfn. and end is max_pfn<<PAGE_SHIFT
> 3. use new mapped area, to map low range.
> 4. change step size if more ram get mapped.
> 5. goto 3, until reach 1M.
>
> now for xen, there is big chunk range under max_pfn and they are reserved
> in memblock.reserved.
> so even we map them, we still can not use them for page table.
>
> so 2 become:
> 2a. find real_end under max_pfn that we can use it for page table
> 2b, map [real_end - 2M, real_end)
>
> and need to add
> 6. map [real_end, max_pfn<<PAGE_SHIFT)

hi, peter,

I updated for-x86-mm branch at
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
for-x86-mm

with updated changlog and acked-by from Stefano for some patches.

2012-11-16 17:14:43

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 6/7] x86, mm: setup page table from top-down

On 10/10/2012 10:38 AM, Yinghai Lu wrote:
> On Wed, Oct 10, 2012 at 10:26 AM, Stefano Stabellini
> <[email protected]> wrote:
>> On Wed, 10 Oct 2012, Yinghai Lu wrote:
>>
>> It doesn't matter whether they come from BRK or other memory: Xen
>> assumes that all the pagetable pages come from
>> pgt_buf_start-pgt_buf_top, so if you are going to use another range you
>> need to tell Xen about it.
>>
>> Alternatively, you can follow Peter's suggestion and replace the current
>> hooks with a new one with a more precise and well defined semantic.
>> Something along the lines of "this pagetable page is about to be hooked
>> into the live pagetable". Xen would use the hook to mark it RO.
>
> attached patch on top of this patch will fix the problem?
>

.mapping = {
- .pagetable_reserve = native_pagetable_reserve,
+ .mark_page_ro = mark_page_ro_noop;
},

I have already objected to this naming in the past, because it describes
an implementation ("hypervisor make readonly") as opposed to a semantic
function "make this page permissible to use as a page table". I would
call it pagetable_prepare or something like that.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-11-16 17:16:50

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 6/7] x86, mm: setup page table from top-down

On Fri, Nov 16, 2012 at 9:14 AM, H. Peter Anvin <[email protected]> wrote:
> On 10/10/2012 10:38 AM, Yinghai Lu wrote:
>> attached patch on top of this patch will fix the problem?
>>
>
> .mapping = {
> - .pagetable_reserve = native_pagetable_reserve,
> + .mark_page_ro = mark_page_ro_noop;
> },
>
> I have already objected to this naming in the past, because it describes an
> implementation ("hypervisor make readonly") as opposed to a semantic
> function "make this page permissible to use as a page table". I would call
> it pagetable_prepare or something like that.

please check v7 series with /46 in the title.