on top of tip/x86/mm2
1. use brk to mapping final page table
2. remove early_ioremap in page table accessing.
v1-v2: changes, update xen interface about pagetable_reserve, so not
use pgt_buf_* in xen code directly.
could be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm
Yinghai Lu (10):
x86, mm: align start address to correct big page size
x86, mm: Use big page size for small memory range
x86, mm: get early page table from BRK
x86, mm: Don't clear page table if next range is ram
x86, mm: Remove early_memremap workaround for page table accessing
x86, mm: only keep initial mapping for ram
x86, xen, mm: Do not need to check if page table is ioremap
x86, xen, mm: fix mapping_pagetable_reserve logic
x86, mm: Hide pgt_buf_* into internal to xen
x86, mm: Add early_pgt_buf_*
arch/x86/include/asm/init.h | 5 ++
arch/x86/include/asm/pgtable.h | 1 +
arch/x86/include/asm/pgtable_types.h | 1 -
arch/x86/include/asm/x86_init.h | 2 +-
arch/x86/kernel/setup.c | 2 +
arch/x86/kernel/x86_init.c | 3 +-
arch/x86/mm/init.c | 73 ++++++++++++++++++++++++---
arch/x86/mm/init_32.c | 9 +++-
arch/x86/mm/init_64.c | 91 ++++++++++++----------------------
arch/x86/xen/mmu.c | 26 +++------
10 files changed, 125 insertions(+), 88 deletions(-)
--
1.7.7
We are going to use buffer in BRK to pre-map final page table buffer.
Final page table buffer could be only page aligened, but around it are
still ram, we could use bigger page to map it to avoid small page.
We will probe to adjust page_size_mask in next patch to make big
page size could be used with small ram range.
Before that, this patch will make start address to be aligned down
according to bigger page size. otherwise entry in page page will
not have correct value.
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init_32.c | 1 +
arch/x86/mm/init_64.c | 5 +++--
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 11a5800..27f7fc6 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -310,6 +310,7 @@ repeat:
__pgprot(PTE_IDENT_ATTR |
_PAGE_PSE);
+ pfn &= PMD_MASK >> PAGE_SHIFT;
addr2 = (pfn + PTRS_PER_PTE-1) * PAGE_SIZE +
PAGE_OFFSET + PAGE_SIZE-1;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index ab558eb..f40f383 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -461,7 +461,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
pages++;
spin_lock(&init_mm.page_table_lock);
set_pte((pte_t *)pmd,
- pfn_pte(address >> PAGE_SHIFT,
+ pfn_pte((address & PMD_MASK) >> PAGE_SHIFT,
__pgprot(pgprot_val(prot) | _PAGE_PSE)));
spin_unlock(&init_mm.page_table_lock);
last_map_addr = next;
@@ -536,7 +536,8 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
pages++;
spin_lock(&init_mm.page_table_lock);
set_pte((pte_t *)pud,
- pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_LARGE));
+ pfn_pte((addr & PUD_MASK) >> PAGE_SHIFT,
+ PAGE_KERNEL_LARGE));
spin_unlock(&init_mm.page_table_lock);
last_map_addr = next;
continue;
--
1.7.7
We could map small range in the middle of big range at first, so should use
big page size at first to avoid using small page size to break down page table.
Only can set big page bit when that range has big ram area around it.
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 32 ++++++++++++++++++++++++++++++++
1 files changed, 32 insertions(+), 0 deletions(-)
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index cf662ba..0c58f2d 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -88,6 +88,35 @@ static int __meminit save_mr(struct map_range *mr, int nr_range,
return nr_range;
}
+/*
+ * adjust the page_size_mask for small range to go with
+ * big page size instead small one if nearby are ram too.
+ */
+static void __init_refok adjust_range_page_size_mask(struct map_range *mr,
+ int nr_range)
+{
+ int i;
+
+ for (i = 0; i < nr_range; i++) {
+ if ((page_size_mask & (1<<PG_LEVEL_2M)) &&
+ !(mr[i].page_size_mask & (1<<PG_LEVEL_2M))) {
+ unsigned long start = round_down(mr[i].start, PMD_SIZE);
+ unsigned long end = round_up(mr[i].end, PMD_SIZE);
+
+ if (memblock_is_region_memory(start, end - start))
+ mr[i].page_size_mask |= 1<<PG_LEVEL_2M;
+ }
+ if ((page_size_mask & (1<<PG_LEVEL_1G)) &&
+ !(mr[i].page_size_mask & (1<<PG_LEVEL_1G))) {
+ unsigned long start = round_down(mr[i].start, PUD_SIZE);
+ unsigned long end = round_up(mr[i].end, PUD_SIZE);
+
+ if (memblock_is_region_memory(start, end - start))
+ mr[i].page_size_mask |= 1<<PG_LEVEL_1G;
+ }
+ }
+}
+
static int __meminit split_mem_range(struct map_range *mr, int nr_range,
unsigned long start,
unsigned long end)
@@ -182,6 +211,9 @@ static int __meminit split_mem_range(struct map_range *mr, int nr_range,
nr_range--;
}
+ if (!after_bootmem)
+ adjust_range_page_size_mask(mr, nr_range);
+
for (i = 0; i < nr_range; i++)
printk(KERN_DEBUG " [mem %#010lx-%#010lx] page %s\n",
mr[i].start, mr[i].end - 1,
--
1.7.7
Get pgt_buf early from BRK, and use it to map page table at first.
also use the left at first, then use new one.
-v2: extra xen call back for that new range.
-v3: fix compiling about #llx in print out that is reported by Fengguang
-v4: remove the early_pgt_buf_* stuff, because xen interface is ready yet.
still need to run x86_init.mapping.pagetable_reserve separately.
so we will waste some pages in BRK, will address that later.
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/include/asm/pgtable.h | 1 +
arch/x86/kernel/setup.c | 2 ++
arch/x86/mm/init.c | 26 +++++++++++++++++++++++++-
3 files changed, 28 insertions(+), 1 deletions(-)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 52d40a1..25fa5bb 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -599,6 +599,7 @@ static inline int pgd_none(pgd_t pgd)
extern int direct_gbpages;
void init_mem_mapping(void);
+void early_alloc_pgt_buf(void);
/* local pte updates need not use xchg for locking */
static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 4989f80..7eb6855 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -896,6 +896,8 @@ void __init setup_arch(char **cmdline_p)
reserve_ibft_region();
+ early_alloc_pgt_buf();
+
/*
* Need to conclude brk, before memblock_x86_fill()
* it could use memblock_find_in_range, could overlap with
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 0c58f2d..a89f485 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -317,12 +317,22 @@ static void __init find_early_table_space(unsigned long start,
unsigned long good_end,
unsigned long tables)
{
- phys_addr_t base;
+ unsigned long base;
base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
if (!base)
panic("Cannot find space for the kernel page tables");
+ init_memory_mapping(base, base + tables);
+ if (pgt_buf_end > pgt_buf_start) {
+ printk(KERN_DEBUG "kernel direct mapping tables from %#lx to %#lx @ [mem %#010lx-%#010lx]\n",
+ base, base + tables - 1, pgt_buf_start << PAGE_SHIFT,
+ (pgt_buf_end << PAGE_SHIFT) - 1);
+
+ x86_init.mapping.pagetable_reserve(PFN_PHYS(pgt_buf_start),
+ PFN_PHYS(pgt_buf_end));
+ }
+
pgt_buf_start = base >> PAGE_SHIFT;
pgt_buf_end = pgt_buf_start;
pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
@@ -469,6 +479,20 @@ void __init init_mem_mapping(void)
early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
}
+RESERVE_BRK(early_pgt_alloc, 16384);
+
+void __init early_alloc_pgt_buf(void)
+{
+ unsigned long tables = 16384;
+ phys_addr_t base;
+
+ base = __pa(extend_brk(tables, PAGE_SIZE));
+
+ pgt_buf_start = base >> PAGE_SHIFT;
+ pgt_buf_end = pgt_buf_start;
+ pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
+}
+
/*
* devmem_is_allowed() checks to see if /dev/mem access to a certain address
* is valid. The argument is a physical page number.
--
1.7.7
After we add code use BRK to map buffer for final page table,
It should be safe to remove early_memmap for page table accessing.
But we get panic with that.
It turns out we clear the initial page table wrongly for next range that is
separated by holes.
And it only happens when we are trying to map range one by one range separately.
After we add checking before clearing the related page table, that panic will
not happen anymore.
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init_64.c | 37 ++++++++++++++++---------------------
1 files changed, 16 insertions(+), 21 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index f40f383..61b3c44 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -363,20 +363,19 @@ static unsigned long __meminit
phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
pgprot_t prot)
{
- unsigned pages = 0;
+ unsigned long pages = 0, next;
unsigned long last_map_addr = end;
int i;
pte_t *pte = pte_page + pte_index(addr);
- for(i = pte_index(addr); i < PTRS_PER_PTE; i++, addr += PAGE_SIZE, pte++) {
-
+ for (i = pte_index(addr); i < PTRS_PER_PTE; i++, addr = next, pte++) {
+ next = (addr & PAGE_MASK) + PAGE_SIZE;
if (addr >= end) {
- if (!after_bootmem) {
- for(; i < PTRS_PER_PTE; i++, pte++)
- set_pte(pte, __pte(0));
- }
- break;
+ if (!after_bootmem &&
+ !e820_any_mapped(addr & PAGE_MASK, next, 0))
+ set_pte(pte, __pte(0));
+ continue;
}
/*
@@ -418,16 +417,14 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
pte_t *pte;
pgprot_t new_prot = prot;
+ next = (address & PMD_MASK) + PMD_SIZE;
if (address >= end) {
- if (!after_bootmem) {
- for (; i < PTRS_PER_PMD; i++, pmd++)
- set_pmd(pmd, __pmd(0));
- }
- break;
+ if (!after_bootmem &&
+ !e820_any_mapped(address & PMD_MASK, next, 0))
+ set_pmd(pmd, __pmd(0));
+ continue;
}
- next = (address & PMD_MASK) + PMD_SIZE;
-
if (pmd_val(*pmd)) {
if (!pmd_large(*pmd)) {
spin_lock(&init_mm.page_table_lock);
@@ -494,13 +491,11 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
pmd_t *pmd;
pgprot_t prot = PAGE_KERNEL;
- if (addr >= end)
- break;
-
next = (addr & PUD_MASK) + PUD_SIZE;
-
- if (!after_bootmem && !e820_any_mapped(addr, next, 0)) {
- set_pud(pud, __pud(0));
+ if (addr >= end) {
+ if (!after_bootmem &&
+ !e820_any_mapped(addr & PUD_MASK, next, 0))
+ set_pud(pud, __pud(0));
continue;
}
--
1.7.7
Not needed that anymore after patches include premaping page table buf
and not clear initial page table wrongly.
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init_64.c | 38 ++++----------------------------------
1 files changed, 4 insertions(+), 34 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 61b3c44..4b21950 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -329,36 +329,12 @@ static __ref void *alloc_low_page(unsigned long *phys)
if (pfn >= pgt_buf_top)
panic("alloc_low_page: ran out of memory");
- adr = early_memremap(pfn * PAGE_SIZE, PAGE_SIZE);
+ adr = __va(pfn * PAGE_SIZE);
clear_page(adr);
*phys = pfn * PAGE_SIZE;
return adr;
}
-static __ref void *map_low_page(void *virt)
-{
- void *adr;
- unsigned long phys, left;
-
- if (after_bootmem)
- return virt;
-
- phys = __pa(virt);
- left = phys & (PAGE_SIZE - 1);
- adr = early_memremap(phys & PAGE_MASK, PAGE_SIZE);
- adr = (void *)(((unsigned long)adr) | left);
-
- return adr;
-}
-
-static __ref void unmap_low_page(void *adr)
-{
- if (after_bootmem)
- return;
-
- early_iounmap((void *)((unsigned long)adr & PAGE_MASK), PAGE_SIZE);
-}
-
static unsigned long __meminit
phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
pgprot_t prot)
@@ -428,10 +404,9 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
if (pmd_val(*pmd)) {
if (!pmd_large(*pmd)) {
spin_lock(&init_mm.page_table_lock);
- pte = map_low_page((pte_t *)pmd_page_vaddr(*pmd));
+ pte = (pte_t *)pmd_page_vaddr(*pmd);
last_map_addr = phys_pte_init(pte, address,
end, prot);
- unmap_low_page(pte);
spin_unlock(&init_mm.page_table_lock);
continue;
}
@@ -467,7 +442,6 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
pte = alloc_low_page(&pte_phys);
last_map_addr = phys_pte_init(pte, address, end, new_prot);
- unmap_low_page(pte);
spin_lock(&init_mm.page_table_lock);
pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
@@ -501,10 +475,9 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
if (pud_val(*pud)) {
if (!pud_large(*pud)) {
- pmd = map_low_page(pmd_offset(pud, 0));
+ pmd = pmd_offset(pud, 0);
last_map_addr = phys_pmd_init(pmd, addr, end,
page_size_mask, prot);
- unmap_low_page(pmd);
__flush_tlb_all();
continue;
}
@@ -541,7 +514,6 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
pmd = alloc_low_page(&pmd_phys);
last_map_addr = phys_pmd_init(pmd, addr, end, page_size_mask,
prot);
- unmap_low_page(pmd);
spin_lock(&init_mm.page_table_lock);
pud_populate(&init_mm, pud, __va(pmd_phys));
@@ -577,17 +549,15 @@ kernel_physical_mapping_init(unsigned long start,
next = end;
if (pgd_val(*pgd)) {
- pud = map_low_page((pud_t *)pgd_page_vaddr(*pgd));
+ pud = (pud_t *)pgd_page_vaddr(*pgd);
last_map_addr = phys_pud_init(pud, __pa(start),
__pa(end), page_size_mask);
- unmap_low_page(pud);
continue;
}
pud = alloc_low_page(&pud_phys);
last_map_addr = phys_pud_init(pud, __pa(start), __pa(next),
page_size_mask);
- unmap_low_page(pud);
spin_lock(&init_mm.page_table_lock);
pgd_populate(&init_mm, pgd, __va(pud_phys));
--
1.7.7
0 mean any e820 type will be kept, and only hole is removed.
change to E820_RAM and E820_RESERVED_KERN only.
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init_64.c | 9 ++++++---
1 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 4b21950..aab1fc1 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -349,7 +349,8 @@ phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
next = (addr & PAGE_MASK) + PAGE_SIZE;
if (addr >= end) {
if (!after_bootmem &&
- !e820_any_mapped(addr & PAGE_MASK, next, 0))
+ !e820_any_mapped(addr & PAGE_MASK, next, E820_RAM) &&
+ !e820_any_mapped(addr & PAGE_MASK, next, E820_RESERVED_KERN))
set_pte(pte, __pte(0));
continue;
}
@@ -396,7 +397,8 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
next = (address & PMD_MASK) + PMD_SIZE;
if (address >= end) {
if (!after_bootmem &&
- !e820_any_mapped(address & PMD_MASK, next, 0))
+ !e820_any_mapped(address & PMD_MASK, next, E820_RAM) &&
+ !e820_any_mapped(address & PMD_MASK, next, E820_RESERVED_KERN))
set_pmd(pmd, __pmd(0));
continue;
}
@@ -468,7 +470,8 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
next = (addr & PUD_MASK) + PUD_SIZE;
if (addr >= end) {
if (!after_bootmem &&
- !e820_any_mapped(addr & PUD_MASK, next, 0))
+ !e820_any_mapped(addr & PUD_MASK, next, E820_RAM) &&
+ !e820_any_mapped(addr & PUD_MASK, next, E820_RESERVED_KERN))
set_pud(pud, __pud(0));
continue;
}
--
1.7.7
All page table buf are pre-mapped, and could use _va to access them.
Remove the not needed checking.
Signed-off-by: Yinghai Lu <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/xen/mmu.c | 8 ++------
1 files changed, 2 insertions(+), 6 deletions(-)
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 7a769b7..9c0956c 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1412,13 +1412,9 @@ static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
/*
* If the new pfn is within the range of the newly allocated
- * kernel pagetable, and it isn't being mapped into an
- * early_ioremap fixmap slot as a freshly allocated page, make sure
- * it is RO.
+ * kernel pagetable, make sure it is RO.
*/
- if (((!is_early_ioremap_ptep(ptep) &&
- pfn >= pgt_buf_start && pfn < pgt_buf_top)) ||
- (is_early_ioremap_ptep(ptep) && pfn != (pgt_buf_end - 1)))
+ if (pfn >= pgt_buf_start && pfn < pgt_buf_top)
pte = pte_wrprotect(pte);
return pte;
--
1.7.7
Current code has hidden usage for pgt_buf_top, so we can not call that with
different pgt_buf_top continuous.
Acutully its main purpose is set some page back to RW.
Split that to make_range_readwrite that is reflecting the real thing is
done by that function.
Signed-off-by: Yinghai Lu <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 1 -
arch/x86/include/asm/x86_init.h | 2 +-
arch/x86/kernel/x86_init.c | 3 ++-
arch/x86/mm/init.c | 16 ++++++++--------
arch/x86/xen/mmu.c | 18 +++++++-----------
5 files changed, 18 insertions(+), 22 deletions(-)
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index db8fec6..b1a7107 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -301,7 +301,6 @@ int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,
/* Install a pte for a particular vaddr in kernel space. */
void set_pte_vaddr(unsigned long vaddr, pte_t pte);
-extern void native_pagetable_reserve(u64 start, u64 end);
#ifdef CONFIG_X86_32
extern void native_pagetable_init(void);
#else
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 5769349..357d055 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -76,7 +76,7 @@ struct x86_init_oem {
* init_memory_mapping and the commit that added it.
*/
struct x86_init_mapping {
- void (*pagetable_reserve)(u64 start, u64 end);
+ void (*make_range_readwrite)(u64 start, u64 end);
};
/**
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index 7a3d075..dee4021 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -28,6 +28,7 @@ void __cpuinit x86_init_noop(void) { }
void __init x86_init_uint_noop(unsigned int unused) { }
int __init iommu_init_noop(void) { return 0; }
void iommu_shutdown_noop(void) { }
+static void make_range_readwrite_noop(u64 start, u64 end) { }
/*
* The platform setup functions are preset with the default functions
@@ -63,7 +64,7 @@ struct x86_init_ops x86_init __initdata = {
},
.mapping = {
- .pagetable_reserve = native_pagetable_reserve,
+ .make_range_readwrite = make_range_readwrite_noop,
},
.paging = {
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index a89f485..6622d35 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -61,10 +61,6 @@ static void __init probe_page_size_mask(void)
__supported_pte_mask |= _PAGE_GLOBAL;
}
}
-void __init native_pagetable_reserve(u64 start, u64 end)
-{
- memblock_reserve(start, end - start);
-}
#ifdef CONFIG_X86_32
#define NR_RANGE_MR 3
@@ -329,9 +325,11 @@ static void __init find_early_table_space(unsigned long start,
base, base + tables - 1, pgt_buf_start << PAGE_SHIFT,
(pgt_buf_end << PAGE_SHIFT) - 1);
- x86_init.mapping.pagetable_reserve(PFN_PHYS(pgt_buf_start),
- PFN_PHYS(pgt_buf_end));
+ memblock_reserve(PFN_PHYS(pgt_buf_start),
+ PFN_PHYS(pgt_buf_end) - PFN_PHYS(pgt_buf_start));
}
+ x86_init.mapping.make_range_readwrite(PFN_PHYS(pgt_buf_end),
+ PFN_PHYS(pgt_buf_top));
pgt_buf_start = base >> PAGE_SHIFT;
pgt_buf_end = pgt_buf_start;
@@ -469,9 +467,11 @@ void __init init_mem_mapping(void)
printk(KERN_DEBUG "kernel direct mapping tables up to %#lx @ [mem %#010lx-%#010lx] final\n",
end - 1, pgt_buf_start << PAGE_SHIFT,
(pgt_buf_end << PAGE_SHIFT) - 1);
- x86_init.mapping.pagetable_reserve(PFN_PHYS(pgt_buf_start),
- PFN_PHYS(pgt_buf_end));
+ memblock_reserve(PFN_PHYS(pgt_buf_start),
+ PFN_PHYS(pgt_buf_end) - PFN_PHYS(pgt_buf_start));
}
+ x86_init.mapping.make_range_readwrite(PFN_PHYS(pgt_buf_end),
+ PFN_PHYS(pgt_buf_top));
/* stop the wrong using */
pgt_buf_top = 0;
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 9c0956c..7607a33 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1183,17 +1183,13 @@ static void __init xen_pagetable_init(void)
xen_post_allocator_init();
}
-static __init void xen_mapping_pagetable_reserve(u64 start, u64 end)
+static __init void xen_make_range_readwrite(u64 start, u64 end)
{
- /* reserve the range used */
- native_pagetable_reserve(start, end);
-
- /* set as RW the rest */
- printk(KERN_DEBUG "xen: setting RW the range %llx - %llx\n", end,
- PFN_PHYS(pgt_buf_top));
- while (end < PFN_PHYS(pgt_buf_top)) {
- make_lowmem_page_readwrite(__va(end));
- end += PAGE_SIZE;
+ printk(KERN_DEBUG "xen: setting RW the range %llx - %llx\n",
+ start, end);
+ while (start < end) {
+ make_lowmem_page_readwrite(__va(start));
+ start += PAGE_SIZE;
}
}
@@ -2060,7 +2056,7 @@ static const struct pv_mmu_ops xen_mmu_ops __initconst = {
void __init xen_init_mmu_ops(void)
{
- x86_init.mapping.pagetable_reserve = xen_mapping_pagetable_reserve;
+ x86_init.mapping.make_range_readwrite = xen_make_range_readwrite;
x86_init.paging.pagetable_init = xen_pagetable_init;
pv_mmu_ops = xen_mmu_ops;
--
1.7.7
only should be used by init.c and init_64.c and init_32.c
Signed-off-by: Yinghai Lu <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Cc: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/init.h | 2 ++
arch/x86/mm/init.c | 5 +++++
arch/x86/xen/mmu.c | 2 +-
3 files changed, 8 insertions(+), 1 deletions(-)
diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
index 4f13998..b69e537 100644
--- a/arch/x86/include/asm/init.h
+++ b/arch/x86/include/asm/init.h
@@ -16,4 +16,6 @@ extern unsigned long __initdata pgt_buf_start;
extern unsigned long __meminitdata pgt_buf_end;
extern unsigned long __meminitdata pgt_buf_top;
+bool is_pfn_in_early_pgt_buf(unsigned long pfn);
+
#endif /* _ASM_X86_INIT_32_H */
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 6622d35..1becfbd 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -21,6 +21,11 @@ unsigned long __initdata pgt_buf_start;
unsigned long __meminitdata pgt_buf_end;
unsigned long __meminitdata pgt_buf_top;
+bool __init is_pfn_in_early_pgt_buf(unsigned long pfn)
+{
+ return pfn >= pgt_buf_start && pfn < pgt_buf_top;
+}
+
int after_bootmem;
int direct_gbpages
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 7607a33..25f68bc 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1410,7 +1410,7 @@ static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
* If the new pfn is within the range of the newly allocated
* kernel pagetable, make sure it is RO.
*/
- if (pfn >= pgt_buf_start && pfn < pgt_buf_top)
+ if (is_pfn_in_early_pgt_buf(pfn))
pte = pte_wrprotect(pte);
return pte;
--
1.7.7
So we could use the left early pgt buf in BRK at first, then use new one.
We avoid wasting in _BRK.
Also we don't need to memblock_reserve that buf in brk again, because all
BRK is reserved before.
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/include/asm/init.h | 3 +++
arch/x86/mm/init.c | 24 +++++++++++-------------
arch/x86/mm/init_32.c | 8 ++++++--
arch/x86/mm/init_64.c | 8 ++++++--
4 files changed, 26 insertions(+), 17 deletions(-)
diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
index b69e537..6ce8718 100644
--- a/arch/x86/include/asm/init.h
+++ b/arch/x86/include/asm/init.h
@@ -15,6 +15,9 @@ kernel_physical_mapping_init(unsigned long start,
extern unsigned long __initdata pgt_buf_start;
extern unsigned long __meminitdata pgt_buf_end;
extern unsigned long __meminitdata pgt_buf_top;
+extern unsigned long __initdata early_pgt_buf_start;
+extern unsigned long __meminitdata early_pgt_buf_end;
+extern unsigned long __meminitdata early_pgt_buf_top;
bool is_pfn_in_early_pgt_buf(unsigned long pfn);
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 1becfbd..34d47c3 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -20,10 +20,14 @@
unsigned long __initdata pgt_buf_start;
unsigned long __meminitdata pgt_buf_end;
unsigned long __meminitdata pgt_buf_top;
+unsigned long __initdata early_pgt_buf_start;
+unsigned long __meminitdata early_pgt_buf_end;
+unsigned long __meminitdata early_pgt_buf_top;
bool __init is_pfn_in_early_pgt_buf(unsigned long pfn)
{
- return pfn >= pgt_buf_start && pfn < pgt_buf_top;
+ return (pfn >= early_pgt_buf_start && pfn < early_pgt_buf_top) ||
+ (pfn >= pgt_buf_start && pfn < pgt_buf_top);
}
int after_bootmem;
@@ -325,16 +329,10 @@ static void __init find_early_table_space(unsigned long start,
panic("Cannot find space for the kernel page tables");
init_memory_mapping(base, base + tables);
- if (pgt_buf_end > pgt_buf_start) {
+ if (early_pgt_buf_end > early_pgt_buf_start)
printk(KERN_DEBUG "kernel direct mapping tables from %#lx to %#lx @ [mem %#010lx-%#010lx]\n",
- base, base + tables - 1, pgt_buf_start << PAGE_SHIFT,
- (pgt_buf_end << PAGE_SHIFT) - 1);
-
- memblock_reserve(PFN_PHYS(pgt_buf_start),
- PFN_PHYS(pgt_buf_end) - PFN_PHYS(pgt_buf_start));
- }
- x86_init.mapping.make_range_readwrite(PFN_PHYS(pgt_buf_end),
- PFN_PHYS(pgt_buf_top));
+ base, base + tables - 1, early_pgt_buf_start << PAGE_SHIFT,
+ (early_pgt_buf_end << PAGE_SHIFT) - 1);
pgt_buf_start = base >> PAGE_SHIFT;
pgt_buf_end = pgt_buf_start;
@@ -493,9 +491,9 @@ void __init early_alloc_pgt_buf(void)
base = __pa(extend_brk(tables, PAGE_SIZE));
- pgt_buf_start = base >> PAGE_SHIFT;
- pgt_buf_end = pgt_buf_start;
- pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
+ early_pgt_buf_start = base >> PAGE_SHIFT;
+ early_pgt_buf_end = early_pgt_buf_start;
+ early_pgt_buf_top = early_pgt_buf_start + (tables >> PAGE_SHIFT);
}
/*
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 27f7fc6..70fa732 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -61,10 +61,14 @@ bool __read_mostly __vmalloc_start_set = false;
static __init void *alloc_low_page(void)
{
- unsigned long pfn = pgt_buf_end++;
+ unsigned long pfn;
void *adr;
- if (pfn >= pgt_buf_top)
+ if (early_pgt_buf_end < early_pgt_buf_top)
+ pfn = early_pgt_buf_end++;
+ else if (pgt_buf_end < pgt_buf_top)
+ pfn = pgt_buf_end++;
+ else
panic("alloc_low_page: ran out of memory");
adr = __va(pfn * PAGE_SIZE);
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index aab1fc1..110c1e4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -316,7 +316,7 @@ void __init cleanup_highmap(void)
static __ref void *alloc_low_page(unsigned long *phys)
{
- unsigned long pfn = pgt_buf_end++;
+ unsigned long pfn;
void *adr;
if (after_bootmem) {
@@ -326,7 +326,11 @@ static __ref void *alloc_low_page(unsigned long *phys)
return adr;
}
- if (pfn >= pgt_buf_top)
+ if (early_pgt_buf_end < early_pgt_buf_top)
+ pfn = early_pgt_buf_end++;
+ else if (pgt_buf_end < pgt_buf_top)
+ pfn = pgt_buf_end++;
+ else
panic("alloc_low_page: ran out of memory");
adr = __va(pfn * PAGE_SIZE);
--
1.7.7
On 10/09/2012 12:39 PM, Yinghai Lu wrote:
> on top of tip/x86/mm2
Hi Yinghai,
This patchset doesn't apply on top of tip:x86/mm2, starting at patch 08/10.
Furthermore, a lot of the descriptions are both incomplete and
incomprehensible, which makes the patchset very very hard to review.
You should think of the descriptions as writing a (short) paper on your
patchset, where you describe not just what you are doing, but how you
accomplish the objective and why that objective is the right thing do
accomplish.
Overall, this patchset seems to be going in the right direction, but I
have several concerns with details.
-hpa
On 10/09/2012 12:39 PM, Yinghai Lu wrote:
> */
> struct x86_init_mapping {
> - void (*pagetable_reserve)(u64 start, u64 end);
> + void (*make_range_readwrite)(u64 start, u64 end);
> };
>
Here you go from one misleading name to another. Another classic case
of "why hooks suck."
make_range_readwrite is particularly toxic, though, because it makes it
sound like it something along the lines of set_memory_rw(), which it
most distinctly is not.
Furthermore, the naming makes me really puzzled why it is there at all.
It makes me suspect, but because the patchset is so messy, it is hard
for me to immediately prove, that you're still missing something important.
However, for example:
> unsigned long __initdata pgt_buf_start;
> unsigned long __meminitdata pgt_buf_end;
> unsigned long __meminitdata pgt_buf_top;
> +unsigned long __initdata early_pgt_buf_start;
> +unsigned long __meminitdata early_pgt_buf_end;
> +unsigned long __meminitdata early_pgt_buf_top;
>
> bool __init is_pfn_in_early_pgt_buf(unsigned long pfn)
> {
> - return pfn >= pgt_buf_start && pfn < pgt_buf_top;
> + return (pfn >= early_pgt_buf_start && pfn < early_pgt_buf_top) ||
> + (pfn >= pgt_buf_start && pfn < pgt_buf_top);
> }
Magic variables augmented with more magic variables. Why? This also
seems to assume that we still do all the kernel page tables in one
chunk, which is exactly what we don't want to do.
-hpa
On Mon, Oct 8, 2012 at 11:07 PM, H. Peter Anvin <[email protected]> wrote:
> On 10/09/2012 12:39 PM, Yinghai Lu wrote:
>> on top of tip/x86/mm2
>
> Hi Yinghai,
>
> This patchset doesn't apply on top of tip:x86/mm2, starting at patch 08/10.
sorry for that. I refresh my base to current linus/master and tip/master.
could be some change there.
could solve it:
1. update x86/mm2 to linus/master
2. or i resend the patch again with x86/mm2 as base.
please let me know which one you like.
>
> Furthermore, a lot of the descriptions are both incomplete and
> incomprehensible, which makes the patchset very very hard to review.
>
> You should think of the descriptions as writing a (short) paper on your
> patchset, where you describe not just what you are doing, but how you
> accomplish the objective and why that objective is the right thing do
> accomplish.
Actually I spent a while on the changelog of this patchset.
will spend more time for next reversion.
>
> Overall, this patchset seems to be going in the right direction, but I
> have several concerns with details.
good, will update the patches accordingly.
Thanks
Yinghai
On 10/09/2012 02:21 PM, Yinghai Lu wrote:
>>
>> Hi Yinghai,
>>
>> This patchset doesn't apply on top of tip:x86/mm2, starting at patch 08/10.
>
> sorry for that. I refresh my base to current linus/master and tip/master.
>
> could be some change there.
>
> could solve it:
> 1. update x86/mm2 to linus/master
> 2. or i resend the patch again with x86/mm2 as base.
>
> please let me know which one you like.
>
If it doesn't have cross-dependencies on other code, x86/mm2 is probably
the best base; if there are other dependencies then that's fine, but we
need to know what they are. We don't want to create unnecessary tangles
among topic branches, it makes Linus very unhappy.
-hpa
On Mon, Oct 8, 2012 at 11:12 PM, H. Peter Anvin <[email protected]> wrote:
> On 10/09/2012 12:39 PM, Yinghai Lu wrote:
>> */
>> struct x86_init_mapping {
>> - void (*pagetable_reserve)(u64 start, u64 end);
>> + void (*make_range_readwrite)(u64 start, u64 end);
>> };
>>
>
> Here you go from one misleading name to another. Another classic case
> of "why hooks suck."
Run out of idea to have name for it.
>
> make_range_readwrite is particularly toxic, though, because it makes it
> sound like it something along the lines of set_memory_rw(), which it
> most distinctly is not.
it just change some page range from RO back to RW.
so how about update_range_ro_to_rw?
>
> Furthermore, the naming makes me really puzzled why it is there at all.
> It makes me suspect, but because the patchset is so messy, it is hard
> for me to immediately prove, that you're still missing something important.
>
> However, for example:
>
>> unsigned long __initdata pgt_buf_start;
>> unsigned long __meminitdata pgt_buf_end;
>> unsigned long __meminitdata pgt_buf_top;
>> +unsigned long __initdata early_pgt_buf_start;
>> +unsigned long __meminitdata early_pgt_buf_end;
>> +unsigned long __meminitdata early_pgt_buf_top;
>>
>> bool __init is_pfn_in_early_pgt_buf(unsigned long pfn)
>> {
>> - return pfn >= pgt_buf_start && pfn < pgt_buf_top;
>> + return (pfn >= early_pgt_buf_start && pfn < early_pgt_buf_top) ||
>> + (pfn >= pgt_buf_start && pfn < pgt_buf_top);
>> }
>
> Magic variables augmented with more magic variables. Why? This also
> seems to assume that we still do all the kernel page tables in one
> chunk, which is exactly what we don't want to do.
for 64bit, page table will be three parts
1. initial page table from arch/x86/kernel/head_64.S
2. page table from BRK.
3. page near end of RAM.
verified from /sys/kernel/debug/kernel_page_tables
only range E820_RAM is mapped.
all initial page table for hole between [0, 1G) get cleared too.
Thanks
Yinghai
On Mon, Oct 8, 2012 at 11:25 PM, H. Peter Anvin <[email protected]> wrote:
> On 10/09/2012 02:21 PM, Yinghai Lu wrote:
>>>
>>> Hi Yinghai,
>>>
>>> This patchset doesn't apply on top of tip:x86/mm2, starting at patch 08/10.
>>
>> sorry for that. I refresh my base to current linus/master and tip/master.
>>
>> could be some change there.
>>
>> could solve it:
>> 1. update x86/mm2 to linus/master
>> 2. or i resend the patch again with x86/mm2 as base.
>>
>> please let me know which one you like.
>>
>
> If it doesn't have cross-dependencies on other code, x86/mm2 is probably
> the best base; if there are other dependencies then that's fine, but we
> need to know what they are. We don't want to create unnecessary tangles
> among topic branches, it makes Linus very unhappy.
old one in tip/x86/mm2:
void __init xen_init_mmu_ops(void)
{
x86_init.mapping.pagetable_reserve = xen_mapping_pagetable_reserve;
x86_init.paging.pagetable_setup_start = xen_pagetable_setup_start;
x86_init.paging.pagetable_setup_done = xen_pagetable_setup_done;
pv_mmu_ops = xen_mmu_ops;
in linus/master
2178 void __init xen_init_mmu_ops(void)
2179 {
2180 x86_init.mapping.pagetable_reserve = xen_mapping_pagetable_reserve;
2181 x86_init.paging.pagetable_init = xen_pagetable_init;
2182 pv_mmu_ops = xen_mmu_ops;
2183
2184 memset(dummy_mapping, 0xff, PAGE_SIZE);
2185 }
cause by commit:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=c711288727a62f74d48032e56e51333dd104bf58
rej:
--- arch/x86/xen/mmu.c
+++ arch/x86/xen/mmu.c
@@ -2169,7 +2165,7 @@
void __init xen_init_mmu_ops(void)
{
- x86_init.mapping.pagetable_reserve = xen_mapping_pagetable_reserve;
+ x86_init.mapping.make_range_readwrite = xen_make_range_readwrite;
x86_init.paging.pagetable_init = xen_pagetable_init;
pv_mmu_ops = xen_mmu_ops;
On 10/09/2012 02:33 PM, Yinghai Lu wrote:
>>
>> make_range_readwrite is particularly toxic, though, because it makes it
>> sound like it something along the lines of set_memory_rw(), which it
>> most distinctly is not.
>
> it just change some page range from RO back to RW.
>
> so how about update_range_ro_to_rw?
>
You're focusing on what the low-level mechanics of one particular
implementation (Xen) of the hook, and then try to make it describe the
hook itself.
What the hook does, if I am reading it correctly, is take a range that
used to be page tables and turn it back to "ordinary memory". As such,
assuming I'm following the logic correctly, something like
pagetable_unreserve() seems like a reasonable name.
However, why during initialization, why do we have to unreserve memory
that has already been reserved for pagetables? (Keep in mind there may
very well be an entirely sensible answer for that question -- I just
can't tell from the patchset without a much more in-depth analysis.
Keep in mind that that in-depth analysis sucks up time, and it doesn't
scale to expect the maintainer to have to do that.)
>>
>> Magic variables augmented with more magic variables. Why? This also
>> seems to assume that we still do all the kernel page tables in one
>> chunk, which is exactly what we don't want to do.
>
> for 64bit, page table will be three parts
> 1. initial page table from arch/x86/kernel/head_64.S
> 2. page table from BRK.
> 3. page near end of RAM.
>
> verified from /sys/kernel/debug/kernel_page_tables
>
> only range E820_RAM is mapped.
>
> all initial page table for hole between [0, 1G) get cleared too.
>
No, this is wrong, and more importantly, your choice of data structures
encode this. There should not be any requirement for the page tables
near the end of RAM to be contiguous -- consider the case of a memory
hole near the end of RAM, or a large-memory machine where memory is
highly discontiguous and we have to use more than one chunk before we
run out. Then the questions become:
1. do we *have* to have this tracking at all? Obviously we have to know
this memory is in use, but memblock reserve should take care of that.
2. if we do, please use a data structure which can handle an arbitrary
number of ranges.
-hpa
On Tue, 9 Oct 2012, Yinghai Lu wrote:
> on top of tip/x86/mm2
>
> 1. use brk to mapping final page table
> 2. remove early_ioremap in page table accessing.
>
> v1-v2: changes, update xen interface about pagetable_reserve, so not
> use pgt_buf_* in xen code directly.
>
> could be found at:
> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm
>
> Yinghai Lu (10):
> x86, mm: align start address to correct big page size
> x86, mm: Use big page size for small memory range
> x86, mm: get early page table from BRK
> x86, mm: Don't clear page table if next range is ram
> x86, mm: Remove early_memremap workaround for page table accessing
> x86, mm: only keep initial mapping for ram
> x86, xen, mm: Do not need to check if page table is ioremap
> x86, xen, mm: fix mapping_pagetable_reserve logic
> x86, mm: Hide pgt_buf_* into internal to xen
> x86, mm: Add early_pgt_buf_*
>
> arch/x86/include/asm/init.h | 5 ++
> arch/x86/include/asm/pgtable.h | 1 +
> arch/x86/include/asm/pgtable_types.h | 1 -
> arch/x86/include/asm/x86_init.h | 2 +-
> arch/x86/kernel/setup.c | 2 +
> arch/x86/kernel/x86_init.c | 3 +-
> arch/x86/mm/init.c | 73 ++++++++++++++++++++++++---
> arch/x86/mm/init_32.c | 9 +++-
> arch/x86/mm/init_64.c | 91 ++++++++++++----------------------
> arch/x86/xen/mmu.c | 26 +++------
> 10 files changed, 125 insertions(+), 88 deletions(-)
>
> --
> 1.7.7
>
I agree with Peter that this series is going in the right direction.
However if I give more than 4G of RAM to the VM I get a panic at boot:
[ 0.000000] Linux version 3.6.0-rc7+ (sstabellini@st22) (gcc version 4.4.5 (Debian 4.4.5-8) ) #532 SMP Tue Oct 9 12:15:32 UTC 2012
[ 0.000000] Command line: root=/dev/xvda1 textmode=1 xencons=xvc0 debug loglevel=9 earlyprintk=xen
[ 0.000000] ACPI in unprivileged domain disabled
[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
[ 0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
[ 0.000000] Xen: [mem 0x0000000000100000-0x00000001387fffff] usable
[ 0.000000] bootconsole [xenboot0] enabled
[ 0.000000] NX (Execute Disable) protection: active
[ 0.000000] DMI not present or invalid.
[ 0.000000] e820: update [mem 0x00000000-0x0000ffff] usable ==> reserved
[ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[ 0.000000] No AGP bridge found
[ 0.000000] e820: last_pfn = 0x138800 max_arch_pfn = 0x400000000
[ 0.000000] e820: last_pfn = 0x100000 max_arch_pfn = 0x400000000
[ 0.000000] initial memory mapped: [mem 0x00000000-0x02781fff]
[ 0.000000] Base memory trampoline at [ffff88000009a000] 9a000 size 24576
[ 0.000000] calculate_table_space_size: [mem 0x00000000-0x000fffff]
[ 0.000000] [mem 0x00000000-0x000fffff] page 4k
[ 0.000000] calculate_table_space_size: [mem 0x00100000-0x1387fffff]
[ 0.000000] [mem 0x00100000-0x1387fffff] page 4k
[ 0.000000] init_memory_mapping: [mem 0x137e33000-0x1387fffff]
[ 0.000000] [mem 0x137e33000-0x1387fffff] page 4k
[ 0.000000] Kernel panic - not syncing: alloc_low_page: ran out of memory
[ 0.000000] Pid: 0, comm: swapper Not tainted 3.6.0-rc7+ #532
[ 0.000000] Call Trace:
[ 0.000000] [<ffffffff8196f86d>] panic+0xc4/0x1da
[ 0.000000] [<ffffffff81005429>] ? pte_mfn_to_pfn+0x19/0x100
[ 0.000000] [<ffffffff819150c1>] alloc_low_page+0xc1/0xd0
[ 0.000000] [<ffffffff8218b197>] phys_pmd_init+0x1f5/0x2af
[ 0.000000] [<ffffffff8100987f>] ? __raw_callee_save_xen_restore_fl+0x11/0x1e
[ 0.000000] [<ffffffff81005125>] ? __raw_callee_save_xen_pud_val+0x11/0x1e
[ 0.000000] [<ffffffff8218b44a>] phys_pud_init+0x1f9/0x2af
[ 0.000000] [<ffffffff8100a8f9>] ? get_phys_to_machine+0x9/0x70
[ 0.000000] [<ffffffff8100508f>] ? __raw_callee_save_xen_pgd_val+0x11/0x1e
[ 0.000000] [<ffffffff8218b5c8>] kernel_physical_mapping_init+0xc8/0x187
[ 0.000000] [<ffffffff8218aba0>] ? T.620+0x208/0x21c
[ 0.000000] [<ffffffff81914f41>] init_memory_mapping+0x81/0x140
[ 0.000000] [<ffffffff8215ab94>] init_mem_mapping+0x119/0x242
[ 0.000000] [<ffffffff8214c3bc>] setup_arch+0x707/0xb42
[ 0.000000] [<ffffffff8196fa59>] ? printk+0x4d/0x4f
[ 0.000000] [<ffffffff82145ab1>] start_kernel+0xd9/0x3b8
[ 0.000000] [<ffffffff82145356>] x86_64_start_reservations+0x131/0x136
[ 0.000000] [<ffffffff8214931e>] xen_start_kernel+0x68f/0x696
On Tue, 9 Oct 2012, H. Peter Anvin wrote:
> On 10/09/2012 02:33 PM, Yinghai Lu wrote:
> >>
> >> make_range_readwrite is particularly toxic, though, because it makes it
> >> sound like it something along the lines of set_memory_rw(), which it
> >> most distinctly is not.
> >
> > it just change some page range from RO back to RW.
> >
> > so how about update_range_ro_to_rw?
> >
>
> You're focusing on what the low-level mechanics of one particular
> implementation (Xen) of the hook, and then try to make it describe the
> hook itself.
>
> What the hook does, if I am reading it correctly, is take a range that
> used to be page tables and turn it back to "ordinary memory". As such,
> assuming I'm following the logic correctly, something like
> pagetable_unreserve() seems like a reasonable name.
>
> However, why during initialization, why do we have to unreserve memory
> that has already been reserved for pagetables? (Keep in mind there may
> very well be an entirely sensible answer for that question -- I just
> can't tell from the patchset without a much more in-depth analysis.
> Keep in mind that that in-depth analysis sucks up time, and it doesn't
> scale to expect the maintainer to have to do that.)
I can give some hints: find_early_table_space allocates more memory than
needed for pagetable pages, the allocated range is
pgt_buf_start-pgt_buf_top.
In fact init_memory_mapping doesn't use all the memory in that range,
and after building the pagetables calls pagetable_reserve to reserve
only the range of memory that actually used.
On native is just a memblock_reserve, to make sure that pagetable pages
are not reused as normal memory.
On Xen, in addition to that, we also make RW the rest of the range
pgt_buf_start-pgt_buf_top.
The first problem that I can see with this scheme is that
pgt_buf_start-pgt_buf_top is assumed to be the pagetable pages range by
the Xen subsystem while it should be exported with a more appropriate
hook.
If we had a hook in find_early_table_space called "pagetable_reserve"
that tells everybody what is the pagetable pages range in memory, then
we could have a "pagetable_unreserve" hook to release the unused
portions of that memory range. This would make more sense to me.
On 10/09/2012 08:22 PM, Stefano Stabellini wrote:
>
> I can give some hints: find_early_table_space allocates more memory than
> needed for pagetable pages, the allocated range is
> pgt_buf_start-pgt_buf_top.
>
This is why I keep telling you that we should *get rid of*
find_early_table_space and instead build the page tables incrementally.
There simply is no sane way to make find_early_table_space or anything
else that requires pre-calculating the page table sizes to work and be
robust.
-hpa
On Mon, Oct 08, 2012 at 09:39:11PM -0700, Yinghai Lu wrote:
> Get pgt_buf early from BRK, and use it to map page table at first.
>
> also use the left at first, then use new one.
Left?
>
> -v2: extra xen call back for that new range.
> -v3: fix compiling about #llx in print out that is reported by Fengguang
> -v4: remove the early_pgt_buf_* stuff, because xen interface is ready yet.
> still need to run x86_init.mapping.pagetable_reserve separately.
> so we will waste some pages in BRK, will address that later.
>
> Signed-off-by: Yinghai Lu <[email protected]>
> ---
> arch/x86/include/asm/pgtable.h | 1 +
> arch/x86/kernel/setup.c | 2 ++
> arch/x86/mm/init.c | 26 +++++++++++++++++++++++++-
> 3 files changed, 28 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 52d40a1..25fa5bb 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -599,6 +599,7 @@ static inline int pgd_none(pgd_t pgd)
>
> extern int direct_gbpages;
> void init_mem_mapping(void);
> +void early_alloc_pgt_buf(void);
>
> /* local pte updates need not use xchg for locking */
> static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 4989f80..7eb6855 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -896,6 +896,8 @@ void __init setup_arch(char **cmdline_p)
>
> reserve_ibft_region();
>
> + early_alloc_pgt_buf();
> +
> /*
> * Need to conclude brk, before memblock_x86_fill()
> * it could use memblock_find_in_range, could overlap with
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 0c58f2d..a89f485 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -317,12 +317,22 @@ static void __init find_early_table_space(unsigned long start,
> unsigned long good_end,
> unsigned long tables)
> {
> - phys_addr_t base;
> + unsigned long base;
>
> base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
> if (!base)
> panic("Cannot find space for the kernel page tables");
>
> + init_memory_mapping(base, base + tables);
> + if (pgt_buf_end > pgt_buf_start) {
> + printk(KERN_DEBUG "kernel direct mapping tables from %#lx to %#lx @ [mem %#010lx-%#010lx]\n",
> + base, base + tables - 1, pgt_buf_start << PAGE_SHIFT,
> + (pgt_buf_end << PAGE_SHIFT) - 1);
> +
> + x86_init.mapping.pagetable_reserve(PFN_PHYS(pgt_buf_start),
> + PFN_PHYS(pgt_buf_end));
> + }
> +
> pgt_buf_start = base >> PAGE_SHIFT;
> pgt_buf_end = pgt_buf_start;
> pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
> @@ -469,6 +479,20 @@ void __init init_mem_mapping(void)
> early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
> }
>
> +RESERVE_BRK(early_pgt_alloc, 16384);
How did you come up with 16KB being the right size? What is this
based on? Can you provide a comment explaining why 16KB is the
right value on 32-bit and 64-bit machines?
> +
> +void __init early_alloc_pgt_buf(void)
> +{
> + unsigned long tables = 16384;
> + phys_addr_t base;
> +
> + base = __pa(extend_brk(tables, PAGE_SIZE));
> +
> + pgt_buf_start = base >> PAGE_SHIFT;
> + pgt_buf_end = pgt_buf_start;
> + pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
> +}
> +
> /*
> * devmem_is_allowed() checks to see if /dev/mem access to a certain address
> * is valid. The argument is a physical page number.
> --
> 1.7.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Mon, Oct 08, 2012 at 09:39:12PM -0700, Yinghai Lu wrote:
> After we add code use BRK to map buffer for final page table,
.. mention the name of the patch that adds this.
What is 'final page table'? Isn't this just the existing
bootup tables modified to cover more memory.
>
> It should be safe to remove early_memmap for page table accessing.
>
> But we get panic with that.
Can you rewrite that please. For example it can be:
"With that it should have been safe to remove the call to early_memmap
but sadly it panics. The panic is due to .."
>
> It turns out we clear the initial page table wrongly for next range that is
> separated by holes.
How do we clean it wrongly?
> And it only happens when we are trying to map range one by one range separately.
>
> After we add checking before clearing the related page table, that panic will
> not happen anymore.
So we do not clean the pages anymore. Is the cleaning of the pages
addressed somewhere?
>
> Signed-off-by: Yinghai Lu <[email protected]>
> ---
> arch/x86/mm/init_64.c | 37 ++++++++++++++++---------------------
> 1 files changed, 16 insertions(+), 21 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index f40f383..61b3c44 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -363,20 +363,19 @@ static unsigned long __meminit
> phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
> pgprot_t prot)
> {
> - unsigned pages = 0;
> + unsigned long pages = 0, next;
> unsigned long last_map_addr = end;
> int i;
>
> pte_t *pte = pte_page + pte_index(addr);
>
> - for(i = pte_index(addr); i < PTRS_PER_PTE; i++, addr += PAGE_SIZE, pte++) {
> -
> + for (i = pte_index(addr); i < PTRS_PER_PTE; i++, addr = next, pte++) {
> + next = (addr & PAGE_MASK) + PAGE_SIZE;
> if (addr >= end) {
> - if (!after_bootmem) {
> - for(; i < PTRS_PER_PTE; i++, pte++)
> - set_pte(pte, __pte(0));
> - }
> - break;
> + if (!after_bootmem &&
> + !e820_any_mapped(addr & PAGE_MASK, next, 0))
> + set_pte(pte, __pte(0));
> + continue;
> }
>
> /*
> @@ -418,16 +417,14 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
> pte_t *pte;
> pgprot_t new_prot = prot;
>
> + next = (address & PMD_MASK) + PMD_SIZE;
> if (address >= end) {
> - if (!after_bootmem) {
> - for (; i < PTRS_PER_PMD; i++, pmd++)
> - set_pmd(pmd, __pmd(0));
> - }
> - break;
> + if (!after_bootmem &&
> + !e820_any_mapped(address & PMD_MASK, next, 0))
> + set_pmd(pmd, __pmd(0));
> + continue;
> }
>
> - next = (address & PMD_MASK) + PMD_SIZE;
> -
> if (pmd_val(*pmd)) {
> if (!pmd_large(*pmd)) {
> spin_lock(&init_mm.page_table_lock);
> @@ -494,13 +491,11 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
> pmd_t *pmd;
> pgprot_t prot = PAGE_KERNEL;
>
> - if (addr >= end)
> - break;
> -
> next = (addr & PUD_MASK) + PUD_SIZE;
> -
> - if (!after_bootmem && !e820_any_mapped(addr, next, 0)) {
> - set_pud(pud, __pud(0));
> + if (addr >= end) {
> + if (!after_bootmem &&
> + !e820_any_mapped(addr & PUD_MASK, next, 0))
> + set_pud(pud, __pud(0));
> continue;
> }
>
> --
> 1.7.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Mon, Oct 08, 2012 at 09:39:15PM -0700, Yinghai Lu wrote:
> All page table buf are pre-mapped, and could use _va to access them.
"__va", not "_va". And can you also include the name of the patch
that makes them pre-mapped?
>
> Remove the not needed checking.
<sigh>
"Remove the ioremap check."
>
> Signed-off-by: Yinghai Lu <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> Cc: Jeremy Fitzhardinge <[email protected]>
> ---
> arch/x86/xen/mmu.c | 8 ++------
> 1 files changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index 7a769b7..9c0956c 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -1412,13 +1412,9 @@ static pte_t __init mask_rw_pte(pte_t *ptep, pte_t pte)
>
> /*
> * If the new pfn is within the range of the newly allocated
> - * kernel pagetable, and it isn't being mapped into an
> - * early_ioremap fixmap slot as a freshly allocated page, make sure
> - * it is RO.
> + * kernel pagetable, make sure it is RO.
> */
> - if (((!is_early_ioremap_ptep(ptep) &&
> - pfn >= pgt_buf_start && pfn < pgt_buf_top)) ||
> - (is_early_ioremap_ptep(ptep) && pfn != (pgt_buf_end - 1)))
> + if (pfn >= pgt_buf_start && pfn < pgt_buf_top)
> pte = pte_wrprotect(pte);
>
> return pte;
> --
> 1.7.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Tue, Oct 9, 2012 at 5:19 AM, Stefano Stabellini
<[email protected]> wrote:
>
> I agree with Peter that this series is going in the right direction.
> However if I give more than 4G of RAM to the VM I get a panic at boot:
>
> [ 0.000000] Linux version 3.6.0-rc7+ (sstabellini@st22) (gcc version 4.4.5 (Debian 4.4.5-8) ) #532 SMP Tue Oct 9 12:15:32 UTC 2012
> [ 0.000000] Command line: root=/dev/xvda1 textmode=1 xencons=xvc0 debug loglevel=9 earlyprintk=xen
> [ 0.000000] ACPI in unprivileged domain disabled
> [ 0.000000] e820: BIOS-provided physical RAM map:
> [ 0.000000] Xen: [mem 0x0000000000000000-0x000000000009ffff] usable
> [ 0.000000] Xen: [mem 0x00000000000a0000-0x00000000000fffff] reserved
> [ 0.000000] Xen: [mem 0x0000000000100000-0x00000001387fffff] usable
> [ 0.000000] bootconsole [xenboot0] enabled
> [ 0.000000] NX (Execute Disable) protection: active
> [ 0.000000] DMI not present or invalid.
> [ 0.000000] e820: update [mem 0x00000000-0x0000ffff] usable ==> reserved
> [ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
> [ 0.000000] No AGP bridge found
> [ 0.000000] e820: last_pfn = 0x138800 max_arch_pfn = 0x400000000
> [ 0.000000] e820: last_pfn = 0x100000 max_arch_pfn = 0x400000000
> [ 0.000000] initial memory mapped: [mem 0x00000000-0x02781fff]
> [ 0.000000] Base memory trampoline at [ffff88000009a000] 9a000 size 24576
> [ 0.000000] calculate_table_space_size: [mem 0x00000000-0x000fffff]
> [ 0.000000] [mem 0x00000000-0x000fffff] page 4k
> [ 0.000000] calculate_table_space_size: [mem 0x00100000-0x1387fffff]
> [ 0.000000] [mem 0x00100000-0x1387fffff] page 4k
> [ 0.000000] init_memory_mapping: [mem 0x137e33000-0x1387fffff]
> [ 0.000000] [mem 0x137e33000-0x1387fffff] page 4k
great, thank for testing it.
will come out new version that willget rid of find_table size and
cal_table ...stuff.
Thanks
Yinghai
On Tue, Oct 9, 2012 at 11:27 AM, Yinghai Lu <[email protected]> wrote:
> On Tue, Oct 9, 2012 at 5:19 AM, Stefano Stabellini
> <[email protected]> wrote:
>>
>> I agree with Peter that this series is going in the right direction.
>> However if I give more than 4G of RAM to the VM I get a panic at boot:
just update my branch for-x86-mm, and it should fix the panic problem....
please try it again.
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
for-x86-mm
Thanks a lot!
Yinghai
On Tue, Oct 9, 2012 at 9:01 AM, Konrad Rzeszutek Wilk <[email protected]> wrote:
>> +RESERVE_BRK(early_pgt_alloc, 16384);
>
> How did you come up with 16KB being the right size? What is this
> based on? Can you provide a comment explaining why 16KB is the
> right value on 32-bit and 64-bit machines?
good point, i add one line comment to the revised patch.
+/* need 3 4k for initial PMD_SIZE, 4k for 0-ISA_END_ADDRESS */
On Tue, Oct 9, 2012 at 9:04 AM, Konrad Rzeszutek Wilk <[email protected]> wrote:
> How do we clean it wrongly?
>> And it only happens when we are trying to map range one by one range separately.
>>
>> After we add checking before clearing the related page table, that panic will
>> not happen anymore.
>
> So we do not clean the pages anymore. Is the cleaning of the pages
> addressed somewhere?
that is related calling sequence.
old design assume: we should call that several times under 1G.
On Tue, Oct 09, 2012 at 06:03:33PM -0700, Yinghai Lu wrote:
> On Tue, Oct 9, 2012 at 9:01 AM, Konrad Rzeszutek Wilk <[email protected]> wrote:
> >> +RESERVE_BRK(early_pgt_alloc, 16384);
> >
> > How did you come up with 16KB being the right size? What is this
> > based on? Can you provide a comment explaining why 16KB is the
> > right value on 32-bit and 64-bit machines?
>
> good point, i add one line comment to the revised patch.
>
> +/* need 3 4k for initial PMD_SIZE, 4k for 0-ISA_END_ADDRESS */
Can you explain the math please? It sounds like this is based on
the assumption that this you are using huge pages. But what if you
are not? What if you can only do 4KB pages?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Wed, Oct 10, 2012 at 7:17 AM, Konrad Rzeszutek Wilk
<[email protected]> wrote:
> On Tue, Oct 09, 2012 at 06:03:33PM -0700, Yinghai Lu wrote:
>> good point, i add one line comment to the revised patch.
>>
>> +/* need 3 4k for initial PMD_SIZE, 4k for 0-ISA_END_ADDRESS */
>
> Can you explain the math please? It sounds like this is based on
> the assumption that this you are using huge pages. But what if you
> are not? What if you can only do 4KB pages?
yes that is some extreme case:
assume that 2M range is [2T-2M, 2T),
one page for PGD-[2T-512g, 2T), one for PUD - [2T-1G, 2T), and one for
PMD -: [2T-2M, 2T)
On Wed, Oct 10, 2012 at 08:49:05AM -0700, Yinghai Lu wrote:
> On Wed, Oct 10, 2012 at 7:17 AM, Konrad Rzeszutek Wilk
> <[email protected]> wrote:
> > On Tue, Oct 09, 2012 at 06:03:33PM -0700, Yinghai Lu wrote:
>
> >> good point, i add one line comment to the revised patch.
> >>
> >> +/* need 3 4k for initial PMD_SIZE, 4k for 0-ISA_END_ADDRESS */
> >
> > Can you explain the math please? It sounds like this is based on
> > the assumption that this you are using huge pages. But what if you
> > are not? What if you can only do 4KB pages?
>
> yes that is some extreme case:
> assume that 2M range is [2T-2M, 2T),
What is T in here? Terabyte? Is the '[' vs ')' a significance in your
explanation? Should it be '[2T-2M, 2T]' ?
> one page for PGD-[2T-512g, 2T), one for PUD - [2T-1G, 2T), and one for
> PMD -: [2T-2M, 2T)
On Thu, Oct 11, 2012 at 7:41 AM, Konrad Rzeszutek Wilk
<[email protected]> wrote:
> On Wed, Oct 10, 2012 at 08:49:05AM -0700, Yinghai Lu wrote:
>>
>> yes that is some extreme case:
>> assume that 2M range is [2T-2M, 2T),
>
> What is T in here? Terabyte? Is the '[' vs ')' a significance in your
> explanation? Should it be '[2T-2M, 2T]' ?
yes, T is terabyte
[2T-2M, 2T) is equal to [2T-2M, 2T-1]
) mean the boundary is not included.
On 10/12/2012 06:55 AM, Yinghai Lu wrote:
> On Thu, Oct 11, 2012 at 7:41 AM, Konrad Rzeszutek Wilk
> <[email protected]> wrote:
>> On Wed, Oct 10, 2012 at 08:49:05AM -0700, Yinghai Lu wrote:
>>>
>>> yes that is some extreme case:
>>> assume that 2M range is [2T-2M, 2T),
>>
>> What is T in here? Terabyte? Is the '[' vs ')' a significance in your
>> explanation? Should it be '[2T-2M, 2T]' ?
>
> yes, T is terabyte
>
> [2T-2M, 2T) is equal to [2T-2M, 2T-1]
>
> ) mean the boundary is not included.
>
Right... and just to defend Yinghai here, this is a standard
mathematical notation.
-hpa