On 01/31/2011 07:18 AM, Stefano Stabellini wrote:
> x86/mm/init: respect memblock reserved regions when destroying mappings
>
> In init_memory_mapping we are destroying all the mappings between
> _brk_end and _end, no matter if some memory areas in that range have
> been reserved using memblock_x86_reserve_range.
> Besides if _end is not pmd aligned we might destroy the
> mappings for valid memory between _end and the following pmd.
>
> In order to avoid this problem, before clearing any pmds we check if the
> corresponding memory area has been reserved and we only destroy the
> mapping if it hasn't.
>
> We found this problem because under Xen we have a valid mapping at _end,
> and if _end is not pmd aligned the current code destroys the initial
> part of it.
>
> In practice this fix does not have any impact on native.
>
> Signed-off-by: Stefano Stabellini <[email protected]>
How on Earth would you end up with a reserved region *inside the BRK*?
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Thu, 3 Feb 2011, H. Peter Anvin wrote:
> On 01/31/2011 07:18 AM, Stefano Stabellini wrote:
> > x86/mm/init: respect memblock reserved regions when destroying mappings
> >
> > In init_memory_mapping we are destroying all the mappings between
> > _brk_end and _end, no matter if some memory areas in that range have
> > been reserved using memblock_x86_reserve_range.
> > Besides if _end is not pmd aligned we might destroy the
> > mappings for valid memory between _end and the following pmd.
> >
> > In order to avoid this problem, before clearing any pmds we check if the
> > corresponding memory area has been reserved and we only destroy the
> > mapping if it hasn't.
> >
> > We found this problem because under Xen we have a valid mapping at _end,
> > and if _end is not pmd aligned the current code destroys the initial
> > part of it.
> >
> > In practice this fix does not have any impact on native.
> >
> > Signed-off-by: Stefano Stabellini <[email protected]>
>
> How on Earth would you end up with a reserved region *inside the BRK*?
I think in practice you cannot, but you can have reserved regions at
_end, that is the main problem I am trying to solve.
If we have a reserved region at _end and _end is not PMD aligned, then
we have a problem.
I thought that checking for reserved regions before destroying the
mapping would be a decent solution (because it wouldn't affect the
normal case); so I ended up checking between _brk_end and _end too.
Other alternative solutions I thought about but that I discarded because
they also affect the normal case are:
- never destroy mappings that could go over _end;
- always PMD align _end.
If none of the above are acceptable, I welcome other suggestions :-)
On 02/03/2011 03:25 AM, Stefano Stabellini wrote:
>>
>> How on Earth would you end up with a reserved region *inside the BRK*?
>
> I think in practice you cannot, but you can have reserved regions at
> _end, that is the main problem I am trying to solve.
> If we have a reserved region at _end and _end is not PMD aligned, then
> we have a problem.
>
> I thought that checking for reserved regions before destroying the
> mapping would be a decent solution (because it wouldn't affect the
> normal case); so I ended up checking between _brk_end and _end too.
>
> Other alternative solutions I thought about but that I discarded because
> they also affect the normal case are:
>
> - never destroy mappings that could go over _end;
> - always PMD align _end.
>
> If none of the above are acceptable, I welcome other suggestions :-)
>
Sounds like the code does the right thing, but the description needs to
be improved.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Thu, 3 Feb 2011, H. Peter Anvin wrote:
> On 02/03/2011 03:25 AM, Stefano Stabellini wrote:
> >>
> >> How on Earth would you end up with a reserved region *inside the BRK*?
> >
> > I think in practice you cannot, but you can have reserved regions at
> > _end, that is the main problem I am trying to solve.
> > If we have a reserved region at _end and _end is not PMD aligned, then
> > we have a problem.
> >
> > I thought that checking for reserved regions before destroying the
> > mapping would be a decent solution (because it wouldn't affect the
> > normal case); so I ended up checking between _brk_end and _end too.
> >
> > Other alternative solutions I thought about but that I discarded because
> > they also affect the normal case are:
> >
> > - never destroy mappings that could go over _end;
> > - always PMD align _end.
> >
> > If none of the above are acceptable, I welcome other suggestions :-)
> >
>
> Sounds like the code does the right thing, but the description needs to
> be improved.
>
I tried to improve both the commit message and the comments within the
code, this is the result:
commit d0136be7b48953f27202dbde285a7379d06cfe98
Author: Stefano Stabellini <[email protected]>
Date: Tue Jan 25 12:05:11 2011 +0000
x86/mm/init: respect memblock reserved regions when destroying mappings
In init_memory_mapping we destroy the mappings between _brk_end and
_end, but if _end is not PMD aligned we also destroy mappings for
potentially reserved regions between _end and the following PMD.
In order to avoid this problem, before clearing any PMDs we check if the
corresponding memory area has been reserved and we only destroy the
mapping if hasn't.
We found this issue because under Xen we have a valid mapping at _end,
and if _end is not PMD aligned the current code destroys the initial
part of it.
In practice this fix does not have any impact on native.
Signed-off-by: Stefano Stabellini <[email protected]>
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 947f42a..65c34f4 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -283,6 +283,8 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
if (!after_bootmem && !start) {
pud_t *pud;
pmd_t *pmd;
+ unsigned long addr;
+ u64 size, memblock_addr;
mmu_cr4_features = read_cr4();
@@ -291,11 +293,22 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
* located on different 2M pages. cleanup_highmap(), however,
* can only consider _end when it runs, so destroy any
* mappings beyond _brk_end here.
+ *
+ * If _end is not PMD aligned, we also destroy the mapping of
+ * the memory area between _end the next PMD, so before clearing
+ * the PMD we make sure that the corresponding memory region has
+ * not been reserved.
*/
pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
pmd = pmd_offset(pud, _brk_end - 1);
- while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
- pmd_clear(pmd);
+ addr = (_brk_end + PMD_SIZE - 1) & PMD_MASK;
+ while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1)) {
+ memblock_addr = memblock_x86_find_in_range_size(__pa(addr),
+ &size, PMD_SIZE);
+ if (memblock_addr == (u64) __pa(addr) && size >= PMD_SIZE)
+ pmd_clear(pmd);
+ addr += PMD_SIZE;
+ }
}
#endif
__flush_tlb_all();
On 02/04/2011 03:35 AM, Stefano Stabellini wrote:
> On Thu, 3 Feb 2011, H. Peter Anvin wrote:
>> On 02/03/2011 03:25 AM, Stefano Stabellini wrote:
>>>> How on Earth would you end up with a reserved region *inside the BRK*?
>>> I think in practice you cannot, but you can have reserved regions at
>>> _end, that is the main problem I am trying to solve.
>>> If we have a reserved region at _end and _end is not PMD aligned, then
>>> we have a problem.
>>>
>>> I thought that checking for reserved regions before destroying the
>>> mapping would be a decent solution (because it wouldn't affect the
>>> normal case); so I ended up checking between _brk_end and _end too.
>>>
>>> Other alternative solutions I thought about but that I discarded because
>>> they also affect the normal case are:
>>>
>>> - never destroy mappings that could go over _end;
>>> - always PMD align _end.
>>>
>>> If none of the above are acceptable, I welcome other suggestions :-)
>>>
>> Sounds like the code does the right thing, but the description needs to
>> be improved.
>>
> I tried to improve both the commit message and the comments within the
> code, this is the result:
>
>
> commit d0136be7b48953f27202dbde285a7379d06cfe98
> Author: Stefano Stabellini <[email protected]>
> Date: Tue Jan 25 12:05:11 2011 +0000
>
> x86/mm/init: respect memblock reserved regions when destroying mappings
>
> In init_memory_mapping we destroy the mappings between _brk_end and
> _end, but if _end is not PMD aligned we also destroy mappings for
> potentially reserved regions between _end and the following PMD.
>
> In order to avoid this problem, before clearing any PMDs we check if the
> corresponding memory area has been reserved and we only destroy the
> mapping if hasn't.
>
> We found this issue because under Xen we have a valid mapping at _end,
> and if _end is not PMD aligned the current code destroys the initial
> part of it.
This description makes more sense, even if the code does exactly the
same thing.
>
> In practice this fix does not have any impact on native.
>
> Signed-off-by: Stefano Stabellini <[email protected]>
>
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 947f42a..65c34f4 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -283,6 +283,8 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
> if (!after_bootmem && !start) {
> pud_t *pud;
> pmd_t *pmd;
> + unsigned long addr;
> + u64 size, memblock_addr;
>
> mmu_cr4_features = read_cr4();
>
> @@ -291,11 +293,22 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
> * located on different 2M pages. cleanup_highmap(), however,
> * can only consider _end when it runs, so destroy any
> * mappings beyond _brk_end here.
> + *
> + * If _end is not PMD aligned, we also destroy the mapping of
> + * the memory area between _end the next PMD, so before clearing
> + * the PMD we make sure that the corresponding memory region has
> + * not been reserved.
> */
> pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
> pmd = pmd_offset(pud, _brk_end - 1);
> - while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
> - pmd_clear(pmd);
> + addr = (_brk_end + PMD_SIZE - 1) & PMD_MASK;
I guess its OK if this is >_end, because the pmd offset will be greater
than _end's. But is there an edge condition if the pmd_offset goes off
the end of the pud, and pud page itself happens to be at the end of the
address space and it wraps?
> + while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1)) {
Could _end be in a different pud from _brk_end? Could this walk off the
pud page?
Or is it moot, and there's a guarantee that the whole space is mapped
out of the same pud page? I guess the original code has the same
concern, so this patch leaves the status quo unchanged.
J
> + memblock_addr = memblock_x86_find_in_range_size(__pa(addr),
> + &size, PMD_SIZE);
> + if (memblock_addr == (u64) __pa(addr) && size >= PMD_SIZE)
> + pmd_clear(pmd);
> + addr += PMD_SIZE;
> + }
> }
> #endif
> __flush_tlb_all();
On Fri, Feb 4, 2011 at 5:18 PM, Jeremy Fitzhardinge <[email protected]> wrote:
> On 02/04/2011 03:35 AM, Stefano Stabellini wrote:
>> On Thu, 3 Feb 2011, H. Peter Anvin wrote:
>>> On 02/03/2011 03:25 AM, Stefano Stabellini wrote:
>>>>> How on Earth would you end up with a reserved region *inside the BRK*?
>>>> I think in practice you cannot, but you can have reserved regions at
>>>> _end, that is the main problem I am trying to solve.
>>>> If we have a reserved region at _end and _end is not PMD aligned, then
>>>> we have a problem.
>>>>
>>>> I thought that checking for reserved regions before destroying the
>>>> mapping would be a decent solution (because it wouldn't affect the
>>>> normal case); so I ended up checking between _brk_end and _end too.
>>>>
>>>> Other alternative solutions I thought about but that I discarded because
>>>> they also affect the normal case are:
>>>>
>>>> - never destroy mappings that could go over _end;
>>>> - always PMD align _end.
>>>>
>>>> If none of the above are acceptable, I welcome other suggestions :-)
>>>>
>>> Sounds like the code does the right thing, but the description needs to
>>> be improved.
>>>
>> I tried to improve both the commit message and the comments within the
>> code, this is the result:
>>
>>
>> commit d0136be7b48953f27202dbde285a7379d06cfe98
>> Author: Stefano Stabellini <[email protected]>
>> Date: ? Tue Jan 25 12:05:11 2011 +0000
>>
>> ? ? x86/mm/init: respect memblock reserved regions when destroying mappings
>>
>> ? ? In init_memory_mapping we destroy the mappings between _brk_end and
>> ? ? _end, but if _end is not PMD aligned we also destroy mappings for
>> ? ? potentially reserved regions between _end and the following PMD.
>>
>> ? ? In order to avoid this problem, before clearing any PMDs we check if the
>> ? ? corresponding memory area has been reserved and we only destroy the
>> ? ? mapping if hasn't.
>>
>> ? ? We found this issue because under Xen we have a valid mapping at _end,
>> ? ? and if _end is not PMD aligned the current code destroys the initial
>> ? ? part of it.
>
> This description makes more sense, even if the code does exactly the
> same thing.
>
>>
>> ? ? In practice this fix does not have any impact on native.
>>
>> ? ? Signed-off-by: Stefano Stabellini <[email protected]>
>>
>> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
>> index 947f42a..65c34f4 100644
>> --- a/arch/x86/mm/init.c
>> +++ b/arch/x86/mm/init.c
>> @@ -283,6 +283,8 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
>> ? ? ? if (!after_bootmem && !start) {
>> ? ? ? ? ? ? ? pud_t *pud;
>> ? ? ? ? ? ? ? pmd_t *pmd;
>> + ? ? ? ? ? ? unsigned long addr;
>> + ? ? ? ? ? ? u64 size, memblock_addr;
>>
>> ? ? ? ? ? ? ? mmu_cr4_features = read_cr4();
>>
>> @@ -291,11 +293,22 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
>> ? ? ? ? ? ? ? ?* located on different 2M pages. cleanup_highmap(), however,
>> ? ? ? ? ? ? ? ?* can only consider _end when it runs, so destroy any
>> ? ? ? ? ? ? ? ?* mappings beyond _brk_end here.
>> + ? ? ? ? ? ? ?*
>> + ? ? ? ? ? ? ?* If _end is not PMD aligned, we also destroy the mapping of
>> + ? ? ? ? ? ? ?* the memory area between _end the next PMD, so before clearing
>> + ? ? ? ? ? ? ?* the PMD we make sure that the corresponding memory region has
>> + ? ? ? ? ? ? ?* not been reserved.
>> ? ? ? ? ? ? ? ?*/
>> ? ? ? ? ? ? ? pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
>> ? ? ? ? ? ? ? pmd = pmd_offset(pud, _brk_end - 1);
>> - ? ? ? ? ? ? while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
>> - ? ? ? ? ? ? ? ? ? ? pmd_clear(pmd);
>> + ? ? ? ? ? ? addr = (_brk_end + PMD_SIZE - 1) & PMD_MASK;
>
> I guess its OK if this is >_end, because the pmd offset will be greater
> than _end's. ?But is there an edge condition if the pmd_offset goes off
> the end of the pud, and pud page itself happens to be at the end of the
> address space and it wraps?
>
>> + ? ? ? ? ? ? while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1)) {
> Could _end be in a different pud from _brk_end? ?Could this walk off the
> pud page?
>
> Or is it moot, and there's a guarantee that the whole space is mapped
> out of the same pud page? ?I guess the original code has the same
> concern, so this patch leaves the status quo unchanged.
>
> ? ?J
>
>
>> + ? ? ? ? ? ? ? ? ? ? memblock_addr = memblock_x86_find_in_range_size(__pa(addr),
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &size, PMD_SIZE);
>> + ? ? ? ? ? ? ? ? ? ? if (memblock_addr == (u64) __pa(addr) && size >= PMD_SIZE)
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? pmd_clear(pmd);
>> + ? ? ? ? ? ? ? ? ? ? addr += PMD_SIZE;
>> + ? ? ? ? ? ? }
>> ? ? ? }
>> ?#endif
>> ? ? ? __flush_tlb_all();
why not just move calling cleanup_highmap down?
something like attached patch.
On 02/05/2011 11:02 PM, Yinghai Lu wrote:
> why not just move calling cleanup_highmap down?
>
> something like attached patch.
This patch looks very clean and looks on the surface of it like it is
removing some ugly ad hoc code, but (as always) it needs a description
about the problem it solves and why it is correct.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>> why not just move calling cleanup_highmap down?
>>
>> something like attached patch.
>
> This patch looks very clean and looks on the surface of it like it is
> removing some ugly ad hoc code, but (as always) it needs a description
> about the problem it solves and why it is correct.
>
just cleanup on native platform.
but wonder if could cause problem for xen memory hotplug support.
Thanks
Yinghai
On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>> why not just move calling cleanup_highmap down?
>>
>> something like attached patch.
>
> This patch looks very clean and looks on the surface of it like it is
> removing some ugly ad hoc code, but (as always) it needs a description
> about the problem it solves and why it is correct.
Sure.
Jeremy and xen guys, can you please check if it works well with xen ?
Thanks
[PATCH] x86: Cleanup highmap after brk is concluded
now cleanup_high actually two step. one is early in head64.c.
and it only clear above _end. later will try to clean from _brk_end to _end.
it will need to double check if those boundary are PMD_SIZE aligned...
Also init_memory_mapping() is called several times for numa or memory hotplug.
and it deals real memory mapping instead of initial kernel mapping.
We really should not handle initial kernel mapping there.
We can move cleanup_highmap() calling down after _brk_end is settled and pass
_brk_end with it.
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/include/asm/pgtable_64.h | 2 +-
arch/x86/kernel/head64.c | 3 ---
arch/x86/kernel/setup.c | 6 ++++++
arch/x86/mm/init.c | 19 -------------------
arch/x86/mm/init_64.c | 5 +++--
5 files changed, 10 insertions(+), 25 deletions(-)
Index: linux-2.6/arch/x86/include/asm/pgtable_64.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/pgtable_64.h
+++ linux-2.6/arch/x86/include/asm/pgtable_64.h
@@ -165,7 +165,7 @@ static inline int pgd_large(pgd_t pgd) {
#define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })
extern int kern_addr_valid(unsigned long addr);
-extern void cleanup_highmap(void);
+extern void cleanup_highmap(unsigned long end);
#define HAVE_ARCH_UNMAPPED_AREA
#define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
Index: linux-2.6/arch/x86/kernel/head64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/head64.c
+++ linux-2.6/arch/x86/kernel/head64.c
@@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * r
/* Make NULL pointers segfault */
zap_identity_mappings();
- /* Cleanup the over mapped high alias */
- cleanup_highmap();
-
max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
Index: linux-2.6/arch/x86/kernel/setup.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup.c
+++ linux-2.6/arch/x86/kernel/setup.c
@@ -297,6 +297,9 @@ static void __init init_gbpages(void)
static inline void init_gbpages(void)
{
}
+static void __init cleanup_highmap(unsigned long end)
+{
+}
#endif
static void __init reserve_brk(void)
@@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
*/
reserve_brk();
+ /* Cleanup the over mapped high alias after _brk_end*/
+ cleanup_highmap(_brk_end);
+
memblock.current_limit = get_max_mapped();
memblock_x86_fill();
Index: linux-2.6/arch/x86/mm/init.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init.c
+++ linux-2.6/arch/x86/mm/init.c
@@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_m
load_cr3(swapper_pg_dir);
#endif
-#ifdef CONFIG_X86_64
- if (!after_bootmem && !start) {
- pud_t *pud;
- pmd_t *pmd;
-
- mmu_cr4_features = read_cr4();
-
- /*
- * _brk_end cannot change anymore, but it and _end may be
- * located on different 2M pages. cleanup_highmap(), however,
- * can only consider _end when it runs, so destroy any
- * mappings beyond _brk_end here.
- */
- pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
- pmd = pmd_offset(pud, _brk_end - 1);
- while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
- pmd_clear(pmd);
- }
-#endif
__flush_tlb_all();
if (!after_bootmem && e820_table_end > e820_table_start)
Index: linux-2.6/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_64.c
+++ linux-2.6/arch/x86/mm/init_64.c
@@ -297,13 +297,14 @@ void __init init_extra_mapping_uc(unsign
* rounded up to the 2MB boundary. This catches the invalid pmds as
* well, as they are located before _text:
*/
-void __init cleanup_highmap(void)
+void __init cleanup_highmap(unsigned long end)
{
unsigned long vaddr = __START_KERNEL_map;
- unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
pmd_t *pmd = level2_kernel_pgt;
pmd_t *last_pmd = pmd + PTRS_PER_PMD;
+ end = roundup(end, PMD_SIZE) - 1;
+
for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
if (pmd_none(*pmd))
continue;
On Sat, 5 Feb 2011, Jeremy Fitzhardinge wrote:
> > pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
> > pmd = pmd_offset(pud, _brk_end - 1);
> > - while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
> > - pmd_clear(pmd);
> > + addr = (_brk_end + PMD_SIZE - 1) & PMD_MASK;
>
> I guess its OK if this is >_end, because the pmd offset will be greater
> than _end's. But is there an edge condition if the pmd_offset goes off
> the end of the pud, and pud page itself happens to be at the end of the
> address space and it wraps?
>
> > + while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1)) {
> Could _end be in a different pud from _brk_end? Could this walk off the
> pud page?
>
> Or is it moot, and there's a guarantee that the whole space is mapped
> out of the same pud page? I guess the original code has the same
> concern, so this patch leaves the status quo unchanged.
>
Indeed. I had this doubt myself, but I thought there must have been a
smart reason why _brk_end and _end must be on the same pud, so I left it
unchanged.
On Sun, 6 Feb 2011, Yinghai Lu wrote:
> On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
> > On 02/05/2011 11:02 PM, Yinghai Lu wrote:
> >> why not just move calling cleanup_highmap down?
> >>
> >> something like attached patch.
> >
> > This patch looks very clean and looks on the surface of it like it is
> > removing some ugly ad hoc code, but (as always) it needs a description
> > about the problem it solves and why it is correct.
>
> Sure.
>
>
> Jeremy and xen guys, can you please check if it works well with xen ?
>
Actually this patch makes things worse on xen, because before
cleanup_highmap() wasn't called at all on xen (on purpose) and now it
is, fully destroying all the mappings we have at _end.
Can we add a check on memblock reserved regions in cleanup_highmap()?
Otherwise could we avoid calling cleanup_highmap() at all on xen?
On 02/07/2011 08:50 AM, Stefano Stabellini wrote:
> On Sun, 6 Feb 2011, Yinghai Lu wrote:
>> On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
>>> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>>>> why not just move calling cleanup_highmap down?
>>>>
>>>> something like attached patch.
>>>
>>> This patch looks very clean and looks on the surface of it like it is
>>> removing some ugly ad hoc code, but (as always) it needs a description
>>> about the problem it solves and why it is correct.
>>
>> Sure.
>>
>>
>> Jeremy and xen guys, can you please check if it works well with xen ?
>>
>
> Actually this patch makes things worse on xen, because before
> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
> is, fully destroying all the mappings we have at _end.
>
> Can we add a check on memblock reserved regions in cleanup_highmap()?
> Otherwise could we avoid calling cleanup_highmap() at all on xen?
why DO xen need over-mapped kernel initial mapping?
what is in that range after _end to 512M?
Yinghai
On Mon, 7 Feb 2011, Yinghai Lu wrote:
> On 02/07/2011 08:50 AM, Stefano Stabellini wrote:
> > On Sun, 6 Feb 2011, Yinghai Lu wrote:
> >> On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
> >>> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
> >>>> why not just move calling cleanup_highmap down?
> >>>>
> >>>> something like attached patch.
> >>>
> >>> This patch looks very clean and looks on the surface of it like it is
> >>> removing some ugly ad hoc code, but (as always) it needs a description
> >>> about the problem it solves and why it is correct.
> >>
> >> Sure.
> >>
> >>
> >> Jeremy and xen guys, can you please check if it works well with xen ?
> >>
> >
> > Actually this patch makes things worse on xen, because before
> > cleanup_highmap() wasn't called at all on xen (on purpose) and now it
> > is, fully destroying all the mappings we have at _end.
> >
> > Can we add a check on memblock reserved regions in cleanup_highmap()?
> > Otherwise could we avoid calling cleanup_highmap() at all on xen?
>
> why DO xen need over-mapped kernel initial mapping?
>
> what is in that range after _end to 512M?
The mfn list that is the list of machine frame numbers assigned to this
domain; it is used across the kernel to convert pfns into mfns.
It passed to us at that address by the domain builder.
On Mon, 7 Feb 2011, Stefano Stabellini wrote:
> On Sun, 6 Feb 2011, Yinghai Lu wrote:
> > On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
> > > On 02/05/2011 11:02 PM, Yinghai Lu wrote:
> > >> why not just move calling cleanup_highmap down?
> > >>
> > >> something like attached patch.
> > >
> > > This patch looks very clean and looks on the surface of it like it is
> > > removing some ugly ad hoc code, but (as always) it needs a description
> > > about the problem it solves and why it is correct.
> >
> > Sure.
> >
> >
> > Jeremy and xen guys, can you please check if it works well with xen ?
> >
>
> Actually this patch makes things worse on xen, because before
> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
> is, fully destroying all the mappings we have at _end.
>
> Can we add a check on memblock reserved regions in cleanup_highmap()?
In case you are wondering how Yinghai Lu's patch would look like with
the added check, here it is:
diff --git a/arch/x86/include/asm/memblock.h b/arch/x86/include/asm/memblock.h
index 19ae14b..184f778 100644
--- a/arch/x86/include/asm/memblock.h
+++ b/arch/x86/include/asm/memblock.h
@@ -3,6 +3,7 @@
#define ARCH_DISCARD_MEMBLOCK
+bool memblock_check_reserved_size(u64 *addrp, u64 *sizep, u64 align);
u64 memblock_x86_find_in_range_size(u64 start, u64 *sizep, u64 align);
void memblock_x86_to_bootmem(u64 start, u64 end);
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 975f709..28686b6 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -165,7 +165,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
#define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })
extern int kern_addr_valid(unsigned long addr);
-extern void cleanup_highmap(void);
+extern void cleanup_highmap(unsigned long end);
#define HAVE_ARCH_UNMAPPED_AREA
#define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 2d2673c..5655c22 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * real_mode_data)
/* Make NULL pointers segfault */
zap_identity_mappings();
- /* Cleanup the over mapped high alias */
- cleanup_highmap();
-
max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d3cfe26..91afde6 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -297,6 +297,9 @@ static void __init init_gbpages(void)
static inline void init_gbpages(void)
{
}
+static void __init cleanup_highmap(unsigned long end)
+{
+}
#endif
static void __init reserve_brk(void)
@@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
*/
reserve_brk();
+ /* Cleanup the over mapped high alias after _brk_end*/
+ cleanup_highmap(_brk_end);
+
memblock.current_limit = get_max_mapped();
memblock_x86_fill();
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 947f42a..f13ff3a 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
load_cr3(swapper_pg_dir);
#endif
-#ifdef CONFIG_X86_64
- if (!after_bootmem && !start) {
- pud_t *pud;
- pmd_t *pmd;
-
- mmu_cr4_features = read_cr4();
-
- /*
- * _brk_end cannot change anymore, but it and _end may be
- * located on different 2M pages. cleanup_highmap(), however,
- * can only consider _end when it runs, so destroy any
- * mappings beyond _brk_end here.
- */
- pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
- pmd = pmd_offset(pud, _brk_end - 1);
- while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
- pmd_clear(pmd);
- }
-#endif
__flush_tlb_all();
if (!after_bootmem && e820_table_end > e820_table_start)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 71a5929..028c49e 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -297,18 +297,26 @@ void __init init_extra_mapping_uc(unsigned long phys, unsigned long size)
* rounded up to the 2MB boundary. This catches the invalid pmds as
* well, as they are located before _text:
*/
-void __init cleanup_highmap(void)
+void __init cleanup_highmap(unsigned long end)
{
unsigned long vaddr = __START_KERNEL_map;
- unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
pmd_t *pmd = level2_kernel_pgt;
pmd_t *last_pmd = pmd + PTRS_PER_PMD;
+ u64 size, addrp;
+ bool changed;
+
+ end = roundup(end, PMD_SIZE) - 1;
for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
if (pmd_none(*pmd))
continue;
- if (vaddr < (unsigned long) _text || vaddr > end)
- set_pmd(pmd, __pmd(0));
+ if (vaddr < (unsigned long) _text || vaddr > end) {
+ addrp = __pa(vaddr);
+ size = PMD_SIZE;
+ changed = memblock_check_reserved_size(&addrp, &size, PMD_SIZE);
+ if (!changed && size)
+ set_pmd(pmd, __pmd(0));
+ }
}
}
diff --git a/arch/x86/mm/memblock.c b/arch/x86/mm/memblock.c
index aa11693..fac21d4 100644
--- a/arch/x86/mm/memblock.c
+++ b/arch/x86/mm/memblock.c
@@ -8,7 +8,7 @@
#include <linux/range.h>
/* Check for already reserved areas */
-static bool __init check_with_memblock_reserved_size(u64 *addrp, u64 *sizep, u64 align)
+bool __init memblock_check_reserved_size(u64 *addrp, u64 *sizep, u64 align)
{
struct memblock_region *r;
u64 addr = *addrp, last;
@@ -59,7 +59,7 @@ u64 __init memblock_x86_find_in_range_size(u64 start, u64 *sizep, u64 align)
if (addr >= ei_last)
continue;
*sizep = ei_last - addr;
- while (check_with_memblock_reserved_size(&addr, sizep, align))
+ while (memblock_check_reserved_size(&addr, sizep, align))
;
if (*sizep)
On 02/07/2011 10:58 AM, Stefano Stabellini wrote:
> On Mon, 7 Feb 2011, Yinghai Lu wrote:
>> On 02/07/2011 08:50 AM, Stefano Stabellini wrote:
>>> On Sun, 6 Feb 2011, Yinghai Lu wrote:
>>>> On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
>>>>> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>>>>>> why not just move calling cleanup_highmap down?
>>>>>>
>>>>>> something like attached patch.
>>>>>
>>>>> This patch looks very clean and looks on the surface of it like it is
>>>>> removing some ugly ad hoc code, but (as always) it needs a description
>>>>> about the problem it solves and why it is correct.
>>>>
>>>> Sure.
>>>>
>>>>
>>>> Jeremy and xen guys, can you please check if it works well with xen ?
>>>>
>>>
>>> Actually this patch makes things worse on xen, because before
>>> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
>>> is, fully destroying all the mappings we have at _end.
>>>
>>> Can we add a check on memblock reserved regions in cleanup_highmap()?
>>> Otherwise could we avoid calling cleanup_highmap() at all on xen?
>>
>> why DO xen need over-mapped kernel initial mapping?
>>
>> what is in that range after _end to 512M?
>
> The mfn list that is the list of machine frame numbers assigned to this
> domain; it is used across the kernel to convert pfns into mfns.
> It passed to us at that address by the domain builder.
is it possible for you to pass physical address, and then map it in kernel?
On Mon, Feb 7, 2011 at 11:00 AM, Yinghai Lu <[email protected]> wrote:
> On 02/07/2011 10:58 AM, Stefano Stabellini wrote:
>> On Mon, 7 Feb 2011, Yinghai Lu wrote:
>>> On 02/07/2011 08:50 AM, Stefano Stabellini wrote:
>>>> On Sun, 6 Feb 2011, Yinghai Lu wrote:
>>>>> On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
>>>>>> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>>>>>>> why not just move calling cleanup_highmap down?
>>>>>>>
>>>>>>> something like attached patch.
>>>>>>
>>>>>> This patch looks very clean and looks on the surface of it like it is
>>>>>> removing some ugly ad hoc code, but (as always) it needs a description
>>>>>> about the problem it solves and why it is correct.
>>>>>
>>>>> Sure.
>>>>>
>>>>>
>>>>> Jeremy and xen guys, can you please check if it works well with xen ?
>>>>>
>>>>
>>>> Actually this patch makes things worse on xen, because before
>>>> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
>>>> is, fully destroying all the mappings we have at _end.
>>>>
>>>> Can we add a check on memblock reserved regions in cleanup_highmap()?
>>>> Otherwise could we avoid calling cleanup_highmap() at all on xen?
>>>
>>> why DO xen need over-mapped kernel initial mapping?
>>>
>>> what is in that range after _end to 512M?
>>
>> The mfn list that is the list of machine frame numbers assigned to this
>> domain; it is used across the kernel to convert pfns into mfns.
>> It passed to us at that address by the domain builder.
>
> is it possible for you to pass physical address, and then map it in kernel?
or
wonder if you can use really kernel mapping for your mfn list
accessing instead of initial mapping accessing?
something like:
mfn_list = __va(__pa(initial_mfn_...);
Yinghai
On 02/07/2011 11:00 AM, Yinghai Lu wrote:
> On 02/07/2011 10:58 AM, Stefano Stabellini wrote:
>> On Mon, 7 Feb 2011, Yinghai Lu wrote:
>>> On 02/07/2011 08:50 AM, Stefano Stabellini wrote:
>>>> On Sun, 6 Feb 2011, Yinghai Lu wrote:
>>>>> On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
>>>>>> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>>>>>>> why not just move calling cleanup_highmap down?
>>>>>>>
>>>>>>> something like attached patch.
>>>>>> This patch looks very clean and looks on the surface of it like it is
>>>>>> removing some ugly ad hoc code, but (as always) it needs a description
>>>>>> about the problem it solves and why it is correct.
>>>>> Sure.
>>>>>
>>>>>
>>>>> Jeremy and xen guys, can you please check if it works well with xen ?
>>>>>
>>>> Actually this patch makes things worse on xen, because before
>>>> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
>>>> is, fully destroying all the mappings we have at _end.
>>>>
>>>> Can we add a check on memblock reserved regions in cleanup_highmap()?
>>>> Otherwise could we avoid calling cleanup_highmap() at all on xen?
>>> why DO xen need over-mapped kernel initial mapping?
>>>
>>> what is in that range after _end to 512M?
>> The mfn list that is the list of machine frame numbers assigned to this
>> domain; it is used across the kernel to convert pfns into mfns.
>> It passed to us at that address by the domain builder.
> is it possible for you to pass physical address, and then map it in kernel?
That is possible in principle, but very difficult in practice.
What's wrong with honouring reserved memory ranges under all circumstances?
J
On 02/07/2011 01:56 PM, Jeremy Fitzhardinge wrote:
> On 02/07/2011 11:00 AM, Yinghai Lu wrote:
>> On 02/07/2011 10:58 AM, Stefano Stabellini wrote:
>>> On Mon, 7 Feb 2011, Yinghai Lu wrote:
>>>> On 02/07/2011 08:50 AM, Stefano Stabellini wrote:
>>>>> On Sun, 6 Feb 2011, Yinghai Lu wrote:
>>>>>> On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
>>>>>>> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>>>>>>>> why not just move calling cleanup_highmap down?
>>>>>>>>
>>>>>>>> something like attached patch.
>>>>>>> This patch looks very clean and looks on the surface of it like it is
>>>>>>> removing some ugly ad hoc code, but (as always) it needs a description
>>>>>>> about the problem it solves and why it is correct.
>>>>>> Sure.
>>>>>>
>>>>>>
>>>>>> Jeremy and xen guys, can you please check if it works well with xen ?
>>>>>>
>>>>> Actually this patch makes things worse on xen, because before
>>>>> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
>>>>> is, fully destroying all the mappings we have at _end.
>>>>>
>>>>> Can we add a check on memblock reserved regions in cleanup_highmap()?
>>>>> Otherwise could we avoid calling cleanup_highmap() at all on xen?
>>>> why DO xen need over-mapped kernel initial mapping?
>>>>
>>>> what is in that range after _end to 512M?
>>> The mfn list that is the list of machine frame numbers assigned to this
>>> domain; it is used across the kernel to convert pfns into mfns.
>>> It passed to us at that address by the domain builder.
>> is it possible for you to pass physical address, and then map it in kernel?
>
> That is possible in principle, but very difficult in practice.
>
> What's wrong with honouring reserved memory ranges under all circumstances?
why punishing native path with those checking?
please check if
+ * max_pfn_mapped is set to KERNEL_IMAGE_SIZE >> PAGE_SHIFT in
+ * head64.c::x86_start_kernel(), aka native path
+ */
+ if (max_pfn_mapped != (KERNEL_IMAGE_SIZE >> PAGE_SHIFT))
+ return;
could be used to skip clear highmap for xen path?
Thanks
Yinghai
[PATCH -v2] x86: Cleanup highmap after brk is concluded
Now cleanup_high actually is two steps. one is early in head64.c.
and it only clear above _end. later will try to clean from _brk_end to _end.
it will need to double check if those boundary are PMD_SIZE aligned...
Also init_memory_mapping() are called several times for numa or memory hotplug.
and it deals real memory mapping instead of initial kernel mapping.
We really should not handle initial kernel mapping there.
We can move cleanup_highmap() calling down after _brk_end is settled and pass
_brk_end with it.
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/include/asm/pgtable_64.h | 2 +-
arch/x86/kernel/head64.c | 3 ---
arch/x86/kernel/setup.c | 6 ++++++
arch/x86/mm/init.c | 19 -------------------
arch/x86/mm/init_64.c | 20 +++++++++++++++-----
5 files changed, 22 insertions(+), 28 deletions(-)
Index: linux-2.6/arch/x86/include/asm/pgtable_64.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/pgtable_64.h
+++ linux-2.6/arch/x86/include/asm/pgtable_64.h
@@ -165,7 +165,7 @@ static inline int pgd_large(pgd_t pgd) {
#define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })
extern int kern_addr_valid(unsigned long addr);
-extern void cleanup_highmap(void);
+extern void cleanup_highmap(unsigned long end);
#define HAVE_ARCH_UNMAPPED_AREA
#define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
Index: linux-2.6/arch/x86/kernel/head64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/head64.c
+++ linux-2.6/arch/x86/kernel/head64.c
@@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * r
/* Make NULL pointers segfault */
zap_identity_mappings();
- /* Cleanup the over mapped high alias */
- cleanup_highmap();
-
max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
Index: linux-2.6/arch/x86/kernel/setup.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup.c
+++ linux-2.6/arch/x86/kernel/setup.c
@@ -297,6 +297,9 @@ static void __init init_gbpages(void)
static inline void init_gbpages(void)
{
}
+static void __init cleanup_highmap(unsigned long end)
+{
+}
#endif
static void __init reserve_brk(void)
@@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
*/
reserve_brk();
+ /* Cleanup the over mapped high alias after _brk_end*/
+ cleanup_highmap(_brk_end);
+
memblock.current_limit = get_max_mapped();
memblock_x86_fill();
Index: linux-2.6/arch/x86/mm/init.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init.c
+++ linux-2.6/arch/x86/mm/init.c
@@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_m
load_cr3(swapper_pg_dir);
#endif
-#ifdef CONFIG_X86_64
- if (!after_bootmem && !start) {
- pud_t *pud;
- pmd_t *pmd;
-
- mmu_cr4_features = read_cr4();
-
- /*
- * _brk_end cannot change anymore, but it and _end may be
- * located on different 2M pages. cleanup_highmap(), however,
- * can only consider _end when it runs, so destroy any
- * mappings beyond _brk_end here.
- */
- pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
- pmd = pmd_offset(pud, _brk_end - 1);
- while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
- pmd_clear(pmd);
- }
-#endif
__flush_tlb_all();
if (!after_bootmem && e820_table_end > e820_table_start)
Index: linux-2.6/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_64.c
+++ linux-2.6/arch/x86/mm/init_64.c
@@ -297,12 +297,22 @@ void __init init_extra_mapping_uc(unsign
* rounded up to the 2MB boundary. This catches the invalid pmds as
* well, as they are located before _text:
*/
-void __init cleanup_highmap(void)
+void __init cleanup_highmap(unsigned long end)
{
- unsigned long vaddr = __START_KERNEL_map;
- unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
- pmd_t *pmd = level2_kernel_pgt;
- pmd_t *last_pmd = pmd + PTRS_PER_PMD;
+ unsigned long vaddr;
+ pmd_t *pmd, *last_pmd;
+
+ /*
+ * max_pfn_mapped is set to KERNEL_IMAGE_SIZE >> PAGE_SHIFT in
+ * head64.c::x86_start_kernel(), aka native path
+ */
+ if (max_pfn_mapped != (KERNEL_IMAGE_SIZE >> PAGE_SHIFT))
+ return;
+
+ vaddr = __START_KERNEL_map;
+ pmd = level2_kernel_pgt;
+ last_pmd = pmd + PTRS_PER_PMD;
+ end = roundup(end, PMD_SIZE) - 1;
for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
if (pmd_none(*pmd))
On 02/07/2011 07:12 PM, Yinghai Lu wrote:
> On 02/07/2011 01:56 PM, Jeremy Fitzhardinge wrote:
>> On 02/07/2011 11:00 AM, Yinghai Lu wrote:
>>> On 02/07/2011 10:58 AM, Stefano Stabellini wrote:
>>>> On Mon, 7 Feb 2011, Yinghai Lu wrote:
>>>>> On 02/07/2011 08:50 AM, Stefano Stabellini wrote:
>>>>>> On Sun, 6 Feb 2011, Yinghai Lu wrote:
>>>>>>> On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
>>>>>>>> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>>>>>>>>> why not just move calling cleanup_highmap down?
>>>>>>>>>
>>>>>>>>> something like attached patch.
>>>>>>>> This patch looks very clean and looks on the surface of it like it is
>>>>>>>> removing some ugly ad hoc code, but (as always) it needs a description
>>>>>>>> about the problem it solves and why it is correct.
>>>>>>> Sure.
>>>>>>>
>>>>>>>
>>>>>>> Jeremy and xen guys, can you please check if it works well with xen ?
>>>>>>>
>>>>>> Actually this patch makes things worse on xen, because before
>>>>>> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
>>>>>> is, fully destroying all the mappings we have at _end.
>>>>>>
>>>>>> Can we add a check on memblock reserved regions in cleanup_highmap()?
>>>>>> Otherwise could we avoid calling cleanup_highmap() at all on xen?
>>>>> why DO xen need over-mapped kernel initial mapping?
>>>>>
>>>>> what is in that range after _end to 512M?
>>>> The mfn list that is the list of machine frame numbers assigned to this
>>>> domain; it is used across the kernel to convert pfns into mfns.
>>>> It passed to us at that address by the domain builder.
>>> is it possible for you to pass physical address, and then map it in kernel?
>> That is possible in principle, but very difficult in practice.
>>
>> What's wrong with honouring reserved memory ranges under all circumstances?
> why punishing native path with those checking?
Why is it punishing anyone to expect memory reservations to be observed?
> please check if
>
> + * max_pfn_mapped is set to KERNEL_IMAGE_SIZE >> PAGE_SHIFT in
> + * head64.c::x86_start_kernel(), aka native path
> + */
> + if (max_pfn_mapped != (KERNEL_IMAGE_SIZE >> PAGE_SHIFT))
> + return;
>
> could be used to skip clear highmap for xen path?
Seems pretty ad-hoc.
J
> Thanks
>
> Yinghai
>
>
>
> [PATCH -v2] x86: Cleanup highmap after brk is concluded
>
> Now cleanup_high actually is two steps. one is early in head64.c.
> and it only clear above _end. later will try to clean from _brk_end to _end.
> it will need to double check if those boundary are PMD_SIZE aligned...
>
> Also init_memory_mapping() are called several times for numa or memory hotplug.
> and it deals real memory mapping instead of initial kernel mapping.
> We really should not handle initial kernel mapping there.
>
> We can move cleanup_highmap() calling down after _brk_end is settled and pass
> _brk_end with it.
>
> Signed-off-by: Yinghai Lu <[email protected]>
>
> ---
> arch/x86/include/asm/pgtable_64.h | 2 +-
> arch/x86/kernel/head64.c | 3 ---
> arch/x86/kernel/setup.c | 6 ++++++
> arch/x86/mm/init.c | 19 -------------------
> arch/x86/mm/init_64.c | 20 +++++++++++++++-----
> 5 files changed, 22 insertions(+), 28 deletions(-)
>
> Index: linux-2.6/arch/x86/include/asm/pgtable_64.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/pgtable_64.h
> +++ linux-2.6/arch/x86/include/asm/pgtable_64.h
> @@ -165,7 +165,7 @@ static inline int pgd_large(pgd_t pgd) {
> #define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })
>
> extern int kern_addr_valid(unsigned long addr);
> -extern void cleanup_highmap(void);
> +extern void cleanup_highmap(unsigned long end);
>
> #define HAVE_ARCH_UNMAPPED_AREA
> #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
> Index: linux-2.6/arch/x86/kernel/head64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/head64.c
> +++ linux-2.6/arch/x86/kernel/head64.c
> @@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * r
> /* Make NULL pointers segfault */
> zap_identity_mappings();
>
> - /* Cleanup the over mapped high alias */
> - cleanup_highmap();
> -
> max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
>
> for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
> Index: linux-2.6/arch/x86/kernel/setup.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/setup.c
> +++ linux-2.6/arch/x86/kernel/setup.c
> @@ -297,6 +297,9 @@ static void __init init_gbpages(void)
> static inline void init_gbpages(void)
> {
> }
> +static void __init cleanup_highmap(unsigned long end)
> +{
> +}
> #endif
>
> static void __init reserve_brk(void)
> @@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
> */
> reserve_brk();
>
> + /* Cleanup the over mapped high alias after _brk_end*/
> + cleanup_highmap(_brk_end);
> +
> memblock.current_limit = get_max_mapped();
> memblock_x86_fill();
>
> Index: linux-2.6/arch/x86/mm/init.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/init.c
> +++ linux-2.6/arch/x86/mm/init.c
> @@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_m
> load_cr3(swapper_pg_dir);
> #endif
>
> -#ifdef CONFIG_X86_64
> - if (!after_bootmem && !start) {
> - pud_t *pud;
> - pmd_t *pmd;
> -
> - mmu_cr4_features = read_cr4();
> -
> - /*
> - * _brk_end cannot change anymore, but it and _end may be
> - * located on different 2M pages. cleanup_highmap(), however,
> - * can only consider _end when it runs, so destroy any
> - * mappings beyond _brk_end here.
> - */
> - pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
> - pmd = pmd_offset(pud, _brk_end - 1);
> - while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
> - pmd_clear(pmd);
> - }
> -#endif
> __flush_tlb_all();
>
> if (!after_bootmem && e820_table_end > e820_table_start)
> Index: linux-2.6/arch/x86/mm/init_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/init_64.c
> +++ linux-2.6/arch/x86/mm/init_64.c
> @@ -297,12 +297,22 @@ void __init init_extra_mapping_uc(unsign
> * rounded up to the 2MB boundary. This catches the invalid pmds as
> * well, as they are located before _text:
> */
> -void __init cleanup_highmap(void)
> +void __init cleanup_highmap(unsigned long end)
> {
> - unsigned long vaddr = __START_KERNEL_map;
> - unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
> - pmd_t *pmd = level2_kernel_pgt;
> - pmd_t *last_pmd = pmd + PTRS_PER_PMD;
> + unsigned long vaddr;
> + pmd_t *pmd, *last_pmd;
> +
> + /*
> + * max_pfn_mapped is set to KERNEL_IMAGE_SIZE >> PAGE_SHIFT in
> + * head64.c::x86_start_kernel(), aka native path
> + */
> + if (max_pfn_mapped != (KERNEL_IMAGE_SIZE >> PAGE_SHIFT))
> + return;
> +
> + vaddr = __START_KERNEL_map;
> + pmd = level2_kernel_pgt;
> + last_pmd = pmd + PTRS_PER_PMD;
> + end = roundup(end, PMD_SIZE) - 1;
>
> for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
> if (pmd_none(*pmd))
>
On Mon, Feb 7, 2011 at 8:56 PM, Jeremy Fitzhardinge <[email protected]> wrote:
> On 02/07/2011 07:12 PM, Yinghai Lu wrote:
>> On 02/07/2011 01:56 PM, Jeremy Fitzhardinge wrote:
>>> On 02/07/2011 11:00 AM, Yinghai Lu wrote:
>>>> On 02/07/2011 10:58 AM, Stefano Stabellini wrote:
>>>>> On Mon, 7 Feb 2011, Yinghai Lu wrote:
>>>>>> On 02/07/2011 08:50 AM, Stefano Stabellini wrote:
>>>>>>> On Sun, 6 Feb 2011, Yinghai Lu wrote:
>>>>>>>> On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
>>>>>>>>> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>>>>>>>>>> why not just move calling cleanup_highmap down?
>>>>>>>>>>
>>>>>>>>>> something like attached patch.
>>>>>>>>> This patch looks very clean and looks on the surface of it like it is
>>>>>>>>> removing some ugly ad hoc code, but (as always) it needs a description
>>>>>>>>> about the problem it solves and why it is correct.
>>>>>>>> Sure.
>>>>>>>>
>>>>>>>>
>>>>>>>> Jeremy and xen guys, can you please check if it works well with xen ?
>>>>>>>>
>>>>>>> Actually this patch makes things worse on xen, because before
>>>>>>> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
>>>>>>> is, fully destroying all the mappings we have at _end.
>>>>>>>
>>>>>>> Can we add a check on memblock reserved regions in cleanup_highmap()?
>>>>>>> Otherwise could we avoid calling cleanup_highmap() at all on xen?
>>>>>> why DO xen need over-mapped kernel initial mapping?
>>>>>>
>>>>>> what is in that range after _end to 512M?
>>>>> The mfn list that is the list of machine frame numbers assigned to this
>>>>> domain; it is used across the kernel to convert pfns into mfns.
>>>>> It passed to us at that address by the domain builder.
>>>> is it possible for you to pass physical address, and then map it in kernel?
>>> That is possible in principle, but very difficult in practice.
>>>
>>> What's wrong with honouring reserved memory ranges under all circumstances?
>> why punishing native path with those checking?
>
> Why is it punishing anyone to expect memory reservations to be observed?
>
>> please check if
>>
>> + ? ? ?* max_pfn_mapped is set to KERNEL_IMAGE_SIZE >> PAGE_SHIFT in
>> + ? ? ?* ? head64.c::x86_start_kernel(), aka native path
>> + ? ? ?*/
>> + ? ? if (max_pfn_mapped != (KERNEL_IMAGE_SIZE >> PAGE_SHIFT))
>> + ? ? ? ? ? ? return;
>>
>> could be used to skip clear highmap for xen path?
>
> Seems pretty ad-hoc.
>
then what is size for mfn-list after _end...
could be copied or move to BRK.
Yinghai
On Mon, Feb 7, 2011 at 11:00 AM, Stefano Stabellini
<[email protected]> wrote:
> On Mon, 7 Feb 2011, Stefano Stabellini wrote:
>> On Sun, 6 Feb 2011, Yinghai Lu wrote:
>> > On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
>> > > On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>> > >> why not just move calling cleanup_highmap down?
>> > >>
>> > >> something like attached patch.
>> > >
>> > > This patch looks very clean and looks on the surface of it like it is
>> > > removing some ugly ad hoc code, but (as always) it needs a description
>> > > about the problem it solves and why it is correct.
>> >
>> > Sure.
>> >
>> >
>> > Jeremy and xen guys, can you please check if it works well with xen ?
>> >
>>
>> Actually this patch makes things worse on xen, because before
>> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
>> is, fully destroying all the mappings we have at _end.
>>
>> Can we add a check on memblock reserved regions in cleanup_highmap()?
>
> In case you are wondering how Yinghai Lu's patch would look like with
> the added check, here it is:
>
>
> diff --git a/arch/x86/include/asm/memblock.h b/arch/x86/include/asm/memblock.h
> index 19ae14b..184f778 100644
> --- a/arch/x86/include/asm/memblock.h
> +++ b/arch/x86/include/asm/memblock.h
> @@ -3,6 +3,7 @@
>
> ?#define ARCH_DISCARD_MEMBLOCK
>
> +bool memblock_check_reserved_size(u64 *addrp, u64 *sizep, u64 align);
> ?u64 memblock_x86_find_in_range_size(u64 start, u64 *sizep, u64 align);
> ?void memblock_x86_to_bootmem(u64 start, u64 end);
>
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> index 975f709..28686b6 100644
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -165,7 +165,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
> ?#define __swp_entry_to_pte(x) ? ? ? ? ?((pte_t) { .pte = (x).val })
>
> ?extern int kern_addr_valid(unsigned long addr);
> -extern void cleanup_highmap(void);
> +extern void cleanup_highmap(unsigned long end);
>
> ?#define HAVE_ARCH_UNMAPPED_AREA
> ?#define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> index 2d2673c..5655c22 100644
> --- a/arch/x86/kernel/head64.c
> +++ b/arch/x86/kernel/head64.c
> @@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * real_mode_data)
> ? ? ? ?/* Make NULL pointers segfault */
> ? ? ? ?zap_identity_mappings();
>
> - ? ? ? /* Cleanup the over mapped high alias */
> - ? ? ? cleanup_highmap();
> -
> ? ? ? ?max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
>
> ? ? ? ?for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index d3cfe26..91afde6 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -297,6 +297,9 @@ static void __init init_gbpages(void)
> ?static inline void init_gbpages(void)
> ?{
> ?}
> +static void __init cleanup_highmap(unsigned long end)
> +{
> +}
> ?#endif
>
> ?static void __init reserve_brk(void)
> @@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
> ? ? ? ? */
> ? ? ? ?reserve_brk();
>
> + ? ? ? /* Cleanup the over mapped high alias after _brk_end*/
> + ? ? ? cleanup_highmap(_brk_end);
> +
> ? ? ? ?memblock.current_limit = get_max_mapped();
> ? ? ? ?memblock_x86_fill();
>
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 947f42a..f13ff3a 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
> ? ? ? ?load_cr3(swapper_pg_dir);
> ?#endif
>
> -#ifdef CONFIG_X86_64
> - ? ? ? if (!after_bootmem && !start) {
> - ? ? ? ? ? ? ? pud_t *pud;
> - ? ? ? ? ? ? ? pmd_t *pmd;
> -
> - ? ? ? ? ? ? ? mmu_cr4_features = read_cr4();
> -
> - ? ? ? ? ? ? ? /*
> - ? ? ? ? ? ? ? ?* _brk_end cannot change anymore, but it and _end may be
> - ? ? ? ? ? ? ? ?* located on different 2M pages. cleanup_highmap(), however,
> - ? ? ? ? ? ? ? ?* can only consider _end when it runs, so destroy any
> - ? ? ? ? ? ? ? ?* mappings beyond _brk_end here.
> - ? ? ? ? ? ? ? ?*/
> - ? ? ? ? ? ? ? pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
> - ? ? ? ? ? ? ? pmd = pmd_offset(pud, _brk_end - 1);
> - ? ? ? ? ? ? ? while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
> - ? ? ? ? ? ? ? ? ? ? ? pmd_clear(pmd);
> - ? ? ? }
> -#endif
> ? ? ? ?__flush_tlb_all();
>
> ? ? ? ?if (!after_bootmem && e820_table_end > e820_table_start)
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 71a5929..028c49e 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -297,18 +297,26 @@ void __init init_extra_mapping_uc(unsigned long phys, unsigned long size)
> ?* rounded up to the 2MB boundary. This catches the invalid pmds as
> ?* well, as they are located before _text:
> ?*/
> -void __init cleanup_highmap(void)
> +void __init cleanup_highmap(unsigned long end)
> ?{
> ? ? ? ?unsigned long vaddr = __START_KERNEL_map;
> - ? ? ? unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
> ? ? ? ?pmd_t *pmd = level2_kernel_pgt;
> ? ? ? ?pmd_t *last_pmd = pmd + PTRS_PER_PMD;
> + ? ? ? u64 size, addrp;
> + ? ? ? bool changed;
> +
> + ? ? ? end = roundup(end, PMD_SIZE) - 1;
>
> ? ? ? ?for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
> ? ? ? ? ? ? ? ?if (pmd_none(*pmd))
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
> - ? ? ? ? ? ? ? if (vaddr < (unsigned long) _text || vaddr > end)
> - ? ? ? ? ? ? ? ? ? ? ? set_pmd(pmd, __pmd(0));
> + ? ? ? ? ? ? ? if (vaddr < (unsigned long) _text || vaddr > end) {
> + ? ? ? ? ? ? ? ? ? ? ? addrp = __pa(vaddr);
> + ? ? ? ? ? ? ? ? ? ? ? size = PMD_SIZE;
> + ? ? ? ? ? ? ? ? ? ? ? changed = memblock_check_reserved_size(&addrp, &size, PMD_SIZE);
> + ? ? ? ? ? ? ? ? ? ? ? if (!changed && size)
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? set_pmd(pmd, __pmd(0));
> + ? ? ? ? ? ? ? }
for native path, memblock_check_reserved_size() are called 256 times
without obvious reasons.
Yinghai
On Tue, 8 Feb 2011, Yinghai Lu wrote:
> On Mon, Feb 7, 2011 at 11:00 AM, Stefano Stabellini
> <[email protected]> wrote:
> > On Mon, 7 Feb 2011, Stefano Stabellini wrote:
> >> On Sun, 6 Feb 2011, Yinghai Lu wrote:
> >> > On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
> >> > > On 02/05/2011 11:02 PM, Yinghai Lu wrote:
> >> > >> why not just move calling cleanup_highmap down?
> >> > >>
> >> > >> something like attached patch.
> >> > >
> >> > > This patch looks very clean and looks on the surface of it like it is
> >> > > removing some ugly ad hoc code, but (as always) it needs a description
> >> > > about the problem it solves and why it is correct.
> >> >
> >> > Sure.
> >> >
> >> >
> >> > Jeremy and xen guys, can you please check if it works well with xen ?
> >> >
> >>
> >> Actually this patch makes things worse on xen, because before
> >> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
> >> is, fully destroying all the mappings we have at _end.
> >>
> >> Can we add a check on memblock reserved regions in cleanup_highmap()?
> >
> > In case you are wondering how Yinghai Lu's patch would look like with
> > the added check, here it is:
> >
> >
> > diff --git a/arch/x86/include/asm/memblock.h b/arch/x86/include/asm/memblock.h
> > index 19ae14b..184f778 100644
> > --- a/arch/x86/include/asm/memblock.h
> > +++ b/arch/x86/include/asm/memblock.h
> > @@ -3,6 +3,7 @@
> >
> > #define ARCH_DISCARD_MEMBLOCK
> >
> > +bool memblock_check_reserved_size(u64 *addrp, u64 *sizep, u64 align);
> > u64 memblock_x86_find_in_range_size(u64 start, u64 *sizep, u64 align);
> > void memblock_x86_to_bootmem(u64 start, u64 end);
> >
> > diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> > index 975f709..28686b6 100644
> > --- a/arch/x86/include/asm/pgtable_64.h
> > +++ b/arch/x86/include/asm/pgtable_64.h
> > @@ -165,7 +165,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
> > #define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })
> >
> > extern int kern_addr_valid(unsigned long addr);
> > -extern void cleanup_highmap(void);
> > +extern void cleanup_highmap(unsigned long end);
> >
> > #define HAVE_ARCH_UNMAPPED_AREA
> > #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
> > diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> > index 2d2673c..5655c22 100644
> > --- a/arch/x86/kernel/head64.c
> > +++ b/arch/x86/kernel/head64.c
> > @@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * real_mode_data)
> > /* Make NULL pointers segfault */
> > zap_identity_mappings();
> >
> > - /* Cleanup the over mapped high alias */
> > - cleanup_highmap();
> > -
> > max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
> >
> > for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index d3cfe26..91afde6 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -297,6 +297,9 @@ static void __init init_gbpages(void)
> > static inline void init_gbpages(void)
> > {
> > }
> > +static void __init cleanup_highmap(unsigned long end)
> > +{
> > +}
> > #endif
> >
> > static void __init reserve_brk(void)
> > @@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
> > */
> > reserve_brk();
> >
> > + /* Cleanup the over mapped high alias after _brk_end*/
> > + cleanup_highmap(_brk_end);
> > +
> > memblock.current_limit = get_max_mapped();
> > memblock_x86_fill();
> >
> > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> > index 947f42a..f13ff3a 100644
> > --- a/arch/x86/mm/init.c
> > +++ b/arch/x86/mm/init.c
> > @@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
> > load_cr3(swapper_pg_dir);
> > #endif
> >
> > -#ifdef CONFIG_X86_64
> > - if (!after_bootmem && !start) {
> > - pud_t *pud;
> > - pmd_t *pmd;
> > -
> > - mmu_cr4_features = read_cr4();
> > -
> > - /*
> > - * _brk_end cannot change anymore, but it and _end may be
> > - * located on different 2M pages. cleanup_highmap(), however,
> > - * can only consider _end when it runs, so destroy any
> > - * mappings beyond _brk_end here.
> > - */
> > - pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
> > - pmd = pmd_offset(pud, _brk_end - 1);
> > - while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
> > - pmd_clear(pmd);
> > - }
> > -#endif
> > __flush_tlb_all();
> >
> > if (!after_bootmem && e820_table_end > e820_table_start)
> > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > index 71a5929..028c49e 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -297,18 +297,26 @@ void __init init_extra_mapping_uc(unsigned long phys, unsigned long size)
> > * rounded up to the 2MB boundary. This catches the invalid pmds as
> > * well, as they are located before _text:
> > */
> > -void __init cleanup_highmap(void)
> > +void __init cleanup_highmap(unsigned long end)
> > {
> > unsigned long vaddr = __START_KERNEL_map;
> > - unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
> > pmd_t *pmd = level2_kernel_pgt;
> > pmd_t *last_pmd = pmd + PTRS_PER_PMD;
> > + u64 size, addrp;
> > + bool changed;
> > +
> > + end = roundup(end, PMD_SIZE) - 1;
> >
> > for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
> > if (pmd_none(*pmd))
> > continue;
> > - if (vaddr < (unsigned long) _text || vaddr > end)
> > - set_pmd(pmd, __pmd(0));
> > + if (vaddr < (unsigned long) _text || vaddr > end) {
> > + addrp = __pa(vaddr);
> > + size = PMD_SIZE;
> > + changed = memblock_check_reserved_size(&addrp, &size, PMD_SIZE);
> > + if (!changed && size)
> > + set_pmd(pmd, __pmd(0));
> > + }
>
> for native path, memblock_check_reserved_size() are called 256 times
> without obvious reasons.
what about this patch, does it look like a reasonable solution?
diff --git a/arch/x86/include/asm/memblock.h b/arch/x86/include/asm/memblock.h
index 19ae14b..184f778 100644
--- a/arch/x86/include/asm/memblock.h
+++ b/arch/x86/include/asm/memblock.h
@@ -3,6 +3,7 @@
#define ARCH_DISCARD_MEMBLOCK
+bool memblock_check_reserved_size(u64 *addrp, u64 *sizep, u64 align);
u64 memblock_x86_find_in_range_size(u64 start, u64 *sizep, u64 align);
void memblock_x86_to_bootmem(u64 start, u64 end);
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 975f709..28686b6 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -165,7 +165,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
#define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })
extern int kern_addr_valid(unsigned long addr);
-extern void cleanup_highmap(void);
+extern void cleanup_highmap(unsigned long end);
#define HAVE_ARCH_UNMAPPED_AREA
#define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 2d2673c..5655c22 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * real_mode_data)
/* Make NULL pointers segfault */
zap_identity_mappings();
- /* Cleanup the over mapped high alias */
- cleanup_highmap();
-
max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d3cfe26..91afde6 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -297,6 +297,9 @@ static void __init init_gbpages(void)
static inline void init_gbpages(void)
{
}
+static void __init cleanup_highmap(unsigned long end)
+{
+}
#endif
static void __init reserve_brk(void)
@@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
*/
reserve_brk();
+ /* Cleanup the over mapped high alias after _brk_end*/
+ cleanup_highmap(_brk_end);
+
memblock.current_limit = get_max_mapped();
memblock_x86_fill();
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 947f42a..f13ff3a 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
load_cr3(swapper_pg_dir);
#endif
-#ifdef CONFIG_X86_64
- if (!after_bootmem && !start) {
- pud_t *pud;
- pmd_t *pmd;
-
- mmu_cr4_features = read_cr4();
-
- /*
- * _brk_end cannot change anymore, but it and _end may be
- * located on different 2M pages. cleanup_highmap(), however,
- * can only consider _end when it runs, so destroy any
- * mappings beyond _brk_end here.
- */
- pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
- pmd = pmd_offset(pud, _brk_end - 1);
- while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
- pmd_clear(pmd);
- }
-#endif
__flush_tlb_all();
if (!after_bootmem && e820_table_end > e820_table_start)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 71a5929..90a64de 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -297,12 +297,25 @@ void __init init_extra_mapping_uc(unsigned long phys, unsigned long size)
* rounded up to the 2MB boundary. This catches the invalid pmds as
* well, as they are located before _text:
*/
-void __init cleanup_highmap(void)
+void __init cleanup_highmap(unsigned long end)
{
unsigned long vaddr = __START_KERNEL_map;
- unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
pmd_t *pmd = level2_kernel_pgt;
pmd_t *last_pmd = pmd + PTRS_PER_PMD;
+ u64 size, addrp;
+ bool changed;
+
+ end = roundup(end, PMD_SIZE) - 1;
+
+ /* check for reserved regions after end */
+ addrp = __pa(end);
+ size = (PTRS_PER_PMD * PMD_SIZE + vaddr) - end;
+ changed = memblock_check_reserved_size(&addrp, &size, PMD_SIZE);
+ if (changed || !size) {
+ /* reserved regions found, avoid removing mappings after end */
+ pud_t *pud = pud_offset(pgd_offset_k(end), end);
+ last_pmd = pmd_offset(pud, end);
+ }
for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
if (pmd_none(*pmd))
diff --git a/arch/x86/mm/memblock.c b/arch/x86/mm/memblock.c
index aa11693..fac21d4 100644
--- a/arch/x86/mm/memblock.c
+++ b/arch/x86/mm/memblock.c
@@ -8,7 +8,7 @@
#include <linux/range.h>
/* Check for already reserved areas */
-static bool __init check_with_memblock_reserved_size(u64 *addrp, u64 *sizep, u64 align)
+bool __init memblock_check_reserved_size(u64 *addrp, u64 *sizep, u64 align)
{
struct memblock_region *r;
u64 addr = *addrp, last;
@@ -59,7 +59,7 @@ u64 __init memblock_x86_find_in_range_size(u64 start, u64 *sizep, u64 align)
if (addr >= ei_last)
continue;
*sizep = ei_last - addr;
- while (check_with_memblock_reserved_size(&addr, sizep, align))
+ while (memblock_check_reserved_size(&addr, sizep, align))
;
if (*sizep)
> >> could be used to skip clear highmap for xen path?
> >
> > Seems pretty ad-hoc.
> >
>
> then what is size for mfn-list after _end...
8 bytes * nr_pages. For 4GB, 2048 pages. For 32GB, 8192 pages. For 128GB,
32768 pages, and so on.
>
> could be copied or move to BRK.
The _brk size is determined during build-time. We don't know what the
memory size will be during bootup time and would have to select the
highest values (128MB) which is quite a large amount to be reserved
in _brk area.
On 02/08/2011 06:03 AM, Stefano Stabellini wrote:
> On Tue, 8 Feb 2011, Yinghai Lu wrote:
>> On Mon, Feb 7, 2011 at 11:00 AM, Stefano Stabellini
>> <[email protected]> wrote:
>>> On Mon, 7 Feb 2011, Stefano Stabellini wrote:
>>>> On Sun, 6 Feb 2011, Yinghai Lu wrote:
>>>>> On 02/05/2011 11:30 PM, H. Peter Anvin wrote:
>>>>>> On 02/05/2011 11:02 PM, Yinghai Lu wrote:
>>>>>>> why not just move calling cleanup_highmap down?
>>>>>>>
>>>>>>> something like attached patch.
>>>>>>
>>>>>> This patch looks very clean and looks on the surface of it like it is
>>>>>> removing some ugly ad hoc code, but (as always) it needs a description
>>>>>> about the problem it solves and why it is correct.
>>>>>
>>>>> Sure.
>>>>>
>>>>>
>>>>> Jeremy and xen guys, can you please check if it works well with xen ?
>>>>>
>>>>
>>>> Actually this patch makes things worse on xen, because before
>>>> cleanup_highmap() wasn't called at all on xen (on purpose) and now it
>>>> is, fully destroying all the mappings we have at _end.
>>>>
>>>> Can we add a check on memblock reserved regions in cleanup_highmap()?
>>>
>>> In case you are wondering how Yinghai Lu's patch would look like with
>>> the added check, here it is:
>>>
>>>
>>> diff --git a/arch/x86/include/asm/memblock.h b/arch/x86/include/asm/memblock.h
>>> index 19ae14b..184f778 100644
>>> --- a/arch/x86/include/asm/memblock.h
>>> +++ b/arch/x86/include/asm/memblock.h
>>> @@ -3,6 +3,7 @@
>>>
>>> #define ARCH_DISCARD_MEMBLOCK
>>>
>>> +bool memblock_check_reserved_size(u64 *addrp, u64 *sizep, u64 align);
>>> u64 memblock_x86_find_in_range_size(u64 start, u64 *sizep, u64 align);
>>> void memblock_x86_to_bootmem(u64 start, u64 end);
>>>
>>> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
>>> index 975f709..28686b6 100644
>>> --- a/arch/x86/include/asm/pgtable_64.h
>>> +++ b/arch/x86/include/asm/pgtable_64.h
>>> @@ -165,7 +165,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
>>> #define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })
>>>
>>> extern int kern_addr_valid(unsigned long addr);
>>> -extern void cleanup_highmap(void);
>>> +extern void cleanup_highmap(unsigned long end);
>>>
>>> #define HAVE_ARCH_UNMAPPED_AREA
>>> #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
>>> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
>>> index 2d2673c..5655c22 100644
>>> --- a/arch/x86/kernel/head64.c
>>> +++ b/arch/x86/kernel/head64.c
>>> @@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * real_mode_data)
>>> /* Make NULL pointers segfault */
>>> zap_identity_mappings();
>>>
>>> - /* Cleanup the over mapped high alias */
>>> - cleanup_highmap();
>>> -
>>> max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
>>>
>>> for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
>>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>>> index d3cfe26..91afde6 100644
>>> --- a/arch/x86/kernel/setup.c
>>> +++ b/arch/x86/kernel/setup.c
>>> @@ -297,6 +297,9 @@ static void __init init_gbpages(void)
>>> static inline void init_gbpages(void)
>>> {
>>> }
>>> +static void __init cleanup_highmap(unsigned long end)
>>> +{
>>> +}
>>> #endif
>>>
>>> static void __init reserve_brk(void)
>>> @@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
>>> */
>>> reserve_brk();
>>>
>>> + /* Cleanup the over mapped high alias after _brk_end*/
>>> + cleanup_highmap(_brk_end);
>>> +
>>> memblock.current_limit = get_max_mapped();
>>> memblock_x86_fill();
>>>
>>> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
>>> index 947f42a..f13ff3a 100644
>>> --- a/arch/x86/mm/init.c
>>> +++ b/arch/x86/mm/init.c
>>> @@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
>>> load_cr3(swapper_pg_dir);
>>> #endif
>>>
>>> -#ifdef CONFIG_X86_64
>>> - if (!after_bootmem && !start) {
>>> - pud_t *pud;
>>> - pmd_t *pmd;
>>> -
>>> - mmu_cr4_features = read_cr4();
>>> -
>>> - /*
>>> - * _brk_end cannot change anymore, but it and _end may be
>>> - * located on different 2M pages. cleanup_highmap(), however,
>>> - * can only consider _end when it runs, so destroy any
>>> - * mappings beyond _brk_end here.
>>> - */
>>> - pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
>>> - pmd = pmd_offset(pud, _brk_end - 1);
>>> - while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
>>> - pmd_clear(pmd);
>>> - }
>>> -#endif
>>> __flush_tlb_all();
>>>
>>> if (!after_bootmem && e820_table_end > e820_table_start)
>>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>>> index 71a5929..028c49e 100644
>>> --- a/arch/x86/mm/init_64.c
>>> +++ b/arch/x86/mm/init_64.c
>>> @@ -297,18 +297,26 @@ void __init init_extra_mapping_uc(unsigned long phys, unsigned long size)
>>> * rounded up to the 2MB boundary. This catches the invalid pmds as
>>> * well, as they are located before _text:
>>> */
>>> -void __init cleanup_highmap(void)
>>> +void __init cleanup_highmap(unsigned long end)
>>> {
>>> unsigned long vaddr = __START_KERNEL_map;
>>> - unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
>>> pmd_t *pmd = level2_kernel_pgt;
>>> pmd_t *last_pmd = pmd + PTRS_PER_PMD;
>>> + u64 size, addrp;
>>> + bool changed;
>>> +
>>> + end = roundup(end, PMD_SIZE) - 1;
>>>
>>> for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
>>> if (pmd_none(*pmd))
>>> continue;
>>> - if (vaddr < (unsigned long) _text || vaddr > end)
>>> - set_pmd(pmd, __pmd(0));
>>> + if (vaddr < (unsigned long) _text || vaddr > end) {
>>> + addrp = __pa(vaddr);
>>> + size = PMD_SIZE;
>>> + changed = memblock_check_reserved_size(&addrp, &size, PMD_SIZE);
>>> + if (!changed && size)
>>> + set_pmd(pmd, __pmd(0));
>>> + }
>>
>> for native path, memblock_check_reserved_size() are called 256 times
>> without obvious reasons.
>
>
> what about this patch, does it look like a reasonable solution?
>
>
>
> diff --git a/arch/x86/include/asm/memblock.h b/arch/x86/include/asm/memblock.h
> index 19ae14b..184f778 100644
> --- a/arch/x86/include/asm/memblock.h
> +++ b/arch/x86/include/asm/memblock.h
> @@ -3,6 +3,7 @@
>
> #define ARCH_DISCARD_MEMBLOCK
>
> +bool memblock_check_reserved_size(u64 *addrp, u64 *sizep, u64 align);
> u64 memblock_x86_find_in_range_size(u64 start, u64 *sizep, u64 align);
> void memblock_x86_to_bootmem(u64 start, u64 end);
>
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> index 975f709..28686b6 100644
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -165,7 +165,7 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
> #define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })
>
> extern int kern_addr_valid(unsigned long addr);
> -extern void cleanup_highmap(void);
> +extern void cleanup_highmap(unsigned long end);
>
> #define HAVE_ARCH_UNMAPPED_AREA
> #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> index 2d2673c..5655c22 100644
> --- a/arch/x86/kernel/head64.c
> +++ b/arch/x86/kernel/head64.c
> @@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * real_mode_data)
> /* Make NULL pointers segfault */
> zap_identity_mappings();
>
> - /* Cleanup the over mapped high alias */
> - cleanup_highmap();
> -
> max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
>
> for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index d3cfe26..91afde6 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -297,6 +297,9 @@ static void __init init_gbpages(void)
> static inline void init_gbpages(void)
> {
> }
> +static void __init cleanup_highmap(unsigned long end)
> +{
> +}
> #endif
>
> static void __init reserve_brk(void)
> @@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
> */
> reserve_brk();
>
> + /* Cleanup the over mapped high alias after _brk_end*/
> + cleanup_highmap(_brk_end);
> +
> memblock.current_limit = get_max_mapped();
> memblock_x86_fill();
>
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 947f42a..f13ff3a 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
> load_cr3(swapper_pg_dir);
> #endif
>
> -#ifdef CONFIG_X86_64
> - if (!after_bootmem && !start) {
> - pud_t *pud;
> - pmd_t *pmd;
> -
> - mmu_cr4_features = read_cr4();
> -
> - /*
> - * _brk_end cannot change anymore, but it and _end may be
> - * located on different 2M pages. cleanup_highmap(), however,
> - * can only consider _end when it runs, so destroy any
> - * mappings beyond _brk_end here.
> - */
> - pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
> - pmd = pmd_offset(pud, _brk_end - 1);
> - while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
> - pmd_clear(pmd);
> - }
> -#endif
> __flush_tlb_all();
>
> if (!after_bootmem && e820_table_end > e820_table_start)
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 71a5929..90a64de 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -297,12 +297,25 @@ void __init init_extra_mapping_uc(unsigned long phys, unsigned long size)
> * rounded up to the 2MB boundary. This catches the invalid pmds as
> * well, as they are located before _text:
> */
> -void __init cleanup_highmap(void)
> +void __init cleanup_highmap(unsigned long end)
> {
> unsigned long vaddr = __START_KERNEL_map;
> - unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
> pmd_t *pmd = level2_kernel_pgt;
> pmd_t *last_pmd = pmd + PTRS_PER_PMD;
> + u64 size, addrp;
> + bool changed;
> +
> + end = roundup(end, PMD_SIZE) - 1;
> +
> + /* check for reserved regions after end */
> + addrp = __pa(end);
> + size = (PTRS_PER_PMD * PMD_SIZE + vaddr) - end;
> + changed = memblock_check_reserved_size(&addrp, &size, PMD_SIZE);
> + if (changed || !size) {
> + /* reserved regions found, avoid removing mappings after end */
> + pud_t *pud = pud_offset(pgd_offset_k(end), end);
> + last_pmd = pmd_offset(pud, end);
> + }
>
> for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
> if (pmd_none(*pmd))
test case:
native path, bootloader have initrd overlap with [0,512M)...
We will not get highmap cleared.
Maybe we have to keep two steps method.
Yinghai
On 02/08/2011 06:55 AM, Konrad Rzeszutek Wilk wrote:
>>>> could be used to skip clear highmap for xen path?
>>> Seems pretty ad-hoc.
>>>
>> then what is size for mfn-list after _end...
> 8 bytes * nr_pages. For 4GB, 2048 pages. For 32GB, 8192 pages. For 128GB,
> 32768 pages, and so on.
>> could be copied or move to BRK.
> The _brk size is determined during build-time. We don't know what the
> memory size will be during bootup time and would have to select the
> highest values (128MB) which is quite a large amount to be reserved
> in _brk area.
If the brk is guaranteed to be the last thing in the kernel, we could
remove the static allocation of brk space, and just make it dynamic, and
then use the dynamic end-of-brk instead of _end.
That would require mapping the brk space at runtime, which would require
a (conservative) runtime estimate of how much space we would end up
needing, I guess by adding together the static allocations and then
adding any dynamic ones we need.
For Xen, specifically, we could just extend brk to include all the stuff
the domain builder sticks after the kernel, so it would both be brk
allocated and left in-situ.
But I'm not sure if this would work in the native case - would there be
a guarantee that there's enough space after the kernel to fit any brk usage?
J
On 02/07/2011 07:12 PM, Yinghai Lu wrote:
>
> why punishing native path with those checking?
>
What happens if you end up with a reserved range in an unfortunate place
on real hardware?
-hpa
On Tue, 8 Feb 2011, Jeremy Fitzhardinge wrote:
> On 02/08/2011 06:55 AM, Konrad Rzeszutek Wilk wrote:
> >>>> could be used to skip clear highmap for xen path?
> >>> Seems pretty ad-hoc.
> >>>
> >> then what is size for mfn-list after _end...
> > 8 bytes * nr_pages. For 4GB, 2048 pages. For 32GB, 8192 pages. For 128GB,
> > 32768 pages, and so on.
> >> could be copied or move to BRK.
> > The _brk size is determined during build-time. We don't know what the
> > memory size will be during bootup time and would have to select the
> > highest values (128MB) which is quite a large amount to be reserved
> > in _brk area.
>
> If the brk is guaranteed to be the last thing in the kernel, we could
> remove the static allocation of brk space, and just make it dynamic, and
> then use the dynamic end-of-brk instead of _end.
>
> That would require mapping the brk space at runtime, which would require
> a (conservative) runtime estimate of how much space we would end up
> needing, I guess by adding together the static allocations and then
> adding any dynamic ones we need.
>
> For Xen, specifically, we could just extend brk to include all the stuff
> the domain builder sticks after the kernel, so it would both be brk
> allocated and left in-situ.
A simpler alternative would be to set max_pfn_mapped = __pa(mfn_list) on
xen, after all the mappings after _end are special mappings without a
corresponding pfn. It shouldn't have any undesired side effects because
max_pfn_mapped is updated soon after cleanup_highmap() anyway
in arch/x86/kernel/setup.c:setup_arch.
Then we use vaddr + (max_pfn_mapped << PAGE_SHIFT) as the memory limit
in cleanup_highmap.
The following patch is a proof of concept but it boots successfully on
xen and on native:
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 2d2673c..5655c22 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * real_mode_data)
/* Make NULL pointers segfault */
zap_identity_mappings();
- /* Cleanup the over mapped high alias */
- cleanup_highmap();
-
max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d3cfe26..f03e6e0 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -297,6 +297,9 @@ static void __init init_gbpages(void)
static inline void init_gbpages(void)
{
}
+static void __init cleanup_highmap(void)
+{
+}
#endif
static void __init reserve_brk(void)
@@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
*/
reserve_brk();
+ /* Cleanup the over mapped high alias after _brk_end*/
+ cleanup_highmap();
+
memblock.current_limit = get_max_mapped();
memblock_x86_fill();
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 947f42a..f13ff3a 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
load_cr3(swapper_pg_dir);
#endif
-#ifdef CONFIG_X86_64
- if (!after_bootmem && !start) {
- pud_t *pud;
- pmd_t *pmd;
-
- mmu_cr4_features = read_cr4();
-
- /*
- * _brk_end cannot change anymore, but it and _end may be
- * located on different 2M pages. cleanup_highmap(), however,
- * can only consider _end when it runs, so destroy any
- * mappings beyond _brk_end here.
- */
- pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
- pmd = pmd_offset(pud, _brk_end - 1);
- while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
- pmd_clear(pmd);
- }
-#endif
__flush_tlb_all();
if (!after_bootmem && e820_table_end > e820_table_start)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 71a5929..a8d08c2 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -51,6 +51,7 @@
#include <asm/numa.h>
#include <asm/cacheflush.h>
#include <asm/init.h>
+#include <asm/setup.h>
static int __init parse_direct_gbpages_off(char *arg)
{
@@ -293,18 +294,18 @@ void __init init_extra_mapping_uc(unsigned long phys, unsigned long size)
* to the compile time generated pmds. This results in invalid pmds up
* to the point where we hit the physaddr 0 mapping.
*
- * We limit the mappings to the region from _text to _end. _end is
- * rounded up to the 2MB boundary. This catches the invalid pmds as
+ * We limit the mappings to the region from _text to _brk_end. _brk_end
+ * is rounded up to the 2MB boundary. This catches the invalid pmds as
* well, as they are located before _text:
*/
void __init cleanup_highmap(void)
{
unsigned long vaddr = __START_KERNEL_map;
- unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
+ unsigned long vaddr_end = __START_KERNEL_map + (max_pfn_mapped << PAGE_SHIFT);
+ unsigned long end = roundup((unsigned long)_brk_end, PMD_SIZE) - 1;
pmd_t *pmd = level2_kernel_pgt;
- pmd_t *last_pmd = pmd + PTRS_PER_PMD;
- for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
+ for (; vaddr + PMD_SIZE - 1 < vaddr_end; pmd++, vaddr += PMD_SIZE) {
if (pmd_none(*pmd))
continue;
if (vaddr < (unsigned long) _text || vaddr > end)
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 5e92b61..73a21db 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1653,9 +1653,6 @@ static __init void xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn)
for (pteidx = 0; pteidx < PTRS_PER_PTE; pteidx++, pfn++) {
pte_t pte;
- if (pfn > max_pfn_mapped)
- max_pfn_mapped = pfn;
-
if (!pte_none(pte_page[pteidx]))
continue;
@@ -1713,6 +1710,8 @@ __init pgd_t *xen_setup_kernel_pagetable(pgd_t *pgd,
pud_t *l3;
pmd_t *l2;
+ max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
+
/* Zap identity mapping */
init_level4_pgt[0] = __pgd(0);
On 02/08/2011 11:34 AM, H. Peter Anvin wrote:
> On 02/07/2011 07:12 PM, Yinghai Lu wrote:
>> why punishing native path with those checking?
>>
> What happens if you end up with a reserved range in an unfortunate place
> on real hardware?
Yes, exactly. The reserved region code isn't very useful if you can't
rely on it to reserve stuff.
J
On 02/10/2011 03:48 PM, Jeremy Fitzhardinge wrote:
> On 02/08/2011 11:34 AM, H. Peter Anvin wrote:
>> On 02/07/2011 07:12 PM, Yinghai Lu wrote:
>>> why punishing native path with those checking?
>>>
>> What happens if you end up with a reserved range in an unfortunate place
>> on real hardware?
>
> Yes, exactly. The reserved region code isn't very useful if you can't
> rely on it to reserve stuff.
assume context is under:
moving cleanup_highmap() down after brk is concluded, and check memblock_reserved there.
one case for that: native path, bootloader could put initrd under 512M. and it is with memblock reserved.
if we check those range with memblock_reserved, initial kernel mapping will not be cleaned up.
or worse if we are checking if there is any range from __pa(_brk_end) to 512M is with memblock reserved to decide
if we need to clean-up highmap. it will skip for whole range.
Yinghai
On 02/10/2011 03:57 PM, Yinghai Lu wrote:
> On 02/10/2011 03:48 PM, Jeremy Fitzhardinge wrote:
>> On 02/08/2011 11:34 AM, H. Peter Anvin wrote:
>>> On 02/07/2011 07:12 PM, Yinghai Lu wrote:
>>>> why punishing native path with those checking?
>>>>
>>> What happens if you end up with a reserved range in an unfortunate place
>>> on real hardware?
>>
>> Yes, exactly. The reserved region code isn't very useful if you can't
>> rely on it to reserve stuff.
>
> assume context is under:
> moving cleanup_highmap() down after brk is concluded, and check memblock_reserved there.
>
> one case for that: native path, bootloader could put initrd under 512M. and it is with memblock reserved.
> if we check those range with memblock_reserved, initial kernel mapping will not be cleaned up.
>
> or worse if we are checking if there is any range from __pa(_brk_end) to 512M is with memblock reserved to decide
> if we need to clean-up highmap. it will skip for whole range.
>
I'm afraid I simply can't parse the above.
-hpa
On 02/10/2011 04:35 PM, H. Peter Anvin wrote:
> On 02/10/2011 03:57 PM, Yinghai Lu wrote:
>> On 02/10/2011 03:48 PM, Jeremy Fitzhardinge wrote:
>>> On 02/08/2011 11:34 AM, H. Peter Anvin wrote:
>>>> On 02/07/2011 07:12 PM, Yinghai Lu wrote:
>>>>> why punishing native path with those checking?
>>>>>
>>>> What happens if you end up with a reserved range in an unfortunate place
>>>> on real hardware?
>>>
>>> Yes, exactly. The reserved region code isn't very useful if you can't
>>> rely on it to reserve stuff.
>>
>> assume context is under:
>> moving cleanup_highmap() down after brk is concluded, and check memblock_reserved there.
>>
>> one case for that: native path, bootloader could put initrd under 512M. and it is with memblock reserved.
>> if we check those range with memblock_reserved, initial kernel mapping will not be cleaned up.
>>
>> or worse if we are checking if there is any range from __pa(_brk_end) to 512M is with memblock reserved to decide
>> if we need to clean-up highmap. it will skip for whole range.
>>
>
> I'm afraid I simply can't parse the above.
1. we have patch that will move down cleanup_highmap, and it will clean initial mapping from _brk_end to 512M (before we have two steps: clear _end to 512M and then _brk_end to _end)
2. So checking memblock_reserved with _brk_end to 512M will cause problem:
a. will check 256 times less.
b. if bootloader put initrd ramdisk overlapped with [_brk_end++, 512M), and overlap range will make clean_highmap bail out early. because those range is memblock_reserved.
BTW: Do we really need to cleanup initial mapping between _brk_end to _end?
origin patch from jan:
commit 498343967613183611ac37dccb2846496d954c06
Author: Jan Beulich <[email protected]>
Date: Wed May 6 13:06:47 2009 +0100
x86-64: finish cleanup_highmaps()'s job wrt. _brk_end
With the introduction of the .brk section, special care must be taken
that no unused page table entries remain if _brk_end and _end are
separated by a 2M page boundary. cleanup_highmap() runs very early and
hence cannot take care of that, hence potential entries needing to be
removed past _brk_end must be cleared once the brk allocator has done
its job.
[ Impact: avoids undesirable TLB aliases ]
Signed-off-by: Jan Beulich <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index fd3da1d..ae4f7b5 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -7,6 +7,7 @@
#include <asm/page.h>
#include <asm/page_types.h>
#include <asm/sections.h>
+#include <asm/setup.h>
#include <asm/system.h>
#include <asm/tlbflush.h>
@@ -304,8 +305,23 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
#endif
#ifdef CONFIG_X86_64
- if (!after_bootmem)
+ if (!after_bootmem && !start) {
+ pud_t *pud;
+ pmd_t *pmd;
+
mmu_cr4_features = read_cr4();
+
+ /*
+ * _brk_end cannot change anymore, but it and _end may be
+ * located on different 2M pages. cleanup_highmap(), however,
+ * can only consider _end when it runs, so destroy any
+ * mappings beyond _brk_end here.
+ */
+ pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
+ pmd = pmd_offset(pud, _brk_end - 1);
+ while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
+ pmd_clear(pmd);
+ }
#endif
__flush_tlb_all();
> BTW: Do we really need to cleanup initial mapping between _brk_end to _end?
Did you try to revert the patch and run it under your setup?
>
> origin patch from jan:
>
> commit 498343967613183611ac37dccb2846496d954c06
> Author: Jan Beulich <[email protected]>
> Date: Wed May 6 13:06:47 2009 +0100
>
> x86-64: finish cleanup_highmaps()'s job wrt. _brk_end
>
> With the introduction of the .brk section, special care must be taken
> that no unused page table entries remain if _brk_end and _end are
> separated by a 2M page boundary. cleanup_highmap() runs very early and
> hence cannot take care of that, hence potential entries needing to be
> removed past _brk_end must be cleared once the brk allocator has done
> its job.
>
> [ Impact: avoids undesirable TLB aliases ]
>
> Signed-off-by: Jan Beulich <[email protected]>
> Signed-off-by: H. Peter Anvin <[email protected]>
>
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index fd3da1d..ae4f7b5 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -7,6 +7,7 @@
> #include <asm/page.h>
> #include <asm/page_types.h>
> #include <asm/sections.h>
> +#include <asm/setup.h>
> #include <asm/system.h>
> #include <asm/tlbflush.h>
>
> @@ -304,8 +305,23 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
> #endif
>
> #ifdef CONFIG_X86_64
> - if (!after_bootmem)
> + if (!after_bootmem && !start) {
> + pud_t *pud;
> + pmd_t *pmd;
> +
> mmu_cr4_features = read_cr4();
> +
> + /*
> + * _brk_end cannot change anymore, but it and _end may be
> + * located on different 2M pages. cleanup_highmap(), however,
> + * can only consider _end when it runs, so destroy any
> + * mappings beyond _brk_end here.
> + */
> + pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
> + pmd = pmd_offset(pud, _brk_end - 1);
> + while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
> + pmd_clear(pmd);
> + }
> #endif
> __flush_tlb_all();
>
On Mon, Feb 14, 2011 at 8:26 AM, Konrad Rzeszutek Wilk
<[email protected]> wrote:
>> BTW: Do we really need to cleanup initial mapping between _brk_end to _end?
>
> Did you try to revert the patch and run it under your setup?
No.
but it should be ok.
Assume, we can update vmlinux.lds.S to make _end to be 2M aligned as
Stefano suggested?
Yinghai
On Mon, 14 Feb 2011, Yinghai Lu wrote:
> On Mon, Feb 14, 2011 at 8:26 AM, Konrad Rzeszutek Wilk
> <[email protected]> wrote:
> >> BTW: Do we really need to cleanup initial mapping between _brk_end to _end?
> >
> > Did you try to revert the patch and run it under your setup?
>
> No.
>
> but it should be ok.
>
> Assume, we can update vmlinux.lds.S to make _end to be 2M aligned as
> Stefano suggested?
That would solve the problem as long as cleanup_highmap() is not moved
to setup_arch.
On Mon, Feb 14, 2011 at 9:58 AM, Stefano Stabellini
<[email protected]> wrote:
> On Mon, 14 Feb 2011, Yinghai Lu wrote:
>> On Mon, Feb 14, 2011 at 8:26 AM, Konrad Rzeszutek Wilk
>> <[email protected]> wrote:
>> >> BTW: Do we really need to cleanup initial mapping between _brk_end to _end?
>> >
>> > Did you try to revert the patch and run it under your setup?
>>
>> No.
>>
>> but it should be ok.
>>
>> Assume, we can update vmlinux.lds.S to make _end to be 2M aligned as
>> Stefano suggested?
>
> That would solve the problem as long as cleanup_highmap() is not moved
> to setup_arch.
yes.
On 02/14/2011 09:55 AM, Yinghai Lu wrote:
>
> Assume, we can update vmlinux.lds.S to make _end to be 2M aligned as
> Stefano suggested?
>
I am concerned about this. This feels like papering over a bug.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Mon, 14 Feb 2011, H. Peter Anvin wrote:
> On 02/14/2011 09:55 AM, Yinghai Lu wrote:
> >
> > Assume, we can update vmlinux.lds.S to make _end to be 2M aligned as
> > Stefano suggested?
> >
>
> I am concerned about this. This feels like papering over a bug.
>
I think the current memblock implementation assumes that the memory that
has been marked as reserved can still be safely removed from the initial
pagetable mappings.
This assumption is clear considering that we are marking the ramdisk as
reserved in x86_64_start_reservations but we are clearing the mappings
for it in cleanup_highmap.
I am not sure if this assumption is correct in the general case, but it
is not correct in our particular scenario.
If the assumption is not correct, then fixing it will also fix our
problem.
If the assumption is correct then we'll have to find another way, and I
think that the appended patch is probably the best way to do it.
The basic ideas behind the patch are:
- we want to move cleanup_highmap() to setup_arch, like Yinghai Lu
suggested;
- we want to honor max_pfn_mapped in cleanup_highmap;
- on xen we set max_pfn_mapped to the last pfn that is safe to remove
from the initial mappings.
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 2d2673c..5655c22 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -77,9 +77,6 @@ void __init x86_64_start_kernel(char * real_mode_data)
/* Make NULL pointers segfault */
zap_identity_mappings();
- /* Cleanup the over mapped high alias */
- cleanup_highmap();
-
max_pfn_mapped = KERNEL_IMAGE_SIZE >> PAGE_SHIFT;
for (i = 0; i < NUM_EXCEPTION_VECTORS; i++) {
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d3cfe26..f03e6e0 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -297,6 +297,9 @@ static void __init init_gbpages(void)
static inline void init_gbpages(void)
{
}
+static void __init cleanup_highmap(void)
+{
+}
#endif
static void __init reserve_brk(void)
@@ -922,6 +925,9 @@ void __init setup_arch(char **cmdline_p)
*/
reserve_brk();
+ /* Cleanup the over mapped high alias after _brk_end*/
+ cleanup_highmap();
+
memblock.current_limit = get_max_mapped();
memblock_x86_fill();
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 947f42a..f13ff3a 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -279,25 +279,6 @@ unsigned long __init_refok init_memory_mapping(unsigned long start,
load_cr3(swapper_pg_dir);
#endif
-#ifdef CONFIG_X86_64
- if (!after_bootmem && !start) {
- pud_t *pud;
- pmd_t *pmd;
-
- mmu_cr4_features = read_cr4();
-
- /*
- * _brk_end cannot change anymore, but it and _end may be
- * located on different 2M pages. cleanup_highmap(), however,
- * can only consider _end when it runs, so destroy any
- * mappings beyond _brk_end here.
- */
- pud = pud_offset(pgd_offset_k(_brk_end), _brk_end);
- pmd = pmd_offset(pud, _brk_end - 1);
- while (++pmd <= pmd_offset(pud, (unsigned long)_end - 1))
- pmd_clear(pmd);
- }
-#endif
__flush_tlb_all();
if (!after_bootmem && e820_table_end > e820_table_start)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 71a5929..a8d08c2 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -51,6 +51,7 @@
#include <asm/numa.h>
#include <asm/cacheflush.h>
#include <asm/init.h>
+#include <asm/setup.h>
static int __init parse_direct_gbpages_off(char *arg)
{
@@ -293,18 +294,18 @@ void __init init_extra_mapping_uc(unsigned long phys, unsigned long size)
* to the compile time generated pmds. This results in invalid pmds up
* to the point where we hit the physaddr 0 mapping.
*
- * We limit the mappings to the region from _text to _end. _end is
- * rounded up to the 2MB boundary. This catches the invalid pmds as
+ * We limit the mappings to the region from _text to _brk_end. _brk_end
+ * is rounded up to the 2MB boundary. This catches the invalid pmds as
* well, as they are located before _text:
*/
void __init cleanup_highmap(void)
{
unsigned long vaddr = __START_KERNEL_map;
- unsigned long end = roundup((unsigned long)_end, PMD_SIZE) - 1;
+ unsigned long vaddr_end = __START_KERNEL_map + (max_pfn_mapped << PAGE_SHIFT);
+ unsigned long end = roundup((unsigned long)_brk_end, PMD_SIZE) - 1;
pmd_t *pmd = level2_kernel_pgt;
- pmd_t *last_pmd = pmd + PTRS_PER_PMD;
- for (; pmd < last_pmd; pmd++, vaddr += PMD_SIZE) {
+ for (; vaddr + PMD_SIZE - 1 < vaddr_end; pmd++, vaddr += PMD_SIZE) {
if (pmd_none(*pmd))
continue;
if (vaddr < (unsigned long) _text || vaddr > end)
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 5e92b61..73a21db 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1653,9 +1653,6 @@ static __init void xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn)
for (pteidx = 0; pteidx < PTRS_PER_PTE; pteidx++, pfn++) {
pte_t pte;
- if (pfn > max_pfn_mapped)
- max_pfn_mapped = pfn;
-
if (!pte_none(pte_page[pteidx]))
continue;
@@ -1713,6 +1710,8 @@ __init pgd_t *xen_setup_kernel_pagetable(pgd_t *pgd,
pud_t *l3;
pmd_t *l2;
+ max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->mfn_list));
+
/* Zap identity mapping */
init_level4_pgt[0] = __pgd(0);