On Sun, 30 Sep 2012, Yinghai Lu wrote:
> After
>
> | commit 8548c84da2f47e71bbbe300f55edb768492575f7
> | Author: Takashi Iwai <[email protected]>
> | Date: Sun Oct 23 23:19:12 2011 +0200
> |
> | x86: Fix S4 regression
> |
> | Commit 4b239f458 ("x86-64, mm: Put early page table high") causes a S4
> | regression since 2.6.39, namely the machine reboots occasionally at S4
> | resume. It doesn't happen always, overall rate is about 1/20. But,
> | like other bugs, once when this happens, it continues to happen.
> |
> | This patch fixes the problem by essentially reverting the memory
> | assignment in the older way.
>
> Have some page table around 512M again, that will prevent kdump to find 512M
> under 768M.
>
> We need revert that reverting, so we could put page table high again for 64bit.
>
> Takashi agreed that S4 regression could be something else.
>
> https://lkml.org/lkml/2012/6/15/182
>
> Signed-off-by: Yinghai Lu <[email protected]>
> ---
> arch/x86/mm/init.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 9f69180..aadb154 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -76,8 +76,8 @@ static void __init find_early_table_space(struct map_range *mr,
> #ifdef CONFIG_X86_32
> /* for fixmap */
> tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
> -#endif
> good_end = max_pfn_mapped << PAGE_SHIFT;
> +#endif
>
> base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
> if (!base)
Isn't this going to cause init_memory_mapping to allocate pagetable
pages from memory not yet mapped?
Last time I spoke with HPA and Thomas about this, they seem to agree
that it isn't a very good idea.
Also, it is proven to cause a certain amount of headaches on Xen,
see commit d8aa5ec3382e6a545b8f25178d1e0992d4927f19.
On Mon, Oct 01, 2012 at 12:00:26PM +0100, Stefano Stabellini wrote:
> On Sun, 30 Sep 2012, Yinghai Lu wrote:
> > After
> >
> > | commit 8548c84da2f47e71bbbe300f55edb768492575f7
> > | Author: Takashi Iwai <[email protected]>
> > | Date: Sun Oct 23 23:19:12 2011 +0200
> > |
> > | x86: Fix S4 regression
> > |
> > | Commit 4b239f458 ("x86-64, mm: Put early page table high") causes a S4
> > | regression since 2.6.39, namely the machine reboots occasionally at S4
> > | resume. It doesn't happen always, overall rate is about 1/20. But,
> > | like other bugs, once when this happens, it continues to happen.
> > |
> > | This patch fixes the problem by essentially reverting the memory
> > | assignment in the older way.
> >
> > Have some page table around 512M again, that will prevent kdump to find 512M
> > under 768M.
> >
> > We need revert that reverting, so we could put page table high again for 64bit.
> >
> > Takashi agreed that S4 regression could be something else.
> >
> > https://lkml.org/lkml/2012/6/15/182
> >
> > Signed-off-by: Yinghai Lu <[email protected]>
> > ---
> > arch/x86/mm/init.c | 2 +-
> > 1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> > index 9f69180..aadb154 100644
> > --- a/arch/x86/mm/init.c
> > +++ b/arch/x86/mm/init.c
> > @@ -76,8 +76,8 @@ static void __init find_early_table_space(struct map_range *mr,
> > #ifdef CONFIG_X86_32
> > /* for fixmap */
> > tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
> > -#endif
> > good_end = max_pfn_mapped << PAGE_SHIFT;
> > +#endif
> >
> > base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
> > if (!base)
>
> Isn't this going to cause init_memory_mapping to allocate pagetable
> pages from memory not yet mapped?
> Last time I spoke with HPA and Thomas about this, they seem to agree
> that it isn't a very good idea.
> Also, it is proven to cause a certain amount of headaches on Xen,
> see commit d8aa5ec3382e6a545b8f25178d1e0992d4927f19.
>
Any comments, thoughts? hpa? Yinghai?
So it seems that during init_memory_mapping Xen needs to modify page table
bits and the memory where the page tables live needs to be direct mapped at
that time.
Since we now call init_memory_mapping for every E820_RAM range sequencially,
the only way to satisfy Xen is to find_early_page_table_space (good_end needs
to be within memory already mapped at the time) for every init_memory_mapping
call.
What do you think Yinghai?
On 10/03/2012 09:51 AM, Jacob Shin wrote:
>
> Any comments, thoughts? hpa? Yinghai?
>
> So it seems that during init_memory_mapping Xen needs to modify page table
> bits and the memory where the page tables live needs to be direct mapped at
> that time.
>
> Since we now call init_memory_mapping for every E820_RAM range sequencially,
> the only way to satisfy Xen is to find_early_page_table_space (good_end needs
> to be within memory already mapped at the time) for every init_memory_mapping
> call.
>
> What do you think Yinghai?
>
I outlined the sane way to do this at Kernel Summit for Yinghai and
several other people. I need to write it up for people who weren't
there, but I don't have time right at the moment.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Wed, Oct 03, 2012 at 11:51:06AM -0500, Jacob Shin wrote:
> On Mon, Oct 01, 2012 at 12:00:26PM +0100, Stefano Stabellini wrote:
> > On Sun, 30 Sep 2012, Yinghai Lu wrote:
> > > After
> > >
> > > | commit 8548c84da2f47e71bbbe300f55edb768492575f7
> > > | Author: Takashi Iwai <[email protected]>
> > > | Date: Sun Oct 23 23:19:12 2011 +0200
> > > |
> > > | x86: Fix S4 regression
> > > |
> > > | Commit 4b239f458 ("x86-64, mm: Put early page table high") causes a S4
> > > | regression since 2.6.39, namely the machine reboots occasionally at S4
> > > | resume. It doesn't happen always, overall rate is about 1/20. But,
> > > | like other bugs, once when this happens, it continues to happen.
> > > |
> > > | This patch fixes the problem by essentially reverting the memory
> > > | assignment in the older way.
> > >
> > > Have some page table around 512M again, that will prevent kdump to find 512M
> > > under 768M.
> > >
> > > We need revert that reverting, so we could put page table high again for 64bit.
> > >
> > > Takashi agreed that S4 regression could be something else.
> > >
> > > https://lkml.org/lkml/2012/6/15/182
> > >
> > > Signed-off-by: Yinghai Lu <[email protected]>
> > > ---
> > > arch/x86/mm/init.c | 2 +-
> > > 1 files changed, 1 insertions(+), 1 deletions(-)
> > >
> > > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> > > index 9f69180..aadb154 100644
> > > --- a/arch/x86/mm/init.c
> > > +++ b/arch/x86/mm/init.c
> > > @@ -76,8 +76,8 @@ static void __init find_early_table_space(struct map_range *mr,
> > > #ifdef CONFIG_X86_32
> > > /* for fixmap */
> > > tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
> > > -#endif
> > > good_end = max_pfn_mapped << PAGE_SHIFT;
> > > +#endif
> > >
> > > base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
> > > if (!base)
> >
> > Isn't this going to cause init_memory_mapping to allocate pagetable
> > pages from memory not yet mapped?
> > Last time I spoke with HPA and Thomas about this, they seem to agree
> > that it isn't a very good idea.
> > Also, it is proven to cause a certain amount of headaches on Xen,
> > see commit d8aa5ec3382e6a545b8f25178d1e0992d4927f19.
> >
>
> Any comments, thoughts? hpa? Yinghai?
>
> So it seems that during init_memory_mapping Xen needs to modify page table
> bits and the memory where the page tables live needs to be direct mapped at
> that time.
That is not exactly true. I am not sure if we are just using the wrong
words for it - so let me try to write up what the impediment is.
There is also this talk between Stefano and tglrx that can help in
getting ones' head around it: https://lkml.org/lkml/2012/8/24/335
The restriction that Xen places on Linux page-tables is that they MUST
be read-only when in usage. Meaning if you creating a PTE table (or PMD,
PUD, etc), you can write to it as long as you want - but the moment you
hook it up to a live page-table - it must be marked RO (so the PMD entry
cannot have _PAGE_RW on it). Easy enough.
This means that if we are re-using a pagetable during the
init_memory_mapping (so we iomap it), we need to iomap it with
!_PAGE_RW) - and that is where xen_set_pte_init has a check for
is_early_ioremap_ptep. To add to the fun, the pagetables are expanding -
so as one is ioremapping/iounmaping, you have to check the pgt_buf_end
to check whether the page table we are mapping is within the:
pgt_buf_start -> pgt_buf_end <- pgt_buf_top
(and pgt_buf_end can increment up to pgt_buf_top).
Now the next part of this that is hard to wrap around is when you
want to create a PTE entries for the pgt_buf_start -> pgt_buf_end.
Its double fun, b/c your pgt_buf_end can increment as you are
trying to create those PTE entries - and you _MUST_ mark those
PTE entries as RO. This is b/c those pagetables (pgt_buf_start ->
pgt_buf_end) are live and only Xen can touch them.
This feels like operating on a live patient, while said patient
is running the marathon. Only duct-tape expert can apply for
this position.
What Peter had in mind is a nice system where we get rid of
this linear allocation of page-tables (so pgt_buf_start -> pgt_buf
_end are linearly allocated). His thinking (and Peter if I mess
up please correct me), is that we can stick the various pagetables
in different spots in memory. Mainly that as we look at mapping
a region (say 0GB->1GB), we look at in chunks (2MB?) and allocate
a page-table at the _end_ of the newly mapped chunk if we have
filled all entries in said pagetable.
For simplicity, lets say we are just dealing with PTE tables and
we are mapping the region 0GB->1GB with 4KB pages.
First we stick a page-table (or if there is a found one reuse it)
at the start of the region (so 0-2MB).
0MB.......................2MB
/-----\
|PTE_A|
\-----/
The PTE entries in it will cover 0->2MB (PTE table #A) and once it is
finished, it will stick a new pagetable at the end of the 2MB region:
0MB.......................2MB...........................4MB
/-----\ /-----\
|PTE_A| |PTE_B|
\-----/ \-----/
The PTE_B page table will be used to map 2MB->4MB.
Once that is finished .. we repeat the cycle.
That should remove the utter duct-tape madness and make this a lot
easier.
On Mon, Oct 1, 2012 at 4:00 AM, Stefano Stabellini
<[email protected]> wrote:
> On Sun, 30 Sep 2012, Yinghai Lu wrote:
>> After
>>
>> | commit 8548c84da2f47e71bbbe300f55edb768492575f7
>> | Author: Takashi Iwai <[email protected]>
>> | Date: Sun Oct 23 23:19:12 2011 +0200
>> |
>> | x86: Fix S4 regression
>> |
>> | Commit 4b239f458 ("x86-64, mm: Put early page table high") causes a S4
>> | regression since 2.6.39, namely the machine reboots occasionally at S4
>> | resume. It doesn't happen always, overall rate is about 1/20. But,
>> | like other bugs, once when this happens, it continues to happen.
>> |
>> | This patch fixes the problem by essentially reverting the memory
>> | assignment in the older way.
>>
>> Have some page table around 512M again, that will prevent kdump to find 512M
>> under 768M.
>>
>> We need revert that reverting, so we could put page table high again for 64bit.
>>
>> Takashi agreed that S4 regression could be something else.
>>
>> https://lkml.org/lkml/2012/6/15/182
>>
>> Signed-off-by: Yinghai Lu <[email protected]>
>> ---
>> arch/x86/mm/init.c | 2 +-
>> 1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
>> index 9f69180..aadb154 100644
>> --- a/arch/x86/mm/init.c
>> +++ b/arch/x86/mm/init.c
>> @@ -76,8 +76,8 @@ static void __init find_early_table_space(struct map_range *mr,
>> #ifdef CONFIG_X86_32
>> /* for fixmap */
>> tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
>> -#endif
>> good_end = max_pfn_mapped << PAGE_SHIFT;
>> +#endif
>>
>> base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
>> if (!base)
>
> Isn't this going to cause init_memory_mapping to allocate pagetable
> pages from memory not yet mapped?
but 64bit is using ioremap to access those page table buf.
> Last time I spoke with HPA and Thomas about this, they seem to agree
> that it isn't a very good idea.
> Also, it is proven to cause a certain amount of headaches on Xen,
> see commit d8aa5ec3382e6a545b8f25178d1e0992d4927f19.
this patchset will allocate page table buf one time only.
So could use ram under 1M to map that page table at first.
so that will make it xen happy ?
Thanks
Yinghai
On Wed, Oct 3, 2012 at 9:51 AM, Jacob Shin <[email protected]> wrote:
> Any comments, thoughts? hpa? Yinghai?
>
> So it seems that during init_memory_mapping Xen needs to modify page table
> bits and the memory where the page tables live needs to be direct mapped at
> that time.
>
> Since we now call init_memory_mapping for every E820_RAM range sequencially,
> the only way to satisfy Xen is to find_early_page_table_space (good_end needs
> to be within memory already mapped at the time) for every init_memory_mapping
> call.
>
> What do you think Yinghai?
that may put the page table on near end of every ram range for next
memory range.
then kdump may have problem get big range again.
Yinghai
On Thu, Oct 04, 2012 at 08:57:55AM -0700, Yinghai Lu wrote:
> On Mon, Oct 1, 2012 at 4:00 AM, Stefano Stabellini
> <[email protected]> wrote:
> > On Sun, 30 Sep 2012, Yinghai Lu wrote:
> >> After
> >>
> >> | commit 8548c84da2f47e71bbbe300f55edb768492575f7
> >> | Author: Takashi Iwai <[email protected]>
> >> | Date: Sun Oct 23 23:19:12 2011 +0200
> >> |
> >> | x86: Fix S4 regression
> >> |
> >> | Commit 4b239f458 ("x86-64, mm: Put early page table high") causes a S4
> >> | regression since 2.6.39, namely the machine reboots occasionally at S4
> >> | resume. It doesn't happen always, overall rate is about 1/20. But,
> >> | like other bugs, once when this happens, it continues to happen.
> >> |
> >> | This patch fixes the problem by essentially reverting the memory
> >> | assignment in the older way.
> >>
> >> Have some page table around 512M again, that will prevent kdump to find 512M
> >> under 768M.
> >>
> >> We need revert that reverting, so we could put page table high again for 64bit.
> >>
> >> Takashi agreed that S4 regression could be something else.
> >>
> >> https://lkml.org/lkml/2012/6/15/182
> >>
> >> Signed-off-by: Yinghai Lu <[email protected]>
> >> ---
> >> arch/x86/mm/init.c | 2 +-
> >> 1 files changed, 1 insertions(+), 1 deletions(-)
> >>
> >> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> >> index 9f69180..aadb154 100644
> >> --- a/arch/x86/mm/init.c
> >> +++ b/arch/x86/mm/init.c
> >> @@ -76,8 +76,8 @@ static void __init find_early_table_space(struct map_range *mr,
> >> #ifdef CONFIG_X86_32
> >> /* for fixmap */
> >> tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
> >> -#endif
> >> good_end = max_pfn_mapped << PAGE_SHIFT;
> >> +#endif
> >>
> >> base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
> >> if (!base)
> >
> > Isn't this going to cause init_memory_mapping to allocate pagetable
> > pages from memory not yet mapped?
>
> but 64bit is using ioremap to access those page table buf.
>
> > Last time I spoke with HPA and Thomas about this, they seem to agree
> > that it isn't a very good idea.
> > Also, it is proven to cause a certain amount of headaches on Xen,
> > see commit d8aa5ec3382e6a545b8f25178d1e0992d4927f19.
>
> this patchset will allocate page table buf one time only.
As in, if your machine has 8GB, it will allocate pagetables that
span 0->8GB at once?
> So could use ram under 1M to map that page table at first.
Could or does this patch do it? And why 1MB?
>
> so that will make it xen happy ?
The issues that Xen faces are purely due to the fact that they
must be RO when they are in use. I believe (and without actually
checking it just to make sure) that it does not matter where
the page-tables are located. But with the current generic code
the location is quite linear: it starts with pgt_buf_start and
goes up to pgt_buf_top. So how would this patch move the location
of the page-table to be under 1MB?
Perhaps we are talking about seperate topics?
My recollection of memblock_find_in_range is that it will try
the end of the range to find a suitable "chunk" that satisfies
the 'size' and 'aligment' parameters?
On Thu, Oct 04, 2012 at 09:19:08AM -0700, Yinghai Lu wrote:
> On Wed, Oct 3, 2012 at 9:51 AM, Jacob Shin <[email protected]> wrote:
> > Any comments, thoughts? hpa? Yinghai?
> >
> > So it seems that during init_memory_mapping Xen needs to modify page table
> > bits and the memory where the page tables live needs to be direct mapped at
> > that time.
> >
> > Since we now call init_memory_mapping for every E820_RAM range sequencially,
> > the only way to satisfy Xen is to find_early_page_table_space (good_end needs
> > to be within memory already mapped at the time) for every init_memory_mapping
> > call.
> >
> > What do you think Yinghai?
>
> that may put the page table on near end of every ram range for next
> memory range.
>
> then kdump may have problem get big range again.
Is there a git commit that explains what the 'big range' problem is?
On Thu, Oct 4, 2012 at 9:45 AM, Konrad Rzeszutek Wilk
<[email protected]> wrote:
>> So could use ram under 1M to map that page table at first.
>
> Could or does this patch do it? And why 1MB?
can you or stefano could test attached patch on xen ?
that will map the page table buffer that will be used.
under 1M, still 4k there, so there will be no page table around 512M.
>>
>> so that will make it xen happy ?
>
> The issues that Xen faces are purely due to the fact that they
> must be RO when they are in use. I believe (and without actually
> checking it just to make sure) that it does not matter where
> the page-tables are located. But with the current generic code
> the location is quite linear: it starts with pgt_buf_start and
> goes up to pgt_buf_top. So how would this patch move the location
> of the page-table to be under 1MB?
just page table buf's page table.
Thanks
Yinghai
On Thu, Oct 4, 2012 at 9:46 AM, Konrad Rzeszutek Wilk
<[email protected]> wrote:
>> then kdump may have problem get big range again.
>
> Is there a git commit that explains what the 'big range' problem is?
commit 7f8595bfacef279f06c82ec98d420ef54f2537e0
Author: H. Peter Anvin <[email protected]>
Date: Thu Dec 16 19:20:41 2010 -0800
x86, kexec: Limit the crashkernel address appropriately
Keep the crash kernel address below 512 MiB for 32 bits and 896 MiB
for 64 bits. For 32 bits, this retains compatibility with earlier
kernel releases, and makes it work even if the vmalloc= setting is
adjusted.
For 64 bits, we should be able to increase this substantially once a
hard-coded limit in kexec-tools is fixed.
Signed-off-by: H. Peter Anvin <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Yinghai Lu <[email protected]>
LKML-Reference: <[email protected]>
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 21c6746..c9089a1 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -501,7 +501,18 @@ static inline unsigned long long get_total_mem(void)
return total << PAGE_SHIFT;
}
-#define DEFAULT_BZIMAGE_ADDR_MAX 0x37FFFFFF
+/*
+ * Keep the crash kernel below this limit. On 32 bits earlier kernels
+ * would limit the kernel to the low 512 MiB due to mapping restrictions.
+ * On 64 bits, kexec-tools currently limits us to 896 MiB; increase this
+ * limit once kexec-tools are fixed.
+ */
+#ifdef CONFIG_X86_32
+# define CRASH_KERNEL_ADDR_MAX (512 << 20)
+#else
+# define CRASH_KERNEL_ADDR_MAX (896 << 20)
+#endif
+
static void __init reserve_crashkernel(void)
{
unsigned long long total_mem;
@@ -520,10 +531,10 @@ static void __init reserve_crashkernel(void)
const unsigned long long alignment = 16<<20; /* 16M */
/*
- * kexec want bzImage is below DEFAULT_BZIMAGE_ADDR_MAX
+ * kexec want bzImage is below CRASH_KERNEL_ADDR_MAX
*/
crash_base = memblock_find_in_range(alignment,
- DEFAULT_BZIMAGE_ADDR_MAX, crash_size, alignment);
+ CRASH_KERNEL_ADDR_MAX, crash_size, alignment);
if (crash_base == MEMBLOCK_ERROR) {
pr_info("crashkernel reservation failed - No
suitable area found.\n");
On Thu, Oct 4, 2012 at 2:21 PM, Yinghai Lu <[email protected]> wrote:
> On Thu, Oct 4, 2012 at 9:45 AM, Konrad Rzeszutek Wilk
> <[email protected]> wrote:
>>> So could use ram under 1M to map that page table at first.
>>
>> Could or does this patch do it? And why 1MB?
>
> can you or stefano could test attached patch on xen ?
>
on top of
http://git.kernel.org/?p=linux/kernel/git/tip/tip.git;a=shortlog;h=refs/heads/x86/mm2
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/mm2
On 10/04/2012 02:40 PM, Yinghai Lu wrote:
> On Thu, Oct 4, 2012 at 2:21 PM, Yinghai Lu <[email protected]> wrote:
>> On Thu, Oct 4, 2012 at 9:45 AM, Konrad Rzeszutek Wilk
>> <[email protected]> wrote:
>>>> So could use ram under 1M to map that page table at first.
>>>
>>> Could or does this patch do it? And why 1MB?
>>
>> can you or stefano could test attached patch on xen ?
>>
> on top of
>
> http://git.kernel.org/?p=linux/kernel/git/tip/tip.git;a=shortlog;h=refs/heads/x86/mm2
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/mm2
>
No, no, not yet another ad hoc hack... please.
-hpa
On Thu, Oct 4, 2012 at 2:41 PM, H. Peter Anvin <[email protected]> wrote:
> On 10/04/2012 02:40 PM, Yinghai Lu wrote:
>> On Thu, Oct 4, 2012 at 2:21 PM, Yinghai Lu <[email protected]> wrote:
>>> On Thu, Oct 4, 2012 at 9:45 AM, Konrad Rzeszutek Wilk
>>> <[email protected]> wrote:
>>>>> So could use ram under 1M to map that page table at first.
>>>>
>>>> Could or does this patch do it? And why 1MB?
>>>
>>> can you or stefano could test attached patch on xen ?
>>>
>> on top of
>>
>> http://git.kernel.org/?p=linux/kernel/git/tip/tip.git;a=shortlog;h=refs/heads/x86/mm2
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/mm2
>>
>
> No, no, not yet another ad hoc hack... please.
>
or let xen map that page table by itself at first?
On 10/04/2012 06:56 AM, Konrad Rzeszutek Wilk wrote:
>
> What Peter had in mind is a nice system where we get rid of
> this linear allocation of page-tables (so pgt_buf_start -> pgt_buf
> _end are linearly allocated). His thinking (and Peter if I mess
> up please correct me), is that we can stick the various pagetables
> in different spots in memory. Mainly that as we look at mapping
> a region (say 0GB->1GB), we look at in chunks (2MB?) and allocate
> a page-table at the _end_ of the newly mapped chunk if we have
> filled all entries in said pagetable.
>
> For simplicity, lets say we are just dealing with PTE tables and
> we are mapping the region 0GB->1GB with 4KB pages.
>
> First we stick a page-table (or if there is a found one reuse it)
> at the start of the region (so 0-2MB).
>
> 0MB.......................2MB
> /-----\
> |PTE_A|
> \-----/
>
> The PTE entries in it will cover 0->2MB (PTE table #A) and once it is
> finished, it will stick a new pagetable at the end of the 2MB region:
>
> 0MB.......................2MB...........................4MB
> /-----\ /-----\
> |PTE_A| |PTE_B|
> \-----/ \-----/
>
>
> The PTE_B page table will be used to map 2MB->4MB.
>
> Once that is finished .. we repeat the cycle.
>
> That should remove the utter duct-tape madness and make this a lot
> easier.
>
You got the basic idea right but the details slightly wrong. Let me try
to explain.
When we start up, we know we have a set of page tables which maps the
kernel text, data, bss and brk. This is set up by the startup code on
native and by the domain builder on Xen.
We can reserve an arbitrary chunk of brk that is (a) big enough to map
the kernel text+data+bss+brk itself plus (b) some arbitrary additional
chunk of memory (perhaps we reserve another 256K of brk or so, enough to
map 128 MB in the worst case of 4K PAE pages.)
Step 1:
- Create page table mappings for kernel text+data+bss+brk out of the
brk region.
Step 2:
- Start creating mappings for the topmost memory region downward, until
the brk reserved area is exhaused.
Step 3:
- Call a paravirt hook on the page tables created so far. On native
this does nothing, on Xen it can map it readonly and tell the
hypervisor it is a page table.
Step 4:
- Switch to the newly created page table. The bootup page table is now
obsolete.
Step 5:
- Moving downward from the last address mapped, create new page tables
for any additional unmapped memory region until either we run out of
unmapped memory regions, or we run out of mapped memory for
the memory regions to map.
Step 6:
- Call the paravirt hook for the new page tables, then add them to the
page table tree.
Step 7:
- Repeat from step 5 until there are no more unmapped memory regions.
This:
a) removes any need to guesstimate how much page tables are going to
consume. We simply construct them; they may not be contiguous but
that's okay.
b) very cleanly solves the Xen problem of not wanting to status-flip
pages any more than necessary.
The only reason for moving downward rather than upward is that we want
the page tables as high as possible in memory, since memory at low
addresses is precious (for stupid DMA devices, for things like
kexec/kdump, and so on.)
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On 10/04/2012 02:46 PM, Yinghai Lu wrote:
>
> or let xen map that page table by itself at first?
>
See my other post. This is bringing up the Kernel Summit algorithm again.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Thu, Oct 4, 2012 at 2:54 PM, H. Peter Anvin <[email protected]> wrote:
>
> See my other post. This is bringing up the Kernel Summit algorithm again.
>
sure. please check if you are ok with attached one on top of x86/mm2
Thanks
Yinghai
On Thu, 4 Oct 2012, Yinghai Lu wrote:
> On Mon, Oct 1, 2012 at 4:00 AM, Stefano Stabellini
> <[email protected]> wrote:
> > On Sun, 30 Sep 2012, Yinghai Lu wrote:
> >> After
> >>
> >> | commit 8548c84da2f47e71bbbe300f55edb768492575f7
> >> | Author: Takashi Iwai <[email protected]>
> >> | Date: Sun Oct 23 23:19:12 2011 +0200
> >> |
> >> | x86: Fix S4 regression
> >> |
> >> | Commit 4b239f458 ("x86-64, mm: Put early page table high") causes a S4
> >> | regression since 2.6.39, namely the machine reboots occasionally at S4
> >> | resume. It doesn't happen always, overall rate is about 1/20. But,
> >> | like other bugs, once when this happens, it continues to happen.
> >> |
> >> | This patch fixes the problem by essentially reverting the memory
> >> | assignment in the older way.
> >>
> >> Have some page table around 512M again, that will prevent kdump to find 512M
> >> under 768M.
> >>
> >> We need revert that reverting, so we could put page table high again for 64bit.
> >>
> >> Takashi agreed that S4 regression could be something else.
> >>
> >> https://lkml.org/lkml/2012/6/15/182
> >>
> >> Signed-off-by: Yinghai Lu <[email protected]>
> >> ---
> >> arch/x86/mm/init.c | 2 +-
> >> 1 files changed, 1 insertions(+), 1 deletions(-)
> >>
> >> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> >> index 9f69180..aadb154 100644
> >> --- a/arch/x86/mm/init.c
> >> +++ b/arch/x86/mm/init.c
> >> @@ -76,8 +76,8 @@ static void __init find_early_table_space(struct map_range *mr,
> >> #ifdef CONFIG_X86_32
> >> /* for fixmap */
> >> tables += roundup(__end_of_fixed_addresses * sizeof(pte_t), PAGE_SIZE);
> >> -#endif
> >> good_end = max_pfn_mapped << PAGE_SHIFT;
> >> +#endif
> >>
> >> base = memblock_find_in_range(start, good_end, tables, PAGE_SIZE);
> >> if (!base)
> >
> > Isn't this going to cause init_memory_mapping to allocate pagetable
> > pages from memory not yet mapped?
>
> but 64bit is using ioremap to access those page table buf.
Yes, but as Konrad explained, the mapping should be RO or RW on Xen
depending on whether the pagetable pages are already hooked into the
pagetable or not. ioremap could be called on not-yet-hooked pagetable
pages (or non-pagetable-pages) and already-hooked pagetable pages and
they need to be marked differently. Finally when you are going to map
the region in memory that contains the pagetable pages, the entire range
needs to be marked RO.
These issues could be avoided if the pagetable pages were allocated from
a memory region that is already mapped and we had a proper pvop to warn
the Xen subsystem that the pagetable pages are about to be hooked into
the live pagetable.
> > Last time I spoke with HPA and Thomas about this, they seem to agree
> > that it isn't a very good idea.
> > Also, it is proven to cause a certain amount of headaches on Xen,
> > see commit d8aa5ec3382e6a545b8f25178d1e0992d4927f19.
>
> this patchset will allocate page table buf one time only.
> So could use ram under 1M to map that page table at first.
I don't have anything agaist the goal of this series, the problem is
just where these pagetable pages come from.
> so that will make it xen happy ?
It would greatly simplify our life if the pagetable pages were
allocated from memory already mapped.
On Fri, 5 Oct 2012, Yinghai Lu wrote:
> On Thu, Oct 4, 2012 at 2:54 PM, H. Peter Anvin <[email protected]> wrote:
> >
> > See my other post. This is bringing up the Kernel Summit algorithm again.
> >
>
> sure. please check if you are ok with attached one on top of x86/mm2
>
> Subject: [PATCH] x86: get early page table from BRK
>
> set pgt_buf early from BRK, and use it to map page table at first.
>
> also use the left at first, then use new extend one.
>
> Signed-off-by: Yinghai Lu <[email protected]>
If I read the patch correctly, it wouldn't actually change the pagetable
allocation flow or implement Peter's suggestion.
However it would pre-map (pgt_buf_start-pgt_buf_top) using brk memory.
So, if that's correct, we could remove the early_memremap call from
alloc_low_page and map_low_page, right?
Also the patch introduces an additional range of pagetable pages
(early_pgt_buf_start-early_pgt_buf_top) and that would need to be
communicated somehow to Xen (that at the moment assumes that the range
is pgt_buf_start-pgt_buf_top. Pretty bad).
Overall I still prefer Peter's suggestion :)
> ---
> arch/x86/include/asm/init.h | 4 ++++
> arch/x86/include/asm/pgtable.h | 1 +
> arch/x86/kernel/setup.c | 2 ++
> arch/x86/mm/init.c | 23 +++++++++++++++++++++++
> arch/x86/mm/init_32.c | 8 ++++++--
> arch/x86/mm/init_64.c | 8 ++++++--
> 6 files changed, 42 insertions(+), 4 deletions(-)
>
> Index: linux-2.6/arch/x86/include/asm/pgtable.h
> ===================================================================
> --- linux-2.6.orig/arch/x86/include/asm/pgtable.h
> +++ linux-2.6/arch/x86/include/asm/pgtable.h
> @@ -599,6 +599,7 @@ static inline int pgd_none(pgd_t pgd)
>
> extern int direct_gbpages;
> void init_mem_mapping(void);
> +void early_alloc_pgt_buf(void);
>
> /* local pte updates need not use xchg for locking */
> static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
> Index: linux-2.6/arch/x86/kernel/setup.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/setup.c
> +++ linux-2.6/arch/x86/kernel/setup.c
> @@ -950,6 +950,8 @@ void __init setup_arch(char **cmdline_p)
>
> reserve_ibft_region();
>
> + early_alloc_pgt_buf();
> +
> /*
> * Need to conclude brk, before memblock_x86_fill()
> * it could use memblock_find_in_range, could overlap with
> Index: linux-2.6/arch/x86/mm/init.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/init.c
> +++ linux-2.6/arch/x86/mm/init.c
> @@ -21,6 +21,10 @@ unsigned long __initdata pgt_buf_start;
> unsigned long __meminitdata pgt_buf_end;
> unsigned long __meminitdata pgt_buf_top;
>
> +unsigned long __initdata early_pgt_buf_start;
> +unsigned long __meminitdata early_pgt_buf_end;
> +unsigned long __meminitdata early_pgt_buf_top;
> +
> int after_bootmem;
>
> int direct_gbpages
> @@ -291,6 +295,11 @@ static void __init find_early_table_spac
> if (!base)
> panic("Cannot find space for the kernel page tables");
>
> + init_memory_mapping(base, base + tables);
> + printk(KERN_DEBUG "kernel direct mapping tables from %#llx to %#llx @ [mem %#010lx-%#010lx]\n",
> + base, base + tables - 1, early_pgt_buf_start << PAGE_SHIFT,
> + (early_pgt_buf_end << PAGE_SHIFT) - 1);
> +
> pgt_buf_start = base >> PAGE_SHIFT;
> pgt_buf_end = pgt_buf_start;
> pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
> @@ -437,6 +446,20 @@ void __init init_mem_mapping(void)
> early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
> }
>
> +RESERVE_BRK(early_pgt_alloc, 16384);
> +
> +void __init early_alloc_pgt_buf(void)
> +{
> + unsigned long tables = 13864;
> + phys_addr_t base;
> +
> + base = __pa(extend_brk(tables, PAGE_SIZE));
> +
> + early_pgt_buf_start = base >> PAGE_SHIFT;
> + early_pgt_buf_end = early_pgt_buf_start;
> + early_pgt_buf_top = early_pgt_buf_start + (tables >> PAGE_SHIFT);
> +}
> +
> /*
> * devmem_is_allowed() checks to see if /dev/mem access to a certain address
> * is valid. The argument is a physical page number.
> Index: linux-2.6/arch/x86/mm/init_32.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/init_32.c
> +++ linux-2.6/arch/x86/mm/init_32.c
> @@ -61,10 +61,14 @@ bool __read_mostly __vmalloc_start_set =
>
> static __init void *alloc_low_page(void)
> {
> - unsigned long pfn = pgt_buf_end++;
> + unsigned long pfn;
> void *adr;
>
> - if (pfn >= pgt_buf_top)
> + if (early_pgt_buf_end < early_pgt_buf_top)
> + pfn = early_pgt_buf_end++;
> + else if (pgt_buf_end < pgt_buf_top)
> + pfn = pgt_buf_end++;
> + else
> panic("alloc_low_page: ran out of memory");
>
> adr = __va(pfn * PAGE_SIZE);
> Index: linux-2.6/arch/x86/mm/init_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/init_64.c
> +++ linux-2.6/arch/x86/mm/init_64.c
> @@ -318,7 +318,7 @@ void __init cleanup_highmap(void)
>
> static __ref void *alloc_low_page(unsigned long *phys)
> {
> - unsigned long pfn = pgt_buf_end++;
> + unsigned long pfn;
> void *adr;
>
> if (after_bootmem) {
> @@ -328,7 +328,11 @@ static __ref void *alloc_low_page(unsign
> return adr;
> }
>
> - if (pfn >= pgt_buf_top)
> + if (early_pgt_buf_end < early_pgt_buf_top)
> + pfn = early_pgt_buf_end++;
> + else if (pgt_buf_end < pgt_buf_top)
> + pfn = pgt_buf_end++;
> + else
> panic("alloc_low_page: ran out of memory");
>
> adr = early_memremap(pfn * PAGE_SIZE, PAGE_SIZE);
On Fri, Oct 5, 2012 at 4:27 AM, Stefano Stabellini
<[email protected]> wrote:
> On Fri, 5 Oct 2012, Yinghai Lu wrote:
>> On Thu, Oct 4, 2012 at 2:54 PM, H. Peter Anvin <[email protected]> wrote:
>> >
>> > See my other post. This is bringing up the Kernel Summit algorithm again.
>> >
>>
>> sure. please check if you are ok with attached one on top of x86/mm2
>>
>> Subject: [PATCH] x86: get early page table from BRK
>>
>> set pgt_buf early from BRK, and use it to map page table at first.
>>
>> also use the left at first, then use new extend one.
>>
>> Signed-off-by: Yinghai Lu <[email protected]>
>
> If I read the patch correctly, it wouldn't actually change the pagetable
> allocation flow or implement Peter's suggestion.
> However it would pre-map (pgt_buf_start-pgt_buf_top) using brk memory.
>
> So, if that's correct, we could remove the early_memremap call from
> alloc_low_page and map_low_page, right?
>
> Also the patch introduces an additional range of pagetable pages
> (early_pgt_buf_start-early_pgt_buf_top) and that would need to be
> communicated somehow to Xen (that at the moment assumes that the range
> is pgt_buf_start-pgt_buf_top. Pretty bad).
that will need extra two lines for xen:
@@ -430,6 +439,8 @@ void __init init_mem_mapping(void)
x86_init.mapping.pagetable_reserve(PFN_PHYS(pgt_buf_start),
PFN_PHYS(pgt_buf_end));
}
+ x86_init.mapping.pagetable_reserve(PFN_PHYS(early_pgt_buf_start),
+ PFN_PHYS(early_pgt_buf_end));
/* stop the wrong using */
pgt_buf_top = 0;
>
> Overall I still prefer Peter's suggestion :)
>
>
>> ---
>> arch/x86/include/asm/init.h | 4 ++++
>> arch/x86/include/asm/pgtable.h | 1 +
>> arch/x86/kernel/setup.c | 2 ++
>> arch/x86/mm/init.c | 23 +++++++++++++++++++++++
>> arch/x86/mm/init_32.c | 8 ++++++--
>> arch/x86/mm/init_64.c | 8 ++++++--
>> 6 files changed, 42 insertions(+), 4 deletions(-)
>>
>> Index: linux-2.6/arch/x86/include/asm/pgtable.h
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/include/asm/pgtable.h
>> +++ linux-2.6/arch/x86/include/asm/pgtable.h
>> @@ -599,6 +599,7 @@ static inline int pgd_none(pgd_t pgd)
>>
>> extern int direct_gbpages;
>> void init_mem_mapping(void);
>> +void early_alloc_pgt_buf(void);
>>
>> /* local pte updates need not use xchg for locking */
>> static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
>> Index: linux-2.6/arch/x86/kernel/setup.c
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/kernel/setup.c
>> +++ linux-2.6/arch/x86/kernel/setup.c
>> @@ -950,6 +950,8 @@ void __init setup_arch(char **cmdline_p)
>>
>> reserve_ibft_region();
>>
>> + early_alloc_pgt_buf();
>> +
>> /*
>> * Need to conclude brk, before memblock_x86_fill()
>> * it could use memblock_find_in_range, could overlap with
>> Index: linux-2.6/arch/x86/mm/init.c
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/mm/init.c
>> +++ linux-2.6/arch/x86/mm/init.c
>> @@ -21,6 +21,10 @@ unsigned long __initdata pgt_buf_start;
>> unsigned long __meminitdata pgt_buf_end;
>> unsigned long __meminitdata pgt_buf_top;
>>
>> +unsigned long __initdata early_pgt_buf_start;
>> +unsigned long __meminitdata early_pgt_buf_end;
>> +unsigned long __meminitdata early_pgt_buf_top;
>> +
>> int after_bootmem;
>>
>> int direct_gbpages
>> @@ -291,6 +295,11 @@ static void __init find_early_table_spac
>> if (!base)
>> panic("Cannot find space for the kernel page tables");
>>
>> + init_memory_mapping(base, base + tables);
>> + printk(KERN_DEBUG "kernel direct mapping tables from %#llx to %#llx @ [mem %#010lx-%#010lx]\n",
>> + base, base + tables - 1, early_pgt_buf_start << PAGE_SHIFT,
>> + (early_pgt_buf_end << PAGE_SHIFT) - 1);
>> +
>> pgt_buf_start = base >> PAGE_SHIFT;
>> pgt_buf_end = pgt_buf_start;
>> pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
>> @@ -437,6 +446,20 @@ void __init init_mem_mapping(void)
>> early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
>> }
>>
>> +RESERVE_BRK(early_pgt_alloc, 16384);
>> +
>> +void __init early_alloc_pgt_buf(void)
>> +{
>> + unsigned long tables = 13864;
>> + phys_addr_t base;
>> +
>> + base = __pa(extend_brk(tables, PAGE_SIZE));
>> +
>> + early_pgt_buf_start = base >> PAGE_SHIFT;
>> + early_pgt_buf_end = early_pgt_buf_start;
>> + early_pgt_buf_top = early_pgt_buf_start + (tables >> PAGE_SHIFT);
>> +}
>> +
>> /*
>> * devmem_is_allowed() checks to see if /dev/mem access to a certain address
>> * is valid. The argument is a physical page number.
>> Index: linux-2.6/arch/x86/mm/init_32.c
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/mm/init_32.c
>> +++ linux-2.6/arch/x86/mm/init_32.c
>> @@ -61,10 +61,14 @@ bool __read_mostly __vmalloc_start_set =
>>
>> static __init void *alloc_low_page(void)
>> {
>> - unsigned long pfn = pgt_buf_end++;
>> + unsigned long pfn;
>> void *adr;
>>
>> - if (pfn >= pgt_buf_top)
>> + if (early_pgt_buf_end < early_pgt_buf_top)
>> + pfn = early_pgt_buf_end++;
>> + else if (pgt_buf_end < pgt_buf_top)
>> + pfn = pgt_buf_end++;
>> + else
>> panic("alloc_low_page: ran out of memory");
>>
>> adr = __va(pfn * PAGE_SIZE);
>> Index: linux-2.6/arch/x86/mm/init_64.c
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/mm/init_64.c
>> +++ linux-2.6/arch/x86/mm/init_64.c
>> @@ -318,7 +318,7 @@ void __init cleanup_highmap(void)
>>
>> static __ref void *alloc_low_page(unsigned long *phys)
>> {
>> - unsigned long pfn = pgt_buf_end++;
>> + unsigned long pfn;
>> void *adr;
>>
>> if (after_bootmem) {
>> @@ -328,7 +328,11 @@ static __ref void *alloc_low_page(unsign
>> return adr;
>> }
>>
>> - if (pfn >= pgt_buf_top)
>> + if (early_pgt_buf_end < early_pgt_buf_top)
>> + pfn = early_pgt_buf_end++;
>> + else if (pgt_buf_end < pgt_buf_top)
>> + pfn = pgt_buf_end++;
>> + else
>> panic("alloc_low_page: ran out of memory");
>>
>> adr = early_memremap(pfn * PAGE_SIZE, PAGE_SIZE);
Yinghai Lu <[email protected]> writes:
> On Thu, Oct 4, 2012 at 9:46 AM, Konrad Rzeszutek Wilk
> <[email protected]> wrote:
>>> then kdump may have problem get big range again.
>>
>> Is there a git commit that explains what the 'big range' problem is?
At least on x86_64 this was recently tested and anywhere below 4G is
good, and there is a patch floating around somewhere to remove this
issue.
I don't know about x86_32.
Eric
On Fri, Oct 5, 2012 at 2:04 PM, Eric W. Biederman <[email protected]> wrote:
>>> Is there a git commit that explains what the 'big range' problem is?
>
> At least on x86_64 this was recently tested and anywhere below 4G is
> good, and there is a patch floating around somewhere to remove this
> issue.
patch for kernel or kexec-tools?
Yinghai
Yinghai Lu <[email protected]> writes:
> On Fri, Oct 5, 2012 at 2:04 PM, Eric W. Biederman <[email protected]> wrote:
>>>> Is there a git commit that explains what the 'big range' problem is?
>>
>> At least on x86_64 this was recently tested and anywhere below 4G is
>> good, and there is a patch floating around somewhere to remove this
>> issue.
>
> patch for kernel or kexec-tools?
kernel.
The sgi guys needed a kdump kernel with 1G of ram to dump their all of
the memory on one of their crazy large machines and so investigated
this.
Basically they found that a kdump kernel loaded anywhere < 4G worked,
the only change that was needed was to relaxy the 896M hard code.
In one test they had a kdump kernel loaded above 2G.
Eric
On Fri, Oct 5, 2012 at 2:32 PM, Eric W. Biederman <[email protected]> wrote:
> Yinghai Lu <[email protected]> writes:
>
>> On Fri, Oct 5, 2012 at 2:04 PM, Eric W. Biederman <[email protected]> wrote:
>>>>> Is there a git commit that explains what the 'big range' problem is?
>>>
>>> At least on x86_64 this was recently tested and anywhere below 4G is
>>> good, and there is a patch floating around somewhere to remove this
>>> issue.
>>
>> patch for kernel or kexec-tools?
>
> kernel.
>
> The sgi guys needed a kdump kernel with 1G of ram to dump their all of
> the memory on one of their crazy large machines and so investigated
> this.
>
> Basically they found that a kdump kernel loaded anywhere < 4G worked,
> the only change that was needed was to relaxy the 896M hard code.
>
> In one test they had a kdump kernel loaded above 2G.
with bzImage or vmlinux?
Yinghai Lu <[email protected]> writes:
> with bzImage or vmlinux?
bzImage I presume. Certainly the bzImage has lost it's 896M limit,
which is where ultimiately the 896M limite came from.
Eric
On Fri, Oct 5, 2012 at 2:41 PM, Eric W. Biederman <[email protected]> wrote:
> Yinghai Lu <[email protected]> writes:
>
>> with bzImage or vmlinux?
>
> bzImage I presume. Certainly the bzImage has lost it's 896M limit,
> which is where ultimiately the 896M limite came from.
they are using updated kexec-tools ?
last time when i checked the code for kexec-tools
found the 896M problem was from kexec-tools bzimage support.
I am going to see about merging these two threads.
Yinghai Lu <[email protected]> writes:
> On Fri, Oct 5, 2012 at 2:41 PM, Eric W. Biederman <[email protected]> wrote:
>> Yinghai Lu <[email protected]> writes:
>>
>>> with bzImage or vmlinux?
>>
>> bzImage I presume. Certainly the bzImage has lost it's 896M limit,
>> which is where ultimiately the 896M limite came from.
>
> they are using updated kexec-tools ?
>
> last time when i checked the code for kexec-tools
> found the 896M problem was from kexec-tools bzimage support.
Cliff Wickman was the guy at sgi running the tests.
To the best of my knowledge he was runing an up to date kexec-tools and
was loading a bzImage. Of course his initial reaction was where did the
896M limit come from, as he had just updated to a kernel with the limit
a few weeks ago.
YH please talk to Cliff directly.
Eric
On 10/05/2012 02:32 PM, Eric W. Biederman wrote:
> Yinghai Lu <[email protected]> writes:
>
>> On Fri, Oct 5, 2012 at 2:04 PM, Eric W. Biederman <[email protected]> wrote:
>>>>> Is there a git commit that explains what the 'big range' problem is?
>>>
>>> At least on x86_64 this was recently tested and anywhere below 4G is
>>> good, and there is a patch floating around somewhere to remove this
>>> issue.
>>
>> patch for kernel or kexec-tools?
>
> kernel.
>
> The sgi guys needed a kdump kernel with 1G of ram to dump their all of
> the memory on one of their crazy large machines and so investigated
> this.
>
> Basically they found that a kdump kernel loaded anywhere < 4G worked,
> the only change that was needed was to relaxy the 896M hard code.
>
> In one test they had a kdump kernel loaded above 2G.
>
Seriously, any case where we can't load anywhere in physical ram on
x86-64 is a bug. i386 is another matter.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On 10/05/2012 02:41 PM, Eric W. Biederman wrote:
> Yinghai Lu <[email protected]> writes:
>
>> with bzImage or vmlinux?
>
> bzImage I presume. Certainly the bzImage has lost it's 896M limit,
> which is where ultimiately the 896M limite came from.
>
~896M (actually comes from i386, not from bzImage...
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
"H. Peter Anvin" <[email protected]> writes:
> On 10/05/2012 02:32 PM, Eric W. Biederman wrote:
>> Yinghai Lu <[email protected]> writes:
>>
>>> On Fri, Oct 5, 2012 at 2:04 PM, Eric W. Biederman <[email protected]> wrote:
>>>>>> Is there a git commit that explains what the 'big range' problem is?
>>>>
>>>> At least on x86_64 this was recently tested and anywhere below 4G is
>>>> good, and there is a patch floating around somewhere to remove this
>>>> issue.
>>>
>>> patch for kernel or kexec-tools?
>>
>> kernel.
>>
>> The sgi guys needed a kdump kernel with 1G of ram to dump their all of
>> the memory on one of their crazy large machines and so investigated
>> this.
>>
>> Basically they found that a kdump kernel loaded anywhere < 4G worked,
>> the only change that was needed was to relaxy the 896M hard code.
>>
>> In one test they had a kdump kernel loaded above 2G.
>>
>
> Seriously, any case where we can't load anywhere in physical ram on x86-64 is a
> bug. i386 is another matter.
As I recall there are data structures like the IDT that only have a
32bit base address.
According to the bzImage header we don't support ramdisks above 4G.
I think we also have a 32bit address for the kernel command line
in the bzImage header.
In the case of kdump in particular there is a need for DMAable
memory and in general that means memory below 4G. So as long
as we only support one memory extent for kdump it makes sense
for that segment to be below 4G.
For a normal x86_64 kernel which gets to use most of the memory it
definitely should be loadable anywhere in memory.
Eric
On 10/05/2012 05:28 PM, Eric W. Biederman wrote:
>>
>> Seriously, any case where we can't load anywhere in physical ram on x86-64 is a
>> bug. i386 is another matter.
>
> As I recall there are data structures like the IDT that only have a
> 32bit base address.
>
Not true. The only one I know of is memory for the trampoline which has
to be below 1M. The < 1M space is already handled specially for good
reason.
> According to the bzImage header we don't support ramdisks above 4G.
> I think we also have a 32bit address for the kernel command line
> in the bzImage header.
There are pointers in the bzImage header, that is true. We can fix that
problem, though, at least for entry via the 64-bit entry point.
> In the case of kdump in particular there is a need for DMAable
> memory and in general that means memory below 4G. So as long
> as we only support one memory extent for kdump it makes sense
> for that segment to be below 4G.
"In general" meaning "no iotlb"? In that case you have some unknown
address space restriction which may or may not be 4G...
> For a normal x86_64 kernel which gets to use most of the memory it
> definitely should be loadable anywhere in memory.
Yes. We should fix problems, like the limitations in the boot_params
structure.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
"H. Peter Anvin" <[email protected]> writes:
> On 10/05/2012 02:41 PM, Eric W. Biederman wrote:
>> Yinghai Lu <[email protected]> writes:
>>
>>> with bzImage or vmlinux?
>>
>> bzImage I presume. Certainly the bzImage has lost it's 896M limit,
>> which is where ultimiately the 896M limite came from.
>>
>
> ~896M (actually comes from i386, not from bzImage...
Right it was 1G - VMALLLOC_SIZE.
At some point that affected the boot protocol as the maximum address we
could load ramdisks.
Eric
That disappeared 10 years ago...
[email protected] wrote:
>"H. Peter Anvin" <[email protected]> writes:
>
>> On 10/05/2012 02:41 PM, Eric W. Biederman wrote:
>>> Yinghai Lu <[email protected]> writes:
>>>
>>>> with bzImage or vmlinux?
>>>
>>> bzImage I presume. Certainly the bzImage has lost it's 896M limit,
>>> which is where ultimiately the 896M limite came from.
>>>
>>
>> ~896M (actually comes from i386, not from bzImage...
>
>Right it was 1G - VMALLLOC_SIZE.
>
>At some point that affected the boot protocol as the maximum address we
>could load ramdisks.
>
>Eric
--
Sent from my mobile phone. Please excuse brevity and lack of formatting.
on top of tip/x86/mm2
also remove early_ioremap in page table accessing.
Yinghai Lu (3):
x86: get early page table from BRK
x86, mm: Don't clear page table if next range is ram
x86, mm: Remove early_memremap workaround for page table accessing
arch/x86/include/asm/init.h | 4 ++
arch/x86/include/asm/pgtable.h | 1 +
arch/x86/kernel/setup.c | 2 +
arch/x86/mm/init.c | 25 ++++++++++++
arch/x86/mm/init_32.c | 8 +++-
arch/x86/mm/init_64.c | 85 ++++++++++++++--------------------------
6 files changed, 67 insertions(+), 58 deletions(-)
--
1.7.7
set pgt_buf early from BRK, and use it to map page table at first.
also use the left at first, then use new extend one.
-v2: extra xen call back for that new range.
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/include/asm/init.h | 4 ++++
arch/x86/include/asm/pgtable.h | 1 +
arch/x86/kernel/setup.c | 2 ++
arch/x86/mm/init.c | 25 +++++++++++++++++++++++++
arch/x86/mm/init_32.c | 8 ++++++--
arch/x86/mm/init_64.c | 8 ++++++--
6 files changed, 44 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
index 4f13998..2f32eea 100644
--- a/arch/x86/include/asm/init.h
+++ b/arch/x86/include/asm/init.h
@@ -16,4 +16,8 @@ extern unsigned long __initdata pgt_buf_start;
extern unsigned long __meminitdata pgt_buf_end;
extern unsigned long __meminitdata pgt_buf_top;
+extern unsigned long __initdata early_pgt_buf_start;
+extern unsigned long __meminitdata early_pgt_buf_end;
+extern unsigned long __meminitdata early_pgt_buf_top;
+
#endif /* _ASM_X86_INIT_32_H */
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 52d40a1..25fa5bb 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -599,6 +599,7 @@ static inline int pgd_none(pgd_t pgd)
extern int direct_gbpages;
void init_mem_mapping(void);
+void early_alloc_pgt_buf(void);
/* local pte updates need not use xchg for locking */
static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 4989f80..7eb6855 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -896,6 +896,8 @@ void __init setup_arch(char **cmdline_p)
reserve_ibft_region();
+ early_alloc_pgt_buf();
+
/*
* Need to conclude brk, before memblock_x86_fill()
* it could use memblock_find_in_range, could overlap with
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index cf662ba..c32eed1 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -21,6 +21,10 @@ unsigned long __initdata pgt_buf_start;
unsigned long __meminitdata pgt_buf_end;
unsigned long __meminitdata pgt_buf_top;
+unsigned long __initdata early_pgt_buf_start;
+unsigned long __meminitdata early_pgt_buf_end;
+unsigned long __meminitdata early_pgt_buf_top;
+
int after_bootmem;
int direct_gbpages
@@ -291,6 +295,11 @@ static void __init find_early_table_space(unsigned long start,
if (!base)
panic("Cannot find space for the kernel page tables");
+ init_memory_mapping(base, base + tables);
+ printk(KERN_DEBUG "kernel direct mapping tables from %#llx to %#llx @ [mem %#010lx-%#010lx]\n",
+ base, base + tables - 1, early_pgt_buf_start << PAGE_SHIFT,
+ (early_pgt_buf_end << PAGE_SHIFT) - 1);
+
pgt_buf_start = base >> PAGE_SHIFT;
pgt_buf_end = pgt_buf_start;
pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
@@ -430,6 +439,8 @@ void __init init_mem_mapping(void)
x86_init.mapping.pagetable_reserve(PFN_PHYS(pgt_buf_start),
PFN_PHYS(pgt_buf_end));
}
+ x86_init.mapping.pagetable_reserve(PFN_PHYS(early_pgt_buf_start),
+ PFN_PHYS(early_pgt_buf_end));
/* stop the wrong using */
pgt_buf_top = 0;
@@ -437,6 +448,20 @@ void __init init_mem_mapping(void)
early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
}
+RESERVE_BRK(early_pgt_alloc, 16384);
+
+void __init early_alloc_pgt_buf(void)
+{
+ unsigned long tables = 16384;
+ phys_addr_t base;
+
+ base = __pa(extend_brk(tables, PAGE_SIZE));
+
+ early_pgt_buf_start = base >> PAGE_SHIFT;
+ early_pgt_buf_end = early_pgt_buf_start;
+ early_pgt_buf_top = early_pgt_buf_start + (tables >> PAGE_SHIFT);
+}
+
/*
* devmem_is_allowed() checks to see if /dev/mem access to a certain address
* is valid. The argument is a physical page number.
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 11a5800..92c0f12 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -61,10 +61,14 @@ bool __read_mostly __vmalloc_start_set = false;
static __init void *alloc_low_page(void)
{
- unsigned long pfn = pgt_buf_end++;
+ unsigned long pfn;
void *adr;
- if (pfn >= pgt_buf_top)
+ if (early_pgt_buf_end < early_pgt_buf_top)
+ pfn = early_pgt_buf_end++;
+ else if (pgt_buf_end < pgt_buf_top)
+ pfn = pgt_buf_end++;
+ else
panic("alloc_low_page: ran out of memory");
adr = __va(pfn * PAGE_SIZE);
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index ab558eb..5375cf0 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -316,7 +316,7 @@ void __init cleanup_highmap(void)
static __ref void *alloc_low_page(unsigned long *phys)
{
- unsigned long pfn = pgt_buf_end++;
+ unsigned long pfn;
void *adr;
if (after_bootmem) {
@@ -326,7 +326,11 @@ static __ref void *alloc_low_page(unsigned long *phys)
return adr;
}
- if (pfn >= pgt_buf_top)
+ if (early_pgt_buf_end < early_pgt_buf_top)
+ pfn = early_pgt_buf_end++;
+ else if (pgt_buf_end < pgt_buf_top)
+ pfn = pgt_buf_end++;
+ else
panic("alloc_low_page: ran out of memory");
adr = early_memremap(pfn * PAGE_SIZE, PAGE_SIZE);
--
1.7.7
During adding code from BRK to map buffer for final page table,
It should be safe to remove early_memmap for page table accessing.
But get panic after that.
It turns out we clear the initial page table wrongly for next range
that is separated by holes.
And it only happens when we are trying to map range one by one range.
After checking before clearing the related page table to fix the problem.
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init_64.c | 39 +++++++++++++++++++--------------------
1 files changed, 19 insertions(+), 20 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5375cf0..0348a02 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -367,20 +367,21 @@ static unsigned long __meminit
phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
pgprot_t prot)
{
- unsigned pages = 0;
+ unsigned long pages = 0, next;
unsigned long last_map_addr = end;
int i;
pte_t *pte = pte_page + pte_index(addr);
- for(i = pte_index(addr); i < PTRS_PER_PTE; i++, addr += PAGE_SIZE, pte++) {
+ for (i = pte_index(addr); i < PTRS_PER_PTE; i++, addr = next, pte++) {
+ next = (addr & PAGE_MASK) + PAGE_SIZE;
if (addr >= end) {
- if (!after_bootmem) {
- for(; i < PTRS_PER_PTE; i++, pte++)
- set_pte(pte, __pte(0));
- }
- break;
+ if (!after_bootmem &&
+ addr < (2UL<<30) &&
+ !e820_any_mapped(addr & PAGE_MASK, next, 0))
+ set_pte(pte, __pte(0));
+ continue;
}
/*
@@ -422,16 +423,15 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
pte_t *pte;
pgprot_t new_prot = prot;
+ next = (address & PMD_MASK) + PMD_SIZE;
if (address >= end) {
- if (!after_bootmem) {
- for (; i < PTRS_PER_PMD; i++, pmd++)
- set_pmd(pmd, __pmd(0));
- }
- break;
+ if (!after_bootmem &&
+ address < (2UL<<30) &&
+ !e820_any_mapped(address & PMD_MASK, next, 0))
+ set_pmd(pmd, __pmd(0));
+ continue;
}
- next = (address & PMD_MASK) + PMD_SIZE;
-
if (pmd_val(*pmd)) {
if (!pmd_large(*pmd)) {
spin_lock(&init_mm.page_table_lock);
@@ -498,13 +498,12 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
pmd_t *pmd;
pgprot_t prot = PAGE_KERNEL;
- if (addr >= end)
- break;
-
next = (addr & PUD_MASK) + PUD_SIZE;
-
- if (!after_bootmem && !e820_any_mapped(addr, next, 0)) {
- set_pud(pud, __pud(0));
+ if (addr >= end) {
+ if (!after_bootmem &&
+ addr < (2UL<<30) &&
+ !e820_any_mapped(addr & PUD_MASK, next, 0))
+ set_pud(pud, __pud(0));
continue;
}
--
1.7.7
Not needed anymore after premaping page table buf and not clear initial page
table wrongly.
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init_64.c | 38 ++++----------------------------------
1 files changed, 4 insertions(+), 34 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 0348a02..e59c94f 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -333,36 +333,12 @@ static __ref void *alloc_low_page(unsigned long *phys)
else
panic("alloc_low_page: ran out of memory");
- adr = early_memremap(pfn * PAGE_SIZE, PAGE_SIZE);
+ adr = __va(pfn * PAGE_SIZE);
clear_page(adr);
*phys = pfn * PAGE_SIZE;
return adr;
}
-static __ref void *map_low_page(void *virt)
-{
- void *adr;
- unsigned long phys, left;
-
- if (after_bootmem)
- return virt;
-
- phys = __pa(virt);
- left = phys & (PAGE_SIZE - 1);
- adr = early_memremap(phys & PAGE_MASK, PAGE_SIZE);
- adr = (void *)(((unsigned long)adr) | left);
-
- return adr;
-}
-
-static __ref void unmap_low_page(void *adr)
-{
- if (after_bootmem)
- return;
-
- early_iounmap((void *)((unsigned long)adr & PAGE_MASK), PAGE_SIZE);
-}
-
static unsigned long __meminit
phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
pgprot_t prot)
@@ -435,10 +411,9 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
if (pmd_val(*pmd)) {
if (!pmd_large(*pmd)) {
spin_lock(&init_mm.page_table_lock);
- pte = map_low_page((pte_t *)pmd_page_vaddr(*pmd));
+ pte = (pte_t *)pmd_page_vaddr(*pmd);
last_map_addr = phys_pte_init(pte, address,
end, prot);
- unmap_low_page(pte);
spin_unlock(&init_mm.page_table_lock);
continue;
}
@@ -474,7 +449,6 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
pte = alloc_low_page(&pte_phys);
last_map_addr = phys_pte_init(pte, address, end, new_prot);
- unmap_low_page(pte);
spin_lock(&init_mm.page_table_lock);
pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
@@ -509,10 +483,9 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
if (pud_val(*pud)) {
if (!pud_large(*pud)) {
- pmd = map_low_page(pmd_offset(pud, 0));
+ pmd = pmd_offset(pud, 0);
last_map_addr = phys_pmd_init(pmd, addr, end,
page_size_mask, prot);
- unmap_low_page(pmd);
__flush_tlb_all();
continue;
}
@@ -548,7 +521,6 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
pmd = alloc_low_page(&pmd_phys);
last_map_addr = phys_pmd_init(pmd, addr, end, page_size_mask,
prot);
- unmap_low_page(pmd);
spin_lock(&init_mm.page_table_lock);
pud_populate(&init_mm, pud, __va(pmd_phys));
@@ -584,17 +556,15 @@ kernel_physical_mapping_init(unsigned long start,
next = end;
if (pgd_val(*pgd)) {
- pud = map_low_page((pud_t *)pgd_page_vaddr(*pgd));
+ pud = (pud_t *)pgd_page_vaddr(*pgd);
last_map_addr = phys_pud_init(pud, __pa(start),
__pa(end), page_size_mask);
- unmap_low_page(pud);
continue;
}
pud = alloc_low_page(&pud_phys);
last_map_addr = phys_pud_init(pud, __pa(start), __pa(next),
page_size_mask);
- unmap_low_page(pud);
spin_lock(&init_mm.page_table_lock);
pgd_populate(&init_mm, pgd, __va(pud_phys));
--
1.7.7
On Fri, Oct 5, 2012 at 7:58 AM, Yinghai Lu <[email protected]> wrote:
> On Fri, Oct 5, 2012 at 4:27 AM, Stefano Stabellini
> <[email protected]> wrote:
>> On Fri, 5 Oct 2012, Yinghai Lu wrote:
>>> On Thu, Oct 4, 2012 at 2:54 PM, H. Peter Anvin <[email protected]> wrote:
>>> >
>>> > See my other post. This is bringing up the Kernel Summit algorithm again.
>>> >
>>>
>>> sure. please check if you are ok with attached one on top of x86/mm2
I updated my for-x86-mm branch, It should be ok now with xen.
Can you please check it ?
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
for-x86-mm
in addition patches in tip x86/mm2, there are 5 new patches added.
688d437: x86, mm: Use big page for small memory range
41e562f: x86, mm: only keep initial mapping for ram
24ac352: x86, mm: Remove early_memremap workaround for page table accessing
d9dd599: x86, mm: Don't clear page table if next range is ram
080acbe: x86, mm: get early page table from BRK
Thanks
Yinghai
On Sat, 6 Oct 2012, Yinghai Lu wrote:
> set pgt_buf early from BRK, and use it to map page table at first.
>
> also use the left at first, then use new extend one.
>
> -v2: extra xen call back for that new range.
>
> Signed-off-by: Yinghai Lu <[email protected]>
> ---
> arch/x86/include/asm/init.h | 4 ++++
> arch/x86/include/asm/pgtable.h | 1 +
> arch/x86/kernel/setup.c | 2 ++
> arch/x86/mm/init.c | 25 +++++++++++++++++++++++++
> arch/x86/mm/init_32.c | 8 ++++++--
> arch/x86/mm/init_64.c | 8 ++++++--
> 6 files changed, 44 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/init.h b/arch/x86/include/asm/init.h
> index 4f13998..2f32eea 100644
> --- a/arch/x86/include/asm/init.h
> +++ b/arch/x86/include/asm/init.h
> @@ -16,4 +16,8 @@ extern unsigned long __initdata pgt_buf_start;
> extern unsigned long __meminitdata pgt_buf_end;
> extern unsigned long __meminitdata pgt_buf_top;
>
> +extern unsigned long __initdata early_pgt_buf_start;
> +extern unsigned long __meminitdata early_pgt_buf_end;
> +extern unsigned long __meminitdata early_pgt_buf_top;
> +
> #endif /* _ASM_X86_INIT_32_H */
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 52d40a1..25fa5bb 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -599,6 +599,7 @@ static inline int pgd_none(pgd_t pgd)
>
> extern int direct_gbpages;
> void init_mem_mapping(void);
> +void early_alloc_pgt_buf(void);
>
> /* local pte updates need not use xchg for locking */
> static inline pte_t native_local_ptep_get_and_clear(pte_t *ptep)
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 4989f80..7eb6855 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -896,6 +896,8 @@ void __init setup_arch(char **cmdline_p)
>
> reserve_ibft_region();
>
> + early_alloc_pgt_buf();
> +
> /*
> * Need to conclude brk, before memblock_x86_fill()
> * it could use memblock_find_in_range, could overlap with
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index cf662ba..c32eed1 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -21,6 +21,10 @@ unsigned long __initdata pgt_buf_start;
> unsigned long __meminitdata pgt_buf_end;
> unsigned long __meminitdata pgt_buf_top;
>
> +unsigned long __initdata early_pgt_buf_start;
> +unsigned long __meminitdata early_pgt_buf_end;
> +unsigned long __meminitdata early_pgt_buf_top;
> +
> int after_bootmem;
>
> int direct_gbpages
> @@ -291,6 +295,11 @@ static void __init find_early_table_space(unsigned long start,
> if (!base)
> panic("Cannot find space for the kernel page tables");
>
> + init_memory_mapping(base, base + tables);
> + printk(KERN_DEBUG "kernel direct mapping tables from %#llx to %#llx @ [mem %#010lx-%#010lx]\n",
> + base, base + tables - 1, early_pgt_buf_start << PAGE_SHIFT,
> + (early_pgt_buf_end << PAGE_SHIFT) - 1);
> +
> pgt_buf_start = base >> PAGE_SHIFT;
> pgt_buf_end = pgt_buf_start;
> pgt_buf_top = pgt_buf_start + (tables >> PAGE_SHIFT);
> @@ -430,6 +439,8 @@ void __init init_mem_mapping(void)
> x86_init.mapping.pagetable_reserve(PFN_PHYS(pgt_buf_start),
> PFN_PHYS(pgt_buf_end));
> }
> + x86_init.mapping.pagetable_reserve(PFN_PHYS(early_pgt_buf_start),
> + PFN_PHYS(early_pgt_buf_end));
pagetable_reserve is not the right hook: pagetable_reserve tells the
subsystem that the memory range you are passing is going to be used for
pagetable pages. It is used to reserve that range using
memblock_reserve. On Xen is also used to mark RW any pages _outside_
that range that have been marked RO: implicitely we assume that the full
range is pgt_buf_start-pgt_buf_top and we mark it RO (see Xen memory
contraints on pagetable pages, as decribed by Konrad).
Calling pagetable_reserve(real_start, real_end) reserves
real_start-real_end as pagetable pages and frees
pgt_buf_start-real_start and real_end-pgt_buf_top.
So the problem is that at the moment we don't have a hook to say: "the
range of pagetable pages is pgt_buf_start-pgt_buf_top". In fact if you
give a look at arch/x86/xen/mmu.c you'll find few references to
pgt_buf_start, pgt_buf_end, pgt_buf_top, that shouldn't really be there.
On Sat, Oct 06, 2012 at 12:44:28AM -0700, Yinghai Lu wrote:
> During adding code from BRK to map buffer for final page table,
>
> It should be safe to remove early_memmap for page table accessing.
>
> But get panic after that.
>
> It turns out we clear the initial page table wrongly for next range
> that is separated by holes.
Were are the holes? Are these E820 holes?
> And it only happens when we are trying to map range one by one range.
>
> After checking before clearing the related page table to fix the problem.
Huh?
>
> Signed-off-by: Yinghai Lu <[email protected]>
> ---
> arch/x86/mm/init_64.c | 39 +++++++++++++++++++--------------------
> 1 files changed, 19 insertions(+), 20 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 5375cf0..0348a02 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -367,20 +367,21 @@ static unsigned long __meminit
> phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
> pgprot_t prot)
> {
> - unsigned pages = 0;
> + unsigned long pages = 0, next;
> unsigned long last_map_addr = end;
> int i;
>
> pte_t *pte = pte_page + pte_index(addr);
>
> - for(i = pte_index(addr); i < PTRS_PER_PTE; i++, addr += PAGE_SIZE, pte++) {
> + for (i = pte_index(addr); i < PTRS_PER_PTE; i++, addr = next, pte++) {
>
> + next = (addr & PAGE_MASK) + PAGE_SIZE;
> if (addr >= end) {
> - if (!after_bootmem) {
> - for(; i < PTRS_PER_PTE; i++, pte++)
> - set_pte(pte, __pte(0));
> - }
> - break;
> + if (!after_bootmem &&
> + addr < (2UL<<30) &&
Why 2G?
> + !e820_any_mapped(addr & PAGE_MASK, next, 0))
What is the 0 parameter for?
> + set_pte(pte, __pte(0));
> + continue;
> }
>
> /*
> @@ -422,16 +423,15 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
> pte_t *pte;
> pgprot_t new_prot = prot;
>
> + next = (address & PMD_MASK) + PMD_SIZE;
> if (address >= end) {
> - if (!after_bootmem) {
> - for (; i < PTRS_PER_PMD; i++, pmd++)
> - set_pmd(pmd, __pmd(0));
> - }
> - break;
> + if (!after_bootmem &&
> + address < (2UL<<30) &&
> + !e820_any_mapped(address & PMD_MASK, next, 0))
> + set_pmd(pmd, __pmd(0));
> + continue;
> }
>
> - next = (address & PMD_MASK) + PMD_SIZE;
> -
> if (pmd_val(*pmd)) {
> if (!pmd_large(*pmd)) {
> spin_lock(&init_mm.page_table_lock);
> @@ -498,13 +498,12 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
> pmd_t *pmd;
> pgprot_t prot = PAGE_KERNEL;
>
> - if (addr >= end)
> - break;
Why do you get rid of that?
> -
> next = (addr & PUD_MASK) + PUD_SIZE;
> -
> - if (!after_bootmem && !e820_any_mapped(addr, next, 0)) {
> - set_pud(pud, __pud(0));
> + if (addr >= end) {
> + if (!after_bootmem &&
> + addr < (2UL<<30) &&
> + !e820_any_mapped(addr & PUD_MASK, next, 0))
> + set_pud(pud, __pud(0));
> continue;
> }
>
> --
> 1.7.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Sat, Oct 06, 2012 at 12:44:29AM -0700, Yinghai Lu wrote:
> Not needed anymore after premaping page table buf and not clear initial page
> table wrongly.
Your comment should include what patch made the iomap/iounmap part
unnecessary.
.. and also explain how this work-around is not required anymore.
>
> Signed-off-by: Yinghai Lu <[email protected]>
> ---
> arch/x86/mm/init_64.c | 38 ++++----------------------------------
> 1 files changed, 4 insertions(+), 34 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 0348a02..e59c94f 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -333,36 +333,12 @@ static __ref void *alloc_low_page(unsigned long *phys)
> else
> panic("alloc_low_page: ran out of memory");
>
> - adr = early_memremap(pfn * PAGE_SIZE, PAGE_SIZE);
> + adr = __va(pfn * PAGE_SIZE);
> clear_page(adr);
> *phys = pfn * PAGE_SIZE;
> return adr;
> }
>
> -static __ref void *map_low_page(void *virt)
> -{
> - void *adr;
> - unsigned long phys, left;
> -
> - if (after_bootmem)
> - return virt;
> -
> - phys = __pa(virt);
> - left = phys & (PAGE_SIZE - 1);
> - adr = early_memremap(phys & PAGE_MASK, PAGE_SIZE);
> - adr = (void *)(((unsigned long)adr) | left);
> -
> - return adr;
> -}
> -
> -static __ref void unmap_low_page(void *adr)
> -{
> - if (after_bootmem)
> - return;
> -
> - early_iounmap((void *)((unsigned long)adr & PAGE_MASK), PAGE_SIZE);
> -}
> -
> static unsigned long __meminit
> phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end,
> pgprot_t prot)
> @@ -435,10 +411,9 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
> if (pmd_val(*pmd)) {
> if (!pmd_large(*pmd)) {
> spin_lock(&init_mm.page_table_lock);
> - pte = map_low_page((pte_t *)pmd_page_vaddr(*pmd));
> + pte = (pte_t *)pmd_page_vaddr(*pmd);
> last_map_addr = phys_pte_init(pte, address,
> end, prot);
> - unmap_low_page(pte);
> spin_unlock(&init_mm.page_table_lock);
> continue;
> }
> @@ -474,7 +449,6 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end,
>
> pte = alloc_low_page(&pte_phys);
> last_map_addr = phys_pte_init(pte, address, end, new_prot);
> - unmap_low_page(pte);
>
> spin_lock(&init_mm.page_table_lock);
> pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
> @@ -509,10 +483,9 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
>
> if (pud_val(*pud)) {
> if (!pud_large(*pud)) {
> - pmd = map_low_page(pmd_offset(pud, 0));
> + pmd = pmd_offset(pud, 0);
> last_map_addr = phys_pmd_init(pmd, addr, end,
> page_size_mask, prot);
> - unmap_low_page(pmd);
> __flush_tlb_all();
> continue;
> }
> @@ -548,7 +521,6 @@ phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end,
> pmd = alloc_low_page(&pmd_phys);
> last_map_addr = phys_pmd_init(pmd, addr, end, page_size_mask,
> prot);
> - unmap_low_page(pmd);
>
> spin_lock(&init_mm.page_table_lock);
> pud_populate(&init_mm, pud, __va(pmd_phys));
> @@ -584,17 +556,15 @@ kernel_physical_mapping_init(unsigned long start,
> next = end;
>
> if (pgd_val(*pgd)) {
> - pud = map_low_page((pud_t *)pgd_page_vaddr(*pgd));
> + pud = (pud_t *)pgd_page_vaddr(*pgd);
> last_map_addr = phys_pud_init(pud, __pa(start),
> __pa(end), page_size_mask);
> - unmap_low_page(pud);
> continue;
> }
>
> pud = alloc_low_page(&pud_phys);
> last_map_addr = phys_pud_init(pud, __pa(start), __pa(next),
> page_size_mask);
> - unmap_low_page(pud);
>
> spin_lock(&init_mm.page_table_lock);
> pgd_populate(&init_mm, pgd, __va(pud_phys));
> --
> 1.7.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Tue, Oct 9, 2012 at 8:46 AM, Konrad Rzeszutek Wilk <[email protected]> wrote:
>> + !e820_any_mapped(addr & PAGE_MASK, next, 0))
>
> What is the 0 parameter for?
any type
if type != 0, the will only check entries with same type.
int
e820_any_mapped(u64 start, u64 end, unsigned type)
{
int i;
for (i = 0; i < e820.nr_map; i++) {
struct e820entry *ei = &e820.map[i];
if (type && ei->type != type)
continue;
if (ei->addr >= end || ei->addr + ei->size <= start)
continue;
return 1;
}
return 0;
}
On Tue, Oct 09, 2012 at 06:00:12PM -0700, Yinghai Lu wrote:
> On Tue, Oct 9, 2012 at 8:46 AM, Konrad Rzeszutek Wilk <[email protected]> wrote:
> >> + !e820_any_mapped(addr & PAGE_MASK, next, 0))
> >
> > What is the 0 parameter for?
>
> any type
OK, which means that it either should have a #define for it, or at least
a comment, like:
0 /* any type */
as this would make it clear at first glance what it is - without having
to dive in e820_any_mapped function to determine that.
>
> if type != 0, the will only check entries with same type.
>
> int
> e820_any_mapped(u64 start, u64 end, unsigned type)
> {
> int i;
>
> for (i = 0; i < e820.nr_map; i++) {
> struct e820entry *ei = &e820.map[i];
>
> if (type && ei->type != type)
> continue;
> if (ei->addr >= end || ei->addr + ei->size <= start)
> continue;
> return 1;
> }
> return 0;
> }
>
On Wed, Oct 10, 2012 at 6:41 AM, Konrad Rzeszutek Wilk
<[email protected]> wrote:
> On Tue, Oct 09, 2012 at 06:00:12PM -0700, Yinghai Lu wrote:
>> On Tue, Oct 9, 2012 at 8:46 AM, Konrad Rzeszutek Wilk <[email protected]> wrote:
>> >> + !e820_any_mapped(addr & PAGE_MASK, next, 0))
>> >
>> > What is the 0 parameter for?
>>
>> any type
>
> OK, which means that it either should have a #define for it, or at least
> a comment, like:
>
> 0 /* any type */
>
> as this would make it clear at first glance what it is - without having
> to dive in e820_any_mapped function to determine that.
yes, we should add E820_ANY_TYPE. and update e820_any_mapped to use it.
will address that later in another patch.