aka, remove the hack added by git commit 609cfda586c7fe3e5d1a02c51edb587506294167
(Merge branch 'stable/bug-fixes-for-rc5' of git://git.kernel.org/../xen)
One idea that is on the table was proposed by Yinghai:
"Xen should set RAM for page-table to RO after init_memory mapping."
In other words, don't do the magic 'mask_rw_pte' when set_pte is called.
(and don't do the calls to make_lowmem_page_readonly when allocating
the PTE table, nor PMD, nor PUD) - I think?
But instead do:
a). when we load the cr3? We could go through the whole pagetable and
set the RO as we need?
b). when we are finished with the creation of a page table? So similary
to the point above - don't set the RO on the pages until we have completed
the full creation of the page.
c). when post_allocator_start is called?
d). other ideas?
But I vaguelly recall that we are using the page table as we are adding in the
entries. And that we are pinning them as well. Perhaps the trigger to scan the
pagetable and set them to RO should be done .. at what point? When the PUD/PMD
allocations are done? And when PUD/PMD are set?
On Fri, 13 May 2011, Konrad Rzeszutek Wilk wrote:
> aka, remove the hack added by git commit 609cfda586c7fe3e5d1a02c51edb587506294167
> (Merge branch 'stable/bug-fixes-for-rc5' of git://git.kernel.org/../xen)
>
> One idea that is on the table was proposed by Yinghai:
> "Xen should set RAM for page-table to RO after init_memory mapping."
>
> In other words, don't do the magic 'mask_rw_pte' when set_pte is called.
> (and don't do the calls to make_lowmem_page_readonly when allocating
> the PTE table, nor PMD, nor PUD) - I think?
>
> But instead do:
> a). when we load the cr3? We could go through the whole pagetable and
> set the RO as we need?
> b). when we are finished with the creation of a page table? So similary
> to the point above - don't set the RO on the pages until we have completed
> the full creation of the page.
> c). when post_allocator_start is called?
I think c) is too late because there might be allocations or at least
memblock allocations after init_memory_mapping and before
post_allocator_start.
> d). other ideas?
>
> But I vaguelly recall that we are using the page table as we are adding in the
> entries. And that we are pinning them as well. Perhaps the trigger to scan the
> pagetable and set them to RO should be done .. at what point? When the PUD/PMD
> allocations are done? And when PUD/PMD are set?
>
Setting the pagetable pages RO after init_memory_mapping is difficult
because pagetable pages have to be set RO before becoming pagetable
pages.
They become pagetable pages when:
- they are explicitly pinned by pin_pagetable_pfn
- they are hooked into the current pagetable
Like you wrote, considering that the x86_64 version of
kernel_physical_mapping_init hooks the pagetable pages into the
currently used pagetable, it wouldn't be possible to mark the pagetable
pages RO after init_memory_mapping.
> They become pagetable pages when:
>
> - they are explicitly pinned by pin_pagetable_pfn
>
> - they are hooked into the current pagetable
Ok, so could we use those two calls to trigger the pagetable walk
and mark them RO as appropiate? Which call sites are those? The
xen_set_pgd/xen_set_pud/xen_set_pmd ? Presumarily we don't have
to do that for the PTE's that are already mapped (as
xen_setup_kernel_pagetable, and xen_map_identity_early do this
already).
> Like you wrote, considering that the x86_64 version of
> kernel_physical_mapping_init hooks the pagetable pages into the
> currently used pagetable, it wouldn't be possible to mark the pagetable
> pages RO after init_memory_mapping.
<nods>
On 05/16/2011 08:41 AM, Konrad Rzeszutek Wilk wrote:
>> They become pagetable pages when:
>>
>> - they are explicitly pinned by pin_pagetable_pfn
>>
>> - they are hooked into the current pagetable
>
> Ok, so could we use those two calls to trigger the pagetable walk
> and mark them RO as appropiate? Which call sites are those? The
> xen_set_pgd/xen_set_pud/xen_set_pmd ? Presumarily we don't have
> to do that for the PTE's that are already mapped (as
> xen_setup_kernel_pagetable, and xen_map_identity_early do this
> already).
>
>> Like you wrote, considering that the x86_64 version of
>> kernel_physical_mapping_init hooks the pagetable pages into the
>> currently used pagetable, it wouldn't be possible to mark the pagetable
>> pages RO after init_memory_mapping.
>
Doesn't Xen have some kind of compatibility mode which could be used
during setup?
-hpa
On Mon, 16 May 2011, Konrad Rzeszutek Wilk wrote:
> > They become pagetable pages when:
> >
> > - they are explicitly pinned by pin_pagetable_pfn
> >
> > - they are hooked into the current pagetable
>
> Ok, so could we use those two calls to trigger the pagetable walk
> and mark them RO as appropiate? Which call sites are those? The
> xen_set_pgd/xen_set_pud/xen_set_pmd ?
xen_alloc_pte_init and xen_alloc_pmd_init are the ones that mark the
pagetable pages RO and pin them, calling make_lowmem_page_readonly and
pin_pagetable_pfn.
alloc_pte/pmd are called right before hooking them into the pagetable;
unfortunately that means that they fail at marking the pagetable pages
RO: make_lowmem_page_readonly uses lookup_address to find the pte
corresponding to a page, however at this point the pagetable pages are
not mapped yet (usually they are not hooked but when they are hooked, the
upper level pagetable page is not hooked), so lookup_address fails.
In order to catch these errors Xen has a parachute: xen_set_pte_init,
the function that takes care of writing a pte to memory and that on xen
converts pfns to mfns, also marks pagetable pages RO trying to
understand when that is appropriate.
This is all very ugly and delicate.
I think alloc_pte/pmd were always thought to be used to mark and pin
pagetable pages but they currently fail during the initial pagetable
setup. If we could fix alloc_pte/pmd most of the problems and the hacks
would go away.
Ideally we could remove both mask_rw_pte (currently responsible for
marking pagetable pages RO, called from xen_set_pte_init) and
x86_init.mapping.pagetable_reserve.
More thinking (and caffeine) needed...
> Presumarily we don't have
> to do that for the PTE's that are already mapped (as
> xen_setup_kernel_pagetable, and xen_map_identity_early do this
> already).
No, we don't.
We do need to make sure they stay RO on x86_32 where we write the
pagetable pages in two steps and we switch pagetable to swapper_pg_dir.
CC'ing Keir in case he knows something I am missing.
On Mon, 16 May 2011, H. Peter Anvin wrote:
> On 05/16/2011 08:41 AM, Konrad Rzeszutek Wilk wrote:
> >> They become pagetable pages when:
> >>
> >> - they are explicitly pinned by pin_pagetable_pfn
> >>
> >> - they are hooked into the current pagetable
> >
> > Ok, so could we use those two calls to trigger the pagetable walk
> > and mark them RO as appropiate? Which call sites are those? The
> > xen_set_pgd/xen_set_pud/xen_set_pmd ? Presumarily we don't have
> > to do that for the PTE's that are already mapped (as
> > xen_setup_kernel_pagetable, and xen_map_identity_early do this
> > already).
> >
> >> Like you wrote, considering that the x86_64 version of
> >> kernel_physical_mapping_init hooks the pagetable pages into the
> >> currently used pagetable, it wouldn't be possible to mark the pagetable
> >> pages RO after init_memory_mapping.
> >
>
> Doesn't Xen have some kind of compatibility mode which could be used
> during setup?
Unfortunately not that I am aware.
On Tue, May 17, 2011 at 06:50:55PM +0100, Stefano Stabellini wrote:
> On Mon, 16 May 2011, Konrad Rzeszutek Wilk wrote:
> > > They become pagetable pages when:
> > >
> > > - they are explicitly pinned by pin_pagetable_pfn
> > >
> > > - they are hooked into the current pagetable
> >
> > Ok, so could we use those two calls to trigger the pagetable walk
> > and mark them RO as appropiate? Which call sites are those? The
> > xen_set_pgd/xen_set_pud/xen_set_pmd ?
>
> xen_alloc_pte_init and xen_alloc_pmd_init are the ones that mark the
> pagetable pages RO and pin them, calling make_lowmem_page_readonly and
> pin_pagetable_pfn.
>
> alloc_pte/pmd are called right before hooking them into the pagetable;
> unfortunately that means that they fail at marking the pagetable pages
> RO: make_lowmem_page_readonly uses lookup_address to find the pte
> corresponding to a page, however at this point the pagetable pages are
> not mapped yet (usually they are not hooked but when they are hooked, the
> upper level pagetable page is not hooked), so lookup_address fails.
Right. We don't have to walk the hooked pagetable, I think. We are passed
in the PMD/PGD of the PFN and we could look at the content of that PFN.
Walk each entry in there and for those that are present, determine
if the page table it points to (whatever level it is) is RO. If not, mark
it RO. And naturally do it recursively to cover all levels.
>
> In order to catch these errors Xen has a parachute: xen_set_pte_init,
> the function that takes care of writing a pte to memory and that on xen
> converts pfns to mfns, also marks pagetable pages RO trying to
> understand when that is appropriate.
>
> This is all very ugly and delicate.
And complex.
>
> I think alloc_pte/pmd were always thought to be used to mark and pin
> pagetable pages but they currently fail during the initial pagetable
> setup. If we could fix alloc_pte/pmd most of the problems and the hacks
> would go away.
Yeey!
> Ideally we could remove both mask_rw_pte (currently responsible for
> marking pagetable pages RO, called from xen_set_pte_init) and
> x86_init.mapping.pagetable_reserve.
>
> More thinking (and caffeine) needed...
>
>
>
>
> > Presumarily we don't have
> > to do that for the PTE's that are already mapped (as
> > xen_setup_kernel_pagetable, and xen_map_identity_early do this
> > already).
>
> No, we don't.
> We do need to make sure they stay RO on x86_32 where we write the
> pagetable pages in two steps and we switch pagetable to swapper_pg_dir.
And then vice-versa later on during the bootup.
> >
> > Doesn't Xen have some kind of compatibility mode which could be used
> > during setup?
>
> Unfortunately not that I am aware.
Peter, were you thinking of XENFEAT_auto_translated_physmap? Which is that
the hypervisor does all of the MMU translations?
> > xen_alloc_pte_init and xen_alloc_pmd_init are the ones that mark the
> > pagetable pages RO and pin them, calling make_lowmem_page_readonly and
> > pin_pagetable_pfn.
> >
> > alloc_pte/pmd are called right before hooking them into the pagetable;
> > unfortunately that means that they fail at marking the pagetable pages
> > RO: make_lowmem_page_readonly uses lookup_address to find the pte
> > corresponding to a page, however at this point the pagetable pages are
> > not mapped yet (usually they are not hooked but when they are hooked, the
> > upper level pagetable page is not hooked), so lookup_address fails.
>
> Right. We don't have to walk the hooked pagetable, I think. We are passed
> in the PMD/PGD of the PFN and we could look at the content of that PFN.
err, got that backwards. "passed in the PFN of the PMD/PGD".
On Tue, 17 May 2011, Konrad Rzeszutek Wilk wrote:
> On Tue, May 17, 2011 at 06:50:55PM +0100, Stefano Stabellini wrote:
> > On Mon, 16 May 2011, Konrad Rzeszutek Wilk wrote:
> > > > They become pagetable pages when:
> > > >
> > > > - they are explicitly pinned by pin_pagetable_pfn
> > > >
> > > > - they are hooked into the current pagetable
> > >
> > > Ok, so could we use those two calls to trigger the pagetable walk
> > > and mark them RO as appropiate? Which call sites are those? The
> > > xen_set_pgd/xen_set_pud/xen_set_pmd ?
> >
> > xen_alloc_pte_init and xen_alloc_pmd_init are the ones that mark the
> > pagetable pages RO and pin them, calling make_lowmem_page_readonly and
> > pin_pagetable_pfn.
> >
> > alloc_pte/pmd are called right before hooking them into the pagetable;
> > unfortunately that means that they fail at marking the pagetable pages
> > RO: make_lowmem_page_readonly uses lookup_address to find the pte
> > corresponding to a page, however at this point the pagetable pages are
> > not mapped yet (usually they are not hooked but when they are hooked, the
> > upper level pagetable page is not hooked), so lookup_address fails.
>
> Right. We don't have to walk the hooked pagetable, I think. We are passed
> in the PMD/PGD of the PFN and we could look at the content of that PFN.
> Walk each entry in there and for those that are present, determine
> if the page table it points to (whatever level it is) is RO. If not, mark
> it RO. And naturally do it recursively to cover all levels.
I am not sure what you mean.
The interface is the following:
void alloc_pte(struct mm_struct *mm, unsigned long pfn);
pfn is the pagetable page's pfn, that has to be marked RO in all his mappings;
mm is the mm_struct where this pagetable page is mapped.
Except it is not true anymore because the pagetable page is not mapped
yet in mm; so we cannot walk anything here, unfortunately.
We could remember that we failed to mark this page RO so that the next
time we try to write a pte that contains that address we know we have to
mark it RO.
But this approach is basically equivalent to the one we had before
2.6.39: we consider the range pgt_buf_start-pgt_buf_end a "published"
range of pagetable pages that we have to mark RO.
In fact it is affected by the same problem: after writing the ptes that
map the range pgt_buf_start-pgt_buf_end, if we try to allocate a new
pagetable page incrementing pgt_buf_end we fail because the new page is
already marked RW in a pinned page.
At the same time we cannot modify the pte to change the mapping to RO
because lookup_address doesn't find it (the pagetable page containing
the pte in question is not reachable from init_mm yet).
Unfortunately I cannot see an easy way to fix alloc_pte without making
sure that the pfn passed as an argument is already mapped and the pte is
reachable using lookup_address.
Alternatively we could come up with a new interface that properly
publishes the pgt_buf_start-pgt_buf_top range, but it would still need a
"free" function for the pgt_buf_end-pgt_buf_top range to be called after
the initial mapping is complete.
On Mon, May 23, 2011 at 04:20:15PM +0100, Stefano Stabellini wrote:
> On Tue, 17 May 2011, Konrad Rzeszutek Wilk wrote:
> > On Tue, May 17, 2011 at 06:50:55PM +0100, Stefano Stabellini wrote:
> > > On Mon, 16 May 2011, Konrad Rzeszutek Wilk wrote:
> > > > > They become pagetable pages when:
> > > > >
> > > > > - they are explicitly pinned by pin_pagetable_pfn
> > > > >
> > > > > - they are hooked into the current pagetable
> > > >
> > > > Ok, so could we use those two calls to trigger the pagetable walk
> > > > and mark them RO as appropiate? Which call sites are those? The
> > > > xen_set_pgd/xen_set_pud/xen_set_pmd ?
> > >
> > > xen_alloc_pte_init and xen_alloc_pmd_init are the ones that mark the
> > > pagetable pages RO and pin them, calling make_lowmem_page_readonly and
> > > pin_pagetable_pfn.
> > >
> > > alloc_pte/pmd are called right before hooking them into the pagetable;
> > > unfortunately that means that they fail at marking the pagetable pages
> > > RO: make_lowmem_page_readonly uses lookup_address to find the pte
> > > corresponding to a page, however at this point the pagetable pages are
> > > not mapped yet (usually they are not hooked but when they are hooked, the
> > > upper level pagetable page is not hooked), so lookup_address fails.
> >
> > Right. We don't have to walk the hooked pagetable, I think. We are passed
> > in the PMD/PGD of the PFN and we could look at the content of that PFN.
> > Walk each entry in there and for those that are present, determine
> > if the page table it points to (whatever level it is) is RO. If not, mark
> > it RO. And naturally do it recursively to cover all levels.
>
> I am not sure what you mean.
> The interface is the following:
>
> void alloc_pte(struct mm_struct *mm, unsigned long pfn);
>
> pfn is the pagetable page's pfn, that has to be marked RO in all his mappings;
> mm is the mm_struct where this pagetable page is mapped.
> Except it is not true anymore because the pagetable page is not mapped
> yet in mm; so we cannot walk anything here, unfortunately.
I was thinking to "resolve" the pfn, and directly read from the pfn's the
entries. So not walking the mm_struct, but reading the raw data from the
PFN page... but I that would not do much as alloc_pte is done _before_ that
pagetable is actually populated - so it has nothing in it.
> We could remember that we failed to mark this page RO so that the next
> time we try to write a pte that contains that address we know we have to
> mark it RO.
> But this approach is basically equivalent to the one we had before
> 2.6.39: we consider the range pgt_buf_start-pgt_buf_end a "published"
> range of pagetable pages that we have to mark RO.
> In fact it is affected by the same problem: after writing the ptes that
> map the range pgt_buf_start-pgt_buf_end, if we try to allocate a new
> pagetable page incrementing pgt_buf_end we fail because the new page is
> already marked RW in a pinned page.
> At the same time we cannot modify the pte to change the mapping to RO
> because lookup_address doesn't find it (the pagetable page containing
> the pte in question is not reachable from init_mm yet).
So.. why not do the raw walk of the PFN (and within this
"raw walk" ioremap the PFNs, and do a depth-first walk on the page-tables
do set them to RO) when it is being hooked up to the page-table?
Meaning - whatever trigger point is when we try to set a PUD in a PGD,
or PTE into a PMD. And naturally we can't walk the 'init_mm' as it
has not been hooked up yet (and it cannot as the page-tables have not
been set to RO).
>
>
> Unfortunately I cannot see an easy way to fix alloc_pte without making
> sure that the pfn passed as an argument is already mapped and the pte is
> reachable using lookup_address.
Lets ignore that for now.
>
> Alternatively we could come up with a new interface that properly
> publishes the pgt_buf_start-pgt_buf_top range, but it would still need a
> "free" function for the pgt_buf_end-pgt_buf_top range to be called after
> the initial mapping is complete.
On Tue, 24 May 2011, Konrad Rzeszutek Wilk wrote:
> > > Right. We don't have to walk the hooked pagetable, I think. We are passed
> > > in the PMD/PGD of the PFN and we could look at the content of that PFN.
> > > Walk each entry in there and for those that are present, determine
> > > if the page table it points to (whatever level it is) is RO. If not, mark
> > > it RO. And naturally do it recursively to cover all levels.
> >
> > I am not sure what you mean.
> > The interface is the following:
> >
> > void alloc_pte(struct mm_struct *mm, unsigned long pfn);
> >
> > pfn is the pagetable page's pfn, that has to be marked RO in all his mappings;
> > mm is the mm_struct where this pagetable page is mapped.
> > Except it is not true anymore because the pagetable page is not mapped
> > yet in mm; so we cannot walk anything here, unfortunately.
>
> I was thinking to "resolve" the pfn, and directly read from the pfn's the
> entries. So not walking the mm_struct, but reading the raw data from the
> PFN page... but I that would not do much as alloc_pte is done _before_ that
> pagetable is actually populated - so it has nothing in it.
>
>
> > We could remember that we failed to mark this page RO so that the next
> > time we try to write a pte that contains that address we know we have to
> > mark it RO.
> > But this approach is basically equivalent to the one we had before
> > 2.6.39: we consider the range pgt_buf_start-pgt_buf_end a "published"
> > range of pagetable pages that we have to mark RO.
> > In fact it is affected by the same problem: after writing the ptes that
> > map the range pgt_buf_start-pgt_buf_end, if we try to allocate a new
> > pagetable page incrementing pgt_buf_end we fail because the new page is
> > already marked RW in a pinned page.
> > At the same time we cannot modify the pte to change the mapping to RO
> > because lookup_address doesn't find it (the pagetable page containing
> > the pte in question is not reachable from init_mm yet).
>
> So.. why not do the raw walk of the PFN (and within this
> "raw walk" ioremap the PFNs, and do a depth-first walk on the page-tables
> do set them to RO) when it is being hooked up to the page-table?
> Meaning - whatever trigger point is when we try to set a PUD in a PGD,
> or PTE into a PMD. And naturally we can't walk the 'init_mm' as it
> has not been hooked up yet (and it cannot as the page-tables have not
> been set to RO).
We cannot use the pvop call when we hook the pagetable page into the
upper level because it is right after the alloc_pte call and we still
cannot find its mappings through lookup_address. Keep in mind that it is
the pagetable page mapping that we have to set read-only, not the
content of the pagetable page.
The pagetable page mapping might be written before or after the
pagetable page is being hooked into the pagetable.
BTW I have to say that all the alloc_pte calls are currently wrong
because the struct mm that we are passing is meaningless.
On the other hand if you are suggesting to use the pvop call when we
write pte entries, that is set_pte (xen_set_pte_init), get the pfn from
the pte, figure out if it is a pagetable page, and if it is mark it
read-only, that is exactly what we are already doing.
Unfortunately it has a number of issues:
- once the pte entries are written they cannot easily be changed because
they still cannot be found using lookup_address. This means that once
the pte entries for range pgt_buf_start-pgt_buf_end have been written,
we cannot allocate any new pagetable pages, because we don't have a way
to mark them read-only anymore. This is the issue that brought us to the
introduction of x86_init.mapping.pagetable_reserve.
- we need to distinguish between normal mappings and temporary
mappings; among the temporary mappings we need to distinguish between new
pagetable pages that are not hooked into the pagetable yet and
pagetable already hooked into the pagetable. It is not easy and it is
error prone.
I think the approach of catching a pte write and trying to understand
what the pfn exactly is has proven to be too fragile.
We need a simple proper interface, like alloc_pte was supposed to be,
otherwise it will keep breaking.
At the moment I can only see two possible ways of solving this:
1) Have a way to find the pte that maps a pagetable page when the
pagetable page is hooked into the pagetable.
Unfortunately I fail to see a way to do it, given the current way of
allocating the pagetable pages. Maybe somebody with a deeper knowlegde
of kernel_physical_mapping_init could suggest a way.
2) Have a simple way to know if a pfn is a pagetable page.
If we had a function like pfn_is_pagetable_page(unsigned long pfn) we
could remove the hacks we have in xen_set_pte and mark read-only
the entry whenever needed. This implies that we need a way to know all
the pfn's of the pagetable pages in advance.