2012-05-28 07:18:53

by Xiao Guangrong

[permalink] [raw]
Subject: [PATCH] KVM: MMU: fix huge page adapted on non-PAE host

The huge page size is 4M on non-PAE host, but 2M page size is used in
transparent_hugepage_adjust(), so the page we get after adjust the
mapping level is not the head page, the BUG_ON() will be triggered

Signed-off-by: Xiao Guangrong <[email protected]>
---
arch/x86/kvm/mmu.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 72102e0..be3cea4 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2595,8 +2595,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
*gfnp = gfn;
kvm_release_pfn_clean(pfn);
pfn &= ~mask;
- if (!get_page_unless_zero(pfn_to_page(pfn)))
- BUG();
+ kvm_get_pfn(pfn);
*pfnp = pfn;
}
}
--
1.7.7.6


2012-05-28 10:57:30

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: fix huge page adapted on non-PAE host

On 05/28/2012 09:10 AM, Xiao Guangrong wrote:
> The huge page size is 4M on non-PAE host, but 2M page size is used in
> transparent_hugepage_adjust(), so the page we get after adjust the
> mapping level is not the head page, the BUG_ON() will be triggered
>
>
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 72102e0..be3cea4 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2595,8 +2595,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
> *gfnp = gfn;
> kvm_release_pfn_clean(pfn);
> pfn &= ~mask;
> - if (!get_page_unless_zero(pfn_to_page(pfn)))
> - BUG();
> + kvm_get_pfn(pfn);
> *pfnp = pfn;
> }
> }

Shouldn't we adjust mask instead?

--
error compiling committee.c: too many arguments to function

2012-05-28 12:24:43

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: fix huge page adapted on non-PAE host

On 05/28/2012 02:39 PM, Xiao Guangrong wrote:
> On 05/28/2012 06:57 PM, Avi Kivity wrote:
>
>> On 05/28/2012 09:10 AM, Xiao Guangrong wrote:
>>> The huge page size is 4M on non-PAE host, but 2M page size is used in
>>> transparent_hugepage_adjust(), so the page we get after adjust the
>>> mapping level is not the head page, the BUG_ON() will be triggered
>>>
>>>
>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>> index 72102e0..be3cea4 100644
>>> --- a/arch/x86/kvm/mmu.c
>>> +++ b/arch/x86/kvm/mmu.c
>>> @@ -2595,8 +2595,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>>> *gfnp = gfn;
>>> kvm_release_pfn_clean(pfn);
>>> pfn &= ~mask;
>>> - if (!get_page_unless_zero(pfn_to_page(pfn)))
>>> - BUG();
>>> + kvm_get_pfn(pfn);
>>> *pfnp = pfn;
>>> }
>>> }
>>
>> Shouldn't we adjust mask instead?
>>
>
>
> Adjusting mask to map the whole 4M huge page to KVM guest?

The code moves the refcount from the small page to the huge page. i.e.
from pfn 0x1312 to pfn 0x1200. But if the huge page frame contains
0x400 pages, it should move the refcount to pfn 0x1000.

> But it seams 4M page size is not supported on VMX/SVM.

We always use 64-bit PTEs in the lowest level, whether using shadow,
EPT, or NPT. Note NPT supports 32-bit PTEs in the lowest level, but we
don't support that configuration. But that doesn't mean we can't use
host 4M pages to back guest 2M pages (or direct maps).

--
error compiling committee.c: too many arguments to function

2012-05-28 13:00:40

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: fix huge page adapted on non-PAE host

On 05/28/2012 08:24 PM, Avi Kivity wrote:

> On 05/28/2012 02:39 PM, Xiao Guangrong wrote:
>> On 05/28/2012 06:57 PM, Avi Kivity wrote:
>>
>>> On 05/28/2012 09:10 AM, Xiao Guangrong wrote:
>>>> The huge page size is 4M on non-PAE host, but 2M page size is used in
>>>> transparent_hugepage_adjust(), so the page we get after adjust the
>>>> mapping level is not the head page, the BUG_ON() will be triggered
>>>>
>>>>
>>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>>> index 72102e0..be3cea4 100644
>>>> --- a/arch/x86/kvm/mmu.c
>>>> +++ b/arch/x86/kvm/mmu.c
>>>> @@ -2595,8 +2595,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>>>> *gfnp = gfn;
>>>> kvm_release_pfn_clean(pfn);
>>>> pfn &= ~mask;
>>>> - if (!get_page_unless_zero(pfn_to_page(pfn)))
>>>> - BUG();
>>>> + kvm_get_pfn(pfn);
>>>> *pfnp = pfn;
>>>> }
>>>> }
>>>
>>> Shouldn't we adjust mask instead?
>>>
>>
>>
>> Adjusting mask to map the whole 4M huge page to KVM guest?
>
> The code moves the refcount from the small page to the huge page. i.e.
> from pfn 0x1312 to pfn 0x1200. But if the huge page frame contains
> 0x400 pages, it should move the refcount to pfn 0x1000.
>


We need not move the refcount to the huge page (the head of pages), moving
the refcount to the any middle small page is also ok, get_page() will
properly handle it:

get_page() -> __get_page_tail():

| struct page *page_head = compound_trans_head(page);
|
| if (likely(page != page_head && get_page_unless_zero(page_head))) {
| /*
| * page_head wasn't a dangling pointer but it
| * may not be a head page anymore by the time
| * we obtain the lock. That is ok as long as it
| * can't be freed from under us.
| */
| flags = compound_lock_irqsave(page_head);
| /* here __split_huge_page_refcount won't run anymore */
| if (likely(PageTail(page))) {
| __get_page_tail_foll(page, false);
| got = true;
| }
| compound_unlock_irqrestore(page_head, flags);
| if (unlikely(!got))
| put_page(page_head);
| }

The refcount of page_head is increased.

>> But it seams 4M page size is not supported on VMX/SVM.
>
> We always use 64-bit PTEs in the lowest level, whether using shadow,
> EPT, or NPT. Note NPT supports 32-bit PTEs in the lowest level, but we
> don't support that configuration. But that doesn't mean we can't use
> host 4M pages to back guest 2M pages (or direct maps).
>


Right.

2012-05-28 13:14:07

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: fix huge page adapted on non-PAE host

On 05/28/2012 03:56 PM, Xiao Guangrong wrote:
> On 05/28/2012 08:24 PM, Avi Kivity wrote:
>
>> On 05/28/2012 02:39 PM, Xiao Guangrong wrote:
>>> On 05/28/2012 06:57 PM, Avi Kivity wrote:
>>>
>>>> On 05/28/2012 09:10 AM, Xiao Guangrong wrote:
>>>>> The huge page size is 4M on non-PAE host, but 2M page size is used in
>>>>> transparent_hugepage_adjust(), so the page we get after adjust the
>>>>> mapping level is not the head page, the BUG_ON() will be triggered
>>>>>
>>>>>
>>>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>>>> index 72102e0..be3cea4 100644
>>>>> --- a/arch/x86/kvm/mmu.c
>>>>> +++ b/arch/x86/kvm/mmu.c
>>>>> @@ -2595,8 +2595,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>>>>> *gfnp = gfn;
>>>>> kvm_release_pfn_clean(pfn);
>>>>> pfn &= ~mask;
>>>>> - if (!get_page_unless_zero(pfn_to_page(pfn)))
>>>>> - BUG();
>>>>> + kvm_get_pfn(pfn);
>>>>> *pfnp = pfn;
>>>>> }
>>>>> }
>>>>
>>>> Shouldn't we adjust mask instead?
>>>>
>>>
>>>
>>> Adjusting mask to map the whole 4M huge page to KVM guest?
>>
>> The code moves the refcount from the small page to the huge page. i.e.
>> from pfn 0x1312 to pfn 0x1200. But if the huge page frame contains
>> 0x400 pages, it should move the refcount to pfn 0x1000.
>>
>
>
> We need not move the refcount to the huge page (the head of pages), moving
> the refcount to the any middle small page is also ok, get_page() will
> properly handle it:
>
> get_page() -> __get_page_tail():
>
> | struct page *page_head = compound_trans_head(page);
> |
> | if (likely(page != page_head && get_page_unless_zero(page_head))) {
> | /*
> | * page_head wasn't a dangling pointer but it
> | * may not be a head page anymore by the time
> | * we obtain the lock. That is ok as long as it
> | * can't be freed from under us.
> | */
> | flags = compound_lock_irqsave(page_head);
> | /* here __split_huge_page_refcount won't run anymore */
> | if (likely(PageTail(page))) {
> | __get_page_tail_foll(page, false);
> | got = true;
> | }
> | compound_unlock_irqrestore(page_head, flags);
> | if (unlikely(!got))
> | put_page(page_head);
> | }
>
> The refcount of page_head is increased.
>

So, the whole thing is unneeded? Andrea?


--
error compiling committee.c: too many arguments to function

2012-05-28 13:24:12

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: fix huge page adapted on non-PAE host

On 05/28/2012 06:57 PM, Avi Kivity wrote:

> On 05/28/2012 09:10 AM, Xiao Guangrong wrote:
>> The huge page size is 4M on non-PAE host, but 2M page size is used in
>> transparent_hugepage_adjust(), so the page we get after adjust the
>> mapping level is not the head page, the BUG_ON() will be triggered
>>
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 72102e0..be3cea4 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -2595,8 +2595,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>> *gfnp = gfn;
>> kvm_release_pfn_clean(pfn);
>> pfn &= ~mask;
>> - if (!get_page_unless_zero(pfn_to_page(pfn)))
>> - BUG();
>> + kvm_get_pfn(pfn);
>> *pfnp = pfn;
>> }
>> }
>
> Shouldn't we adjust mask instead?
>


Adjusting mask to map the whole 4M huge page to KVM guest?
But it seams 4M page size is not supported on VMX/SVM.

2012-05-28 13:41:34

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: fix huge page adapted on non-PAE host

On 05/28/2012 09:14 PM, Avi Kivity wrote:

> On 05/28/2012 03:56 PM, Xiao Guangrong wrote:
>> On 05/28/2012 08:24 PM, Avi Kivity wrote:
>>
>>> On 05/28/2012 02:39 PM, Xiao Guangrong wrote:
>>>> On 05/28/2012 06:57 PM, Avi Kivity wrote:
>>>>
>>>>> On 05/28/2012 09:10 AM, Xiao Guangrong wrote:
>>>>>> The huge page size is 4M on non-PAE host, but 2M page size is used in
>>>>>> transparent_hugepage_adjust(), so the page we get after adjust the
>>>>>> mapping level is not the head page, the BUG_ON() will be triggered
>>>>>>
>>>>>>
>>>>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>>>>> index 72102e0..be3cea4 100644
>>>>>> --- a/arch/x86/kvm/mmu.c
>>>>>> +++ b/arch/x86/kvm/mmu.c
>>>>>> @@ -2595,8 +2595,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>>>>>> *gfnp = gfn;
>>>>>> kvm_release_pfn_clean(pfn);
>>>>>> pfn &= ~mask;
>>>>>> - if (!get_page_unless_zero(pfn_to_page(pfn)))
>>>>>> - BUG();
>>>>>> + kvm_get_pfn(pfn);
>>>>>> *pfnp = pfn;
>>>>>> }
>>>>>> }
>>>>>
>>>>> Shouldn't we adjust mask instead?
>>>>>
>>>>
>>>>
>>>> Adjusting mask to map the whole 4M huge page to KVM guest?
>>>
>>> The code moves the refcount from the small page to the huge page. i.e.
>>> from pfn 0x1312 to pfn 0x1200. But if the huge page frame contains
>>> 0x400 pages, it should move the refcount to pfn 0x1000.
>>>
>>
>>
>> We need not move the refcount to the huge page (the head of pages), moving
>> the refcount to the any middle small page is also ok, get_page() will
>> properly handle it:
>>
>> get_page() -> __get_page_tail():
>>
>> | struct page *page_head = compound_trans_head(page);
>> |
>> | if (likely(page != page_head && get_page_unless_zero(page_head))) {
>> | /*
>> | * page_head wasn't a dangling pointer but it
>> | * may not be a head page anymore by the time
>> | * we obtain the lock. That is ok as long as it
>> | * can't be freed from under us.
>> | */
>> | flags = compound_lock_irqsave(page_head);
>> | /* here __split_huge_page_refcount won't run anymore */
>> | if (likely(PageTail(page))) {
>> | __get_page_tail_foll(page, false);
>> | got = true;
>> | }
>> | compound_unlock_irqrestore(page_head, flags);
>> | if (unlikely(!got))
>> | put_page(page_head);
>> | }
>>
>> The refcount of page_head is increased.
>>
>
> So, the whole thing is unneeded? Andrea?
>


I think the reason we move refcount in current code is, we should increase the
refcount of the page we will mapped into shadow page table, since we always
decrease its refcount after it is mapped. (That is this patch does.)

2012-05-28 13:53:44

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: fix huge page adapted on non-PAE host

On 05/28/2012 04:41 PM, Xiao Guangrong wrote:
> On 05/28/2012 09:14 PM, Avi Kivity wrote:
>
>> On 05/28/2012 03:56 PM, Xiao Guangrong wrote:
>>> On 05/28/2012 08:24 PM, Avi Kivity wrote:
>>>
>>>> On 05/28/2012 02:39 PM, Xiao Guangrong wrote:
>>>>> On 05/28/2012 06:57 PM, Avi Kivity wrote:
>>>>>
>>>>>> On 05/28/2012 09:10 AM, Xiao Guangrong wrote:
>>>>>>> The huge page size is 4M on non-PAE host, but 2M page size is used in
>>>>>>> transparent_hugepage_adjust(), so the page we get after adjust the
>>>>>>> mapping level is not the head page, the BUG_ON() will be triggered
>>>>>>>
>>>>>>>
>>>>>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>>>>>> index 72102e0..be3cea4 100644
>>>>>>> --- a/arch/x86/kvm/mmu.c
>>>>>>> +++ b/arch/x86/kvm/mmu.c
>>>>>>> @@ -2595,8 +2595,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>>>>>>> *gfnp = gfn;
>>>>>>> kvm_release_pfn_clean(pfn);
>>>>>>> pfn &= ~mask;
>>>>>>> - if (!get_page_unless_zero(pfn_to_page(pfn)))
>>>>>>> - BUG();
>>>>>>> + kvm_get_pfn(pfn);
>>>>>>> *pfnp = pfn;
>>>>>>> }
>>>>>>> }
>>>>>>
>>>>>> Shouldn't we adjust mask instead?
>>>>>>
>>>>>
>>>>>
>>>>> Adjusting mask to map the whole 4M huge page to KVM guest?
>>>>
>>>> The code moves the refcount from the small page to the huge page. i.e.
>>>> from pfn 0x1312 to pfn 0x1200. But if the huge page frame contains
>>>> 0x400 pages, it should move the refcount to pfn 0x1000.
>>>>
>>>
>>>
>>> We need not move the refcount to the huge page (the head of pages), moving
>>> the refcount to the any middle small page is also ok, get_page() will
>>> properly handle it:
>>>
>>> get_page() -> __get_page_tail():
>>>
>>> | struct page *page_head = compound_trans_head(page);
>>> |
>>> | if (likely(page != page_head && get_page_unless_zero(page_head))) {
>>> | /*
>>> | * page_head wasn't a dangling pointer but it
>>> | * may not be a head page anymore by the time
>>> | * we obtain the lock. That is ok as long as it
>>> | * can't be freed from under us.
>>> | */
>>> | flags = compound_lock_irqsave(page_head);
>>> | /* here __split_huge_page_refcount won't run anymore */
>>> | if (likely(PageTail(page))) {
>>> | __get_page_tail_foll(page, false);
>>> | got = true;
>>> | }
>>> | compound_unlock_irqrestore(page_head, flags);
>>> | if (unlikely(!got))
>>> | put_page(page_head);
>>> | }
>>>
>>> The refcount of page_head is increased.
>>>
>>
>> So, the whole thing is unneeded? Andrea?
>>
>
>
> I think the reason we move refcount in current code is, we should increase the
> refcount of the page we will mapped into shadow page table, since we always
> decrease its refcount after it is mapped. (That is this patch does.)
>


As far as I can tell __get_user_pages_fast() will take the reference
count in the page head in the first place.

--
error compiling committee.c: too many arguments to function

2012-05-28 14:05:59

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: fix huge page adapted on non-PAE host

On 05/28/2012 09:53 PM, Avi Kivity wrote:

> On 05/28/2012 04:41 PM, Xiao Guangrong wrote:
>> On 05/28/2012 09:14 PM, Avi Kivity wrote:
>>
>>> On 05/28/2012 03:56 PM, Xiao Guangrong wrote:
>>>> On 05/28/2012 08:24 PM, Avi Kivity wrote:
>>>>
>>>>> On 05/28/2012 02:39 PM, Xiao Guangrong wrote:
>>>>>> On 05/28/2012 06:57 PM, Avi Kivity wrote:
>>>>>>
>>>>>>> On 05/28/2012 09:10 AM, Xiao Guangrong wrote:
>>>>>>>> The huge page size is 4M on non-PAE host, but 2M page size is used in
>>>>>>>> transparent_hugepage_adjust(), so the page we get after adjust the
>>>>>>>> mapping level is not the head page, the BUG_ON() will be triggered
>>>>>>>>
>>>>>>>>
>>>>>>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>>>>>>> index 72102e0..be3cea4 100644
>>>>>>>> --- a/arch/x86/kvm/mmu.c
>>>>>>>> +++ b/arch/x86/kvm/mmu.c
>>>>>>>> @@ -2595,8 +2595,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
>>>>>>>> *gfnp = gfn;
>>>>>>>> kvm_release_pfn_clean(pfn);
>>>>>>>> pfn &= ~mask;
>>>>>>>> - if (!get_page_unless_zero(pfn_to_page(pfn)))
>>>>>>>> - BUG();
>>>>>>>> + kvm_get_pfn(pfn);
>>>>>>>> *pfnp = pfn;
>>>>>>>> }
>>>>>>>> }
>>>>>>>
>>>>>>> Shouldn't we adjust mask instead?
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Adjusting mask to map the whole 4M huge page to KVM guest?
>>>>>
>>>>> The code moves the refcount from the small page to the huge page. i.e.
>>>>> from pfn 0x1312 to pfn 0x1200. But if the huge page frame contains
>>>>> 0x400 pages, it should move the refcount to pfn 0x1000.
>>>>>
>>>>
>>>>
>>>> We need not move the refcount to the huge page (the head of pages), moving
>>>> the refcount to the any middle small page is also ok, get_page() will
>>>> properly handle it:
>>>>
>>>> get_page() -> __get_page_tail():
>>>>
>>>> | struct page *page_head = compound_trans_head(page);
>>>> |
>>>> | if (likely(page != page_head && get_page_unless_zero(page_head))) {
>>>> | /*
>>>> | * page_head wasn't a dangling pointer but it
>>>> | * may not be a head page anymore by the time
>>>> | * we obtain the lock. That is ok as long as it
>>>> | * can't be freed from under us.
>>>> | */
>>>> | flags = compound_lock_irqsave(page_head);
>>>> | /* here __split_huge_page_refcount won't run anymore */
>>>> | if (likely(PageTail(page))) {
>>>> | __get_page_tail_foll(page, false);
>>>> | got = true;
>>>> | }
>>>> | compound_unlock_irqrestore(page_head, flags);
>>>> | if (unlikely(!got))
>>>> | put_page(page_head);
>>>> | }
>>>>
>>>> The refcount of page_head is increased.
>>>>
>>>
>>> So, the whole thing is unneeded? Andrea?
>>>
>>
>>
>> I think the reason we move refcount in current code is, we should increase the
>> refcount of the page we will mapped into shadow page table, since we always
>> decrease its refcount after it is mapped. (That is this patch does.)
>>
>
>
> As far as I can tell __get_user_pages_fast() will take the reference
> count in the page head in the first place.


IIUC, the refcount used in the Compound Page is like this:

get_user_pages / get_page(page):
head_page = page->first_page;
if (page is not the head page)
page->__mapcount++
head_page->_count++


put_page(page):
head_page = page->first_page;
if (page is not the head page)
page->__mapcount--
head_page->_count--

2012-05-28 14:20:33

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: fix huge page adapted on non-PAE host

On 05/28/2012 05:05 PM, Xiao Guangrong wrote:
>>>
>>>
>>> I think the reason we move refcount in current code is, we should increase the
>>> refcount of the page we will mapped into shadow page table, since we always
>>> decrease its refcount after it is mapped. (That is this patch does.)
>>>
>>
>>
>> As far as I can tell __get_user_pages_fast() will take the reference
>> count in the page head in the first place.
>
>
> IIUC, the refcount used in the Compound Page is like this:
>
> get_user_pages / get_page(page):
> head_page = page->first_page;
> if (page is not the head page)
> page->__mapcount++
> head_page->_count++
>
>
> put_page(page):
> head_page = page->first_page;
> if (page is not the head page)
> page->__mapcount--
> head_page->_count--
>

Aha.

The "right thing" we should be doing is running get_page() on every
small page within the frame (we asked for a small page but are
opportunistrically using the pages around it, without a proper ref).
That's a bit slow though, so we cheat.

Maybe we should do it anyway. Large page maps/unmaps should be rare.

But I guess we can start with your fix. But what about shifting mask by
one bit? Isn't it sufficient?

- mask = KVM_PAGES_PER_HPAGE(level) - 1;
+ mask = KVM_PAGES_PER_HPAGE(level);
+ mask *= KVM_HOST_HPAGES_PER_HPAGE;
+ mask -= 1;

This should move the reference to the right place.

--
error compiling committee.c: too many arguments to function

2012-05-28 14:32:25

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: fix huge page adapted on non-PAE host

Hi,

On Mon, May 28, 2012 at 04:53:38PM +0300, Avi Kivity wrote:
> As far as I can tell __get_user_pages_fast() will take the reference
> count in the page head in the first place.

mask = KVM_PAGES_PER_HPAGE(level) - 1;

The BUG would trigger if the above KVM mask is 2M (that is the NPT/EPT
pmd size), but the hugepage size in the host is 4M (noPAE 32bit).

The refcount is taken only in the head page for heads, and in both for
tails.

Because we've mmu notifier, we never keep the pages mapped by sptes
refcounted, we drop them all. So all we need to do is just to move the
refcount on the same exact pfn that is then freed by mmu_set_spte
(kvm_release_pfn_clean at the end).

The adjustement is not done for the refcounting, the issue here is, we
want to adjust the "pfn" passed to mmu_set_spte, and in turn we've to
move the refcounting too, because the kvm_release_pfn_clean will run
on that "pfn" (not on the pfn returned by gup-fast anymore).

So it looks fine to just do get_page and the patch looks correct (not
sure if the mmio the mmio check is needed or if we can just do
get_page) as long as the "pfn" that is returned through &pfn parameter
and then passssed to mmu_set_sptes is the same one were we do get_page.

The reason it was a get_page_unless_zero() is that it wanted to check
that there was no THP split and the head page was still there. Problem
is that with a 4M host page size and 2M NTP/EPT pmd size, we need to
get_page a tail page half of the time, and get_page_unless_zero()
won't be a correct refcount for tail pages, not equivalent to a full
get_page.

Overall the most important thing is that the pfn returned is the
correct one that matches the alignment of the NPT/EPT hugepmd size,
the refcounting just closely follows that aligned "pfn".

2012-05-28 14:40:19

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: fix huge page adapted on non-PAE host

On 05/28/2012 05:32 PM, Andrea Arcangeli wrote:
> Hi,
>
> On Mon, May 28, 2012 at 04:53:38PM +0300, Avi Kivity wrote:
>> As far as I can tell __get_user_pages_fast() will take the reference
>> count in the page head in the first place.
>
> mask = KVM_PAGES_PER_HPAGE(level) - 1;
>
> The BUG would trigger if the above KVM mask is 2M (that is the NPT/EPT
> pmd size), but the hugepage size in the host is 4M (noPAE 32bit).
>
> The refcount is taken only in the head page for heads, and in both for
> tails.
>
> Because we've mmu notifier, we never keep the pages mapped by sptes
> refcounted, we drop them all. So all we need to do is just to move the
> refcount on the same exact pfn that is then freed by mmu_set_spte
> (kvm_release_pfn_clean at the end).
>
> The adjustement is not done for the refcounting, the issue here is, we
> want to adjust the "pfn" passed to mmu_set_spte, and in turn we've to
> move the refcounting too, because the kvm_release_pfn_clean will run
> on that "pfn" (not on the pfn returned by gup-fast anymore).
>
> So it looks fine to just do get_page and the patch looks correct (not
> sure if the mmio the mmio check is needed or if we can just do
> get_page) as long as the "pfn" that is returned through &pfn parameter
> and then passssed to mmu_set_sptes is the same one were we do get_page.
>
> The reason it was a get_page_unless_zero() is that it wanted to check
> that there was no THP split and the head page was still there. Problem
> is that with a 4M host page size and 2M NTP/EPT pmd size, we need to
> get_page a tail page half of the time, and get_page_unless_zero()
> won't be a correct refcount for tail pages, not equivalent to a full
> get_page.
>
> Overall the most important thing is that the pfn returned is the
> correct one that matches the alignment of the NPT/EPT hugepmd size,
> the refcounting just closely follows that aligned "pfn".

Yes, I see it now. Adjusting mask is incorrect since we won't have the
same adjustment on release. I'll apply the patch for 3.5.

--
error compiling committee.c: too many arguments to function

2012-05-28 14:42:10

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: fix huge page adapted on non-PAE host

On Mon, May 28, 2012 at 05:20:02PM +0300, Avi Kivity wrote:
> The "right thing" we should be doing is running get_page() on every
> small page within the frame (we asked for a small page but are
> opportunistrically using the pages around it, without a proper ref).
> That's a bit slow though, so we cheat.

The problem is that we're aligning the pfn. The refcount move just
follows the pfn alignment. The spte setting I think need the correct
aligned pfn that matches the hugepmd NPT/EPT alignment.

So then we're moving the pfn refcount too, otherwise
kvm_release_pfn_clean then would run on a different pfn than the one
that was returned by gup-fast.

If we would drop the refcount before calling __direct_map or use a
gup-fast that doesn't even take a pin, we wouldn't need to move the
refcount and we could only free the page (without having to do a
get_page).

> But I guess we can start with your fix. But what about shifting mask by
> one bit? Isn't it sufficient?
>
> - mask = KVM_PAGES_PER_HPAGE(level) - 1;
> + mask = KVM_PAGES_PER_HPAGE(level);
> + mask *= KVM_HOST_HPAGES_PER_HPAGE;
> + mask -= 1;
>
> This should move the reference to the right place.

The pfn passed to mmu_set_spte then would then be aligned at 4M
despite the NTP/EPT size is 2M, so I doubt it would be ok. The real
thing to check here is that the pfn passed is correct. The refcount
just follows.

Just doing s/get_page_unless_zero()/get_page()/ should work I
think. And good thing there's no chance to get this wrong by testing,
it either boots or doesn't boot.

2012-05-28 14:44:45

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: fix huge page adapted on non-PAE host

On Mon, May 28, 2012 at 05:40:08PM +0300, Avi Kivity wrote:
> Yes, I see it now. Adjusting mask is incorrect since we won't have the
> same adjustment on release. I'll apply the patch for 3.5.

Sounds great to me. One thing I'm not sure about is about the real need of
the mmio check vs a stright get_page (we shouldn't ever get to a gup-fast
succeeding if this was a mmio region?) but I don't see chances that it
could hurt either.

Thanks!
Andrea

2012-05-29 14:23:40

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH] KVM: MMU: fix huge page adapted on non-PAE host

On 05/28/2012 05:44 PM, Andrea Arcangeli wrote:
> On Mon, May 28, 2012 at 05:40:08PM +0300, Avi Kivity wrote:
>> Yes, I see it now. Adjusting mask is incorrect since we won't have the
>> same adjustment on release. I'll apply the patch for 3.5.
>
> Sounds great to me. One thing I'm not sure about is about the real need of
> the mmio check vs a stright get_page (we shouldn't ever get to a gup-fast
> succeeding if this was a mmio region?) but I don't see chances that it
> could hurt either.

This is a device assignment mmio region. gpu_fast() can succeed is we
happen to have a struct page for it (pci hole) or fail if we don't
(64-bit BAR). See hva_to_pfn() (and indeed this is the reason we use
pfns instead of struct pages).

--
error compiling committee.c: too many arguments to function