2013-08-30 12:41:49

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH] KVM: mmu: allow page tables to be in read-only slots

Page tables in a read-only memory slot will currently cause a triple
fault because the page walker uses gfn_to_hva and it fails on such a slot.

OVMF uses such a page table; however, real hardware seems to be fine with
that as long as the accessed/dirty bits are set. Save whether the slot
is readonly, and later check it when updating the accessed and dirty bits.

Cc: [email protected]
Cc: [email protected]
Cc: Xiao Guangrong <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
CCing to stable@ since the regression was introduced with
support for readonly memory slots.

arch/x86/kvm/paging_tmpl.h | 7 ++++++-
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 14 +++++++++-----
3 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 0433301..dadc5c0 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -99,6 +99,7 @@ struct guest_walker {
pt_element_t prefetch_ptes[PTE_PREFETCH_NUM];
gpa_t pte_gpa[PT_MAX_FULL_LEVELS];
pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS];
+ bool pte_writable[PT_MAX_FULL_LEVELS];
unsigned pt_access;
unsigned pte_access;
gfn_t gfn;
@@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
if (pte == orig_pte)
continue;

+ if (unlikely(!walker->pte_writable[level - 1]))
+ return -EACCES;
+
ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, orig_pte, pte);
if (ret)
return ret;
@@ -309,7 +313,8 @@ retry_walk:
goto error;
real_gfn = gpa_to_gfn(real_gfn);

- host_addr = gfn_to_hva(vcpu->kvm, real_gfn);
+ host_addr = gfn_to_hva_read(vcpu->kvm, real_gfn,
+ &walker->pte_writable[walker->level - 1]);
if (unlikely(kvm_is_error_hva(host_addr)))
goto error;

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ca645a0..22f9cdf 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -533,6 +533,7 @@ int gfn_to_page_many_atomic(struct kvm *kvm, gfn_t gfn, struct page **pages,

struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn);
unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn);
+unsigned long gfn_to_hva_read(struct kvm *kvm, gfn_t gfn, bool *writable);
unsigned long gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
void kvm_release_page_clean(struct page *page);
void kvm_release_page_dirty(struct page *page);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f7e4334..418d037 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1078,11 +1078,15 @@ unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
EXPORT_SYMBOL_GPL(gfn_to_hva);

/*
- * The hva returned by this function is only allowed to be read.
- * It should pair with kvm_read_hva() or kvm_read_hva_atomic().
+ * If writable is set to false, the hva returned by this function is only
+ * allowed to be read.
*/
-static unsigned long gfn_to_hva_read(struct kvm *kvm, gfn_t gfn)
+unsigned long gfn_to_hva_read(struct kvm *kvm, gfn_t gfn, bool *writable)
{
+ struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
+ if (writable)
+ *writable = !memslot_is_readonly(slot);
+
return __gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, NULL, false);
}

@@ -1450,7 +1454,7 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
int r;
unsigned long addr;

- addr = gfn_to_hva_read(kvm, gfn);
+ addr = gfn_to_hva_read(kvm, gfn, NULL);
if (kvm_is_error_hva(addr))
return -EFAULT;
r = kvm_read_hva(data, (void __user *)addr + offset, len);
@@ -1488,7 +1492,7 @@ int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, void *data,
gfn_t gfn = gpa >> PAGE_SHIFT;
int offset = offset_in_page(gpa);

- addr = gfn_to_hva_read(kvm, gfn);
+ addr = gfn_to_hva_read(kvm, gfn, NULL);
if (kvm_is_error_hva(addr))
return -EFAULT;
pagefault_disable();
--
1.8.3.1


2013-09-01 09:18:00

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

On Fri, Aug 30, 2013 at 02:41:37PM +0200, Paolo Bonzini wrote:
> Page tables in a read-only memory slot will currently cause a triple
> fault because the page walker uses gfn_to_hva and it fails on such a slot.
>
> OVMF uses such a page table; however, real hardware seems to be fine with
> that as long as the accessed/dirty bits are set. Save whether the slot
> is readonly, and later check it when updating the accessed and dirty bits.
>
The fix looks OK to me, but some comment below.

> Cc: [email protected]
> Cc: [email protected]
> Cc: Xiao Guangrong <[email protected]>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---
> CCing to stable@ since the regression was introduced with
> support for readonly memory slots.
>
> arch/x86/kvm/paging_tmpl.h | 7 ++++++-
> include/linux/kvm_host.h | 1 +
> virt/kvm/kvm_main.c | 14 +++++++++-----
> 3 files changed, 16 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index 0433301..dadc5c0 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -99,6 +99,7 @@ struct guest_walker {
> pt_element_t prefetch_ptes[PTE_PREFETCH_NUM];
> gpa_t pte_gpa[PT_MAX_FULL_LEVELS];
> pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS];
> + bool pte_writable[PT_MAX_FULL_LEVELS];
> unsigned pt_access;
> unsigned pte_access;
> gfn_t gfn;
> @@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
> if (pte == orig_pte)
> continue;
>
> + if (unlikely(!walker->pte_writable[level - 1]))
> + return -EACCES;
> +
> ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, orig_pte, pte);
> if (ret)
> return ret;
> @@ -309,7 +313,8 @@ retry_walk:
> goto error;
> real_gfn = gpa_to_gfn(real_gfn);
>
> - host_addr = gfn_to_hva(vcpu->kvm, real_gfn);
> + host_addr = gfn_to_hva_read(vcpu->kvm, real_gfn,
> + &walker->pte_writable[walker->level - 1]);
The use of gfn_to_hva_read is misleading. The code can still write into
gfn. Lets rename gfn_to_hva_read to gfn_to_hva_prot() and gfn_to_hva()
to gfn_to_hva_write().

This makes me think are there other places where gfn_to_hva() was
used, but gfn_to_hva_prot() should have been?
- kvm_host_page_size() looks incorrect. We never use huge page to map
read only memory slots currently.
- kvm_handle_bad_page() also looks incorrect and may cause incorrect
address to be reported to userspace.
- kvm_setup_async_pf() also incorrect. Makes all page fault on read
only slot to be sync.
- kvm_vm_fault() one looks OK since function assumes write only slots,
but it is obsolete and should be deleted anyway.

Others in generic and x86 code looks OK, somebody need to check ppc and
arm code.


> if (unlikely(kvm_is_error_hva(host_addr)))
> goto error;
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index ca645a0..22f9cdf 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -533,6 +533,7 @@ int gfn_to_page_many_atomic(struct kvm *kvm, gfn_t gfn, struct page **pages,
>
> struct page *gfn_to_page(struct kvm *kvm, gfn_t gfn);
> unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn);
> +unsigned long gfn_to_hva_read(struct kvm *kvm, gfn_t gfn, bool *writable);
> unsigned long gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn);
> void kvm_release_page_clean(struct page *page);
> void kvm_release_page_dirty(struct page *page);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index f7e4334..418d037 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1078,11 +1078,15 @@ unsigned long gfn_to_hva(struct kvm *kvm, gfn_t gfn)
> EXPORT_SYMBOL_GPL(gfn_to_hva);
>
> /*
> - * The hva returned by this function is only allowed to be read.
> - * It should pair with kvm_read_hva() or kvm_read_hva_atomic().
> + * If writable is set to false, the hva returned by this function is only
> + * allowed to be read.
> */
> -static unsigned long gfn_to_hva_read(struct kvm *kvm, gfn_t gfn)
> +unsigned long gfn_to_hva_read(struct kvm *kvm, gfn_t gfn, bool *writable)
> {
> + struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
> + if (writable)
> + *writable = !memslot_is_readonly(slot);
> +
> return __gfn_to_hva_many(gfn_to_memslot(kvm, gfn), gfn, NULL, false);
> }
>
> @@ -1450,7 +1454,7 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
> int r;
> unsigned long addr;
>
> - addr = gfn_to_hva_read(kvm, gfn);
> + addr = gfn_to_hva_read(kvm, gfn, NULL);
> if (kvm_is_error_hva(addr))
> return -EFAULT;
> r = kvm_read_hva(data, (void __user *)addr + offset, len);
> @@ -1488,7 +1492,7 @@ int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, void *data,
> gfn_t gfn = gpa >> PAGE_SHIFT;
> int offset = offset_in_page(gpa);
>
> - addr = gfn_to_hva_read(kvm, gfn);
> + addr = gfn_to_hva_read(kvm, gfn, NULL);
> if (kvm_is_error_hva(addr))
> return -EFAULT;
> pagefault_disable();
> --
> 1.8.3.1

--
Gleb.

2013-09-02 09:20:28

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

On 08/30/2013 08:41 PM, Paolo Bonzini wrote:
> Page tables in a read-only memory slot will currently cause a triple
> fault because the page walker uses gfn_to_hva and it fails on such a slot.
>
> OVMF uses such a page table; however, real hardware seems to be fine with
> that as long as the accessed/dirty bits are set. Save whether the slot
> is readonly, and later check it when updating the accessed and dirty bits.

Paolo, do you know why OVMF is using readonly memory like this?

AFAIK, The fault trigged by this kind of access can hardly be fixed by
userspace since the fault is trigged by pagetable walking not by the current
instruction. Do you have any idea to let uerspace emulate it properly?

2013-09-02 09:26:34

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote:
> On 08/30/2013 08:41 PM, Paolo Bonzini wrote:
> > Page tables in a read-only memory slot will currently cause a triple
> > fault because the page walker uses gfn_to_hva and it fails on such a slot.
> >
> > OVMF uses such a page table; however, real hardware seems to be fine with
> > that as long as the accessed/dirty bits are set. Save whether the slot
> > is readonly, and later check it when updating the accessed and dirty bits.
>
> Paolo, do you know why OVMF is using readonly memory like this?
>
Just a guess, but perhaps they want to move to paging mode as early as
possible even before memory controller is fully initialized.

> AFAIK, The fault trigged by this kind of access can hardly be fixed by
> userspace since the fault is trigged by pagetable walking not by the current
> instruction. Do you have any idea to let uerspace emulate it properly?
Not sure what userspace you mean here, but there shouldn't be a fault in the
first place if ROM page tables have access/dirty bit set and they do.

--
Gleb.

2013-09-02 09:42:36

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

On 09/01/2013 05:17 PM, Gleb Natapov wrote:
> On Fri, Aug 30, 2013 at 02:41:37PM +0200, Paolo Bonzini wrote:
>> Page tables in a read-only memory slot will currently cause a triple
>> fault because the page walker uses gfn_to_hva and it fails on such a slot.
>>
>> OVMF uses such a page table; however, real hardware seems to be fine with
>> that as long as the accessed/dirty bits are set. Save whether the slot
>> is readonly, and later check it when updating the accessed and dirty bits.
>>
> The fix looks OK to me, but some comment below.
>
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: Xiao Guangrong <[email protected]>
>> Signed-off-by: Paolo Bonzini <[email protected]>
>> ---
>> CCing to stable@ since the regression was introduced with
>> support for readonly memory slots.
>>
>> arch/x86/kvm/paging_tmpl.h | 7 ++++++-
>> include/linux/kvm_host.h | 1 +
>> virt/kvm/kvm_main.c | 14 +++++++++-----
>> 3 files changed, 16 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>> index 0433301..dadc5c0 100644
>> --- a/arch/x86/kvm/paging_tmpl.h
>> +++ b/arch/x86/kvm/paging_tmpl.h
>> @@ -99,6 +99,7 @@ struct guest_walker {
>> pt_element_t prefetch_ptes[PTE_PREFETCH_NUM];
>> gpa_t pte_gpa[PT_MAX_FULL_LEVELS];
>> pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS];
>> + bool pte_writable[PT_MAX_FULL_LEVELS];
>> unsigned pt_access;
>> unsigned pte_access;
>> gfn_t gfn;
>> @@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
>> if (pte == orig_pte)
>> continue;
>>
>> + if (unlikely(!walker->pte_writable[level - 1]))
>> + return -EACCES;
>> +
>> ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, orig_pte, pte);
>> if (ret)
>> return ret;
>> @@ -309,7 +313,8 @@ retry_walk:
>> goto error;
>> real_gfn = gpa_to_gfn(real_gfn);
>>
>> - host_addr = gfn_to_hva(vcpu->kvm, real_gfn);
>> + host_addr = gfn_to_hva_read(vcpu->kvm, real_gfn,
>> + &walker->pte_writable[walker->level - 1]);
> The use of gfn_to_hva_read is misleading. The code can still write into
> gfn. Lets rename gfn_to_hva_read to gfn_to_hva_prot() and gfn_to_hva()
> to gfn_to_hva_write().

Yes. I agreed.

>
> This makes me think are there other places where gfn_to_hva() was
> used, but gfn_to_hva_prot() should have been?
> - kvm_host_page_size() looks incorrect. We never use huge page to map
> read only memory slots currently.

It only checks whether gfn have been mapped, I think we can use
gfn_to_hva_read() instead, the real permission will be checked when we translate
the gfn to pfn.

> - kvm_handle_bad_page() also looks incorrect and may cause incorrect
> address to be reported to userspace.

I have no idea on this point. kvm_handle_bad_page() is called when it failed to
translate the target gfn to pfn, then the emulator can detect the error on target gfn
properly. no? Or i misunderstood your meaning?

> - kvm_setup_async_pf() also incorrect. Makes all page fault on read
> only slot to be sync.
> - kvm_vm_fault() one looks OK since function assumes write only slots,
> but it is obsolete and should be deleted anyway.

2013-09-02 09:49:22

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

On Mon, Sep 02, 2013 at 05:42:25PM +0800, Xiao Guangrong wrote:
> On 09/01/2013 05:17 PM, Gleb Natapov wrote:
> > On Fri, Aug 30, 2013 at 02:41:37PM +0200, Paolo Bonzini wrote:
> >> Page tables in a read-only memory slot will currently cause a triple
> >> fault because the page walker uses gfn_to_hva and it fails on such a slot.
> >>
> >> OVMF uses such a page table; however, real hardware seems to be fine with
> >> that as long as the accessed/dirty bits are set. Save whether the slot
> >> is readonly, and later check it when updating the accessed and dirty bits.
> >>
> > The fix looks OK to me, but some comment below.
> >
> >> Cc: [email protected]
> >> Cc: [email protected]
> >> Cc: Xiao Guangrong <[email protected]>
> >> Signed-off-by: Paolo Bonzini <[email protected]>
> >> ---
> >> CCing to stable@ since the regression was introduced with
> >> support for readonly memory slots.
> >>
> >> arch/x86/kvm/paging_tmpl.h | 7 ++++++-
> >> include/linux/kvm_host.h | 1 +
> >> virt/kvm/kvm_main.c | 14 +++++++++-----
> >> 3 files changed, 16 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> >> index 0433301..dadc5c0 100644
> >> --- a/arch/x86/kvm/paging_tmpl.h
> >> +++ b/arch/x86/kvm/paging_tmpl.h
> >> @@ -99,6 +99,7 @@ struct guest_walker {
> >> pt_element_t prefetch_ptes[PTE_PREFETCH_NUM];
> >> gpa_t pte_gpa[PT_MAX_FULL_LEVELS];
> >> pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS];
> >> + bool pte_writable[PT_MAX_FULL_LEVELS];
> >> unsigned pt_access;
> >> unsigned pte_access;
> >> gfn_t gfn;
> >> @@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
> >> if (pte == orig_pte)
> >> continue;
> >>
> >> + if (unlikely(!walker->pte_writable[level - 1]))
> >> + return -EACCES;
> >> +
> >> ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, orig_pte, pte);
> >> if (ret)
> >> return ret;
> >> @@ -309,7 +313,8 @@ retry_walk:
> >> goto error;
> >> real_gfn = gpa_to_gfn(real_gfn);
> >>
> >> - host_addr = gfn_to_hva(vcpu->kvm, real_gfn);
> >> + host_addr = gfn_to_hva_read(vcpu->kvm, real_gfn,
> >> + &walker->pte_writable[walker->level - 1]);
> > The use of gfn_to_hva_read is misleading. The code can still write into
> > gfn. Lets rename gfn_to_hva_read to gfn_to_hva_prot() and gfn_to_hva()
> > to gfn_to_hva_write().
>
> Yes. I agreed.
>
> >
> > This makes me think are there other places where gfn_to_hva() was
> > used, but gfn_to_hva_prot() should have been?
> > - kvm_host_page_size() looks incorrect. We never use huge page to map
> > read only memory slots currently.
>
> It only checks whether gfn have been mapped, I think we can use
> gfn_to_hva_read() instead, the real permission will be checked when we translate
> the gfn to pfn.
>
Yes, all the cases I listed should be changed to use function that looks
at both regular and RO slots.

> > - kvm_handle_bad_page() also looks incorrect and may cause incorrect
> > address to be reported to userspace.
>
> I have no idea on this point. kvm_handle_bad_page() is called when it failed to
> translate the target gfn to pfn, then the emulator can detect the error on target gfn
> properly. no? Or i misunderstood your meaning?
>
I am talking about the following code:

if (pfn == KVM_PFN_ERR_HWPOISON) {
kvm_send_hwpoison_signal(gfn_to_hva(vcpu->kvm, gfn), current);
return 0;
}

pfn will be KVM_PFN_ERR_HWPOISON gfn is backed by faulty memory, we need
to report the liner address of the faulty memory to a userspace here,
but if gfn is in a RO slot gfn_to_hva() will not return correct address
here.

> > - kvm_setup_async_pf() also incorrect. Makes all page fault on read
> > only slot to be sync.
> > - kvm_vm_fault() one looks OK since function assumes write only slots,
> > but it is obsolete and should be deleted anyway.
>

--
Gleb.

2013-09-02 10:00:51

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

On 09/02/2013 05:25 PM, Gleb Natapov wrote:
> On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote:
>> On 08/30/2013 08:41 PM, Paolo Bonzini wrote:
>>> Page tables in a read-only memory slot will currently cause a triple
>>> fault because the page walker uses gfn_to_hva and it fails on such a slot.
>>>
>>> OVMF uses such a page table; however, real hardware seems to be fine with
>>> that as long as the accessed/dirty bits are set. Save whether the slot
>>> is readonly, and later check it when updating the accessed and dirty bits.
>>
>> Paolo, do you know why OVMF is using readonly memory like this?
>>
> Just a guess, but perhaps they want to move to paging mode as early as
> possible even before memory controller is fully initialized.
>
>> AFAIK, The fault trigged by this kind of access can hardly be fixed by
>> userspace since the fault is trigged by pagetable walking not by the current
>> instruction. Do you have any idea to let uerspace emulate it properly?
> Not sure what userspace you mean here, but there shouldn't be a fault in the

I just wonder how to fix this kind of fault. The current patch returns -EACCES
but that will crash the guest. I think we'd better let userspace to fix this
error (let userspace set the D/A bit.)

> first place if ROM page tables have access/dirty bit set and they do.

Yes, so we can not call x86_emulate_instruction() to fix this fault (that function
emulates the access on the first place). Need directly return a MMIO-exit to
userpsace when met this fault? What happen if this fault on pagetable-walking
is trigged in x86_emulate_instruction().?

2013-09-02 10:05:23

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

On 09/02/2013 05:49 PM, Gleb Natapov wrote:
> On Mon, Sep 02, 2013 at 05:42:25PM +0800, Xiao Guangrong wrote:
>> On 09/01/2013 05:17 PM, Gleb Natapov wrote:
>>> On Fri, Aug 30, 2013 at 02:41:37PM +0200, Paolo Bonzini wrote:
>>>> Page tables in a read-only memory slot will currently cause a triple
>>>> fault because the page walker uses gfn_to_hva and it fails on such a slot.
>>>>
>>>> OVMF uses such a page table; however, real hardware seems to be fine with
>>>> that as long as the accessed/dirty bits are set. Save whether the slot
>>>> is readonly, and later check it when updating the accessed and dirty bits.
>>>>
>>> The fix looks OK to me, but some comment below.
>>>
>>>> Cc: [email protected]
>>>> Cc: [email protected]
>>>> Cc: Xiao Guangrong <[email protected]>
>>>> Signed-off-by: Paolo Bonzini <[email protected]>
>>>> ---
>>>> CCing to stable@ since the regression was introduced with
>>>> support for readonly memory slots.
>>>>
>>>> arch/x86/kvm/paging_tmpl.h | 7 ++++++-
>>>> include/linux/kvm_host.h | 1 +
>>>> virt/kvm/kvm_main.c | 14 +++++++++-----
>>>> 3 files changed, 16 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
>>>> index 0433301..dadc5c0 100644
>>>> --- a/arch/x86/kvm/paging_tmpl.h
>>>> +++ b/arch/x86/kvm/paging_tmpl.h
>>>> @@ -99,6 +99,7 @@ struct guest_walker {
>>>> pt_element_t prefetch_ptes[PTE_PREFETCH_NUM];
>>>> gpa_t pte_gpa[PT_MAX_FULL_LEVELS];
>>>> pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS];
>>>> + bool pte_writable[PT_MAX_FULL_LEVELS];
>>>> unsigned pt_access;
>>>> unsigned pte_access;
>>>> gfn_t gfn;
>>>> @@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
>>>> if (pte == orig_pte)
>>>> continue;
>>>>
>>>> + if (unlikely(!walker->pte_writable[level - 1]))
>>>> + return -EACCES;
>>>> +
>>>> ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, orig_pte, pte);
>>>> if (ret)
>>>> return ret;
>>>> @@ -309,7 +313,8 @@ retry_walk:
>>>> goto error;
>>>> real_gfn = gpa_to_gfn(real_gfn);
>>>>
>>>> - host_addr = gfn_to_hva(vcpu->kvm, real_gfn);
>>>> + host_addr = gfn_to_hva_read(vcpu->kvm, real_gfn,
>>>> + &walker->pte_writable[walker->level - 1]);
>>> The use of gfn_to_hva_read is misleading. The code can still write into
>>> gfn. Lets rename gfn_to_hva_read to gfn_to_hva_prot() and gfn_to_hva()
>>> to gfn_to_hva_write().
>>
>> Yes. I agreed.
>>
>>>
>>> This makes me think are there other places where gfn_to_hva() was
>>> used, but gfn_to_hva_prot() should have been?
>>> - kvm_host_page_size() looks incorrect. We never use huge page to map
>>> read only memory slots currently.
>>
>> It only checks whether gfn have been mapped, I think we can use
>> gfn_to_hva_read() instead, the real permission will be checked when we translate
>> the gfn to pfn.
>>
> Yes, all the cases I listed should be changed to use function that looks
> at both regular and RO slots.
>
>>> - kvm_handle_bad_page() also looks incorrect and may cause incorrect
>>> address to be reported to userspace.
>>
>> I have no idea on this point. kvm_handle_bad_page() is called when it failed to
>> translate the target gfn to pfn, then the emulator can detect the error on target gfn
>> properly. no? Or i misunderstood your meaning?
>>
> I am talking about the following code:
>
> if (pfn == KVM_PFN_ERR_HWPOISON) {
> kvm_send_hwpoison_signal(gfn_to_hva(vcpu->kvm, gfn), current);
> return 0;
> }
>
> pfn will be KVM_PFN_ERR_HWPOISON gfn is backed by faulty memory, we need
> to report the liner address of the faulty memory to a userspace here,
> but if gfn is in a RO slot gfn_to_hva() will not return correct address
> here.

Got it, thanks for your explanation.

BTW, if you and Paolo are busy on other things, i am happy to fix these issues. :)

2013-09-02 10:08:02

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

On Mon, Sep 02, 2013 at 06:00:39PM +0800, Xiao Guangrong wrote:
> On 09/02/2013 05:25 PM, Gleb Natapov wrote:
> > On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote:
> >> On 08/30/2013 08:41 PM, Paolo Bonzini wrote:
> >>> Page tables in a read-only memory slot will currently cause a triple
> >>> fault because the page walker uses gfn_to_hva and it fails on such a slot.
> >>>
> >>> OVMF uses such a page table; however, real hardware seems to be fine with
> >>> that as long as the accessed/dirty bits are set. Save whether the slot
> >>> is readonly, and later check it when updating the accessed and dirty bits.
> >>
> >> Paolo, do you know why OVMF is using readonly memory like this?
> >>
> > Just a guess, but perhaps they want to move to paging mode as early as
> > possible even before memory controller is fully initialized.
> >
> >> AFAIK, The fault trigged by this kind of access can hardly be fixed by
> >> userspace since the fault is trigged by pagetable walking not by the current
> >> instruction. Do you have any idea to let uerspace emulate it properly?
> > Not sure what userspace you mean here, but there shouldn't be a fault in the
>
> I just wonder how to fix this kind of fault. The current patch returns -EACCES
> but that will crash the guest. I think we'd better let userspace to fix this
> error (let userspace set the D/A bit.)
>
Ugh, this is not good. Missed that. Don't know what real HW will do
here, but the easy thing for us to do would be to just return success.

> > first place if ROM page tables have access/dirty bit set and they do.
>
> Yes, so we can not call x86_emulate_instruction() to fix this fault (that function
> emulates the access on the first place). Need directly return a MMIO-exit to
> userpsace when met this fault? What happen if this fault on pagetable-walking
> is trigged in x86_emulate_instruction().?
I think we should not return MMIO-exit to userspace. Either ignore write attempt
or kill a guest.

--
Gleb.

2013-09-02 10:12:28

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

On Mon, Sep 02, 2013 at 06:05:10PM +0800, Xiao Guangrong wrote:
> On 09/02/2013 05:49 PM, Gleb Natapov wrote:
> > On Mon, Sep 02, 2013 at 05:42:25PM +0800, Xiao Guangrong wrote:
> >> On 09/01/2013 05:17 PM, Gleb Natapov wrote:
> >>> On Fri, Aug 30, 2013 at 02:41:37PM +0200, Paolo Bonzini wrote:
> >>>> Page tables in a read-only memory slot will currently cause a triple
> >>>> fault because the page walker uses gfn_to_hva and it fails on such a slot.
> >>>>
> >>>> OVMF uses such a page table; however, real hardware seems to be fine with
> >>>> that as long as the accessed/dirty bits are set. Save whether the slot
> >>>> is readonly, and later check it when updating the accessed and dirty bits.
> >>>>
> >>> The fix looks OK to me, but some comment below.
> >>>
> >>>> Cc: [email protected]
> >>>> Cc: [email protected]
> >>>> Cc: Xiao Guangrong <[email protected]>
> >>>> Signed-off-by: Paolo Bonzini <[email protected]>
> >>>> ---
> >>>> CCing to stable@ since the regression was introduced with
> >>>> support for readonly memory slots.
> >>>>
> >>>> arch/x86/kvm/paging_tmpl.h | 7 ++++++-
> >>>> include/linux/kvm_host.h | 1 +
> >>>> virt/kvm/kvm_main.c | 14 +++++++++-----
> >>>> 3 files changed, 16 insertions(+), 6 deletions(-)
> >>>>
> >>>> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> >>>> index 0433301..dadc5c0 100644
> >>>> --- a/arch/x86/kvm/paging_tmpl.h
> >>>> +++ b/arch/x86/kvm/paging_tmpl.h
> >>>> @@ -99,6 +99,7 @@ struct guest_walker {
> >>>> pt_element_t prefetch_ptes[PTE_PREFETCH_NUM];
> >>>> gpa_t pte_gpa[PT_MAX_FULL_LEVELS];
> >>>> pt_element_t __user *ptep_user[PT_MAX_FULL_LEVELS];
> >>>> + bool pte_writable[PT_MAX_FULL_LEVELS];
> >>>> unsigned pt_access;
> >>>> unsigned pte_access;
> >>>> gfn_t gfn;
> >>>> @@ -235,6 +236,9 @@ static int FNAME(update_accessed_dirty_bits)(struct kvm_vcpu *vcpu,
> >>>> if (pte == orig_pte)
> >>>> continue;
> >>>>
> >>>> + if (unlikely(!walker->pte_writable[level - 1]))
> >>>> + return -EACCES;
> >>>> +
> >>>> ret = FNAME(cmpxchg_gpte)(vcpu, mmu, ptep_user, index, orig_pte, pte);
> >>>> if (ret)
> >>>> return ret;
> >>>> @@ -309,7 +313,8 @@ retry_walk:
> >>>> goto error;
> >>>> real_gfn = gpa_to_gfn(real_gfn);
> >>>>
> >>>> - host_addr = gfn_to_hva(vcpu->kvm, real_gfn);
> >>>> + host_addr = gfn_to_hva_read(vcpu->kvm, real_gfn,
> >>>> + &walker->pte_writable[walker->level - 1]);
> >>> The use of gfn_to_hva_read is misleading. The code can still write into
> >>> gfn. Lets rename gfn_to_hva_read to gfn_to_hva_prot() and gfn_to_hva()
> >>> to gfn_to_hva_write().
> >>
> >> Yes. I agreed.
> >>
> >>>
> >>> This makes me think are there other places where gfn_to_hva() was
> >>> used, but gfn_to_hva_prot() should have been?
> >>> - kvm_host_page_size() looks incorrect. We never use huge page to map
> >>> read only memory slots currently.
> >>
> >> It only checks whether gfn have been mapped, I think we can use
> >> gfn_to_hva_read() instead, the real permission will be checked when we translate
> >> the gfn to pfn.
> >>
> > Yes, all the cases I listed should be changed to use function that looks
> > at both regular and RO slots.
> >
> >>> - kvm_handle_bad_page() also looks incorrect and may cause incorrect
> >>> address to be reported to userspace.
> >>
> >> I have no idea on this point. kvm_handle_bad_page() is called when it failed to
> >> translate the target gfn to pfn, then the emulator can detect the error on target gfn
> >> properly. no? Or i misunderstood your meaning?
> >>
> > I am talking about the following code:
> >
> > if (pfn == KVM_PFN_ERR_HWPOISON) {
> > kvm_send_hwpoison_signal(gfn_to_hva(vcpu->kvm, gfn), current);
> > return 0;
> > }
> >
> > pfn will be KVM_PFN_ERR_HWPOISON gfn is backed by faulty memory, we need
> > to report the liner address of the faulty memory to a userspace here,
> > but if gfn is in a RO slot gfn_to_hva() will not return correct address
> > here.
>
> Got it, thanks for your explanation.
>
> BTW, if you and Paolo are busy on other things, i am happy to fix these issues. :)
I am busy with reviews mostly :). If you are not to busy with lockless
write protection then fine with me. Lest wait for Paolo's input on
proposed API though.

--
Gleb.

2013-09-02 15:56:24

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

Il 02/09/2013 12:07, Gleb Natapov ha scritto:
> On Mon, Sep 02, 2013 at 06:00:39PM +0800, Xiao Guangrong wrote:
>> On 09/02/2013 05:25 PM, Gleb Natapov wrote:
>>> On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote:
>>>> On 08/30/2013 08:41 PM, Paolo Bonzini wrote:
>>>>> Page tables in a read-only memory slot will currently cause a triple
>>>>> fault because the page walker uses gfn_to_hva and it fails on such a slot.
>>>>>
>>>>> OVMF uses such a page table; however, real hardware seems to be fine with
>>>>> that as long as the accessed/dirty bits are set. Save whether the slot
>>>>> is readonly, and later check it when updating the accessed and dirty bits.
>>>>
>>>> Paolo, do you know why OVMF is using readonly memory like this?
>>>>
>>> Just a guess, but perhaps they want to move to paging mode as early as
>>> possible even before memory controller is fully initialized.
>>>
>>>> AFAIK, The fault trigged by this kind of access can hardly be fixed by
>>>> userspace since the fault is trigged by pagetable walking not by the current
>>>> instruction. Do you have any idea to let uerspace emulate it properly?
>>> Not sure what userspace you mean here, but there shouldn't be a fault in the
>>
>> I just wonder how to fix this kind of fault. The current patch returns -EACCES
>> but that will crash the guest. I think we'd better let userspace to fix this
>> error (let userspace set the D/A bit.)
>>
> Ugh, this is not good. Missed that. Don't know what real HW will do
> here, but the easy thing for us to do would be to just return success.

Real hardware would just do a memory write. What happens depends on
what is on the bus, i.e. on what the ROM is used for.

QEMU uses read-only slots for two things: actual read-only memory where
writes go to the bitbucket, and "ROMD" memory where writes are treated
as MMIO.

So, in the first case we would ignore the write. In the second we would
do an MMIO exit to userspace. But ignoring the write isn't always
correct, and doing an MMIO exit is complicated, so I would just kill the
guest.

EPT will probably write to the read-only slots without thinking much
about it.

My patch injects a page fault, which is very likely to escalate to a
triple fault. This is probably never what you want---on the other hand,
I wasn't sure what level of accuracy we want in this case, given that
EPT does it wrong too.

Paolo

>>> first place if ROM page tables have access/dirty bit set and they do.
>>
>> Yes, so we can not call x86_emulate_instruction() to fix this fault (that function
>> emulates the access on the first place). Need directly return a MMIO-exit to
>> userpsace when met this fault? What happen if this fault on pagetable-walking
>> is trigged in x86_emulate_instruction().?
> I think we should not return MMIO-exit to userspace. Either ignore write attempt
> or kill a guest.
>
> --
> Gleb.
>

2013-09-02 15:58:56

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

Il 02/09/2013 12:11, Gleb Natapov ha scritto:
>> >
>> > Got it, thanks for your explanation.
>> >
>> > BTW, if you and Paolo are busy on other things, i am happy to fix these issues. :)
> I am busy with reviews mostly :). If you are not to busy with lockless
> write protection then fine with me. Lest wait for Paolo's input on
> proposed API though.

No problem, I'm happy to do this kind of exercise. I like the API you
proposed. I can do this change on top of the API change. However,
since this bug is "in the wild" it may be applicable to stable@ as well.
If you agree, I will also post v2 of this patch for stable kernels as
soon as we agree on the desired faulting behavior.

Paolo

2013-09-02 16:00:37

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

Il 02/09/2013 11:25, Gleb Natapov ha scritto:
> On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote:
>> On 08/30/2013 08:41 PM, Paolo Bonzini wrote:
>>> Page tables in a read-only memory slot will currently cause a triple
>>> fault because the page walker uses gfn_to_hva and it fails on such a slot.
>>>
>>> OVMF uses such a page table; however, real hardware seems to be fine with
>>> that as long as the accessed/dirty bits are set. Save whether the slot
>>> is readonly, and later check it when updating the accessed and dirty bits.
>>
>> Paolo, do you know why OVMF is using readonly memory like this?
>>
> Just a guess, but perhaps they want to move to paging mode as early as
> possible even before memory controller is fully initialized.

More precisely they want to move to 64-bit mode as early as possible,
and that requires paging.

>> AFAIK, The fault trigged by this kind of access can hardly be fixed by
>> userspace since the fault is trigged by pagetable walking not by the current
>> instruction. Do you have any idea to let uerspace emulate it properly?
>
> Not sure what userspace you mean here, but there shouldn't be a fault in the
> first place if ROM page tables have access/dirty bit set and they do.

Yep. Actually they do not set dirty on the PD/PDP/PML4, only on the
bottom level (which is also fine).

Paolo

2013-09-02 16:16:31

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

On Mon, Sep 02, 2013 at 06:00:08PM +0200, Paolo Bonzini wrote:
> Il 02/09/2013 11:25, Gleb Natapov ha scritto:
> > On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote:
> >> On 08/30/2013 08:41 PM, Paolo Bonzini wrote:
> >>> Page tables in a read-only memory slot will currently cause a triple
> >>> fault because the page walker uses gfn_to_hva and it fails on such a slot.
> >>>
> >>> OVMF uses such a page table; however, real hardware seems to be fine with
> >>> that as long as the accessed/dirty bits are set. Save whether the slot
> >>> is readonly, and later check it when updating the accessed and dirty bits.
> >>
> >> Paolo, do you know why OVMF is using readonly memory like this?
> >>
> > Just a guess, but perhaps they want to move to paging mode as early as
> > possible even before memory controller is fully initialized.
>
> More precisely they want to move to 64-bit mode as early as possible,
> and that requires paging.
>
> >> AFAIK, The fault trigged by this kind of access can hardly be fixed by
> >> userspace since the fault is trigged by pagetable walking not by the current
> >> instruction. Do you have any idea to let uerspace emulate it properly?
> >
> > Not sure what userspace you mean here, but there shouldn't be a fault in the
> > first place if ROM page tables have access/dirty bit set and they do.
>
> Yep. Actually they do not set dirty on the PD/PDP/PML4, only on the
> bottom level (which is also fine).
>
They do it for a good reason, dirty bit exists only on the bottom level :)

--
Gleb.

2013-09-02 16:26:52

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

On Mon, Sep 02, 2013 at 05:56:07PM +0200, Paolo Bonzini wrote:
> Il 02/09/2013 12:07, Gleb Natapov ha scritto:
> > On Mon, Sep 02, 2013 at 06:00:39PM +0800, Xiao Guangrong wrote:
> >> On 09/02/2013 05:25 PM, Gleb Natapov wrote:
> >>> On Mon, Sep 02, 2013 at 05:20:15PM +0800, Xiao Guangrong wrote:
> >>>> On 08/30/2013 08:41 PM, Paolo Bonzini wrote:
> >>>>> Page tables in a read-only memory slot will currently cause a triple
> >>>>> fault because the page walker uses gfn_to_hva and it fails on such a slot.
> >>>>>
> >>>>> OVMF uses such a page table; however, real hardware seems to be fine with
> >>>>> that as long as the accessed/dirty bits are set. Save whether the slot
> >>>>> is readonly, and later check it when updating the accessed and dirty bits.
> >>>>
> >>>> Paolo, do you know why OVMF is using readonly memory like this?
> >>>>
> >>> Just a guess, but perhaps they want to move to paging mode as early as
> >>> possible even before memory controller is fully initialized.
> >>>
> >>>> AFAIK, The fault trigged by this kind of access can hardly be fixed by
> >>>> userspace since the fault is trigged by pagetable walking not by the current
> >>>> instruction. Do you have any idea to let uerspace emulate it properly?
> >>> Not sure what userspace you mean here, but there shouldn't be a fault in the
> >>
> >> I just wonder how to fix this kind of fault. The current patch returns -EACCES
> >> but that will crash the guest. I think we'd better let userspace to fix this
> >> error (let userspace set the D/A bit.)
> >>
> > Ugh, this is not good. Missed that. Don't know what real HW will do
> > here, but the easy thing for us to do would be to just return success.
>
> Real hardware would just do a memory write. What happens depends on
> what is on the bus, i.e. on what the ROM is used for.
>
> QEMU uses read-only slots for two things: actual read-only memory where
> writes go to the bitbucket, and "ROMD" memory where writes are treated
> as MMIO.
>
> So, in the first case we would ignore the write. In the second we would
> do an MMIO exit to userspace. But ignoring the write isn't always
> correct, and doing an MMIO exit is complicated, so I would just kill the
> guest.
>
> EPT will probably write to the read-only slots without thinking much
> about it.
>
> My patch injects a page fault, which is very likely to escalate to a
> triple fault. This is probably never what you want---on the other hand,
> I wasn't sure what level of accuracy we want in this case, given that
> EPT does it wrong too.
>
Just ignoring write into RO slot is OK. We do not handle page tables in
MMIO memory, so handling page tables in ROMD memory shouldn't be a
priority either.

--
Gleb.

2013-09-02 16:28:10

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

On Mon, Sep 02, 2013 at 05:58:41PM +0200, Paolo Bonzini wrote:
> Il 02/09/2013 12:11, Gleb Natapov ha scritto:
> >> >
> >> > Got it, thanks for your explanation.
> >> >
> >> > BTW, if you and Paolo are busy on other things, i am happy to fix these issues. :)
> > I am busy with reviews mostly :). If you are not to busy with lockless
> > write protection then fine with me. Lest wait for Paolo's input on
> > proposed API though.
>
> No problem, I'm happy to do this kind of exercise. I like the API you
> proposed. I can do this change on top of the API change. However,
> since this bug is "in the wild" it may be applicable to stable@ as well.
> If you agree, I will also post v2 of this patch for stable kernels as
> soon as we agree on the desired faulting behavior.
>
I am OK with having minimal patch for backporting to stable and doing
API changes + fixes for other call sites on top.

--
Gleb.

2013-10-14 15:29:05

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] KVM: mmu: allow page tables to be in read-only slots

Il 01/09/2013 11:17, Gleb Natapov ha scritto:
> This makes me think are there other places where gfn_to_hva() was
> used, but gfn_to_hva_prot() should have been?
> - kvm_host_page_size() looks incorrect. We never use huge page to map
> read only memory slots currently.
> - kvm_handle_bad_page() also looks incorrect and may cause incorrect
> address to be reported to userspace.
> - kvm_setup_async_pf() also incorrect. Makes all page fault on read
> only slot to be sync.
> - kvm_vm_fault() one looks OK since function assumes write only slots,
> but it is obsolete and should be deleted anyway.
>
> Others in generic and x86 code looks OK, somebody need to check ppc and
> arm code.

PPC is a mess, for s390 I couldn't find a spec. ARM needs to be fixed.
Patches coming...

Paolo