On Fri, Aug 20, 2021, Brijesh Singh wrote:
> When SEV-SNP is enabled in the guest VM, the guest memory pages can
> either be a private or shared. A write from the hypervisor goes through
> the RMP checks. If hardware sees that hypervisor is attempting to write
> to a guest private page, then it triggers an RMP violation #PF.
>
> To avoid the RMP violation, add post_{map,unmap}_gfn() ops that can be
> used to verify that its safe to map a given guest page. Use the SRCU to
> protect against the page state change for existing mapped pages.
SRCU isn't protecting anything. The synchronize_srcu_expedited() in the PSC code
forces it to wait for existing maps to go away, but it doesn't prevent new maps
from being created while the actual RMP updates are in-flight. Most telling is
that the RMP updates happen _after_ the synchronize_srcu_expedited() call.
This also doesn't handle kvm_{read,write}_guest_cached().
I can't help but note that the private memslots idea[*] would handle this gracefully,
e.g. the memslot lookup would fail, and any change in private memslots would
invalidate the cache due to a generation mismatch.
KSM is another mess that would Just Work.
I'm not saying that SNP should be blocked on support for unmapping guest private
memory, but I do think we should strongly consider focusing on that effort rather
than trying to fix things piecemeal throughout KVM. I don't think it's too absurd
to say that it might actually be faster overall. And I 100% think that having a
cohesive design and uABI for SNP and TDX would be hugely beneficial to KVM.
[*] https://lkml.kernel.org/r/[email protected]
> +int sev_post_map_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *token)
> +{
> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> + int level;
> +
> + if (!sev_snp_guest(kvm))
> + return 0;
> +
> + *token = srcu_read_lock(&sev->psc_srcu);
> +
> + /* If pfn is not added as private then fail */
This comment and the pr_err() are backwards, and confused the heck out of me.
snp_lookup_rmpentry() returns '1' if the pfn is assigned, a.k.a. private. That
means this code throws an error if the page is private, i.e. requires the page
to be shared. Which makes sense given the use cases, it's just incredibly
confusing.
> + if (snp_lookup_rmpentry(pfn, &level) == 1) {
Any reason not to provide e.g. rmp_is_shared() and rmp_is_private() so that
callers don't have to care as much about the return values? The -errno/0/1
semantics are all but guarantee to bite us in the rear at some point.
Actually, peeking at other patches, I think it already has. This usage in
__unregister_enc_region_locked() is wrong:
/*
* The guest memory pages are assigned in the RMP table. Unassign it
* before releasing the memory.
*/
if (sev_snp_guest(kvm)) {
for (i = 0; i < region->npages; i++) {
pfn = page_to_pfn(region->pages[i]);
if (!snp_lookup_rmpentry(pfn, &level)) <-- attempts make_shared() on non-existent entry
continue;
cond_resched();
if (level > PG_LEVEL_4K)
pfn &= ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
host_rmp_make_shared(pfn, level, true);
}
}
> + srcu_read_unlock(&sev->psc_srcu, *token);
> + pr_err_ratelimited("failed to map private gfn 0x%llx pfn 0x%llx\n", gfn, pfn);
> + return -EBUSY;
> + }
> +
> + return 0;
> +}
> static struct kvm_x86_init_ops svm_init_ops __initdata = {
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index d10f7166b39d..ff91184f9b4a 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -76,16 +76,22 @@ struct kvm_sev_info {
> bool active; /* SEV enabled guest */
> bool es_active; /* SEV-ES enabled guest */
> bool snp_active; /* SEV-SNP enabled guest */
> +
> unsigned int asid; /* ASID used for this guest */
> unsigned int handle; /* SEV firmware handle */
> int fd; /* SEV device fd */
> +
> unsigned long pages_locked; /* Number of pages locked */
> struct list_head regions_list; /* List of registered regions */
> +
> u64 ap_jump_table; /* SEV-ES AP Jump Table address */
> +
> struct kvm *enc_context_owner; /* Owner of copied encryption context */
> struct misc_cg *misc_cg; /* For misc cgroup accounting */
> +
Unrelated whitespace changes.
> u64 snp_init_flags;
> void *snp_context; /* SNP guest context page */
> + struct srcu_struct psc_srcu;
> };
On 10/12/21 5:23 PM, Sean Christopherson wrote:
> On Fri, Aug 20, 2021, Brijesh Singh wrote:
>> When SEV-SNP is enabled in the guest VM, the guest memory pages can
>> either be a private or shared. A write from the hypervisor goes through
>> the RMP checks. If hardware sees that hypervisor is attempting to write
>> to a guest private page, then it triggers an RMP violation #PF.
>>
>> To avoid the RMP violation, add post_{map,unmap}_gfn() ops that can be
>> used to verify that its safe to map a given guest page. Use the SRCU to
>> protect against the page state change for existing mapped pages.
> SRCU isn't protecting anything. The synchronize_srcu_expedited() in the PSC code
> forces it to wait for existing maps to go away, but it doesn't prevent new maps
> from being created while the actual RMP updates are in-flight. Most telling is
> that the RMP updates happen _after_ the synchronize_srcu_expedited() call.
>
> This also doesn't handle kvm_{read,write}_guest_cached().
Hmm, I thought the kvm_{read_write}_guest_cached() uses the
copy_{to,from}_user(). Writing to the private will cause a #PF and will
fail the copy_to_user(). Am I missing something?
>
> I can't help but note that the private memslots idea[*] would handle this gracefully,
> e.g. the memslot lookup would fail, and any change in private memslots would
> invalidate the cache due to a generation mismatch.
>
> KSM is another mess that would Just Work.
>
> I'm not saying that SNP should be blocked on support for unmapping guest private
> memory, but I do think we should strongly consider focusing on that effort rather
> than trying to fix things piecemeal throughout KVM. I don't think it's too absurd
> to say that it might actually be faster overall. And I 100% think that having a
> cohesive design and uABI for SNP and TDX would be hugely beneficial to KVM.
>
> [*] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20210824005248.200037-1-seanjc%40google.com&data=04%7C01%7Cbrijesh.singh%40amd.com%7Cd1717d511a1f473cedc408d98ddfb027%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637696814148744591%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=3LF77%2BcqmpUdiP6YAk7LpImisBzjRGUzdI3raqjJxl0%3D&reserved=0
>
>> +int sev_post_map_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *token)
>> +{
>> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>> + int level;
>> +
>> + if (!sev_snp_guest(kvm))
>> + return 0;
>> +
>> + *token = srcu_read_lock(&sev->psc_srcu);
>> +
>> + /* If pfn is not added as private then fail */
> This comment and the pr_err() are backwards, and confused the heck out of me.
> snp_lookup_rmpentry() returns '1' if the pfn is assigned, a.k.a. private. That
> means this code throws an error if the page is private, i.e. requires the page
> to be shared. Which makes sense given the use cases, it's just incredibly
> confusing.
Actually I followed your recommendation from the previous feedback that
snp_lookup_rmpentry() should return 1 for the assigned page, 0 for the
shared and -negative for invalid. I can clarify it hereĀ again.
>
>> + if (snp_lookup_rmpentry(pfn, &level) == 1) {
> Any reason not to provide e.g. rmp_is_shared() and rmp_is_private() so that
> callers don't have to care as much about the return values? The -errno/0/1
> semantics are all but guarantee to bite us in the rear at some point.
If we look at the previous series, I had a macro rmp_is_assigned() for
exactly the same purpose but the feedback was to drop those macros and
return the state indirectly through the snp_lookup_rmpentry(). I can
certainly add a helper rmp_is_{shared,private}() if it makes code more
readable.
> Actually, peeking at other patches, I think it already has. This usage in
> __unregister_enc_region_locked() is wrong:
>
> /*
> * The guest memory pages are assigned in the RMP table. Unassign it
> * before releasing the memory.
> */
> if (sev_snp_guest(kvm)) {
> for (i = 0; i < region->npages; i++) {
> pfn = page_to_pfn(region->pages[i]);
>
> if (!snp_lookup_rmpentry(pfn, &level)) <-- attempts make_shared() on non-existent entry
> continue;
>
> cond_resched();
>
> if (level > PG_LEVEL_4K)
> pfn &= ~(KVM_PAGES_PER_HPAGE(PG_LEVEL_2M) - 1);
>
> host_rmp_make_shared(pfn, level, true);
> }
> }
>
>
>> + srcu_read_unlock(&sev->psc_srcu, *token);
>> + pr_err_ratelimited("failed to map private gfn 0x%llx pfn 0x%llx\n", gfn, pfn);
>> + return -EBUSY;
>> + }
>> +
>> + return 0;
>> +}
>> static struct kvm_x86_init_ops svm_init_ops __initdata = {
>> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
>> index d10f7166b39d..ff91184f9b4a 100644
>> --- a/arch/x86/kvm/svm/svm.h
>> +++ b/arch/x86/kvm/svm/svm.h
>> @@ -76,16 +76,22 @@ struct kvm_sev_info {
>> bool active; /* SEV enabled guest */
>> bool es_active; /* SEV-ES enabled guest */
>> bool snp_active; /* SEV-SNP enabled guest */
>> +
>> unsigned int asid; /* ASID used for this guest */
>> unsigned int handle; /* SEV firmware handle */
>> int fd; /* SEV device fd */
>> +
>> unsigned long pages_locked; /* Number of pages locked */
>> struct list_head regions_list; /* List of registered regions */
>> +
>> u64 ap_jump_table; /* SEV-ES AP Jump Table address */
>> +
>> struct kvm *enc_context_owner; /* Owner of copied encryption context */
>> struct misc_cg *misc_cg; /* For misc cgroup accounting */
>> +
> Unrelated whitespace changes.
>
>> u64 snp_init_flags;
>> void *snp_context; /* SNP guest context page */
>> + struct srcu_struct psc_srcu;
>> };
On Wed, Oct 13, 2021, Brijesh Singh wrote:
>
> On 10/12/21 5:23 PM, Sean Christopherson wrote:
> > On Fri, Aug 20, 2021, Brijesh Singh wrote:
> >> When SEV-SNP is enabled in the guest VM, the guest memory pages can
> >> either be a private or shared. A write from the hypervisor goes through
> >> the RMP checks. If hardware sees that hypervisor is attempting to write
> >> to a guest private page, then it triggers an RMP violation #PF.
> >>
> >> To avoid the RMP violation, add post_{map,unmap}_gfn() ops that can be
> >> used to verify that its safe to map a given guest page. Use the SRCU to
> >> protect against the page state change for existing mapped pages.
> > SRCU isn't protecting anything. The synchronize_srcu_expedited() in the PSC code
> > forces it to wait for existing maps to go away, but it doesn't prevent new maps
> > from being created while the actual RMP updates are in-flight. Most telling is
> > that the RMP updates happen _after_ the synchronize_srcu_expedited() call.
> >
> > This also doesn't handle kvm_{read,write}_guest_cached().
>
> Hmm, I thought the kvm_{read_write}_guest_cached() uses the
> copy_{to,from}_user(). Writing to the private will cause a #PF and will
> fail the copy_to_user(). Am I missing something?
Doh, right you are. I was thinking they cached the kmap, but it's just the
gpa->hva that gets cached.
> > I can't help but note that the private memslots idea[*] would handle this gracefully,
> > e.g. the memslot lookup would fail, and any change in private memslots would
> > invalidate the cache due to a generation mismatch.
> >
> > KSM is another mess that would Just Work.
> >
> > I'm not saying that SNP should be blocked on support for unmapping guest private
> > memory, but I do think we should strongly consider focusing on that effort rather
> > than trying to fix things piecemeal throughout KVM. I don't think it's too absurd
> > to say that it might actually be faster overall. And I 100% think that having a
> > cohesive design and uABI for SNP and TDX would be hugely beneficial to KVM.
> >
> > [*] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20210824005248.200037-1-seanjc%40google.com&data=04%7C01%7Cbrijesh.singh%40amd.com%7Cd1717d511a1f473cedc408d98ddfb027%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637696814148744591%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=3LF77%2BcqmpUdiP6YAk7LpImisBzjRGUzdI3raqjJxl0%3D&reserved=0
> >
> >> +int sev_post_map_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *token)
> >> +{
> >> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> >> + int level;
> >> +
> >> + if (!sev_snp_guest(kvm))
> >> + return 0;
> >> +
> >> + *token = srcu_read_lock(&sev->psc_srcu);
> >> +
> >> + /* If pfn is not added as private then fail */
> > This comment and the pr_err() are backwards, and confused the heck out of me.
> > snp_lookup_rmpentry() returns '1' if the pfn is assigned, a.k.a. private. That
> > means this code throws an error if the page is private, i.e. requires the page
> > to be shared. Which makes sense given the use cases, it's just incredibly
> > confusing.
> Actually I followed your recommendation from the previous feedback that
> snp_lookup_rmpentry() should return 1 for the assigned page, 0 for the
> shared and -negative for invalid. I can clarify it here? again.
>
> >> + if (snp_lookup_rmpentry(pfn, &level) == 1) {
> > Any reason not to provide e.g. rmp_is_shared() and rmp_is_private() so that
> > callers don't have to care as much about the return values? The -errno/0/1
> > semantics are all but guarantee to bite us in the rear at some point.
>
> If we look at the previous series, I had a macro rmp_is_assigned() for
> exactly the same purpose but the feedback was to drop those macros and
> return the state indirectly through the snp_lookup_rmpentry(). I can
> certainly add a helper rmp_is_{shared,private}() if it makes code more
> readable.
Ah rats. Bad communication on my side. I didn't intended to have non-RMP code
directly consume snp_lookup_rmpentry() for simple checks. The goal behind the
helper was to bury "struct rmpentry" so that it wasn't visible to the kernel at
large. I.e. my objection was that rmp_assigned() was looking directly at a
non-architectural struct.
My full thought for snp_lookup_rmpentry() was that it _could_ be consumed directly
without exposing "struct rmpentry", but that simple, common use cases would provide
wrappers, e.g.
static inline snp_is_rmp_private(...)
{
return snp_lookup_rmpentry(...) == 1;
}
static inline snp_is_rmp_shared(...)
{
return snp_lookup_rmpentry(...) < 1;
}
Side topic, what do you think about s/assigned/private for the "public" APIs, as
suggested/implied above? I actually like the terminology when talking specifically
about the RMP, but it doesn't fit the abstractions that tend to be used when talking
about these things in other contexts, e.g. in KVM.
On Wed, Oct 13, 2021, Sean Christopherson wrote:
> On Fri, Aug 20, 2021, Brijesh Singh wrote:
> > When SEV-SNP is enabled in the guest VM, the guest memory pages can
> > either be a private or shared. A write from the hypervisor goes through
> > the RMP checks. If hardware sees that hypervisor is attempting to write
> > to a guest private page, then it triggers an RMP violation #PF.
> >
> > To avoid the RMP violation, add post_{map,unmap}_gfn() ops that can be
> > used to verify that its safe to map a given guest page. Use the SRCU to
> > protect against the page state change for existing mapped pages.
>
> SRCU isn't protecting anything. The synchronize_srcu_expedited() in the PSC code
> forces it to wait for existing maps to go away, but it doesn't prevent new maps
> from being created while the actual RMP updates are in-flight. Most telling is
> that the RMP updates happen _after_ the synchronize_srcu_expedited() call.
Argh, another goof on my part. Rereading prior feedback, I see that I loosely
suggested SRCU as a possible solution. That was a bad, bad suggestion. I think
(hope) I made it offhand without really thinking it through. SRCU can't work in
this case, because the whole premise of Read-Copy-Update is that there can be
multiple copies of the data. That simply can't be true for the RMP as hardware
operates on a single table.
In the future, please don't hesitate to push back on and/or question suggestions,
especially those that are made without concrete examples, i.e. are likely off the
cuff. My goal isn't to set you up for failure :-/
On 10/13/21 1:10 PM, Sean Christopherson wrote:
> On Wed, Oct 13, 2021, Brijesh Singh wrote:
>> On 10/12/21 5:23 PM, Sean Christopherson wrote:
>>> On Fri, Aug 20, 2021, Brijesh Singh wrote:
>>>> When SEV-SNP is enabled in the guest VM, the guest memory pages can
>>>> either be a private or shared. A write from the hypervisor goes through
>>>> the RMP checks. If hardware sees that hypervisor is attempting to write
>>>> to a guest private page, then it triggers an RMP violation #PF.
>>>>
>>>> To avoid the RMP violation, add post_{map,unmap}_gfn() ops that can be
>>>> used to verify that its safe to map a given guest page. Use the SRCU to
>>>> protect against the page state change for existing mapped pages.
>>> SRCU isn't protecting anything. The synchronize_srcu_expedited() in the PSC code
>>> forces it to wait for existing maps to go away, but it doesn't prevent new maps
>>> from being created while the actual RMP updates are in-flight. Most telling is
>>> that the RMP updates happen _after_ the synchronize_srcu_expedited() call.
>>>
>>> This also doesn't handle kvm_{read,write}_guest_cached().
>> Hmm, I thought the kvm_{read_write}_guest_cached() uses the
>> copy_{to,from}_user(). Writing to the private will cause a #PF and will
>> fail the copy_to_user(). Am I missing something?
> Doh, right you are. I was thinking they cached the kmap, but it's just the
> gpa->hva that gets cached.
>
>>> I can't help but note that the private memslots idea[*] would handle this gracefully,
>>> e.g. the memslot lookup would fail, and any change in private memslots would
>>> invalidate the cache due to a generation mismatch.
>>>
>>> KSM is another mess that would Just Work.
>>>
>>> I'm not saying that SNP should be blocked on support for unmapping guest private
>>> memory, but I do think we should strongly consider focusing on that effort rather
>>> than trying to fix things piecemeal throughout KVM. I don't think it's too absurd
>>> to say that it might actually be faster overall. And I 100% think that having a
>>> cohesive design and uABI for SNP and TDX would be hugely beneficial to KVM.
>>>
>>> [*] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20210824005248.200037-1-seanjc%40google.com&data=04%7C01%7Cbrijesh.singh%40amd.com%7C0f1a3f5f63074b60d21b08d98e857daf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637697526304105177%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=w6xS3DcG4fcTweC5i4%2BuB4jhn3Xcj2a44BkoATVcSgQ%3D&reserved=0
>>>
>>>> +int sev_post_map_gfn(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, int *token)
>>>> +{
>>>> + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
>>>> + int level;
>>>> +
>>>> + if (!sev_snp_guest(kvm))
>>>> + return 0;
>>>> +
>>>> + *token = srcu_read_lock(&sev->psc_srcu);
>>>> +
>>>> + /* If pfn is not added as private then fail */
>>> This comment and the pr_err() are backwards, and confused the heck out of me.
>>> snp_lookup_rmpentry() returns '1' if the pfn is assigned, a.k.a. private. That
>>> means this code throws an error if the page is private, i.e. requires the page
>>> to be shared. Which makes sense given the use cases, it's just incredibly
>>> confusing.
>> Actually I followed your recommendation from the previous feedback that
>> snp_lookup_rmpentry() should return 1 for the assigned page, 0 for the
>> shared and -negative for invalid. I can clarify it hereĀ again.
>>
>>>> + if (snp_lookup_rmpentry(pfn, &level) == 1) {
>>> Any reason not to provide e.g. rmp_is_shared() and rmp_is_private() so that
>>> callers don't have to care as much about the return values? The -errno/0/1
>>> semantics are all but guarantee to bite us in the rear at some point.
>> If we look at the previous series, I had a macro rmp_is_assigned() for
>> exactly the same purpose but the feedback was to drop those macros and
>> return the state indirectly through the snp_lookup_rmpentry(). I can
>> certainly add a helper rmp_is_{shared,private}() if it makes code more
>> readable.
> Ah rats. Bad communication on my side. I didn't intended to have non-RMP code
> directly consume snp_lookup_rmpentry() for simple checks. The goal behind the
> helper was to bury "struct rmpentry" so that it wasn't visible to the kernel at
> large. I.e. my objection was that rmp_assigned() was looking directly at a
> non-architectural struct.
>
> My full thought for snp_lookup_rmpentry() was that it _could_ be consumed directly
> without exposing "struct rmpentry", but that simple, common use cases would provide
> wrappers, e.g.
>
> static inline snp_is_rmp_private(...)
> {
> return snp_lookup_rmpentry(...) == 1;
> }
>
> static inline snp_is_rmp_shared(...)
> {
> return snp_lookup_rmpentry(...) < 1;
> }
Yep, that what I was going to do for the helper.
>
> Side topic, what do you think about s/assigned/private for the "public" APIs, as
> suggested/implied above? I actually like the terminology when talking specifically
> about the RMP, but it doesn't fit the abstractions that tend to be used when talking
> about these things in other contexts, e.g. in KVM.
I can float the idea to see if docs folks is okay with the changes but
generally speaking we all have been referring the assigned == private in
the Linux SNP support patch.
thanks
On 10/13/21 1:16 PM, Sean Christopherson wrote:
> On Wed, Oct 13, 2021, Sean Christopherson wrote:
>> On Fri, Aug 20, 2021, Brijesh Singh wrote:
>>> When SEV-SNP is enabled in the guest VM, the guest memory pages can
>>> either be a private or shared. A write from the hypervisor goes through
>>> the RMP checks. If hardware sees that hypervisor is attempting to write
>>> to a guest private page, then it triggers an RMP violation #PF.
>>>
>>> To avoid the RMP violation, add post_{map,unmap}_gfn() ops that can be
>>> used to verify that its safe to map a given guest page. Use the SRCU to
>>> protect against the page state change for existing mapped pages.
>> SRCU isn't protecting anything. The synchronize_srcu_expedited() in the PSC code
>> forces it to wait for existing maps to go away, but it doesn't prevent new maps
>> from being created while the actual RMP updates are in-flight. Most telling is
>> that the RMP updates happen _after_ the synchronize_srcu_expedited() call.
> Argh, another goof on my part. Rereading prior feedback, I see that I loosely
> suggested SRCU as a possible solution. That was a bad, bad suggestion. I think
> (hope) I made it offhand without really thinking it through. SRCU can't work in
> this case, because the whole premise of Read-Copy-Update is that there can be
> multiple copies of the data. That simply can't be true for the RMP as hardware
> operates on a single table.
>
> In the future, please don't hesitate to push back on and/or question suggestions,
> especially those that are made without concrete examples, i.e. are likely off the
> cuff. My goal isn't to set you up for failure :-/
What do you think about going back to my initial proposal of per-gfn
tracking [1] ? We can limit the changes to just for the kvm_vcpu_map()
and let the copy_to_user() take a fault and return an error (if it
attempt to write to guest private). If PSC happen while lock is held
then simplify return and let the guest retry PSC.
[1]
https://lore.kernel.org/lkml/[email protected]/
On Fri, Oct 15, 2021, Brijesh Singh wrote:
>
> On 10/13/21 1:16 PM, Sean Christopherson wrote:
> > On Wed, Oct 13, 2021, Sean Christopherson wrote:
> >> On Fri, Aug 20, 2021, Brijesh Singh wrote:
> >>> When SEV-SNP is enabled in the guest VM, the guest memory pages can
> >>> either be a private or shared. A write from the hypervisor goes through
> >>> the RMP checks. If hardware sees that hypervisor is attempting to write
> >>> to a guest private page, then it triggers an RMP violation #PF.
> >>>
> >>> To avoid the RMP violation, add post_{map,unmap}_gfn() ops that can be
> >>> used to verify that its safe to map a given guest page. Use the SRCU to
> >>> protect against the page state change for existing mapped pages.
> >> SRCU isn't protecting anything. The synchronize_srcu_expedited() in the PSC code
> >> forces it to wait for existing maps to go away, but it doesn't prevent new maps
> >> from being created while the actual RMP updates are in-flight. Most telling is
> >> that the RMP updates happen _after_ the synchronize_srcu_expedited() call.
> > Argh, another goof on my part. Rereading prior feedback, I see that I loosely
> > suggested SRCU as a possible solution. That was a bad, bad suggestion. I think
> > (hope) I made it offhand without really thinking it through. SRCU can't work in
> > this case, because the whole premise of Read-Copy-Update is that there can be
> > multiple copies of the data. That simply can't be true for the RMP as hardware
> > operates on a single table.
> >
> > In the future, please don't hesitate to push back on and/or question suggestions,
> > especially those that are made without concrete examples, i.e. are likely off the
> > cuff. My goal isn't to set you up for failure :-/
>
> What do you think about going back to my initial proposal of per-gfn
> tracking [1] ? We can limit the changes to just for the kvm_vcpu_map()
> and let the copy_to_user() take a fault and return an error (if it
> attempt to write to guest private). If PSC happen while lock is held
> then simplify return and let the guest retry PSC.
That approach is also broken as it doesn't hold a lock when updating host_write_track,
e.g. the count can be corrupted if writers collide, and nothing blocks writers on
in-progress readers.
I'm not opposed to a scheme that blocks PSC while KVM is reading, but I don't want
to spend time iterating on the KVM case until consensus has been reached on how
exactly RMP updates will be handled, and in general how the kernel will manage
guest private memory.
On Fri, Oct 15, 2021 at 05:16:28PM +0000, Sean Christopherson wrote:
> On Fri, Oct 15, 2021, Brijesh Singh wrote:
> >
> > On 10/13/21 1:16 PM, Sean Christopherson wrote:
> > > On Wed, Oct 13, 2021, Sean Christopherson wrote:
> > >> On Fri, Aug 20, 2021, Brijesh Singh wrote:
> > >>> When SEV-SNP is enabled in the guest VM, the guest memory pages can
> > >>> either be a private or shared. A write from the hypervisor goes through
> > >>> the RMP checks. If hardware sees that hypervisor is attempting to write
> > >>> to a guest private page, then it triggers an RMP violation #PF.
> > >>>
> > >>> To avoid the RMP violation, add post_{map,unmap}_gfn() ops that can be
> > >>> used to verify that its safe to map a given guest page. Use the SRCU to
> > >>> protect against the page state change for existing mapped pages.
> > >> SRCU isn't protecting anything. The synchronize_srcu_expedited() in the PSC code
> > >> forces it to wait for existing maps to go away, but it doesn't prevent new maps
> > >> from being created while the actual RMP updates are in-flight. Most telling is
> > >> that the RMP updates happen _after_ the synchronize_srcu_expedited() call.
> > > Argh, another goof on my part. Rereading prior feedback, I see that I loosely
> > > suggested SRCU as a possible solution. That was a bad, bad suggestion. I think
> > > (hope) I made it offhand without really thinking it through. SRCU can't work in
> > > this case, because the whole premise of Read-Copy-Update is that there can be
> > > multiple copies of the data. That simply can't be true for the RMP as hardware
> > > operates on a single table.
> > >
> > > In the future, please don't hesitate to push back on and/or question suggestions,
> > > especially those that are made without concrete examples, i.e. are likely off the
> > > cuff. My goal isn't to set you up for failure :-/
> >
> > What do you think about going back to my initial proposal of per-gfn
> > tracking [1] ? We can limit the changes to just for the kvm_vcpu_map()
> > and let the copy_to_user() take a fault and return an error (if it
> > attempt to write to guest private). If PSC happen while lock is held
> > then simplify return and let the guest retry PSC.
>
> That approach is also broken as it doesn't hold a lock when updating host_write_track,
> e.g. the count can be corrupted if writers collide, and nothing blocks writers on
> in-progress readers.
>
> I'm not opposed to a scheme that blocks PSC while KVM is reading, but I don't want
> to spend time iterating on the KVM case until consensus has been reached on how
> exactly RMP updates will be handled, and in general how the kernel will manage
> guest private memory.
Hi Sean,
(Sorry in advance for the long read, but it touches on a couple
inter-tangled topics that I think are important for this series.)
While we do still remain committed to working with community toward
implementing a UPM solution, we would still like to have a 'complete'
solution in the meantime, if at the very least to use as a basis for
building out other parts of the stack, or for enabling early adopters
downstream doing the same.
Toward that end, there are a couple areas that need to be addressed,
(and remain unaddressed with v6, since we were planning to leverage UPM
to handle them and so left them as open TODOs at the time):
1) this issue of guarding against shared->private conversions for pages
that are in-use by kernel
2) how to deal with unmapping/splitting the direct map
(These 2 things end up being fairly tightly coupled, which is why I'm
trying to cover them both in this email. Will explain more in a bit)
You mentioned elsewhere how 1) would be nicely addressed with UPM, since
the conversion would only point the guest to an entirely new page, while
leaving the shared page intact (at least until the normal notifiers do
their thing in response to any subsequent discard operations from
userspace). If the guest breaks because it doesn't see the write, that's
okay, because it's not supposed to be switching in-use shared pages to
private while they are being used. (though the pKVM folks did note in v7
of private memslot patches that they actually use the same physical page
for both shared/private, so I'm not sure if that approach will still
stack up in that context, if it ends up being needed there)
So in the context of this interim solution, we're trying to look for a
solution that's simple enough that it can be used reliably, without
introducing too much additional complexity into KVM. There is one
approach that seems to fit that bill, that Brijesh attempted in an
earlier version of this series (I'm not sure what exactly was the
catalyst to changing the approach, as I wasn't really in the loop at
the time, but AIUI there weren't any showstoppers there, but please
correct me if I'm missing anything):
- if the host is writing to a page that it thinks is supposed to be
shared, and the guest switches it to private, we get an RMP fault
(actually, we will get a !PRESENT fault, since as of v5 we now
remove the mapping from the directmap as part of conversion)
- in the host #PF handler, if we see that the page is marked private
in the RMP table, simply switch it back to shared
- if this was a bug on the part of the host, then the guest will see
the validated bit was cleared, and get a #VC where it can
terminate accordingly
- if this was a bug (or maliciousness) on the part of the guest,
then the behavior is unexpected anyway, and it will be made
aware this by the same mechanism
We've done some testing with this approach and it seems to hold up,
but since we also need the handler to deal with the !PRESENT case
(because of the current approach of nuking directmap entry for
private pages), there's a situation that can't be easily resolved
with this approach:
# CASE1: recoverable
THREAD 1 THREAD 2
- #PF(!PRESENT) - #PF(!PRESENT)
- restore directmap
- RMPUPDATE(shared)
- not private? okay to retry since maybe it was
fixed up already
# CASE2: unrecoverable
THREAD 1
- broken kernel breaks directmap/init-mm, not related to SEV
- #PF(!PRESENT)
- not private? we should oops, since retrying might cause a loop
The reason we can't distinguish between recoverable/unrecoverable is
because the kernel #PF handling here needs to happen for both !PRESENT
and for RMP violations. This is due to how we unmap from directmap as
part of shared->private conversion.
So we have to take the pessimistic approach due to CASE2, which
means that if CASE1 pops up, we'll still have a case where guest can
cause a crash/loop.
That where 2) comes into play. If we go back to the original approach
(as of v4) of simply splitting the directmap when we do shared->private
conversions, then we can limit the #PF handling to only have to deal
with cases where there's an RMP violation and X86_PF_RMP is set, which
makes it safe to retry to handle spurious cases like case #1.
AIUI, this is still sort of an open question, but you noted how nuking
the directmap without any formalized interface for letting the kernel
know about it could be problematic down the road, which also sounds
like the sort of thing more suited for having UPM address at a more
general level, since there are similar requirements for TDX as well.
AIUI there are 2 main arguments against splitting the directmap:
a) we can't easily rebuild it atm
b) things like KSM might still tries to access private pages
But unmapping also suffers from a), since we still end up splitting the
directmap unless we remove pages in blocks of 2M. But nothing prevents
a guest from switching a single 4K page to private, in which case we
are forced to split. That would be normal behavior on the part of the
guest for setting up GHCB pages/etc, so we still end up splitting the
directmap over time.
And for b), as you noted, this is already something that SEV/SEV-ES
need to deal with, and at least in the case of SNP guest things are
a bit better since:
- if some kernel thread tried to write to private memory, they
would notice if some errant kernel thread tried to write to guest
memory, since #PF handler with flip the page and unset the
validated bit on that memory.
- for reads, they will get ciphertext, which isn't any worse than
reading unencrypted guest memory for anything outside of KVM that
specifically knows what it is expecting to read there (KSM being
one exception, since it'll waste cycles cmp'ing ciphertext, but
there's no way to make it work anyway, and we would be fine with
requiring it to be disabled if that is a concern)
So that's sort of the context for why we're considering these 2 changes.
Any input on these is welcome, but at the very least wanted to provide
our rationale for past reviewers.
Thanks!
-Mike
On Thu, Sep 08, 2022 at 04:21:14PM -0500, Michael Roth wrote:
> On Fri, Oct 15, 2021 at 05:16:28PM +0000, Sean Christopherson wrote:
> > On Fri, Oct 15, 2021, Brijesh Singh wrote:
> > >
> > > On 10/13/21 1:16 PM, Sean Christopherson wrote:
> > > > On Wed, Oct 13, 2021, Sean Christopherson wrote:
> > > >> On Fri, Aug 20, 2021, Brijesh Singh wrote:
> > > >>> When SEV-SNP is enabled in the guest VM, the guest memory pages can
> > > >>> either be a private or shared. A write from the hypervisor goes through
> > > >>> the RMP checks. If hardware sees that hypervisor is attempting to write
> > > >>> to a guest private page, then it triggers an RMP violation #PF.
> > > >>>
> > > >>> To avoid the RMP violation, add post_{map,unmap}_gfn() ops that can be
> > > >>> used to verify that its safe to map a given guest page. Use the SRCU to
> > > >>> protect against the page state change for existing mapped pages.
> > > >> SRCU isn't protecting anything. The synchronize_srcu_expedited() in the PSC code
> > > >> forces it to wait for existing maps to go away, but it doesn't prevent new maps
> > > >> from being created while the actual RMP updates are in-flight. Most telling is
> > > >> that the RMP updates happen _after_ the synchronize_srcu_expedited() call.
> > > > Argh, another goof on my part. Rereading prior feedback, I see that I loosely
> > > > suggested SRCU as a possible solution. That was a bad, bad suggestion. I think
> > > > (hope) I made it offhand without really thinking it through. SRCU can't work in
> > > > this case, because the whole premise of Read-Copy-Update is that there can be
> > > > multiple copies of the data. That simply can't be true for the RMP as hardware
> > > > operates on a single table.
> > > >
> > > > In the future, please don't hesitate to push back on and/or question suggestions,
> > > > especially those that are made without concrete examples, i.e. are likely off the
> > > > cuff. My goal isn't to set you up for failure :-/
> > >
> > > What do you think about going back to my initial proposal of per-gfn
> > > tracking [1] ? We can limit the changes to just for the kvm_vcpu_map()
> > > and let the copy_to_user() take a fault and return an error (if it
> > > attempt to write to guest private). If PSC happen while lock is held
> > > then simplify return and let the guest retry PSC.
> >
> > That approach is also broken as it doesn't hold a lock when updating host_write_track,
> > e.g. the count can be corrupted if writers collide, and nothing blocks writers on
> > in-progress readers.
> >
> > I'm not opposed to a scheme that blocks PSC while KVM is reading, but I don't want
> > to spend time iterating on the KVM case until consensus has been reached on how
> > exactly RMP updates will be handled, and in general how the kernel will manage
> > guest private memory.
>
> Hi Sean,
>
> (Sorry in advance for the long read, but it touches on a couple
> inter-tangled topics that I think are important for this series.)
>
> While we do still remain committed to working with community toward
> implementing a UPM solution, we would still like to have a 'complete'
> solution in the meantime, if at the very least to use as a basis for
> building out other parts of the stack, or for enabling early adopters
> downstream doing the same.
>
> Toward that end, there are a couple areas that need to be addressed,
> (and remain unaddressed with v6, since we were planning to leverage UPM
> to handle them and so left them as open TODOs at the time):
>
> 1) this issue of guarding against shared->private conversions for pages
> that are in-use by kernel
> 2) how to deal with unmapping/splitting the direct map
>
> (These 2 things end up being fairly tightly coupled, which is why I'm
> trying to cover them both in this email. Will explain more in a bit)
>
> You mentioned elsewhere how 1) would be nicely addressed with UPM, since
> the conversion would only point the guest to an entirely new page, while
> leaving the shared page intact (at least until the normal notifiers do
> their thing in response to any subsequent discard operations from
> userspace). If the guest breaks because it doesn't see the write, that's
> okay, because it's not supposed to be switching in-use shared pages to
> private while they are being used. (though the pKVM folks did note in v7
> of private memslot patches that they actually use the same physical page
> for both shared/private, so I'm not sure if that approach will still
> stack up in that context, if it ends up being needed there)
>
> So in the context of this interim solution, we're trying to look for a
> solution that's simple enough that it can be used reliably, without
> introducing too much additional complexity into KVM. There is one
> approach that seems to fit that bill, that Brijesh attempted in an
> earlier version of this series (I'm not sure what exactly was the
> catalyst to changing the approach, as I wasn't really in the loop at
> the time, but AIUI there weren't any showstoppers there, but please
> correct me if I'm missing anything):
>
> - if the host is writing to a page that it thinks is supposed to be
> shared, and the guest switches it to private, we get an RMP fault
> (actually, we will get a !PRESENT fault, since as of v5 we now
> remove the mapping from the directmap as part of conversion)
> - in the host #PF handler, if we see that the page is marked private
> in the RMP table, simply switch it back to shared
> - if this was a bug on the part of the host, then the guest will see
> the validated bit was cleared, and get a #VC where it can
> terminate accordingly
> - if this was a bug (or maliciousness) on the part of the guest,
> then the behavior is unexpected anyway, and it will be made
> aware this by the same mechanism
>
> We've done some testing with this approach and it seems to hold up,
> but since we also need the handler to deal with the !PRESENT case
> (because of the current approach of nuking directmap entry for
> private pages), there's a situation that can't be easily resolved
> with this approach:
>
> # CASE1: recoverable
>
> THREAD 1 THREAD 2
> - #PF(!PRESENT) - #PF(!PRESENT)
> - restore directmap
> - RMPUPDATE(shared)
> - not private? okay to retry since maybe it was
> fixed up already
>
> # CASE2: unrecoverable
>
> THREAD 1
> - broken kernel breaks directmap/init-mm, not related to SEV
> - #PF(!PRESENT)
> - not private? we should oops, since retrying might cause a loop
>
> The reason we can't distinguish between recoverable/unrecoverable is
> because the kernel #PF handling here needs to happen for both !PRESENT
> and for RMP violations. This is due to how we unmap from directmap as
> part of shared->private conversion.
>
> So we have to take the pessimistic approach due to CASE2, which
> means that if CASE1 pops up, we'll still have a case where guest can
> cause a crash/loop.
>
> That where 2) comes into play. If we go back to the original approach
> (as of v4) of simply splitting the directmap when we do shared->private
> conversions, then we can limit the #PF handling to only have to deal
> with cases where there's an RMP violation and X86_PF_RMP is set, which
> makes it safe to retry to handle spurious cases like case #1.
>
> AIUI, this is still sort of an open question, but you noted how nuking
> the directmap without any formalized interface for letting the kernel
> know about it could be problematic down the road, which also sounds
> like the sort of thing more suited for having UPM address at a more
> general level, since there are similar requirements for TDX as well.
>
> AIUI there are 2 main arguments against splitting the directmap:
> a) we can't easily rebuild it atm
> b) things like KSM might still tries to access private pages
>
> But unmapping also suffers from a), since we still end up splitting the
> directmap unless we remove pages in blocks of 2M. But nothing prevents
> a guest from switching a single 4K page to private, in which case we
> are forced to split. That would be normal behavior on the part of the
> guest for setting up GHCB pages/etc, so we still end up splitting the
> directmap over time.
One more thing to note here, is with the splitting approach, we also
have the option of doing all the splitting when the region is registered
via KVM_MEMORY_ENCRYPT_REG_REGION. If unsplitting becomes possible in
the future, we could then handle that in KVM_MEMORY_ENCRYPT_UNREG_REGION
all at once which, where we'd have a bit more flexibility for going
about that.
-Mike
>
> And for b), as you noted, this is already something that SEV/SEV-ES
> need to deal with, and at least in the case of SNP guest things are
> a bit better since:
>
> - if some kernel thread tried to write to private memory, they
> would notice if some errant kernel thread tried to write to guest
> memory, since #PF handler with flip the page and unset the
> validated bit on that memory.
>
> - for reads, they will get ciphertext, which isn't any worse than
> reading unencrypted guest memory for anything outside of KVM that
> specifically knows what it is expecting to read there (KSM being
> one exception, since it'll waste cycles cmp'ing ciphertext, but
> there's no way to make it work anyway, and we would be fine with
> requiring it to be disabled if that is a concern)
>
> So that's sort of the context for why we're considering these 2 changes.
> Any input on these is welcome, but at the very least wanted to provide
> our rationale for past reviewers.
>
> Thanks!
>
> -Mike
On Thu, Sep 08, 2022, Michael Roth wrote:
> On Fri, Oct 15, 2021 at 05:16:28PM +0000, Sean Christopherson wrote:
> So in the context of this interim solution, we're trying to look for a
> solution that's simple enough that it can be used reliably, without
> introducing too much additional complexity into KVM. There is one
> approach that seems to fit that bill, that Brijesh attempted in an
> earlier version of this series (I'm not sure what exactly was the
> catalyst to changing the approach, as I wasn't really in the loop at
> the time, but AIUI there weren't any showstoppers there, but please
> correct me if I'm missing anything):
>
> - if the host is writing to a page that it thinks is supposed to be
> shared, and the guest switches it to private, we get an RMP fault
> (actually, we will get a !PRESENT fault, since as of v5 we now
> remove the mapping from the directmap as part of conversion)
> - in the host #PF handler, if we see that the page is marked private
> in the RMP table, simply switch it back to shared
> - if this was a bug on the part of the host, then the guest will see
As discussed off-list, attempting to fix up RMP violations in the host #PF handler
is not a viable approach. There was also extensive discussion on-list a while back:
https://lore.kernel.org/all/[email protected]
> AIUI, this is still sort of an open question, but you noted how nuking
> the directmap without any formalized interface for letting the kernel
> know about it could be problematic down the road, which also sounds
> like the sort of thing more suited for having UPM address at a more
> general level, since there are similar requirements for TDX as well.
>
> AIUI there are 2 main arguments against splitting the directmap:
> a) we can't easily rebuild it atm
> b) things like KSM might still tries to access private pages
>
> But unmapping also suffers from a), since we still end up splitting the
> directmap unless we remove pages in blocks of 2M.
But for UPM, it's easy to rebuild the direct map since there will be an explicit,
kernel controlled point where the "inaccesible" memfd releases the private page.
> But nothing prevents a guest from switching a single 4K page to private, in
> which case we are forced to split. That would be normal behavior on the part
> of the guest for setting up GHCB pages/etc, so we still end up splitting the
> directmap over time.
The host actually isn't _forced_ to split with UPM. One option would be to refuse
to split the direct map and instead force userspace to eat the 2mb allocation even
though it only wants to map a single 4kb chunk into the guest. I don't know that
that's a _good_ option, but it is an option.
On Wed, Sep 14, 2022 at 9:05 AM Sean Christopherson <[email protected]> wrote:
>
> On Thu, Sep 08, 2022, Michael Roth wrote:
> > On Fri, Oct 15, 2021 at 05:16:28PM +0000, Sean Christopherson wrote:
> > So in the context of this interim solution, we're trying to look for a
> > solution that's simple enough that it can be used reliably, without
> > introducing too much additional complexity into KVM. There is one
> > approach that seems to fit that bill, that Brijesh attempted in an
> > earlier version of this series (I'm not sure what exactly was the
> > catalyst to changing the approach, as I wasn't really in the loop at
> > the time, but AIUI there weren't any showstoppers there, but please
> > correct me if I'm missing anything):
> >
> > - if the host is writing to a page that it thinks is supposed to be
> > shared, and the guest switches it to private, we get an RMP fault
> > (actually, we will get a !PRESENT fault, since as of v5 we now
> > remove the mapping from the directmap as part of conversion)
> > - in the host #PF handler, if we see that the page is marked private
> > in the RMP table, simply switch it back to shared
> > - if this was a bug on the part of the host, then the guest will see
>
> As discussed off-list, attempting to fix up RMP violations in the host #PF handler
> is not a viable approach. There was also extensive discussion on-list a while back:
>
> https://lore.kernel.org/all/[email protected]
I mentioned this during Mike's talk at the micro-conference: For pages
mapped in by the kernel can we disallow them to be converted to
private? Note, userspace accesses are already handled by UPM.
In pseudo-code, I'm thinking something like this:
kmap_helper() {
// And all other interfaces where the kernel can map a GPA
// into the kernel page tables
mapped_into_kernel_mem_set[hpa] = true;
}
kunmap_helper() {
// And all other interfaces where the kernel can unmap a GPA
// into the kernel page tables
mapped_into_kernel_mem_set[hpa] = false;
// Except it's not this simple because we probably need ref counting
// for multiple mappings. Sigh. But you get the idea.
}
rmpupdate_helper() {
if (conversion = SHARED_TO_PRIVATE && mapped_into_kernel_mem_set[hpa])
return -EINVAL; // Or whatever the appropriate error code here is.
}
On Wed, Sep 14, 2022, Marc Orr wrote:
> On Wed, Sep 14, 2022 at 9:05 AM Sean Christopherson <[email protected]> wrote:
> >
> > On Thu, Sep 08, 2022, Michael Roth wrote:
> > > On Fri, Oct 15, 2021 at 05:16:28PM +0000, Sean Christopherson wrote:
> > > So in the context of this interim solution, we're trying to look for a
> > > solution that's simple enough that it can be used reliably, without
> > > introducing too much additional complexity into KVM. There is one
> > > approach that seems to fit that bill, that Brijesh attempted in an
> > > earlier version of this series (I'm not sure what exactly was the
> > > catalyst to changing the approach, as I wasn't really in the loop at
> > > the time, but AIUI there weren't any showstoppers there, but please
> > > correct me if I'm missing anything):
> > >
> > > - if the host is writing to a page that it thinks is supposed to be
> > > shared, and the guest switches it to private, we get an RMP fault
> > > (actually, we will get a !PRESENT fault, since as of v5 we now
> > > remove the mapping from the directmap as part of conversion)
> > > - in the host #PF handler, if we see that the page is marked private
> > > in the RMP table, simply switch it back to shared
> > > - if this was a bug on the part of the host, then the guest will see
> >
> > As discussed off-list, attempting to fix up RMP violations in the host #PF handler
> > is not a viable approach. There was also extensive discussion on-list a while back:
> >
> > https://lore.kernel.org/all/[email protected]
>
> I mentioned this during Mike's talk at the micro-conference: For pages
> mapped in by the kernel can we disallow them to be converted to
> private?
In theory, yes. Do we want to do something like this? No. kmap() does something
vaguely similar for 32-bit PAE/PSE kernels, but that's a lot of complexity and
overhead to take on. And this issue goes far beyond a kmap(); when the kernel gup()s
a page, the kernel expects the pfn to be available, no exceptions (pun intended).
> Note, userspace accesses are already handled by UPM.
I'm confused by the UPM comment. Isn't the gist of this thread about the ability
to merge SNP _without_ UPM? Or am I out in left field?
> In pseudo-code, I'm thinking something like this:
>
> kmap_helper() {
> // And all other interfaces where the kernel can map a GPA
> // into the kernel page tables
> mapped_into_kernel_mem_set[hpa] = true;
> }
>
> kunmap_helper() {
> // And all other interfaces where the kernel can unmap a GPA
> // into the kernel page tables
> mapped_into_kernel_mem_set[hpa] = false;
>
> // Except it's not this simple because we probably need ref counting
> // for multiple mappings. Sigh. But you get the idea.
A few issues off the top of my head:
- It's not just refcounting, there would also likely need to be locking to
guarantee sane behavior.
- kmap() isn't allowed to fail and RMPUPDATE isn't strictly guaranteed to succeed,
which is problematic if the kernel attempts to kmap() a page that's already
private, especially for kmap_atomic(), which isn't allowed to sleep.
- Not all kernel code is well behaved and bounces through kmap(); undoubtedly
some of the 1200+ users of page_address() will be problematic.
$ git grep page_address | wc -l
1267
- It's not sufficient for TDX. Merging something this complicated when we know
we still need UPM would be irresponsible from a maintenance perspective.
- KVM would need to support two separate APIs for SNP, which I very much don't
want to do.
On Wed, Sep 14, 2022 at 5:15 PM Sean Christopherson <[email protected]> wrote:
>
> On Wed, Sep 14, 2022, Marc Orr wrote:
> > On Wed, Sep 14, 2022 at 9:05 AM Sean Christopherson <[email protected]> wrote:
> > >
> > > On Thu, Sep 08, 2022, Michael Roth wrote:
> > > > On Fri, Oct 15, 2021 at 05:16:28PM +0000, Sean Christopherson wrote:
> > > > So in the context of this interim solution, we're trying to look for a
> > > > solution that's simple enough that it can be used reliably, without
> > > > introducing too much additional complexity into KVM. There is one
> > > > approach that seems to fit that bill, that Brijesh attempted in an
> > > > earlier version of this series (I'm not sure what exactly was the
> > > > catalyst to changing the approach, as I wasn't really in the loop at
> > > > the time, but AIUI there weren't any showstoppers there, but please
> > > > correct me if I'm missing anything):
> > > >
> > > > - if the host is writing to a page that it thinks is supposed to be
> > > > shared, and the guest switches it to private, we get an RMP fault
> > > > (actually, we will get a !PRESENT fault, since as of v5 we now
> > > > remove the mapping from the directmap as part of conversion)
> > > > - in the host #PF handler, if we see that the page is marked private
> > > > in the RMP table, simply switch it back to shared
> > > > - if this was a bug on the part of the host, then the guest will see
> > >
> > > As discussed off-list, attempting to fix up RMP violations in the host #PF handler
> > > is not a viable approach. There was also extensive discussion on-list a while back:
> > >
> > > https://lore.kernel.org/all/[email protected]
> >
> > I mentioned this during Mike's talk at the micro-conference: For pages
> > mapped in by the kernel can we disallow them to be converted to
> > private?
>
> In theory, yes. Do we want to do something like this? No. kmap() does something
> vaguely similar for 32-bit PAE/PSE kernels, but that's a lot of complexity and
> overhead to take on. And this issue goes far beyond a kmap(); when the kernel gup()s
> a page, the kernel expects the pfn to be available, no exceptions (pun intended).
>
> > Note, userspace accesses are already handled by UPM.
>
> I'm confused by the UPM comment. Isn't the gist of this thread about the ability
> to merge SNP _without_ UPM? Or am I out in left field?
I think that was the overall gist: yes. But it's not what I was trying
to comment on :-).
HOWEVER, thinking about this more: I was confused when I wrote out my
last reply. I had thought that the issue that Michael brought up
applied even with UPM. That is, I was thinking it was still possibly
for a guest to maliciously convert a page to private mapped in by the
kernel and assumed to be shared.
But I now realize that is not what will actually happen. To be
concrete, let's assume the GHCB page. What will happen is:
- KVM has GHCB page mapped in. GHCB is always assumed to be shared. So
far so good.
- Malicious guest converts GHCB page to private (e.g., via Page State
Change request)
- Guest exits to KVM
- KVM exits to userspace VMM
- Userspace VM allocates page in private FD.
Now, what happens here depends on how UPM works. If we allow double
allocation then our host kernel is safe. However, now we have the
"double allocation problem".
If on the other hand, we deallocate the page in the shared FD, the
host kernel can segfault. And now we actually do have essentially the
same problem Michael was describing that we have without UPM. Because
we'll end up in fault.c in the kernel context and likely panic the
host.
I hope I got this right this time. Sorry for the confusion on my last reply.
> > In pseudo-code, I'm thinking something like this:
> >
> > kmap_helper() {
> > // And all other interfaces where the kernel can map a GPA
> > // into the kernel page tables
> > mapped_into_kernel_mem_set[hpa] = true;
> > }
> >
> > kunmap_helper() {
> > // And all other interfaces where the kernel can unmap a GPA
> > // into the kernel page tables
> > mapped_into_kernel_mem_set[hpa] = false;
> >
> > // Except it's not this simple because we probably need ref counting
> > // for multiple mappings. Sigh. But you get the idea.
>
> A few issues off the top of my head:
>
> - It's not just refcounting, there would also likely need to be locking to
> guarantee sane behavior.
> - kmap() isn't allowed to fail and RMPUPDATE isn't strictly guaranteed to succeed,
> which is problematic if the kernel attempts to kmap() a page that's already
> private, especially for kmap_atomic(), which isn't allowed to sleep.
> - Not all kernel code is well behaved and bounces through kmap(); undoubtedly
> some of the 1200+ users of page_address() will be problematic.
>
> $ git grep page_address | wc -l
> 1267
> - It's not sufficient for TDX. Merging something this complicated when we know
> we still need UPM would be irresponsible from a maintenance perspective.
> - KVM would need to support two separate APIs for SNP, which I very much don't
> want to do.
Ack on merging without UPM. I wasn't trying to chime in on merging
before/after UPM. See my other comment above. Sorry for the confusion.
Ack on other concerns about "enlightening kmap" as well. I agree.
On Wed, Sep 14, 2022 at 5:32 PM Marc Orr <[email protected]> wrote:
>
> On Wed, Sep 14, 2022 at 5:15 PM Sean Christopherson <[email protected]> wrote:
> >
> > On Wed, Sep 14, 2022, Marc Orr wrote:
> > > On Wed, Sep 14, 2022 at 9:05 AM Sean Christopherson <[email protected]> wrote:
> > > >
> > > > On Thu, Sep 08, 2022, Michael Roth wrote:
> > > > > On Fri, Oct 15, 2021 at 05:16:28PM +0000, Sean Christopherson wrote:
> > > > > So in the context of this interim solution, we're trying to look for a
> > > > > solution that's simple enough that it can be used reliably, without
> > > > > introducing too much additional complexity into KVM. There is one
> > > > > approach that seems to fit that bill, that Brijesh attempted in an
> > > > > earlier version of this series (I'm not sure what exactly was the
> > > > > catalyst to changing the approach, as I wasn't really in the loop at
> > > > > the time, but AIUI there weren't any showstoppers there, but please
> > > > > correct me if I'm missing anything):
> > > > >
> > > > > - if the host is writing to a page that it thinks is supposed to be
> > > > > shared, and the guest switches it to private, we get an RMP fault
> > > > > (actually, we will get a !PRESENT fault, since as of v5 we now
> > > > > remove the mapping from the directmap as part of conversion)
> > > > > - in the host #PF handler, if we see that the page is marked private
> > > > > in the RMP table, simply switch it back to shared
> > > > > - if this was a bug on the part of the host, then the guest will see
> > > >
> > > > As discussed off-list, attempting to fix up RMP violations in the host #PF handler
> > > > is not a viable approach. There was also extensive discussion on-list a while back:
> > > >
> > > > https://lore.kernel.org/all/[email protected]
> > >
> > > I mentioned this during Mike's talk at the micro-conference: For pages
> > > mapped in by the kernel can we disallow them to be converted to
> > > private?
> >
> > In theory, yes. Do we want to do something like this? No. kmap() does something
> > vaguely similar for 32-bit PAE/PSE kernels, but that's a lot of complexity and
> > overhead to take on. And this issue goes far beyond a kmap(); when the kernel gup()s
> > a page, the kernel expects the pfn to be available, no exceptions (pun intended).
> >
> > > Note, userspace accesses are already handled by UPM.
> >
> > I'm confused by the UPM comment. Isn't the gist of this thread about the ability
> > to merge SNP _without_ UPM? Or am I out in left field?
>
> I think that was the overall gist: yes. But it's not what I was trying
> to comment on :-).
>
> HOWEVER, thinking about this more: I was confused when I wrote out my
> last reply. I had thought that the issue that Michael brought up
> applied even with UPM. That is, I was thinking it was still possibly
> for a guest to maliciously convert a page to private mapped in by the
> kernel and assumed to be shared.
>
> But I now realize that is not what will actually happen. To be
> concrete, let's assume the GHCB page. What will happen is:
> - KVM has GHCB page mapped in. GHCB is always assumed to be shared. So
> far so good.
> - Malicious guest converts GHCB page to private (e.g., via Page State
> Change request)
> - Guest exits to KVM
> - KVM exits to userspace VMM
> - Userspace VM allocates page in private FD.
>
> Now, what happens here depends on how UPM works. If we allow double
> allocation then our host kernel is safe. However, now we have the
> "double allocation problem".
>
> If on the other hand, we deallocate the page in the shared FD, the
> host kernel can segfault. And now we actually do have essentially the
> same problem Michael was describing that we have without UPM. Because
> we'll end up in fault.c in the kernel context and likely panic the
> host.
Thinking about this even more... Even if we deallocate in the
userspace VMM's shared FD, the kernel has its own page tables --
right? So maybe we are actually 100% OK under UPM then regardless of
the userspace VMM's policy around managing the private and shared FDs.