LinuxLists.cc - Re: [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation

2021-03-25 14:50:09

Subject: Re: [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation

On 3/24/21 10:04 AM, Brijesh Singh wrote:
> When SEV-SNP is enabled globally in the system, a write from the hypervisor
> can raise an RMP violation. We can resolve the RMP violation by splitting
> the virtual address to a lower page level.
>
> e.g
> - guest made a page shared in the RMP entry so that the hypervisor
> can write to it.
> - the hypervisor has mapped the pfn as a large page. A write access
> will cause an RMP violation if one of the pages within the 2MB region
> is a guest private page.
>
> The above RMP violation can be resolved by simply splitting the large
> page.

What if the large page is provided by hugetlbfs?

What if the kernel uses the direct map to access the page instead of the
userspace mapping?

> The architecture specific code will read the RMP entry to determine
> if the fault can be resolved by splitting and propagating the request
> to split the page by setting newly introduced fault flag
> (FAULT_FLAG_PAGE_SPLIT). If the fault cannot be resolved by splitting,
> then a SIGBUS signal is sent to terminate the process.

Are users just supposed to know what memory types are compatible with
SEV-SNP? Basically, don't use anything that might map a guest using
non-4k entries, except THP?

This does seem like a rather nasty aspect of the hardware. For
everything else, if the virtualization page tables and the x86 tables
disagree, the TLB just sees the smallest page size.

> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 7605e06a6dd9..f6571563f433 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1305,6 +1305,70 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
> }
> NOKPROBE_SYMBOL(do_kern_addr_fault);
>
> +#define RMP_FAULT_RETRY 0
> +#define RMP_FAULT_KILL 1
> +#define RMP_FAULT_PAGE_SPLIT 2
> +
> +static inline size_t pages_per_hpage(int level)
> +{
> + return page_level_size(level) / PAGE_SIZE;
> +}
> +
> +/*
> + * The RMP fault can happen when a hypervisor attempts to write to:
> + * 1. a guest owned page or
> + * 2. any pages in the large page is a guest owned page.
> + *
> + * #1 will happen only when a process or VMM is attempting to modify the guest page
> + * without the guests cooperation. If a guest wants a VMM to be able to write to its memory
> + * then it should make the page shared. If we detect #1, kill the process because we can not
> + * resolve the fault.
> + *
> + * #2 can happen when the page level does not match between the RMP entry and x86
> + * page table walk, e.g the page is mapped as a large page in the x86 page table but its
> + * added as a 4K shared page in the RMP entry. This can be resolved by splitting the address
> + * into a smaller page level.
> + */

These comments need to get wrapped a bit sooner. Could you try to match
some of the others in the file?

> +static int handle_rmp_page_fault(unsigned long hw_error_code, unsigned long address)
> +{
> + unsigned long pfn, mask;
> + int rmp_level, level;
> + rmpentry_t *e;
> + pte_t *pte;
> +
> + /* Get the native page level */
> + pte = lookup_address_in_mm(current->mm, address, &level);
> + if (unlikely(!pte))
> + return RMP_FAULT_KILL;
> +
> + pfn = pte_pfn(*pte);
> + if (level > PG_LEVEL_4K) {
> + mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
> + pfn |= (address >> PAGE_SHIFT) & mask;
> + }

What is this trying to do, exactly?

> + /* Get the page level from the RMP entry. */
> + e = lookup_page_in_rmptable(pfn_to_page(pfn), &rmp_level);
> + if (!e) {
> + pr_alert("SEV-SNP: failed to lookup RMP entry for address 0x%lx pfn 0x%lx\n",
> + address, pfn);
> + return RMP_FAULT_KILL;
> + }
> +
> + /* Its a guest owned page */
> + if (rmpentry_assigned(e))
> + return RMP_FAULT_KILL;
> +
> + /*
> + * Its a shared page but the page level does not match between the native walk
> + * and RMP entry.
> + */

For these two-line comments, please try to split the text fairly evenly
between the lines.

> + if (level > rmp_level)
> + return RMP_FAULT_PAGE_SPLIT;
> +
> + return RMP_FAULT_RETRY;
> +}
> +
> /* Handle faults in the user portion of the address space */
> static inline
> void do_user_addr_fault(struct pt_regs *regs,
> @@ -1315,6 +1379,7 @@ void do_user_addr_fault(struct pt_regs *regs,
> struct task_struct *tsk;
> struct mm_struct *mm;
> vm_fault_t fault;
> + int ret;
> unsigned int flags = FAULT_FLAG_DEFAULT;
>
> tsk = current;
> @@ -1377,6 +1442,22 @@ void do_user_addr_fault(struct pt_regs *regs,
> if (hw_error_code & X86_PF_INSTR)
> flags |= FAULT_FLAG_INSTRUCTION;
>
> + /*
> + * If its an RMP violation, see if we can resolve it.
> + */
> + if ((hw_error_code & X86_PF_RMP)) {
> + ret = handle_rmp_page_fault(hw_error_code, address);
> + if (ret == RMP_FAULT_PAGE_SPLIT) {
> + flags |= FAULT_FLAG_PAGE_SPLIT;
> + } else if (ret == RMP_FAULT_KILL) {
> + fault |= VM_FAULT_SIGBUS;
> + mm_fault_error(regs, hw_error_code, address, fault);
> + return;
> + } else {
> + return;
> + }
> + }
> +
> #ifdef CONFIG_X86_64
> /*
> * Faults in the vsyscall page might need emulation. The
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ecdf8a8cd6ae..1be3218f3738 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -434,6 +434,8 @@ extern pgprot_t protection_map[16];
> * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
> * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
> * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
> + * @FAULT_FLAG_PAGE_SPLIT: The fault was due page size mismatch, split the region to smaller
> + * page size and retry.
> *
> * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
> * whether we would allow page faults to retry by specifying these two
> @@ -464,6 +466,7 @@ extern pgprot_t protection_map[16];
> #define FAULT_FLAG_REMOTE 0x80
> #define FAULT_FLAG_INSTRUCTION 0x100
> #define FAULT_FLAG_INTERRUPTIBLE 0x200
> +#define FAULT_FLAG_PAGE_SPLIT 0x400
>
> /*
> * The default fault flags that should be used by most of the
> @@ -501,7 +504,8 @@ static inline bool fault_flag_allow_retry_first(unsigned int flags)
> { FAULT_FLAG_USER, "USER" }, \
> { FAULT_FLAG_REMOTE, "REMOTE" }, \
> { FAULT_FLAG_INSTRUCTION, "INSTRUCTION" }, \
> - { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }
> + { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }, \
> + { FAULT_FLAG_PAGE_SPLIT, "PAGESPLIT" }
>
> /*
> * vm_fault is filled by the pagefault handler and passed to the vma's
> diff --git a/mm/memory.c b/mm/memory.c
> index feff48e1465a..c9dcf9b30719 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4427,6 +4427,12 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> return 0;
> }
>
> +static int handle_split_page_fault(struct vm_fault *vmf)
> +{
> + __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
> + return 0;
> +}

Wait a sec, I thought this could fail. Where's the "failed to split"
path? Why does this even return an error code if it's always 0?

> /*
> * By the time we get here, we already hold the mm semaphore
> *
> @@ -4448,6 +4454,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> pgd_t *pgd;
> p4d_t *p4d;
> vm_fault_t ret;
> + int split_page = flags & FAULT_FLAG_PAGE_SPLIT;
>
> pgd = pgd_offset(mm, address);
> p4d = p4d_alloc(mm, pgd, address);
> @@ -4504,6 +4511,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> pmd_migration_entry_wait(mm, vmf.pmd);
> return 0;
> }
> +
> + if (split_page)
> + return handle_split_page_fault(&vmf);
> +
> if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
> if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
> return do_huge_pmd_numa_page(&vmf, orig_pmd);

Is there a reason for the 'split_page' variable? It seems like a waste
of space.

2021-03-25 15:25:27

by Brijesh Singh

[permalink] [raw]

Subject: Re: [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation

On 3/25/21 9:48 AM, Dave Hansen wrote:
> On 3/24/21 10:04 AM, Brijesh Singh wrote:
>> When SEV-SNP is enabled globally in the system, a write from the hypervisor
>> can raise an RMP violation. We can resolve the RMP violation by splitting
>> the virtual address to a lower page level.
>>
>> e.g
>> - guest made a page shared in the RMP entry so that the hypervisor
>> can write to it.
>> - the hypervisor has mapped the pfn as a large page. A write access
>> will cause an RMP violation if one of the pages within the 2MB region
>> is a guest private page.
>>
>> The above RMP violation can be resolved by simply splitting the large
>> page.
> What if the large page is provided by hugetlbfs?

I was not able to find a method to split the large pages in the
hugetlbfs. Unfortunately, at this time a VMM cannot use the backing
memory from the hugetlbfs pool. An SEV-SNP aware VMM can use either
transparent hugepage or small pages.

>
> What if the kernel uses the direct map to access the page instead of the
> userspace mapping?

See the Patch 04/30. Currently, we split the kernel direct maps to 4K
before adding the page in the RMP table to avoid the need to split the
pages due to the RMP fault.

>
>> The architecture specific code will read the RMP entry to determine
>> if the fault can be resolved by splitting and propagating the request
>> to split the page by setting newly introduced fault flag
>> (FAULT_FLAG_PAGE_SPLIT). If the fault cannot be resolved by splitting,
>> then a SIGBUS signal is sent to terminate the process.
> Are users just supposed to know what memory types are compatible with
> SEV-SNP? Basically, don't use anything that might map a guest using
> non-4k entries, except THP?

Currently, VMM will need to know the compatible memory type and use it
for allocating the backing pages.

>
> This does seem like a rather nasty aspect of the hardware. For
> everything else, if the virtualization page tables and the x86 tables
> disagree, the TLB just sees the smallest page size.
>
>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>> index 7605e06a6dd9..f6571563f433 100644
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>> @@ -1305,6 +1305,70 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
>> }
>> NOKPROBE_SYMBOL(do_kern_addr_fault);
>>
>> +#define RMP_FAULT_RETRY 0
>> +#define RMP_FAULT_KILL 1
>> +#define RMP_FAULT_PAGE_SPLIT 2
>> +
>> +static inline size_t pages_per_hpage(int level)
>> +{
>> + return page_level_size(level) / PAGE_SIZE;
>> +}
>> +
>> +/*
>> + * The RMP fault can happen when a hypervisor attempts to write to:
>> + * 1. a guest owned page or
>> + * 2. any pages in the large page is a guest owned page.
>> + *
>> + * #1 will happen only when a process or VMM is attempting to modify the guest page
>> + * without the guests cooperation. If a guest wants a VMM to be able to write to its memory
>> + * then it should make the page shared. If we detect #1, kill the process because we can not
>> + * resolve the fault.
>> + *
>> + * #2 can happen when the page level does not match between the RMP entry and x86
>> + * page table walk, e.g the page is mapped as a large page in the x86 page table but its
>> + * added as a 4K shared page in the RMP entry. This can be resolved by splitting the address
>> + * into a smaller page level.
>> + */
> These comments need to get wrapped a bit sooner. Could you try to match
> some of the others in the file?

Noted.

>
>> +static int handle_rmp_page_fault(unsigned long hw_error_code, unsigned long address)
>> +{
>> + unsigned long pfn, mask;
>> + int rmp_level, level;
>> + rmpentry_t *e;
>> + pte_t *pte;
>> +
>> + /* Get the native page level */
>> + pte = lookup_address_in_mm(current->mm, address, &level);
>> + if (unlikely(!pte))
>> + return RMP_FAULT_KILL;
>> +
>> + pfn = pte_pfn(*pte);
>> + if (level > PG_LEVEL_4K) {
>> + mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
>> + pfn |= (address >> PAGE_SHIFT) & mask;
>> + }
> What is this trying to do, exactly?

Trying to calculate the pfn within the large entry.

The lookup above will return a base pfn of a large page. Need to find
index within the large page to calculate the PFN.

>
>> + /* Get the page level from the RMP entry. */
>> + e = lookup_page_in_rmptable(pfn_to_page(pfn), &rmp_level);
>> + if (!e) {
>> + pr_alert("SEV-SNP: failed to lookup RMP entry for address 0x%lx pfn 0x%lx\n",
>> + address, pfn);
>> + return RMP_FAULT_KILL;
>> + }
>> +
>> + /* Its a guest owned page */
>> + if (rmpentry_assigned(e))
>> + return RMP_FAULT_KILL;
>> +
>> + /*
>> + * Its a shared page but the page level does not match between the native walk
>> + * and RMP entry.
>> + */
> For these two-line comments, please try to split the text fairly evenly
> between the lines.

Noted.

>
>> + if (level > rmp_level)
>> + return RMP_FAULT_PAGE_SPLIT;
>> +
>> + return RMP_FAULT_RETRY;
>> +}
>> +
>> /* Handle faults in the user portion of the address space */
>> static inline
>> void do_user_addr_fault(struct pt_regs *regs,
>> @@ -1315,6 +1379,7 @@ void do_user_addr_fault(struct pt_regs *regs,
>> struct task_struct *tsk;
>> struct mm_struct *mm;
>> vm_fault_t fault;
>> + int ret;
>> unsigned int flags = FAULT_FLAG_DEFAULT;
>>
>> tsk = current;
>> @@ -1377,6 +1442,22 @@ void do_user_addr_fault(struct pt_regs *regs,
>> if (hw_error_code & X86_PF_INSTR)
>> flags |= FAULT_FLAG_INSTRUCTION;
>>
>> + /*
>> + * If its an RMP violation, see if we can resolve it.
>> + */
>> + if ((hw_error_code & X86_PF_RMP)) {
>> + ret = handle_rmp_page_fault(hw_error_code, address);
>> + if (ret == RMP_FAULT_PAGE_SPLIT) {
>> + flags |= FAULT_FLAG_PAGE_SPLIT;
>> + } else if (ret == RMP_FAULT_KILL) {
>> + fault |= VM_FAULT_SIGBUS;
>> + mm_fault_error(regs, hw_error_code, address, fault);
>> + return;
>> + } else {
>> + return;
>> + }
>> + }
>> +
>> #ifdef CONFIG_X86_64
>> /*
>> * Faults in the vsyscall page might need emulation. The
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index ecdf8a8cd6ae..1be3218f3738 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -434,6 +434,8 @@ extern pgprot_t protection_map[16];
>> * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
>> * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
>> * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
>> + * @FAULT_FLAG_PAGE_SPLIT: The fault was due page size mismatch, split the region to smaller
>> + * page size and retry.
>> *
>> * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
>> * whether we would allow page faults to retry by specifying these two
>> @@ -464,6 +466,7 @@ extern pgprot_t protection_map[16];
>> #define FAULT_FLAG_REMOTE 0x80
>> #define FAULT_FLAG_INSTRUCTION 0x100
>> #define FAULT_FLAG_INTERRUPTIBLE 0x200
>> +#define FAULT_FLAG_PAGE_SPLIT 0x400
>>
>> /*
>> * The default fault flags that should be used by most of the
>> @@ -501,7 +504,8 @@ static inline bool fault_flag_allow_retry_first(unsigned int flags)
>> { FAULT_FLAG_USER, "USER" }, \
>> { FAULT_FLAG_REMOTE, "REMOTE" }, \
>> { FAULT_FLAG_INSTRUCTION, "INSTRUCTION" }, \
>> - { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }
>> + { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }, \
>> + { FAULT_FLAG_PAGE_SPLIT, "PAGESPLIT" }
>>
>> /*
>> * vm_fault is filled by the pagefault handler and passed to the vma's
>> diff --git a/mm/memory.c b/mm/memory.c
>> index feff48e1465a..c9dcf9b30719 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4427,6 +4427,12 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>> return 0;
>> }
>>
>> +static int handle_split_page_fault(struct vm_fault *vmf)
>> +{
>> + __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
>> + return 0;
>> +}
> Wait a sec, I thought this could fail. Where's the "failed to split"
> path? Why does this even return an error code if it's always 0?

Ah, I missed it. Let me address this comment by propagating the error
code to the caller so that it can take the corrective action. In this
particular case, if we fail to split then we will not be able to resolve
the fault and will need to send SIGBUG to abort the process.

>
>> /*
>> * By the time we get here, we already hold the mm semaphore
>> *
>> @@ -4448,6 +4454,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>> pgd_t *pgd;
>> p4d_t *p4d;
>> vm_fault_t ret;
>> + int split_page = flags & FAULT_FLAG_PAGE_SPLIT;
>>
>> pgd = pgd_offset(mm, address);
>> p4d = p4d_alloc(mm, pgd, address);
>> @@ -4504,6 +4511,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>> pmd_migration_entry_wait(mm, vmf.pmd);
>> return 0;
>> }
>> +
>> + if (split_page)
>> + return handle_split_page_fault(&vmf);
>> +
>> if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
>> if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
>> return do_huge_pmd_numa_page(&vmf, orig_pmd);
> Is there a reason for the 'split_page' variable? It seems like a waste
> of space.

I can remove it to save some space.

2021-03-25 16:01:23

by Dave Hansen

[permalink] [raw]

Subject: Re: [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation

On 3/25/21 8:24 AM, Brijesh Singh wrote:
> On 3/25/21 9:48 AM, Dave Hansen wrote:
>> On 3/24/21 10:04 AM, Brijesh Singh wrote:
>>> When SEV-SNP is enabled globally in the system, a write from the hypervisor
>>> can raise an RMP violation. We can resolve the RMP violation by splitting
>>> the virtual address to a lower page level.
>>>
>>> e.g
>>> - guest made a page shared in the RMP entry so that the hypervisor
>>> can write to it.
>>> - the hypervisor has mapped the pfn as a large page. A write access
>>> will cause an RMP violation if one of the pages within the 2MB region
>>> is a guest private page.
>>>
>>> The above RMP violation can be resolved by simply splitting the large
>>> page.
>> What if the large page is provided by hugetlbfs?
> I was not able to find a method to split the large pages in the
> hugetlbfs. Unfortunately, at this time a VMM cannot use the backing
> memory from the hugetlbfs pool. An SEV-SNP aware VMM can use either
> transparent hugepage or small pages.

That's really, really nasty. Especially since it might not be evident
until long after boot and the guest is killed.

It's even nastier because hugetlbfs is actually a great fit for SEV-SNP
memory. It's physically contiguous, so it would keep you from having to
fracture the direct map all the way down to 4k, it also can't be
reclaimed (just like all SEV memory).

I think the minimal thing you can do here is to fail to add memory to
the RMP in the first place if you can't split it. That way, users will
at least fail to _start_ their VM versus dying randomly for no good reason.

Even better would be to come up with a stronger contract between host
and guest. I really don't think the host should be exposed to random
RMP faults on the direct map. If the guest wants to share memory, then
it needs to tell the host and give the host an opportunity to move the
guest physical memory. It might, for instance, sequester all the shared
pages in a single spot to minimize direct map fragmentation.

I'll let the other x86 folks chime in on this, but I really think this
needs a different approach than what's being proposed.

2021-04-21 13:05:35

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation

On 3/25/21 4:59 PM, Dave Hansen wrote:
> On 3/25/21 8:24 AM, Brijesh Singh wrote:
>> On 3/25/21 9:48 AM, Dave Hansen wrote:
>>> On 3/24/21 10:04 AM, Brijesh Singh wrote:
>>>> When SEV-SNP is enabled globally in the system, a write from the hypervisor
>>>> can raise an RMP violation. We can resolve the RMP violation by splitting
>>>> the virtual address to a lower page level.
>>>>
>>>> e.g
>>>> - guest made a page shared in the RMP entry so that the hypervisor
>>>> can write to it.
>>>> - the hypervisor has mapped the pfn as a large page. A write access
>>>> will cause an RMP violation if one of the pages within the 2MB region
>>>> is a guest private page.
>>>>
>>>> The above RMP violation can be resolved by simply splitting the large
>>>> page.
>>> What if the large page is provided by hugetlbfs?
>> I was not able to find a method to split the large pages in the
>> hugetlbfs. Unfortunately, at this time a VMM cannot use the backing
>> memory from the hugetlbfs pool. An SEV-SNP aware VMM can use either
>> transparent hugepage or small pages.
>
> That's really, really nasty. Especially since it might not be evident
> until long after boot and the guest is killed.

I'd assume a SNP-aware QEMU would be needed in the first place and thus this
QEMU would know not to use hugetlbfs?

> It's even nastier because hugetlbfs is actually a great fit for SEV-SNP
> memory. It's physically contiguous, so it would keep you from having to

Maybe this could be solvable by remapping the hugetlbfs page with pte's when
needed (a guest wants to share 4k out of 2MB with the host temporarily). But
certainly never as flexibly as pte-mapped THP's as the complexity of that
(refcounting tail pages etc) is significant.

> fracture the direct map all the way down to 4k, it also can't be
> reclaimed (just like all SEV memory).

About that... the whitepaper I've seen [1] mentions support for swapping guest
pages. I'd expect the same mechanism could be used for their migration -
scattering 4kB unmovable SEV pages around would be terrible for fragmentation. I
assume neither swap or migration support is part of the patchset(s) yet?

> I think the minimal thing you can do here is to fail to add memory to
> the RMP in the first place if you can't split it. That way, users will
> at least fail to _start_ their VM versus dying randomly for no good reason.
>
> Even better would be to come up with a stronger contract between host
> and guest. I really don't think the host should be exposed to random
> RMP faults on the direct map. If the guest wants to share memory, then
> it needs to tell the host and give the host an opportunity to move the
> guest physical memory. It might, for instance, sequester all the shared
> pages in a single spot to minimize direct map fragmentation.

Agreed, and the contract should be elaborated before going to implementation
details (patches). Could a malicious guest violate such contract unilaterally? I
guess not, because psmash is a hypervisor instruction? And if yes, the
RMP-specific page fault handlers would be used just to kill such guest, not to
fix things up during page fault.

> I'll let the other x86 folks chime in on this, but I really think this
> needs a different approach than what's being proposed.

Not an x86 folk, but agreed :)

[1]
https://www.amd.com/system/files/TechDocs/SEV-SNP-strengthening-vm-isolation-with-integrity-protection-and-more.pdf

2021-04-21 17:28:09

by Brijesh Singh

[permalink] [raw]

Subject: Re: [RFC Part2 PATCH 07/30] mm: add support to split the large THP based on RMP violation

On 4/21/21 7:59 AM, Vlastimil Babka wrote:
> On 3/25/21 4:59 PM, Dave Hansen wrote:
>> On 3/25/21 8:24 AM, Brijesh Singh wrote:
>>> On 3/25/21 9:48 AM, Dave Hansen wrote:
>>>> On 3/24/21 10:04 AM, Brijesh Singh wrote:
>>>>> When SEV-SNP is enabled globally in the system, a write from the hypervisor
>>>>> can raise an RMP violation. We can resolve the RMP violation by splitting
>>>>> the virtual address to a lower page level.
>>>>>
>>>>> e.g
>>>>> - guest made a page shared in the RMP entry so that the hypervisor
>>>>> can write to it.
>>>>> - the hypervisor has mapped the pfn as a large page. A write access
>>>>> will cause an RMP violation if one of the pages within the 2MB region
>>>>> is a guest private page.
>>>>>
>>>>> The above RMP violation can be resolved by simply splitting the large
>>>>> page.
>>>> What if the large page is provided by hugetlbfs?
>>> I was not able to find a method to split the large pages in the
>>> hugetlbfs. Unfortunately, at this time a VMM cannot use the backing
>>> memory from the hugetlbfs pool. An SEV-SNP aware VMM can use either
>>> transparent hugepage or small pages.
>> That's really, really nasty. Especially since it might not be evident
>> until long after boot and the guest is killed.
> I'd assume a SNP-aware QEMU would be needed in the first place and thus this
> QEMU would know not to use hugetlbfs?

Yes, that is correct. Qemu patches will not launch SEV-SNP guest when
hugetlbfs is used. I can also look to add the check in kernel to ensure
that backing pages does not come from the hugetlbfs so that non-QEMU VMM
will also fail to create the SNP guest.

>
>> It's even nastier because hugetlbfs is actually a great fit for SEV-SNP
>> memory. It's physically contiguous, so it would keep you from having to
> Maybe this could be solvable by remapping the hugetlbfs page with pte's when
> needed (a guest wants to share 4k out of 2MB with the host temporarily). But
> certainly never as flexibly as pte-mapped THP's as the complexity of that
> (refcounting tail pages etc) is significant.
>
>> fracture the direct map all the way down to 4k, it also can't be
>> reclaimed (just like all SEV memory).
> About that... the whitepaper I've seen [1] mentions support for swapping guest
> pages. I'd expect the same mechanism could be used for their migration -
> scattering 4kB unmovable SEV pages around would be terrible for fragmentation. I
> assume neither swap or migration support is part of the patchset(s) yet?

Yes, the patches does not support swapping guest pages yet. We want to
add the support incrementally. The swap/move can be implemented after we
have the base enabled in the kernel. Both the SEV and SNP firmware
provides PSP commands that can be used to swap the guest pages. I
believe KVM mmu notifier can use it during the page move.

>
>> I think the minimal thing you can do here is to fail to add memory to
>> the RMP in the first place if you can't split it. That way, users will
>> at least fail to _start_ their VM versus dying randomly for no good reason.
>>
>> Even better would be to come up with a stronger contract between host
>> and guest. I really don't think the host should be exposed to random
>> RMP faults on the direct map. If the guest wants to share memory, then
>> it needs to tell the host and give the host an opportunity to move the
>> guest physical memory. It might, for instance, sequester all the shared
>> pages in a single spot to minimize direct map fragmentation.
> Agreed, and the contract should be elaborated before going to implementation
> details (patches). Could a malicious guest violate such contract unilaterally? I
> guess not, because psmash is a hypervisor instruction? And if yes, the
> RMP-specific page fault handlers would be used just to kill such guest, not to
> fix things up during page fault.

The version 2 of GHCB specification defines a contract between the guest
and the host. When guest is ready to share a page with the host it
issues the page state change request to the hypervisor. Hypervisor is
responsible to add the page in the RMP table using the RMPUPDATE
instruction. The page state change request include an operation field.
The operation can be one of the following

1. Add page in RMP table (make guest page private)

2. Remove page from RMP table (make guest page shared)

3. Psmash - split the large RMP entry

4. Unmash - merge small RMP entry into large. The unmash operation
require the PSP assist.

The current RMP-specific fault handler checks if host is attempting to
write to a guest private page. If so, kill the guest. I guess it covers
the case where a malicious guest violates the contract to issue the
page-state-change.

>> I'll let the other x86 folks chime in on this, but I really think this
>> needs a different approach than what's being proposed.
> Not an x86 folk, but agreed :)
>
> [1]
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.amd.com%2Fsystem%2Ffiles%2FTechDocs%2FSEV-SNP-strengthening-vm-isolation-with-integrity-protection-and-more.pdf&data=04%7C01%7Cbrijesh.singh%40amd.com%7C3a8c99a1738940b550af08d904c55938%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637546068243853651%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=x%2Bmtud8IxrykFCPAPgBu2CCAFO9Q26PA3OhryvlX%2BbM%3D&reserved=0