2024-04-29 10:52:09

by Dmitrii Kuvaiskii

[permalink] [raw]
Subject: [PATCH 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows

SGX runtimes such as Gramine may implement EDMM-based lazy allocation of
enclave pages and may support MADV_DONTNEED semantics [1]. The former
implies #PF-based page allocation, and the latter implies the usage of
SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl.

A trivial program like below (run under Gramine and with EDMM enabled)
stresses these two flows in the SGX driver and hangs:

/* repeatedly touch different enclave pages at random and mix with
* `madvise(MADV_DONTNEED)` to stress EAUG/EREMOVE flows */
static void* thread_func(void* arg) {
size_t num_pages = 0xA000 / page_size;
for (int i = 0; i < 5000; i++) {
size_t page = get_random_ulong() % num_pages;
char data = READ_ONCE(((char*)arg)[page * page_size]);

page = get_random_ulong() % num_pages;
madvise(arg + page * page_size, page_size, MADV_DONTNEED);
}
}

addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0);
pthread_t threads[16];
for (int i = 0; i < 16; i++)
pthread_create(&threads[i], NULL, thread_func, addr);

This program uncovers two data races in the SGX driver. The remaining
patches describe and fix these races.

I performed several stress tests to verify that there are no other data
races (at least with the test program above):

- On Icelake server with 128GB of PRMRR (EPC), without madvise(). This
stresses the first data race. A Gramine SGX test suite running in the
background for additional stressing. Result: 1,000 runs without hangs
(result without the first bug fix: hangs every time).
- On Icelake server with 128GB of PRMRR (EPC), with madvise(). This
stresses the second data race. A Gramine SGX test suite running in the
background for additional stressing. Result: 1,000 runs without hangs
(result with the first bug fix but without the second bug fix: hangs
approx. once in 50 runs).
- On Icelake server with 4GB of PRMRR (EPC), with madvise(). This
additionally stresses the enclave page swapping flows. Two Gramine SGX
test suites running in the background for additional stressing of
swapping (I observe 100% CPU utilization from ksgxd which confirms that
swapping happens). Result: 1,000 runs without hangs.

(Sorry for the previous copy of this email, accidentally sent to
[email protected]. Failed to use `--suppress-cc` during a test send.)

Dmitrii Kuvaiskii (2):
x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
x86/sgx: Resolve EREMOVE page vs EAUG page data race

arch/x86/kernel/cpu/sgx/encl.c | 10 +++++++---
arch/x86/kernel/cpu/sgx/encl.h | 3 +++
arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
3 files changed, 11 insertions(+), 3 deletions(-)

--
2.34.1



2024-04-29 10:52:23

by Dmitrii Kuvaiskii

[permalink] [raw]
Subject: [PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

Two enclave threads may try to access the same non-present enclave page
simultaneously (e.g., if the SGX runtime supports lazy allocation). The
threads will end up in sgx_encl_eaug_page(), racing to acquire the
enclave lock. The winning thread will perform EAUG, set up the page
table entry, and insert the page into encl->page_array. The losing
thread will then get -EBUSY on xa_insert(&encl->page_array) and proceed
to error handling path.

This error handling path contains two bugs: (1) SIGBUS is sent to
userspace even though the enclave page is correctly installed by another
thread, and (2) sgx_encl_free_epc_page() is called that performs EREMOVE
even though the enclave page was never intended to be removed. The first
bug is less severe because it impacts only the user space; the second
bug is more severe because it also impacts the OS state by ripping the
page (added by the winning thread) from the enclave.

Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux
fault handler so that no signal is sent to userspace, and (2) by
replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no
EREMOVE is performed.

Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized enclave")
Cc: [email protected]
Reported-by: Marcelina Kościelnicka <[email protected]>
Suggested-by: Reinette Chatre <[email protected]>
Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
---
arch/x86/kernel/cpu/sgx/encl.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 279148e72459..41f14b1a3025 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
* If ret == -EBUSY then page was created in another flow while
* running without encl->lock
*/
- if (ret)
+ if (ret) {
+ if (ret == -EBUSY)
+ vmret = VM_FAULT_NOPAGE;
goto err_out_shrink;
+ }

pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page);
pginfo.addr = encl_page->desc & PAGE_MASK;
@@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
err_out_shrink:
sgx_encl_shrink(encl, va_page);
err_out_epc:
- sgx_encl_free_epc_page(epc_page);
+ sgx_free_epc_page(epc_page);
err_out_unlock:
mutex_unlock(&encl->lock);
kfree(encl_page);
--
2.34.1


2024-04-29 10:52:29

by Dmitrii Kuvaiskii

[permalink] [raw]
Subject: [PATCH 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race

Two enclave threads may try to add and remove the same enclave page
simultaneously (e.g., if the SGX runtime supports both lazy allocation
and `MADV_DONTNEED` semantics). Consider this race:

1. T1 performs page removal in sgx_encl_remove_pages() and stops right
after removing the page table entry and right before re-acquiring the
enclave lock to EREMOVE and xa_erase(&encl->page_array) the page.
2. T2 tries to access the page, and #PF[not_present] is raised. The
condition to EAUG in sgx_vma_fault() is not satisfied because the
page is still present in encl->page_array, thus the SGX driver
assumes that the fault happened because the page was swapped out. The
driver continues on a code path that installs a page table entry
*without* performing EAUG.
3. The enclave page metadata is in inconsistent state: the PTE is
installed but there was no EAUG. Thus, T2 in userspace infinitely
receives SIGSEGV on this page (and EACCEPT always fails).

Fix this by making sure that T1 (the page-removing thread) always wins
this data race. In particular, the page-being-removed is marked as such,
and T2 retries until the page is fully removed.

Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
Cc: [email protected]
Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
---
arch/x86/kernel/cpu/sgx/encl.c | 3 ++-
arch/x86/kernel/cpu/sgx/encl.h | 3 +++
arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
index 41f14b1a3025..7ccd8b2fce5f 100644
--- a/arch/x86/kernel/cpu/sgx/encl.c
+++ b/arch/x86/kernel/cpu/sgx/encl.c
@@ -257,7 +257,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,

/* Entry successfully located. */
if (entry->epc_page) {
- if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
+ if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED |
+ SGX_ENCL_PAGE_BEING_REMOVED))
return ERR_PTR(-EBUSY);

return entry;
diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
index f94ff14c9486..fff5f2293ae7 100644
--- a/arch/x86/kernel/cpu/sgx/encl.h
+++ b/arch/x86/kernel/cpu/sgx/encl.h
@@ -25,6 +25,9 @@
/* 'desc' bit marking that the page is being reclaimed. */
#define SGX_ENCL_PAGE_BEING_RECLAIMED BIT(3)

+/* 'desc' bit marking that the page is being removed. */
+#define SGX_ENCL_PAGE_BEING_REMOVED BIT(2)
+
struct sgx_encl_page {
unsigned long desc;
unsigned long vm_max_prot_bits:8;
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index b65ab214bdf5..c542d4dd3e64 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
* Do not keep encl->lock because of dependency on
* mmap_lock acquired in sgx_zap_enclave_ptes().
*/
+ entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED;
mutex_unlock(&encl->lock);

sgx_zap_enclave_ptes(encl, addr);
--
2.34.1


2024-04-29 13:08:06

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows

On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
> SGX runtimes such as Gramine may implement EDMM-based lazy allocation of
> enclave pages and may support MADV_DONTNEED semantics [1]. The former
> implies #PF-based page allocation, and the latter implies the usage of
> SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl.
>
> A trivial program like below (run under Gramine and with EDMM enabled)
> stresses these two flows in the SGX driver and hangs:
>
> /* repeatedly touch different enclave pages at random and mix with
> * `madvise(MADV_DONTNEED)` to stress EAUG/EREMOVE flows */
> static void* thread_func(void* arg) {
> size_t num_pages = 0xA000 / page_size;
> for (int i = 0; i < 5000; i++) {
> size_t page = get_random_ulong() % num_pages;
> char data = READ_ONCE(((char*)arg)[page * page_size]);
>
> page = get_random_ulong() % num_pages;
> madvise(arg + page * page_size, page_size, MADV_DONTNEED);
> }
> }
>
> addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0);
> pthread_t threads[16];
> for (int i = 0; i < 16; i++)
> pthread_create(&threads[i], NULL, thread_func, addr);

I'm not convinced that kernel is the problem here but it could be also
how Gramine is implemented.

So maybe you could make a better case of that. The example looks a bit
artificial to me.

>
> This program uncovers two data races in the SGX driver. The remaining
> patches describe and fix these races.
>
> I performed several stress tests to verify that there are no other data
> races (at least with the test program above):
>
> - On Icelake server with 128GB of PRMRR (EPC), without madvise(). This
> stresses the first data race. A Gramine SGX test suite running in the
> background for additional stressing. Result: 1,000 runs without hangs
> (result without the first bug fix: hangs every time).
> - On Icelake server with 128GB of PRMRR (EPC), with madvise(). This
> stresses the second data race. A Gramine SGX test suite running in the
> background for additional stressing. Result: 1,000 runs without hangs
> (result with the first bug fix but without the second bug fix: hangs
> approx. once in 50 runs).
> - On Icelake server with 4GB of PRMRR (EPC), with madvise(). This
> additionally stresses the enclave page swapping flows. Two Gramine SGX
> test suites running in the background for additional stressing of
> swapping (I observe 100% CPU utilization from ksgxd which confirms that
> swapping happens). Result: 1,000 runs without hangs.
>
> (Sorry for the previous copy of this email, accidentally sent to
> [email protected]. Failed to use `--suppress-cc` during a test send.)
>
> Dmitrii Kuvaiskii (2):
> x86/sgx: Resolve EAUG race where losing thread returns SIGBUS
> x86/sgx: Resolve EREMOVE page vs EAUG page data race
>
> arch/x86/kernel/cpu/sgx/encl.c | 10 +++++++---
> arch/x86/kernel/cpu/sgx/encl.h | 3 +++
> arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
> 3 files changed, 11 insertions(+), 3 deletions(-)

BR, Jarkko

2024-04-29 13:19:26

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race

On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
> Two enclave threads may try to add and remove the same enclave page
> simultaneously (e.g., if the SGX runtime supports both lazy allocation
> and `MADV_DONTNEED` semantics). Consider this race:
>
> 1. T1 performs page removal in sgx_encl_remove_pages() and stops right
> after removing the page table entry and right before re-acquiring the
> enclave lock to EREMOVE and xa_erase(&encl->page_array) the page.
> 2. T2 tries to access the page, and #PF[not_present] is raised. The
> condition to EAUG in sgx_vma_fault() is not satisfied because the
> page is still present in encl->page_array, thus the SGX driver
> assumes that the fault happened because the page was swapped out. The
> driver continues on a code path that installs a page table entry
> *without* performing EAUG.
> 3. The enclave page metadata is in inconsistent state: the PTE is
> installed but there was no EAUG. Thus, T2 in userspace infinitely
> receives SIGSEGV on this page (and EACCEPT always fails).
>
> Fix this by making sure that T1 (the page-removing thread) always wins
> this data race. In particular, the page-being-removed is marked as such,
> and T2 retries until the page is fully removed.
>
> Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
> Cc: [email protected]
> Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
> ---
> arch/x86/kernel/cpu/sgx/encl.c | 3 ++-
> arch/x86/kernel/cpu/sgx/encl.h | 3 +++
> arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
> 3 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 41f14b1a3025..7ccd8b2fce5f 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -257,7 +257,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
>
> /* Entry successfully located. */
> if (entry->epc_page) {
> - if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
> + if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED |
> + SGX_ENCL_PAGE_BEING_REMOVED))
> return ERR_PTR(-EBUSY);
>
> return entry;
> diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
> index f94ff14c9486..fff5f2293ae7 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.h
> +++ b/arch/x86/kernel/cpu/sgx/encl.h
> @@ -25,6 +25,9 @@
> /* 'desc' bit marking that the page is being reclaimed. */
> #define SGX_ENCL_PAGE_BEING_RECLAIMED BIT(3)
>
> +/* 'desc' bit marking that the page is being removed. */
> +#define SGX_ENCL_PAGE_BEING_REMOVED BIT(2)
> +
> struct sgx_encl_page {
> unsigned long desc;
> unsigned long vm_max_prot_bits:8;
> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> index b65ab214bdf5..c542d4dd3e64 100644
> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
> * Do not keep encl->lock because of dependency on
> * mmap_lock acquired in sgx_zap_enclave_ptes().
> */
> + entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED;
> mutex_unlock(&encl->lock);
>
> sgx_zap_enclave_ptes(encl, addr);

It is somewhat trivial to NAK this as the commit message does
not do any effort describing the new flag. By default at least
I have strong opposition against any new flags related to
reclaiming even if it needs a bit of extra synchronization
work in the user space.

One way to describe concurrency scenarios would be to take
example from https://www.kernel.org/doc/Documentation/memory-barriers.txt

I.e. see the examples with CPU 1 and CPU 2.

BR, Jarkko

2024-04-29 13:31:58

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

On Mon Apr 29, 2024 at 4:04 PM EEST, Jarkko Sakkinen wrote:
> > Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux
> > fault handler so that no signal is sent to userspace, and (2) by
> > replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no
> > EREMOVE is performed.
>
> What is the collateral damage caused by ENCLS[EREMOVE]?

Have you measured cost of eremove on an empty page?

I tried to lookup for a thread from lore because I have a faint memory
that it was concluded that its cost irrelevant. Please correct if I'm
wrong.

BR, Jarkko

2024-04-29 13:33:29

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

On Mon Apr 29, 2024 at 4:22 PM EEST, Jarkko Sakkinen wrote:
> On Mon Apr 29, 2024 at 4:04 PM EEST, Jarkko Sakkinen wrote:
> > > Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux
> > > fault handler so that no signal is sent to userspace, and (2) by
> > > replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no
> > > EREMOVE is performed.
> >
> > What is the collateral damage caused by ENCLS[EREMOVE]?
>
> Have you measured cost of eremove on an empty page?
>
> I tried to lookup for a thread from lore because I have a faint memory
> that it was concluded that its cost irrelevant. Please correct if I'm
> wrong.

Also pseudocode for EREMOVE supports this as it just returns without
actually doing anything.

BR, Jarkko

2024-04-29 13:38:25

by Jarkko Sakkinen

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
> Two enclave threads may try to access the same non-present enclave page
> simultaneously (e.g., if the SGX runtime supports lazy allocation). The
> threads will end up in sgx_encl_eaug_page(), racing to acquire the
> enclave lock. The winning thread will perform EAUG, set up the page
> table entry, and insert the page into encl->page_array. The losing
> thread will then get -EBUSY on xa_insert(&encl->page_array) and proceed
> to error handling path.

And that path removes page. Not sure I got gist of this tbh.

> This error handling path contains two bugs: (1) SIGBUS is sent to
> userspace even though the enclave page is correctly installed by another
> thread, and (2) sgx_encl_free_epc_page() is called that performs EREMOVE
> even though the enclave page was never intended to be removed. The first
> bug is less severe because it impacts only the user space; the second
> bug is more severe because it also impacts the OS state by ripping the
> page (added by the winning thread) from the enclave.
>
> Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux
> fault handler so that no signal is sent to userspace, and (2) by
> replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no
> EREMOVE is performed.

What is the collateral damage caused by ENCLS[EREMOVE]?

>
> Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized enclave")
> Cc: [email protected]
> Reported-by: Marcelina Kościelnicka <[email protected]>
> Suggested-by: Reinette Chatre <[email protected]>
> Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
> ---
> arch/x86/kernel/cpu/sgx/encl.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> index 279148e72459..41f14b1a3025 100644
> --- a/arch/x86/kernel/cpu/sgx/encl.c
> +++ b/arch/x86/kernel/cpu/sgx/encl.c
> @@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
> * If ret == -EBUSY then page was created in another flow while
> * running without encl->lock
> */
> - if (ret)
> + if (ret) {
> + if (ret == -EBUSY)
> + vmret = VM_FAULT_NOPAGE;
> goto err_out_shrink;
> + }
>
> pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page);
> pginfo.addr = encl_page->desc & PAGE_MASK;
> @@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
> err_out_shrink:
> sgx_encl_shrink(encl, va_page);
> err_out_epc:
> - sgx_encl_free_epc_page(epc_page);
> + sgx_free_epc_page(epc_page);

This ignores check for the page being reclaimer tracked, i.e. it does
changes that have been ignored in the commit message.

> err_out_unlock:
> mutex_unlock(&encl->lock);
> kfree(encl_page);


BR, Jarkko

2024-04-30 14:46:22

by Dmitrii Kuvaiskii

[permalink] [raw]
Subject: Re: [PATCH 0/2] x86/sgx: Fix two data races in EAUG/EREMOVE flows

On Mon, Apr 29, 2024 at 04:06:39PM +0300, Jarkko Sakkinen wrote:
> On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
> > SGX runtimes such as Gramine may implement EDMM-based lazy allocation of
> > enclave pages and may support MADV_DONTNEED semantics [1]. The former
> > implies #PF-based page allocation, and the latter implies the usage of
> > SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl.
> >
> > A trivial program like below (run under Gramine and with EDMM enabled)
> > stresses these two flows in the SGX driver and hangs:
> >
> > /* repeatedly touch different enclave pages at random and mix with
> > * `madvise(MADV_DONTNEED)` to stress EAUG/EREMOVE flows */
> > static void* thread_func(void* arg) {
> > size_t num_pages = 0xA000 / page_size;
> > for (int i = 0; i < 5000; i++) {
> > size_t page = get_random_ulong() % num_pages;
> > char data = READ_ONCE(((char*)arg)[page * page_size]);
> >
> > page = get_random_ulong() % num_pages;
> > madvise(arg + page * page_size, page_size, MADV_DONTNEED);
> > }
> > }
> >
> > addr = mmap(NULL, 0xA000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS, -1, 0);
> > pthread_t threads[16];
> > for (int i = 0; i < 16; i++)
> > pthread_create(&threads[i], NULL, thread_func, addr);
>
> I'm not convinced that kernel is the problem here but it could be also
> how Gramine is implemented.
>
> So maybe you could make a better case of that. The example looks a bit
> artificial to me.

I believe that these are the bugs in the kernel (in the SGX driver). I
provided more detailed descriptions of the races and ensuing bugs in the
other two replies, please check them.

The example is a stress test written to debug very infrequent hangs of
real-world applications that are run with Gramine, EDMM, and two
optimizations (lazy allocation and MADV_DONTNEED semantics). We observed
hangs of Node.js, PyTorch, R, iperf, Blender, Nginx. To root cause these
hangs, we wrote this artificial stress test. This test succeeds on vanilla
Linux, so ideally it should also pass on Gramine.

Please also note that the optimizations of lazy allocation and
MADV_DONTNEED provide significant performance improvement for some
workloads that run on Gramine. For example, a Java workload with a 16GB
enclave size has approx. 57x improvement in total runtime. Thus, we
consider it important to permit these optimizations in Gramine, which IIUC
requires bug fixes in the SGX driver.

You can find more info at
https://github.com/gramineproject/gramine/pull/1513.

Which parts do you consider artificial, and how could I modify the stress
test?

--
Dmitrii Kuvaiskii

2024-04-30 14:47:55

by Dmitrii Kuvaiskii

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race

On Mon, Apr 29, 2024 at 04:11:03PM +0300, Jarkko Sakkinen wrote:
> On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
> > Two enclave threads may try to add and remove the same enclave page
> > simultaneously (e.g., if the SGX runtime supports both lazy allocation
> > and `MADV_DONTNEED` semantics). Consider this race:
> >
> > 1. T1 performs page removal in sgx_encl_remove_pages() and stops right
> > after removing the page table entry and right before re-acquiring the
> > enclave lock to EREMOVE and xa_erase(&encl->page_array) the page.
> > 2. T2 tries to access the page, and #PF[not_present] is raised. The
> > condition to EAUG in sgx_vma_fault() is not satisfied because the
> > page is still present in encl->page_array, thus the SGX driver
> > assumes that the fault happened because the page was swapped out. The
> > driver continues on a code path that installs a page table entry
> > *without* performing EAUG.
> > 3. The enclave page metadata is in inconsistent state: the PTE is
> > installed but there was no EAUG. Thus, T2 in userspace infinitely
> > receives SIGSEGV on this page (and EACCEPT always fails).
> >
> > Fix this by making sure that T1 (the page-removing thread) always wins
> > this data race. In particular, the page-being-removed is marked as such,
> > and T2 retries until the page is fully removed.
> >
> > Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
> > Cc: [email protected]
> > Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
> > ---
> > arch/x86/kernel/cpu/sgx/encl.c | 3 ++-
> > arch/x86/kernel/cpu/sgx/encl.h | 3 +++
> > arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
> > 3 files changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> > index 41f14b1a3025..7ccd8b2fce5f 100644
> > --- a/arch/x86/kernel/cpu/sgx/encl.c
> > +++ b/arch/x86/kernel/cpu/sgx/encl.c
> > @@ -257,7 +257,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
> >
> > /* Entry successfully located. */
> > if (entry->epc_page) {
> > - if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
> > + if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED |
> > + SGX_ENCL_PAGE_BEING_REMOVED))
> > return ERR_PTR(-EBUSY);
> >
> > return entry;
> > diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
> > index f94ff14c9486..fff5f2293ae7 100644
> > --- a/arch/x86/kernel/cpu/sgx/encl.h
> > +++ b/arch/x86/kernel/cpu/sgx/encl.h
> > @@ -25,6 +25,9 @@
> > /* 'desc' bit marking that the page is being reclaimed. */
> > #define SGX_ENCL_PAGE_BEING_RECLAIMED BIT(3)
> >
> > +/* 'desc' bit marking that the page is being removed. */
> > +#define SGX_ENCL_PAGE_BEING_REMOVED BIT(2)
> > +
> > struct sgx_encl_page {
> > unsigned long desc;
> > unsigned long vm_max_prot_bits:8;
> > diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
> > index b65ab214bdf5..c542d4dd3e64 100644
> > --- a/arch/x86/kernel/cpu/sgx/ioctl.c
> > +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
> > @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
> > * Do not keep encl->lock because of dependency on
> > * mmap_lock acquired in sgx_zap_enclave_ptes().
> > */
> > + entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED;
> > mutex_unlock(&encl->lock);
> >
> > sgx_zap_enclave_ptes(encl, addr);
>
> It is somewhat trivial to NAK this as the commit message does
> not do any effort describing the new flag. By default at least
> I have strong opposition against any new flags related to
> reclaiming even if it needs a bit of extra synchronization
> work in the user space.
>
> One way to describe concurrency scenarios would be to take
> example from https://www.kernel.org/doc/Documentation/memory-barriers.txt
>
> I.e. see the examples with CPU 1 and CPU 2.

Thank you for the suggestion. Here is my new attempt at describing the racy
scenario:

Consider some enclave page added to the enclave. User space decides to
temporarily remove this page (e.g., emulating the MADV_DONTNEED semantics)
on CPU1. At the same time, user space performs a memory access on the same
page on CPU2, which results in a #PF and ultimately in sgx_vma_fault().
Scenario proceeds as follows:

/*
* CPU1: User space performs
* ioctl(SGX_IOC_ENCLAVE_REMOVE_PAGES)
* on a single enclave page
*/
sgx_encl_remove_pages() {

mutex_lock(&encl->lock);

entry = sgx_encl_load_page(encl);
/*
* verify that page is
* trimmed and accepted
*/

mutex_unlock(&encl->lock);

/*
* remove PTE entry; cannot
* be performed under lock
*/
sgx_zap_enclave_ptes(encl);
/*
* Fault on CPU2
*/
sgx_vma_fault() {
/*
* PTE entry was removed, but the
* page is still in enclave's xarray
*/
xa_load(&encl->page_array) != NULL ->
/*
* SGX driver thinks that this page
* was swapped out and loads it
*/
mutex_lock(&encl->lock);
/*
* this is effectively a no-op
*/
entry = sgx_encl_load_page_in_vma();
/*
* add PTE entry
*/
vmf_insert_pfn(...);

mutex_unlock(&encl->lock);
return VM_FAULT_NOPAGE;
}
/*
* continue with page removal
*/
mutex_lock(&encl->lock);

sgx_encl_free_epc_page(epc_page) {
/*
* remove page via EREMOVE
*/
/*
* free EPC page
*/
sgx_free_epc_page(epc_page);
}

xa_erase(&encl->page_array);

mutex_unlock(&encl->lock);
}

CPU1 removed the page. However CPU2 installed the PTE entry on the
same page. This enclave page becomes perpetually inaccessible (until
another SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is
marked accessible in the PTE entry but is not EAUGed. Because of this
combination, any subsequent access to this page raises a fault, and the #PF
handler sees the SGX bit set in the #PF error code and does not call
sgx_vma_fault() but instead raises a SIGSEGV. The userspace SIGSEGV handler
cannot perform EACCEPT because the page was not EAUGed. Thus, the user
space is stuck with the inaccessible page.

This race can be fixed by forcing the fault handler on CPU2 to back off if
the page is currently being removed (on CPU1). Thus a simple change is to
introduce a new flag SGX_ENCL_PAGE_BEING_REMOVED, which is unset by default
and set only right-before the first mutex_unlock() in
sgx_encl_remove_pages(). Upon loading the page, CPU2 checks whether this
page is being removed, and if yes then CPU2 backs off and waits until the
page is completely removed. After that, any memory access to this page
results in a normal "allocate and EAUG a page on #PF" flow.

--
Dmitrii Kuvaiskii

2024-04-30 14:48:03

by Dmitrii Kuvaiskii

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

On Mon, Apr 29, 2024 at 04:04:24PM +0300, Jarkko Sakkinen wrote:
> On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
> > Two enclave threads may try to access the same non-present enclave page
> > simultaneously (e.g., if the SGX runtime supports lazy allocation). The
> > threads will end up in sgx_encl_eaug_page(), racing to acquire the
> > enclave lock. The winning thread will perform EAUG, set up the page
> > table entry, and insert the page into encl->page_array. The losing
> > thread will then get -EBUSY on xa_insert(&encl->page_array) and proceed
> > to error handling path.
>
> And that path removes page. Not sure I got gist of this tbh.

Well, this is not about a redundant EREMOVE performed. This is about the
enclave page becoming inaccessible due to a bug triggered with a data race.

Consider some enclave page not yet added to the enclave. The enclave
performs a memory access to it at the same time on CPU1 and CPU2. Since the
page does not yet have a corresponding PTE, the #PF handler on both CPUs
calls sgx_vma_fault(). Scenario proceeds as follows:

/*
* Fault on CPU1
*/
sgx_vma_fault() {

xa_load(&encl->page_array) == NULL ->

sgx_encl_eaug_page() {

... /*
* Fault on CPU2
*/
sgx_vma_fault() {

xa_load(&encl->page_array) == NULL ->

sgx_encl_eaug_page() {

...

mutex_lock(&encl->lock);
/*
* alloc encl_page
*/
/*
* alloc EPC page
*/
epc_page = sgx_alloc_epc_page(...);
/*
* add page_to enclave's xarray
*/
xa_insert(&encl->page_array, ...);
/*
* add page to enclave via EAUG
* (page is in pending state)
*/
/*
* add PTE entry
*/
vmf_insert_pfn(...);

mutex_unlock(&encl->lock);
return VM_FAULT_NOPAGE;
}
}
mutex_lock(&encl->lock);
/*
* alloc encl_page
*/
/*
* alloc EPC page
*/
epc_page = sgx_alloc_epc_page(...);
/*
* add page_to enclave's xarray,
* this fails with -EBUSY
*/
xa_insert(&encl->page_array, ...);

err_out_shrink:
sgx_encl_free_epc_page(epc_page) {
/*
* remove page via EREMOVE
*/
/*
* free EPC page
*/
sgx_free_epc_page(epc_page);
}

mutex_unlock(&encl->lock);
return VM_FAULT_SIGBUS;
}
}

CPU2 added the enclave page (in pending state) to the enclave and installed
the PTE. The kernel gives control back to the user space, without raising a
signal. The user space on CPU2 retries the memory access and induces a page
fault, but now with the SGX bit set in the #PF error code. The #PF handler
calls do_user_addr_fault(), which calls access_error() and ultimately
raises a SIGSEGV. The userspace SIGSEGV handler is supposed to perform
EACCEPT, after which point the enclave page becomes accessible.

CPU1 however jumps to the error handling path because the page was already
inserted into the enclave's xarray. This error handling path EREMOVEs the
page and also raises a SIGBUS signal to user space. The PTE entry is not
removed.

After CPU1 performs EREMOVE, this enclave page becomes perpetually
inaccessible (until an SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because
the page is marked accessible in the PTE entry but is not EAUGed. Because
of this combination, the #PF handler sees the SGX bit set in the #PF error
code and does not call sgx_vma_fault() but instead raises a SIGSEGV. The
userspace SIGSEGV handler cannot perform EACCEPT because the page was not
EAUGed. Thus, the user space is stuck with the inaccessible page.

Also note that in the scenario, CPU1 raises a SIGBUS signal to user space
unnecessarily. This signal is spurious because a page-access retry on CPU2
will also raise the SIGBUS signal. That said, this side effect is less
severe because it affects only user space. Therefore, it could be
circumvented in user space alone, but it seems reasonable to fix it in this
patch.

> > This error handling path contains two bugs: (1) SIGBUS is sent to
> > userspace even though the enclave page is correctly installed by another
> > thread, and (2) sgx_encl_free_epc_page() is called that performs EREMOVE
> > even though the enclave page was never intended to be removed. The first
> > bug is less severe because it impacts only the user space; the second
> > bug is more severe because it also impacts the OS state by ripping the
> > page (added by the winning thread) from the enclave.
> >
> > Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux
> > fault handler so that no signal is sent to userspace, and (2) by
> > replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no
> > EREMOVE is performed.
>
> What is the collateral damage caused by ENCLS[EREMOVE]?

As explained above, the damage is that the SGX driver leaves the enclave
page metadata in an inconsistent state: on the one hand, the PTE entry is
installed which forces the generic Linux fault handler to raise SIGSEGV,
and on the other hand, the page is not in a correct state to be EACCEPTed
(i.e., EAUG was not performed on this page).

> > Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized enclave")
> > Cc: [email protected]
> > Reported-by: Marcelina Kościelnicka <[email protected]>
> > Suggested-by: Reinette Chatre <[email protected]>
> > Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
> > ---
> > arch/x86/kernel/cpu/sgx/encl.c | 7 +++++--
> > 1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
> > index 279148e72459..41f14b1a3025 100644
> > --- a/arch/x86/kernel/cpu/sgx/encl.c
> > +++ b/arch/x86/kernel/cpu/sgx/encl.c
> > @@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
> > * If ret == -EBUSY then page was created in another flow while
> > * running without encl->lock
> > */
> > - if (ret)
> > + if (ret) {
> > + if (ret == -EBUSY)
> > + vmret = VM_FAULT_NOPAGE;
> > goto err_out_shrink;
> > + }
> >
> > pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page);
> > pginfo.addr = encl_page->desc & PAGE_MASK;
> > @@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
> > err_out_shrink:
> > sgx_encl_shrink(encl, va_page);
> > err_out_epc:
> > - sgx_encl_free_epc_page(epc_page);
> > + sgx_free_epc_page(epc_page);
>
> This ignores check for the page being reclaimer tracked, i.e. it does
> changes that have been ignored in the commit message.

Indeed, sgx_encl_free_epc_page() performs the following check:

WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);

However, the EPC page is allocated in sgx_encl_eaug_page() and has
zeroed-out flags in all error-handling paths. In other words, the page is
marked as reclaimable only in the happy path of sgx_encl_eaug_page().
Therefore, in the particular code path that I changed this "page reclaimer
tracked" condition is always false, and the warning is never printed.

Do you want me to explain this in the commit message?

---

(Below questions are from follow-up emails, I add them to this reply email
to have all discussions in one place.)

> > > > What is the collateral damage caused by ENCLS[EREMOVE]?
> >
> > Have you measured cost of eremove on an empty page?
> >
> > I tried to lookup for a thread from lore because I have a faint memory
> > that it was concluded that its cost irrelevant. Please correct if I'm
> > wrong.
>
> Also pseudocode for EREMOVE supports this as it just returns without
> actually doing anything.

I have not measured the cost of EREMOVE on an empty page. This cost may be
negligible. But as stated above, my patch does not get rid of EREMOVE
simply for performance reasons. My patch removes a data race that leads to
a forever-inaccessible enclave page.

--
Dmitrii Kuvaiskii

2024-05-10 23:47:44

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/sgx: Resolve EAUG race where losing thread returns SIGBUS

Hi Dmitrii,

Thank you so much for finding as well as fixing this issue.

On 4/30/2024 7:37 AM, Dmitrii Kuvaiskii wrote:
> On Mon, Apr 29, 2024 at 04:04:24PM +0300, Jarkko Sakkinen wrote:
>> On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
>>> Two enclave threads may try to access the same non-present enclave page
>>> simultaneously (e.g., if the SGX runtime supports lazy allocation). The
>>> threads will end up in sgx_encl_eaug_page(), racing to acquire the
>>> enclave lock. The winning thread will perform EAUG, set up the page
>>> table entry, and insert the page into encl->page_array. The losing
>>> thread will then get -EBUSY on xa_insert(&encl->page_array) and proceed
>>> to error handling path.
>>
>> And that path removes page. Not sure I got gist of this tbh.
>
> Well, this is not about a redundant EREMOVE performed. This is about the
> enclave page becoming inaccessible due to a bug triggered with a data race.
>
> Consider some enclave page not yet added to the enclave. The enclave
> performs a memory access to it at the same time on CPU1 and CPU2. Since the
> page does not yet have a corresponding PTE, the #PF handler on both CPUs
> calls sgx_vma_fault(). Scenario proceeds as follows:
>
> /*
> * Fault on CPU1
> */
> sgx_vma_fault() {
>
> xa_load(&encl->page_array) == NULL ->
>
> sgx_encl_eaug_page() {
>
> ... /*
> * Fault on CPU2
> */
> sgx_vma_fault() {
>
> xa_load(&encl->page_array) == NULL ->
>
> sgx_encl_eaug_page() {
>
> ...
>

Up to here it may be helpful to have the CPU1 and CPU2 code run concurrently
to highlight the race. First one to get the mutex "wins".

> mutex_lock(&encl->lock);
> /*
> * alloc encl_page
> */

Please note that encl_page is allocated before mutex is obtained.

> /*
> * alloc EPC page
> */
> epc_page = sgx_alloc_epc_page(...);
> /*
> * add page_to enclave's xarray

"page_to" -> "page to" ?

> */
> xa_insert(&encl->page_array, ...);
> /*
> * add page to enclave via EAUG
> * (page is in pending state)
> */
> /*
> * add PTE entry
> */
> vmf_insert_pfn(...);
>
> mutex_unlock(&encl->lock);
> return VM_FAULT_NOPAGE;
> }
> }

A brief comment under CPU2 essentially stating that this is a "good"
flow may help. Something like: "All good up to here. Enclave page successfully
added to enclave, ready for EACCEPT from user space". (please feel free to
improve)

> mutex_lock(&encl->lock);
> /*
> * alloc encl_page
> */

This should be outside mutex_lock(). It can even be shown earlier how
CPU1 and CPU2 can allocate encl_page concurrently (which is fine to do).

> /*
> * alloc EPC page
> */
> epc_page = sgx_alloc_epc_page(...);
> /*
> * add page_to enclave's xarray,

hmmm ... is page_to actually intended?

> * this fails with -EBUSY

It may help to highlight that this failure is because CPU1 and CPU2 are both
attempting to access the same page thus the page was already added in CPU2 flow.

> */
> xa_insert(&encl->page_array, ...);
>
> err_out_shrink:
> sgx_encl_free_epc_page(epc_page) {
> /*
> * remove page via EREMOVE
> */

This needs emphasis that this is *BAD*. Something like:
"BUG: Enclave page added from CPU2 is yanked (via EREMOVE)
from enclave while it remains "accessible" from OS perspective
PTE installed with entry in OS's page_array)."

(please feel free to improve)

> /*
> * free EPC page
> */
> sgx_free_epc_page(epc_page);
> }
>
> mutex_unlock(&encl->lock);
> return VM_FAULT_SIGBUS;

This needs emphasis that this is *BAD*. "BUG: SIGBUS is
returned for a valid enclave page." (please feel free to
improve)

> }
> }
>
> CPU2 added the enclave page (in pending state) to the enclave and installed
> the PTE. The kernel gives control back to the user space, without raising a
> signal. The user space on CPU2 retries the memory access and induces a page
> fault, but now with the SGX bit set in the #PF error code. The #PF handler
> calls do_user_addr_fault(), which calls access_error() and ultimately
> raises a SIGSEGV. The userspace SIGSEGV handler is supposed to perform
> EACCEPT, after which point the enclave page becomes accessible.
>
> CPU1 however jumps to the error handling path because the page was already
> inserted into the enclave's xarray. This error handling path EREMOVEs the
> page and also raises a SIGBUS signal to user space. The PTE entry is not
> removed.
>
> After CPU1 performs EREMOVE, this enclave page becomes perpetually
> inaccessible (until an SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because
> the page is marked accessible in the PTE entry but is not EAUGed. Because
> of this combination, the #PF handler sees the SGX bit set in the #PF error

Which #PF handler are you referring to here?

> code and does not call sgx_vma_fault() but instead raises a SIGSEGV. The
> userspace SIGSEGV handler cannot perform EACCEPT because the page was not
> EAUGed. Thus, the user space is stuck with the inaccessible page.
>
> Also note that in the scenario, CPU1 raises a SIGBUS signal to user space
> unnecessarily. This signal is spurious because a page-access retry on CPU2
> will also raise the SIGBUS signal. That said, this side effect is less
> severe because it affects only user space. Therefore, it could be
> circumvented in user space alone, but it seems reasonable to fix it in this
> patch.

The variety of the signals and how they could/should be handled by userspace
are not completely clear to me but the bugs are clear to me and needs to be
fixed.

>>> This error handling path contains two bugs: (1) SIGBUS is sent to
>>> userspace even though the enclave page is correctly installed by another
>>> thread, and (2) sgx_encl_free_epc_page() is called that performs EREMOVE
>>> even though the enclave page was never intended to be removed. The first
>>> bug is less severe because it impacts only the user space; the second
>>> bug is more severe because it also impacts the OS state by ripping the
>>> page (added by the winning thread) from the enclave.
>>>
>>> Fix these two bugs (1) by returning VM_FAULT_NOPAGE to the generic Linux
>>> fault handler so that no signal is sent to userspace, and (2) by
>>> replacing sgx_encl_free_epc_page() with sgx_free_epc_page() so that no
>>> EREMOVE is performed.
>>
>> What is the collateral damage caused by ENCLS[EREMOVE]?
>
> As explained above, the damage is that the SGX driver leaves the enclave
> page metadata in an inconsistent state: on the one hand, the PTE entry is
> installed which forces the generic Linux fault handler to raise SIGSEGV,
> and on the other hand, the page is not in a correct state to be EACCEPTed
> (i.e., EAUG was not performed on this page).
>
>>> Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized enclave")
>>> Cc: [email protected]
>>> Reported-by: Marcelina Kościelnicka <[email protected]>
>>> Suggested-by: Reinette Chatre <[email protected]>
>>> Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
>>> ---
>>> arch/x86/kernel/cpu/sgx/encl.c | 7 +++++--
>>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
>>> index 279148e72459..41f14b1a3025 100644
>>> --- a/arch/x86/kernel/cpu/sgx/encl.c
>>> +++ b/arch/x86/kernel/cpu/sgx/encl.c
>>> @@ -382,8 +382,11 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
>>> * If ret == -EBUSY then page was created in another flow while
>>> * running without encl->lock
>>> */
>>> - if (ret)
>>> + if (ret) {
>>> + if (ret == -EBUSY)
>>> + vmret = VM_FAULT_NOPAGE;
>>> goto err_out_shrink;
>>> + }
>>>
>>> pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page);
>>> pginfo.addr = encl_page->desc & PAGE_MASK;
>>> @@ -419,7 +422,7 @@ static vm_fault_t sgx_encl_eaug_page(struct vm_area_struct *vma,
>>> err_out_shrink:
>>> sgx_encl_shrink(encl, va_page);
>>> err_out_epc:
>>> - sgx_encl_free_epc_page(epc_page);
>>> + sgx_free_epc_page(epc_page);
>>
>> This ignores check for the page being reclaimer tracked, i.e. it does
>> changes that have been ignored in the commit message.
>
> Indeed, sgx_encl_free_epc_page() performs the following check:
>
> WARN_ON_ONCE(page->flags & SGX_EPC_PAGE_RECLAIMER_TRACKED);
>
> However, the EPC page is allocated in sgx_encl_eaug_page() and has
> zeroed-out flags in all error-handling paths. In other words, the page is
> marked as reclaimable only in the happy path of sgx_encl_eaug_page().
> Therefore, in the particular code path that I changed this "page reclaimer
> tracked" condition is always false, and the warning is never printed.
>
> Do you want me to explain this in the commit message?

Since original commit did prompt this question I do think it would
be helpful to add a snippet about this, yes.

The fix looks good to me. I assume that you will add the "CPU1 vs CPU2"
race description in the next version, that will help a lot to make the
bugs easier to spot.

Thanks again for this. Great catch.

Reinette


2024-05-10 23:49:22

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH 2/2] x86/sgx: Resolve EREMOVE page vs EAUG page data race

Hi Dmitrii,

Thank you very much for uncovering and fixing this issue.

On 4/30/2024 7:38 AM, Dmitrii Kuvaiskii wrote:
> On Mon, Apr 29, 2024 at 04:11:03PM +0300, Jarkko Sakkinen wrote:
>> On Mon Apr 29, 2024 at 1:43 PM EEST, Dmitrii Kuvaiskii wrote:
>>> Two enclave threads may try to add and remove the same enclave page
>>> simultaneously (e.g., if the SGX runtime supports both lazy allocation
>>> and `MADV_DONTNEED` semantics). Consider this race:
>>>
>>> 1. T1 performs page removal in sgx_encl_remove_pages() and stops right
>>> after removing the page table entry and right before re-acquiring the
>>> enclave lock to EREMOVE and xa_erase(&encl->page_array) the page.
>>> 2. T2 tries to access the page, and #PF[not_present] is raised. The
>>> condition to EAUG in sgx_vma_fault() is not satisfied because the
>>> page is still present in encl->page_array, thus the SGX driver
>>> assumes that the fault happened because the page was swapped out. The
>>> driver continues on a code path that installs a page table entry
>>> *without* performing EAUG.
>>> 3. The enclave page metadata is in inconsistent state: the PTE is
>>> installed but there was no EAUG. Thus, T2 in userspace infinitely
>>> receives SIGSEGV on this page (and EACCEPT always fails).
>>>
>>> Fix this by making sure that T1 (the page-removing thread) always wins
>>> this data race. In particular, the page-being-removed is marked as such,
>>> and T2 retries until the page is fully removed.
>>>
>>> Fixes: 9849bb27152c ("x86/sgx: Support complete page removal")
>>> Cc: [email protected]
>>> Signed-off-by: Dmitrii Kuvaiskii <[email protected]>
>>> ---
>>> arch/x86/kernel/cpu/sgx/encl.c | 3 ++-
>>> arch/x86/kernel/cpu/sgx/encl.h | 3 +++
>>> arch/x86/kernel/cpu/sgx/ioctl.c | 1 +
>>> 3 files changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c
>>> index 41f14b1a3025..7ccd8b2fce5f 100644
>>> --- a/arch/x86/kernel/cpu/sgx/encl.c
>>> +++ b/arch/x86/kernel/cpu/sgx/encl.c
>>> @@ -257,7 +257,8 @@ static struct sgx_encl_page *__sgx_encl_load_page(struct sgx_encl *encl,
>>>
>>> /* Entry successfully located. */
>>> if (entry->epc_page) {
>>> - if (entry->desc & SGX_ENCL_PAGE_BEING_RECLAIMED)
>>> + if (entry->desc & (SGX_ENCL_PAGE_BEING_RECLAIMED |
>>> + SGX_ENCL_PAGE_BEING_REMOVED))
>>> return ERR_PTR(-EBUSY);
>>>
>>> return entry;
>>> diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h
>>> index f94ff14c9486..fff5f2293ae7 100644
>>> --- a/arch/x86/kernel/cpu/sgx/encl.h
>>> +++ b/arch/x86/kernel/cpu/sgx/encl.h
>>> @@ -25,6 +25,9 @@
>>> /* 'desc' bit marking that the page is being reclaimed. */
>>> #define SGX_ENCL_PAGE_BEING_RECLAIMED BIT(3)
>>>
>>> +/* 'desc' bit marking that the page is being removed. */
>>> +#define SGX_ENCL_PAGE_BEING_REMOVED BIT(2)
>>> +
>>> struct sgx_encl_page {
>>> unsigned long desc;
>>> unsigned long vm_max_prot_bits:8;
>>> diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
>>> index b65ab214bdf5..c542d4dd3e64 100644
>>> --- a/arch/x86/kernel/cpu/sgx/ioctl.c
>>> +++ b/arch/x86/kernel/cpu/sgx/ioctl.c
>>> @@ -1142,6 +1142,7 @@ static long sgx_encl_remove_pages(struct sgx_encl *encl,
>>> * Do not keep encl->lock because of dependency on
>>> * mmap_lock acquired in sgx_zap_enclave_ptes().
>>> */
>>> + entry->desc |= SGX_ENCL_PAGE_BEING_REMOVED;
>>> mutex_unlock(&encl->lock);
>>>
>>> sgx_zap_enclave_ptes(encl, addr);
>>
>> It is somewhat trivial to NAK this as the commit message does
>> not do any effort describing the new flag. By default at least
>> I have strong opposition against any new flags related to
>> reclaiming even if it needs a bit of extra synchronization
>> work in the user space.
>>
>> One way to describe concurrency scenarios would be to take
>> example from https://www.kernel.org/doc/Documentation/memory-barriers.txt
>>
>> I.e. see the examples with CPU 1 and CPU 2.
>
> Thank you for the suggestion. Here is my new attempt at describing the racy
> scenario:
>
> Consider some enclave page added to the enclave. User space decides to
> temporarily remove this page (e.g., emulating the MADV_DONTNEED semantics)
> on CPU1. At the same time, user space performs a memory access on the same
> page on CPU2, which results in a #PF and ultimately in sgx_vma_fault().
> Scenario proceeds as follows:
>
> /*
> * CPU1: User space performs
> * ioctl(SGX_IOC_ENCLAVE_REMOVE_PAGES)
> * on a single enclave page
> */
> sgx_encl_remove_pages() {
>
> mutex_lock(&encl->lock);
>
> entry = sgx_encl_load_page(encl);
> /*
> * verify that page is
> * trimmed and accepted
> */
>
> mutex_unlock(&encl->lock);
>
> /*
> * remove PTE entry; cannot
> * be performed under lock
> */
> sgx_zap_enclave_ptes(encl);
> /*
> * Fault on CPU2
> */

Please highlight that this fault is related to the page that
is in process of being removed on CPU1.

> sgx_vma_fault() {
> /*
> * PTE entry was removed, but the
> * page is still in enclave's xarray
> */
> xa_load(&encl->page_array) != NULL ->
> /*
> * SGX driver thinks that this page
> * was swapped out and loads it
> */
> mutex_lock(&encl->lock);
> /*
> * this is effectively a no-op
> */
> entry = sgx_encl_load_page_in_vma();
> /*
> * add PTE entry
> */

It may be helpful to highlight that this is a problem: "BUG: A PTE
is installed for a page in process of being removed." (please feel free
to expand)

> vmf_insert_pfn(...);
>
> mutex_unlock(&encl->lock);
> return VM_FAULT_NOPAGE;
> }
> /*
> * continue with page removal
> */
> mutex_lock(&encl->lock);
>
> sgx_encl_free_epc_page(epc_page) {
> /*
> * remove page via EREMOVE
> */
> /*
> * free EPC page
> */
> sgx_free_epc_page(epc_page);
> }
>
> xa_erase(&encl->page_array);
>
> mutex_unlock(&encl->lock);
> }
>
> CPU1 removed the page. However CPU2 installed the PTE entry on the
> same page. This enclave page becomes perpetually inaccessible (until
> another SGX_IOC_ENCLAVE_REMOVE_PAGES ioctl). This is because the page is
> marked accessible in the PTE entry but is not EAUGed. Because of this
> combination, any subsequent access to this page raises a fault, and the #PF
> handler sees the SGX bit set in the #PF error code and does not call

Which #PF handler?

> sgx_vma_fault() but instead raises a SIGSEGV. The userspace SIGSEGV handler
> cannot perform EACCEPT because the page was not EAUGed. Thus, the user
> space is stuck with the inaccessible page.
>
> This race can be fixed by forcing the fault handler on CPU2 to back off if
> the page is currently being removed (on CPU1). Thus a simple change is to
> introduce a new flag SGX_ENCL_PAGE_BEING_REMOVED, which is unset by default
> and set only right-before the first mutex_unlock() in
> sgx_encl_remove_pages(). Upon loading the page, CPU2 checks whether this
> page is being removed, and if yes then CPU2 backs off and waits until the
> page is completely removed. After that, any memory access to this page
> results in a normal "allocate and EAUG a page on #PF" flow.

I have been tripped by these page flags before so would appreciate
another opinion. From my side this looks like an appropriate fix.

Reinette