2024-02-12 10:03:01

by Dmitry Vyukov

[permalink] [raw]
Subject: Spurious SIGSEGV with rseq/membarrier

Hi rseq/membarrier maintainers,

I've spent a bit debugging some spurious SIGSEGVs and it turned out to
be an interesting interaction between page faults, rseq and
membarrier. The manifestation is that membarrier(EXPEDITED_RSEQ) is
effectively not working for a thread (doesn't restart its rseq
critical section).

The real code is inside of tcmalloc and relates to the "slabs resing" procedure:

https://github.com/google/tcmalloc/blob/39775a2d57969eda9497f3673421766bc1e886a0/tcmalloc/internal/percpu_tcmalloc.cc#L176

The essence is:
Threads use a data structure inside of rseq critical section.
The resize procedure replaces the old data structure with a new one,
uses a membarrier to ensure that threads don't use the old one any
more and unmaps/mprotects pages that back the old data structure. At
this point no threads use the old data structure anymore and no
threads should get SIGSEGV.

However, what happens is as follows:
A thread gets a minor page fault on the old data structure inside of
rseq critical section.
The page fault handler re-enables preemption and allows other threads
to be scheduled (I am tno sure this is actually important, but that's
what I observed in all traces, and it makes the failure scenario much
more likely).
Now, the resize procedure is executed, replaces all pointers to the
old data structure to the new one, executes the membarrier and unmaps
the old data structure.
Now the page fault handler resumes, verifies VMA protection and finds
out that the VMA is indeed inaccessible and the page fault is not a
minor one, but rather should result in SIGSEGV and sends SIGSEGV.
Note: at this point the thread has rseq restart pending (from both
preemption and membarrier), and the restart indeed happens as part of
SIGSEGV delivery, but it's already too late.

I think the page fault handling should give the rseq restart
preference in this case, and realize the thread shouldn't be executing
the faulting instruction in the first place. In such case the thread
would be restarted, and access the new data structure after the
restart.

Unmapping/mprotecting the old data in this case is useful for 2 reasons:
1. It allows to release memory (not possible to do reliably now).
2. It allows to ensure there are no logical bugs in the user-space
code and thread don't access the old data when they shouldn't. I was
actually tracking a potential bug in user-space code, but after
mprotecting old data, started seeing more of more confusing crashes
(this spurious SIGSEGV).


2024-02-12 17:14:28

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: Spurious SIGSEGV with rseq/membarrier

On 2024-02-12 05:02, Dmitry Vyukov wrote:
> Hi rseq/membarrier maintainers,
>
> I've spent a bit debugging some spurious SIGSEGVs and it turned out to
> be an interesting interaction between page faults, rseq and
> membarrier. The manifestation is that membarrier(EXPEDITED_RSEQ) is
> effectively not working for a thread (doesn't restart its rseq
> critical section).
>
> The real code is inside of tcmalloc and relates to the "slabs resing" procedure:
>
> https://github.com/google/tcmalloc/blob/39775a2d57969eda9497f3673421766bc1e886a0/tcmalloc/internal/percpu_tcmalloc.cc#L176
>
> The essence is:
> Threads use a data structure inside of rseq critical section.
> The resize procedure replaces the old data structure with a new one,
> uses a membarrier to ensure that threads don't use the old one any
> more and unmaps/mprotects pages that back the old data structure. At
> this point no threads use the old data structure anymore and no
> threads should get SIGSEGV.
>
> However, what happens is as follows:
> A thread gets a minor page fault on the old data structure inside of
> rseq critical section.
> The page fault handler re-enables preemption and allows other threads
> to be scheduled (I am tno sure this is actually important, but that's
> what I observed in all traces, and it makes the failure scenario much
> more likely).
> Now, the resize procedure is executed, replaces all pointers to the
> old data structure to the new one, executes the membarrier and unmaps
> the old data structure.
> Now the page fault handler resumes, verifies VMA protection and finds
> out that the VMA is indeed inaccessible and the page fault is not a
> minor one, but rather should result in SIGSEGV and sends SIGSEGV.
> Note: at this point the thread has rseq restart pending (from both
> preemption and membarrier), and the restart indeed happens as part of
> SIGSEGV delivery, but it's already too late.

Hi Dmitry,

Thanks for spending the time to analyze this issue and identify the
scenario causing it.

>
> I think the page fault handling should give the rseq restart
> preference in this case, and realize the thread shouldn't be executing
> the faulting instruction in the first place. In such case the thread
> would be restarted, and access the new data structure after the
> restart.

The wanted behavior you describe here does make sense.

> Unmapping/mprotecting the old data in this case is useful for 2 reasons:
> 1. It allows to release memory (not possible to do reliably now).
> 2. It allows to ensure there are no logical bugs in the user-space
> code and thread don't access the old data when they shouldn't. I was
> actually tracking a potential bug in user-space code, but after
> mprotecting old data, started seeing more of more confusing crashes
> (this spurious SIGSEGV).

So I think we are in a situation where in the original rseq design
we've tried to eliminate "all" kernel-internal rseq restart kernel
corner cases by preventing system calls from being issued within rseq
critical sections (to keep things nice and simple), but missed the fact
that page faults are another mean of entering the kernel which can
indeed trigger preemption, and the restart should happen either before
the side-effect of the page fault is decided (e.g. segmentation fault or
SIGBUS), or before the actual signal is delivered to user-space.

In order to solve this here is what I think we need:

1) Add a proper stress-test reproducer to the rseq selftests:

- In a loop, conditionally issue the first memory access to a newly
mmap'd area from a rseq critical section using a pointer
dereference.
- In another thread, in a loop:
- mmap a new memory area,
- set pointer to point to that memory area
- wait a tiny while (cpu_relax busy-loop)
- set pointer to NULL
- membarrier EXPEDITED_RSEQ
- munmap the area (could also be mprotect)

We should probably provide hints to mmap so the same address is not
re-used loop after loop to increase the chances of hitting the
race.

2) We need to figure out where to implement this behavior change. Either
at the page fault handler level, or at signal delivery. It would need
to check whether:
- t->rseq_event_mask is nonzero _and_
- in_rseq_cs() would return true (it's tricky because this needs to
read userspace memory, which can trigger a page fault).
We should be careful about not preventing signals emitted by other
sources to happen.

3) This would be a new behavior. Even if it is something that would have
been preferable from the beginning, all the pre-existing kernels
implementing membarrier EXPEDITED_RSEQ do not have that behavior
today. We would want to implement something that allows userspace to
detect this "feature". Extending getauxval() to add a new "rseq
feature" entry that would return feature flags would be an option.

Thoughts ?

Thanks,

Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com