On Wed, Oct 21, 2020 at 10:44:46PM -0500, Jeremy Linton via Libc-alpha wrote:
> Hi,
>
> There is a problem with glibc+systemd on BTI enabled systems. Systemd
> has a service flag "MemoryDenyWriteExecute" which uses seccomp to deny
> PROT_EXEC changes. Glibc enables BTI only on segments which are marked as
> being BTI compatible by calling mprotect PROT_EXEC|PROT_BTI. That call is
> caught by the seccomp filter, resulting in service failures.
>
> So, at the moment one has to pick either denying PROT_EXEC changes, or BTI.
> This is obviously not desirable.
>
> Various changes have been suggested, replacing the mprotect with mmap calls
> having PROT_BTI set on the original mapping, re-mmapping the segments,
> implying PROT_EXEC on mprotect PROT_BTI calls when VM_EXEC is already set,
> and various modification to seccomp to allow particular mprotect cases to
> bypass the filters. In each case there seems to be an undesirable attribute
> to the solution.
>
> So, whats the best solution?
Unrolling this discussion a bit, this problem comes from a few sources:
1) systemd is trying to implement a policy that doesn't fit SECCOMP
syscall filtering very well.
2) The program is trying to do something not expressible through the
syscall interface: really the intent is to set PROT_BTI on the page,
with no intent to set PROT_EXEC on any page that didn't already have it
set.
This limitation of mprotect() was known when I originally added PROT_BTI,
but at that time we weren't aware of a clear use case that would fail.
Would it now help to add something like:
int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
{
int ret = -EINVAL;
mmap_write_lock(current->mm);
if (all vmas in [addr .. addr + len) have
their mprotect flags set to old_flags) {
ret = mprotect(addr, len, new_flags);
}
mmap_write_unlock(current->mm);
return ret;
}
libc would now be able to do
mchangeprot(addr, len, PROT_EXEC | PROT_READ,
PROT_EXEC | PROT_READ | PROT_BTI);
while systemd's MDWX filter would reject the call if
(new_flags & PROT_EXEC) &&
(!(old_flags & PROT_EXEC) || (new_flags & PROT_WRITE)
This won't magically fix current code, but something along these lines
might be better going forward.
Thoughts?
---Dave
On 26.10.2020 18.24, Dave Martin wrote:
> On Wed, Oct 21, 2020 at 10:44:46PM -0500, Jeremy Linton via Libc-alpha wrote:
>> Hi,
>>
>> There is a problem with glibc+systemd on BTI enabled systems. Systemd
>> has a service flag "MemoryDenyWriteExecute" which uses seccomp to deny
>> PROT_EXEC changes. Glibc enables BTI only on segments which are marked as
>> being BTI compatible by calling mprotect PROT_EXEC|PROT_BTI. That call is
>> caught by the seccomp filter, resulting in service failures.
>>
>> So, at the moment one has to pick either denying PROT_EXEC changes, or BTI.
>> This is obviously not desirable.
>>
>> Various changes have been suggested, replacing the mprotect with mmap calls
>> having PROT_BTI set on the original mapping, re-mmapping the segments,
>> implying PROT_EXEC on mprotect PROT_BTI calls when VM_EXEC is already set,
>> and various modification to seccomp to allow particular mprotect cases to
>> bypass the filters. In each case there seems to be an undesirable attribute
>> to the solution.
>>
>> So, whats the best solution?
>
> Unrolling this discussion a bit, this problem comes from a few sources:
>
> 1) systemd is trying to implement a policy that doesn't fit SECCOMP
> syscall filtering very well.
>
> 2) The program is trying to do something not expressible through the
> syscall interface: really the intent is to set PROT_BTI on the page,
> with no intent to set PROT_EXEC on any page that didn't already have it
> set.
>
>
> This limitation of mprotect() was known when I originally added PROT_BTI,
> but at that time we weren't aware of a clear use case that would fail.
>
>
> Would it now help to add something like:
>
> int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
> {
> int ret = -EINVAL;
> mmap_write_lock(current->mm);
> if (all vmas in [addr .. addr + len) have
> their mprotect flags set to old_flags) {
>
> ret = mprotect(addr, len, new_flags);
> }
>
> mmap_write_unlock(current->mm);
> return ret;
> }
>
>
> libc would now be able to do
>
> mchangeprot(addr, len, PROT_EXEC | PROT_READ,
> PROT_EXEC | PROT_READ | PROT_BTI);
>
> while systemd's MDWX filter would reject the call if
>
> (new_flags & PROT_EXEC) &&
> (!(old_flags & PROT_EXEC) || (new_flags & PROT_WRITE)
>
>
>
> This won't magically fix current code, but something along these lines
> might be better going forward.
>
>
> Thoughts?
Looks good to me.
-Topi
The 10/26/2020 16:24, Dave Martin via Libc-alpha wrote:
> Unrolling this discussion a bit, this problem comes from a few sources:
>
> 1) systemd is trying to implement a policy that doesn't fit SECCOMP
> syscall filtering very well.
>
> 2) The program is trying to do something not expressible through the
> syscall interface: really the intent is to set PROT_BTI on the page,
> with no intent to set PROT_EXEC on any page that didn't already have it
> set.
>
>
> This limitation of mprotect() was known when I originally added PROT_BTI,
> but at that time we weren't aware of a clear use case that would fail.
>
>
> Would it now help to add something like:
>
> int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
> {
> int ret = -EINVAL;
> mmap_write_lock(current->mm);
> if (all vmas in [addr .. addr + len) have
> their mprotect flags set to old_flags) {
>
> ret = mprotect(addr, len, new_flags);
> }
>
> mmap_write_unlock(current->mm);
> return ret;
> }
if more prot flags are introduced then the exact
match for old_flags may be restrictive and currently
there is no way to query these flags to figure out
how to toggle one prot flag in a future proof way,
so i don't think this solves the issue completely.
i think we might need a new api, given that aarch64
now has PROT_BTI and PROT_MTE while existing code
expects RWX only, but i don't know what api is best.
> libc would now be able to do
>
> mchangeprot(addr, len, PROT_EXEC | PROT_READ,
> PROT_EXEC | PROT_READ | PROT_BTI);
>
> while systemd's MDWX filter would reject the call if
>
> (new_flags & PROT_EXEC) &&
> (!(old_flags & PROT_EXEC) || (new_flags & PROT_WRITE)
>
>
>
> This won't magically fix current code, but something along these lines
> might be better going forward.
>
>
> Thoughts?
>
> ---Dave
* Dave Martin via Libc-alpha:
> Would it now help to add something like:
>
> int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
> {
> int ret = -EINVAL;
> mmap_write_lock(current->mm);
> if (all vmas in [addr .. addr + len) have
> their mprotect flags set to old_flags) {
>
> ret = mprotect(addr, len, new_flags);
> }
>
> mmap_write_unlock(current->mm);
> return ret;
> }
I suggested something similar as well. Ideally, the interface would
subsume pkey_mprotect, though, and have a separate flags argument from
the protection flags. But then we run into argument list length limits.
Thanks,
Florian
--
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
On Mon, Oct 26, 2020 at 04:57:55PM +0000, Szabolcs Nagy via Libc-alpha wrote:
> The 10/26/2020 16:24, Dave Martin via Libc-alpha wrote:
> > Unrolling this discussion a bit, this problem comes from a few sources:
> >
> > 1) systemd is trying to implement a policy that doesn't fit SECCOMP
> > syscall filtering very well.
> >
> > 2) The program is trying to do something not expressible through the
> > syscall interface: really the intent is to set PROT_BTI on the page,
> > with no intent to set PROT_EXEC on any page that didn't already have it
> > set.
> >
> >
> > This limitation of mprotect() was known when I originally added PROT_BTI,
> > but at that time we weren't aware of a clear use case that would fail.
> >
> >
> > Would it now help to add something like:
> >
> > int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
> > {
> > int ret = -EINVAL;
> > mmap_write_lock(current->mm);
> > if (all vmas in [addr .. addr + len) have
> > their mprotect flags set to old_flags) {
> >
> > ret = mprotect(addr, len, new_flags);
> > }
> >
> > mmap_write_unlock(current->mm);
> > return ret;
> > }
>
> if more prot flags are introduced then the exact
> match for old_flags may be restrictive and currently
> there is no way to query these flags to figure out
> how to toggle one prot flag in a future proof way,
> so i don't think this solves the issue completely.
Ack -- I illustrated this model because it makes the seccomp filter's
job easy, but it does have limitations.
> i think we might need a new api, given that aarch64
> now has PROT_BTI and PROT_MTE while existing code
> expects RWX only, but i don't know what api is best.
An alternative option would be a call that sets / clears chosen
flags and leaves others unchanged.
The trouble with that is that the MDWX policy then becomes hard to
implement again.
But policies might be best set via another route, such as a prctl,
rather than being implemented completely in a seccomp filter.
Cheers
---Dave
Hi,
On 10/26/20 12:52 PM, Dave Martin wrote:
> On Mon, Oct 26, 2020 at 04:57:55PM +0000, Szabolcs Nagy via Libc-alpha wrote:
>> The 10/26/2020 16:24, Dave Martin via Libc-alpha wrote:
>>> Unrolling this discussion a bit, this problem comes from a few sources:
>>>
>>> 1) systemd is trying to implement a policy that doesn't fit SECCOMP
>>> syscall filtering very well.
>>>
>>> 2) The program is trying to do something not expressible through the
>>> syscall interface: really the intent is to set PROT_BTI on the page,
>>> with no intent to set PROT_EXEC on any page that didn't already have it
>>> set.
>>>
>>>
>>> This limitation of mprotect() was known when I originally added PROT_BTI,
>>> but at that time we weren't aware of a clear use case that would fail.
>>>
>>>
>>> Would it now help to add something like:
>>>
>>> int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
>>> {
>>> int ret = -EINVAL;
>>> mmap_write_lock(current->mm);
>>> if (all vmas in [addr .. addr + len) have
>>> their mprotect flags set to old_flags) {
>>>
>>> ret = mprotect(addr, len, new_flags);
>>> }
>>>
>>> mmap_write_unlock(current->mm);
>>> return ret;
>>> }
>>
>> if more prot flags are introduced then the exact
>> match for old_flags may be restrictive and currently
>> there is no way to query these flags to figure out
>> how to toggle one prot flag in a future proof way,
>> so i don't think this solves the issue completely.
>
> Ack -- I illustrated this model because it makes the seccomp filter's
> job easy, but it does have limitations.
>
>> i think we might need a new api, given that aarch64
>> now has PROT_BTI and PROT_MTE while existing code
>> expects RWX only, but i don't know what api is best.
>
> An alternative option would be a call that sets / clears chosen
> flags and leaves others unchanged.
I tend to favor a set/clear API, but that could also just be done by
creating a new PROT_BTI_IF_X which enables BTI for areas already set to
_EXEC. That goes right by the seccomp filters too, and actually is
closer to what glibc wants to do anyway.
>
> The trouble with that is that the MDWX policy then becomes hard to
> implement again.
>
>
> But policies might be best set via another route, such as a prctl,
> rather than being implemented completely in a seccomp filter.
>
> Cheers
> ---Dave
>
On Mon, Oct 26, 2020 at 05:45:42PM +0100, Florian Weimer via Libc-alpha wrote:
> * Dave Martin via Libc-alpha:
>
> > Would it now help to add something like:
> >
> > int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
> > {
> > int ret = -EINVAL;
> > mmap_write_lock(current->mm);
> > if (all vmas in [addr .. addr + len) have
> > their mprotect flags set to old_flags) {
> >
> > ret = mprotect(addr, len, new_flags);
> > }
> >
> > mmap_write_unlock(current->mm);
> > return ret;
> > }
>
> I suggested something similar as well. Ideally, the interface would
> subsume pkey_mprotect, though, and have a separate flags argument from
> the protection flags. But then we run into argument list length limits.
>
> Thanks,
> Florian
I suppose. Assuming that a syscall filter can inspect memory, we might
be able to bundle arguments into a struct if necessary.
[...]
Cheers
---Dave
* Dave Martin via Libc-alpha:
> On Mon, Oct 26, 2020 at 05:45:42PM +0100, Florian Weimer via Libc-alpha wrote:
>> * Dave Martin via Libc-alpha:
>>
>> > Would it now help to add something like:
>> >
>> > int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
>> > {
>> > int ret = -EINVAL;
>> > mmap_write_lock(current->mm);
>> > if (all vmas in [addr .. addr + len) have
>> > their mprotect flags set to old_flags) {
>> >
>> > ret = mprotect(addr, len, new_flags);
>> > }
>> >
>> > mmap_write_unlock(current->mm);
>> > return ret;
>> > }
>>
>> I suggested something similar as well. Ideally, the interface would
>> subsume pkey_mprotect, though, and have a separate flags argument from
>> the protection flags. But then we run into argument list length limits.
>>
>> Thanks,
>> Florian
>
> I suppose. Assuming that a syscall filter can inspect memory, we might
> be able to bundle arguments into a struct if necessary.
But that leads to a discussion about batch mmap/mprotect/munmap, and
that's again incompatible with seccomp (it would need a checking loop).
Thanks,
Florian
--
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
On Mon, Oct 26, 2020 at 05:39:42PM -0500, Jeremy Linton via Libc-alpha wrote:
> Hi,
>
> On 10/26/20 12:52 PM, Dave Martin wrote:
> >On Mon, Oct 26, 2020 at 04:57:55PM +0000, Szabolcs Nagy via Libc-alpha wrote:
> >>The 10/26/2020 16:24, Dave Martin via Libc-alpha wrote:
> >>>Unrolling this discussion a bit, this problem comes from a few sources:
> >>>
> >>>1) systemd is trying to implement a policy that doesn't fit SECCOMP
> >>>syscall filtering very well.
> >>>
> >>>2) The program is trying to do something not expressible through the
> >>>syscall interface: really the intent is to set PROT_BTI on the page,
> >>>with no intent to set PROT_EXEC on any page that didn't already have it
> >>>set.
> >>>
> >>>
> >>>This limitation of mprotect() was known when I originally added PROT_BTI,
> >>>but at that time we weren't aware of a clear use case that would fail.
> >>>
> >>>
> >>>Would it now help to add something like:
> >>>
> >>>int mchangeprot(void *addr, size_t len, int old_flags, int new_flags)
> >>>{
> >>> int ret = -EINVAL;
> >>> mmap_write_lock(current->mm);
> >>> if (all vmas in [addr .. addr + len) have
> >>> their mprotect flags set to old_flags) {
> >>>
> >>> ret = mprotect(addr, len, new_flags);
> >>> }
> >>>
> >>> mmap_write_unlock(current->mm);
> >>> return ret;
> >>>}
> >>
> >>if more prot flags are introduced then the exact
> >>match for old_flags may be restrictive and currently
> >>there is no way to query these flags to figure out
> >>how to toggle one prot flag in a future proof way,
> >>so i don't think this solves the issue completely.
> >
> >Ack -- I illustrated this model because it makes the seccomp filter's
> >job easy, but it does have limitations.
> >
> >>i think we might need a new api, given that aarch64
> >>now has PROT_BTI and PROT_MTE while existing code
> >>expects RWX only, but i don't know what api is best.
> >
> >An alternative option would be a call that sets / clears chosen
> >flags and leaves others unchanged.
>
> I tend to favor a set/clear API, but that could also just be done by
> creating a new PROT_BTI_IF_X which enables BTI for areas already set to
> _EXEC. That goes right by the seccomp filters too, and actually is closer to
> what glibc wants to do anyway.
That works, though I'm not so keen on teating PROT_BTI as a special case,
since the problem is likely to recur when other weird per-arch flags get
added...
I also wonder whether we actually care whether the pages are marked
executable or not here; probably the flags can just be independent. This
rather depends on whether the how the architecture treats the BTI (a.k.a
GP) pagetable bit for non-executable pages. I have a feeling we already
allow PROT_BTI && !PROT_EXEC through anyway.
What about a generic-ish set/clear interface that still works by just
adding a couple of PROT_ flags:
switch (flags & (PROT_SET | PROT_CLEAR)) {
case PROT_SET: prot |= flags; break;
case PROT_CLEAR: prot &= ~flags; break;
case 0: prot = flags; break;
default:
return -EINVAL;
}
This can't atomically set some flags while clearing some others, but for
simple stuff it seems sufficient and shouldn't be too invasive on the
kernel side.
We will still have to take the mm lock when doing a SET or CLEAR, but
not for the non-set/clear case.
Anyway, libc could now do:
mprotect(addr, len, PROT_SET | PROT_BTI);
with much the same effect as your PROT_BTI_IF_X.
JITting or breakpoint setting code that wants to change the permissions
temporarily, without needing to know whether PROT_BTI is set, say:
mprotect(addr, len, PROT_SET | PROT_WRITE);
*addr = BKPT_INSN;
mprotect(addr, len, PROT_CLEAR | PROT_WRITE);
Thoughts?
I won't claim this doesn't still have some limitations...
Cheers
---Dave
On Tue, Oct 27, 2020 at 02:15:22PM +0000, Dave P Martin wrote:
> I also wonder whether we actually care whether the pages are marked
> executable or not here; probably the flags can just be independent. This
> rather depends on whether the how the architecture treats the BTI (a.k.a
> GP) pagetable bit for non-executable pages. I have a feeling we already
> allow PROT_BTI && !PROT_EXEC through anyway.
>
>
> What about a generic-ish set/clear interface that still works by just
> adding a couple of PROT_ flags:
>
> switch (flags & (PROT_SET | PROT_CLEAR)) {
> case PROT_SET: prot |= flags; break;
> case PROT_CLEAR: prot &= ~flags; break;
> case 0: prot = flags; break;
>
> default:
> return -EINVAL;
> }
>
> This can't atomically set some flags while clearing some others, but for
> simple stuff it seems sufficient and shouldn't be too invasive on the
> kernel side.
>
> We will still have to take the mm lock when doing a SET or CLEAR, but
> not for the non-set/clear case.
>
>
> Anyway, libc could now do:
>
> mprotect(addr, len, PROT_SET | PROT_BTI);
>
> with much the same effect as your PROT_BTI_IF_X.
>
>
> JITting or breakpoint setting code that wants to change the permissions
> temporarily, without needing to know whether PROT_BTI is set, say:
>
> mprotect(addr, len, PROT_SET | PROT_WRITE);
> *addr = BKPT_INSN;
> mprotect(addr, len, PROT_CLEAR | PROT_WRITE);
The problem with this approach is that you can't catch
PROT_EXEC|PROT_WRITE mappings via seccomp. So you'd have to limit it to
some harmless PROT_ flags only. I don't like this limitation, nor the
PROT_BTI_IF_X approach.
The only generic solutions I see are to either use a stateful filter in
systemd or pass the old state to the kernel in a cmpxchg style so that
seccomp can check it (I think you suggest this at some point).
The latter requires a new syscall which is not something we can address
as a quick, back-portable fix here. If systemd cannot be changed to use
a stateful filter for w^x detection, my suggestion is to go for the
kernel setting PROT_BTI on the main executable with glibc changed to
tolerate EPERM on mprotect(). I don't mind adding an AT_FLAGS bit if
needed but I don't think it buys us much.
Once the current problem is fixed, we can look at a better solution
longer term as a new syscall.
--
Catalin
On Thu, Oct 29, 2020 at 11:02:22AM +0000, Catalin Marinas via Libc-alpha wrote:
> On Tue, Oct 27, 2020 at 02:15:22PM +0000, Dave P Martin wrote:
> > I also wonder whether we actually care whether the pages are marked
> > executable or not here; probably the flags can just be independent. This
> > rather depends on whether the how the architecture treats the BTI (a.k.a
> > GP) pagetable bit for non-executable pages. I have a feeling we already
> > allow PROT_BTI && !PROT_EXEC through anyway.
> >
> >
> > What about a generic-ish set/clear interface that still works by just
> > adding a couple of PROT_ flags:
> >
> > switch (flags & (PROT_SET | PROT_CLEAR)) {
> > case PROT_SET: prot |= flags; break;
> > case PROT_CLEAR: prot &= ~flags; break;
> > case 0: prot = flags; break;
> >
> > default:
> > return -EINVAL;
> > }
> >
> > This can't atomically set some flags while clearing some others, but for
> > simple stuff it seems sufficient and shouldn't be too invasive on the
> > kernel side.
> >
> > We will still have to take the mm lock when doing a SET or CLEAR, but
> > not for the non-set/clear case.
> >
> >
> > Anyway, libc could now do:
> >
> > mprotect(addr, len, PROT_SET | PROT_BTI);
> >
> > with much the same effect as your PROT_BTI_IF_X.
> >
> >
> > JITting or breakpoint setting code that wants to change the permissions
> > temporarily, without needing to know whether PROT_BTI is set, say:
> >
> > mprotect(addr, len, PROT_SET | PROT_WRITE);
> > *addr = BKPT_INSN;
> > mprotect(addr, len, PROT_CLEAR | PROT_WRITE);
>
> The problem with this approach is that you can't catch
> PROT_EXEC|PROT_WRITE mappings via seccomp. So you'd have to limit it to
> some harmless PROT_ flags only. I don't like this limitation, nor the
> PROT_BTI_IF_X approach.
Ack; this is just one flavour of interface, and every approach seems to
have some shortcomings.
> The only generic solutions I see are to either use a stateful filter in
> systemd or pass the old state to the kernel in a cmpxchg style so that
> seccomp can check it (I think you suggest this at some point).
The "cmpxchg" option has the disadvantage that the caller needs to know
the original permissions. It seems that glibc is prepared to work
around this, but it won't always be feasible in ancillary /
instrumentation code or libraries.
IMHO it would be preferable to apply a policy to mmap/mprotect in the
kernel proper rather then BPF being the only way to do it -- in any
case, the required checks seem to be out of the scope of what can be
done efficiently (or perhaps at all) in a syscall filter.
> The latter requires a new syscall which is not something we can address
> as a quick, back-portable fix here. If systemd cannot be changed to use
> a stateful filter for w^x detection, my suggestion is to go for the
> kernel setting PROT_BTI on the main executable with glibc changed to
> tolerate EPERM on mprotect(). I don't mind adding an AT_FLAGS bit if
> needed but I don't think it buys us much.
I agree, this seems the best short-term approach.
> Once the current problem is fixed, we can look at a better solution
> longer term as a new syscall.
Agreed, I think if we try to rush the addition of new syscalls, the
chance of coming up with a bad design is high...
Cheers
---Dave