On Mon, Sep 21, 2020 at 7:35 AM YiFei Zhu <[email protected]> wrote:
> SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not
> access any syscall arguments or instruction pointer. To facilitate
> this we need a static analyser to know whether a filter will
> access. This is implemented here with a pseudo-emulator, and
> stored in a per-filter bitmap. Each seccomp cBPF instruction,
> aside from ALU (which should rarely be used in seccomp), gets a
> naive best-effort emulation for each syscall number.
>
> The emulator works by following all possible (without SAT solving)
> paths the filter can take. Every cBPF register / memory position
> records whether that is a constant, and of so, the value of the
> constant. Loading from struct seccomp_data is considered constant
> if it is a syscall number, else it is an unknown. For each
> conditional jump, if the both arguments can be resolved to a
> constant, the jump is followed after computing the result of the
> condition; else both directions are followed, by pushing one of
> the next states to a linked list of next states to process. We
> keep a finite number of pending states to process.
Is this actually necessary, or can we just bail out on any branch that
we can't statically resolve?
struct seccomp_data only contains the syscall number (constant for a
given filter evaluation), the architecture number (also constant), the
instruction pointer (basically never used in seccomp filters), and the
syscall arguments. Any normal seccomp filter first branches on the
architecture, then branches on the syscall number, and then branches
on arguments if necessary.
This optimization could only be improved by the "follow both branches"
logic if a seccomp program branches on either the instruction pointer
or an argument *before* looking at the syscall number, and later comes
to the same conclusion on *both* sides of the check. It would have to
be something like:
if (instruction_pointer == 0xasdf1234) {
if (nr == mmap) return ACCEPT;
[...]
return KILL;
} else {
if (nr == mmap) return ACCEPT;
[...]
return KILL;
}
I've never seen anyone do something like this. And the proposed patch
would still bail out on such a filter because of the load from the
instruction_pointer field; I don't think it would even be possible to
reach a branch with an unknown condition with this patch. So I think
we should probably get rid of this extra logic for keeping track of
multiple execution states for now. That would make the code a lot
simpler.
Also: If it turns out that the time spent in seccomp_cache_prepare()
is measurable for large filters, a possible improvement would be to
keep track of the last syscall number for which the result would be
the same as for the current one, such that instead of evaluating the
filter for one instruction at a time, it would effectively be
evaluated for a range at a time. That should be pretty straightforward
to implement, I think.
> The emulation is halted if it reaches a return, or if it reaches a
> read from struct seccomp_data that reads an offset that is neither
> syscall number or architecture number. In the latter case, we mark
> the syscall number as not okay for seccomp to cache. If a filter
> depends on more filters, then if its dependee cannot process the
> syscall then the depender is also marked not to process the syscall.
>
> We also do a single pass on the entire filter instructions before
> performing emulation. If none of the filter instructions load from
> the troublesome offsets, then the filter is considered "trivial",
> and all syscalls are marked okay for seccomp to cache.
>
> Signed-off-by: YiFei Zhu <[email protected]>
> ---
> arch/x86/Kconfig | 27 ++++
> kernel/seccomp.c | 323 ++++++++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 349 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
[...]
> +choice
> + prompt "Seccomp filter cache"
> + default SECCOMP_CACHE_NONE
I think this should be on by default.
> + depends on SECCOMP
> + depends on SECCOMP_FILTER
SECCOMP_FILTER already depends on SECCOMP, so the "depends on SECCOMP"
line is unnecessary.
> + help
> + Seccomp filters can potentially incur large overhead for each
> + system call. This can alleviate some of the overhead.
> +
> + If in doubt, select 'none'.
This should not be in arch/x86. Other architectures, such as arm64,
should also be able to use this without extra work.
> +config SECCOMP_CACHE_NONE
> + bool "None"
> + help
> + No caching is done. Seccomp filters will be called each time
> + a system call occurs in a seccomp-guarded task.
> +
> +config SECCOMP_CACHE_NR_ONLY
> + bool "Syscall number only"
> + help
> + This is enables a bitmap to cache the results of seccomp
> + filters, if the filter allows the syscall and is independent
> + of the syscall arguments.
Maybe reword this as something like: "For each syscall number, if the
seccomp filter has a fixed result, store that result in a bitmap to
speed up system calls."
> This requires around 60 bytes per
> + filter and 70 bytes per task.
> +
> +endchoice
> +
> source "kernel/Kconfig.hz"
>
> config KEXEC
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 3ee59ce0a323..d8c30901face 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -143,6 +143,27 @@ struct notification {
> struct list_head notifications;
> };
>
> +#ifdef CONFIG_SECCOMP_CACHE_NR_ONLY
> +/**
> + * struct seccomp_cache_filter_data - container for cache's per-filter data
> + *
> + * @syscall_ok: A bitmap where each bit represent whether seccomp is allowed to
nit: represents
> + * cache the results of this syscall.
> + */
> +struct seccomp_cache_filter_data {
> + DECLARE_BITMAP(syscall_ok, NR_syscalls);
> +};
> +
> +#define SECCOMP_EMU_MAX_PENDING_STATES 64
> +#else
> +struct seccomp_cache_filter_data { };
> +
> +static inline int seccomp_cache_prepare(struct seccomp_filter *sfilter)
> +{
> + return 0;
> +}
> +#endif /* CONFIG_SECCOMP_CACHE_NR_ONLY */
[...]
> +/**
> + * seccomp_emu_step - step one instruction in the emulator
> + * @env: The emulator environment
> + * @state: The emulator state
> + *
> + * Returns 1 to halt emulation, 0 to continue, or -errno if error occurred.
> + */
> +static int seccomp_emu_step(struct seccomp_emu_env *env,
> + struct seccomp_emu_state *state)
> +{
> + struct sock_filter *ftest = &env->filter[state->pc++];
> + struct seccomp_emu_state *new_state;
> + u16 code = ftest->code;
> + u32 k = ftest->k;
> + u32 operand;
> + bool compare;
> + int reg_idx;
> +
> + switch (BPF_CLASS(code)) {
> + case BPF_LD:
> + case BPF_LDX:
> + reg_idx = BPF_CLASS(code) == BPF_LDX;
> +
> + switch (BPF_MODE(code)) {
> + case BPF_IMM:
> + state->reg_known[reg_idx] = true;
> + state->reg_const[reg_idx] = k;
> + break;
> + case BPF_ABS:
> + if (k == offsetof(struct seccomp_data, nr)) {
> + state->reg_known[reg_idx] = true;
> + state->reg_const[reg_idx] = env->nr;
> + } else {
> + state->reg_known[reg_idx] = false;
This is completely broken. This emulation logic *needs* to run with
the proper architecture identifier. (And for platforms like x86-64
that have compatibility support for a second ABI, the emulation should
probably also be done for that ABI, and there should be separate
bitmasks for that ABI.)
With the current logic, you will (almost) never actually have
permitted syscalls in the bitmask, because filters fundamentally have
to return different results for different ABIs - the syscall numbers
mean completely different things under different ABIs.
> + if (k != offsetof(struct seccomp_data, arch)) {
> + env->syscall_ok = false;
> + return 1;
> + }
> + }
This would read nicer as:
if (k == offsetof(struct seccomp_data, nr)) {
} else if (k == offsetof(struct seccomp_data, arch)) {
} else {
env->syscall_ok = false;
return 1;
}
> +
> + break;
> + case BPF_MEM:
> + state->reg_known[reg_idx] = state->reg_known[2 + k];
> + state->reg_const[reg_idx] = state->reg_const[2 + k];
> + break;
> + default:
> + state->reg_known[reg_idx] = false;
> + }
> +
> + return 0;
> + case BPF_ST:
> + case BPF_STX:
> + reg_idx = BPF_CLASS(code) == BPF_STX;
> +
> + state->reg_known[2 + k] = state->reg_known[reg_idx];
> + state->reg_const[2 + k] = state->reg_const[reg_idx];
I think we should probably just bail out if we see anything that's
BPF_ST/BPF_STX. I've never seen seccomp filters that actually use that
part of cBPF.
But in case we do need this, maybe instead of using "2 +" for all
these things, the cBPF memory slots should be in a separate array.
> + return 0;
> + case BPF_ALU:
> + state->reg_known[0] = false;
> + return 0;
> + case BPF_JMP:
> + if (BPF_OP(code) == BPF_JA) {
> + state->pc += k;
> + return 0;
> + }
> +
> + if (ftest->jt == ftest->jf) {
> + state->pc += ftest->jt;
> + return 0;
> + }
Why is this check here? Is anyone actually creating filters with such
obviously nonsensical branches? I know that there are highly ludicrous
filters out there, but I don't think I've ever seen this specific kind
of useless code.
> + if (!state->reg_known[0])
> + goto both_cases;
[...]
> +both_cases:
> + if (env->next_state_len >= SECCOMP_EMU_MAX_PENDING_STATES)
> + return -E2BIG;
Even if we cap the maximum number of pending states, this could still
run for an almost unbounded amount of time, I think. Which is bad. If
this code was actually necessary, we'd probably want to track
separately the total number of branches we've seen and so on.
But as I said, I think this code should just be removed instead.
[...]
> + }
> +}
[...]
On Mon, Sep 21, 2020 at 7:47 PM Jann Horn <[email protected]> wrote:
> On Mon, Sep 21, 2020 at 7:35 AM YiFei Zhu <[email protected]> wrote:
> > SECCOMP_CACHE_NR_ONLY will only operate on syscalls that do not
> > access any syscall arguments or instruction pointer. To facilitate
> > this we need a static analyser to know whether a filter will
> > access. This is implemented here with a pseudo-emulator, and
> > stored in a per-filter bitmap. Each seccomp cBPF instruction,
> > aside from ALU (which should rarely be used in seccomp), gets a
> > naive best-effort emulation for each syscall number.
> >
> > The emulator works by following all possible (without SAT solving)
> > paths the filter can take. Every cBPF register / memory position
> > records whether that is a constant, and of so, the value of the
> > constant. Loading from struct seccomp_data is considered constant
> > if it is a syscall number, else it is an unknown. For each
> > conditional jump, if the both arguments can be resolved to a
> > constant, the jump is followed after computing the result of the
> > condition; else both directions are followed, by pushing one of
> > the next states to a linked list of next states to process. We
> > keep a finite number of pending states to process.
>
> Is this actually necessary, or can we just bail out on any branch that
> we can't statically resolve?
Aaaah, now I get what's going on. You statically compute a bitmask
that says whether a given syscall number always has a fixed result
*per architecture number*, and then use that later to decide whether
results can be cached for the combination of a specific seccomp filter
and a specific architecture number. Which mostly works, except that it
means you end up with weird per-thread caches and you get interference
between ABIs (so if a process e.g. filters the argument numbers for
syscall 123 in ABI 1, the results for syscall 123 in ABI 2 also can't
be cached).
Anyway, even though this works, I think it's the wrong way to go about it.
On Mon, Sep 21, 2020 at 12:47 PM Jann Horn <[email protected]> wrote:
> Is this actually necessary, or can we just bail out on any branch that
> we can't statically resolve?
I think after we do enumerate the arch numbers it would make much more
sense. Since if there is a branch after arch number and syscall
numbers are fixed we can assume that the return values will be
different if one or the other case is followed.
> Also: If it turns out that the time spent in seccomp_cache_prepare()
> is measurable for large filters, a possible improvement would be to
> keep track of the last syscall number for which the result would be
> the same as for the current one, such that instead of evaluating the
> filter for one instruction at a time, it would effectively be
> evaluated for a range at a time. That should be pretty straightforward
> to implement, I think.
My concern was more of the possibly-exponential amount of time &
memory needed to evaluate an adversarial filter containing full of
unresolveable branches, hence the max pending states. If we never
follow both branches then evaluation should not be much of a concern.
> > + depends on SECCOMP
> > + depends on SECCOMP_FILTER
>
> SECCOMP_FILTER already depends on SECCOMP, so the "depends on SECCOMP"
> line is unnecessary.
The reason that this is here is because of the looks in menuconfig.
SECCOMP is the direct previous entry, so if this depends on SECCOMP
then the config would be indented. Is this looks not worth keeping or
is there some better way to do this?
> > + help
> > + Seccomp filters can potentially incur large overhead for each
> > + system call. This can alleviate some of the overhead.
> > +
> > + If in doubt, select 'none'.
>
> This should not be in arch/x86. Other architectures, such as arm64,
> should also be able to use this without extra work.
In the initial RFC patch I only added to x86. I could add it to any
arch that has seccomp filters. Though, I'm wondering, why is SECCOMP
in the arch-specific Kconfigs?
> I think we should probably just bail out if we see anything that's
> BPF_ST/BPF_STX. I've never seen seccomp filters that actually use that
> part of cBPF.
>
> But in case we do need this, maybe instead of using "2 +" for all
> these things, the cBPF memory slots should be in a separate array.
Ok I'll just bail.
YiFei Zhu
On Tue, Sep 22, 2020 at 1:44 AM YiFei Zhu <[email protected]> wrote:
> On Mon, Sep 21, 2020 at 12:47 PM Jann Horn <[email protected]> wrote:
> > > + depends on SECCOMP
> > > + depends on SECCOMP_FILTER
> >
> > SECCOMP_FILTER already depends on SECCOMP, so the "depends on SECCOMP"
> > line is unnecessary.
>
> The reason that this is here is because of the looks in menuconfig.
> SECCOMP is the direct previous entry, so if this depends on SECCOMP
> then the config would be indented. Is this looks not worth keeping or
> is there some better way to do this?
Ah, I didn't realize this.
> > > + help
> > > + Seccomp filters can potentially incur large overhead for each
> > > + system call. This can alleviate some of the overhead.
> > > +
> > > + If in doubt, select 'none'.
> >
> > This should not be in arch/x86. Other architectures, such as arm64,
> > should also be able to use this without extra work.
>
> In the initial RFC patch I only added to x86. I could add it to any
> arch that has seccomp filters. Though, I'm wondering, why is SECCOMP
> in the arch-specific Kconfigs?
Ugh, yeah, the existing code is already bad... as far as I can tell,
SECCOMP shouldn't be there, and instead the arch-specific Kconfig
should define something like HAVE_ARCH_SECCOMP and then arch/Kconfig
would define SECCOMP and let it depend on HAVE_ARCH_SECCOMP. It's
really gross how the SECCOMP config description has been copypasted
into a dozen different Kconfig files; and looking around a bit, you
can actually see that e.g. s390 has an utterly outdated help text
which still claims that seccomp is controlled via the ancient
"/proc/<pid>/seccomp". I guess this very nicely illustrates why
putting such options into arch-specific Kconfig is a bad idea. :P
On Mon, Sep 21, 2020 at 7:26 PM Jann Horn <[email protected]> wrote:
> > In the initial RFC patch I only added to x86. I could add it to any
> > arch that has seccomp filters. Though, I'm wondering, why is SECCOMP
> > in the arch-specific Kconfigs?
>
> Ugh, yeah, the existing code is already bad... as far as I can tell,
> SECCOMP shouldn't be there, and instead the arch-specific Kconfig
> should define something like HAVE_ARCH_SECCOMP and then arch/Kconfig
> would define SECCOMP and let it depend on HAVE_ARCH_SECCOMP. It's
> really gross how the SECCOMP config description has been copypasted
> into a dozen different Kconfig files; and looking around a bit, you
> can actually see that e.g. s390 has an utterly outdated help text
> which still claims that seccomp is controlled via the ancient
> "/proc/<pid>/seccomp". I guess this very nicely illustrates why
> putting such options into arch-specific Kconfig is a bad idea. :P
Ah, time to fix this then.
YiFei Zhu