LinuxLists.cc - Re: Using ftrace/perf as a basis for generic seccomp

2011-02-01 14:58:18

Subject: Re: Using ftrace/perf as a basis for generic seccomp

On Wed, Jan 12, 2011 at 4:28 PM, Eric Paris <[email protected]> wrote:
> Some time ago Adam posted a patch to allow for a generic seccomp
> implementation (unlike the current seccomp where your choice is all
> syscalls or only read, write, sigreturn, and exit) which got little
> traction and it was suggested he instead do the same thing somehow using
> the tracing code:
> http://thread.gmane.org/gmane.linux.kernel/833556
> The actual method that this could be achieved was apparently left as an
> exercise for the reader. ?Since I'd like to do something similar (and
> actually basically reimplemented Adam's code before I found this thread)
> I guess that makes me the reader. ?I've never touched
> perf/ftrace/whatever so I'm not even knowledgeably enough to ask good
> questions so please, try to talk to me like a 2 year old.
>
> I started playing a bit having no idea where to start decided to see
> where something like:
> perf stat -e syscalls:sys_enter_read -e syscalls:sys_enter_write -- ./seccomp_test
> Ended up in the kernel. ?It ended up I saw in perf_syscall_enter(). ?So
> I decided to do a little hacking and added this little patch segment:
>
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index bac752f..6653995 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -495,8 +495,12 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
> ? ? ? ?int size;
>
> ? ? ? ?syscall_nr = syscall_get_nr(current, regs);
> - ? ? ? if (!test_bit(syscall_nr, enabled_perf_enter_syscalls))
> +
> + ? ? ? if (!test_bit(syscall_nr, enabled_perf_enter_syscalls)) {
> + ? ? ? ? ? ? ? if (current->seccomp.mode == 2)
> + ? ? ? ? ? ? ? ? ? ? ? do_exit(SIGKILL);
> ? ? ? ? ? ? ? ?return;
> + ? ? ? }
>
> ? ? ? ?sys_data = syscall_nr_to_meta(syscall_nr);
> ? ? ? ?if (!sys_data)
>
> Which appears to be a necessary, but not sufficient, requirement, since
> another unrelated task could also have a 'watch?' on other syscalls. ?So
> I hacked in this little PoS into the filter code.
>
> diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> index eac7e33..d8c1c8f 100644
> --- a/kernel/perf_event.c
> +++ b/kernel/perf_event.c
> @@ -4780,15 +4780,19 @@ void perf_tp_event(u64 addr, u64 count, void *record, int entry_size,
> ? ? ? ? ? ? ? ?.size = entry_size,
> ? ? ? ? ? ? ? ?.data = record,
> ? ? ? ?};
> + ? ? ? int found = 0;
>
> ? ? ? ?perf_sample_data_init(&data, addr);
> ? ? ? ?data.raw = &raw;
>
> ? ? ? ?hlist_for_each_entry_rcu(event, node, head, hlist_entry) {
> - ? ? ? ? ? ? ? if (perf_tp_event_match(event, &data, regs))
> + ? ? ? ? ? ? ? if (perf_tp_event_match(event, &data, regs)) {
> + ? ? ? ? ? ? ? ? ? ? ? found = 1;
> ? ? ? ? ? ? ? ? ? ? ? ?perf_swevent_event(event, count, 1, &data, regs);
> + ? ? ? ? ? ? ? }
> ? ? ? ?}
> -
> + ? ? ? if (current->seccomp.mode == 2 && !found)
> + ? ? ? ? ? ? ? do_exit(SIGKILL);
> ? ? ? ?perf_swevent_put_recursion_context(rctx);
> ?}
> ?EXPORT_SYMBOL_GPL(perf_tp_event);
>
> Which seems to get me a 'working' version of generic seccomp on top of
> ftrace. ?Problem is it makes me feel dirty, I'm logging a bunch of trace
> stuff I don't care about, and I'm sure its being done wrong 1001 ways.
> I know that do_exit(SIGKILL) is actually really wrong since it ends up
> giving me this crap (but i don't know how to do it better from there)
>
> note: seccomp-test[2485] exited with preempt_count 1
> BUG: sleeping function called from invalid context at kernel/rwsem.c:21
>
> So, finally, onto the question. ?How would you guys do it? ?The tracing
> code seems to me to be built on the idea of recording information on a
> very small limited set of events, not blocking access on the complement
> of a small limited set of events.
>
> I'm not seeing how the tracing code is better than the generic seccomp
> code that Adam wrote, but hopefully someone can enlighten me as to how
> this can be done reasonably. ?I need all the guidance you can offer
> because I don't really see what next steps should be!

Ping tracing people. I'm not seeing steps forward. At this point I'm
going to start looking at Adam's code again. I can think of a couple
of cleanups and simplifications to his code. I just don't see how
using tracing is supposed to work better.....

-Eric

2011-02-02 12:14:49

by 平松雅巳 / HIRAMATU，MASAMI

[permalink] [raw]

Subject: Re: Using ftrace/perf as a basis for generic seccomp

Hi Eric,

(2011/02/01 23:58), Eric Paris wrote:
> On Wed, Jan 12, 2011 at 4:28 PM, Eric Paris <[email protected]> wrote:
>> Some time ago Adam posted a patch to allow for a generic seccomp
>> implementation (unlike the current seccomp where your choice is all
>> syscalls or only read, write, sigreturn, and exit) which got little
>> traction and it was suggested he instead do the same thing somehow using
>> the tracing code:
>> http://thread.gmane.org/gmane.linux.kernel/833556

Hm, interesting idea :)
But why would you like to use tracing code? just for hooking?

>> The actual method that this could be achieved was apparently left as an
>> exercise for the reader. Since I'd like to do something similar (and
>> actually basically reimplemented Adam's code before I found this thread)
>> I guess that makes me the reader. I've never touched
>> perf/ftrace/whatever so I'm not even knowledgeably enough to ask good
>> questions so please, try to talk to me like a 2 year old.

OK, I'll try to explain;

Ftrace/perf syscall event tracing is based on syscall tracepoints
(sys_enter and sys_exit) which are implemented as a special hook
requiring TIF_SYSCALL_TRACEPOINT.

--<arch/x86/kernel/ptrace.c>--
asmregparm long syscall_trace_enter(struct pt_regs *regs)
{
long ret = 0;
[...]
/* do the secure computing check first */
secure_computing(regs->orig_ax); <-- secomp!

if (unlikely(test_thread_flag(TIF_SYSCALL_EMU)))
ret = -1L;

if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
tracehook_report_syscall_entry(regs))
ret = -1L;

if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
trace_sys_enter(regs, regs->orig_ax); <--here!
----

All syscalls issued by threads which has TIF_SYSCALL_TRACEPOINT
kick trace_sys_enter() tracepoint, and then the tracepoint calls
ftrace handler or perf handler.

And this tracepoint is not only for ftrace/perf, but also you
can use it directly via register_trace_sys_enter() (the tracepoint
can be shared among several handlers). If you just want to hook
the syscall entry, I recommend that instead of modifying ftrace/perf.
See kernel/trace/trace_syscalls.c, Documentation/trace/tracepoints.txt
and samples/tracepoints/ for details.

However, I think here is an ordering problem. As you can see, secomp
hook is done before these hooks, that might cause a problem because
tracehook_report_syscall_entry(), which is also done before tracepoint,
is used by ptrace().
This means that someone can hook into an unsafe syscall via debugger.

So, finally, I think you'd better expand secure_computing() hook, or
introduce more generic hook-point.

Thank you,

--
Masami HIRAMATSU
2nd Dept. Linux Technology Center
Hitachi, Ltd., Systems Development Laboratory
E-mail: [email protected]

2011-02-02 12:26:34

by Ingo Molnar

[permalink] [raw]

Subject: Re: Using ftrace/perf as a basis for generic seccomp

* Masami Hiramatsu <[email protected]> wrote:

> Hi Eric,
>
> (2011/02/01 23:58), Eric Paris wrote:
> > On Wed, Jan 12, 2011 at 4:28 PM, Eric Paris <[email protected]> wrote:
> >> Some time ago Adam posted a patch to allow for a generic seccomp
> >> implementation (unlike the current seccomp where your choice is all
> >> syscalls or only read, write, sigreturn, and exit) which got little
> >> traction and it was suggested he instead do the same thing somehow using
> >> the tracing code:
> >> http://thread.gmane.org/gmane.linux.kernel/833556
>
> Hm, interesting idea :)
> But why would you like to use tracing code? just for hooking?

What I suggested before was to reuse the scripting engine and the tracepoints.

I.e. the "seccomp restrictions" can be implemented via a filter expression - and the
scripting engine could be generalized so that such 'sandboxing' code can make use of
it.

For example, if you want to restrict a process to only allow open() syscalls to fd 4
(a very restrictive sandbox), it could be done via this filter expression:

'fd == 4'

etc. Note that obviously the scripting engine needs to be abstracted out somewhat -
but this is the basic idea, to reuse the callbacks and reuse the scripting engine
for runtime filtering of syscall parameters.

Thanks,

Ingo

2011-02-02 16:46:07

by Eric Paris

[permalink] [raw]

Subject: Re: Using ftrace/perf as a basis for generic seccomp

On Wed, 2011-02-02 at 13:26 +0100, Ingo Molnar wrote:
> * Masami Hiramatsu <[email protected]> wrote:
>
> > Hi Eric,
> >
> > (2011/02/01 23:58), Eric Paris wrote:
> > > On Wed, Jan 12, 2011 at 4:28 PM, Eric Paris <[email protected]> wrote:
> > >> Some time ago Adam posted a patch to allow for a generic seccomp
> > >> implementation (unlike the current seccomp where your choice is all
> > >> syscalls or only read, write, sigreturn, and exit) which got little
> > >> traction and it was suggested he instead do the same thing somehow using
> > >> the tracing code:
> > >> http://thread.gmane.org/gmane.linux.kernel/833556
> >
> > Hm, interesting idea :)
> > But why would you like to use tracing code? just for hooking?
>
> What I suggested before was to reuse the scripting engine and the tracepoints.
>
> I.e. the "seccomp restrictions" can be implemented via a filter expression - and the
> scripting engine could be generalized so that such 'sandboxing' code can make use of
> it.
>
> For example, if you want to restrict a process to only allow open() syscalls to fd 4
> (a very restrictive sandbox), it could be done via this filter expression:
>
> 'fd == 4'
>
> etc. Note that obviously the scripting engine needs to be abstracted out somewhat -
> but this is the basic idea, to reuse the callbacks and reuse the scripting engine
> for runtime filtering of syscall parameters.

Any pointers on what is involved in this abstraction? I can work out
the details, but I don't know the big picture well enough to even start
to move forwards.....

-Eric

2011-02-02 17:56:16

by Ingo Molnar

[permalink] [raw]

Subject: Re: Using ftrace/perf as a basis for generic seccomp

* Eric Paris <[email protected]> wrote:

> On Wed, 2011-02-02 at 13:26 +0100, Ingo Molnar wrote:
> > * Masami Hiramatsu <[email protected]> wrote:
> >
> > > Hi Eric,
> > >
> > > (2011/02/01 23:58), Eric Paris wrote:
> > > > On Wed, Jan 12, 2011 at 4:28 PM, Eric Paris <[email protected]> wrote:
> > > >> Some time ago Adam posted a patch to allow for a generic seccomp
> > > >> implementation (unlike the current seccomp where your choice is all
> > > >> syscalls or only read, write, sigreturn, and exit) which got little
> > > >> traction and it was suggested he instead do the same thing somehow using
> > > >> the tracing code:
> > > >> http://thread.gmane.org/gmane.linux.kernel/833556
> > >
> > > Hm, interesting idea :)
> > > But why would you like to use tracing code? just for hooking?
> >
> > What I suggested before was to reuse the scripting engine and the tracepoints.
> >
> > I.e. the "seccomp restrictions" can be implemented via a filter expression - and the
> > scripting engine could be generalized so that such 'sandboxing' code can make use of
> > it.
> >
> > For example, if you want to restrict a process to only allow open() syscalls to fd 4
> > (a very restrictive sandbox), it could be done via this filter expression:
> >
> > 'fd == 4'
> >
> > etc. Note that obviously the scripting engine needs to be abstracted out somewhat -
> > but this is the basic idea, to reuse the callbacks and reuse the scripting engine
> > for runtime filtering of syscall parameters.
>
> Any pointers on what is involved in this abstraction? I can work out
> the details, but I don't know the big picture well enough to even start
> to move forwards.....

perf has support for these filters, so would it work with you if I gave you some
example usage?

First you identify an interesting tracepoint - look at the list of:

perf list | grep Tracepoint

Say we want to filter sys_close() events, so we pick:

syscalls:sys_enter_close [Tracepoint event]

And record all sys_open (enter) events in the system, for one second:

perf record -e syscalls:sys_enter_close -a sleep 1

All the recorded data will be in perf.data in cwd.

'perf report' will show a profile, and 'perf script' will show the trace output:

perf-30558 [002] 117691.065243: sys_enter_close: fd: 0x00000016
perf-30558 [002] 117691.065406: sys_enter_close: fd: 0x00000016
perf-30558 [002] 117691.065443: sys_enter_close: fd: 0x00000017
perf-30558 [002] 117691.065444: sys_enter_close: fd: 0x00000016
[...]

Now, to record a 'filtered' event, use the --filter parameter when recording:

Available field names can be found in the 'format' file:

cat /debug/tracing/events/syscalls/sys_close_enter/format

name: sys_enter_close
ID: 402
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:int common_lock_depth; offset:8; size:4; signed:1;

field:int nr; offset:12; size:4; signed:1;
field:unsigned int fd; offset:16; size:8; signed:0;

print fmt: "fd: 0x%08lx", ((unsigned long)(REC->fd))

The interesting ones is:

field:unsigned int fd; offset:16; size:8; signed:0;

This is the field that represents the fd of the close(fd) call. To filter it, simply
use it symbolically:

perf record -e syscalls:sys_enter_close --filter 'fd==3' ./hackbench 5

As you can see it in 'perf script' output:

hackbench-30576 [008] 117802.180002: sys_enter_close: fd: 0x00000003
hackbench-30576 [008] 117802.222056: sys_enter_close: fd: 0x00000003
hackbench-30576 [008] 117802.222064: sys_enter_close: fd: 0x00000003
hackbench-30576 [008] 117802.222065: sys_enter_close: fd: 0x00000003
hackbench-30576 [008] 117802.222067: sys_enter_close: fd: 0x00000003
hackbench-30576 [008] 117802.222069: sys_enter_close: fd: 0x00000003
hackbench-30576 [008] 117802.222070: sys_enter_close: fd: 0x00000003
hackbench-30576 [008] 117802.222071: sys_enter_close: fd: 0x00000003
hackbench-30576 [008] 117802.222073: sys_enter_close: fd: 0x00000003

Only fd==3 events were recorded.

The filter expression engine executes in the kernel, when the event happens. The
user-space perf tool parses the --filter parameter and passes it to the kernel as a
string in essence. The kerner parses this into atomic predicaments which are linked
to the event structure. When the event happens the predicaments are executed by the
filter engine.

The expressions are simple, but rather flexible, so you can do 'fd==0||fd==1' and
more complex expressions, etc. The engine could also be extended.

The kernel code is mostly in kernel/trace/trace_events_filter.c.

I've Cc:-ed Tom, Frederic, Steve, Li Zefan and Arnaldo who have worked on the filter
engine, in case something is broken with this functionality or if there are other
questions :)

Thanks,

Ingo

2011-02-02 18:17:59

by Steven Rostedt

[permalink] [raw]

Subject: Re: Using ftrace/perf as a basis for generic seccomp

On Wed, 2011-02-02 at 18:55 +0100, Ingo Molnar wrote:

> The filter expression engine executes in the kernel, when the event happens. The
> user-space perf tool parses the --filter parameter and passes it to the kernel as a
> string in essence. The kerner parses this into atomic predicaments which are linked
> to the event structure. When the event happens the predicaments are executed by the
> filter engine.
>
> The expressions are simple, but rather flexible, so you can do 'fd==0||fd==1' and
> more complex expressions, etc. The engine could also be extended.
>
> The kernel code is mostly in kernel/trace/trace_events_filter.c.
>
> I've Cc:-ed Tom, Frederic, Steve, Li Zefan and Arnaldo who have worked on the filter
> engine, in case something is broken with this functionality or if there are other
> questions :)

Yep, and I'm currently working on them as well. As they currently have a
32 pred limit (may seem like a lot, but I actually hit it). And I've
also added short circuits (0 && .... no need to process more).

I posted an RFC:

https://lkml.org/lkml/2011/1/27/438

and I'm again working on finishing it. Just a few more things to do. I
got side tracked because my employer actually asked me to do something
for them ;)

-- Steve

2011-02-03 19:06:53

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: Using ftrace/perf as a basis for generic seccomp

On Wed, Feb 02, 2011 at 11:45:22AM -0500, Eric Paris wrote:
> On Wed, 2011-02-02 at 13:26 +0100, Ingo Molnar wrote:
> > * Masami Hiramatsu <[email protected]> wrote:
> >
> > > Hi Eric,
> > >
> > > (2011/02/01 23:58), Eric Paris wrote:
> > > > On Wed, Jan 12, 2011 at 4:28 PM, Eric Paris <[email protected]> wrote:
> > > >> Some time ago Adam posted a patch to allow for a generic seccomp
> > > >> implementation (unlike the current seccomp where your choice is all
> > > >> syscalls or only read, write, sigreturn, and exit) which got little
> > > >> traction and it was suggested he instead do the same thing somehow using
> > > >> the tracing code:
> > > >> http://thread.gmane.org/gmane.linux.kernel/833556
> > >
> > > Hm, interesting idea :)
> > > But why would you like to use tracing code? just for hooking?
> >
> > What I suggested before was to reuse the scripting engine and the tracepoints.
> >
> > I.e. the "seccomp restrictions" can be implemented via a filter expression - and the
> > scripting engine could be generalized so that such 'sandboxing' code can make use of
> > it.
> >
> > For example, if you want to restrict a process to only allow open() syscalls to fd 4
> > (a very restrictive sandbox), it could be done via this filter expression:
> >
> > 'fd == 4'
> >
> > etc. Note that obviously the scripting engine needs to be abstracted out somewhat -
> > but this is the basic idea, to reuse the callbacks and reuse the scripting engine
> > for runtime filtering of syscall parameters.
>
> Any pointers on what is involved in this abstraction? I can work out
> the details, but I don't know the big picture well enough to even start
> to move forwards.....

In the big picture, the filtering code is very tight to the tracing code.
Creation, initialization, removal of filters is all made on top of the
trace events structures (struct ftrace_event_call) because we apply and
interpret filters on the fields of trace events, which are what we save
in a trace.

Example:

If you look at the sched switch trace events, we have several fields
like prev_comm and next_comm. These are defined in the TRACE_EVENT()
macros calls. So when we apply a filter like "prev_comm == firefox-bin",
we enter the filtering code with the trace_event structure for sched
switch events and iterate through its fields to find one called
prev_comm and then we work on top of that.
I think you won't work with trace events, so you need to make the
filtering code more tracing-agnostic.

But I think it's quite workable and shouldn't be too hard to split that
into a filtering backend. Many parts are already pretty standalone.

Also I suspect the tracepoints are not what you need. Or may be
they are. But as Masami said, the syscall tracepoint is called late.
It's workable though. The other problem is that preemption is disabled
when tracepoints are called, which is probably not what you want.
One day I think we'll need to unify the tracepoints and notifier
code but until then, better keep tracepoints for tracing.

Now once you have the filtering code more generic, you still
need an arch backend to map register contents and layout into syscall
arguments name and type. On top of which you can finally use the filtering
code. For that you can use, again, some code we use for tracing, which
are syscalls metadata: informations generated on build time
that have syscalls fields and type.
And that also needs to be split up, but it's more trivial
than the filtering part.

Note for now, filtering + syscalls metadata only works on top
of raw arguments value. Syscalls metadata don't know much
about type semantics and won't help you to dereference
syscall argument pointers. Only raw syscall parameter values.
Similarly, the filtering code can't evaluate pointer dereferencing
expression evaluation, only direct values comprehension.

But please note this is all features we want in the long term
anyway, using the kprobe expression code to intepret dereferencing,
and have more type introspection into kernel structures for
smarter syscalls metadata. And we can do that all gradually
without breaking backward.

Now with the current features you'll already have access to
a much more powerful seccomp implementation.

And if you have questions about anything, please don't hesitate.

2011-02-03 19:18:53

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: Using ftrace/perf as a basis for generic seccomp

On Thu, Feb 03, 2011 at 08:06:45PM +0100, Frederic Weisbecker wrote:
> On Wed, Feb 02, 2011 at 11:45:22AM -0500, Eric Paris wrote:
> > On Wed, 2011-02-02 at 13:26 +0100, Ingo Molnar wrote:
> > > * Masami Hiramatsu <[email protected]> wrote:
> > >
> > > > Hi Eric,
> > > >
> > > > (2011/02/01 23:58), Eric Paris wrote:
> > > > > On Wed, Jan 12, 2011 at 4:28 PM, Eric Paris <[email protected]> wrote:
> > > > >> Some time ago Adam posted a patch to allow for a generic seccomp
> > > > >> implementation (unlike the current seccomp where your choice is all
> > > > >> syscalls or only read, write, sigreturn, and exit) which got little
> > > > >> traction and it was suggested he instead do the same thing somehow using
> > > > >> the tracing code:
> > > > >> http://thread.gmane.org/gmane.linux.kernel/833556
> > > >
> > > > Hm, interesting idea :)
> > > > But why would you like to use tracing code? just for hooking?
> > >
> > > What I suggested before was to reuse the scripting engine and the tracepoints.
> > >
> > > I.e. the "seccomp restrictions" can be implemented via a filter expression - and the
> > > scripting engine could be generalized so that such 'sandboxing' code can make use of
> > > it.
> > >
> > > For example, if you want to restrict a process to only allow open() syscalls to fd 4
> > > (a very restrictive sandbox), it could be done via this filter expression:
> > >
> > > 'fd == 4'
> > >
> > > etc. Note that obviously the scripting engine needs to be abstracted out somewhat -
> > > but this is the basic idea, to reuse the callbacks and reuse the scripting engine
> > > for runtime filtering of syscall parameters.
> >
> > Any pointers on what is involved in this abstraction? I can work out
> > the details, but I don't know the big picture well enough to even start
> > to move forwards.....
>
> In the big picture, the filtering code is very tight to the tracing code.
> Creation, initialization, removal of filters is all made on top of the
> trace events structures (struct ftrace_event_call) because we apply and
> interpret filters on the fields of trace events, which are what we save
> in a trace.
>
> Example:
>
> If you look at the sched switch trace events, we have several fields
> like prev_comm and next_comm. These are defined in the TRACE_EVENT()
> macros calls. So when we apply a filter like "prev_comm == firefox-bin",
> we enter the filtering code with the trace_event structure for sched
> switch events and iterate through its fields to find one called
> prev_comm and then we work on top of that.
> I think you won't work with trace events, so you need to make the
> filtering code more tracing-agnostic.
>
> But I think it's quite workable and shouldn't be too hard to split that
> into a filtering backend. Many parts are already pretty standalone.
>
> Also I suspect the tracepoints are not what you need. Or may be
> they are. But as Masami said, the syscall tracepoint is called late.
> It's workable though. The other problem is that preemption is disabled
> when tracepoints are called, which is probably not what you want.
> One day I think we'll need to unify the tracepoints and notifier
> code but until then, better keep tracepoints for tracing.
>
> Now once you have the filtering code more generic, you still
> need an arch backend to map register contents and layout into syscall
> arguments name and type. On top of which you can finally use the filtering
> code. For that you can use, again, some code we use for tracing, which
> are syscalls metadata: informations generated on build time
> that have syscalls fields and type.
> And that also needs to be split up, but it's more trivial
> than the filtering part.
>
> Note for now, filtering + syscalls metadata only works on top
> of raw arguments value. Syscalls metadata don't know much
> about type semantics and won't help you to dereference
> syscall argument pointers. Only raw syscall parameter values.
> Similarly, the filtering code can't evaluate pointer dereferencing
> expression evaluation, only direct values comprehension.

Actually we have string comparison supported by the filtering code.
Still we need safe accessors (copy_from_user()) from filtering code
to use that safely on syscall parameters.

2011-02-03 22:26:48

by Stefan Fritsch

[permalink] [raw]

Subject: Re: Using ftrace/perf as a basis for generic seccomp

Hi,

On Thursday 03 February 2011, Frederic Weisbecker wrote:
> I think you won't work with trace events, so you need to make the
> filtering code more tracing-agnostic.
>
> But I think it's quite workable and shouldn't be too hard to split
> that into a filtering backend. Many parts are already pretty
> standalone.
>
> Also I suspect the tracepoints are not what you need. Or may be
> they are. But as Masami said, the syscall tracepoint is called
> late. It's workable though. The other problem is that preemption
> is disabled when tracepoints are called, which is probably not
> what you want. One day I think we'll need to unify the tracepoints
> and notifier code but until then, better keep tracepoints for
> tracing.
>
> Now once you have the filtering code more generic, you still
> need an arch backend to map register contents and layout into
> syscall arguments name and type. On top of which you can finally
> use the filtering code. For that you can use, again, some code we
> use for tracing, which are syscalls metadata: informations
> generated on build time that have syscalls fields and type.
> And that also needs to be split up, but it's more trivial
> than the filtering part.

AFAICS the infrastructure for tracing and metadata of compat syscalls
is also still missing. That would need to be added, too. Jason Baron
and Ian Munsie have worked on this in mid 2010, but I don't know about
the current status.

Considering that all this is still quite a bit of work and that the
initial suggestion by Adam Langley happened nearly two years ago,
maybe a two step approach would be better:

Integrate a seccomp mode 2 now, which only supports a bitmask of
bitmaps and no filtering.

Then, when the infrastructure for the filtering is finished, add a
seccomp mode 3 with support for filtering.

This would give something in the very near future that is way more
usable than seccomp mode 1. I think only the following adjustments
would need to be made to Adam Langley's patch:

- only allow syscalls in the mode (non-compat/compat) that the prctl
call was made in
- deny exec of setuid/setgid binaries
- deny exec of binaries with filesystem capabilities

What do you think of this proposal? I have a patch lying around
somewhere that already does the first two of these.

BTW, given that the compat syscall layer tends to have security bugs,
being able to disable compat syscalls per process would already have
some value on its own. Is there a way to do this by disabling the int
80 and compat sysenter()/syscall() vectors per process? In my patch I
did a check in secure_computing(), which is of course less elegant.

> Note for now, filtering + syscalls metadata only works on top
> of raw arguments value. Syscalls metadata don't know much
> about type semantics and won't help you to dereference
> syscall argument pointers. Only raw syscall parameter values.
> Similarly, the filtering code can't evaluate pointer dereferencing
> expression evaluation, only direct values comprehension.

Pointer dereferencing at syscall entry must be avoided for seccomp
anyway, or there would be race conditions. Of course if the filtering
points could be put after the final copy_form_user, it would be ok.

Cheers,
Stefan

2011-02-03 23:11:01

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: Using ftrace/perf as a basis for generic seccomp

On Thu, Feb 03, 2011 at 11:06:33PM +0100, Stefan Fritsch wrote:
> Hi,
>
> On Thursday 03 February 2011, Frederic Weisbecker wrote:
> > I think you won't work with trace events, so you need to make the
> > filtering code more tracing-agnostic.
> >
> > But I think it's quite workable and shouldn't be too hard to split
> > that into a filtering backend. Many parts are already pretty
> > standalone.
> >
> > Also I suspect the tracepoints are not what you need. Or may be
> > they are. But as Masami said, the syscall tracepoint is called
> > late. It's workable though. The other problem is that preemption
> > is disabled when tracepoints are called, which is probably not
> > what you want. One day I think we'll need to unify the tracepoints
> > and notifier code but until then, better keep tracepoints for
> > tracing.
> >
> > Now once you have the filtering code more generic, you still
> > need an arch backend to map register contents and layout into
> > syscall arguments name and type. On top of which you can finally
> > use the filtering code. For that you can use, again, some code we
> > use for tracing, which are syscalls metadata: informations
> > generated on build time that have syscalls fields and type.
> > And that also needs to be split up, but it's more trivial
> > than the filtering part.
>
> AFAICS the infrastructure for tracing and metadata of compat syscalls
> is also still missing. That would need to be added, too. Jason Baron
> and Ian Munsie have worked on this in mid 2010, but I don't know about
> the current status.

Oh we have these features for a while now. Jason and others have
made a great work on it.
For example you can accept only read() calls when fd = 7 with perf:

./perf record -a -e syscalls:sys_enter_read --filter "fd==7"
./perf script

perf-2956 [000] 6427.757791: sys_enter_read: fd: 0x00000007, buf: 0x7fdd2c095000, count: 0x00000400
perf-2956 [000] 6427.757858: sys_enter_read: fd: 0x00000007, buf: 0x7fdd2c095000, count: 0x00000400
perf-2956 [000] 6427.757877: sys_enter_read: fd: 0x00000007, buf: 0x7fdd2c095000, count: 0x00000400
perf-2956 [000] 6427.757949: sys_enter_read: fd: 0x00000007, buf: 0x7fdd2c095000, count: 0x00000400
perf-2956 [000] 6427.757968: sys_enter_read: fd: 0x00000007, buf: 0x7fdd2c095000, count: 0x00000400

Or you can do it with ftrace.

Now what Jason and Ian have worked on lately was rather about compat syscalls support and power pc
support amongst other things.

> Considering that all this is still quite a bit of work and that the
> initial suggestion by Adam Langley happened nearly two years ago,
> maybe a two step approach would be better:
>
> Integrate a seccomp mode 2 now, which only supports a bitmask of
> bitmaps and no filtering.
>
> Then, when the infrastructure for the filtering is finished, add a
> seccomp mode 3 with support for filtering.

Infrastructure for filtering will never be finished. IMO there will
always be many ways to make it better, simply because using such
conditional expressions offer tons of possibilities and is extendable
by nature.

Actually having seccomp as a user of filter expressions is a opportunity
to improve it and bring new ideas.

> This would give something in the very near future that is way more
> usable than seccomp mode 1. I think only the following adjustments
> would need to be made to Adam Langley's patch:
>
> - only allow syscalls in the mode (non-compat/compat) that the prctl
> call was made in
> - deny exec of setuid/setgid binaries
> - deny exec of binaries with filesystem capabilities
>
> What do you think of this proposal? I have a patch lying around
> somewhere that already does the first two of these.

IMHO this is an unnecessary intermediate state. It will actually make
things worse by bringing a new non-flexible ABI that we'll need to
maintain forever.

I'm no expert in security but I think it's not flexible.

> > of raw arguments value. Syscalls metadata don't know much
> > about type semantics and won't help you to dereference
> > syscall argument pointers. Only raw syscall parameter values.
> > Similarly, the filtering code can't evaluate pointer dereferencing
> > expression evaluation, only direct values comprehension.
>
> Pointer dereferencing at syscall entry must be avoided for seccomp
> anyway, or there would be race conditions. Of course if the filtering
> points could be put after the final copy_form_user, it would be ok.

Makes sense yeah.

2011-02-04 01:51:35

by Eric Paris

[permalink] [raw]

Subject: Re: Using ftrace/perf as a basis for generic seccomp

On Fri, 2011-02-04 at 00:10 +0100, Frederic Weisbecker wrote:
> On Thu, Feb 03, 2011 at 11:06:33PM +0100, Stefan Fritsch wrote:
> > On Thursday 03 February 2011, Frederic Weisbecker wrote:

> Actually having seccomp as a user of filter expressions is a opportunity
> to improve it and bring new ideas.
>
> > This would give something in the very near future that is way more
> > usable than seccomp mode 1. I think only the following adjustments
> > would need to be made to Adam Langley's patch:
> >
> > - only allow syscalls in the mode (non-compat/compat) that the prctl
> > call was made in
> > - deny exec of setuid/setgid binaries
> > - deny exec of binaries with filesystem capabilities
> >
> > What do you think of this proposal? I have a patch lying around
> > somewhere that already does the first two of these.
>
> IMHO this is an unnecessary intermediate state. It will actually make
> things worse by bringing a new non-flexible ABI that we'll need to
> maintain forever.
>
> I'm no expert in security but I think it's not flexible.

I'm not quite sure how I feel. The only person who ask/semi-required
this flexibility is Ingo. I agree it's a really great idea and maybe
one that people will someday use. I'm going to try to work on it over
the next week or two. I'm not certain if my use case really wants/needs
it or will even use the filter flexibility. Now, there's no question
that what we have today is so inflexible it's basically useless. It's
also obvious that we have at least 3 people who would use something like
the google solution. (sounds like at least 3 of us have written a
google like solution by now)

So I'm going to take a stab at understanding everything you told me
today. THANK YOU SO MUCH for the overview. It was AMAZING and exactly
what I was hoping to hear. But if I don't think I can come up with
something in a reasonable time frame I'm probably going to be back
pushing what we all wanted to start with :)

-Eric

2011-02-04 14:31:18

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Using ftrace/perf as a basis for generic seccomp

On Thu, 2011-02-03 at 20:50 -0500, Eric Paris wrote:
> I'm going to try to work on it over
> the next week or two.

What is your use-case? Going by: http://lwn.net/Articles/332990/ syscall
based stuff (seccomp) is broken by design.

2011-02-04 16:30:22

by Eric Paris

[permalink] [raw]

Subject: Re: Using ftrace/perf as a basis for generic seccomp

On Fri, 2011-02-04 at 15:31 +0100, Peter Zijlstra wrote:
> On Thu, 2011-02-03 at 20:50 -0500, Eric Paris wrote:
> > I'm going to try to work on it over
> > the next week or two.
>
> What is your use-case? Going by: http://lwn.net/Articles/332990/ syscall
> based stuff (seccomp) is broken by design.

My personal goal is very different than an LSM. My goal is to reduce
attack surface. I'm not trying to implement an LSM. LSM hooks are
(intentionally) placed in the kernel after object resolution is
complete. In an LSM we don't check 'open' type operation until after
the pathname has been converted to an inode. We don't check some
'sendto' operations until after the data has been placed into an skb and
is about to be queued to a socket. There is a LOT of code between
syscall_entry() and any given LSM hook.

An obvious vulnerability that I'm sure all the people involved here know
would be the original perf syscall bounds checking vulnerability. If
I'm dealing with an application that I know will never use perf I'd like
a way to be able to completely disable the perf syscall and greatly
reduce the kernel attack surface. It would be almost impossible for an
LSM to hook between the syscall_enter() and the location of that
vulnerability in the perf syscall. In my particular case I'm thinking
about qemu, which never needs to call perf. I want a way to disable all
of the code after syscall_enter() for huge swaths of the kernel.

What we have today, called "seccomp", is a one way toggle,
prctl(PR_SET_SECCOMP, 1), which reduces the available syscalls to
read,write,exit, and sigreturn. Any other syscall results in a process
being immediately killed. It's a great idea to reduce the attack
surface of the kernel but it is too inflexible to be useful. I wonder
if anyone is using it.

Qemu on my box in just a couple of seconds of strace was found to use
futex, ioctl, read, rt_sigaction, select, timer_gettime, timer_settime,
and write. I'm sure that other well defined processes have other such
sort lists of required syscalls. I think a more flexible seccomp which
lets one remove syscalls from the allowed set (but never add them back)
can GREATLY reduce the kernel attack surface from malicious processes.

This is not a sandbox. This is not an LSM replacement. This is a per
syscall cutoff. It can be used to help build a stronger sandbox. I'll
likely see if this can't be used by the SELinux sandbox which already
uses the LSM hooks to control information flow and mediate access. But
SELinux does not control the sheer amount of the kernel code that can be
executed. I believe we can build a stronger sandbox using a flexible
seccomp as one of the tools. All we have to do is find one
vulnerability in the code between the syscall entry and a LSM hook which
would deny to operation to see the value in a per syscall control
mechanism.

As to doing it in seccomp code where it's all of a syscall or none vs
making use of the filter infrastructure to allow even more fine grained
control over the syscall is a question. I'm leaning more towards just
doing it in seccomp. We can't ever build a full and complete strong
sandbox using the filter code. James' assertions about copy_from_user()
are obviously correct. A chat with PeterZ privately on IRC indicated
that he also was not interested in seeing this creep into the tracing
code. Do we have a user that can articulate a need for greater
flexibility in their use of such a hardening tool?

I think given all these things I'm going to go back to looking at the
flexible seccomp for now. And maybe we should work towards using the
tracing filter code in the future if someone can articulate a real use
case.....

-Eric

2011-02-04 16:37:23

by Eric Paris

[permalink] [raw]

Subject: Re: Using ftrace/perf as a basis for generic seccomp

On Thu, 2011-02-03 at 23:06 +0100, Stefan Fritsch wrote:

> - only allow syscalls in the mode (non-compat/compat) that the prctl
> call was made in

This is what I was thinking. If it was a compat task when it dropped
things from the set of syscalls we should implicitly deny all non-compat
syscalls, and vice versa.

> - deny exec of setuid/setgid binaries
> - deny exec of binaries with filesystem capabilities

I think both of these are wrong to try to address here. The right way
to handle these is to

1) set prctl(SECBIT_NOROOT)
2) drop all caps from the bset, pP, pE, and pI
3) make sure the setuid(2) syscall (not to be confused with SETUID
filesystem bit) is not in the set of allowed syscalls. Thus rendering
suid and file with fcaps irrelevant.

-Eric

2011-02-04 17:04:58

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: Using ftrace/perf as a basis for generic seccomp

On Fri, Feb 04, 2011 at 11:29:19AM -0500, Eric Paris wrote:
> On Fri, 2011-02-04 at 15:31 +0100, Peter Zijlstra wrote:
> > On Thu, 2011-02-03 at 20:50 -0500, Eric Paris wrote:
> > > I'm going to try to work on it over
> > > the next week or two.
> >
> > What is your use-case? Going by: http://lwn.net/Articles/332990/ syscall
> > based stuff (seccomp) is broken by design.
>
> My personal goal is very different than an LSM. My goal is to reduce
> attack surface. I'm not trying to implement an LSM. LSM hooks are
> (intentionally) placed in the kernel after object resolution is
> complete. In an LSM we don't check 'open' type operation until after
> the pathname has been converted to an inode. We don't check some
> 'sendto' operations until after the data has been placed into an skb and
> is about to be queued to a socket. There is a LOT of code between
> syscall_entry() and any given LSM hook.
>
> An obvious vulnerability that I'm sure all the people involved here know
> would be the original perf syscall bounds checking vulnerability. If
> I'm dealing with an application that I know will never use perf I'd like
> a way to be able to completely disable the perf syscall and greatly
> reduce the kernel attack surface. It would be almost impossible for an
> LSM to hook between the syscall_enter() and the location of that
> vulnerability in the perf syscall. In my particular case I'm thinking
> about qemu, which never needs to call perf. I want a way to disable all
> of the code after syscall_enter() for huge swaths of the kernel.
>
> What we have today, called "seccomp", is a one way toggle,
> prctl(PR_SET_SECCOMP, 1), which reduces the available syscalls to
> read,write,exit, and sigreturn. Any other syscall results in a process
> being immediately killed. It's a great idea to reduce the attack
> surface of the kernel but it is too inflexible to be useful. I wonder
> if anyone is using it.
>
> Qemu on my box in just a couple of seconds of strace was found to use
> futex, ioctl, read, rt_sigaction, select, timer_gettime, timer_settime,
> and write. I'm sure that other well defined processes have other such
> sort lists of required syscalls. I think a more flexible seccomp which
> lets one remove syscalls from the allowed set (but never add them back)
> can GREATLY reduce the kernel attack surface from malicious processes.
>
> This is not a sandbox. This is not an LSM replacement. This is a per
> syscall cutoff. It can be used to help build a stronger sandbox. I'll
> likely see if this can't be used by the SELinux sandbox which already
> uses the LSM hooks to control information flow and mediate access. But
> SELinux does not control the sheer amount of the kernel code that can be
> executed. I believe we can build a stronger sandbox using a flexible
> seccomp as one of the tools. All we have to do is find one
> vulnerability in the code between the syscall entry and a LSM hook which
> would deny to operation to see the value in a per syscall control
> mechanism.
>
> As to doing it in seccomp code where it's all of a syscall or none vs
> making use of the filter infrastructure to allow even more fine grained
> control over the syscall is a question. I'm leaning more towards just
> doing it in seccomp. We can't ever build a full and complete strong
> sandbox using the filter code. James' assertions about copy_from_user()
> are obviously correct. A chat with PeterZ privately on IRC indicated
> that he also was not interested in seeing this creep into the tracing
> code.

Note it's not about tracing here. It's about abstracting some tracing
features to make them standalone and usable outside tracing.

But yeah, now that I consider the fact that checks on pointers are
racy until objects are resolved (got my first security lesson), such
deep filtering up to dereferencing pointers is then pointless.

Now there are still immediate values for which there is still a point
(filtering fd, filtering opening mode, etc...).

> Do we have a user that can articulate a need for greater
> flexibility in their use of such a hardening tool?

So yeah, indeed we probably need to get more usecases to consider it.

2011-02-05 11:42:57

by Stefan Fritsch

[permalink] [raw]

Subject: Re: Using ftrace/perf as a basis for generic seccomp

On Fri, 4 Feb 2011, Eric Paris wrote:
> On Thu, 2011-02-03 at 23:06 +0100, Stefan Fritsch wrote:
>> - only allow syscalls in the mode (non-compat/compat) that the prctl
>> call was made in
>
> This is what I was thinking. If it was a compat task when it dropped
> things from the set of syscalls we should implicitly deny all non-compat
> syscalls, and vice versa.

I have attached my modified version of Adam's patch which alread does that
below. It has a few other modification because I first tried to have two
separate bitmasks for native and compat. But maybe it is still useful for
you.

>> - deny exec of setuid/setgid binaries
>> - deny exec of binaries with filesystem capabilities
>
> I think both of these are wrong to try to address here. The right way
> to handle these is to
>
> 1) set prctl(SECBIT_NOROOT)
> 2) drop all caps from the bset, pP, pE, and pI
> 3) make sure the setuid(2) syscall (not to be confused with SETUID
> filesystem bit) is not in the set of allowed syscalls. Thus rendering
> suid and file with fcaps irrelevant.

I think that's a bad idea. Some programs may get confused if run as root
but without capabilities (think sendmail). If a setuid root binary gets
confused enough to write arbitrary files as root, you get all capabilities
again by writing to /etc/crontab or /root/.ssh/authorized_keys. I admit
that this is very unlikely if setuid(2)/setgid(2) lead to the process
being killed, but better to be save and disallow the user change for
SETUID binaries completely. And the simplest way to do that seemed to me
to disallow exec'ing of SETUID binaries.

Cheers,
Stefan

diff --git a/fs/exec.c b/fs/exec.c
index c62efcb..be6e3c5 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1219,6 +1219,10 @@ int prepare_binprm(struct linux_binprm *bprm)
if (!(bprm->file->f_path.mnt->mnt_flags & MNT_NOSUID)) {
/* Set-uid? */
if (mode & S_ISUID) {
+#ifdef CONFIG_SECCOMP
+ if (unlikely(test_thread_flag(TIF_SECCOMP)))
+ return -EPERM; /* XXX: KILL instead? */
+#endif
bprm->per_clear |= PER_CLEAR_ON_SETID;
bprm->cred->euid = inode->i_uid;
}
@@ -1230,6 +1234,10 @@ int prepare_binprm(struct linux_binprm *bprm)
* executable.
*/
if ((mode & (S_ISGID | S_IXGRP)) == (S_ISGID | S_IXGRP)) {
+#ifdef CONFIG_SECCOMP
+ if (unlikely(test_thread_flag(TIF_SECCOMP)))
+ return -EPERM; /* XXX: KILL instead? */
+#endif
bprm->per_clear |= PER_CLEAR_ON_SETID;
bprm->cred->egid = inode->i_gid;
}
diff --git a/fs/proc/array.c b/fs/proc/array.c
index fff6572..6d4f296 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -337,6 +337,30 @@ static void task_cpus_allowed(struct seq_file *m, struct task_struct *task)
seq_printf(m, "\n");
}

+extern const char hex_asc[];
+
+static void task_show_seccomp(struct seq_file *m, struct task_struct *p) {
+#if defined(CONFIG_SECCOMP)
+ const int mode = p->seccomp.state ? p->seccomp.state->mode : 0;
+ unsigned i;
+
+ seq_printf(m, "Seccomp:\t%d\n", mode);
+ if (mode != 2)
+ return;
+ seq_printf(m, "Seccomp_mask:\t%d,", p->seccomp.state->bitlen);
+ for (i = 0; i < (p->seccomp.state->bitlen + 7) / 8; ++i) {
+ const uint8_t *mask = p->seccomp.state->bitmask;
+ seq_printf(m, "%c%c", hex_asc[mask[i] >> 4],
+ hex_asc[mask[i] & 15]);
+ }
+#ifdef CONFIG_COMPAT
+ if (p->seccomp.state->is_compat)
+ seq_printf(m, ", compat");
+#endif
+ seq_printf(m, "\n");
+#endif
+}
+
int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
@@ -357,6 +381,7 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
task_show_regs(m, task);
#endif
task_context_switch_counts(m, task);
+ task_show_seccomp(m, task);
return 0;
}

diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 167c333..10ffe22 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -7,7 +7,32 @@
#include <linux/thread_info.h>
#include <asm/seccomp.h>

-typedef struct { int mode; } seccomp_t;
+/**
+ * struct seccomp_state - the state of a seccomp'ed process
+ *
+ * @mode:
+ * if this is 1, the process is under standard seccomp rules
+ * is 2, the process is only allowed to make system calls where
+ * the corresponding bit is set in bitmask and the syscall
+ * is made in the compat/no-compat mode corresponding to
+ * the is_compat member.
+ * @bitlen: the length of bitmask, in bits.
+ * @bitmask: a mask of allowed system calls.
+ * @refs: reference count
+ * @flags: unused
+ */
+struct seccomp_state {
+ uint8_t *bitmask;
+ int bitlen;
+ int flags;
+#ifdef CONFIG_COMPAT
+ int is_compat;
+#endif
+ int mode;
+ atomic_t refs;
+};
+
+typedef struct { struct seccomp_state *state; } seccomp_t;

extern void __secure_computing(int);
static inline void secure_computing(int this_syscall)
@@ -16,8 +41,10 @@ static inline void secure_computing(int this_syscall)
__secure_computing(this_syscall);
}

+extern struct seccomp_state* seccomp_state_get(struct seccomp_state *s);
+extern void seccomp_state_put(struct seccomp_state *s);
extern long prctl_get_seccomp(void);
-extern long prctl_set_seccomp(unsigned long);
+extern long prctl_set_seccomp(unsigned long, unsigned long, unsigned long, unsigned long);

#else /* CONFIG_SECCOMP */

@@ -32,7 +59,8 @@ static inline long prctl_get_seccomp(void)
return -EINVAL;
}

-static inline long prctl_set_seccomp(unsigned long arg2)
+static inline long prctl_set_seccomp(unsigned long arg2, unsigned long arg3,
+ unsigned long arg4, unsigned long arg5)
{
return -EINVAL;
}
diff --git a/kernel/fork.c b/kernel/fork.c
index 5447dc7..01f7210 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
#include <linux/cgroup.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/seccomp.h>
#include <linux/swap.h>
#include <linux/syscalls.h>
#include <linux/jiffies.h>
@@ -162,6 +163,10 @@ void free_task(struct task_struct *tsk)
free_thread_info(tsk->stack);
rt_mutex_debug_task_free(tsk);
ftrace_graph_exit_task(tsk);
+#ifdef CONFIG_SECCOMP
+ if (tsk->seccomp.state)
+ seccomp_state_put(tsk->seccomp.state);
+#endif
free_task_struct(tsk);
}
EXPORT_SYMBOL(free_task);
@@ -271,6 +276,11 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
if (err)
goto out;

+#ifdef CONFIG_SECCOMP
+ if (orig->seccomp.state)
+ tsk->seccomp.state = seccomp_state_get(orig->seccomp.state);
+#endif
+
setup_thread_stack(tsk, orig);
clear_user_return_notifier(tsk);
clear_tsk_need_resched(tsk);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 57d4b13..d5920d0 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -9,9 +9,10 @@
#include <linux/seccomp.h>
#include <linux/sched.h>
#include <linux/compat.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>

/* #define SECCOMP_DEBUG 1 */
-#define NR_SECCOMP_MODES 1

/*
* Secure computing mode 1 allows only read/write/exit/sigreturn.
@@ -32,8 +33,8 @@ static int mode1_syscalls_32[] = {

void __secure_computing(int this_syscall)
{
- int mode = current->seccomp.mode;
- int * syscall;
+ int mode = current->seccomp.state->mode;
+ int *syscall;

switch (mode) {
case 1:
@@ -47,40 +48,175 @@ void __secure_computing(int this_syscall)
return;
} while (*++syscall);
break;
+ case 2:
+#ifdef CONFIG_COMPAT
+ if (unlikely(is_compat_task() !=
+ current->seccomp.state->is_compat))
+ break;
+#endif
+ if (unlikely(this_syscall >= current->seccomp.state->bitlen))
+ break;
+ if (likely(current->seccomp.state->bitmask[this_syscall / 8]
+ & (1 << (7 - (this_syscall % 8)))))
+ return;
+ break;
default:
BUG();
}

+#ifdef CONFIG_COMPAT
+ printk(KERN_WARNING "Killed by seccomp: syscall %d compat: %d/%d",
+ this_syscall, is_compat_task(),
+ current->seccomp.state->is_compat);
+#else
+ printk(KERN_WARNING "Killed by seccomp in syscall %d", this_syscall);
+#endif
+
#ifdef SECCOMP_DEBUG
dump_stack();
#endif
do_exit(SIGKILL);
}

+void seccomp_state_put(struct seccomp_state *state)
+{
+ if (atomic_dec_and_test(&state->refs)) {
+ kfree(state->bitmask);
+ kfree(state);
+ }
+}
+
+struct seccomp_state *seccomp_state_get(struct seccomp_state *state)
+{
+ atomic_inc(&state->refs);
+ return state;
+}
+
+/* alloc mem for new bitmask and copy from user.
+ * Zero the unused bits int the last byte.
+ */
+static int new_bitmask_from_user(uint8_t **dst, unsigned long src,
+ unsigned long bitlen)
+{
+ const int ubytes = (bitlen + 7) / 8;
+ const int ibits = bitlen % 8;
+ if (!bitlen)
+ *dst = NULL;
+ *dst = kmalloc(ubytes, GFP_KERNEL);
+ if (!*dst)
+ return -ENOMEM;
+ if (copy_from_user(*dst, (void __user *) src, ubytes)) {
+ kfree(*dst);
+ *dst = NULL;
+ return -EFAULT;
+ }
+ if (ibits)
+ (*dst)[ubytes-1] &= 0xff << (8 - (ibits));
+ return 0;
+}
+
+static int new_state_from_user(struct seccomp_state **state, unsigned long mask,
+ unsigned long bitlen)
+{
+ int ret;
+ struct seccomp_state *new;
+ if (bitlen >= 1024)
+ return -EINVAL;
+ new = kmalloc(sizeof(struct seccomp_state), GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+ atomic_set(&new->refs, 1);
+ new->mode = 2;
+ new->bitlen = bitlen;
+#ifdef CONFIG_COMPAT
+ new->is_compat = is_compat_task();
+#endif
+ ret = new_bitmask_from_user(&new->bitmask, mask, bitlen);
+ if (ret < 0)
+ goto error;
+ *state = new;
+ return 0;
+error:
+ kfree(new->bitmask);
+ kfree(new);
+ return ret;
+}
+
+
+static int mask_install(unsigned long mask, unsigned long bitlen)
+{
+ int ret = new_state_from_user(&current->seccomp.state, mask, bitlen);
+ if (ret < 0)
+ return ret;
+ set_thread_flag(TIF_SECCOMP);
+ return 0;
+}
+
+static int mask_replace(unsigned long mask, unsigned long bitlen)
+{
+ int ret, i;
+ struct seccomp_state *new;
+
+ if (bitlen > current->seccomp.state->bitlen)
+ bitlen = current->seccomp.state->bitlen;
+ ret = new_state_from_user(&new, mask, bitlen);
+ if (ret < 0)
+ return ret;
+
+ for (i = 0; i < (bitlen + 7)/8; ++i)
+ new->bitmask[i] &= current->seccomp.state->bitmask[i];
+
+ seccomp_state_put(current->seccomp.state);
+ current->seccomp.state = new;
+
+ return 0;
+}
+
long prctl_get_seccomp(void)
{
- return current->seccomp.mode;
+ if (!current->seccomp.state)
+ return 0;
+ return current->seccomp.state->mode;
}

-long prctl_set_seccomp(unsigned long seccomp_mode)
+long prctl_set_seccomp(unsigned long seccomp_mode,
+ unsigned long seccomp_mask,
+ unsigned long seccomp_bitlen,
+ unsigned long seccomp_flags)
{
- long ret;
+ struct seccomp_state *new;

- /* can set it only once to be even more secure */
- ret = -EPERM;
- if (unlikely(current->seccomp.mode))
- goto out;
+ switch (seccomp_mode) {
+ case 0:
+ /* One cannot switch to mode 0 from any other mode */
+ if (current->seccomp.state)
+ return -EPERM;
+ return 0;

- ret = -EINVAL;
- if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
- current->seccomp.mode = seccomp_mode;
+ case 1:
+ new = kmalloc(sizeof(struct seccomp_state), GFP_KERNEL);
+ if (!new)
+ return -ENOMEM;
+ memset(new, 0, sizeof(*new));
+ atomic_set(&new->refs, 1);
+ new->mode = 1;
+ if (current->seccomp.state)
+ seccomp_state_put(current->seccomp.state);
+ current->seccomp.state = new;
set_thread_flag(TIF_SECCOMP);
#ifdef TIF_NOTSC
disable_TSC();
#endif
- ret = 0;
- }
+ return 0;

- out:
- return ret;
+ case 2:
+ /* System call bitmask */
+
+ if (current->seccomp.state)
+ return mask_replace(seccomp_mask, seccomp_bitlen);
+ return mask_install(seccomp_mask, seccomp_bitlen);
+
+ default:
+ return -EINVAL;
+ }
}
diff --git a/kernel/sys.c b/kernel/sys.c
index 7f5a0cd..6cc1f91 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1667,7 +1667,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = prctl_get_seccomp();
break;
case PR_SET_SECCOMP:
- error = prctl_set_seccomp(arg2);
+ error = prctl_set_seccomp(arg2, arg3, arg4, arg5);
break;
case PR_GET_TSC:
error = GET_TSC_CTL(arg2);

2011-02-05 11:52:00

by Stefan Fritsch

[permalink] [raw]

Subject: Re: Using ftrace/perf as a basis for generic seccomp

On Fri, 4 Feb 2011, Frederic Weisbecker wrote:
> Note it's not about tracing here. It's about abstracting some tracing
> features to make them standalone and usable outside tracing.
>
> But yeah, now that I consider the fact that checks on pointers are
> racy until objects are resolved (got my first security lesson), such
> deep filtering up to dereferencing pointers is then pointless.
>
> Now there are still immediate values for which there is still a point
> (filtering fd, filtering opening mode, etc...).
>
>> Do we have a user that can articulate a need for greater
>> flexibility in their use of such a hardening tool?
>
> So yeah, indeed we probably need to get more usecases to consider it.

A really major use case is socketcall(2). All socket related syscalls
(accept, bind, connect, receivemsg, ...) are implemented as socketcall
with an appropriate argument. There will be many cases where you want a
sandboxed process to be able to do recvmsg(2) to receive new file
descriptors over an already open unix-domain socket from a broker process.
But you may want to disallow other socket operations, especially listen,
accept, and connect.

Of course one could also add some special case handling for socketcall
in seccomp instead of using the full filtering.

Cheers,
Stefan

2011-02-06 16:51:35

by Eric Paris

[permalink] [raw]

Subject: Re: Using ftrace/perf as a basis for generic seccomp

Dropped a lot of people and added 2 more. I'm going to try to shift the
direction of this thread to discuss how to handle suid apps in a
potential sandbox solution (and remember, I don't consider an extended
seccomp to be a sandbox, it's just a tool to help create a sandbox)

Stefan would like to be able to prevent SETUID programs and programs
with fcaps from executing. I suggested (below) that he play with prctl,
drop things from bset, pP, pE, pI, and then remove the suid(2) syscall
from the set of allowed syscalls. He had some concerns:

On Sat, 2011-02-05 at 12:42 +0100, Stefan Fritsch wrote:
> On Fri, 4 Feb 2011, Eric Paris wrote:
> > On Thu, 2011-02-03 at 23:06 +0100, Stefan Fritsch wrote:

> >> - deny exec of setuid/setgid binaries
> >> - deny exec of binaries with filesystem capabilities
> >
> > I think both of these are wrong to try to address here. The right way
> > to handle these is to
> >
> > 1) set prctl(SECBIT_NOROOT)
> > 2) drop all caps from the bset, pP, pE, and pI
> > 3) make sure the setuid(2) syscall (not to be confused with SETUID
> > filesystem bit) is not in the set of allowed syscalls. Thus rendering
> > suid and file with fcaps irrelevant.
>
> I think that's a bad idea. Some programs may get confused if run as root
> but without capabilities (think sendmail). If a setuid root binary gets
> confused enough to write arbitrary files as root, you get all capabilities
> again by writing to /etc/crontab or /root/.ssh/authorized_keys. I admit
> that this is very unlikely if setuid(2)/setgid(2) lead to the process
> being killed, but better to be save and disallow the user change for
> SETUID binaries completely. And the simplest way to do that seemed to me
> to disallow exec'ing of SETUID binaries.

I believe that my method should be safe for fcaps. I believe the fcaps
code will kill any process that is unable to get the caps it claims to
need. But I believe he's right about SUID apps with SECBIT_NOROOT.
sendmail (unless it was smart) wouldn't know it didn't have permissions
to do what it needed to do and would thus, likely break. Anyone have
thoughts on that? I've thought a couple of times about adding a new LSM
hook security_exec_suid() which would just be a big hammer blocking the
execution of suid root files. How can we safely and sanely prevent the
execution of suid root files?

-Eric

2011-02-07 12:25:09

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Using ftrace/perf as a basis for generic seccomp

On Sat, 2011-02-05 at 12:51 +0100, Stefan Fritsch wrote:

> A really major use case is socketcall(2). All socket related syscalls
> (accept, bind, connect, receivemsg, ...) are implemented as socketcall
> with an appropriate argument. There will be many cases where you want a
> sandboxed process to be able to do recvmsg(2) to receive new file
> descriptors over an already open unix-domain socket from a broker process.
> But you may want to disallow other socket operations, especially listen,
> accept, and connect.
>
> Of course one could also add some special case handling for socketcall
> in seccomp instead of using the full filtering.

That looks like a perfect use-case for the LSM bits, attach some state
to both the fd object and the task object and if they don't match, don't
allow the action.