2011-04-28 03:14:39

by Will Drewry

[permalink] [raw]
Subject: [PATCH 0/7] Revisiting expanded seccomp functionality

I'd like to revisit the past discussions around extending seccomp
functionality:
1. http://lwn.net/Articles/332438/
2. http://thread.gmane.org/gmane.linux.kernel/1086816/focus=1096626

First, some background and motivation, feel free to skip straight to the
patches!

kernel/seccomp.c provides early system call interception hooks which
have been used for reducing the kernel attack surface for a given
user-level task. Normally, seccomp limits the kernel interfaces to read,
write, sigreturn, and exit. These restrictions have proved effective,
but for many common uses, the model is too draconion. That reality
doesn't mean that a less aggressive reduction of the attack surface
wouldn't still have beneficial effects.

To accomodate the lack of flexibility, there are several out-of-tree
patches for system call interception (with and without farther reaching
"policy" enforcements) and even a complex pure-assembly trusted
supervisor-thread to broker the requests of seccomp-guarded threads
(http://code.google.com/p/seccompsandbox). The latter requires severe
contortions with a high chance of accidental attack surface exposure
while out-of-tree patches are just that. (This ignores the handful of
userspace solutions, like plash and systrace, which jump through their
own hurdles and suffer not only from complexity but from a heavy
performance penalty. Of course, those approaches often include policy
enforcement work in addition to pure attack surface reduction, but
that's tangential.)

In general, attack surface reduction is applicable in most
circumstances, but it is especially true when handling untrusted data
(which seccomp was originally meant to help with!).

Some simple motivating examples are as follows:
- disallowing perf system calls inside a selinux sandbox (before parsing
occursm such that true policy logic can be applied when appropriate.)
- minimizing kernel attack surface during untrusted JIT execution
(Actionscript, Javascript, etc).
- ...

This patchset provides a flexible means to perform kernel attack surface
reduction using the early seccomp system call hooks and the ftrace
filter engine for system call name to number translation along with
limited argument-based filtering decision making.

Patches 1 through 5 cover the meat of this change. Patch 3 contains the
more controversial pieces, I suspect. Patches 6 and 7 show some of the
work that is needed to make this system even more effective. (Even
without those patches, it is still quite useful.)

Core changes as part of this proposal:
[PATCH 1/7] tracing: split out filter init, access, tear down.
[PATCH 2/7] tracing: split out syscall_trace_enter construction
[PATCH 3/7] seccomp_filter: Enable ftrace-based system call filtering
[PATCH 4/7] seccomp_filter: add process state reporting
[PATCH 5/7] seccomp_filter: Document what seccomp_filter is and how it works.

Nice-to-haves, imo, for ftrace and this proposal:
[PATCH 6/7] include/linux/syscalls.h: add __ layer of macros with return types.
[PATCH 7/7] arch/x86: hook int returning system calls

Any and all commentary will be appreciated!

I feel that the approach of this patch series addresses both the
continued need for attack surface reduction when handling untrusted
content, as well as the need to reuse the developing ftrace
infrastructure. I'm certain there are bugs, style-issues, etc, but I
hope that the general design leaves everyone else feeling that this
approach also addresses those needs too. I will happily address any
issues if it means we might make progress on this iteration of the
exposed-kernel-surface-discussion!

Thanks!
will


2011-04-28 03:09:06

by Will Drewry

[permalink] [raw]
Subject: [PATCH 1/7] tracing: split out filter init, access, tear down.

Moves the perf-specific profile event allocation and freeing code into
kernel/perf_event.c where it is called from and two symbols are exported
via ftrace_event.h for instantiating struct event_filters without
requiring a change to the core tracing code. It also adds a
filter_string accessor to allow non-tracing filter engine consumers to
access the field.

The change allows globally registered ftrace events to be used in
event_filter structs. perf is the current consumer, but a possible
future consumer is a system call filtering using the secure computing
hooks (and the existing syscalls subsystem events).

Signed-off-by: Will Drewry <[email protected]>
---
include/linux/ftrace_event.h | 9 +++--
kernel/perf_event.c | 7 +++-
kernel/trace/trace_events_filter.c | 60 ++++++++++++++++++++++--------------
3 files changed, 48 insertions(+), 28 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 22b32af..fea9d98 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -216,6 +216,12 @@ extern int filter_current_check_discard(struct ring_buffer *buffer,
void *rec,
struct ring_buffer_event *event);

+extern void ftrace_free_filter(struct event_filter *filter);
+extern int ftrace_parse_filter(struct event_filter **filter,
+ int event_id,
+ const char *filter_str);
+extern const char *ftrace_get_filter_string(const struct event_filter *filter);
+
enum {
FILTER_OTHER = 0,
FILTER_STATIC_STRING,
@@ -266,9 +272,6 @@ extern int perf_trace_init(struct perf_event *event);
extern void perf_trace_destroy(struct perf_event *event);
extern int perf_trace_add(struct perf_event *event, int flags);
extern void perf_trace_del(struct perf_event *event, int flags);
-extern int ftrace_profile_set_filter(struct perf_event *event, int event_id,
- char *filter_str);
-extern void ftrace_profile_free_filter(struct perf_event *event);
extern void *perf_trace_buf_prepare(int size, unsigned short type,
struct pt_regs *regs, int *rctxp);

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 8e81a98..1da45e7 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -5588,7 +5588,8 @@ static int perf_event_set_filter(struct perf_event *event, void __user *arg)
if (IS_ERR(filter_str))
return PTR_ERR(filter_str);

- ret = ftrace_profile_set_filter(event, event->attr.config, filter_str);
+ ret = ftrace_parse_filter(&event->filter, event->attr.config,
+ filter_str);

kfree(filter_str);
return ret;
@@ -5596,7 +5597,9 @@ static int perf_event_set_filter(struct perf_event *event, void __user *arg)

static void perf_event_free_filter(struct perf_event *event)
{
- ftrace_profile_free_filter(event);
+ struct event_filter *filter = event->filter;
+ event->filter = NULL;
+ ftrace_free_filter(filter);
}

#else
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 8008ddc..787b174 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -522,7 +522,7 @@ static void remove_filter_string(struct event_filter *filter)
}

static int replace_filter_string(struct event_filter *filter,
- char *filter_string)
+ const char *filter_string)
{
kfree(filter->filter_string);
filter->filter_string = kstrdup(filter_string, GFP_KERNEL);
@@ -1936,21 +1936,27 @@ out_unlock:
return err;
}

-#ifdef CONFIG_PERF_EVENTS
-
-void ftrace_profile_free_filter(struct perf_event *event)
+/* ftrace_free_filter - frees a parsed filter its internal structures.
+ *
+ * @filter: pointer to the event_filter to free.
+ */
+void ftrace_free_filter(struct event_filter *filter)
{
- struct event_filter *filter = event->filter;
-
- event->filter = NULL;
- __free_filter(filter);
+ if (filter)
+ __free_filter(filter);
}
+EXPORT_SYMBOL_GPL(ftrace_free_filter);

-int ftrace_profile_set_filter(struct perf_event *event, int event_id,
- char *filter_str)
+/* ftrace_parse_filter - allocates and populates a new event_filter
+ *
+ * @event_id: may be something like syscalls::sys_event_tkill's id.
+ * @filter_str: pointer to the filter string. Ownership IS taken.
+ */
+int ftrace_parse_filter(struct event_filter **filter,
+ int event_id,
+ const char *filter_str)
{
int err;
- struct event_filter *filter;
struct filter_parse_state *ps;
struct ftrace_event_call *call = NULL;

@@ -1966,12 +1972,12 @@ int ftrace_profile_set_filter(struct perf_event *event, int event_id,
goto out_unlock;

err = -EEXIST;
- if (event->filter)
+ if (*filter)
goto out_unlock;

- filter = __alloc_filter();
- if (!filter) {
- err = PTR_ERR(filter);
+ *filter = __alloc_filter();
+ if (IS_ERR_OR_NULL(*filter)) {
+ err = PTR_ERR(*filter);
goto out_unlock;
}

@@ -1980,14 +1986,14 @@ int ftrace_profile_set_filter(struct perf_event *event, int event_id,
if (!ps)
goto free_filter;

- parse_init(ps, filter_ops, filter_str);
+ replace_filter_string(*filter, filter_str);
+
+ parse_init(ps, filter_ops, (*filter)->filter_string);
err = filter_parse(ps);
if (err)
goto free_ps;

- err = replace_preds(call, filter, ps, filter_str, false);
- if (!err)
- event->filter = filter;
+ err = replace_preds(call, *filter, ps, (*filter)->filter_string, false);

free_ps:
filter_opstack_clear(ps);
@@ -1995,14 +2001,22 @@ free_ps:
kfree(ps);

free_filter:
- if (err)
- __free_filter(filter);
+ if (err) {
+ __free_filter(*filter);
+ *filter = NULL;
+ }

out_unlock:
mutex_unlock(&event_mutex);

return err;
}
+EXPORT_SYMBOL_GPL(ftrace_parse_filter);

-#endif /* CONFIG_PERF_EVENTS */
-
+const char *ftrace_get_filter_string(const struct event_filter *filter)
+{
+ if (!filter)
+ return NULL;
+ return filter->filter_string;
+}
+EXPORT_SYMBOL_GPL(ftrace_get_filter_string);
--
1.7.0.4