This change allows CONFIG_SECCOMP to make use of BPF programs for
user-controlled system call filtering (as shown in this patch series).
To minimize the impact on existing BPF evaluation, function pointer
use must be declared at sk_chk_filter-time. This allows ancillary
load instructions to be generated that use the function pointer rather
than adding _any_ code to the existing LD_* instruction paths.
Crude performance numbers using udpflood -l 10000000 against dummy0.
3 trials just udpflooding, 3 with tcpdump. Averaged then differenced.
[v11 #s]
* x86 32-bit (Atom N570 @ 1.66 GHz 2 core HT) [stackprot]:
(tcpdump) (no bpf)
- Without: 88.19s - 70.98s = 17.21s
- With: 90.97s - 74.96s = 15.98s
- Slowdown per packet: -123 nanoseconds
* x86 64-bit (Atom N570 @ 1.66 GHz 2 core HT) [stackprot]:
(tcpdump) (no bpf)
- Without: 117.80s - 103.99s = 13.80s
- With: 116.81s - 101.85s = 14.96s
- Slowdown per packet: 116 nanoseconds
In this round of benchmarking, the gap seems narrower and I've attempted
to take extra steps to ensure a quiescent system (especially given the
wild variability I was seeing on 64-bit last time). It's not clear to
me if this overhead (on an Atom-class system) is acceptable or not.
Irrespective, I had planned to follow on to this patch set updating
every in-kernel callsite of sk_run/chk_filter to use bpf_*_filter
directly and alleviate the bounce through. I was hoping that could be
done independently, however. If it would make sense to do that first
or just duplicate the sk_run_filter code in seccomp.c initially, please
let me know.
v11: - un-inline sk_*_filter to adhere to the existing export symbol use
and not break bpf_jit_free ([email protected])
- updated benchmarks
v10: - converted length to a u32 on the struct because it is passed in
at bpf_run_filter time anyway. ([email protected])
- added a comment about the LD_*_ falling through ([email protected])
- more clean up (pointer->load, drop SKB macro, comments) ([email protected])
v9: - n/a
v8: - fixed variable positioning and bad cast ([email protected])
- no longer passes A as a pointer (inspection of x86 asm shows A is
%ebx again; thanks [email protected])
- cleaned up switch macros and expanded use
([email protected], [email protected])
- added length fn pointer and handled LD_W_LEN/LDX_W_LEN
- moved from a wrapping struct to a typedef for the function
pointer. (matches existing function pointer style)
- added comprehensive comment above the typedef.
- benchmarks
v7: - first cut
Signed-off-by: Will Drewry <[email protected]>
---
include/linux/filter.h | 46 +++++++++++++++++++
net/core/filter.c | 117 +++++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 157 insertions(+), 6 deletions(-)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 8eeb205..bdd02f9 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -110,6 +110,9 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
*/
#define BPF_MEMWORDS 16
+/* BPF program (checking) flags */
+#define BPF_CHK_FLAGS_NO_SKB 1
+
/* RATIONALE. Negative offsets are invalid in BPF.
We use them to reference ancillary data.
Unlike introduction new instructions, it does not break
@@ -145,6 +148,33 @@ struct sk_filter
struct sock_filter insns[0];
};
+/**
+ * struct bpf_load_fn - callback and data for bpf_run_filter
+ * This structure is used by bpf_run_filter if bpf_chk_filter
+ * was invoked with BPF_CHK_FLAGS_NO_SKB.
+ *
+ * @load:
+ * @data: const pointer to the data passed into bpf_run_filter
+ * @k: offset into @skb's data
+ * @size: the size of the requested data in bytes: 1, 2, or 4.
+ * @buffer: If non-NULL, a 32-bit buffer for staging data.
+ *
+ * Returns a pointer to the requested data.
+ *
+ * This function operates similarly to load_pointer in net/core/filter.c
+ * except that the pointer to the returned data must already be
+ * byteswapped as appropriate to the source data and endianness.
+ * @buffer may be used if the data needs to be staged.
+ *
+ * @length: the length of the supplied data for use by the LD*_LEN
+ * instructions.
+ */
+struct bpf_load_fn {
+ void *(*load)(const void *data, int k, unsigned int size,
+ void *buffer);
+ u32 length;
+};
+
static inline unsigned int sk_filter_len(const struct sk_filter *fp)
{
return fp->len * sizeof(struct sock_filter) + sizeof(*fp);
@@ -153,9 +183,15 @@ static inline unsigned int sk_filter_len(const struct sk_filter *fp)
extern int sk_filter(struct sock *sk, struct sk_buff *skb);
extern unsigned int sk_run_filter(const struct sk_buff *skb,
const struct sock_filter *filter);
+extern unsigned int bpf_run_filter(const void *data,
+ const struct sock_filter *filter,
+ const struct bpf_load_fn *load_fn);
extern int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
extern int sk_detach_filter(struct sock *sk);
+
extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
+extern int bpf_chk_filter(struct sock_filter *filter, unsigned int flen, u32 flags);
+
#ifdef CONFIG_BPF_JIT
extern void bpf_jit_compile(struct sk_filter *fp);
@@ -228,6 +264,16 @@ enum {
BPF_S_ANC_HATYPE,
BPF_S_ANC_RXHASH,
BPF_S_ANC_CPU,
+ /* Used to differentiate SKB data and generic data */
+ BPF_S_ANC_LD_W_ABS,
+ BPF_S_ANC_LD_H_ABS,
+ BPF_S_ANC_LD_B_ABS,
+ BPF_S_ANC_LD_W_LEN,
+ BPF_S_ANC_LD_W_IND,
+ BPF_S_ANC_LD_H_IND,
+ BPF_S_ANC_LD_B_IND,
+ BPF_S_ANC_LDX_W_LEN,
+ BPF_S_ANC_LDX_B_MSH,
};
#endif /* __KERNEL__ */
diff --git a/net/core/filter.c b/net/core/filter.c
index 5dea452..fe8d396 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -98,9 +98,10 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
EXPORT_SYMBOL(sk_filter);
/**
- * sk_run_filter - run a filter on a socket
- * @skb: buffer to run the filter on
+ * bpf_run_filter - run a BPF filter program on @data
+ * @data: buffer to run the filter on
* @fentry: filter to apply
+ * @load_fn: custom data accessor
*
* Decode and apply filter instructions to the skb->data.
* Return length to keep, 0 for none. @skb is the data we are
@@ -109,8 +110,9 @@ EXPORT_SYMBOL(sk_filter);
* and last instruction guaranteed to be a RET, we dont need to check
* flen. (We used to pass to this function the length of filter)
*/
-unsigned int sk_run_filter(const struct sk_buff *skb,
- const struct sock_filter *fentry)
+unsigned int bpf_run_filter(const void *data,
+ const struct sock_filter *fentry,
+ const struct bpf_load_fn *load_fn)
{
void *ptr;
u32 A = 0; /* Accumulator */
@@ -118,6 +120,7 @@ unsigned int sk_run_filter(const struct sk_buff *skb,
u32 mem[BPF_MEMWORDS]; /* Scratch Memory Store */
u32 tmp;
int k;
+ const struct sk_buff *skb = data;
/*
* Process array of filter instructions.
@@ -350,6 +353,55 @@ load_b:
A = 0;
continue;
}
+ case BPF_S_ANC_LD_W_ABS:
+ k = K;
+load_fn_w:
+ ptr = load_fn->load(data, k, 4, &tmp);
+ if (ptr) {
+ A = *(u32 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_ANC_LD_H_ABS:
+ k = K;
+load_fn_h:
+ ptr = load_fn->load(data, k, 2, &tmp);
+ if (ptr) {
+ A = *(u16 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_ANC_LD_B_ABS:
+ k = K;
+load_fn_b:
+ ptr = load_fn->load(data, k, 1, &tmp);
+ if (ptr) {
+ A = *(u8 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_ANC_LDX_B_MSH:
+ ptr = load_fn->load(data, K, 1, &tmp);
+ if (ptr) {
+ X = (*(u8 *)ptr & 0xf) << 2;
+ continue;
+ }
+ return 0;
+ case BPF_S_ANC_LD_W_IND:
+ k = X + K;
+ goto load_fn_w;
+ case BPF_S_ANC_LD_H_IND:
+ k = X + K;
+ goto load_fn_h;
+ case BPF_S_ANC_LD_B_IND:
+ k = X + K;
+ goto load_fn_b;
+ case BPF_S_ANC_LD_W_LEN:
+ A = load_fn->length;
+ continue;
+ case BPF_S_ANC_LDX_W_LEN:
+ X = load_fn->length;
+ continue;
default:
WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
fentry->code, fentry->jt,
@@ -360,6 +412,21 @@ load_b:
return 0;
}
+EXPORT_SYMBOL(bpf_run_filter);
+
+/**
+ * sk_run_filter - run a filter on a socket
+ * @skb: buffer to run the filter on
+ * @fentry: filter to apply
+ *
+ * Runs bpf_run_filter with the struct sk_buff-specific data
+ * accessor behavior.
+ */
+unsigned int sk_run_filter(const struct sk_buff *skb,
+ const struct sock_filter *filter)
+{
+ return bpf_run_filter(skb, filter, NULL);
+}
EXPORT_SYMBOL(sk_run_filter);
/*
@@ -423,9 +490,10 @@ error:
}
/**
- * sk_chk_filter - verify socket filter code
+ * bpf_chk_filter - verify socket filter BPF code
* @filter: filter to verify
* @flen: length of filter
+ * @flags: May be BPF_CHK_FLAGS_NO_SKB or 0
*
* Check the user's filter code. If we let some ugly
* filter code slip through kaboom! The filter must contain
@@ -434,9 +502,13 @@ error:
*
* All jumps are forward as they are not signed.
*
+ * If BPF_CHK_FLAGS_NO_SKB is set in flags, any SKB-specific
+ * rules become illegal and a bpf_load_fn will be expected by
+ * bpf_run_filter.
+ *
* Returns 0 if the rule set is legal or -EINVAL if not.
*/
-int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
+int bpf_chk_filter(struct sock_filter *filter, unsigned int flen, u32 flags)
{
/*
* Valid instructions are initialized to non-0.
@@ -542,9 +614,36 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
pc + ftest->jf + 1 >= flen)
return -EINVAL;
break;
+#define MAYBE_USE_LOAD_FN(CODE) \
+ if (flags & BPF_CHK_FLAGS_NO_SKB) { \
+ code = BPF_S_ANC_##CODE; \
+ break; \
+ }
+ case BPF_S_LD_W_LEN:
+ MAYBE_USE_LOAD_FN(LD_W_LEN);
+ break;
+ case BPF_S_LDX_W_LEN:
+ MAYBE_USE_LOAD_FN(LDX_W_LEN);
+ break;
+ case BPF_S_LD_W_IND:
+ MAYBE_USE_LOAD_FN(LD_W_IND);
+ break;
+ case BPF_S_LD_H_IND:
+ MAYBE_USE_LOAD_FN(LD_H_IND);
+ break;
+ case BPF_S_LD_B_IND:
+ MAYBE_USE_LOAD_FN(LD_B_IND);
+ break;
+ case BPF_S_LDX_B_MSH:
+ MAYBE_USE_LOAD_FN(LDX_B_MSH);
+ break;
case BPF_S_LD_W_ABS:
+ MAYBE_USE_LOAD_FN(LD_W_ABS);
+ /* Falls through to BPF_S_LD_B_ABS. */
case BPF_S_LD_H_ABS:
+ MAYBE_USE_LOAD_FN(LD_H_ABS);
case BPF_S_LD_B_ABS:
+ MAYBE_USE_LOAD_FN(LD_B_ABS);
#define ANCILLARY(CODE) case SKF_AD_OFF + SKF_AD_##CODE: \
code = BPF_S_ANC_##CODE; \
break
@@ -572,6 +671,12 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
}
return -EINVAL;
}
+EXPORT_SYMBOL(bpf_chk_filter);
+
+int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
+{
+ return bpf_chk_filter(filter, flen, 0);
+}
EXPORT_SYMBOL(sk_chk_filter);
/**
--
1.7.5.4
Replaces the seccomp_t typedef with struct seccomp to match modern
kernel style.
v8-v11: no changes
v7: struct seccomp_struct -> struct seccomp
v6: original inclusion in this series.
Signed-off-by: Will Drewry <[email protected]>
Reviewed-by: James Morris <[email protected]>
---
include/linux/sched.h | 2 +-
include/linux/seccomp.h | 10 ++++++----
2 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7d379a6..c30526f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1418,7 +1418,7 @@ struct task_struct {
uid_t loginuid;
unsigned int sessionid;
#endif
- seccomp_t seccomp;
+ struct seccomp seccomp;
/* Thread group tracking */
u32 parent_exec_id;
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index cc7a4e9..d61f27f 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -7,7 +7,9 @@
#include <linux/thread_info.h>
#include <asm/seccomp.h>
-typedef struct { int mode; } seccomp_t;
+struct seccomp {
+ int mode;
+};
extern void __secure_computing(int);
static inline void secure_computing(int this_syscall)
@@ -19,7 +21,7 @@ static inline void secure_computing(int this_syscall)
extern long prctl_get_seccomp(void);
extern long prctl_set_seccomp(unsigned long);
-static inline int seccomp_mode(seccomp_t *s)
+static inline int seccomp_mode(struct seccomp *s)
{
return s->mode;
}
@@ -28,7 +30,7 @@ static inline int seccomp_mode(seccomp_t *s)
#include <linux/errno.h>
-typedef struct { } seccomp_t;
+struct seccomp { };
#define secure_computing(x) do { } while (0)
@@ -42,7 +44,7 @@ static inline long prctl_set_seccomp(unsigned long arg2)
return -EINVAL;
}
-static inline int seccomp_mode(seccomp_t *s)
+static inline int seccomp_mode(struct seccomp *s)
{
return 0;
}
--
1.7.5.4
This change adds the SECCOMP_RET_ERRNO as a valid return value from a
seccomp filter. Additionally, it makes the first use of the lower
16-bits for storing a filter-supplied errno. 16-bits is more than
enough for the errno-base.h calls.
Returning errors instead of immediately terminating processes that
violate seccomp policy allow for broader use of this functionality
for kernel attack surface reduction. For example, a linux container
could maintain a whitelist of pre-existing system calls but drop
all new ones with errnos. This would keep a logically static attack
surface while providing errnos that may allow for graceful failure
without the downside of do_exit() on a bad call.
v11: - check for NULL filter ([email protected])
v10: - change loaders to fn
v9: - n/a
v8: - update Kconfig to note new need for syscall_set_return_value.
- reordered such that TRAP behavior follows on later.
- made the for loop a little less indent-y
v7: - introduced
Signed-off-by: Will Drewry <[email protected]>
---
arch/Kconfig | 6 ++++--
include/linux/seccomp.h | 15 +++++++++++----
kernel/seccomp.c | 42 ++++++++++++++++++++++++++++++++----------
3 files changed, 47 insertions(+), 16 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index 7c6bd48..dd4e067 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -203,8 +203,10 @@ config HAVE_ARCH_SECCOMP_FILTER
bool
help
This symbol should be selected by an architecure if it provides
- asm/syscall.h, specifically syscall_get_arguments() and
- syscall_get_arch().
+ asm/syscall.h, specifically syscall_get_arguments(),
+ syscall_get_arch(), and syscall_set_return_value(). Additionally,
+ its system call entry path must respect a return value of -1 from
+ __secure_computing_int() and/or secure_computing().
config SECCOMP_FILTER
def_bool y
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 6e69274..93c2d98 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -12,13 +12,14 @@
/*
* All BPF programs must return a 32-bit value.
- * The bottom 16-bits are reserved for future use.
+ * The bottom 16-bits are for optional return data.
* The upper 16-bits are ordered from least permissive values to most.
*
* The ordering ensures that a min_t() over composed return values always
* selects the least permissive choice.
*/
#define SECCOMP_RET_KILL 0x00000000U /* kill the task immediately */
+#define SECCOMP_RET_ERRNO 0x00030000U /* returns an errno */
#define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */
/* Masks for the return value sections. */
@@ -64,11 +65,17 @@ struct seccomp {
struct seccomp_filter *filter;
};
-extern void __secure_computing(int);
-static inline void secure_computing(int this_syscall)
+/*
+ * Direct callers to __secure_computing should be updated as
+ * CONFIG_HAVE_ARCH_SECCOMP_FILTER propagates.
+ */
+extern void __secure_computing(int) __deprecated;
+extern int __secure_computing_int(int);
+static inline int secure_computing(int this_syscall)
{
if (unlikely(test_thread_flag(TIF_SECCOMP)))
- __secure_computing(this_syscall);
+ return __secure_computing_int(this_syscall);
+ return 0;
}
extern long prctl_get_seccomp(void);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 25e8296..3b3a16e 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -134,22 +134,23 @@ static void *bpf_load(const void *nr, int off, unsigned int size, void *buf)
static u32 seccomp_run_filters(int syscall)
{
struct seccomp_filter *f;
- u32 ret = SECCOMP_RET_KILL;
static const struct bpf_load_fn fns = {
bpf_load,
sizeof(struct seccomp_data),
};
+ u32 ret = SECCOMP_RET_ALLOW;
const void *sc_ptr = (const void *)(uintptr_t)syscall;
+ /* Ensure unexpected behavior doesn't result in failing open. */
+ if (unlikely(current->seccomp.filter == NULL))
+ ret = SECCOMP_RET_KILL;
+
/*
* All filters are evaluated in order of youngest to oldest. The lowest
* BPF return value always takes priority.
*/
- for (f = current->seccomp.filter; f; f = f->prev) {
- ret = bpf_run_filter(sc_ptr, f->insns, &fns);
- if (ret != SECCOMP_RET_ALLOW)
- break;
- }
+ for (f = current->seccomp.filter; f; f = f->prev)
+ ret = min_t(u32, ret, bpf_run_filter(sc_ptr, f->insns, &fns));
return ret;
}
@@ -292,6 +293,13 @@ static int mode1_syscalls_32[] = {
void __secure_computing(int this_syscall)
{
+ /* Filter calls should never use this function. */
+ BUG_ON(current->seccomp.mode == SECCOMP_MODE_FILTER);
+ __secure_computing_int(this_syscall);
+}
+
+int __secure_computing_int(int this_syscall)
+{
int mode = current->seccomp.mode;
int exit_code = SIGKILL;
int *syscall;
@@ -305,16 +313,29 @@ void __secure_computing(int this_syscall)
#endif
do {
if (*syscall == this_syscall)
- return;
+ return 0;
} while (*++syscall);
break;
#ifdef CONFIG_SECCOMP_FILTER
- case SECCOMP_MODE_FILTER:
- if (seccomp_run_filters(this_syscall) == SECCOMP_RET_ALLOW)
- return;
+ case SECCOMP_MODE_FILTER: {
+ u32 action = seccomp_run_filters(this_syscall);
+ switch (action & SECCOMP_RET_ACTION) {
+ case SECCOMP_RET_ERRNO:
+ /* Set the low-order 16-bits as a errno. */
+ syscall_set_return_value(current, task_pt_regs(current),
+ -(action & SECCOMP_RET_DATA),
+ 0);
+ return -1;
+ case SECCOMP_RET_ALLOW:
+ return 0;
+ case SECCOMP_RET_KILL:
+ default:
+ break;
+ }
seccomp_filter_log_failure(this_syscall);
exit_code = SIGSYS;
break;
+ }
#endif
default:
BUG();
@@ -325,6 +346,7 @@ void __secure_computing(int this_syscall)
#endif
audit_seccomp(this_syscall);
do_exit(exit_code);
+ return -1; /* never reached */
}
long prctl_get_seccomp(void)
--
1.7.5.4
This change adds support for a new ptrace option, PTRACE_O_TRACESECCOMP,
and a new return value for seccomp BPF programs, SECCOMP_RET_TRACE.
When a tracer specifies PTRACE_O_TRACESECCOMP while using
PTRACE_SYSCALL, system call notification will _only_ occur when
a seccomp BPF program returns SECCOMP_RET_TRACE. No other system
calls will notify the tracer.
If the subordinate process is not using seccomp filter, then no
system call notifications will occur.
If there is no attached tracer when SECCOMP_RET_TRACE is returned,
the system call will not be executed and an -ENOSYS errno will be
returned to userspace.
Interestingly, this change does not add a dependency on the system
call slow path. Instead, seccomp will only interact with ptrace
if TIF_SYSCALL_TRACE is enabled which also means the task is in
the system call slow path already and the requisite registers
are populated.
I realize that there are pending patches for cleaning up ptrace events.
I can either reintegrate with those when they are available or vice
versa. That's assuming these changes make sense and are viable. It's
also possible to use ptrace_event(PTRACE_EVENT_SECCOMP) instead, but it
seemed sane to share the syscall path.
v11: - invert the logic to just make it a PTRACE_SYSCALL accelerator
([email protected])
v10: - moved to PTRACE_O_SECCOMP / PT_TRACE_SECCOMP
v9: - n/a
v8: - guarded PTRACE_SECCOMP use with an ifdef
v7: - introduced
Signed-off-by: Will Drewry <[email protected]>
---
arch/Kconfig | 1 +
include/linux/ptrace.h | 7 +++++--
include/linux/seccomp.h | 4 +++-
include/linux/tracehook.h | 6 ++++++
kernel/ptrace.c | 4 ++++
kernel/seccomp.c | 18 ++++++++++++++++++
6 files changed, 37 insertions(+), 3 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index d92a78e..bceced5 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -203,6 +203,7 @@ config HAVE_ARCH_SECCOMP_FILTER
bool
help
This symbol should be selected by an architecure if it provides:
+ linux/tracehook.h, for TIF_SYSCALL_TRACE and ptrace_report_syscall
asm/syscall.h:
- syscall_get_arch()
- syscall_get_arguments()
diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index c2f1f6a..2fccdbc 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -62,8 +62,9 @@
#define PTRACE_O_TRACEEXEC 0x00000010
#define PTRACE_O_TRACEVFORKDONE 0x00000020
#define PTRACE_O_TRACEEXIT 0x00000040
+#define PTRACE_O_TRACESECCOMP 0x00000080
-#define PTRACE_O_MASK 0x0000007f
+#define PTRACE_O_MASK 0x000000ff
/* Wait extended result codes for the above trace options. */
#define PTRACE_EVENT_FORK 1
@@ -73,6 +74,7 @@
#define PTRACE_EVENT_VFORK_DONE 5
#define PTRACE_EVENT_EXIT 6
#define PTRACE_EVENT_STOP 7
+#define PTRACE_EVENT_SECCOMP 8 /* never directly delivered */
#include <asm/ptrace.h>
@@ -101,8 +103,9 @@
#define PT_TRACE_EXEC PT_EVENT_FLAG(PTRACE_EVENT_EXEC)
#define PT_TRACE_VFORK_DONE PT_EVENT_FLAG(PTRACE_EVENT_VFORK_DONE)
#define PT_TRACE_EXIT PT_EVENT_FLAG(PTRACE_EVENT_EXIT)
+#define PT_TRACE_SECCOMP PT_EVENT_FLAG(PTRACE_EVENT_SECCOMP)
-#define PT_TRACE_MASK 0x000003f4
+#define PT_TRACE_MASK 0x00000ff4
/* single stepping state bits (used on ARM and PA-RISC) */
#define PT_SINGLESTEP_BIT 31
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index b44d038..b53104b 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -21,6 +21,7 @@
#define SECCOMP_RET_KILL 0x00000000U /* kill the task immediately */
#define SECCOMP_RET_TRAP 0x00020000U /* disallow and force a SIGSYS */
#define SECCOMP_RET_ERRNO 0x00030000U /* returns an errno */
+#define SECCOMP_RET_TRACE 0x7ffe0000U /* pass to a tracer or disallow */
#define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */
/* Masks for the return value sections. */
@@ -55,6 +56,7 @@ struct seccomp_filter;
*
* @mode: indicates one of the valid values above for controlled
* system calls available to a process.
+ * @trace: tells tracehook to notify for the current syscall.
* @filter: The metadata and ruleset for determining what system calls
* are allowed for a task.
*
@@ -63,6 +65,7 @@ struct seccomp_filter;
*/
struct seccomp {
int mode;
+ int trace;
struct seccomp_filter *filter;
};
@@ -118,7 +121,6 @@ extern void copy_seccomp(struct seccomp *child, const struct seccomp *parent);
#else /* CONFIG_SECCOMP_FILTER */
/* The macro consumes the ->filter reference. */
#define put_seccomp_filter(_s) do { } while (0)
-
static inline void copy_seccomp(struct seccomp *c, const struct seccomp *p)
{
return;
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index a71a292..68e9478 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -48,6 +48,7 @@
#include <linux/sched.h>
#include <linux/ptrace.h>
+#include <linux/seccomp.h>
#include <linux/security.h>
struct linux_binprm;
@@ -61,6 +62,11 @@ static inline void ptrace_report_syscall(struct pt_regs *regs)
if (!(ptrace & PT_PTRACED))
return;
+#ifdef CONFIG_SECCOMP_FILTER
+ if ((ptrace & PT_TRACE_SECCOMP) && !current->seccomp.trace)
+ return;
+#endif
+
ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
/*
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 00ab2ca..61e5ac4 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -19,6 +19,7 @@
#include <linux/signal.h>
#include <linux/audit.h>
#include <linux/pid_namespace.h>
+#include <linux/seccomp.h>
#include <linux/syscalls.h>
#include <linux/uaccess.h>
#include <linux/regset.h>
@@ -551,6 +552,9 @@ static int ptrace_setoptions(struct task_struct *child, unsigned long data)
if (data & PTRACE_O_TRACEEXIT)
child->ptrace |= PT_TRACE_EXIT;
+ if (data & PTRACE_O_TRACESECCOMP)
+ child->ptrace |= PT_TRACE_SECCOMP;
+
return (data & ~PTRACE_O_MASK) ? -EINVAL : 0;
}
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index d2e173e..5aabc3c 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -354,6 +354,24 @@ int __secure_computing_int(int this_syscall)
seccomp_send_sigsys(this_syscall, reason_code);
return -1;
}
+ case SECCOMP_RET_TRACE: {
+ int ret;
+ struct pt_regs *regs = task_pt_regs(current);
+ if (!(test_tsk_thread_flag(current, TIF_SYSCALL_TRACE)) ||
+ !(current->ptrace & PT_TRACE_SECCOMP))
+ return -1;
+ /*
+ * PT_TRACE_SECCOMP and seccomp.trace indicate whether
+ * tracehook_report_syscall_entry needs to signal the
+ * tracer. This avoids race conditions in hand off and
+ * the requirement for TIF_SYSCALL_TRACE ensures that
+ * we are in the syscall slow path.
+ */
+ current->seccomp.trace = 1;
+ ret = tracehook_report_syscall_entry(regs);
+ current->seccomp.trace = 0;
+ return ret;
+ }
case SECCOMP_RET_ALLOW:
return 0;
case SECCOMP_RET_KILL:
--
1.7.5.4
[This patch depends on [email protected]'s no_new_privs patch:
https://lkml.org/lkml/2012/1/30/264
]
This patch adds support for seccomp mode 2. Mode 2 introduces the
ability for unprivileged processes to install system call filtering
policy expressed in terms of a Berkeley Packet Filter (BPF) program.
This program will be evaluated in the kernel for each system call
the task makes and computes a result based on data in the format
of struct seccomp_data.
A filter program may be installed by calling:
struct sock_fprog fprog = { ... };
...
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &fprog);
The return value of the filter program determines if the system call is
allowed to proceed or denied. If the first filter program installed
allows prctl(2) calls, then the above call may be made repeatedly
by a task to further reduce its access to the kernel. All attached
programs must be evaluated before a system call will be allowed to
proceed.
Filter programs will be inherited across fork/clone and execve.
However, if the task attaching the filter is unprivileged
(!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task. This
ensures that unprivileged tasks cannot attach filters that affect
privileged tasks (e.g., setuid binary).
There are a number of benefits to this approach. A few of which are
as follows:
- BPF has been exposed to userland for a long time
- BPF optimization (and JIT'ing) are well understood
- Userland already knows its ABI: system call numbers and desired
arguments
- No time-of-check-time-of-use vulnerable data accesses are possible.
- system call arguments are loaded on access only to minimize copying
required for system call policy decisions.
Mode 2 support is restricted to architectures that enable
HAVE_ARCH_SECCOMP_FILTER. In this patch, the primary dependency is on
syscall_get_arguments(). The full desired scope of this feature will
add a few minor additional requirements expressed later in this series.
Based on discussion, SECCOMP_RET_ERRNO and SECCOMP_RET_TRACE seem to be
the desired additional functionality.
No architectures are enabled in this patch.
v11: - reorder struct seccomp_data to allow future args expansion ([email protected])
- style clean up, @compat dropped, compat_sock_fprog32 ([email protected])
- do_exit(SIGSYS) ([email protected], [email protected])
- pare down Kconfig doc reference.
- extra comment clean up
v10: - seccomp_data has changed again to be more aesthetically pleasing
([email protected])
- calling convention is noted in a new u32 field using syscall_get_arch.
This allows for cross-calling convention tasks to use seccomp filters.
([email protected])
- lots of clean up (thanks, Indan!)
v9: - n/a
v8: - use bpf_chk_filter, bpf_run_filter. update load_fns
- Lots of fixes courtesy of [email protected]:
-- fix up load behavior, compat fixups, and merge alloc code,
-- renamed pc and dropped __packed, use bool compat.
-- Added a hidden CONFIG_SECCOMP_FILTER to synthesize non-arch
dependencies
v7: (massive overhaul thanks to Indan, others)
- added CONFIG_HAVE_ARCH_SECCOMP_FILTER
- merged into seccomp.c
- minimal seccomp_filter.h
- no config option (part of seccomp)
- no new prctl
- doesn't break seccomp on systems without asm/syscall.h
(works but arg access always fails)
- dropped seccomp_init_task, extra free functions, ...
- dropped the no-asm/syscall.h code paths
- merges with network sk_run_filter and sk_chk_filter
v6: - fix memory leak on attach compat check failure
- require no_new_privs || CAP_SYS_ADMIN prior to filter
installation. ([email protected])
- s/seccomp_struct_/seccomp_/ for macros/functions ([email protected])
- cleaned up Kconfig ([email protected])
- on block, note if the call was compat (so the # means something)
v5: - uses syscall_get_arguments
([email protected],[email protected], [email protected])
- uses union-based arg storage with hi/lo struct to
handle endianness. Compromises between the two alternate
proposals to minimize extra arg shuffling and account for
endianness assuming userspace uses offsetof().
([email protected], [email protected])
- update Kconfig description
- add include/seccomp_filter.h and add its installation
- (naive) on-demand syscall argument loading
- drop seccomp_t ([email protected])
v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
- now uses current->no_new_privs
([email protected],[email protected])
- assign names to seccomp modes ([email protected])
- fix style issues ([email protected])
- reworded Kconfig entry ([email protected])
v3: - macros to inline ([email protected])
- init_task behavior fixed ([email protected])
- drop creator entry and extra NULL check ([email protected])
- alloc returns -EINVAL on bad sizing ([email protected])
- adds tentative use of "always_unprivileged" as per
[email protected] and [email protected]
v2: - (patch 2 only)
Signed-off-by: Will Drewry <[email protected]>
---
arch/Kconfig | 17 +++
include/linux/Kbuild | 1 +
include/linux/seccomp.h | 74 ++++++++++-
kernel/fork.c | 3 +
kernel/seccomp.c | 328 ++++++++++++++++++++++++++++++++++++++++++++---
kernel/sys.c | 2 +-
6 files changed, 402 insertions(+), 23 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index 4f55c73..7c6bd48 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -199,4 +199,21 @@ config HAVE_CMPXCHG_LOCAL
config HAVE_CMPXCHG_DOUBLE
bool
+config HAVE_ARCH_SECCOMP_FILTER
+ bool
+ help
+ This symbol should be selected by an architecure if it provides
+ asm/syscall.h, specifically syscall_get_arguments() and
+ syscall_get_arch().
+
+config SECCOMP_FILTER
+ def_bool y
+ depends on HAVE_ARCH_SECCOMP_FILTER && SECCOMP && NET
+ help
+ Enable tasks to build secure computing environments defined
+ in terms of Berkeley Packet Filter programs which implement
+ task-defined system call filtering polices.
+
+ See Documentation/prctl/seccomp_filter.txt for details.
+
source "kernel/gcov/Kconfig"
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index c94e717..d41ba12 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -330,6 +330,7 @@ header-y += scc.h
header-y += sched.h
header-y += screen_info.h
header-y += sdla.h
+header-y += seccomp.h
header-y += securebits.h
header-y += selinux_netlink.h
header-y += sem.h
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index d61f27f..6e69274 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -1,14 +1,67 @@
#ifndef _LINUX_SECCOMP_H
#define _LINUX_SECCOMP_H
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+
+/* Valid values for seccomp.mode and prctl(PR_SET_SECCOMP, <mode>) */
+#define SECCOMP_MODE_DISABLED 0 /* seccomp is not in use. */
+#define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */
+#define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */
+
+/*
+ * All BPF programs must return a 32-bit value.
+ * The bottom 16-bits are reserved for future use.
+ * The upper 16-bits are ordered from least permissive values to most.
+ *
+ * The ordering ensures that a min_t() over composed return values always
+ * selects the least permissive choice.
+ */
+#define SECCOMP_RET_KILL 0x00000000U /* kill the task immediately */
+#define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */
+
+/* Masks for the return value sections. */
+#define SECCOMP_RET_ACTION 0xffff0000U
+#define SECCOMP_RET_DATA 0x0000ffffU
+
+/**
+ * struct seccomp_data - the format the BPF program executes over.
+ * @nr: the system call number
+ * @arch: indicates system call convention as an AUDIT_ARCH_* value
+ * as defined in <linux/audit.h>.
+ * @instruction_pointer: at the time of the system call.
+ * @args: up to 6 system call arguments always stored as 64-bit values
+ * regardless of the architecture.
+ */
+struct seccomp_data {
+ int nr;
+ __u32 arch;
+ __u64 instruction_pointer;
+ __u64 args[6];
+};
+#ifdef __KERNEL__
#ifdef CONFIG_SECCOMP
#include <linux/thread_info.h>
#include <asm/seccomp.h>
+struct seccomp_filter;
+/**
+ * struct seccomp - the state of a seccomp'ed process
+ *
+ * @mode: indicates one of the valid values above for controlled
+ * system calls available to a process.
+ * @filter: The metadata and ruleset for determining what system calls
+ * are allowed for a task.
+ *
+ * @filter must only be accessed from the context of current as there
+ * is no locking.
+ */
struct seccomp {
int mode;
+ struct seccomp_filter *filter;
};
extern void __secure_computing(int);
@@ -19,7 +72,7 @@ static inline void secure_computing(int this_syscall)
}
extern long prctl_get_seccomp(void);
-extern long prctl_set_seccomp(unsigned long);
+extern long prctl_set_seccomp(unsigned long, char __user *);
static inline int seccomp_mode(struct seccomp *s)
{
@@ -31,15 +84,16 @@ static inline int seccomp_mode(struct seccomp *s)
#include <linux/errno.h>
struct seccomp { };
+struct seccomp_filter { };
-#define secure_computing(x) do { } while (0)
+#define secure_computing(x) 0
static inline long prctl_get_seccomp(void)
{
return -EINVAL;
}
-static inline long prctl_set_seccomp(unsigned long arg2)
+static inline long prctl_set_seccomp(unsigned long arg2, char __user *arg3)
{
return -EINVAL;
}
@@ -48,7 +102,19 @@ static inline int seccomp_mode(struct seccomp *s)
{
return 0;
}
-
#endif /* CONFIG_SECCOMP */
+#ifdef CONFIG_SECCOMP_FILTER
+extern void put_seccomp_filter(struct seccomp_filter *);
+extern void copy_seccomp(struct seccomp *child, const struct seccomp *parent);
+#else /* CONFIG_SECCOMP_FILTER */
+/* The macro consumes the ->filter reference. */
+#define put_seccomp_filter(_s) do { } while (0)
+
+static inline void copy_seccomp(struct seccomp *c, const struct seccomp *p)
+{
+ return;
+}
+#endif /* CONFIG_SECCOMP_FILTER */
+#endif /* __KERNEL__ */
#endif /* _LINUX_SECCOMP_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index b77fd55..a5187b7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
#include <linux/cgroup.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/seccomp.h>
#include <linux/swap.h>
#include <linux/syscalls.h>
#include <linux/jiffies.h>
@@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
free_thread_info(tsk->stack);
rt_mutex_debug_task_free(tsk);
ftrace_graph_exit_task(tsk);
+ put_seccomp_filter(tsk->seccomp.filter);
free_task_struct(tsk);
}
EXPORT_SYMBOL(free_task);
@@ -1113,6 +1115,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
goto fork_out;
ftrace_graph_init_task(p);
+ copy_seccomp(&p->seccomp, ¤t->seccomp);
rt_mutex_init_task(p);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index e8d76c5..25e8296 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -3,16 +3,275 @@
*
* Copyright 2004-2005 Andrea Arcangeli <[email protected]>
*
- * This defines a simple but solid secure-computing mode.
+ * Copyright (C) 2012 Google, Inc.
+ * Will Drewry <[email protected]>
+ *
+ * This defines a simple but solid secure-computing facility.
+ *
+ * Mode 1 uses a fixed list of allowed system calls.
+ * Mode 2 allows user-defined system call filters in the form
+ * of Berkeley Packet Filters/Linux Socket Filters.
*/
+#include <linux/atomic.h>
#include <linux/audit.h>
-#include <linux/seccomp.h>
-#include <linux/sched.h>
#include <linux/compat.h>
+#include <linux/filter.h>
+#include <linux/sched.h>
+#include <linux/seccomp.h>
+#include <linux/security.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+#include <linux/tracehook.h>
+#include <asm/syscall.h>
/* #define SECCOMP_DEBUG 1 */
-#define NR_SECCOMP_MODES 1
+
+#ifdef CONFIG_SECCOMP_FILTER
+/**
+ * struct seccomp_filter - container for seccomp BPF programs
+ *
+ * @usage: reference count to manage the object liftime.
+ * get/put helpers should be used when accessing an instance
+ * outside of a lifetime-guarded section. In general, this
+ * is only needed for handling filters shared across tasks.
+ * @prev: points to a previously installed, or inherited, filter
+ * @insns: the BPF program instructions to evaluate
+ * @len: the number of instructions in the program
+ *
+ * seccomp_filter objects are organized in a tree linked via the @prev
+ * pointer. For any task, it appears to be a singly-linked list starting
+ * with current->seccomp.filter, the most recently attached or inherited filter.
+ * However, multiple filters may share a @prev node, by way of fork(), which
+ * results in a unidirectional tree existing in memory. This is similar to
+ * how namespaces work.
+ *
+ * seccomp_filter objects should never be modified after being attached
+ * to a task_struct (other than @usage).
+ */
+struct seccomp_filter {
+ atomic_t usage;
+ struct seccomp_filter *prev;
+ unsigned short len; /* Instruction count */
+ struct sock_filter insns[];
+};
+
+static void seccomp_filter_log_failure(int syscall)
+{
+ int compat = 0;
+#ifdef CONFIG_COMPAT
+ compat = is_compat_task();
+#endif
+ pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n",
+ current->comm, task_pid_nr(current),
+ (compat ? "compat " : ""),
+ syscall, KSTK_EIP(current));
+}
+
+/**
+ * get_u32 - returns a u32 offset into data
+ * @data: a unsigned 64 bit value
+ * @index: 0 or 1 to return the first or second 32-bits
+ *
+ * This inline exists to hide the length of unsigned long.
+ * If a 32-bit unsigned long is passed in, it will be extended
+ * and the top 32-bits will be 0. If it is a 64-bit unsigned
+ * long, then whatever data is resident will be properly returned.
+ */
+static inline u32 get_u32(u64 data, int index)
+{
+ return ((u32 *)&data)[index];
+}
+
+/* Helper for bpf_load below. */
+#define BPF_DATA(_name) offsetof(struct seccomp_data, _name)
+/**
+ * bpf_load: checks and returns a pointer to the requested offset
+ * @nr: int syscall passed as a void * to bpf_run_filter
+ * @off: offset into struct seccomp_data to load from
+ * @size: number of bytes to load (must be 4)
+ * @buffer: temporary storage supplied by bpf_run_filter (4 bytes)
+ *
+ * Returns a pointer to the loaded data (usually @buf).
+ * On failure, returns NULL.
+ */
+static void *bpf_load(const void *nr, int off, unsigned int size, void *buf)
+{
+ unsigned long value;
+ u32 *A = buf;
+
+ if (size != sizeof(u32) || off % sizeof(u32))
+ return NULL;
+
+ if (off >= BPF_DATA(args[0]) && off < BPF_DATA(args[6])) {
+ struct pt_regs *regs = task_pt_regs(current);
+ int arg = (off - BPF_DATA(args[0])) / sizeof(u64);
+ int index = (off % sizeof(u64)) ? 1 : 0;
+ syscall_get_arguments(current, regs, arg, 1, &value);
+ *A = get_u32(value, index);
+ } else if (off == BPF_DATA(nr)) {
+ *A = (u32)(uintptr_t)nr;
+ } else if (off == BPF_DATA(arch)) {
+ struct pt_regs *regs = task_pt_regs(current);
+ *A = syscall_get_arch(current, regs);
+ } else if (off == BPF_DATA(instruction_pointer)) {
+ *A = get_u32(KSTK_EIP(current), 0);
+ } else if (off == BPF_DATA(instruction_pointer) + sizeof(u32)) {
+ *A = get_u32(KSTK_EIP(current), 1);
+ } else {
+ return NULL;
+ }
+ return buf;
+}
+
+/**
+ * seccomp_run_filters - evaluates all seccomp filters against @syscall
+ * @syscall: number of the current system call
+ *
+ * Returns valid seccomp BPF response codes.
+ */
+static u32 seccomp_run_filters(int syscall)
+{
+ struct seccomp_filter *f;
+ u32 ret = SECCOMP_RET_KILL;
+ static const struct bpf_load_fn fns = {
+ bpf_load,
+ sizeof(struct seccomp_data),
+ };
+ const void *sc_ptr = (const void *)(uintptr_t)syscall;
+
+ /*
+ * All filters are evaluated in order of youngest to oldest. The lowest
+ * BPF return value always takes priority.
+ */
+ for (f = current->seccomp.filter; f; f = f->prev) {
+ ret = bpf_run_filter(sc_ptr, f->insns, &fns);
+ if (ret != SECCOMP_RET_ALLOW)
+ break;
+ }
+ return ret;
+}
+
+/**
+ * seccomp_attach_filter: Attaches a seccomp filter to current.
+ * @fprog: BPF program to install
+ *
+ * Returns 0 on success or an errno on failure.
+ */
+static long seccomp_attach_filter(struct sock_fprog *fprog)
+{
+ struct seccomp_filter *filter;
+ unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
+ long ret;
+
+ if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
+ return -EINVAL;
+
+ /* Allocate a new seccomp_filter */
+ filter = kzalloc(sizeof(struct seccomp_filter) + fp_size, GFP_KERNEL);
+ if (!filter)
+ return -ENOMEM;
+ atomic_set(&filter->usage, 1);
+ filter->len = fprog->len;
+
+ /* Copy the instructions from fprog. */
+ ret = -EFAULT;
+ if (copy_from_user(filter->insns, fprog->filter, fp_size))
+ goto out;
+
+ /* Check the fprog */
+ ret = bpf_chk_filter(filter->insns, filter->len, BPF_CHK_FLAGS_NO_SKB);
+ if (ret)
+ goto out;
+
+ /*
+ * Installing a seccomp filter requires that the task have
+ * CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
+ * This avoids scenarios where unprivileged tasks can affect the
+ * behavior of privileged children.
+ */
+ ret = -EACCES;
+ if (!current->no_new_privs &&
+ security_capable_noaudit(current_cred(), current_user_ns(),
+ CAP_SYS_ADMIN) != 0)
+ goto out;
+
+ /*
+ * If there is an existing filter, make it the prev and don't drop its
+ * task reference.
+ */
+ filter->prev = current->seccomp.filter;
+ current->seccomp.filter = filter;
+ return 0;
+out:
+ put_seccomp_filter(filter); /* for get or task, on err */
+ return ret;
+}
+
+/**
+ * seccomp_attach_user_filter - attaches a user-supplied sock_fprog
+ * @user_filter: pointer to the user data containing a sock_fprog.
+ *
+ * Returns 0 on success and non-zero otherwise.
+ */
+long seccomp_attach_user_filter(char __user *user_filter)
+{
+ struct sock_fprog fprog;
+ long ret = -EFAULT;
+
+ if (!user_filter)
+ goto out;
+#ifdef CONFIG_COMPAT
+ if (is_compat_task()) {
+ struct compat_sock_fprog fprog32;
+ if (copy_from_user(&fprog32, user_filter, sizeof(fprog32)))
+ goto out;
+ fprog.len = fprog32.len;
+ fprog.filter = compat_ptr(fprog32.filter);
+ } else /* falls through to the if below. */
+#endif
+ if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
+ goto out;
+ ret = seccomp_attach_filter(&fprog);
+out:
+ return ret;
+}
+
+/* get_seccomp_filter - increments the reference count of @orig. */
+static struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
+{
+ if (!orig)
+ return NULL;
+ /* Reference count is bounded by the number of total processes. */
+ atomic_inc(&orig->usage);
+ return orig;
+}
+
+/* put_seccomp_filter - decrements the ref count of @orig and may free. */
+void put_seccomp_filter(struct seccomp_filter *orig)
+{
+ /* Clean up single-reference branches iteratively. */
+ while (orig && atomic_dec_and_test(&orig->usage)) {
+ struct seccomp_filter *freeme = orig;
+ orig = orig->prev;
+ kfree(freeme);
+ }
+}
+
+/**
+ * copy_seccomp: manages inheritance on fork
+ * @child: forkee's seccomp
+ * @parent: forker's seccomp
+ *
+ * Ensures that @child inherits filters if in use.
+ */
+void copy_seccomp(struct seccomp *child, const struct seccomp *parent)
+{
+ /* Other fields are handled by dup_task_struct. */
+ child->filter = get_seccomp_filter(parent->filter);
+}
+#endif /* CONFIG_SECCOMP_FILTER */
/*
* Secure computing mode 1 allows only read/write/exit/sigreturn.
@@ -34,10 +293,11 @@ static int mode1_syscalls_32[] = {
void __secure_computing(int this_syscall)
{
int mode = current->seccomp.mode;
- int * syscall;
+ int exit_code = SIGKILL;
+ int *syscall;
switch (mode) {
- case 1:
+ case SECCOMP_MODE_STRICT:
syscall = mode1_syscalls;
#ifdef CONFIG_COMPAT
if (is_compat_task())
@@ -48,6 +308,14 @@ void __secure_computing(int this_syscall)
return;
} while (*++syscall);
break;
+#ifdef CONFIG_SECCOMP_FILTER
+ case SECCOMP_MODE_FILTER:
+ if (seccomp_run_filters(this_syscall) == SECCOMP_RET_ALLOW)
+ return;
+ seccomp_filter_log_failure(this_syscall);
+ exit_code = SIGSYS;
+ break;
+#endif
default:
BUG();
}
@@ -56,7 +324,7 @@ void __secure_computing(int this_syscall)
dump_stack();
#endif
audit_seccomp(this_syscall);
- do_exit(SIGKILL);
+ do_exit(exit_code);
}
long prctl_get_seccomp(void)
@@ -64,25 +332,49 @@ long prctl_get_seccomp(void)
return current->seccomp.mode;
}
-long prctl_set_seccomp(unsigned long seccomp_mode)
+/**
+ * prctl_set_seccomp: configures current->seccomp.mode
+ * @seccomp_mode: requested mode to use
+ * @filter: optional struct sock_fprog for use with SECCOMP_MODE_FILTER
+ *
+ * This function may be called repeatedly with a @seccomp_mode of
+ * SECCOMP_MODE_FILTER to install additional filters. Every filter
+ * successfully installed will be evaluated (in reverse order) for each system
+ * call the task makes.
+ *
+ * It is not possible to transition change the current->seccomp.mode after
+ * one has been set on a task.
+ *
+ * Returns 0 on success or -EINVAL on failure.
+ */
+long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
{
- long ret;
+ long ret = -EINVAL;
- /* can set it only once to be even more secure */
- ret = -EPERM;
- if (unlikely(current->seccomp.mode))
+ if (current->seccomp.mode &&
+ current->seccomp.mode != seccomp_mode)
goto out;
- ret = -EINVAL;
- if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
- current->seccomp.mode = seccomp_mode;
- set_thread_flag(TIF_SECCOMP);
+ switch (seccomp_mode) {
+ case SECCOMP_MODE_STRICT:
+ ret = 0;
#ifdef TIF_NOTSC
disable_TSC();
#endif
- ret = 0;
+ break;
+#ifdef CONFIG_SECCOMP_FILTER
+ case SECCOMP_MODE_FILTER:
+ ret = seccomp_attach_user_filter(filter);
+ if (ret)
+ goto out;
+ break;
+#endif
+ default:
+ goto out;
}
- out:
+ current->seccomp.mode = seccomp_mode;
+ set_thread_flag(TIF_SECCOMP);
+out:
return ret;
}
diff --git a/kernel/sys.c b/kernel/sys.c
index 4070153..905031e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1899,7 +1899,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = prctl_get_seccomp();
break;
case PR_SET_SECCOMP:
- error = prctl_set_seccomp(arg2);
+ error = prctl_set_seccomp(arg2, (char __user *)arg3);
break;
case PR_GET_TSC:
error = GET_TSC_CTL(arg2);
--
1.7.5.4
Documents how system call filtering using Berkeley Packet
Filter programs works and how it may be used.
Includes an example for x86 (32-bit) and a semi-generic
example using a macro-based code generator.
v11: - overhaul return value language, updates ([email protected])
- comment on do_exit(SIGSYS)
v10: - update for SIGSYS
- update for new seccomp_data layout
- update for ptrace option use
v9: - updated bpf-direct.c for SIGILL
v8: - add PR_SET_NO_NEW_PRIVS to the samples.
v7: - updated for all the new stuff in v7: TRAP, TRACE
- only talk about PR_SET_SECCOMP now
- fixed bad JLE32 check ([email protected])
- adds dropper.c: a simple system call disabler
v6: - tweak the language to note the requirement of
PR_SET_NO_NEW_PRIVS being called prior to use. ([email protected])
v5: - update sample to use system call arguments
- adds a "fancy" example using a macro-based generator
- cleaned up bpf in the sample
- update docs to mention arguments
- fix prctl value ([email protected])
- language cleanup ([email protected])
v4: - update for no_new_privs use
- minor tweaks
v3: - call out BPF <-> Berkeley Packet Filter ([email protected])
- document use of tentative always-unprivileged
- guard sample compilation for i386 and x86_64
v2: - move code to samples ([email protected])
Signed-off-by: Will Drewry <[email protected]>
---
Documentation/prctl/seccomp_filter.txt | 150 ++++++++++++++++++++
samples/Makefile | 2 +-
samples/seccomp/Makefile | 31 ++++
samples/seccomp/bpf-direct.c | 150 ++++++++++++++++++++
samples/seccomp/bpf-fancy.c | 102 ++++++++++++++
samples/seccomp/bpf-helper.c | 89 ++++++++++++
samples/seccomp/bpf-helper.h | 236 ++++++++++++++++++++++++++++++++
samples/seccomp/dropper.c | 68 +++++++++
8 files changed, 827 insertions(+), 1 deletions(-)
create mode 100644 Documentation/prctl/seccomp_filter.txt
create mode 100644 samples/seccomp/Makefile
create mode 100644 samples/seccomp/bpf-direct.c
create mode 100644 samples/seccomp/bpf-fancy.c
create mode 100644 samples/seccomp/bpf-helper.c
create mode 100644 samples/seccomp/bpf-helper.h
create mode 100644 samples/seccomp/dropper.c
diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..4e78773
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,150 @@
+ SECure COMPuting with filters
+ =============================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated. A
+certain subset of userland applications benefit by having a reduced set
+of available system calls. The resulting set reduces the total kernel
+surface exposed to the application. System call filtering is meant for
+use with those applications.
+
+Seccomp filtering provides a means for a process to specify a filter for
+incoming system calls. The filter is expressed as a Berkeley Packet
+Filter (BPF) program, as with socket filters, except that the data
+operated on is related to the system call being made: system call
+number and the system call arguments. This allows for expressive
+filtering of system calls using a filter program language with a long
+history of being exposed to userland and a straightforward data set.
+
+Additionally, BPF makes it impossible for users of seccomp to fall prey
+to time-of-check-time-of-use (TOCTOU) attacks that are common in system
+call interposition frameworks. BPF programs may not dereference
+pointers which constrains all filters to solely evaluating the system
+call arguments directly.
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox. It provides a clearly defined
+mechanism for minimizing the exposed kernel surface. It is meant to be
+a tool for sandbox developers to use. Beyond that, policy for logical
+behavior and information flow should be managed with a combination of
+other system hardening techniques and, potentially, an LSM of your
+choosing. Expressive, dynamic filters provide further options down this
+path (avoiding pathological sizes or selecting which of the multiplexed
+system calls in socketcall() is allowed, for instance) which could be
+construed, incorrectly, as a more complete sandboxing solution.
+
+Usage
+-----
+
+An additional seccomp mode is added and is enabled using the same
+prctl(2) call as the strict seccomp. If the architecture has
+CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below:
+
+PR_SET_SECCOMP:
+ Now takes an additional argument which specifies a new filter
+ using a BPF program.
+ The BPF program will be executed over struct seccomp_data
+ reflecting the system call number, arguments, and other
+ metadata. The BPF program must then return one of the
+ acceptable values to inform the kernel which action should be
+ taken.
+
+ Usage:
+ prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog);
+
+ The 'prog' argument is a pointer to a struct sock_fprog which
+ will contain the filter program. If the program is invalid, the
+ call will return -1 and set errno to EINVAL.
+
+ Note, is_compat_task is also tracked for the @prog. This means
+ that once set the calling task will have all of its system calls
+ blocked if it switches its system call ABI.
+
+ If fork/clone and execve are allowed by @prog, any child
+ processes will be constrained to the same filters and system
+ call ABI as the parent.
+
+ Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or
+ run with CAP_SYS_ADMIN privileges in its namespace. If these are not
+ true, -EACCES will be returned. This requirement ensures that filter
+ programs cannot be applied to child processes with greater privileges
+ than the task that installed them.
+
+ Additionally, if prctl(2) is allowed by the attached filter,
+ additional filters may be layered on which will increase evaluation
+ time, but allow for further decreasing the attack surface during
+ execution of a process.
+
+The above call returns 0 on success and non-zero on error.
+
+Return values
+-------------
+A seccomp filter may return any of the following values. If multiple
+filters exist, the return value for the evaluation of a given system
+call will always use the highest precedent value. (For example,
+SECCOMP_RET_KILL will always take precedence.)
+
+In precedence order, they are:
+
+SECCOMP_RET_KILL:
+ Results in the task exiting immediately without executing the
+ system call. The exit status of the task (status & 0x7f) will
+ be SIGSYS, not SIGKILL.
+
+SECCOMP_RET_TRAP:
+ Results in the kernel sending a SIGSYS signal to the triggering
+ task without executing the system call. The kernel will
+ rollback the register state to just before the system call
+ entry such that a signal handler in the task will be able to
+ inspect the ucontext_t->uc_mcontext registers and emulate
+ system call success or failure upon return from the signal
+ handler.
+
+ SIGSYS triggered by seccomp will have a si_code of SYS_SECCOMP.
+
+SECCOMP_RET_ERRNO:
+ Results in the lower 16-bits of the return value being passed
+ to userland as the errno without executing the system call.
+
+SECCOMP_RET_TRACE:
+ When returned, this value will cause the kernel to attempt to
+ notify a ptrace()-based tracer prior to executing the system
+ call. This return value is only valid if the task is currently
+ being traced (TIF_SYSCALL_TRACE). If it is not being traced or
+ the ptrace options are invalid, -ENOSYS is returned to userland
+ and the system call is not executed.
+
+ A tracer will be notified if it requests PTRACE_O_TRACESECCOMP
+ using ptrace(PTRACE_SETOPTIONS) and traces the process using
+ ptrace(PTRACE_SYSCALL). This feature allows seccomp filter
+ programs to act as in-kernel accelerators for ptrace-based
+ system call filtering frameworks.
+
+SECCOMP_RET_ALLOW:
+ Results in the system call being executed.
+
+If multiple filters exist, the return value for the evaluation of a
+given system call will always use the highest precedent value. For
+example, SECCOMP_RET_KILL will always take precedence.
+
+
+Example
+-------
+
+The samples/seccomp/ directory contains both a 32-bit specific example
+and a more generic example of a higher level macro interface for BPF
+program generation.
+
+Adding architecture support
+-----------------------
+
+See arch/Kconfig for the required functionality. In general, if an
+architecture supports both tracehook and seccomp, it will be able to
+support seccomp filter with minor alteration. Then it must just add
+CONFIG_HAVE_ARCH_SECCOMP_FILTER to its arch-specific Kconfig.
diff --git a/samples/Makefile b/samples/Makefile
index 6280817..f29b19c 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,4 @@
# Makefile for Linux samples code
obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ tracepoints/ trace_events/ \
- hw_breakpoint/ kfifo/ kdb/ hidraw/
+ hw_breakpoint/ kfifo/ kdb/ hidraw/ seccomp/
diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
new file mode 100644
index 0000000..38922f7
--- /dev/null
+++ b/samples/seccomp/Makefile
@@ -0,0 +1,31 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+hostprogs-$(CONFIG_SECCOMP) := bpf-fancy dropper
+bpf-fancy-objs := bpf-fancy.o bpf-helper.o
+
+HOSTCFLAGS_bpf-fancy.o += -I$(objtree)/usr/include
+HOSTCFLAGS_bpf-fancy.o += -idirafter $(objtree)/include
+HOSTCFLAGS_bpf-helper.o += -I$(objtree)/usr/include
+HOSTCFLAGS_bpf-helper.o += -idirafter $(objtree)/include
+
+HOSTCFLAGS_dropper.o += -I$(objtree)/usr/include
+HOSTCFLAGS_dropper.o += -idirafter $(objtree)/include
+dropper-objs := dropper.o
+
+# bpf-direct.c is x86-only.
+ifeq ($(filter-out x86_64 i386,$(KBUILD_BUILDHOST)),)
+# List of programs to build
+hostprogs-$(CONFIG_SECCOMP) += bpf-direct
+bpf-direct-objs := bpf-direct.o
+endif
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_bpf-direct.o += -I$(objtree)/usr/include
+HOSTCFLAGS_bpf-direct.o += -idirafter $(objtree)/include
+ifeq ($(KBUILD_BUILDHOST),x86_64)
+HOSTCFLAGS_bpf-direct.o += -m32
+HOSTLOADLIBES_bpf-direct += -m32
+endif
diff --git a/samples/seccomp/bpf-direct.c b/samples/seccomp/bpf-direct.c
new file mode 100644
index 0000000..56e5443
--- /dev/null
+++ b/samples/seccomp/bpf-direct.c
@@ -0,0 +1,150 @@
+/*
+ * 32-bit seccomp filter example with BPF macros
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <[email protected]>
+ * Author: Will Drewry <[email protected]>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_SET_SECCOMP, 2, ...).
+ */
+#define __USE_GNU 1
+#define _GNU_SOURCE 1
+
+#include <linux/types.h>
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+#include <linux/unistd.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <string.h>
+#include <sys/prctl.h>
+#include <unistd.h>
+
+#define syscall_arg(_n) (offsetof(struct seccomp_data, args[_n]))
+#define syscall_nr (offsetof(struct seccomp_data, nr))
+
+#ifndef PR_SET_NO_NEW_PRIVS
+#define PR_SET_NO_NEW_PRIVS 36
+#endif
+
+#ifndef SYS_SECCOMP
+#define SYS_SECCOMP 1
+#endif
+
+static void emulator(int nr, siginfo_t *info, void *void_context)
+{
+ ucontext_t *ctx = (ucontext_t *)(void_context);
+ int syscall;
+ char *buf;
+ ssize_t bytes;
+ size_t len;
+ if (info->si_code != SYS_SECCOMP)
+ return;
+ if (!ctx)
+ return;
+ syscall = ctx->uc_mcontext.gregs[REG_EAX];
+ buf = (char *) ctx->uc_mcontext.gregs[REG_ECX];
+ len = (size_t) ctx->uc_mcontext.gregs[REG_EDX];
+
+ if (syscall != __NR_write)
+ return;
+ if (ctx->uc_mcontext.gregs[REG_EBX] != STDERR_FILENO)
+ return;
+ /* Redirect stderr messages to stdout. Doesn't handle EINTR, etc */
+ write(STDOUT_FILENO, "[ERR] ", 6);
+ bytes = write(STDOUT_FILENO, buf, len);
+ ctx->uc_mcontext.gregs[REG_EAX] = bytes;
+ return;
+}
+
+static int install_emulator(void)
+{
+ struct sigaction act;
+ sigset_t mask;
+ memset(&act, 0, sizeof(act));
+ sigemptyset(&mask);
+ sigaddset(&mask, SIGSYS);
+
+ act.sa_sigaction = &emulator;
+ act.sa_flags = SA_SIGINFO;
+ if (sigaction(SIGSYS, &act, NULL) < 0) {
+ perror("sigaction");
+ return -1;
+ }
+ if (sigprocmask(SIG_UNBLOCK, &mask, NULL)) {
+ perror("sigprocmask");
+ return -1;
+ }
+ return 0;
+}
+
+static int install_filter(void)
+{
+ struct sock_filter filter[] = {
+ /* Grab the system call number */
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_nr),
+ /* Jump table for the allowed syscalls */
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 0, 1),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 0, 1),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 0, 1),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 0, 1),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 3, 2),
+
+ /* Check that read is only using stdin. */
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 4, 0),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
+
+ /* Check that write is only using stdout */
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
+ /* Trap attempts to write to stderr */
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 1, 2),
+
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_TRAP),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ .filter = filter,
+ };
+
+ if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
+ perror("prctl(NO_NEW_PRIVS)");
+ return 1;
+ }
+
+
+ if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
+ perror("prctl");
+ return 1;
+ }
+ return 0;
+}
+
+#define payload(_c) (_c), sizeof((_c))
+int main(int argc, char **argv)
+{
+ char buf[4096];
+ ssize_t bytes = 0;
+ if (install_emulator())
+ return 1;
+ if (install_filter())
+ return 1;
+ syscall(__NR_write, STDOUT_FILENO,
+ payload("OHAI! WHAT IS YOUR NAME? "));
+ bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
+ syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+ syscall(__NR_write, STDOUT_FILENO, buf, bytes);
+ syscall(__NR_write, STDERR_FILENO,
+ payload("Error message going to STDERR\n"));
+ return 0;
+}
diff --git a/samples/seccomp/bpf-fancy.c b/samples/seccomp/bpf-fancy.c
new file mode 100644
index 0000000..bf1f6b5
--- /dev/null
+++ b/samples/seccomp/bpf-fancy.c
@@ -0,0 +1,102 @@
+/*
+ * Seccomp BPF example using a macro-based generator.
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <[email protected]>
+ * Author: Will Drewry <[email protected]>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+#include <linux/unistd.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/prctl.h>
+#include <unistd.h>
+
+#include "bpf-helper.h"
+
+#ifndef PR_SET_NO_NEW_PRIVS
+#define PR_SET_NO_NEW_PRIVS 36
+#endif
+
+int main(int argc, char **argv)
+{
+ struct bpf_labels l;
+ static const char msg1[] = "Please type something: ";
+ static const char msg2[] = "You typed: ";
+ char buf[256];
+ struct sock_filter filter[] = {
+ /* TODO: LOAD_SYSCALL_NR(arch) and enforce an arch */
+ LOAD_SYSCALL_NR,
+ SYSCALL(__NR_exit, ALLOW),
+ SYSCALL(__NR_exit_group, ALLOW),
+ SYSCALL(__NR_write, JUMP(&l, write_fd)),
+ SYSCALL(__NR_read, JUMP(&l, read)),
+ DENY, /* Don't passthrough into a label */
+
+ LABEL(&l, read),
+ ARG(0),
+ JNE(STDIN_FILENO, DENY),
+ ARG(1),
+ JNE((unsigned long)buf, DENY),
+ ARG(2),
+ JGE(sizeof(buf), DENY),
+ ALLOW,
+
+ LABEL(&l, write_fd),
+ ARG(0),
+ JEQ(STDOUT_FILENO, JUMP(&l, write_buf)),
+ JEQ(STDERR_FILENO, JUMP(&l, write_buf)),
+ DENY,
+
+ LABEL(&l, write_buf),
+ ARG(1),
+ JEQ((unsigned long)msg1, JUMP(&l, msg1_len)),
+ JEQ((unsigned long)msg2, JUMP(&l, msg2_len)),
+ JEQ((unsigned long)buf, JUMP(&l, buf_len)),
+ DENY,
+
+ LABEL(&l, msg1_len),
+ ARG(2),
+ JLT(sizeof(msg1), ALLOW),
+ DENY,
+
+ LABEL(&l, msg2_len),
+ ARG(2),
+ JLT(sizeof(msg2), ALLOW),
+ DENY,
+
+ LABEL(&l, buf_len),
+ ARG(2),
+ JLT(sizeof(buf), ALLOW),
+ DENY,
+ };
+ struct sock_fprog prog = {
+ .filter = filter,
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ };
+ ssize_t bytes;
+ bpf_resolve_jumps(&l, filter, sizeof(filter)/sizeof(*filter));
+
+ if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
+ perror("prctl(NO_NEW_PRIVS)");
+ return 1;
+ }
+
+ if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
+ perror("prctl(SECCOMP)");
+ return 1;
+ }
+ syscall(__NR_write, STDOUT_FILENO, msg1, strlen(msg1));
+ bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf)-1);
+ bytes = (bytes > 0 ? bytes : 0);
+ syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2));
+ syscall(__NR_write, STDERR_FILENO, buf, bytes);
+ /* Now get killed */
+ syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2)+2);
+ return 0;
+}
diff --git a/samples/seccomp/bpf-helper.c b/samples/seccomp/bpf-helper.c
new file mode 100644
index 0000000..579cfe3
--- /dev/null
+++ b/samples/seccomp/bpf-helper.c
@@ -0,0 +1,89 @@
+/*
+ * Seccomp BPF helper functions
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <[email protected]>
+ * Author: Will Drewry <[email protected]>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <stdio.h>
+#include <string.h>
+
+#include "bpf-helper.h"
+
+int bpf_resolve_jumps(struct bpf_labels *labels,
+ struct sock_filter *filter, size_t count)
+{
+ struct sock_filter *begin = filter;
+ __u8 insn = count - 1;
+
+ if (count < 1)
+ return -1;
+ /*
+ * Walk it once, backwards, to build the label table and do fixups.
+ * Since backward jumps are disallowed by BPF, this is easy.
+ */
+ filter += insn;
+ for (; filter >= begin; --insn, --filter) {
+ if (filter->code != (BPF_JMP+BPF_JA))
+ continue;
+ switch ((filter->jt<<8)|filter->jf) {
+ case (JUMP_JT<<8)|JUMP_JF:
+ if (labels->labels[filter->k].location == 0xffffffff) {
+ fprintf(stderr, "Unresolved label: '%s'\n",
+ labels->labels[filter->k].label);
+ return 1;
+ }
+ filter->k = labels->labels[filter->k].location -
+ (insn + 1);
+ filter->jt = 0;
+ filter->jf = 0;
+ continue;
+ case (LABEL_JT<<8)|LABEL_JF:
+ if (labels->labels[filter->k].location != 0xffffffff) {
+ fprintf(stderr, "Duplicate label use: '%s'\n",
+ labels->labels[filter->k].label);
+ return 1;
+ }
+ labels->labels[filter->k].location = insn;
+ filter->k = 0; /* fall through */
+ filter->jt = 0;
+ filter->jf = 0;
+ continue;
+ }
+ }
+ return 0;
+}
+
+/* Simple lookup table for labels. */
+__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label)
+{
+ struct __bpf_label *begin = labels->labels, *end;
+ int id;
+ if (labels->count == 0) {
+ begin->label = label;
+ begin->location = 0xffffffff;
+ labels->count++;
+ return 0;
+ }
+ end = begin + labels->count;
+ for (id = 0; begin < end; ++begin, ++id) {
+ if (!strcmp(label, begin->label))
+ return id;
+ }
+ begin->label = label;
+ begin->location = 0xffffffff;
+ labels->count++;
+ return id;
+}
+
+void seccomp_bpf_print(struct sock_filter *filter, size_t count)
+{
+ struct sock_filter *end = filter + count;
+ for ( ; filter < end; ++filter)
+ printf("{ code=%u,jt=%u,jf=%u,k=%u },\n",
+ filter->code, filter->jt, filter->jf, filter->k);
+}
diff --git a/samples/seccomp/bpf-helper.h b/samples/seccomp/bpf-helper.h
new file mode 100644
index 0000000..273fcd7
--- /dev/null
+++ b/samples/seccomp/bpf-helper.h
@@ -0,0 +1,236 @@
+/*
+ * Example wrapper around BPF macros.
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <[email protected]>
+ * Author: Will Drewry <[email protected]>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_SET_SECCOMP, 2, ...).
+ *
+ * No guarantees are provided with respect to the correctness
+ * or functionality of this code.
+ */
+#ifndef __BPF_HELPER_H__
+#define __BPF_HELPER_H__
+
+#include <asm/bitsperlong.h> /* for __BITS_PER_LONG */
+#include <linux/filter.h>
+#include <linux/seccomp.h> /* for seccomp_data */
+#include <linux/types.h>
+#include <linux/unistd.h>
+#include <stddef.h>
+
+#define BPF_LABELS_MAX 256
+struct bpf_labels {
+ int count;
+ struct __bpf_label {
+ const char *label;
+ __u32 location;
+ } labels[BPF_LABELS_MAX];
+};
+
+int bpf_resolve_jumps(struct bpf_labels *labels,
+ struct sock_filter *filter, size_t count);
+__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label);
+void seccomp_bpf_print(struct sock_filter *filter, size_t count);
+
+#define JUMP_JT 0xff
+#define JUMP_JF 0xff
+#define LABEL_JT 0xfe
+#define LABEL_JF 0xfe
+
+#define ALLOW \
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW)
+#define DENY \
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL)
+#define JUMP(labels, label) \
+ BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL((labels), (label)), \
+ JUMP_JT, JUMP_JF)
+#define LABEL(labels, label) \
+ BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL((labels), (label)), \
+ LABEL_JT, LABEL_JF)
+#define SYSCALL(nr, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (nr), 0, 1), \
+ jt
+
+/* Lame, but just an example */
+#define FIND_LABEL(labels, label) seccomp_bpf_label((labels), #label)
+
+#define EXPAND(...) __VA_ARGS__
+/* Map all width-sensitive operations */
+#if __BITS_PER_LONG == 32
+
+#define JEQ(x, jt) JEQ32(x, EXPAND(jt))
+#define JNE(x, jt) JNE32(x, EXPAND(jt))
+#define JGT(x, jt) JGT32(x, EXPAND(jt))
+#define JLT(x, jt) JLT32(x, EXPAND(jt))
+#define JGE(x, jt) JGE32(x, EXPAND(jt))
+#define JLE(x, jt) JLE32(x, EXPAND(jt))
+#define JA(x, jt) JA32(x, EXPAND(jt))
+#define ARG(i) ARG_32(i)
+
+#elif __BITS_PER_LONG == 64
+
+/* Ensure that we load the logically correct offset. */
+#if defined(__LITTLE_ENDIAN)
+#define LO_ARG(idx) offsetof(struct seccomp_data, args[(idx)])
+#define HI_ARG(idx) offsetof(struct seccomp_data, args[(idx)]) + sizeof(__u32)
+#define ENDIAN(_lo, _hi) _lo, _hi
+#elif defined(__BIG_ENDIAN)
+#define ENDIAN(_lo, _hi) _hi, _lo
+#define LO_ARG(idx) offsetof(struct seccomp_data, args[(idx)]) + sizeof(__u32)
+#define HI_ARG(idx) offsetof(struct seccomp_data, args[(idx)])
+#else
+#error "Unknown endianness"
+#endif
+
+union arg64 {
+ struct {
+ __u32 ENDIAN(lo32, hi32);
+ };
+ __u64 u64;
+};
+
+#define JEQ(x, jt) \
+ JEQ64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JGT(x, jt) \
+ JGT64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JGE(x, jt) \
+ JGE64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JNE(x, jt) \
+ JNE64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JLT(x, jt) \
+ JLT64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JLE(x, jt) \
+ JLE64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+
+#define JA(x, jt) \
+ JA64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define ARG(i) ARG_64(i)
+
+#else
+#error __BITS_PER_LONG value unusable.
+#endif
+
+/* Loads the arg into A */
+#define ARG_32(idx) \
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, LO_ARG(idx))
+
+/* Loads hi into A and lo in X */
+#define ARG_64(idx) \
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, LO_ARG(idx)), \
+ BPF_STMT(BPF_ST, 0), /* lo -> M[0] */ \
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, HI_ARG(idx)), \
+ BPF_STMT(BPF_ST, 1) /* hi -> M[1] */
+
+#define JEQ32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 0, 1), \
+ jt
+
+#define JNE32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 1, 0), \
+ jt
+
+/* Checks the lo, then swaps to check the hi. A=lo,X=hi */
+#define JEQ64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 0, 2), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JNE64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 5, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 2, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JA32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (value), 0, 1), \
+ jt
+
+#define JA64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (hi), 3, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (lo), 0, 2), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JGE32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 0, 1), \
+ jt
+
+#define JLT32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 1, 0), \
+ jt
+
+/* Shortcut checking if hi > arg.hi. */
+#define JGE64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 4, 0), \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (lo), 0, 2), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JLT64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (hi), 0, 4), \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 2, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JGT32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 0, 1), \
+ jt
+
+#define JLE32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 1, 0), \
+ jt
+
+/* Check hi > args.hi first, then do the GE checking */
+#define JGT64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 4, 0), \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 0, 2), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JLE64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 6, 0), \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 3), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 2, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define LOAD_SYSCALL_NR \
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+ offsetof(struct seccomp_data, nr))
+
+#endif /* __BPF_HELPER_H__ */
diff --git a/samples/seccomp/dropper.c b/samples/seccomp/dropper.c
new file mode 100644
index 0000000..74e035d
--- /dev/null
+++ b/samples/seccomp/dropper.c
@@ -0,0 +1,68 @@
+/*
+ * Naive system call dropper built on seccomp_filter.
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <[email protected]>
+ * Author: Will Drewry <[email protected]>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_SET_SECCOMP, 2, ...).
+ *
+ * When run, returns the specified errno for the specified
+ * system call number against the given architecture.
+ *
+ * Run this one as root as PR_SET_NO_NEW_PRIVS is not called.
+ */
+
+#include <errno.h>
+#include <linux/audit.h>
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+#include <linux/unistd.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <sys/prctl.h>
+#include <unistd.h>
+
+static int install_filter(int nr, int arch, int error)
+{
+ struct sock_filter filter[] = {
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+ (offsetof(struct seccomp_data, arch))),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, arch, 0, 3),
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+ (offsetof(struct seccomp_data, nr))),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
+ BPF_STMT(BPF_RET+BPF_K,
+ SECCOMP_RET_ERRNO|(error & SECCOMP_RET_DATA)),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ .filter = filter,
+ };
+ if (prctl(PR_SET_SECCOMP, 2, &prog)) {
+ perror("prctl");
+ return 1;
+ }
+ return 0;
+}
+
+int main(int argc, char **argv)
+{
+ if (argc < 5) {
+ fprintf(stderr, "Usage:\n"
+ "dropper <syscall_nr> <arch> <errno> <prog> [<args>]\n"
+ "Hint: AUDIT_ARCH_I386: %x\n"
+ " AUDIT_ARCH_X86_64: %x\n"
+ "\n", AUDIT_ARCH_I386, AUDIT_ARCH_X86_64);
+ return 1;
+ }
+ if (install_filter(strtol(argv[1], NULL, 0), strtol(argv[2], NULL, 0),
+ strtol(argv[3], NULL, 0)))
+ return 1;
+ execv(argv[4], &argv[4]);
+ printf("Failed to execv\n");
+ return 255;
+}
--
1.7.5.4
Adds a new return value to seccomp filters that triggers a SIGSYS to be
delivered with the new SYS_SECCOMP si_code.
This allows in-process system call emulation, including just specifying
an errno or cleanly dumping core, rather than just dying.
v11: - clarify the comment ([email protected])
- s/sigtrap/sigsys
v10: - use SIGSYS, syscall_get_arch, updates arch/Kconfig
note suggested-by (though original suggestion had other behaviors)
v9: - changes to SIGILL
v8: - clean up based on changes to dependent patches
v7: - introduction
Suggested-by: Markus Gutschke <[email protected]>
Suggested-by: Julien Tinnes <[email protected]>
Signed-off-by: Will Drewry <[email protected]>
---
arch/Kconfig | 14 +++++++++-----
include/asm-generic/siginfo.h | 2 +-
include/linux/seccomp.h | 1 +
kernel/seccomp.c | 28 ++++++++++++++++++++++++++++
4 files changed, 39 insertions(+), 6 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index dd4e067..d92a78e 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -202,11 +202,15 @@ config HAVE_CMPXCHG_DOUBLE
config HAVE_ARCH_SECCOMP_FILTER
bool
help
- This symbol should be selected by an architecure if it provides
- asm/syscall.h, specifically syscall_get_arguments(),
- syscall_get_arch(), and syscall_set_return_value(). Additionally,
- its system call entry path must respect a return value of -1 from
- __secure_computing_int() and/or secure_computing().
+ This symbol should be selected by an architecure if it provides:
+ asm/syscall.h:
+ - syscall_get_arch()
+ - syscall_get_arguments()
+ - syscall_rollback()
+ - syscall_set_return_value()
+ SIGSYS siginfo_t support must be implemented.
+ __secure_computing_int()/secure_computing()'s return value must be
+ checked, with -1 resulting in the syscall being skipped.
config SECCOMP_FILTER
def_bool y
diff --git a/include/asm-generic/siginfo.h b/include/asm-generic/siginfo.h
index 31306f5..af5d035 100644
--- a/include/asm-generic/siginfo.h
+++ b/include/asm-generic/siginfo.h
@@ -93,7 +93,7 @@ typedef struct siginfo {
/* SIGSYS */
struct {
- void __user *_call_addr; /* calling insn */
+ void __user *_call_addr; /* calling user insn */
int _syscall; /* triggering system call number */
unsigned int _arch; /* AUDIT_ARCH_* of syscall */
} _sigsys;
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 93c2d98..b44d038 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -19,6 +19,7 @@
* selects the least permissive choice.
*/
#define SECCOMP_RET_KILL 0x00000000U /* kill the task immediately */
+#define SECCOMP_RET_TRAP 0x00020000U /* disallow and force a SIGSYS */
#define SECCOMP_RET_ERRNO 0x00030000U /* returns an errno */
#define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 3b3a16e..d2e173e 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -272,6 +272,26 @@ void copy_seccomp(struct seccomp *child, const struct seccomp *parent)
/* Other fields are handled by dup_task_struct. */
child->filter = get_seccomp_filter(parent->filter);
}
+
+/**
+ * seccomp_send_sigsys - signals the task to allow in-process syscall emulation
+ * @syscall: syscall number to send to userland
+ * @reason: filter-supplied reason code to send to userland (via si_errno)
+ *
+ * Forces a SIGSYS with a code of SYS_SECCOMP and related sigsys info.
+ */
+static void seccomp_send_sigsys(int syscall, int reason)
+{
+ struct siginfo info;
+ memset(&info, 0, sizeof(info));
+ info.si_signo = SIGSYS;
+ info.si_code = SYS_SECCOMP;
+ info.si_call_addr = (void __user *)KSTK_EIP(current);
+ info.si_errno = reason;
+ info.si_arch = syscall_get_arch(current, task_pt_regs(current));
+ info.si_syscall = syscall;
+ force_sig_info(SIGSYS, &info, current);
+}
#endif /* CONFIG_SECCOMP_FILTER */
/*
@@ -326,6 +346,14 @@ int __secure_computing_int(int this_syscall)
-(action & SECCOMP_RET_DATA),
0);
return -1;
+ case SECCOMP_RET_TRAP: {
+ int reason_code = action & SECCOMP_RET_DATA;
+ /* Show the handler the original registers. */
+ syscall_rollback(current, task_pt_regs(current));
+ /* Let the filter pass back 16 bits of data. */
+ seccomp_send_sigsys(this_syscall, reason_code);
+ return -1;
+ }
case SECCOMP_RET_ALLOW:
return 0;
case SECCOMP_RET_KILL:
--
1.7.5.4
Add syscall_get_arch() to export the current AUDIT_ARCH_* based on system call
entry path.
Signed-off-by: Will Drewry <[email protected]>
---
arch/x86/include/asm/syscall.h | 23 +++++++++++++++++++++++
1 files changed, 23 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index d962e56..1d713e4 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -13,6 +13,7 @@
#ifndef _ASM_X86_SYSCALL_H
#define _ASM_X86_SYSCALL_H
+#include <linux/audit.h>
#include <linux/sched.h>
#include <linux/err.h>
#include <asm/asm-offsets.h> /* For NR_syscalls */
@@ -87,6 +88,12 @@ static inline void syscall_set_arguments(struct task_struct *task,
memcpy(®s->bx + i, args, n * sizeof(args[0]));
}
+static inline int syscall_get_arch(struct task_struct *task,
+ struct pt_regs *regs)
+{
+ return AUDIT_ARCH_I386;
+}
+
#else /* CONFIG_X86_64 */
static inline void syscall_get_arguments(struct task_struct *task,
@@ -211,6 +218,22 @@ static inline void syscall_set_arguments(struct task_struct *task,
}
}
+static inline int syscall_get_arch(struct task_struct *task,
+ struct pt_regs *regs)
+{
+#ifdef CONFIG_IA32_EMULATION
+ /*
+ * TS_COMPAT is set for 32-bit syscall entries and then
+ * remains set until we return to user mode.
+ *
+ * TIF_IA32 tasks should always have TS_COMPAT set at
+ * system call time.
+ */
+ if (task_thread_info(task)->status & TS_COMPAT)
+ return AUDIT_ARCH_I386;
+#endif
+ return AUDIT_ARCH_X86_64;
+}
#endif /* CONFIG_X86_32 */
#endif /* _ASM_X86_SYSCALL_H */
--
1.7.5.4
Adds a stub for a function that will return the AUDIT_ARCH_*
value appropriate to the supplied task based on the system
call convention.
For audit's use, the value can generally be hard-coded at the
audit-site. However, for other functionality not inlined into
syscall entry/exit, this makes that information available.
seccomp_filter is the first planned consumer and, as such,
the comment indicates a tie to HAVE_ARCH_SECCOMP_FILTER. That
is probably an unneeded detail.
Suggested-by: Roland McGrath <[email protected]>
Signed-off-by: Will Drewry <[email protected]>
v11: fixed improper return type ([email protected])
v10: introduced
---
include/asm-generic/syscall.h | 14 ++++++++++++++
1 files changed, 14 insertions(+), 0 deletions(-)
diff --git a/include/asm-generic/syscall.h b/include/asm-generic/syscall.h
index 5c122ae..a2c13dc 100644
--- a/include/asm-generic/syscall.h
+++ b/include/asm-generic/syscall.h
@@ -142,4 +142,18 @@ void syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
unsigned int i, unsigned int n,
const unsigned long *args);
+/**
+ * syscall_get_arch - return the AUDIT_ARCH for the current system call
+ * @task: task of interest, must be in system call entry tracing
+ * @regs: task_pt_regs() of @task
+ *
+ * Returns the AUDIT_ARCH_* based on the system call convention in use.
+ *
+ * It's only valid to call this when @task is stopped on entry to a system
+ * call, due to %TIF_SYSCALL_TRACE, %TIF_SYSCALL_AUDIT, or %TIF_SECCOMP.
+ *
+ * Note, at present this function is only required with
+ * CONFIG_HAVE_ARCH_SECCOMP_FILTER.
+ */
+int syscall_get_arch(struct task_struct *task, struct pt_regs *regs);
#endif /* _ASM_SYSCALL_H */
--
1.7.5.4
Any other users of bpf_*_filter that take a struct sock_fprog from
userspace will need to be able to also accept a compat_sock_fprog
if the arch supports compat calls. This change let's the existing
compat_sock_fprog be shared.
Signed-off-by: Will Drewry <[email protected]>
v11: introduction
---
include/linux/filter.h | 11 +++++++++++
net/compat.c | 8 --------
2 files changed, 11 insertions(+), 8 deletions(-)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index bdd02f9..f052f5c 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -10,6 +10,7 @@
#ifdef __KERNEL__
#include <linux/atomic.h>
+#include <linux/compat.h>
#endif
/*
@@ -135,6 +136,16 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
#ifdef __KERNEL__
+#ifdef CONFIG_COMPAT
+/*
+ * A struct sock_filter is architecture independent.
+ */
+struct compat_sock_fprog {
+ u16 len;
+ compat_uptr_t filter; /* struct sock_filter * */
+};
+#endif
+
struct sk_buff;
struct sock;
diff --git a/net/compat.c b/net/compat.c
index 6def90e..c5c61c8 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -326,14 +326,6 @@ void scm_detach_fds_compat(struct msghdr *kmsg, struct scm_cookie *scm)
__scm_destroy(scm);
}
-/*
- * A struct sock_filter is architecture independent.
- */
-struct compat_sock_fprog {
- u16 len;
- compat_uptr_t filter; /* struct sock_filter * */
-};
-
static int do_set_attach_filter(struct socket *sock, int level, int optname,
char __user *optval, unsigned int optlen)
{
--
1.7.5.4
This change enables SIGSYS, defines _sigfields._sigsys, and adds
x86 (compat) arch support. _sigsys defines fields which allow
a signal handler to receive the triggering system call number,
the relevant AUDIT_ARCH_* value for that number, and the address
of the callsite.
To ensure that SIGSYS delivery occurs on return from the triggering
system call, SIGSYS is added to the SYNCHRONOUS_MASK macro.
The first consumer of SIGSYS would be seccomp filter. In particular,
a filter program could specify a new return value, SECCOMP_RET_TRAP,
which would result in the system call being denied and the calling
thread signaled. This also means that implementing arch-specific
support can be dependent upon HAVE_ARCH_SECCOMP_FILTER.
v11: - fix dropped words in the change description
- added fallback copy_siginfo support.
- added __ARCH_SIGSYS define to allow stepped arch support.
v10: - first version based on suggestion
Suggested-by: H. Peter Anvin <[email protected]>
Signed-off-by: Will Drewry <[email protected]>
---
arch/x86/ia32/ia32_signal.c | 4 ++++
arch/x86/include/asm/ia32.h | 6 ++++++
include/asm-generic/siginfo.h | 22 ++++++++++++++++++++++
kernel/signal.c | 9 ++++++++-
4 files changed, 40 insertions(+), 1 deletions(-)
diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index 6557769..c81d2c7 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -73,6 +73,10 @@ int copy_siginfo_to_user32(compat_siginfo_t __user *to, siginfo_t *from)
switch (from->si_code >> 16) {
case __SI_FAULT >> 16:
break;
+ case __SI_SYS >> 16:
+ put_user_ex(from->si_syscall, &to->si_syscall);
+ put_user_ex(from->si_arch, &to->si_arch);
+ break;
case __SI_CHLD >> 16:
put_user_ex(from->si_utime, &to->si_utime);
put_user_ex(from->si_stime, &to->si_stime);
diff --git a/arch/x86/include/asm/ia32.h b/arch/x86/include/asm/ia32.h
index 1f7e625..541485f 100644
--- a/arch/x86/include/asm/ia32.h
+++ b/arch/x86/include/asm/ia32.h
@@ -126,6 +126,12 @@ typedef struct compat_siginfo {
int _band; /* POLL_IN, POLL_OUT, POLL_MSG */
int _fd;
} _sigpoll;
+
+ struct {
+ unsigned int _call_addr; /* calling insn */
+ int _syscall; /* triggering system call number */
+ unsigned int _arch; /* AUDIT_ARCH_* of syscall */
+ } _sigsys;
} _sifields;
} compat_siginfo_t;
diff --git a/include/asm-generic/siginfo.h b/include/asm-generic/siginfo.h
index 0dd4e87..31306f5 100644
--- a/include/asm-generic/siginfo.h
+++ b/include/asm-generic/siginfo.h
@@ -90,9 +90,18 @@ typedef struct siginfo {
__ARCH_SI_BAND_T _band; /* POLL_IN, POLL_OUT, POLL_MSG */
int _fd;
} _sigpoll;
+
+ /* SIGSYS */
+ struct {
+ void __user *_call_addr; /* calling insn */
+ int _syscall; /* triggering system call number */
+ unsigned int _arch; /* AUDIT_ARCH_* of syscall */
+ } _sigsys;
} _sifields;
} siginfo_t;
+/* If the arch shares siginfo, then it has SIGSYS. */
+#define __ARCH_SIGSYS
#endif
/*
@@ -116,6 +125,11 @@ typedef struct siginfo {
#define si_addr_lsb _sifields._sigfault._addr_lsb
#define si_band _sifields._sigpoll._band
#define si_fd _sifields._sigpoll._fd
+#ifdef __ARCH_SIGSYS
+#define si_call_addr _sifields._sigsys._call_addr
+#define si_syscall _sifields._sigsys._syscall
+#define si_arch _sifields._sigsys._arch
+#endif
#ifdef __KERNEL__
#define __SI_MASK 0xffff0000u
@@ -126,6 +140,7 @@ typedef struct siginfo {
#define __SI_CHLD (4 << 16)
#define __SI_RT (5 << 16)
#define __SI_MESGQ (6 << 16)
+#define __SI_SYS (7 << 16)
#define __SI_CODE(T,N) ((T) | ((N) & 0xffff))
#else
#define __SI_KILL 0
@@ -135,6 +150,7 @@ typedef struct siginfo {
#define __SI_CHLD 0
#define __SI_RT 0
#define __SI_MESGQ 0
+#define __SI_SYS 0
#define __SI_CODE(T,N) (N)
#endif
@@ -232,6 +248,12 @@ typedef struct siginfo {
#define NSIGPOLL 6
/*
+ * SIGSYS si_codes
+ */
+#define SYS_SECCOMP (__SI_SYS|1) /* seccomp triggered */
+#define NSIGSYS 1
+
+/*
* sigevent definitions
*
* It seems likely that SIGEV_THREAD will have to be handled from
diff --git a/kernel/signal.c b/kernel/signal.c
index c73c428..15a9e9b 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -160,7 +160,7 @@ void recalc_sigpending(void)
#define SYNCHRONOUS_MASK \
(sigmask(SIGSEGV) | sigmask(SIGBUS) | sigmask(SIGILL) | \
- sigmask(SIGTRAP) | sigmask(SIGFPE))
+ sigmask(SIGTRAP) | sigmask(SIGFPE) | sigmask(SIGSYS))
int next_signal(struct sigpending *pending, sigset_t *mask)
{
@@ -2686,6 +2686,13 @@ int copy_siginfo_to_user(siginfo_t __user *to, siginfo_t *from)
err |= __put_user(from->si_uid, &to->si_uid);
err |= __put_user(from->si_ptr, &to->si_ptr);
break;
+#ifdef __ARCH_SIGSYS
+ case __SI_SYS:
+ err |= __put_user(from->si_call_addr, &to->si_call_addr);
+ err |= __put_user(from->si_syscall, &to->si_syscall);
+ err |= __put_user(from->si_arch, &to->si_arch);
+ break;
+#endif
default: /* this is just in case for now ... */
err |= __put_user(from->si_pid, &to->si_pid);
err |= __put_user(from->si_uid, &to->si_uid);
--
1.7.5.4
Enable support for seccomp filter on x86:
- asm/tracehook.h exists
- syscall_get_arguments() works
- syscall_rollback() works
- ptrace_report_syscall() works
- secure_computing() return value is honored (see below)
This also adds support for honoring the return
value from secure_computing().
SECCOMP_RET_TRACE and SECCOMP_RET_TRAP may result in seccomp needing to
skip a system call without killing the process. This is done by
returning a non-zero (-1) value from secure_computing. This change
makes x86 respect that return value.
To ensure that minimal kernel code is exposed, a non-zero return value
results in an immediate return to user space (with an invalid syscall
number).
Signed-off-by: Will Drewry <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/kernel/ptrace.c | 7 ++++++-
2 files changed, 7 insertions(+), 1 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5bed94e..4c9012b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -82,6 +82,7 @@ config X86
select CLKEVT_I8253
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select GENERIC_IOMAP
+ select HAVE_ARCH_SECCOMP_FILTER
config INSTRUCTION_DECODER
def_bool (KPROBES || PERF_EVENTS)
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 5026738..90d465a 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1380,7 +1380,11 @@ long syscall_trace_enter(struct pt_regs *regs)
regs->flags |= X86_EFLAGS_TF;
/* do the secure computing check first */
- secure_computing(regs->orig_ax);
+ if (secure_computing(regs->orig_ax)) {
+ /* seccomp failures shouldn't expose any additional code. */
+ ret = -1L;
+ goto out;
+ }
if (unlikely(test_thread_flag(TIF_SYSCALL_EMU)))
ret = -1L;
@@ -1405,6 +1409,7 @@ long syscall_trace_enter(struct pt_regs *regs)
regs->dx, regs->r10);
#endif
+out:
return ret ?: regs->orig_ax;
}
--
1.7.5.4
On Fri, Feb 24, 2012 at 7:21 PM, Will Drewry <[email protected]> wrote:
> This change adds the SECCOMP_RET_ERRNO as a valid return value from a
> seccomp filter. ?Additionally, it makes the first use of the lower
> 16-bits for storing a filter-supplied errno. ?16-bits is more than
> enough for the errno-base.h calls.
>
> Returning errors instead of immediately terminating processes that
> violate seccomp policy allow for broader use of this functionality
> for kernel attack surface reduction. ?For example, a linux container
> could maintain a whitelist of pre-existing system calls but drop
> all new ones with errnos. ?This would keep a logically static attack
> surface while providing errnos that may allow for graceful failure
> without the downside of do_exit() on a bad call.
>
> v11: - check for NULL filter ([email protected])
> v10: - change loaders to fn
> ?v9: - n/a
> ?v8: - update Kconfig to note new need for syscall_set_return_value.
> ? ? - reordered such that TRAP behavior follows on later.
> ? ? - made the for loop a little less indent-y
> ?v7: - introduced
>
> Signed-off-by: Will Drewry <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
> + ? ? ? /* Ensure unexpected behavior doesn't result in failing open. */
> + ? ? ? if (unlikely(current->seccomp.filter == NULL))
> + ? ? ? ? ? ? ? ret = SECCOMP_RET_KILL;
Any reason to not just immediately return in this case?
-Kees
--
Kees Cook
ChromeOS Security
On Fri, Feb 24, 2012 at 09:21:45PM -0600, Will Drewry wrote:
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index e8d76c5..25e8296 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> [...]
> +static void seccomp_filter_log_failure(int syscall)
> +{
> + int compat = 0;
> +#ifdef CONFIG_COMPAT
> + compat = is_compat_task();
> +#endif
> + pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n",
> + current->comm, task_pid_nr(current),
> + (compat ? "compat " : ""),
> + syscall, KSTK_EIP(current));
> +}
> [...]
> +#ifdef CONFIG_SECCOMP_FILTER
> + case SECCOMP_MODE_FILTER:
> + if (seccomp_run_filters(this_syscall) == SECCOMP_RET_ALLOW)
> + return;
> + seccomp_filter_log_failure(this_syscall);
> + exit_code = SIGSYS;
> + break;
> +#endif
> default:
> BUG();
> }
> @@ -56,7 +324,7 @@ void __secure_computing(int this_syscall)
> dump_stack();
> #endif
> audit_seccomp(this_syscall);
> - do_exit(SIGKILL);
> + do_exit(exit_code);
> }
I think the seccomp_filter_log_failure() use is redundant with the
audit_seccomp call. Here's a possible reorganization of the logging...
From: Kees Cook <[email protected]>
Date: Sun, 26 Feb 2012 11:56:12 -0800
Subject: [PATCH] seccomp: improve audit logging details
This consolidates the seccomp filter error logging path and adds more
details to the audit log.
Signed-off-by: Kees Cook <[email protected]>
---
include/linux/audit.h | 8 ++++----
kernel/auditsc.c | 9 +++++++--
kernel/seccomp.c | 15 +--------------
3 files changed, 12 insertions(+), 20 deletions(-)
diff --git a/include/linux/audit.h b/include/linux/audit.h
index 9ff7a2c..5aa6cfc 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -463,7 +463,7 @@ extern void audit_putname(const char *name);
extern void __audit_inode(const char *name, const struct dentry *dentry);
extern void __audit_inode_child(const struct dentry *dentry,
const struct inode *parent);
-extern void __audit_seccomp(unsigned long syscall);
+extern void __audit_seccomp(unsigned long syscall, long signr);
extern void __audit_ptrace(struct task_struct *t);
static inline int audit_dummy_context(void)
@@ -508,10 +508,10 @@ static inline void audit_inode_child(const struct dentry *dentry,
}
void audit_core_dumps(long signr);
-static inline void audit_seccomp(unsigned long syscall)
+static inline void audit_seccomp(unsigned long syscall, long signr)
{
if (unlikely(!audit_dummy_context()))
- __audit_seccomp(syscall);
+ __audit_seccomp(syscall, signr);
}
static inline void audit_ptrace(struct task_struct *t)
@@ -634,7 +634,7 @@ extern int audit_signals;
#define audit_inode(n,d) do { (void)(d); } while (0)
#define audit_inode_child(i,p) do { ; } while (0)
#define audit_core_dumps(i) do { ; } while (0)
-#define audit_seccomp(i) do { ; } while (0)
+#define audit_seccomp(i,s) do { ; } while (0)
#define auditsc_get_stamp(c,t,s) (0)
#define audit_get_loginuid(t) (-1)
#define audit_get_sessionid(t) (-1)
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index af1de0f..74652fe 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -67,6 +67,7 @@
#include <linux/syscalls.h>
#include <linux/capability.h>
#include <linux/fs_struct.h>
+#include <linux/compat.h>
#include "audit.h"
@@ -2710,13 +2711,17 @@ void audit_core_dumps(long signr)
audit_log_end(ab);
}
-void __audit_seccomp(unsigned long syscall)
+void __audit_seccomp(unsigned long syscall, long signr)
{
struct audit_buffer *ab;
ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_ANOM_ABEND);
- audit_log_abend(ab, "seccomp", SIGKILL);
+ audit_log_abend(ab, "seccomp", signr);
audit_log_format(ab, " syscall=%ld", syscall);
+#ifdef CONFIG_COMPAT
+ audit_log_format(ab, " compat=%d", is_compat_task());
+#endif
+ audit_log_format(ab, " ip=0x%lx", KSTK_EIP(current));
audit_log_end(ab);
}
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 5aabc3c..40af83f 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -57,18 +57,6 @@ struct seccomp_filter {
struct sock_filter insns[];
};
-static void seccomp_filter_log_failure(int syscall)
-{
- int compat = 0;
-#ifdef CONFIG_COMPAT
- compat = is_compat_task();
-#endif
- pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n",
- current->comm, task_pid_nr(current),
- (compat ? "compat " : ""),
- syscall, KSTK_EIP(current));
-}
-
/**
* get_u32 - returns a u32 offset into data
* @data: a unsigned 64 bit value
@@ -378,7 +366,6 @@ int __secure_computing_int(int this_syscall)
default:
break;
}
- seccomp_filter_log_failure(this_syscall);
exit_code = SIGSYS;
break;
}
@@ -390,7 +377,7 @@ int __secure_computing_int(int this_syscall)
#ifdef SECCOMP_DEBUG
dump_stack();
#endif
- audit_seccomp(this_syscall);
+ audit_seccomp(this_syscall, exit_code);
do_exit(exit_code);
return -1; /* never reached */
}
--
1.7.0.4
--
Kees Cook
On Sat, Feb 25, 2012 at 2:20 PM, Kees Cook <[email protected]> wrote:
> On Fri, Feb 24, 2012 at 7:21 PM, Will Drewry <[email protected]> wrote:
>> This change adds the SECCOMP_RET_ERRNO as a valid return value from a
>> seccomp filter. ?Additionally, it makes the first use of the lower
>> 16-bits for storing a filter-supplied errno. ?16-bits is more than
>> enough for the errno-base.h calls.
>>
>> Returning errors instead of immediately terminating processes that
>> violate seccomp policy allow for broader use of this functionality
>> for kernel attack surface reduction. ?For example, a linux container
>> could maintain a whitelist of pre-existing system calls but drop
>> all new ones with errnos. ?This would keep a logically static attack
>> surface while providing errnos that may allow for graceful failure
>> without the downside of do_exit() on a bad call.
>>
>> v11: - check for NULL filter ([email protected])
>> v10: - change loaders to fn
>> ?v9: - n/a
>> ?v8: - update Kconfig to note new need for syscall_set_return_value.
>> ? ? - reordered such that TRAP behavior follows on later.
>> ? ? - made the for loop a little less indent-y
>> ?v7: - introduced
>>
>> Signed-off-by: Will Drewry <[email protected]>
>
> Reviewed-by: Kees Cook <[email protected]>
Thanks!
>> + ? ? ? /* Ensure unexpected behavior doesn't result in failing open. */
>> + ? ? ? if (unlikely(current->seccomp.filter == NULL))
>> + ? ? ? ? ? ? ? ret = SECCOMP_RET_KILL;
>
> Any reason to not just immediately return in this case?
Not that I can think of. I can just have it bail here.
On Sun, Feb 26, 2012 at 2:28 PM, Kees Cook <[email protected]> wrote:
> On Fri, Feb 24, 2012 at 09:21:45PM -0600, Will Drewry wrote:
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index e8d76c5..25e8296 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> [...]
>> +static void seccomp_filter_log_failure(int syscall)
>> +{
>> + ? ? int compat = 0;
>> +#ifdef CONFIG_COMPAT
>> + ? ? compat = is_compat_task();
>> +#endif
>> + ? ? pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n",
>> + ? ? ? ? ? ? current->comm, task_pid_nr(current),
>> + ? ? ? ? ? ? (compat ? "compat " : ""),
>> + ? ? ? ? ? ? syscall, KSTK_EIP(current));
>> +}
>> [...]
>> +#ifdef CONFIG_SECCOMP_FILTER
>> + ? ? case SECCOMP_MODE_FILTER:
>> + ? ? ? ? ? ? if (seccomp_run_filters(this_syscall) == SECCOMP_RET_ALLOW)
>> + ? ? ? ? ? ? ? ? ? ? return;
>> + ? ? ? ? ? ? seccomp_filter_log_failure(this_syscall);
>> + ? ? ? ? ? ? exit_code = SIGSYS;
>> + ? ? ? ? ? ? break;
>> +#endif
>> ? ? ? default:
>> ? ? ? ? ? ? ? BUG();
>> ? ? ? }
>> @@ -56,7 +324,7 @@ void __secure_computing(int this_syscall)
>> ? ? ? dump_stack();
>> ?#endif
>> ? ? ? audit_seccomp(this_syscall);
>> - ? ? do_exit(SIGKILL);
>> + ? ? do_exit(exit_code);
>> ?}
>
> I think the seccomp_filter_log_failure() use is redundant with the
> audit_seccomp call. ?Here's a possible reorganization of the logging...
Cool - a comment below:
> From: Kees Cook <[email protected]>
> Date: Sun, 26 Feb 2012 11:56:12 -0800
> Subject: [PATCH] seccomp: improve audit logging details
>
> This consolidates the seccomp filter error logging path and adds more
> details to the audit log.
>
> Signed-off-by: Kees Cook <[email protected]>
> ---
> ?include/linux/audit.h | ? ?8 ++++----
> ?kernel/auditsc.c ? ? ?| ? ?9 +++++++--
> ?kernel/seccomp.c ? ? ?| ? 15 +--------------
> ?3 files changed, 12 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/audit.h b/include/linux/audit.h
> index 9ff7a2c..5aa6cfc 100644
> --- a/include/linux/audit.h
> +++ b/include/linux/audit.h
> @@ -463,7 +463,7 @@ extern void audit_putname(const char *name);
> ?extern void __audit_inode(const char *name, const struct dentry *dentry);
> ?extern void __audit_inode_child(const struct dentry *dentry,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?const struct inode *parent);
> -extern void __audit_seccomp(unsigned long syscall);
> +extern void __audit_seccomp(unsigned long syscall, long signr);
> ?extern void __audit_ptrace(struct task_struct *t);
>
> ?static inline int audit_dummy_context(void)
> @@ -508,10 +508,10 @@ static inline void audit_inode_child(const struct dentry *dentry,
> ?}
> ?void audit_core_dumps(long signr);
>
> -static inline void audit_seccomp(unsigned long syscall)
> +static inline void audit_seccomp(unsigned long syscall, long signr)
> ?{
> ? ? ? ?if (unlikely(!audit_dummy_context()))
> - ? ? ? ? ? ? ? __audit_seccomp(syscall);
> + ? ? ? ? ? ? ? __audit_seccomp(syscall, signr);
> ?}
>
> ?static inline void audit_ptrace(struct task_struct *t)
> @@ -634,7 +634,7 @@ extern int audit_signals;
> ?#define audit_inode(n,d) do { (void)(d); } while (0)
> ?#define audit_inode_child(i,p) do { ; } while (0)
> ?#define audit_core_dumps(i) do { ; } while (0)
> -#define audit_seccomp(i) do { ; } while (0)
> +#define audit_seccomp(i,s) do { ; } while (0)
> ?#define auditsc_get_stamp(c,t,s) (0)
> ?#define audit_get_loginuid(t) (-1)
> ?#define audit_get_sessionid(t) (-1)
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index af1de0f..74652fe 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -67,6 +67,7 @@
> ?#include <linux/syscalls.h>
> ?#include <linux/capability.h>
> ?#include <linux/fs_struct.h>
> +#include <linux/compat.h>
>
> ?#include "audit.h"
>
> @@ -2710,13 +2711,17 @@ void audit_core_dumps(long signr)
> ? ? ? ?audit_log_end(ab);
> ?}
>
> -void __audit_seccomp(unsigned long syscall)
> +void __audit_seccomp(unsigned long syscall, long signr)
> ?{
> ? ? ? ?struct audit_buffer *ab;
>
> ? ? ? ?ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_ANOM_ABEND);
> - ? ? ? audit_log_abend(ab, "seccomp", SIGKILL);
> + ? ? ? audit_log_abend(ab, "seccomp", signr);
> ? ? ? ?audit_log_format(ab, " syscall=%ld", syscall);
> +#ifdef CONFIG_COMPAT
> + ? ? ? audit_log_format(ab, " compat=%d", is_compat_task());
> +#endif
Should this just use syscall_get_arch to get the AUDIT_ARCH now? :)
> + ? ? ? audit_log_format(ab, " ip=0x%lx", KSTK_EIP(current));
> ? ? ? ?audit_log_end(ab);
> ?}
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 5aabc3c..40af83f 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -57,18 +57,6 @@ struct seccomp_filter {
> ? ? ? ?struct sock_filter insns[];
> ?};
>
> -static void seccomp_filter_log_failure(int syscall)
> -{
> - ? ? ? int compat = 0;
> -#ifdef CONFIG_COMPAT
> - ? ? ? compat = is_compat_task();
> -#endif
> - ? ? ? pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n",
> - ? ? ? ? ? ? ? current->comm, task_pid_nr(current),
> - ? ? ? ? ? ? ? (compat ? "compat " : ""),
> - ? ? ? ? ? ? ? syscall, KSTK_EIP(current));
> -}
> -
> ?/**
> ?* get_u32 - returns a u32 offset into data
> ?* @data: a unsigned 64 bit value
> @@ -378,7 +366,6 @@ int __secure_computing_int(int this_syscall)
> ? ? ? ? ? ? ? ?default:
> ? ? ? ? ? ? ? ? ? ? ? ?break;
> ? ? ? ? ? ? ? ?}
> - ? ? ? ? ? ? ? seccomp_filter_log_failure(this_syscall);
> ? ? ? ? ? ? ? ?exit_code = SIGSYS;
> ? ? ? ? ? ? ? ?break;
> ? ? ? ?}
> @@ -390,7 +377,7 @@ int __secure_computing_int(int this_syscall)
> ?#ifdef SECCOMP_DEBUG
> ? ? ? ?dump_stack();
> ?#endif
> - ? ? ? audit_seccomp(this_syscall);
> + ? ? ? audit_seccomp(this_syscall, exit_code);
> ? ? ? ?do_exit(exit_code);
> ? ? ? ?return -1; ? ? ?/* never reached */
> ?}
> --
> 1.7.0.4
I'll pull this into the series if that's okay with you?
Thanks!
On Mon, 2012-02-27 at 10:23 -0600, Will Drewry wrote:
> On Sun, Feb 26, 2012 at 2:28 PM, Kees Cook <[email protected]> wrote:
> > On Fri, Feb 24, 2012 at 09:21:45PM -0600, Will Drewry wrote:
> >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> >> index e8d76c5..25e8296 100644
> >> --- a/kernel/seccomp.c
> >> +++ b/kernel/seccomp.c
> >> [...]
> >> +static void seccomp_filter_log_failure(int syscall)
> >> +{
> >> + int compat = 0;
> >> +#ifdef CONFIG_COMPAT
> >> + compat = is_compat_task();
> >> +#endif
> >> + pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n",
> >> + current->comm, task_pid_nr(current),
> >> + (compat ? "compat " : ""),
> >> + syscall, KSTK_EIP(current));
> >> +}
> >> [...]
> >> +#ifdef CONFIG_SECCOMP_FILTER
> >> + case SECCOMP_MODE_FILTER:
> >> + if (seccomp_run_filters(this_syscall) == SECCOMP_RET_ALLOW)
> >> + return;
> >> + seccomp_filter_log_failure(this_syscall);
> >> + exit_code = SIGSYS;
> >> + break;
> >> +#endif
> >> default:
> >> BUG();
> >> }
> >> @@ -56,7 +324,7 @@ void __secure_computing(int this_syscall)
> >> dump_stack();
> >> #endif
> >> audit_seccomp(this_syscall);
> >> - do_exit(SIGKILL);
> >> + do_exit(exit_code);
> >> }
> >
> > I think the seccomp_filter_log_failure() use is redundant with the
> > audit_seccomp call. Here's a possible reorganization of the logging...
>
> Cool - a comment below:
>
> > From: Kees Cook <[email protected]>
> > Date: Sun, 26 Feb 2012 11:56:12 -0800
> > Subject: [PATCH] seccomp: improve audit logging details
> >
> > This consolidates the seccomp filter error logging path and adds more
> > details to the audit log.
> >
> > Signed-off-by: Kees Cook <[email protected]>
> > ---
> > include/linux/audit.h | 8 ++++----
> > kernel/auditsc.c | 9 +++++++--
> > kernel/seccomp.c | 15 +--------------
> > 3 files changed, 12 insertions(+), 20 deletions(-)
> >
> > diff --git a/include/linux/audit.h b/include/linux/audit.h
> > index 9ff7a2c..5aa6cfc 100644
> > --- a/include/linux/audit.h
> > +++ b/include/linux/audit.h
> > @@ -463,7 +463,7 @@ extern void audit_putname(const char *name);
> > extern void __audit_inode(const char *name, const struct dentry *dentry);
> > extern void __audit_inode_child(const struct dentry *dentry,
> > const struct inode *parent);
> > -extern void __audit_seccomp(unsigned long syscall);
> > +extern void __audit_seccomp(unsigned long syscall, long signr);
> > extern void __audit_ptrace(struct task_struct *t);
> >
> > static inline int audit_dummy_context(void)
> > @@ -508,10 +508,10 @@ static inline void audit_inode_child(const struct dentry *dentry,
> > }
> > void audit_core_dumps(long signr);
> >
> > -static inline void audit_seccomp(unsigned long syscall)
> > +static inline void audit_seccomp(unsigned long syscall, long signr)
> > {
> > if (unlikely(!audit_dummy_context()))
> > - __audit_seccomp(syscall);
> > + __audit_seccomp(syscall, signr);
> > }
> >
> > static inline void audit_ptrace(struct task_struct *t)
> > @@ -634,7 +634,7 @@ extern int audit_signals;
> > #define audit_inode(n,d) do { (void)(d); } while (0)
> > #define audit_inode_child(i,p) do { ; } while (0)
> > #define audit_core_dumps(i) do { ; } while (0)
> > -#define audit_seccomp(i) do { ; } while (0)
> > +#define audit_seccomp(i,s) do { ; } while (0)
> > #define auditsc_get_stamp(c,t,s) (0)
> > #define audit_get_loginuid(t) (-1)
> > #define audit_get_sessionid(t) (-1)
> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> > index af1de0f..74652fe 100644
> > --- a/kernel/auditsc.c
> > +++ b/kernel/auditsc.c
> > @@ -67,6 +67,7 @@
> > #include <linux/syscalls.h>
> > #include <linux/capability.h>
> > #include <linux/fs_struct.h>
> > +#include <linux/compat.h>
> >
> > #include "audit.h"
> >
> > @@ -2710,13 +2711,17 @@ void audit_core_dumps(long signr)
> > audit_log_end(ab);
> > }
> >
> > -void __audit_seccomp(unsigned long syscall)
> > +void __audit_seccomp(unsigned long syscall, long signr)
> > {
> > struct audit_buffer *ab;
> >
> > ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_ANOM_ABEND);
> > - audit_log_abend(ab, "seccomp", SIGKILL);
> > + audit_log_abend(ab, "seccomp", signr);
> > audit_log_format(ab, " syscall=%ld", syscall);
> > +#ifdef CONFIG_COMPAT
> > + audit_log_format(ab, " compat=%d", is_compat_task());
> > +#endif
>
> Should this just use syscall_get_arch to get the AUDIT_ARCH now? :)
I'm waffling on this one, but I'm leaning towards not including compat
at all. If you include it, yes, you should use the generic function.
If you have CONFIG_AUDITSC and started audit you are going to get this,
along with a0-a4, in a separate but associated audit record. Thus you
get all the interesting/relevant info. Without CONFIG_AUDITSC and
running auditd you get compat, but nothing else. Why is compat so
interesting?
This patch would duplicate the arch=field from that record (calling it
compat). So if we are going to duplicate it in another record, we
should at least call it the same thing (arch=%x)
My current thinking, and I'm not settled would be to include syscall,
a0-a4 and arch in the record only if !CONFIG_AUDITSC. (ip doesn't show
up elsewhere, so that makes sense here)
It might be annoying to have to find the info in the right record, but
if you use the auparse audit library tools, it should 'Just Work'...
> > + audit_log_format(ab, " ip=0x%lx", KSTK_EIP(current));
> > audit_log_end(ab);
> > }
> >
> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> > index 5aabc3c..40af83f 100644
> > --- a/kernel/seccomp.c
> > +++ b/kernel/seccomp.c
> > @@ -57,18 +57,6 @@ struct seccomp_filter {
> > struct sock_filter insns[];
> > };
> >
> > -static void seccomp_filter_log_failure(int syscall)
> > -{
> > - int compat = 0;
> > -#ifdef CONFIG_COMPAT
> > - compat = is_compat_task();
> > -#endif
> > - pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n",
> > - current->comm, task_pid_nr(current),
> > - (compat ? "compat " : ""),
> > - syscall, KSTK_EIP(current));
> > -}
> > -
> > /**
> > * get_u32 - returns a u32 offset into data
> > * @data: a unsigned 64 bit value
> > @@ -378,7 +366,6 @@ int __secure_computing_int(int this_syscall)
> > default:
> > break;
> > }
> > - seccomp_filter_log_failure(this_syscall);
> > exit_code = SIGSYS;
> > break;
> > }
> > @@ -390,7 +377,7 @@ int __secure_computing_int(int this_syscall)
> > #ifdef SECCOMP_DEBUG
> > dump_stack();
> > #endif
> > - audit_seccomp(this_syscall);
> > + audit_seccomp(this_syscall, exit_code);
> > do_exit(exit_code);
> > return -1; /* never reached */
> > }
> > --
> > 1.7.0.4
>
> I'll pull this into the series if that's okay with you?
>
> Thanks!
Hello Will.
I missed the previous discussions, and I don't think I can read
all these emails now. So I apologize in advance if this was already
discussed.
On 02/24, Will Drewry wrote:
>
> struct seccomp {
> int mode;
> + struct seccomp_filter *filter;
> };
Minor nit, it seems that the new member can be "ifdef CONFIG_SECCOMP_FILTER"
> +static long seccomp_attach_filter(struct sock_fprog *fprog)
> +{
> + struct seccomp_filter *filter;
> + unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
> + long ret;
> +
> + if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
> + return -EINVAL;
OK, this limits the memory PR_SET_SECCOMP can use.
But,
> + /*
> + * If there is an existing filter, make it the prev and don't drop its
> + * task reference.
> + */
> + filter->prev = current->seccomp.filter;
> + current->seccomp.filter = filter;
> + return 0;
this doesn't limit the number of filters, looks like a DoS.
What if the application simply does prctl(PR_SET_SECCOMP, dummy_filter)
in an endless loop?
> +static struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
> +{
> + if (!orig)
> + return NULL;
> + /* Reference count is bounded by the number of total processes. */
> + atomic_inc(&orig->usage);
> + return orig;
> +}
> ...
> +void copy_seccomp(struct seccomp *child, const struct seccomp *parent)
> +{
> + /* Other fields are handled by dup_task_struct. */
> + child->filter = get_seccomp_filter(parent->filter);
> +}
This is purely cosmetic, but imho looks a bit confusing.
We do not copy seccomp->mode and this is correct, it was already copied
implicitely. So why do we copy ->filter? This is not "symmetrical", afaics
you can simply do
void copy_seccomp(struct seccomp *child)
{
if (child->filter)
atomic_inc(child->filter->usage);
But once again, this is cosmetic, feel free to ignore.
Oleg.
On 02/24, Will Drewry wrote:
>
> static u32 seccomp_run_filters(int syscall)
> {
> struct seccomp_filter *f;
> - u32 ret = SECCOMP_RET_KILL;
> static const struct bpf_load_fn fns = {
> bpf_load,
> sizeof(struct seccomp_data),
> };
> + u32 ret = SECCOMP_RET_ALLOW;
> const void *sc_ptr = (const void *)(uintptr_t)syscall;
>
> + /* Ensure unexpected behavior doesn't result in failing open. */
> + if (unlikely(current->seccomp.filter == NULL))
> + ret = SECCOMP_RET_KILL;
Is "seccomp.filter == NULL" really possible?
Oleg.
On 02/24, Will Drewry wrote:
>
> To ensure that SIGSYS delivery occurs on return from the triggering
> system call, SIGSYS is added to the SYNCHRONOUS_MASK macro.
Hmm. Can't understand... please help.
> #define SYNCHRONOUS_MASK \
> (sigmask(SIGSEGV) | sigmask(SIGBUS) | sigmask(SIGILL) | \
> - sigmask(SIGTRAP) | sigmask(SIGFPE))
> + sigmask(SIGTRAP) | sigmask(SIGFPE) | sigmask(SIGSYS))
Why?
SYNCHRONOUS_MASK just tells dequeue_signal() "pick them first".
This is needed to make sure that the handler for, say SIGSEGV,
can use ucontext->ip as a faulting addr.
But seccomp adds info->si_call_addr, this looks unneeded.
Oleg.
On Mon, Feb 27, 2012 at 9:22 AM, Oleg Nesterov <[email protected]> wrote:
> SYNCHRONOUS_MASK just tells dequeue_signal() "pick them first".
> This is needed to make sure that the handler for, say SIGSEGV,
> can use ucontext->ip as a faulting addr.
It's desireable to have these signals handled first just so that the thread
state that provoked the signal is not obscured by an unrelated asynchronous
signal having its handler setup done beforehand.
On 02/24, Will Drewry wrote:
>
> arch/Kconfig | 1 +
> include/linux/ptrace.h | 7 +++++--
> include/linux/seccomp.h | 4 +++-
> include/linux/tracehook.h | 6 ++++++
> kernel/ptrace.c | 4 ++++
> kernel/seccomp.c | 18 ++++++++++++++++++
FYI, this conflicts with the changes -mm tree.
The changes in ptrace.* confict with Denys's
"ptrace: simplify PTRACE_foo constants and PTRACE_SETOPTIONS code"
The change in tracehook.h conflicts with
"ptrace: the killed tracee should not enter the syscall"
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -354,6 +354,24 @@ int __secure_computing_int(int this_syscall)
> seccomp_send_sigsys(this_syscall, reason_code);
> return -1;
> }
> + case SECCOMP_RET_TRACE: {
> + int ret;
> + struct pt_regs *regs = task_pt_regs(current);
> + if (!(test_tsk_thread_flag(current, TIF_SYSCALL_TRACE)) ||
> + !(current->ptrace & PT_TRACE_SECCOMP))
> + return -1;
> + /*
> + * PT_TRACE_SECCOMP and seccomp.trace indicate whether
> + * tracehook_report_syscall_entry needs to signal the
> + * tracer. This avoids race conditions in hand off and
> + * the requirement for TIF_SYSCALL_TRACE ensures that
> + * we are in the syscall slow path.
> + */
> + current->seccomp.trace = 1;
> + ret = tracehook_report_syscall_entry(regs);
> + current->seccomp.trace = 0;
> + return ret;
To be honest, this interface looks a bit strange to me...
Once again, sorry if this was already discussed. But perhaps it would
be better to introduce PTRACE_EVENT_SECCOMP/PTRACE_O_SECCOMP instead?
SECCOMP_RET_TRACE: could simply do ptrace_event(PTRACE_EVENT_SECCOMP)
unconditionaly. The tracer can set the option and do PTRACE_CONT if it
doesn't want the system call notifications.
This is also much simpler, no need to change ptrace/tracehook files.
Oleg.
On Mon, Feb 27, 2012 at 9:11 AM, Oleg Nesterov <[email protected]> wrote:
> On 02/24, Will Drewry wrote:
>>
>> ?static u32 seccomp_run_filters(int syscall)
>> ?{
>> ? ? ? struct seccomp_filter *f;
>> - ? ? u32 ret = SECCOMP_RET_KILL;
>> ? ? ? static const struct bpf_load_fn fns = {
>> ? ? ? ? ? ? ? bpf_load,
>> ? ? ? ? ? ? ? sizeof(struct seccomp_data),
>> ? ? ? };
>> + ? ? u32 ret = SECCOMP_RET_ALLOW;
>> ? ? ? const void *sc_ptr = (const void *)(uintptr_t)syscall;
>>
>> + ? ? /* Ensure unexpected behavior doesn't result in failing open. */
>> + ? ? if (unlikely(current->seccomp.filter == NULL))
>> + ? ? ? ? ? ? ret = SECCOMP_RET_KILL;
>
> Is "seccomp.filter == NULL" really possible?
It should not be, but I'm much more comfortable with this failing
closed. I think it's important to be as defensive as possible with
this code given its intended use.
-Kees
--
Kees Cook
ChromeOS Security
On 02/27, Roland McGrath wrote:
>
> On Mon, Feb 27, 2012 at 9:22 AM, Oleg Nesterov <[email protected]> wrote:
> > SYNCHRONOUS_MASK just tells dequeue_signal() "pick them first".
> > This is needed to make sure that the handler for, say SIGSEGV,
> > can use ucontext->ip as a faulting addr.
>
> It's desireable to have these signals handled first just so that the thread
> state that provoked the signal is not obscured by an unrelated asynchronous
> signal having its handler setup done beforehand.
OK, then probably it makes sense to update the changelog, "To ensure that
SIGSYS delivery occurs on return from the triggering system call" looks
confusing imho.
Not that I really understand why "setup_rt_frame() first" really matters
in this case, siginfo should carry all necessary info. IOW, may be "run
this handler first" makes more sense but this change makes the opposite.
OK, I won't argue, just I was confused by the changelog.
Oleg.
On 02/27, Kees Cook wrote:
>
> On Mon, Feb 27, 2012 at 9:11 AM, Oleg Nesterov <[email protected]> wrote:
> > On 02/24, Will Drewry wrote:
> >>
> >> ?static u32 seccomp_run_filters(int syscall)
> >> ?{
> >> ? ? ? struct seccomp_filter *f;
> >> - ? ? u32 ret = SECCOMP_RET_KILL;
> >> ? ? ? static const struct bpf_load_fn fns = {
> >> ? ? ? ? ? ? ? bpf_load,
> >> ? ? ? ? ? ? ? sizeof(struct seccomp_data),
> >> ? ? ? };
> >> + ? ? u32 ret = SECCOMP_RET_ALLOW;
> >> ? ? ? const void *sc_ptr = (const void *)(uintptr_t)syscall;
> >>
> >> + ? ? /* Ensure unexpected behavior doesn't result in failing open. */
> >> + ? ? if (unlikely(current->seccomp.filter == NULL))
> >> + ? ? ? ? ? ? ret = SECCOMP_RET_KILL;
> >
> > Is "seccomp.filter == NULL" really possible?
>
> It should not be, but I'm much more comfortable with this failing
> closed. I think it's important to be as defensive as possible with
> this code given its intended use.
Can't resists... Sorry, I know I am troll but personally I think
in this case the most defensive code is BUG_ON(->filter == NULL)
or at least WARN_ON().
Nevermind, I won't pretend I really understand the intended use,
please ignore.
Oleg.
On Mon, Feb 27, 2012 at 10:14 AM, Oleg Nesterov <[email protected]> wrote:
> On 02/27, Kees Cook wrote:
>>
>> On Mon, Feb 27, 2012 at 9:11 AM, Oleg Nesterov <[email protected]> wrote:
>> > On 02/24, Will Drewry wrote:
>> >>
>> >> ?static u32 seccomp_run_filters(int syscall)
>> >> ?{
>> >> ? ? ? struct seccomp_filter *f;
>> >> - ? ? u32 ret = SECCOMP_RET_KILL;
>> >> ? ? ? static const struct bpf_load_fn fns = {
>> >> ? ? ? ? ? ? ? bpf_load,
>> >> ? ? ? ? ? ? ? sizeof(struct seccomp_data),
>> >> ? ? ? };
>> >> + ? ? u32 ret = SECCOMP_RET_ALLOW;
>> >> ? ? ? const void *sc_ptr = (const void *)(uintptr_t)syscall;
>> >>
>> >> + ? ? /* Ensure unexpected behavior doesn't result in failing open. */
>> >> + ? ? if (unlikely(current->seccomp.filter == NULL))
>> >> + ? ? ? ? ? ? ret = SECCOMP_RET_KILL;
>> >
>> > Is "seccomp.filter == NULL" really possible?
>>
>> It should not be, but I'm much more comfortable with this failing
>> closed. I think it's important to be as defensive as possible with
>> this code given its intended use.
>
> Can't resists... Sorry, I know I am troll but personally I think
> in this case the most defensive code is BUG_ON(->filter == NULL)
> or at least WARN_ON().
Linus will probably object because he objected (correctly) to a very
similar problem in my old vsyscall emulation series. A userspace
security feature shouldn't have a failure mode in which it confuses
the kernel and results in an oops, unless the situation is really
unrecoverable. So WARN_ON plus do_exit would be okay but BUG_ON would
not.
--Andy
On Mon, Feb 27, 2012 at 8:49 AM, Eric Paris <[email protected]> wrote:
> On Mon, 2012-02-27 at 10:23 -0600, Will Drewry wrote:
>> On Sun, Feb 26, 2012 at 2:28 PM, Kees Cook <[email protected]> wrote:
>> > On Fri, Feb 24, 2012 at 09:21:45PM -0600, Will Drewry wrote:
>> >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> >> index e8d76c5..25e8296 100644
>> >> --- a/kernel/seccomp.c
>> >> +++ b/kernel/seccomp.c
>> >> [...]
>> >> +static void seccomp_filter_log_failure(int syscall)
>> >> +{
>> >> + ? ? int compat = 0;
>> >> +#ifdef CONFIG_COMPAT
>> >> + ? ? compat = is_compat_task();
>> >> +#endif
>> >> + ? ? pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n",
>> >> + ? ? ? ? ? ? current->comm, task_pid_nr(current),
>> >> + ? ? ? ? ? ? (compat ? "compat " : ""),
>> >> + ? ? ? ? ? ? syscall, KSTK_EIP(current));
>> >> +}
>> >> [...]
>> >> +#ifdef CONFIG_SECCOMP_FILTER
>> >> + ? ? case SECCOMP_MODE_FILTER:
>> >> + ? ? ? ? ? ? if (seccomp_run_filters(this_syscall) == SECCOMP_RET_ALLOW)
>> >> + ? ? ? ? ? ? ? ? ? ? return;
>> >> + ? ? ? ? ? ? seccomp_filter_log_failure(this_syscall);
>> >> + ? ? ? ? ? ? exit_code = SIGSYS;
>> >> + ? ? ? ? ? ? break;
>> >> +#endif
>> >> ? ? ? default:
>> >> ? ? ? ? ? ? ? BUG();
>> >> ? ? ? }
>> >> @@ -56,7 +324,7 @@ void __secure_computing(int this_syscall)
>> >> ? ? ? dump_stack();
>> >> ?#endif
>> >> ? ? ? audit_seccomp(this_syscall);
>> >> - ? ? do_exit(SIGKILL);
>> >> + ? ? do_exit(exit_code);
>> >> ?}
>> >
>> > I think the seccomp_filter_log_failure() use is redundant with the
>> > audit_seccomp call. ?Here's a possible reorganization of the logging...
>>
>> Cool - a comment below:
>>
>> > From: Kees Cook <[email protected]>
>> > Date: Sun, 26 Feb 2012 11:56:12 -0800
>> > Subject: [PATCH] seccomp: improve audit logging details
>> >
>> > This consolidates the seccomp filter error logging path and adds more
>> > details to the audit log.
>> >
>> > Signed-off-by: Kees Cook <[email protected]>
>> > ---
>> > ?include/linux/audit.h | ? ?8 ++++----
>> > ?kernel/auditsc.c ? ? ?| ? ?9 +++++++--
>> > ?kernel/seccomp.c ? ? ?| ? 15 +--------------
>> > ?3 files changed, 12 insertions(+), 20 deletions(-)
>> >
>> > diff --git a/include/linux/audit.h b/include/linux/audit.h
>> > index 9ff7a2c..5aa6cfc 100644
>> > --- a/include/linux/audit.h
>> > +++ b/include/linux/audit.h
>> > @@ -463,7 +463,7 @@ extern void audit_putname(const char *name);
>> > ?extern void __audit_inode(const char *name, const struct dentry *dentry);
>> > ?extern void __audit_inode_child(const struct dentry *dentry,
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?const struct inode *parent);
>> > -extern void __audit_seccomp(unsigned long syscall);
>> > +extern void __audit_seccomp(unsigned long syscall, long signr);
>> > ?extern void __audit_ptrace(struct task_struct *t);
>> >
>> > ?static inline int audit_dummy_context(void)
>> > @@ -508,10 +508,10 @@ static inline void audit_inode_child(const struct dentry *dentry,
>> > ?}
>> > ?void audit_core_dumps(long signr);
>> >
>> > -static inline void audit_seccomp(unsigned long syscall)
>> > +static inline void audit_seccomp(unsigned long syscall, long signr)
>> > ?{
>> > ? ? ? ?if (unlikely(!audit_dummy_context()))
>> > - ? ? ? ? ? ? ? __audit_seccomp(syscall);
>> > + ? ? ? ? ? ? ? __audit_seccomp(syscall, signr);
>> > ?}
>> >
>> > ?static inline void audit_ptrace(struct task_struct *t)
>> > @@ -634,7 +634,7 @@ extern int audit_signals;
>> > ?#define audit_inode(n,d) do { (void)(d); } while (0)
>> > ?#define audit_inode_child(i,p) do { ; } while (0)
>> > ?#define audit_core_dumps(i) do { ; } while (0)
>> > -#define audit_seccomp(i) do { ; } while (0)
>> > +#define audit_seccomp(i,s) do { ; } while (0)
>> > ?#define auditsc_get_stamp(c,t,s) (0)
>> > ?#define audit_get_loginuid(t) (-1)
>> > ?#define audit_get_sessionid(t) (-1)
>> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> > index af1de0f..74652fe 100644
>> > --- a/kernel/auditsc.c
>> > +++ b/kernel/auditsc.c
>> > @@ -67,6 +67,7 @@
>> > ?#include <linux/syscalls.h>
>> > ?#include <linux/capability.h>
>> > ?#include <linux/fs_struct.h>
>> > +#include <linux/compat.h>
>> >
>> > ?#include "audit.h"
>> >
>> > @@ -2710,13 +2711,17 @@ void audit_core_dumps(long signr)
>> > ? ? ? ?audit_log_end(ab);
>> > ?}
>> >
>> > -void __audit_seccomp(unsigned long syscall)
>> > +void __audit_seccomp(unsigned long syscall, long signr)
>> > ?{
>> > ? ? ? ?struct audit_buffer *ab;
>> >
>> > ? ? ? ?ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_ANOM_ABEND);
>> > - ? ? ? audit_log_abend(ab, "seccomp", SIGKILL);
>> > + ? ? ? audit_log_abend(ab, "seccomp", signr);
>> > ? ? ? ?audit_log_format(ab, " syscall=%ld", syscall);
>> > +#ifdef CONFIG_COMPAT
>> > + ? ? ? audit_log_format(ab, " compat=%d", is_compat_task());
>> > +#endif
>>
>> Should this just use syscall_get_arch to get the AUDIT_ARCH now? :)
>
> I'm waffling on this one, but I'm leaning towards not including compat
> at all. ?If you include it, yes, you should use the generic function.
>
> If you have CONFIG_AUDITSC and started audit you are going to get this,
> along with a0-a4, in a separate but associated audit record. ?Thus you
> get all the interesting/relevant info. ?Without CONFIG_AUDITSC and
> running auditd you get compat, but nothing else. ?Why is compat so
> interesting?
You mean as used in audit_log_exit() ? It looks like that depends on a
lot of state cached in __audit_syscall_entry() and finally triggered
in __audit_syscall_exit() (and ..._free()). I don't think this is
really want seccomp wants to be involved in.
By CONFIG_AUDITSC, you mean CONFIG_AUDITSYSCALL? Without that set,
audit_seccomp is a no-op.
The reason compat needs to be reported (or rather, arch) is because
just reporting syscall is ambiguous. It either needs arch or compat to
distinguish it.
> This patch would duplicate the arch=field from that record (calling it
> compat). ?So if we are going to duplicate it in another record, we
> should at least call it the same thing (arch=%x)
Right, I agree with Will, this should be arch=%x via
syscall_get_arch() if it's going to happen here.
> My current thinking, and I'm not settled would be to include syscall,
> a0-a4 and arch in the record only if !CONFIG_AUDITSC. ?(ip doesn't show
> up elsewhere, so that makes sense here)
>
> It might be annoying to have to find the info in the right record, but
> if you use the auparse audit library tools, it should 'Just Work'...
Given that this is more about logging an abend-like condition, I don't
think it should need to depend on having all syscall auditing enabled
for the process just to get the arch. It really feels like a distinct
condition. But maybe I'm misunderstanding something about how
auditsc.c does its work.
>> > + ? ? ? audit_log_format(ab, " ip=0x%lx", KSTK_EIP(current));
>> > ? ? ? ?audit_log_end(ab);
>> > ?}
>> >
>> > diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> > index 5aabc3c..40af83f 100644
>> > --- a/kernel/seccomp.c
>> > +++ b/kernel/seccomp.c
>> > @@ -57,18 +57,6 @@ struct seccomp_filter {
>> > ? ? ? ?struct sock_filter insns[];
>> > ?};
>> >
>> > -static void seccomp_filter_log_failure(int syscall)
>> > -{
>> > - ? ? ? int compat = 0;
>> > -#ifdef CONFIG_COMPAT
>> > - ? ? ? compat = is_compat_task();
>> > -#endif
>> > - ? ? ? pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n",
>> > - ? ? ? ? ? ? ? current->comm, task_pid_nr(current),
>> > - ? ? ? ? ? ? ? (compat ? "compat " : ""),
>> > - ? ? ? ? ? ? ? syscall, KSTK_EIP(current));
>> > -}
>> > -
>> > ?/**
>> > ?* get_u32 - returns a u32 offset into data
>> > ?* @data: a unsigned 64 bit value
>> > @@ -378,7 +366,6 @@ int __secure_computing_int(int this_syscall)
>> > ? ? ? ? ? ? ? ?default:
>> > ? ? ? ? ? ? ? ? ? ? ? ?break;
>> > ? ? ? ? ? ? ? ?}
>> > - ? ? ? ? ? ? ? seccomp_filter_log_failure(this_syscall);
>> > ? ? ? ? ? ? ? ?exit_code = SIGSYS;
>> > ? ? ? ? ? ? ? ?break;
>> > ? ? ? ?}
>> > @@ -390,7 +377,7 @@ int __secure_computing_int(int this_syscall)
>> > ?#ifdef SECCOMP_DEBUG
>> > ? ? ? ?dump_stack();
>> > ?#endif
>> > - ? ? ? audit_seccomp(this_syscall);
>> > + ? ? ? audit_seccomp(this_syscall, exit_code);
>> > ? ? ? ?do_exit(exit_code);
>> > ? ? ? ?return -1; ? ? ?/* never reached */
>> > ?}
>> > --
>> > 1.7.0.4
>>
>> I'll pull this into the series if that's okay with you?
Let me send a modified version that doesn't include arch, just to
avoid that can of worms for the moment. A separate patch can add that
later, along with all the get_audit_arch() routines for the other
architectures. My original intent was to avoid the duplication between
pr_info() and audit_seccomp(). :)
-Kees
--
Kees Cook
ChromeOS Security
On Mon, Feb 27, 2012 at 10:35 AM, Andrew Lutomirski <[email protected]> wrote:
> On Mon, Feb 27, 2012 at 10:14 AM, Oleg Nesterov <[email protected]> wrote:
>> On 02/27, Kees Cook wrote:
>>>
>>> On Mon, Feb 27, 2012 at 9:11 AM, Oleg Nesterov <[email protected]> wrote:
>>> > On 02/24, Will Drewry wrote:
>>> >>
>>> >> ?static u32 seccomp_run_filters(int syscall)
>>> >> ?{
>>> >> ? ? ? struct seccomp_filter *f;
>>> >> - ? ? u32 ret = SECCOMP_RET_KILL;
>>> >> ? ? ? static const struct bpf_load_fn fns = {
>>> >> ? ? ? ? ? ? ? bpf_load,
>>> >> ? ? ? ? ? ? ? sizeof(struct seccomp_data),
>>> >> ? ? ? };
>>> >> + ? ? u32 ret = SECCOMP_RET_ALLOW;
>>> >> ? ? ? const void *sc_ptr = (const void *)(uintptr_t)syscall;
>>> >>
>>> >> + ? ? /* Ensure unexpected behavior doesn't result in failing open. */
>>> >> + ? ? if (unlikely(current->seccomp.filter == NULL))
>>> >> + ? ? ? ? ? ? ret = SECCOMP_RET_KILL;
>>> >
>>> > Is "seccomp.filter == NULL" really possible?
>>>
>>> It should not be, but I'm much more comfortable with this failing
>>> closed. I think it's important to be as defensive as possible with
>>> this code given its intended use.
>>
>> Can't resists... Sorry, I know I am troll but personally I think
>> in this case the most defensive code is BUG_ON(->filter == NULL)
>> or at least WARN_ON().
>
> Linus will probably object because he objected (correctly) to a very
> similar problem in my old vsyscall emulation series. ?A userspace
> security feature shouldn't have a failure mode in which it confuses
> the kernel and results in an oops, unless the situation is really
> unrecoverable. ?So WARN_ON plus do_exit would be okay but BUG_ON would
> not.
Yeah, actually, add WARN_ON would be preferred here because it should
be an impossible situation. It should still fail closed, though:
? ? /* Ensure unexpected behavior doesn't result in failing open. */
? ? if (WARN_ON(current->seccomp.filter == NULL))
? ? ? ? ? ? return SECCOMP_RET_KILL;
-Kees
--
Kees Cook
ChromeOS Security
On Mon, 2012-02-27 at 10:55 -0800, Kees Cook wrote:
> On Mon, Feb 27, 2012 at 8:49 AM, Eric Paris <[email protected]> wrote:
> You mean as used in audit_log_exit() ? It looks like that depends on a
> lot of state cached in __audit_syscall_entry() and finally triggered
> in __audit_syscall_exit() (and ..._free()). I don't think this is
> really want seccomp wants to be involved in.
>
> By CONFIG_AUDITSC, you mean CONFIG_AUDITSYSCALL? Without that set,
> audit_seccomp is a no-op.
>
> The reason compat needs to be reported (or rather, arch) is because
> just reporting syscall is ambiguous. It either needs arch or compat to
> distinguish it.
Yes, that is what I mean and you are right. You shouldn't push the
syscall in this record either. If !audit_dummy_context() you are
already going to get arch, syscall, and a0-a4 in the associated audit
record. Please do not duplicate that info.
It might make sense to have a separate audit_seccomp() path when
audit_dummy_context() which includes arch, syscall, and a0-a4.
It is my fault (85e7bac3) that we have syscall at all, but I'm on a new
crusade to remove audit record duplication. So I'd happily see a patch
in this series that removes that instead of adds to it.
-Eric
On Mon, Feb 27, 2012 at 11:54 AM, Oleg Nesterov <[email protected]> wrote:
> On 02/24, Will Drewry wrote:
>>
>> ?arch/Kconfig ? ? ? ? ? ? ?| ? ?1 +
>> ?include/linux/ptrace.h ? ?| ? ?7 +++++--
>> ?include/linux/seccomp.h ? | ? ?4 +++-
>> ?include/linux/tracehook.h | ? ?6 ++++++
>> ?kernel/ptrace.c ? ? ? ? ? | ? ?4 ++++
>> ?kernel/seccomp.c ? ? ? ? ?| ? 18 ++++++++++++++++++
>
> FYI, this conflicts with the changes -mm tree.
>
> The changes in ptrace.* confict with Denys's
> "ptrace: simplify PTRACE_foo constants and PTRACE_SETOPTIONS code"
>
> The change in tracehook.h conflicts with
> "ptrace: the killed tracee should not enter the syscall"
What's the best way to reconcile this in this day and age? I don't see
these in kernel-next yet and I can't tell if there is a public -mm
anywhere anymore. I can use the patches from the mailing list with
Denys's changes if that'd be good enough. His cleanup will make this
code even smaller!
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -354,6 +354,24 @@ int __secure_computing_int(int this_syscall)
>> ? ? ? ? ? ? ? ? ? ? ? seccomp_send_sigsys(this_syscall, reason_code);
>> ? ? ? ? ? ? ? ? ? ? ? return -1;
>> ? ? ? ? ? ? ? }
>> + ? ? ? ? ? ? case SECCOMP_RET_TRACE: {
>> + ? ? ? ? ? ? ? ? ? ? int ret;
>> + ? ? ? ? ? ? ? ? ? ? struct pt_regs *regs = task_pt_regs(current);
>> + ? ? ? ? ? ? ? ? ? ? if (!(test_tsk_thread_flag(current, TIF_SYSCALL_TRACE)) ||
>> + ? ? ? ? ? ? ? ? ? ? ? ? !(current->ptrace & PT_TRACE_SECCOMP))
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? return -1;
>> + ? ? ? ? ? ? ? ? ? ? /*
>> + ? ? ? ? ? ? ? ? ? ? ?* PT_TRACE_SECCOMP and seccomp.trace indicate whether
>> + ? ? ? ? ? ? ? ? ? ? ?* tracehook_report_syscall_entry needs to signal the
>> + ? ? ? ? ? ? ? ? ? ? ?* tracer. ?This avoids race conditions in hand off and
>> + ? ? ? ? ? ? ? ? ? ? ?* the requirement for TIF_SYSCALL_TRACE ensures that
>> + ? ? ? ? ? ? ? ? ? ? ?* we are in the syscall slow path.
>> + ? ? ? ? ? ? ? ? ? ? ?*/
>> + ? ? ? ? ? ? ? ? ? ? current->seccomp.trace = 1;
>> + ? ? ? ? ? ? ? ? ? ? ret = tracehook_report_syscall_entry(regs);
>> + ? ? ? ? ? ? ? ? ? ? current->seccomp.trace = 0;
>> + ? ? ? ? ? ? ? ? ? ? return ret;
>
> To be honest, this interface looks a bit strange to me...
>
> Once again, sorry if this was already discussed. But perhaps it would
> be better to introduce PTRACE_EVENT_SECCOMP/PTRACE_O_SECCOMP instead?
>
> SECCOMP_RET_TRACE: could simply do ptrace_event(PTRACE_EVENT_SECCOMP)
> unconditionaly. The tracer can set the option and do PTRACE_CONT if it
> doesn't want the system call notifications.
Works for me - this also gets rid of the extra int for brief state
tracking. I'll switch over to that in the next rev. (More follow-ups
to your reviews incoming too :).
Thanks!
will
On Mon, Feb 27, 2012 at 11:09 AM, Oleg Nesterov <[email protected]> wrote:
> Hello Will.
>
> I missed the previous discussions, and I don't think I can read
> all these emails now. So I apologize in advance if this was already
> discussed.
No worries - any review is appreciated :)
> On 02/24, Will Drewry wrote:
>>
>> ?struct seccomp {
>> ? ? ? int mode;
>> + ? ? struct seccomp_filter *filter;
>> ?};
>
> Minor nit, it seems that the new member can be "ifdef CONFIG_SECCOMP_FILTER"
Good call - I'll add that.
>> +static long seccomp_attach_filter(struct sock_fprog *fprog)
>> +{
>> + ? ? struct seccomp_filter *filter;
>> + ? ? unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
>> + ? ? long ret;
>> +
>> + ? ? if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
>> + ? ? ? ? ? ? return -EINVAL;
>
> OK, this limits the memory PR_SET_SECCOMP can use.
>
> But,
>
>> + ? ? /*
>> + ? ? ?* If there is an existing filter, make it the prev and don't drop its
>> + ? ? ?* task reference.
>> + ? ? ?*/
>> + ? ? filter->prev = current->seccomp.filter;
>> + ? ? current->seccomp.filter = filter;
>> + ? ? return 0;
>
> this doesn't limit the number of filters, looks like a DoS.
>
> What if the application simply does prctl(PR_SET_SECCOMP, dummy_filter)
> in an endless loop?
It consumes a massive amount of kernel memory and, maybe, the OOM
killer gives it a boot :)
I wasn't sure what the normal convention was for avoiding memory
consumption by user processes. Should I just add a sysctl and a
per-task counter for the max number of filters?
I'm fine doing whatever makes sense here.
>
>
>> +static struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
>> +{
>> + ? ? if (!orig)
>> + ? ? ? ? ? ? return NULL;
>> + ? ? /* Reference count is bounded by the number of total processes. */
>> + ? ? atomic_inc(&orig->usage);
>> + ? ? return orig;
>> +}
>> ...
>> +void copy_seccomp(struct seccomp *child, const struct seccomp *parent)
>> +{
>> + ? ? /* Other fields are handled by dup_task_struct. */
>> + ? ? child->filter = get_seccomp_filter(parent->filter);
>> +}
>
> This is purely cosmetic, but imho looks a bit confusing.
>
> We do not copy seccomp->mode and this is correct, it was already copied
> implicitely. So why do we copy ->filter? This is not "symmetrical", afaics
> you can simply do
>
> ? ? ? ?void copy_seccomp(struct seccomp *child)
> ? ? ? ?{
> ? ? ? ? ? ? ? ?if (child->filter)
> ? ? ? ? ? ? ? ? ? ? ? ?atomic_inc(child->filter->usage);
>
> But once again, this is cosmetic, feel free to ignore.
Right now get_seccomp_filter does the NULL check, so really this could
be reduced to adding an external get_seccomp_filter(p->seccomp.filter)
in place of copy_seccomp().
As to removing the extra arg, that should be fine since the parent
can't drop its refcount when copy_seccomp is called. At the very
least, I can make that change so it reads more cleanly.
thanks!
will
On Mon, Feb 27, 2012 at 1:14 PM, Kees Cook <[email protected]> wrote:
> On Mon, Feb 27, 2012 at 10:35 AM, Andrew Lutomirski <[email protected]> wrote:
>> On Mon, Feb 27, 2012 at 10:14 AM, Oleg Nesterov <[email protected]> wrote:
>>> On 02/27, Kees Cook wrote:
>>>>
>>>> On Mon, Feb 27, 2012 at 9:11 AM, Oleg Nesterov <[email protected]> wrote:
>>>> > On 02/24, Will Drewry wrote:
>>>> >>
>>>> >> ?static u32 seccomp_run_filters(int syscall)
>>>> >> ?{
>>>> >> ? ? ? struct seccomp_filter *f;
>>>> >> - ? ? u32 ret = SECCOMP_RET_KILL;
>>>> >> ? ? ? static const struct bpf_load_fn fns = {
>>>> >> ? ? ? ? ? ? ? bpf_load,
>>>> >> ? ? ? ? ? ? ? sizeof(struct seccomp_data),
>>>> >> ? ? ? };
>>>> >> + ? ? u32 ret = SECCOMP_RET_ALLOW;
>>>> >> ? ? ? const void *sc_ptr = (const void *)(uintptr_t)syscall;
>>>> >>
>>>> >> + ? ? /* Ensure unexpected behavior doesn't result in failing open. */
>>>> >> + ? ? if (unlikely(current->seccomp.filter == NULL))
>>>> >> + ? ? ? ? ? ? ret = SECCOMP_RET_KILL;
>>>> >
>>>> > Is "seccomp.filter == NULL" really possible?
>>>>
>>>> It should not be, but I'm much more comfortable with this failing
>>>> closed. I think it's important to be as defensive as possible with
>>>> this code given its intended use.
>>>
>>> Can't resists... Sorry, I know I am troll but personally I think
>>> in this case the most defensive code is BUG_ON(->filter == NULL)
>>> or at least WARN_ON().
>>
>> Linus will probably object because he objected (correctly) to a very
>> similar problem in my old vsyscall emulation series. ?A userspace
>> security feature shouldn't have a failure mode in which it confuses
>> the kernel and results in an oops, unless the situation is really
>> unrecoverable. ?So WARN_ON plus do_exit would be okay but BUG_ON would
>> not.
>
> Yeah, actually, add WARN_ON would be preferred here because it should
> be an impossible situation. It should still fail closed, though:
>
> ?? ? /* Ensure unexpected behavior doesn't result in failing open. */
> ?? ? if (WARN_ON(current->seccomp.filter == NULL))
> ?? ? ? ? ? ? return SECCOMP_RET_KILL;
I'll do that - thanks!
On Mon, Feb 27, 2012 at 11:25 AM, Eric Paris <[email protected]> wrote:
> On Mon, 2012-02-27 at 10:55 -0800, Kees Cook wrote:
>> On Mon, Feb 27, 2012 at 8:49 AM, Eric Paris <[email protected]> wrote:
>
>> You mean as used in audit_log_exit() ? It looks like that depends on a
>> lot of state cached in __audit_syscall_entry() and finally triggered
>> in __audit_syscall_exit() (and ..._free()). I don't think this is
>> really want seccomp wants to be involved in.
>>
>> By CONFIG_AUDITSC, you mean CONFIG_AUDITSYSCALL? Without that set,
>> audit_seccomp is a no-op.
>>
>> The reason compat needs to be reported (or rather, arch) is because
>> just reporting syscall is ambiguous. It either needs arch or compat to
>> distinguish it.
>
> Yes, that is what I mean and you are right. ?You shouldn't push the
> syscall in this record either. ?If !audit_dummy_context() you are
> already going to get arch, syscall, and a0-a4 in the associated audit
> record. ?Please do not duplicate that info.
Ah, in that case, please ignore the patch I just sent. Heh.
> It might make sense to have a separate audit_seccomp() path when
> audit_dummy_context() which includes arch, syscall, and a0-a4.
Ah! I think I understand what you mean now. If audit_dummy_context(),
then the syscall, arch, and a0-a4 were not already collected. Gotcha.
How do you envision it looking? I still see it as two distinct events
(the syscall itself, and the rejection). Would you want those details
added to the context structure to be reported at ..._exit() time? It
seems like context->type couldn't be used to see if those fields were
valid.
Something like:
void __audit_seccomp(unsigned long syscall, long signr)
{
struct audit_buffer *ab;
if (!audit_dummy_context()) {
struct audit_context *context = current->audit_context;
context->syscall_signr = signr;
context->syscall_ip = KSTK_EIP(current);
return;
}
ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_ANOM_ABEND);
audit_log_abend(ab, "seccomp", signr);
audit_log_format(ab, " syscall=%ld", syscall);
audit_log_format(ab, " ip=0x%lx", KSTK_EIP(current));
audit_log_end(ab);
}
And then report syscall_ip and syscall_signr if syscall_signr != 0 in
the _exit()? I think everything else from audit_log_abend() will end
up in the _exit() report.
> It is my fault (85e7bac3) that we have syscall at all, but I'm on a new
> crusade to remove audit record duplication. ?So I'd happily see a patch
> in this series that removes that instead of adds to it.
Well, I think the abend reporting is nice; I'd hate to see that
totally removed. The seccomp case is a bit different, I agree. I could
see it either way.
-Kees
--
Kees Cook
ChromeOS Security
On Mon, Feb 27, 2012 at 11:54 AM, Will Drewry <[email protected]> wrote:
> On Mon, Feb 27, 2012 at 11:09 AM, Oleg Nesterov <[email protected]> wrote:
>> On 02/24, Will Drewry wrote:
>>> +static struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
>>> +{
>>> + ? ? if (!orig)
>>> + ? ? ? ? ? ? return NULL;
>>> + ? ? /* Reference count is bounded by the number of total processes. */
>>> + ? ? atomic_inc(&orig->usage);
>>> + ? ? return orig;
>>> +}
>>> ...
>>> +void copy_seccomp(struct seccomp *child, const struct seccomp *parent)
>>> +{
>>> + ? ? /* Other fields are handled by dup_task_struct. */
>>> + ? ? child->filter = get_seccomp_filter(parent->filter);
>>> +}
>>
>> This is purely cosmetic, but imho looks a bit confusing.
>>
>> We do not copy seccomp->mode and this is correct, it was already copied
>> implicitely. So why do we copy ->filter? This is not "symmetrical", afaics
>> you can simply do
>>
>> ? ? ? ?void copy_seccomp(struct seccomp *child)
>> ? ? ? ?{
>> ? ? ? ? ? ? ? ?if (child->filter)
>> ? ? ? ? ? ? ? ? ? ? ? ?atomic_inc(child->filter->usage);
>>
>> But once again, this is cosmetic, feel free to ignore.
>
> Right now get_seccomp_filter does the NULL check, so really this could
> be reduced to adding an external get_seccomp_filter(p->seccomp.filter)
> in place of copy_seccomp().
>
> As to removing the extra arg, that should be fine since the parent
> can't drop its refcount when copy_seccomp is called. ?At the very
> least, I can make that change so it reads more cleanly.
I had various conflicting thoughts while looking over the refcounting:
- get_seccomp_filter is defined static, and has a single caller: copy_seccomp()
- put isn't static, and has a single caller: kernel/fork.c:free_task()
- having only get_/put_ touch ->usage seems cleaner to me
- seccomp_attach_filter touches ->usage without get_seccomp_filter
- having the initializing routine use atomic_set(..., 1) is a common pattern
In a fit of extreme bike-shedding, I can't decide which is more sensible:
- rename put_seccomp_filter to free_seccomp_filter and inline the
get_seccomp_filter logic into copy_seccomp().
or
- create a wrapper for put_seccomp_filter named free_seccomp_filter so
that get_/put_ can both be static.
-Kees
--
Kees Cook
ChromeOS Security
On Mon, Feb 27, 2012 at 11:22 AM, Oleg Nesterov <[email protected]> wrote:
> On 02/24, Will Drewry wrote:
>>
>> To ensure that SIGSYS delivery occurs on return from the triggering
>> system call, SIGSYS is added to the SYNCHRONOUS_MASK macro.
>
> Hmm. Can't understand... please help.
>
>> ?#define SYNCHRONOUS_MASK \
>> ? ? ? (sigmask(SIGSEGV) | sigmask(SIGBUS) | sigmask(SIGILL) | \
>> - ? ? ?sigmask(SIGTRAP) | sigmask(SIGFPE))
>> + ? ? ?sigmask(SIGTRAP) | sigmask(SIGFPE) | sigmask(SIGSYS))
>
> Why?
>
> SYNCHRONOUS_MASK just tells dequeue_signal() "pick them first".
> This is needed to make sure that the handler for, say SIGSEGV,
> can use ucontext->ip as a faulting addr.
I think that Roland covered this. (Since the syscall_rollback was
called it's nice to let our handler get first go.)
> But seccomp adds info->si_call_addr, this looks unneeded.
True enough. I can drop it. It'd only be useful if the SIGSYS wasn't
being forced and the signal was being handled without ucontext_t
access.
thanks!
On Mon, 2012-02-27 at 12:00 -0800, Kees Cook wrote:
> On Mon, Feb 27, 2012 at 11:25 AM, Eric Paris <[email protected]> wrote:
> > On Mon, 2012-02-27 at 10:55 -0800, Kees Cook wrote:
> >> On Mon, Feb 27, 2012 at 8:49 AM, Eric Paris <[email protected]> wrote:
> >
> >> You mean as used in audit_log_exit() ? It looks like that depends on a
> >> lot of state cached in __audit_syscall_entry() and finally triggered
> >> in __audit_syscall_exit() (and ..._free()). I don't think this is
> >> really want seccomp wants to be involved in.
> >>
> >> By CONFIG_AUDITSC, you mean CONFIG_AUDITSYSCALL? Without that set,
> >> audit_seccomp is a no-op.
> >>
> >> The reason compat needs to be reported (or rather, arch) is because
> >> just reporting syscall is ambiguous. It either needs arch or compat to
> >> distinguish it.
> >
> > Yes, that is what I mean and you are right. You shouldn't push the
> > syscall in this record either. If !audit_dummy_context() you are
> > already going to get arch, syscall, and a0-a4 in the associated audit
> > record. Please do not duplicate that info.
>
> Ah, in that case, please ignore the patch I just sent. Heh.
>
> > It might make sense to have a separate audit_seccomp() path when
> > audit_dummy_context() which includes arch, syscall, and a0-a4.
>
> Ah! I think I understand what you mean now. If audit_dummy_context(),
> then the syscall, arch, and a0-a4 were not already collected. Gotcha.
>
> How do you envision it looking? I still see it as two distinct events
> (the syscall itself, and the rejection). Would you want those details
> added to the context structure to be reported at ..._exit() time? It
> seems like context->type couldn't be used to see if those fields were
> valid.
>
> Something like:
>
> void __audit_seccomp(unsigned long syscall, long signr)
> {
> struct audit_buffer *ab;
>
> if (!audit_dummy_context()) {
> struct audit_context *context = current->audit_context;
> context->syscall_signr = signr;
> context->syscall_ip = KSTK_EIP(current);
> return;
> }
>
> ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_ANOM_ABEND);
> audit_log_abend(ab, "seccomp", signr);
> audit_log_format(ab, " syscall=%ld", syscall);
> audit_log_format(ab, " ip=0x%lx", KSTK_EIP(current));
> audit_log_end(ab);
> }
>
> And then report syscall_ip and syscall_signr if syscall_signr != 0 in
> the _exit()? I think everything else from audit_log_abend() will end
> up in the _exit() report.
>
> > It is my fault (85e7bac3) that we have syscall at all, but I'm on a new
> > crusade to remove audit record duplication. So I'd happily see a patch
> > in this series that removes that instead of adds to it.
>
> Well, I think the abend reporting is nice; I'd hate to see that
> totally removed. The seccomp case is a bit different, I agree. I could
> see it either way.
Once again I send you down a bad path. Your original patch was the
best. We should consider including a0-aX in a future version. I was
mistaken in foolishly believing that audit_syscall_entry() was done
before secure_computing(). But if you look, that isn't the case.
Please pretend I never said anything as you had it right the first time.
-Eric
On Mon, Feb 27, 2012 at 12:34 PM, Eric Paris <[email protected]> wrote:
> On Mon, 2012-02-27 at 12:00 -0800, Kees Cook wrote:
>> On Mon, Feb 27, 2012 at 11:25 AM, Eric Paris <[email protected]> wrote:
>> > On Mon, 2012-02-27 at 10:55 -0800, Kees Cook wrote:
>> >> On Mon, Feb 27, 2012 at 8:49 AM, Eric Paris <[email protected]> wrote:
>> >
>> >> You mean as used in audit_log_exit() ? It looks like that depends on a
>> >> lot of state cached in __audit_syscall_entry() and finally triggered
>> >> in __audit_syscall_exit() (and ..._free()). I don't think this is
>> >> really want seccomp wants to be involved in.
>> >>
>> >> By CONFIG_AUDITSC, you mean CONFIG_AUDITSYSCALL? Without that set,
>> >> audit_seccomp is a no-op.
>> >>
>> >> The reason compat needs to be reported (or rather, arch) is because
>> >> just reporting syscall is ambiguous. It either needs arch or compat to
>> >> distinguish it.
>> >
>> > Yes, that is what I mean and you are right. ?You shouldn't push the
>> > syscall in this record either. ?If !audit_dummy_context() you are
>> > already going to get arch, syscall, and a0-a4 in the associated audit
>> > record. ?Please do not duplicate that info.
>>
>> Ah, in that case, please ignore the patch I just sent. Heh.
>>
>> > It might make sense to have a separate audit_seccomp() path when
>> > audit_dummy_context() which includes arch, syscall, and a0-a4.
>>
>> Ah! I think I understand what you mean now. If audit_dummy_context(),
>> then the syscall, arch, and a0-a4 were not already collected. Gotcha.
>>
>> How do you envision it looking? I still see it as two distinct events
>> (the syscall itself, and the rejection). Would you want those details
>> added to the context structure to be reported at ..._exit() time? It
>> seems like context->type couldn't be used to see if those fields were
>> valid.
>>
>> Something like:
>>
>> void __audit_seccomp(unsigned long syscall, long signr)
>> {
>> ? ? ? ? struct audit_buffer *ab;
>>
>> ? ? ? ? if (!audit_dummy_context()) {
>> ? ? ? ? ? ? ? ? struct audit_context *context = current->audit_context;
>> ? ? ? ? ? ? ? ? context->syscall_signr = signr;
>> ? ? ? ? ? ? ? ? context->syscall_ip = KSTK_EIP(current);
>> ? ? ? ? ? ? ? ? return;
>> ? ? ? ? }
>>
>> ? ? ? ? ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_ANOM_ABEND);
>> ? ? ? ? audit_log_abend(ab, "seccomp", signr);
>> ? ? ? ? audit_log_format(ab, " syscall=%ld", syscall);
>> ? ? ? ? audit_log_format(ab, " ip=0x%lx", KSTK_EIP(current));
>> ? ? ? ? audit_log_end(ab);
>> }
>>
>> And then report syscall_ip and syscall_signr if syscall_signr != 0 in
>> the _exit()? I think everything else from audit_log_abend() will end
>> up in the _exit() report.
>>
>> > It is my fault (85e7bac3) that we have syscall at all, but I'm on a new
>> > crusade to remove audit record duplication. ?So I'd happily see a patch
>> > in this series that removes that instead of adds to it.
>>
>> Well, I think the abend reporting is nice; I'd hate to see that
>> totally removed. The seccomp case is a bit different, I agree. I could
>> see it either way.
>
> Once again I send you down a bad path. ?Your original patch was the
> best. ?We should consider including a0-aX in a future version. ?I was
> mistaken in foolishly believing that audit_syscall_entry() was done
> before secure_computing(). ?But if you look, that isn't the case.
> Please pretend I never said anything as you had it right the first time.
Heh, okay. But now I know more about audit, so that's good. :)
-Kees
--
Kees Cook
ChromeOS Security
Hello,
On Sat, February 25, 2012 04:21, Will Drewry wrote:
> @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
> free_thread_info(tsk->stack);
> rt_mutex_debug_task_free(tsk);
> ftrace_graph_exit_task(tsk);
> + put_seccomp_filter(tsk->seccomp.filter);
> free_task_struct(tsk);
> }
> EXPORT_SYMBOL(free_task);
> @@ -1113,6 +1115,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> goto fork_out;
>
> ftrace_graph_init_task(p);
> + copy_seccomp(&p->seccomp, ¤t->seccomp);
I agree it's more symmetrical when get_seccomp_filter() is used here
directly instead of copy_seccomp(). That should put Kees at ease.
> +static void seccomp_filter_log_failure(int syscall)
> +{
> + int compat = 0;
> +#ifdef CONFIG_COMPAT
> + compat = is_compat_task();
> +#endif
> + pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n",
> + current->comm, task_pid_nr(current),
> + (compat ? "compat " : ""),
> + syscall, KSTK_EIP(current));
> +}
This should be at least rate limited, but could be dropped altogether,
as it's mostly useful for debugging filters. There is no kernel message
when a process is killed because it exceeds a ulimit either. The death
by SIGSYS is hopefully clear enough for users, and filter writers can
return different errno values when debugging where it goes wrong.
> +/**
> + * bpf_load: checks and returns a pointer to the requested offset
> + * @nr: int syscall passed as a void * to bpf_run_filter
> + * @off: offset into struct seccomp_data to load from
Must be aligned, that's worth mentioning.
> + * @size: number of bytes to load (must be 4)
> + * @buffer: temporary storage supplied by bpf_run_filter (4 bytes)
'@buf'.
> +/**
> + * copy_seccomp: manages inheritance on fork
> + * @child: forkee's seccomp
> + * @parent: forker's seccomp
> + *
> + * Ensures that @child inherits filters if in use.
> + */
> +void copy_seccomp(struct seccomp *child, const struct seccomp *parent)
> +{
> + /* Other fields are handled by dup_task_struct. */
> + child->filter = get_seccomp_filter(parent->filter);
> +}
> +#endif /* CONFIG_SECCOMP_FILTER */
Yeah, just get rid of this and use get_seccomp_filter directly, and make
it return void instead of a pointer.
>
> /*
> * Secure computing mode 1 allows only read/write/exit/sigreturn.
> @@ -34,10 +293,11 @@ static int mode1_syscalls_32[] = {
> void __secure_computing(int this_syscall)
> {
> int mode = current->seccomp.mode;
> - int * syscall;
> + int exit_code = SIGKILL;
> + int *syscall;
>
> switch (mode) {
> - case 1:
> + case SECCOMP_MODE_STRICT:
> syscall = mode1_syscalls;
> #ifdef CONFIG_COMPAT
> if (is_compat_task())
> @@ -48,6 +308,14 @@ void __secure_computing(int this_syscall)
> return;
> } while (*++syscall);
> break;
> +#ifdef CONFIG_SECCOMP_FILTER
> + case SECCOMP_MODE_FILTER:
> + if (seccomp_run_filters(this_syscall) == SECCOMP_RET_ALLOW)
> + return;
> + seccomp_filter_log_failure(this_syscall);
> + exit_code = SIGSYS;
Wouldn't it make more sense to always kill with SIGSYS, also for mode 1?
I suppose it's too late for that now.
> +/**
> + * prctl_set_seccomp: configures current->seccomp.mode
> + * @seccomp_mode: requested mode to use
> + * @filter: optional struct sock_fprog for use with SECCOMP_MODE_FILTER
> + *
> + * This function may be called repeatedly with a @seccomp_mode of
> + * SECCOMP_MODE_FILTER to install additional filters. Every filter
> + * successfully installed will be evaluated (in reverse order) for each system
> + * call the task makes.
> + *
> + * It is not possible to transition change the current->seccomp.mode after
> + * one has been set on a task.
That reads awkwardly, do you mean the mode can't be changed once it's set?
---
Reviewed-by: Indan Zupancic <[email protected]>
All in all I'm quite happy with how the code is now, at least this bit.
I'm still unsure about the networking patch and the 32-bit BPF on 64-bit
archs mismatch.
As for the unlimited filter count, I'm not sure how to fix that.
The problem is that filters are inherited and shared. Limiting the
list length (tree depth) helps a bit, then the total memory usage
is limited by max number of processes. It may make sense to limit
the total instruction count instead of the number of filters.
Greetings,
Indan
On Mon, Feb 27, 2012 at 10:51 PM, Indan Zupancic <[email protected]> wrote:
> On Sat, February 25, 2012 04:21, Will Drewry wrote:
>> @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
>> ? ? ? free_thread_info(tsk->stack);
>> ? ? ? rt_mutex_debug_task_free(tsk);
>> ? ? ? ftrace_graph_exit_task(tsk);
>> + ? ? put_seccomp_filter(tsk->seccomp.filter);
>> ? ? ? free_task_struct(tsk);
>> }
>> EXPORT_SYMBOL(free_task);
>> @@ -1113,6 +1115,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>> ? ? ? ? ? ? ? goto fork_out;
>>
>> ? ? ? ftrace_graph_init_task(p);
>> + ? ? copy_seccomp(&p->seccomp, ¤t->seccomp);
>
> I agree it's more symmetrical when get_seccomp_filter() is used here
> directly instead of copy_seccomp(). That should put Kees at ease.
Yeah, that does feel more symmetrical.
>> +static void seccomp_filter_log_failure(int syscall)
>> +{
>> + ? ? int compat = 0;
>> +#ifdef CONFIG_COMPAT
>> + ? ? compat = is_compat_task();
>> +#endif
>> + ? ? pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n",
>> + ? ? ? ? ? ? current->comm, task_pid_nr(current),
>> + ? ? ? ? ? ? (compat ? "compat " : ""),
>> + ? ? ? ? ? ? syscall, KSTK_EIP(current));
>> +}
>
> This should be at least rate limited, but could be dropped altogether,
> as it's mostly useful for debugging filters. There is no kernel message
> when a process is killed because it exceeds a ulimit either. The death
> by SIGSYS is hopefully clear enough for users, and filter writers can
> return different errno values when debugging where it goes wrong.
I've already sent a patch to take care of this. It was redundant with
the later call to audit_seccomp() on the exit path.
https://lkml.org/lkml/2012/2/26/70
https://lkml.org/lkml/2012/2/27/369
>> @@ -48,6 +308,14 @@ void __secure_computing(int this_syscall)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return;
>> ? ? ? ? ? ? ? } while (*++syscall);
>> ? ? ? ? ? ? ? break;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> + ? ? case SECCOMP_MODE_FILTER:
>> + ? ? ? ? ? ? if (seccomp_run_filters(this_syscall) == SECCOMP_RET_ALLOW)
>> + ? ? ? ? ? ? ? ? ? ? return;
>> + ? ? ? ? ? ? seccomp_filter_log_failure(this_syscall);
>> + ? ? ? ? ? ? exit_code = SIGSYS;
>
> Wouldn't it make more sense to always kill with SIGSYS, also for mode 1?
> I suppose it's too late for that now.
Right, this should (somewhat unfortunately) stay SIGKILL for mode 1.
-Kees
--
Kees Cook
ChromeOS Security
On 02/27, Will Drewry wrote:
>
> On Mon, Feb 27, 2012 at 11:09 AM, Oleg Nesterov <[email protected]> wrote:
>
> >> +static long seccomp_attach_filter(struct sock_fprog *fprog)
> >> +{
> >> + ? ? struct seccomp_filter *filter;
> >> + ? ? unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
> >> + ? ? long ret;
> >> +
> >> + ? ? if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
> >> + ? ? ? ? ? ? return -EINVAL;
> >
> > OK, this limits the memory PR_SET_SECCOMP can use.
> >
> > But,
> >
> >> + ? ? /*
> >> + ? ? ?* If there is an existing filter, make it the prev and don't drop its
> >> + ? ? ?* task reference.
> >> + ? ? ?*/
> >> + ? ? filter->prev = current->seccomp.filter;
> >> + ? ? current->seccomp.filter = filter;
> >> + ? ? return 0;
> >
> > this doesn't limit the number of filters, looks like a DoS.
> >
> > What if the application simply does prctl(PR_SET_SECCOMP, dummy_filter)
> > in an endless loop?
>
> It consumes a massive amount of kernel memory and, maybe, the OOM
> killer gives it a boot :)
may be ;) but most probably oom-killer kills another innocent task,
this memory is not accounted.
> I wasn't sure what the normal convention was for avoiding memory
> consumption by user processes. Should I just add a sysctl
Perhaps we can add a sysctl later, but personally I think that we
can start with some "arbitrary" #define BPF_MAXFILTERS.
> and a
> per-task counter for the max number of filters?
Do we really need the counter? attach_filter is not the fast path,
perhaps seccomp_attach_filter() could simply iterate the chain and
count the number?
In any case, if this hurts perfomance-wise then seccomp_run_filters()
has even more problems.
> I'm fine doing whatever makes sense here.
I am fine either way too.
Oleg.
On 02/27, Will Drewry wrote:
>
> On Mon, Feb 27, 2012 at 11:22 AM, Oleg Nesterov <[email protected]> wrote:
> > On 02/24, Will Drewry wrote:
> >>
> >> To ensure that SIGSYS delivery occurs on return from the triggering
> >> system call, SIGSYS is added to the SYNCHRONOUS_MASK macro.
> >
> > Hmm. Can't understand... please help.
> >
> >> ?#define SYNCHRONOUS_MASK \
> >> ? ? ? (sigmask(SIGSEGV) | sigmask(SIGBUS) | sigmask(SIGILL) | \
> >> - ? ? ?sigmask(SIGTRAP) | sigmask(SIGFPE))
> >> + ? ? ?sigmask(SIGTRAP) | sigmask(SIGFPE) | sigmask(SIGSYS))
> >
> > Why?
> >
> > SYNCHRONOUS_MASK just tells dequeue_signal() "pick them first".
> > This is needed to make sure that the handler for, say SIGSEGV,
> > can use ucontext->ip as a faulting addr.
>
> I think that Roland covered this. (Since the syscall_rollback was
> called it's nice to let our handler get first go.)
OK, except I do not really understand the "our handler get first go".
Suppose SIGSYS "races" with the pending SIGHUP. With this change
the caller for SIGHUP will be called first. But yes, setup_frame()
will be called for SIGSYS first. Hopefully this is what you want.
> > But seccomp adds info->si_call_addr, this looks unneeded.
>
> True enough. I can drop it.
Hmm. I meant, the change in SYNCHRONOUS_MASK looks unneeded. Please
keep ->si_call_addr, it is much more convenient than ucontext_t in
userspace.
> It'd only be useful if the SIGSYS wasn't
> being forced and the signal was being handled without ucontext_t
> access.
No, force_ doesn't make any difference in this sense...
In short, the patch looks fine to me, but if you resend it may be
you can update the changelog.
Oleg.
On 02/27, Will Drewry wrote:
> On Mon, Feb 27, 2012 at 11:54 AM, Oleg Nesterov <[email protected]> wrote:
> > On 02/24, Will Drewry wrote:
> >>
> >> ?arch/Kconfig ? ? ? ? ? ? ?| ? ?1 +
> >> ?include/linux/ptrace.h ? ?| ? ?7 +++++--
> >> ?include/linux/seccomp.h ? | ? ?4 +++-
> >> ?include/linux/tracehook.h | ? ?6 ++++++
> >> ?kernel/ptrace.c ? ? ? ? ? | ? ?4 ++++
> >> ?kernel/seccomp.c ? ? ? ? ?| ? 18 ++++++++++++++++++
> >
> > FYI, this conflicts with the changes -mm tree.
> >
> > The changes in ptrace.* confict with Denys's
> > "ptrace: simplify PTRACE_foo constants and PTRACE_SETOPTIONS code"
> >
> > The change in tracehook.h conflicts with
> > "ptrace: the killed tracee should not enter the syscall"
>
> What's the best way to reconcile this in this day and age?
Of course I'd prefer if you make this change on top of Denys's patch ;)
Besides, if you agree with PTRACE_EVENT_SECCOMP/PTRACE_O_SECCOMP you
need only one trivial change in ptrace.h.
> I don't see
> these in kernel-next yet and I can't tell if there is a public -mm
> anywhere anymore.
Strange... I didn't check, but every patch in
http://marc.info/?l=linux-mm-commits has this note:
The -mm tree is included into linux-next and is updated
there every 3-4 working days
> >> --- a/kernel/seccomp.c
> >> +++ b/kernel/seccomp.c
> >> @@ -354,6 +354,24 @@ int __secure_computing_int(int this_syscall)
> >> ? ? ? ? ? ? ? ? ? ? ? seccomp_send_sigsys(this_syscall, reason_code);
> >> ? ? ? ? ? ? ? ? ? ? ? return -1;
> >> ? ? ? ? ? ? ? }
> >> + ? ? ? ? ? ? case SECCOMP_RET_TRACE: {
> >> + ? ? ? ? ? ? ? ? ? ? int ret;
> >> + ? ? ? ? ? ? ? ? ? ? struct pt_regs *regs = task_pt_regs(current);
> >> + ? ? ? ? ? ? ? ? ? ? if (!(test_tsk_thread_flag(current, TIF_SYSCALL_TRACE)) ||
> >> + ? ? ? ? ? ? ? ? ? ? ? ? !(current->ptrace & PT_TRACE_SECCOMP))
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? return -1;
> >> + ? ? ? ? ? ? ? ? ? ? /*
> >> + ? ? ? ? ? ? ? ? ? ? ?* PT_TRACE_SECCOMP and seccomp.trace indicate whether
> >> + ? ? ? ? ? ? ? ? ? ? ?* tracehook_report_syscall_entry needs to signal the
> >> + ? ? ? ? ? ? ? ? ? ? ?* tracer. ?This avoids race conditions in hand off and
> >> + ? ? ? ? ? ? ? ? ? ? ?* the requirement for TIF_SYSCALL_TRACE ensures that
> >> + ? ? ? ? ? ? ? ? ? ? ?* we are in the syscall slow path.
> >> + ? ? ? ? ? ? ? ? ? ? ?*/
> >> + ? ? ? ? ? ? ? ? ? ? current->seccomp.trace = 1;
> >> + ? ? ? ? ? ? ? ? ? ? ret = tracehook_report_syscall_entry(regs);
> >> + ? ? ? ? ? ? ? ? ? ? current->seccomp.trace = 0;
> >> + ? ? ? ? ? ? ? ? ? ? return ret;
> >
> > To be honest, this interface looks a bit strange to me...
> >
> > Once again, sorry if this was already discussed. But perhaps it would
> > be better to introduce PTRACE_EVENT_SECCOMP/PTRACE_O_SECCOMP instead?
> >
> > SECCOMP_RET_TRACE: could simply do ptrace_event(PTRACE_EVENT_SECCOMP)
> > unconditionaly. The tracer can set the option and do PTRACE_CONT if it
> > doesn't want the system call notifications.
>
> Works for me - this also gets rid of the extra int for brief state
> tracking. I'll switch over to that in the next rev.
Great. In this case this patch becomes really trivial. Just 2 defines
in ptrace.h and the unconditional ptrace_event() under SECCOMP_RET_TRACE.
But probably you should check fatal_signal_pending(current) after
ptrace_event() returns, ptrace_event() returns void.
Oleg.
On Tue, Feb 28, 2012 at 10:43 AM, Oleg Nesterov <[email protected]> wrote:
> On 02/27, Will Drewry wrote:
>> On Mon, Feb 27, 2012 at 11:54 AM, Oleg Nesterov <[email protected]> wrote:
>> > On 02/24, Will Drewry wrote:
>> >>
>> >> ?arch/Kconfig ? ? ? ? ? ? ?| ? ?1 +
>> >> ?include/linux/ptrace.h ? ?| ? ?7 +++++--
>> >> ?include/linux/seccomp.h ? | ? ?4 +++-
>> >> ?include/linux/tracehook.h | ? ?6 ++++++
>> >> ?kernel/ptrace.c ? ? ? ? ? | ? ?4 ++++
>> >> ?kernel/seccomp.c ? ? ? ? ?| ? 18 ++++++++++++++++++
>> >
>> > FYI, this conflicts with the changes -mm tree.
>> >
>> > The changes in ptrace.* confict with Denys's
>> > "ptrace: simplify PTRACE_foo constants and PTRACE_SETOPTIONS code"
>> >
>> > The change in tracehook.h conflicts with
>> > "ptrace: the killed tracee should not enter the syscall"
>>
>> What's the best way to reconcile this in this day and age?
>
> Of course I'd prefer if you make this change on top of Denys's patch ;)
>
> Besides, if you agree with PTRACE_EVENT_SECCOMP/PTRACE_O_SECCOMP you
> need only one trivial change in ptrace.h.
I think that works quite well :)
>> I don't see
>> these in kernel-next yet and I can't tell if there is a public -mm
>> anywhere anymore.
>
> Strange... I didn't check, but every patch in
> http://marc.info/?l=linux-mm-commits has this note:
>
> ? ? ? ?The -mm tree is included into linux-next and is updated
> ? ? ? ?there every 3-4 working days
It appears to have been pulled in ~8 hours ago. I'm rebasing to next now.
>> >> --- a/kernel/seccomp.c
>> >> +++ b/kernel/seccomp.c
>> >> @@ -354,6 +354,24 @@ int __secure_computing_int(int this_syscall)
>> >> ? ? ? ? ? ? ? ? ? ? ? seccomp_send_sigsys(this_syscall, reason_code);
>> >> ? ? ? ? ? ? ? ? ? ? ? return -1;
>> >> ? ? ? ? ? ? ? }
>> >> + ? ? ? ? ? ? case SECCOMP_RET_TRACE: {
>> >> + ? ? ? ? ? ? ? ? ? ? int ret;
>> >> + ? ? ? ? ? ? ? ? ? ? struct pt_regs *regs = task_pt_regs(current);
>> >> + ? ? ? ? ? ? ? ? ? ? if (!(test_tsk_thread_flag(current, TIF_SYSCALL_TRACE)) ||
>> >> + ? ? ? ? ? ? ? ? ? ? ? ? !(current->ptrace & PT_TRACE_SECCOMP))
>> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? return -1;
>> >> + ? ? ? ? ? ? ? ? ? ? /*
>> >> + ? ? ? ? ? ? ? ? ? ? ?* PT_TRACE_SECCOMP and seccomp.trace indicate whether
>> >> + ? ? ? ? ? ? ? ? ? ? ?* tracehook_report_syscall_entry needs to signal the
>> >> + ? ? ? ? ? ? ? ? ? ? ?* tracer. ?This avoids race conditions in hand off and
>> >> + ? ? ? ? ? ? ? ? ? ? ?* the requirement for TIF_SYSCALL_TRACE ensures that
>> >> + ? ? ? ? ? ? ? ? ? ? ?* we are in the syscall slow path.
>> >> + ? ? ? ? ? ? ? ? ? ? ?*/
>> >> + ? ? ? ? ? ? ? ? ? ? current->seccomp.trace = 1;
>> >> + ? ? ? ? ? ? ? ? ? ? ret = tracehook_report_syscall_entry(regs);
>> >> + ? ? ? ? ? ? ? ? ? ? current->seccomp.trace = 0;
>> >> + ? ? ? ? ? ? ? ? ? ? return ret;
>> >
>> > To be honest, this interface looks a bit strange to me...
>> >
>> > Once again, sorry if this was already discussed. But perhaps it would
>> > be better to introduce PTRACE_EVENT_SECCOMP/PTRACE_O_SECCOMP instead?
>> >
>> > SECCOMP_RET_TRACE: could simply do ptrace_event(PTRACE_EVENT_SECCOMP)
>> > unconditionaly. The tracer can set the option and do PTRACE_CONT if it
>> > doesn't want the system call notifications.
>>
>> Works for me - this also gets rid of the extra int for brief state
>> tracking. I'll switch over to that in the next rev.
>
> Great. In this case this patch becomes really trivial. Just 2 defines
> in ptrace.h and the unconditional ptrace_event() under SECCOMP_RET_TRACE.
>
> But probably you should check fatal_signal_pending(current) after
> ptrace_event() returns, ptrace_event() returns void.
Ah yeah - I don't want to reintroduce that issue :)
Thanks!
On Tue, Feb 28, 2012 at 10:02 AM, Oleg Nesterov <[email protected]> wrote:
> On 02/27, Will Drewry wrote:
>>
>> On Mon, Feb 27, 2012 at 11:22 AM, Oleg Nesterov <[email protected]> wrote:
>> > On 02/24, Will Drewry wrote:
>> >>
>> >> To ensure that SIGSYS delivery occurs on return from the triggering
>> >> system call, SIGSYS is added to the SYNCHRONOUS_MASK macro.
>> >
>> > Hmm. Can't understand... please help.
>> >
>> >> ?#define SYNCHRONOUS_MASK \
>> >> ? ? ? (sigmask(SIGSEGV) | sigmask(SIGBUS) | sigmask(SIGILL) | \
>> >> - ? ? ?sigmask(SIGTRAP) | sigmask(SIGFPE))
>> >> + ? ? ?sigmask(SIGTRAP) | sigmask(SIGFPE) | sigmask(SIGSYS))
>> >
>> > Why?
>> >
>> > SYNCHRONOUS_MASK just tells dequeue_signal() "pick them first".
>> > This is needed to make sure that the handler for, say SIGSEGV,
>> > can use ucontext->ip as a faulting addr.
>>
>> I think that Roland covered this. ?(Since the syscall_rollback was
>> called it's nice to let our handler get first go.)
>
> OK, except I do not really understand the "our handler get first go".
Err I meant "gets to go first".
> Suppose SIGSYS "races" with the pending SIGHUP. With this change
> the caller for SIGHUP will be called first. But yes, setup_frame()
> will be called for SIGSYS first. Hopefully this is what you want.
I believe it is. I just want ucontext_t to be properly populate since
the registers were just rolled back for it.
>> > But seccomp adds info->si_call_addr, this looks unneeded.
>>
>> True enough. ?I can drop it.
>
> Hmm. I meant, the change in SYNCHRONOUS_MASK looks unneeded. Please
> keep ->si_call_addr, it is much more convenient than ucontext_t in
> userspace.
Sorry for the confusion
>> It'd only be useful if the SIGSYS wasn't
>> being forced and the signal was being handled without ucontext_t
>> access.
>
> No, force_ doesn't make any difference in this sense...
I guess I was thinking about users of signalfd wanting the call site
but force_ avoids it being blockable so that seems largely irrelevant
at present.
> In short, the patch looks fine to me, but if you resend it may be
> you can update the changelog.
Thanks! I will try to clarify the changelog.
will
On Tue, Feb 28, 2012 at 12:51 AM, Indan Zupancic <[email protected]> wrote:
> Hello,
>
> On Sat, February 25, 2012 04:21, Will Drewry wrote:
>> @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
>> ? ? ? free_thread_info(tsk->stack);
>> ? ? ? rt_mutex_debug_task_free(tsk);
>> ? ? ? ftrace_graph_exit_task(tsk);
>> + ? ? put_seccomp_filter(tsk->seccomp.filter);
>> ? ? ? free_task_struct(tsk);
>> }
>> EXPORT_SYMBOL(free_task);
>> @@ -1113,6 +1115,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>> ? ? ? ? ? ? ? goto fork_out;
>>
>> ? ? ? ftrace_graph_init_task(p);
>> + ? ? copy_seccomp(&p->seccomp, ¤t->seccomp);
>
> I agree it's more symmetrical when get_seccomp_filter() is used here
> directly instead of copy_seccomp(). That should put Kees at ease.
Makes sense to me too. Once I'd dropped the other bits, it seemed
silly to keep copy_seccomp.
>> +static void seccomp_filter_log_failure(int syscall)
>> +{
>> + ? ? int compat = 0;
>> +#ifdef CONFIG_COMPAT
>> + ? ? compat = is_compat_task();
>> +#endif
>> + ? ? pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n",
>> + ? ? ? ? ? ? current->comm, task_pid_nr(current),
>> + ? ? ? ? ? ? (compat ? "compat " : ""),
>> + ? ? ? ? ? ? syscall, KSTK_EIP(current));
>> +}
>
> This should be at least rate limited, but could be dropped altogether,
> as it's mostly useful for debugging filters. There is no kernel message
> when a process is killed because it exceeds a ulimit either. The death
> by SIGSYS is hopefully clear enough for users, and filter writers can
> return different errno values when debugging where it goes wrong.
I'll pull Kees's patch into the series this next go-round.
>> +/**
>> + * bpf_load: checks and returns a pointer to the requested offset
>> + * @nr: int syscall passed as a void * to bpf_run_filter
>> + * @off: offset into struct seccomp_data to load from
>
> Must be aligned, that's worth mentioning.
True - thanks!
>> + * @size: number of bytes to load (must be 4)
>> + * @buffer: temporary storage supplied by bpf_run_filter (4 bytes)
>
> '@buf'.
Oops - fixed.
>> +/**
>> + * copy_seccomp: manages inheritance on fork
>> + * @child: forkee's seccomp
>> + * @parent: forker's seccomp
>> + *
>> + * Ensures that @child inherits filters if in use.
>> + */
>> +void copy_seccomp(struct seccomp *child, const struct seccomp *parent)
>> +{
>> + ? ? /* Other fields are handled by dup_task_struct. */
>> + ? ? child->filter = get_seccomp_filter(parent->filter);
>> +}
>> +#endif ? ? ? /* CONFIG_SECCOMP_FILTER */
>
> Yeah, just get rid of this and use get_seccomp_filter directly, and make
> it return void instead of a pointer.
It'll be updated.
>>
>> /*
>> ?* Secure computing mode 1 allows only read/write/exit/sigreturn.
>> @@ -34,10 +293,11 @@ static int mode1_syscalls_32[] = {
>> void __secure_computing(int this_syscall)
>> {
>> ? ? ? int mode = current->seccomp.mode;
>> - ? ? int * syscall;
>> + ? ? int exit_code = SIGKILL;
>> + ? ? int *syscall;
>>
>> ? ? ? switch (mode) {
>> - ? ? case 1:
>> + ? ? case SECCOMP_MODE_STRICT:
>> ? ? ? ? ? ? ? syscall = mode1_syscalls;
>> #ifdef CONFIG_COMPAT
>> ? ? ? ? ? ? ? if (is_compat_task())
>> @@ -48,6 +308,14 @@ void __secure_computing(int this_syscall)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return;
>> ? ? ? ? ? ? ? } while (*++syscall);
>> ? ? ? ? ? ? ? break;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> + ? ? case SECCOMP_MODE_FILTER:
>> + ? ? ? ? ? ? if (seccomp_run_filters(this_syscall) == SECCOMP_RET_ALLOW)
>> + ? ? ? ? ? ? ? ? ? ? return;
>> + ? ? ? ? ? ? seccomp_filter_log_failure(this_syscall);
>> + ? ? ? ? ? ? exit_code = SIGSYS;
>
> Wouldn't it make more sense to always kill with SIGSYS, also for mode 1?
> I suppose it's too late for that now.
It would but I don't want to go and change existing ABI.
>> +/**
>> + * prctl_set_seccomp: configures current->seccomp.mode
>> + * @seccomp_mode: requested mode to use
>> + * @filter: optional struct sock_fprog for use with SECCOMP_MODE_FILTER
>> + *
>> + * This function may be called repeatedly with a @seccomp_mode of
>> + * SECCOMP_MODE_FILTER to install additional filters. ?Every filter
>> + * successfully installed will be evaluated (in reverse order) for each system
>> + * call the task makes.
>> + *
>> + * It is not possible to transition change the current->seccomp.mode after
>> + * one has been set on a task.
>
> That reads awkwardly, do you mean the mode can't be changed once it's set?
Yup - I will fix that. It doesn't even make sense :)
> ---
>
> Reviewed-by: Indan Zupancic <[email protected]>
Thanks!
> All in all I'm quite happy with how the code is now, at least this bit.
> I'm still unsure about the networking patch and the 32-bit BPF on 64-bit
> archs mismatch.
Yeah - that is really the most challenging set of compromises, but I
think they are the right ones.
> As for the unlimited filter count, I'm not sure how to fix that.
> The problem is that filters are inherited and shared. Limiting the
> list length (tree depth) helps a bit, then the total memory usage
> is limited by max number of processes. It may make sense to limit
> the total instruction count instead of the number of filters.
I'll take a stab at tree path size for starters and hopefully we can
encourage consumers of the API to check for errors on return.
Thanks for the continued review and feedback!
will
On Tue, Feb 28, 2012 at 9:13 AM, Oleg Nesterov <[email protected]> wrote:
> On 02/27, Will Drewry wrote:
>>
>> On Mon, Feb 27, 2012 at 11:09 AM, Oleg Nesterov <[email protected]> wrote:
>>
>> >> +static long seccomp_attach_filter(struct sock_fprog *fprog)
>> >> +{
>> >> + ? ? struct seccomp_filter *filter;
>> >> + ? ? unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
>> >> + ? ? long ret;
>> >> +
>> >> + ? ? if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
>> >> + ? ? ? ? ? ? return -EINVAL;
>> >
>> > OK, this limits the memory PR_SET_SECCOMP can use.
>> >
>> > But,
>> >
>> >> + ? ? /*
>> >> + ? ? ?* If there is an existing filter, make it the prev and don't drop its
>> >> + ? ? ?* task reference.
>> >> + ? ? ?*/
>> >> + ? ? filter->prev = current->seccomp.filter;
>> >> + ? ? current->seccomp.filter = filter;
>> >> + ? ? return 0;
>> >
>> > this doesn't limit the number of filters, looks like a DoS.
>> >
>> > What if the application simply does prctl(PR_SET_SECCOMP, dummy_filter)
>> > in an endless loop?
>>
>> It consumes a massive amount of kernel memory and, maybe, the OOM
>> killer gives it a boot :)
>
> may be ;) but most probably oom-killer kills another innocent task,
> this memory is not accounted.
>
>> I wasn't sure what the normal convention was for avoiding memory
>> consumption by user processes. Should I just add a sysctl
>
> Perhaps we can add a sysctl later, but personally I think that we
> can start with some "arbitrary" #define BPF_MAXFILTERS.
Sounds good - I'll wire something like this up in the next round.
>> and a
>> per-task counter for the max number of filters?
>
> Do we really need the counter? attach_filter is not the fast path,
> perhaps seccomp_attach_filter() could simply iterate the chain and
> count the number?
>
> In any case, if this hurts perfomance-wise then seccomp_run_filters()
> has even more problems.
>
>> I'm fine doing whatever makes sense here.
>
> I am fine either way too.
>
> Oleg.
>
On Tue, Feb 28, 2012 at 09:17, Will Drewry <[email protected]> wrote:
> On Tue, Feb 28, 2012 at 12:51 AM, Indan Zupancic <[email protected]> wrote:
>> Wouldn't it make more sense to always kill with SIGSYS, also for mode 1?
>> I suppose it's too late for that now.
>
> It would but I don't want to go and change existing ABI.
Just for the record, when this discussion came up about two years ago,
it was established that other than Chrome, there are currently no
users of Seccomp, as mode 1 is just too difficult to use for most
potential applications.
Chrome doesn't rely on the exit status of a process that violated the
seccomp restrictions, so it is highly unlikely you would break
existing userspace, if you changed the ABI. On the other hand, as mode
2 is so much more powerful than mode 1, it is unlikely that we will
see future userspace applications make use of mode 1.
In other words, changing the ABI is unlikely to cause harm, but also
unlikely to do any good. I don't have any strong opinion either way;
just providing an additional data point.
Markus
On Tue, Feb 28, 2012 at 11:04 AM, Will Drewry <[email protected]> wrote:
> On Tue, Feb 28, 2012 at 10:43 AM, Oleg Nesterov <[email protected]> wrote:
>> On 02/27, Will Drewry wrote:
>>> On Mon, Feb 27, 2012 at 11:54 AM, Oleg Nesterov <[email protected]> wrote:
>>> > On 02/24, Will Drewry wrote:
>>> >>
>>> >> ?arch/Kconfig ? ? ? ? ? ? ?| ? ?1 +
>>> >> ?include/linux/ptrace.h ? ?| ? ?7 +++++--
>>> >> ?include/linux/seccomp.h ? | ? ?4 +++-
>>> >> ?include/linux/tracehook.h | ? ?6 ++++++
>>> >> ?kernel/ptrace.c ? ? ? ? ? | ? ?4 ++++
>>> >> ?kernel/seccomp.c ? ? ? ? ?| ? 18 ++++++++++++++++++
>>> >
>>> > FYI, this conflicts with the changes -mm tree.
>>> >
>>> > The changes in ptrace.* confict with Denys's
>>> > "ptrace: simplify PTRACE_foo constants and PTRACE_SETOPTIONS code"
>>> >
>>> > The change in tracehook.h conflicts with
>>> > "ptrace: the killed tracee should not enter the syscall"
>>>
>>> What's the best way to reconcile this in this day and age?
>>
>> Of course I'd prefer if you make this change on top of Denys's patch ;)
>>
>> Besides, if you agree with PTRACE_EVENT_SECCOMP/PTRACE_O_SECCOMP you
>> need only one trivial change in ptrace.h.
>
> I think that works quite well :)
>
>>> I don't see
>>> these in kernel-next yet and I can't tell if there is a public -mm
>>> anywhere anymore.
>>
>> Strange... I didn't check, but every patch in
>> http://marc.info/?l=linux-mm-commits has this note:
>>
>> ? ? ? ?The -mm tree is included into linux-next and is updated
>> ? ? ? ?there every 3-4 working days
>
> It appears to have been pulled in ~8 hours ago. ?I'm rebasing to next now.
>
>>> >> --- a/kernel/seccomp.c
>>> >> +++ b/kernel/seccomp.c
>>> >> @@ -354,6 +354,24 @@ int __secure_computing_int(int this_syscall)
>>> >> ? ? ? ? ? ? ? ? ? ? ? seccomp_send_sigsys(this_syscall, reason_code);
>>> >> ? ? ? ? ? ? ? ? ? ? ? return -1;
>>> >> ? ? ? ? ? ? ? }
>>> >> + ? ? ? ? ? ? case SECCOMP_RET_TRACE: {
>>> >> + ? ? ? ? ? ? ? ? ? ? int ret;
>>> >> + ? ? ? ? ? ? ? ? ? ? struct pt_regs *regs = task_pt_regs(current);
>>> >> + ? ? ? ? ? ? ? ? ? ? if (!(test_tsk_thread_flag(current, TIF_SYSCALL_TRACE)) ||
>>> >> + ? ? ? ? ? ? ? ? ? ? ? ? !(current->ptrace & PT_TRACE_SECCOMP))
>>> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? return -1;
>>> >> + ? ? ? ? ? ? ? ? ? ? /*
>>> >> + ? ? ? ? ? ? ? ? ? ? ?* PT_TRACE_SECCOMP and seccomp.trace indicate whether
>>> >> + ? ? ? ? ? ? ? ? ? ? ?* tracehook_report_syscall_entry needs to signal the
>>> >> + ? ? ? ? ? ? ? ? ? ? ?* tracer. ?This avoids race conditions in hand off and
>>> >> + ? ? ? ? ? ? ? ? ? ? ?* the requirement for TIF_SYSCALL_TRACE ensures that
>>> >> + ? ? ? ? ? ? ? ? ? ? ?* we are in the syscall slow path.
>>> >> + ? ? ? ? ? ? ? ? ? ? ?*/
>>> >> + ? ? ? ? ? ? ? ? ? ? current->seccomp.trace = 1;
>>> >> + ? ? ? ? ? ? ? ? ? ? ret = tracehook_report_syscall_entry(regs);
>>> >> + ? ? ? ? ? ? ? ? ? ? current->seccomp.trace = 0;
>>> >> + ? ? ? ? ? ? ? ? ? ? return ret;
>>> >
>>> > To be honest, this interface looks a bit strange to me...
>>> >
>>> > Once again, sorry if this was already discussed. But perhaps it would
>>> > be better to introduce PTRACE_EVENT_SECCOMP/PTRACE_O_SECCOMP instead?
>>> >
>>> > SECCOMP_RET_TRACE: could simply do ptrace_event(PTRACE_EVENT_SECCOMP)
>>> > unconditionaly. The tracer can set the option and do PTRACE_CONT if it
>>> > doesn't want the system call notifications.
>>>
>>> Works for me - this also gets rid of the extra int for brief state
>>> tracking. I'll switch over to that in the next rev.
>>
>> Great. In this case this patch becomes really trivial. Just 2 defines
>> in ptrace.h and the unconditional ptrace_event() under SECCOMP_RET_TRACE.
hrm the only snag is that I can't then rely on TIF_SYSCALL_TRACE to
ensure seccomp is in the slow-path. Right now, on x86, seccomp is
slow-path, but it doesn't have to be to have the syscall and args.
However, for ptrace to behavior properly, I believed it did need to be
in the slow path. If SECCOMP_RET_TRACE doesn't rely on
PTRACE_SYSCALL, then it introduces a need for seccomp to always be in
the slow path or to flag (somehow) when it needs slow path.
Any suggestions there?
Thanks!
will
On 02/28, Will Drewry wrote:
>
> On Tue, Feb 28, 2012 at 11:04 AM, Will Drewry <[email protected]> wrote:
> > On Tue, Feb 28, 2012 at 10:43 AM, Oleg Nesterov <[email protected]> wrote:
> >>
> >> Great. In this case this patch becomes really trivial. Just 2 defines
> >> in ptrace.h and the unconditional ptrace_event() under SECCOMP_RET_TRACE.
>
> hrm the only snag is that I can't then rely on TIF_SYSCALL_TRACE to
> ensure seccomp is in the slow-path. Right now, on x86, seccomp is
> slow-path, but it doesn't have to be to have the syscall and args.
> However, for ptrace to behavior properly, I believed it did need to be
> in the slow path. If SECCOMP_RET_TRACE doesn't rely on
> PTRACE_SYSCALL, then it introduces a need for seccomp to always be in
> the slow path or to flag (somehow) when it needs slow path.
My understanding of this magic is very limited, and I'm afraid
I misunderstood... So please correct me.
But what is the problem? system_call checks _TIF_WORK_SYSCALL_ENTRY
which includes _TIF_SECCOMP | _TIF_SYSCALL_TRACE, and jumps to
tracesys which does SAVE_REST.
Anyway. secure_computing() is called by syscall_trace_enter() which
also calls tracehook_report_syscall_entry(). If SECCOMP_RET_TRACE
can't do ptrace_event() then why tracehook_report_syscall_entry() is
fine?
Oleg.
On Wed, Feb 29, 2012 at 10:14 AM, Oleg Nesterov <[email protected]> wrote:
> On 02/28, Will Drewry wrote:
>>
>> On Tue, Feb 28, 2012 at 11:04 AM, Will Drewry <[email protected]> wrote:
>> > On Tue, Feb 28, 2012 at 10:43 AM, Oleg Nesterov <[email protected]> wrote:
>> >>
>> >> Great. In this case this patch becomes really trivial. Just 2 defines
>> >> in ptrace.h and the unconditional ptrace_event() under SECCOMP_RET_TRACE.
>>
>> hrm the only snag is that I can't then rely on TIF_SYSCALL_TRACE to
>> ensure seccomp is in the slow-path. ?Right now, on x86, seccomp is
>> slow-path, but it doesn't have to be to have the syscall and args.
>> However, for ptrace to behavior properly, I believed it did need to be
>> in the slow path. ?If SECCOMP_RET_TRACE doesn't rely on
>> PTRACE_SYSCALL, then it introduces a need for seccomp to always be in
>> the slow path or to flag (somehow) when it needs slow path.
>
> My understanding of this magic is very limited, and I'm afraid
> I misunderstood... So please correct me.
>
> But what is the problem? system_call checks _TIF_WORK_SYSCALL_ENTRY
> which includes _TIF_SECCOMP | _TIF_SYSCALL_TRACE, and jumps to
> tracesys which does SAVE_REST.
>
> Anyway. secure_computing() is called by syscall_trace_enter() which
> also calls tracehook_report_syscall_entry(). If SECCOMP_RET_TRACE
> can't do ptrace_event() then why tracehook_report_syscall_entry() is
> fine?
Early on in this patch series, I was urged away from regviews (for
many reasons), one of them was so that seccomp could, at some point,
be fast-path'd like audit is for x86. (It may be on arm already, I'd
need to check.) So I was hoping that I could avoid adding a slow-path
dependency to the seccomp code.
Right now, on x86, you are exactly right: Both seccomp and ptrace take
the slow path as part of _TIF_WORK_SYSCALL_ENTRY, and seccomp is only
called in syscall_trace_enter. By adding a requirement for the
slow-path in the form of ptrace_event(), the difficulty for making
seccomp fast-path friendly is increased. (It could be possible to add
a return code, e.g., return NEEDS_SLOW_PATH, which tells the fast path
code to restart the handling at syscall_trace_enter, so maybe I am
making a big deal out of nothing.)
I was hoping to avoid having TIF_SECCOMP imply the slow-path, but if
that is the only sane way to integrate, then I can leave making it
fast-path friendly as a future exercise.
If I'm over-optimizing, just say so, and I'll post the v12 with the
docs updated to indicate that, at present, seccomp filters requires
the slow path. However, if you see a nice way to avoid the
dependency, I'd love to know!
Thanks!
will
On 02/29, Will Drewry wrote:
> On Wed, Feb 29, 2012 at 10:14 AM, Oleg Nesterov <[email protected]> wrote:
> > On 02/28, Will Drewry wrote:
> >>
> >> On Tue, Feb 28, 2012 at 11:04 AM, Will Drewry <[email protected]> wrote:
> >> > On Tue, Feb 28, 2012 at 10:43 AM, Oleg Nesterov <[email protected]> wrote:
> >> >>
> >> >> Great. In this case this patch becomes really trivial. Just 2 defines
> >> >> in ptrace.h and the unconditional ptrace_event() under SECCOMP_RET_TRACE.
> >>
> >> hrm the only snag is that I can't then rely on TIF_SYSCALL_TRACE to
> >> ensure seccomp is in the slow-path. ?Right now, on x86, seccomp is
> >> slow-path, but it doesn't have to be to have the syscall and args.
> >> However, for ptrace to behavior properly, I believed it did need to be
> >> in the slow path. ?If SECCOMP_RET_TRACE doesn't rely on
> >> PTRACE_SYSCALL, then it introduces a need for seccomp to always be in
> >> the slow path or to flag (somehow) when it needs slow path.
> >
> > My understanding of this magic is very limited, and I'm afraid
> > I misunderstood... So please correct me.
> >
> > But what is the problem? system_call checks _TIF_WORK_SYSCALL_ENTRY
> > which includes _TIF_SECCOMP | _TIF_SYSCALL_TRACE, and jumps to
> > tracesys which does SAVE_REST.
> >
> > Anyway. secure_computing() is called by syscall_trace_enter() which
> > also calls tracehook_report_syscall_entry(). If SECCOMP_RET_TRACE
> > can't do ptrace_event() then why tracehook_report_syscall_entry() is
> > fine?
>
> Early on in this patch series, I was urged away from regviews (for
> many reasons), one of them was so that seccomp could, at some point,
> be fast-path'd like audit is for x86. (It may be on arm already, I'd
> need to check.) So I was hoping that I could avoid adding a slow-path
> dependency to the seccomp code.
Thanks, now I see what you meant.
> By adding a requirement for the
> slow-path in the form of ptrace_event(), the difficulty for making
> seccomp fast-path friendly is increased.
Yes. I do not know if this is really bad. I mean, I do not know how
much do we want the fast-path'd seccomp.
> (It could be possible to add
> a return code, e.g., return NEEDS_SLOW_PATH, which tells the fast path
> code to restart the handling at syscall_trace_enter, so maybe I am
> making a big deal out of nothing.)
Probably. Or SECCOMP_RET_TRACE can set TIF_NOTIFY_RESUME.
(Btw there is another alternative although imho PTRACE_EVENT_SECCOMP
looks better. SECCOMP_RET_TRACE can simply set TIF_SYSCALL_TRACE and
return, syscall_trace_enter() will report the syscall. If we make it
fast-path'ed, then TIF_NOTIFY_RESUME should trigger the slow path)
> I was hoping to avoid having TIF_SECCOMP imply the slow-path, but if
> that is the only sane way to integrate, then I can leave making it
> fast-path friendly as a future exercise.
>
> If I'm over-optimizing,
may be ;) but it is very possible I underestimate the problem.
I'd like to know what Roland thinks.
Oleg.
I don't think TIF_NOTIFY_RESUME is apropos here. That only triggers on
returning to user mode, i.e. after syscall exit. But regardless of the
exact implementation details, I don't think it will be prohibitive to add
some means by which the fast-path can back off before actual syscall entry
and go to the slow path for ptrace reporting.
Since there is no strong reason to think it can't be reorganized that way
later, I don't see any good rationale for constraining the seccomp-filter
feature definition based on a plan to optimize the implementation in the
future.
Thanks,
Roland
On Wed, Feb 29, 2012 at 11:41 AM, Roland McGrath <[email protected]> wrote:
> I don't think TIF_NOTIFY_RESUME is apropos here. ?That only triggers on
> returning to user mode, i.e. after syscall exit. ?But regardless of the
> exact implementation details, I don't think it will be prohibitive to add
> some means by which the fast-path can back off before actual syscall entry
> and go to the slow path for ptrace reporting.
>
> Since there is no strong reason to think it can't be reorganized that way
> later, I don't see any good rationale for constraining the seccomp-filter
> feature definition based on a plan to optimize the implementation in the
> future.
Sounds good to me. I'll move to ptrace_event and save the problem of
code organization for the future.
Thanks!