2012-02-21 17:31:28

by Will Drewry

[permalink] [raw]
Subject: [PATCH v10 01/11] sk_run_filter: add support for custom load_pointer

This change allows CONFIG_SECCOMP to make use of BPF programs for
user-controlled system call filtering (as shown in this patch series).

To minimize the impact on existing BPF evaluation, function pointer
use must be declared at sk_chk_filter-time. This allows ancillary
load instructions to be generated that use the function pointer rather
than adding _any_ code to the existing LD_* instruction paths.

(v8) Crude performance numbers using udpflood -l 10000000 against
dummy0. 3 trials for baseline, 3 for with tcpdump. Averaged then
differenced. Hard to believe trials were repeated at least a couple
more times.

* x86 32-bit (Atom N570 @ 1.66 GHz 2 core HT) [stackprot]:
- Without: 94.05s - 76.36s = 17.68s
- With: 86.22s - 73.30s = 12.92s
- Slowdown per call: -476 nanoseconds

* x86 32-bit (Atom N570 @ 1.66 GHz 2 core HT) [no stackprot]:
- Without: 92.06s - 77.81s = 14.25s
- With: 91.77s - 76.91s = 14.86s
- Slowdown per call: +61 nanoseconds

* x86 64-bit (Atom N570 @ 1.66 GHz 2 core HT) [stackprot]:
- Without: 122.58s - 99.54s = 23.04s
- With: 115.52s - 98.99s = 16.53s
- Slowdown per call: -651 nanoseconds

* x86 64-bit (Atom N570 @ 1.66 GHz 2 core HT) [no stackprot]:
- Without: 114.95s - 91.92s = 23.03s
- With: 110.47s - 90.79s = 19.68s
- Slowdown per call: -335 nanoseconds

This makes the x86-32-nossp make sense. Added register pressure always
makes x86-32 sad. If this is a concern, I could change the call
approach to bpf_run_filter to see if I can alleviate it a bit.

That said, the x86-*-ssp numbers show a marked increase in performance.
I've tested and retested and I keep getting these results. I'm also
suprised by the nossp speed up on 64-bit, but I dunno. I haven't looked
at the full disassembly of the call path. If that is required for the
performance differences I'm seeing, please let me know. Or if I there is
a preferred cpu to run this against - atoms can be a little weird.

v10: - converted length to a u32 on the struct because it is passed in
at bpf_run_filter time anyway. ([email protected])
- added a comment about the LD_*_ falling through ([email protected])
- more clean up (pointer->load, drop SKB macro, comments) ([email protected])
v9: - n/a
v8: - fixed variable positioning and bad cast ([email protected])
- no longer passes A as a pointer (inspection of x86 asm shows A is
%ebx again; thanks [email protected])
- cleaned up switch macros and expanded use
([email protected], [email protected])
- added length fn pointer and handled LD_W_LEN/LDX_W_LEN
- moved from a wrapping struct to a typedef for the function
pointer. (matches existing function pointer style)
- added comprehensive comment above the typedef.
- benchmarks
v7: - first cut

Signed-off-by: Will Drewry <[email protected]>
---
include/linux/filter.h | 67 ++++++++++++++++++++++++++++-
net/core/filter.c | 108 ++++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 160 insertions(+), 15 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 8eeb205..184ef99 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -110,6 +110,9 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
*/
#define BPF_MEMWORDS 16

+/* BPF program (checking) flags */
+#define BPF_CHK_FLAGS_NO_SKB 1
+
/* RATIONALE. Negative offsets are invalid in BPF.
We use them to reference ancillary data.
Unlike introduction new instructions, it does not break
@@ -145,17 +148,65 @@ struct sk_filter
struct sock_filter insns[0];
};

+/**
+ * struct bpf_load_fn - callback and data for bpf_run_filter
+ * This structure is used by bpf_run_filter if bpf_chk_filter
+ * was invoked with BPF_CHK_FLAGS_NO_SKB.
+ *
+ * @load:
+ * @data: const pointer to the data passed into bpf_run_filter
+ * @k: offset into @skb's data
+ * @size: the size of the requested data in bytes: 1, 2, or 4.
+ * @buffer: If non-NULL, a 32-bit buffer for staging data.
+ *
+ * Returns a pointer to the requested data.
+ *
+ * This function operates similarly to load_pointer in net/core/filter.c
+ * except that the pointer to the returned data must already be
+ * byteswapped as appropriate to the source data and endianness.
+ * @buffer may be used if the data needs to be staged.
+ *
+ * @length: the length of the supplied data for use by the LD*_LEN
+ * instructions.
+ */
+struct bpf_load_fn {
+ void *(*load)(const void *data, int k, unsigned int size,
+ void *buffer);
+ u32 length;
+};
+
static inline unsigned int sk_filter_len(const struct sk_filter *fp)
{
return fp->len * sizeof(struct sock_filter) + sizeof(*fp);
}

+extern unsigned int bpf_run_filter(const void *data,
+ const struct sock_filter *filter,
+ const struct bpf_load_fn *load_fn);
+
+/**
+ * sk_run_filter - run a filter on a socket
+ * @skb: buffer to run the filter on
+ * @fentry: filter to apply
+ *
+ * Runs bpf_run_filter with the struct sk_buff-specific data
+ * accessor behavior.
+ */
+static inline unsigned int sk_run_filter(const struct sk_buff *skb,
+ const struct sock_filter *filter)
+{
+ return bpf_run_filter(skb, filter, NULL);
+}
+
extern int sk_filter(struct sock *sk, struct sk_buff *skb);
-extern unsigned int sk_run_filter(const struct sk_buff *skb,
- const struct sock_filter *filter);
extern int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
extern int sk_detach_filter(struct sock *sk);
-extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
+extern int bpf_chk_filter(struct sock_filter *filter, unsigned int flen, u32 flags);
+
+static inline int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
+{
+ return bpf_chk_filter(filter, flen, 0);
+}

#ifdef CONFIG_BPF_JIT
extern void bpf_jit_compile(struct sk_filter *fp);
@@ -228,6 +279,16 @@ enum {
BPF_S_ANC_HATYPE,
BPF_S_ANC_RXHASH,
BPF_S_ANC_CPU,
+ /* Used to differentiate SKB data and generic data */
+ BPF_S_ANC_LD_W_ABS,
+ BPF_S_ANC_LD_H_ABS,
+ BPF_S_ANC_LD_B_ABS,
+ BPF_S_ANC_LD_W_LEN,
+ BPF_S_ANC_LD_W_IND,
+ BPF_S_ANC_LD_H_IND,
+ BPF_S_ANC_LD_B_IND,
+ BPF_S_ANC_LDX_W_LEN,
+ BPF_S_ANC_LDX_B_MSH,
};

#endif /* __KERNEL__ */
diff --git a/net/core/filter.c b/net/core/filter.c
index 5dea452..6b995a1 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -98,9 +98,10 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
EXPORT_SYMBOL(sk_filter);

/**
- * sk_run_filter - run a filter on a socket
- * @skb: buffer to run the filter on
+ * bpf_run_filter - run a BPF filter program on @data
+ * @data: buffer to run the filter on
* @fentry: filter to apply
+ * @load_fn: custom data accessor
*
* Decode and apply filter instructions to the skb->data.
* Return length to keep, 0 for none. @skb is the data we are
@@ -109,8 +110,9 @@ EXPORT_SYMBOL(sk_filter);
* and last instruction guaranteed to be a RET, we dont need to check
* flen. (We used to pass to this function the length of filter)
*/
-unsigned int sk_run_filter(const struct sk_buff *skb,
- const struct sock_filter *fentry)
+unsigned int bpf_run_filter(const void *data,
+ const struct sock_filter *fentry,
+ const struct bpf_load_fn *load_fn)
{
void *ptr;
u32 A = 0; /* Accumulator */
@@ -118,6 +120,7 @@ unsigned int sk_run_filter(const struct sk_buff *skb,
u32 mem[BPF_MEMWORDS]; /* Scratch Memory Store */
u32 tmp;
int k;
+ const struct sk_buff *skb = data;

/*
* Process array of filter instructions.
@@ -213,7 +216,7 @@ unsigned int sk_run_filter(const struct sk_buff *skb,
case BPF_S_LD_W_ABS:
k = K;
load_w:
- ptr = load_pointer(skb, k, 4, &tmp);
+ ptr = load_pointer(data, k, 4, &tmp);
if (ptr != NULL) {
A = get_unaligned_be32(ptr);
continue;
@@ -222,7 +225,7 @@ load_w:
case BPF_S_LD_H_ABS:
k = K;
load_h:
- ptr = load_pointer(skb, k, 2, &tmp);
+ ptr = load_pointer(data, k, 2, &tmp);
if (ptr != NULL) {
A = get_unaligned_be16(ptr);
continue;
@@ -231,7 +234,7 @@ load_h:
case BPF_S_LD_B_ABS:
k = K;
load_b:
- ptr = load_pointer(skb, k, 1, &tmp);
+ ptr = load_pointer(data, k, 1, &tmp);
if (ptr != NULL) {
A = *(u8 *)ptr;
continue;
@@ -253,7 +256,7 @@ load_b:
k = X + K;
goto load_b;
case BPF_S_LDX_B_MSH:
- ptr = load_pointer(skb, K, 1, &tmp);
+ ptr = load_pointer(data, K, 1, &tmp);
if (ptr != NULL) {
X = (*(u8 *)ptr & 0xf) << 2;
continue;
@@ -350,6 +353,55 @@ load_b:
A = 0;
continue;
}
+ case BPF_S_ANC_LD_W_ABS:
+ k = K;
+load_fn_w:
+ ptr = load_fn->load(data, k, 4, &tmp);
+ if (ptr) {
+ A = *(u32 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_ANC_LD_H_ABS:
+ k = K;
+load_fn_h:
+ ptr = load_fn->load(data, k, 2, &tmp);
+ if (ptr) {
+ A = *(u16 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_ANC_LD_B_ABS:
+ k = K;
+load_fn_b:
+ ptr = load_fn->load(data, k, 1, &tmp);
+ if (ptr) {
+ A = *(u8 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_ANC_LDX_B_MSH:
+ ptr = load_fn->load(data, K, 1, &tmp);
+ if (ptr) {
+ X = (*(u8 *)ptr & 0xf) << 2;
+ continue;
+ }
+ return 0;
+ case BPF_S_ANC_LD_W_IND:
+ k = X + K;
+ goto load_fn_w;
+ case BPF_S_ANC_LD_H_IND:
+ k = X + K;
+ goto load_fn_h;
+ case BPF_S_ANC_LD_B_IND:
+ k = X + K;
+ goto load_fn_b;
+ case BPF_S_ANC_LD_W_LEN:
+ A = load_fn->length;
+ continue;
+ case BPF_S_ANC_LDX_W_LEN:
+ X = load_fn->length;
+ continue;
default:
WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
fentry->code, fentry->jt,
@@ -360,7 +412,7 @@ load_b:

return 0;
}
-EXPORT_SYMBOL(sk_run_filter);
+EXPORT_SYMBOL(bpf_run_filter);

/*
* Security :
@@ -423,9 +475,10 @@ error:
}

/**
- * sk_chk_filter - verify socket filter code
+ * bpf_chk_filter - verify socket filter BPF code
* @filter: filter to verify
* @flen: length of filter
+ * @flags: May be BPF_CHK_FLAGS_NO_SKB or 0
*
* Check the user's filter code. If we let some ugly
* filter code slip through kaboom! The filter must contain
@@ -434,9 +487,13 @@ error:
*
* All jumps are forward as they are not signed.
*
+ * If BPF_CHK_FLAGS_NO_SKB is set in flags, any SKB-specific
+ * rules become illegal and a bpf_load_fn will be expected by
+ * bpf_run_filter.
+ *
* Returns 0 if the rule set is legal or -EINVAL if not.
*/
-int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
+int bpf_chk_filter(struct sock_filter *filter, unsigned int flen, u32 flags)
{
/*
* Valid instructions are initialized to non-0.
@@ -542,9 +599,36 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
pc + ftest->jf + 1 >= flen)
return -EINVAL;
break;
+#define MAYBE_USE_LOAD_FN(CODE) \
+ if (flags & BPF_CHK_FLAGS_NO_SKB) { \
+ code = BPF_S_ANC_##CODE; \
+ break; \
+ }
+ case BPF_S_LD_W_LEN:
+ MAYBE_USE_LOAD_FN(LD_W_LEN);
+ break;
+ case BPF_S_LDX_W_LEN:
+ MAYBE_USE_LOAD_FN(LDX_W_LEN);
+ break;
+ case BPF_S_LD_W_IND:
+ MAYBE_USE_LOAD_FN(LD_W_IND);
+ break;
+ case BPF_S_LD_H_IND:
+ MAYBE_USE_LOAD_FN(LD_H_IND);
+ break;
+ case BPF_S_LD_B_IND:
+ MAYBE_USE_LOAD_FN(LD_B_IND);
+ break;
+ case BPF_S_LDX_B_MSH:
+ MAYBE_USE_LOAD_FN(LDX_B_MSH);
+ break;
case BPF_S_LD_W_ABS:
+ MAYBE_USE_LOAD_FN(LD_W_ABS);
+ /* Falls through to BPF_S_LD_B_ABS. */
case BPF_S_LD_H_ABS:
+ MAYBE_USE_LOAD_FN(LD_H_ABS);
case BPF_S_LD_B_ABS:
+ MAYBE_USE_LOAD_FN(LD_B_ABS);
#define ANCILLARY(CODE) case SKF_AD_OFF + SKF_AD_##CODE: \
code = BPF_S_ANC_##CODE; \
break
@@ -572,7 +656,7 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
}
return -EINVAL;
}
-EXPORT_SYMBOL(sk_chk_filter);
+EXPORT_SYMBOL(bpf_chk_filter);

/**
* sk_filter_release_rcu - Release a socket filter by rcu_head
--
1.7.5.4


2012-02-21 17:31:30

by Will Drewry

[permalink] [raw]
Subject: [PATCH v10 02/11] seccomp: kill the seccomp_t typedef

Replaces the seccomp_t typedef with struct seccomp to match modern
kernel style.

v8-v10: no changes
v7: struct seccomp_struct -> struct seccomp
v6: original inclusion in this series.

Signed-off-by: Will Drewry <[email protected]>
Reviewed-by: James Morris <[email protected]>
---
include/linux/sched.h | 2 +-
include/linux/seccomp.h | 10 ++++++----
2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7d379a6..c30526f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1418,7 +1418,7 @@ struct task_struct {
uid_t loginuid;
unsigned int sessionid;
#endif
- seccomp_t seccomp;
+ struct seccomp seccomp;

/* Thread group tracking */
u32 parent_exec_id;
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index cc7a4e9..d61f27f 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -7,7 +7,9 @@
#include <linux/thread_info.h>
#include <asm/seccomp.h>

-typedef struct { int mode; } seccomp_t;
+struct seccomp {
+ int mode;
+};

extern void __secure_computing(int);
static inline void secure_computing(int this_syscall)
@@ -19,7 +21,7 @@ static inline void secure_computing(int this_syscall)
extern long prctl_get_seccomp(void);
extern long prctl_set_seccomp(unsigned long);

-static inline int seccomp_mode(seccomp_t *s)
+static inline int seccomp_mode(struct seccomp *s)
{
return s->mode;
}
@@ -28,7 +30,7 @@ static inline int seccomp_mode(seccomp_t *s)

#include <linux/errno.h>

-typedef struct { } seccomp_t;
+struct seccomp { };

#define secure_computing(x) do { } while (0)

@@ -42,7 +44,7 @@ static inline long prctl_set_seccomp(unsigned long arg2)
return -EINVAL;
}

-static inline int seccomp_mode(seccomp_t *s)
+static inline int seccomp_mode(struct seccomp *s)
{
return 0;
}
--
1.7.5.4

2012-02-21 17:31:38

by Will Drewry

[permalink] [raw]
Subject: [PATCH v10 07/11] signal, x86: add SIGSYS info and make it synchronous.

This change enables SIGSYS, defines _sigfields._sigsys, and adds
x86 (compat) arch support. _sigsys defines fields which allow
a signal handler to receive the triggering system call number,
the relevant AUDIT_ARCH_* value for that number, and the address
of the callsite.

To ensure that SIGSYS delivery occurs on return from the triggering
system call, SIGSYS is added to the SYNCHRONOUS_MASK macro. I'm
this is enough to ensure it will be synchronous or if it is explicitly
required to ensure an immediate delivery of the signal upon return from
the blocked system call.

The first consumer of SIGSYS would be seccomp filter. In particular,
a filter program could specify a new return value, SECCOMP_RET_TRAP,
which would result in the system call being denied and the calling
thread signaled. This also means that implementing arch-specific
support can be dependent upon HAVE_ARCH_SECCOMP_FILTER.

v10: - first version based on suggestion

Suggested-by: H. Peter Anvin <[email protected]>
Signed-off-by: Will Drewry <[email protected]>
---
arch/x86/ia32/ia32_signal.c | 4 ++++
arch/x86/include/asm/ia32.h | 6 ++++++
include/asm-generic/siginfo.h | 18 ++++++++++++++++++
kernel/signal.c | 2 +-
4 files changed, 29 insertions(+), 1 deletions(-)

diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
index 6557769..c81d2c7 100644
--- a/arch/x86/ia32/ia32_signal.c
+++ b/arch/x86/ia32/ia32_signal.c
@@ -73,6 +73,10 @@ int copy_siginfo_to_user32(compat_siginfo_t __user *to, siginfo_t *from)
switch (from->si_code >> 16) {
case __SI_FAULT >> 16:
break;
+ case __SI_SYS >> 16:
+ put_user_ex(from->si_syscall, &to->si_syscall);
+ put_user_ex(from->si_arch, &to->si_arch);
+ break;
case __SI_CHLD >> 16:
put_user_ex(from->si_utime, &to->si_utime);
put_user_ex(from->si_stime, &to->si_stime);
diff --git a/arch/x86/include/asm/ia32.h b/arch/x86/include/asm/ia32.h
index 1f7e625..541485f 100644
--- a/arch/x86/include/asm/ia32.h
+++ b/arch/x86/include/asm/ia32.h
@@ -126,6 +126,12 @@ typedef struct compat_siginfo {
int _band; /* POLL_IN, POLL_OUT, POLL_MSG */
int _fd;
} _sigpoll;
+
+ struct {
+ unsigned int _call_addr; /* calling insn */
+ int _syscall; /* triggering system call number */
+ unsigned int _arch; /* AUDIT_ARCH_* of syscall */
+ } _sigsys;
} _sifields;
} compat_siginfo_t;

diff --git a/include/asm-generic/siginfo.h b/include/asm-generic/siginfo.h
index 0dd4e87..a83b478 100644
--- a/include/asm-generic/siginfo.h
+++ b/include/asm-generic/siginfo.h
@@ -90,6 +90,13 @@ typedef struct siginfo {
__ARCH_SI_BAND_T _band; /* POLL_IN, POLL_OUT, POLL_MSG */
int _fd;
} _sigpoll;
+
+ /* SIGSYS */
+ struct {
+ void __user *_call_addr; /* calling insn */
+ int _syscall; /* triggering system call number */
+ unsigned int _arch; /* AUDIT_ARCH_* of syscall */
+ } _sigsys;
} _sifields;
} siginfo_t;

@@ -116,6 +123,9 @@ typedef struct siginfo {
#define si_addr_lsb _sifields._sigfault._addr_lsb
#define si_band _sifields._sigpoll._band
#define si_fd _sifields._sigpoll._fd
+#define si_call_addr _sifields._sigsys._call_addr
+#define si_syscall _sifields._sigsys._syscall
+#define si_arch _sifields._sigsys._arch

#ifdef __KERNEL__
#define __SI_MASK 0xffff0000u
@@ -126,6 +136,7 @@ typedef struct siginfo {
#define __SI_CHLD (4 << 16)
#define __SI_RT (5 << 16)
#define __SI_MESGQ (6 << 16)
+#define __SI_SYS (7 << 16)
#define __SI_CODE(T,N) ((T) | ((N) & 0xffff))
#else
#define __SI_KILL 0
@@ -135,6 +146,7 @@ typedef struct siginfo {
#define __SI_CHLD 0
#define __SI_RT 0
#define __SI_MESGQ 0
+#define __SI_SYS 0
#define __SI_CODE(T,N) (N)
#endif

@@ -232,6 +244,12 @@ typedef struct siginfo {
#define NSIGPOLL 6

/*
+ * SIGSYS si_codes
+ */
+#define SYS_SECCOMP (__SI_SYS|1) /* seccomp triggered */
+#define NSIGSYS 1
+
+/*
* sigevent definitions
*
* It seems likely that SIGEV_THREAD will have to be handled from
diff --git a/kernel/signal.c b/kernel/signal.c
index c73c428..7573819 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -160,7 +160,7 @@ void recalc_sigpending(void)

#define SYNCHRONOUS_MASK \
(sigmask(SIGSEGV) | sigmask(SIGBUS) | sigmask(SIGILL) | \
- sigmask(SIGTRAP) | sigmask(SIGFPE))
+ sigmask(SIGTRAP) | sigmask(SIGFPE) | sigmask(SIGSYS))

int next_signal(struct sigpending *pending, sigset_t *mask)
{
--
1.7.5.4

2012-02-21 17:31:57

by Will Drewry

[permalink] [raw]
Subject: [PATCH v10 11/11] Documentation: prctl/seccomp_filter

Documents how system call filtering using Berkeley Packet
Filter programs works and how it may be used.
Includes an example for x86 (32-bit) and a semi-generic
example using a macro-based code generator.

v10: - update for SIGSYS
- update for new seccomp_data layout
- update for ptrace option use
v9: - updated bpf-direct.c for SIGILL
v8: - add PR_SET_NO_NEW_PRIVS to the samples.
v7: - updated for all the new stuff in v7: TRAP, TRACE
- only talk about PR_SET_SECCOMP now
- fixed bad JLE32 check ([email protected])
- adds dropper.c: a simple system call disabler
v6: - tweak the language to note the requirement of
PR_SET_NO_NEW_PRIVS being called prior to use. ([email protected])
v5: - update sample to use system call arguments
- adds a "fancy" example using a macro-based generator
- cleaned up bpf in the sample
- update docs to mention arguments
- fix prctl value ([email protected])
- language cleanup ([email protected])
v4: - update for no_new_privs use
- minor tweaks
v3: - call out BPF <-> Berkeley Packet Filter ([email protected])
- document use of tentative always-unprivileged
- guard sample compilation for i386 and x86_64
v2: - move code to samples ([email protected])

Signed-off-by: Will Drewry <[email protected]>
---
Documentation/prctl/seccomp_filter.txt | 157 +++++++++++++++++++++
samples/Makefile | 2 +-
samples/seccomp/Makefile | 31 ++++
samples/seccomp/bpf-direct.c | 150 ++++++++++++++++++++
samples/seccomp/bpf-fancy.c | 102 ++++++++++++++
samples/seccomp/bpf-helper.c | 89 ++++++++++++
samples/seccomp/bpf-helper.h | 236 ++++++++++++++++++++++++++++++++
samples/seccomp/dropper.c | 68 +++++++++
8 files changed, 834 insertions(+), 1 deletions(-)
create mode 100644 Documentation/prctl/seccomp_filter.txt
create mode 100644 samples/seccomp/Makefile
create mode 100644 samples/seccomp/bpf-direct.c
create mode 100644 samples/seccomp/bpf-fancy.c
create mode 100644 samples/seccomp/bpf-helper.c
create mode 100644 samples/seccomp/bpf-helper.h
create mode 100644 samples/seccomp/dropper.c

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..7de865b
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,157 @@
+ SECure COMPuting with filters
+ =============================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated. A
+certain subset of userland applications benefit by having a reduced set
+of available system calls. The resulting set reduces the total kernel
+surface exposed to the application. System call filtering is meant for
+use with those applications.
+
+Seccomp filtering provides a means for a process to specify a filter for
+incoming system calls. The filter is expressed as a Berkeley Packet
+Filter (BPF) program, as with socket filters, except that the data
+operated on is related to the system call being made: system call
+number and the system call arguments. This allows for expressive
+filtering of system calls using a filter program language with a long
+history of being exposed to userland and a straightforward data set.
+
+Additionally, BPF makes it impossible for users of seccomp to fall prey
+to time-of-check-time-of-use (TOCTOU) attacks that are common in system
+call interposition frameworks. BPF programs may not dereference
+pointers which constrains all filters to solely evaluating the system
+call arguments directly.
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox. It provides a clearly defined
+mechanism for minimizing the exposed kernel surface. It is meant to be
+a tool for sandbox developers to use. Beyond that, policy for logical
+behavior and information flow should be managed with a combination of
+other system hardening techniques and, potentially, an LSM of your
+choosing. Expressive, dynamic filters provide further options down this
+path (avoiding pathological sizes or selecting which of the multiplexed
+system calls in socketcall() is allowed, for instance) which could be
+construed, incorrectly, as a more complete sandboxing solution.
+
+Usage
+-----
+
+An additional seccomp mode is added and is enabled using the same
+prctl(2) call as the strict seccomp. If the architecture has
+CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below:
+
+PR_SET_SECCOMP:
+ Now takes an additional argument which specifies a new filter
+ using a BPF program.
+ The BPF program will be executed over struct seccomp_data
+ reflecting the system call number, arguments, and other
+ metadata. The BPF program must then return one of the
+ acceptable values to inform the kernel which action should be
+ taken.
+
+ Usage:
+ prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog);
+
+ The 'prog' argument is a pointer to a struct sock_fprog which
+ will contain the filter program. If the program is invalid, the
+ call will return -1 and set errno to EINVAL.
+
+ Note, is_compat_task is also tracked for the @prog. This means
+ that once set the calling task will have all of its system calls
+ blocked if it switches its system call ABI.
+
+ If fork/clone and execve are allowed by @prog, any child
+ processes will be constrained to the same filters and system
+ call ABI as the parent.
+
+ Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or
+ run with CAP_SYS_ADMIN privileges in its namespace. If these are not
+ true, -EACCES will be returned. This requirement ensures that filter
+ programs cannot be applied to child processes with greater privileges
+ than the task that installed them.
+
+ Additionally, if prctl(2) is allowed by the attached filter,
+ additional filters may be layered on which will increase evaluation
+ time, but allow for further decreasing the attack surface during
+ execution of a process.
+
+The above call returns 0 on success and non-zero on error.
+
+Return values
+-------------
+
+A seccomp filter may return any of the following values:
+ SECCOMP_RET_ALLOW, SECCOMP_RET_KILL, SECCOMP_RET_TRAP,
+ SECCOMP_RET_ERRNO, or SECCOMP_RET_TRACE.
+
+SECCOMP_RET_ALLOW:
+ If all filters for a given task return this value then
+ the system call will proceed normally.
+
+SECCOMP_RET_KILL:
+ If any filters for a given take return this value then
+ the task will exit immediately without executing the system
+ call.
+
+SECCOMP_RET_TRAP:
+ If any filters specify SECCOMP_RET_TRAP and none of them
+ specify SECCOMP_RET_KILL, then the kernel will send a SIGTRAP
+ signal to the task and not execute the system call. The kernel
+ will rollback the register state to just before system call
+ entry such that a signal handler in the process will be able
+ to inspect the ucontext_t->uc_mcontext registers and emulate
+ system call success or failure upon return from the signal
+ handler.
+
+ The SIGTRAP is differentiated by other SIGTRAPS by a si_code
+ of TRAP_SECCOMP.
+
+SECCOMP_RET_ERRNO:
+ If returned, the value provided in the lower 16-bits is
+ returned to userland as the errno and the system call is
+ not executed.
+
+SECCOMP_RET_TRACE:
+ If any filters return this value and the others return
+ SECCOMP_RET_ALLOW, then the kernel will attempt to notify
+ a ptrace()-based tracer prior to executing the system call.
+
+ A tracer will be notified if it requests PTRACE_O_TRACESECCOMP
+ via PTRACE_SETOPTIONS. Otherwise, the system call will
+ not execute and -ENOSYS will be returned to userspace.
+
+ If the tracer ignores notification, then the system call will
+ proceed normally. Changes to the registers will function
+ similarly to PTRACE_SYSCALL. Additionally, if the tracer
+ detaches during notification or just after, the task may be
+ terminated as precautionary measure.
+
+Please note that the order of precedence is as follows:
+SECCOMP_RET_KILL, SECCOMP_RET_ERRNO, SECCOMP_RET_TRAP,
+SECCOMP_RET_TRACE, SECCOMP_RET_ALLOW.
+
+If multiple filters exist, the return value for the evaluation of a given
+system call will always use the highest precedent value.
+SECCOMP_RET_KILL will always take precedence.
+
+
+Example
+-------
+
+The samples/seccomp/ directory contains both a 32-bit specific example
+and a more generic example of a higher level macro interface for BPF
+program generation.
+
+Adding architecture support
+-----------------------
+
+See arch/Kconfig for the required functionality. In general, if an
+architecture supports both tracehook and seccomp, it will be able to
+support seccomp filter with minor alteration. Then it must just add
+CONFIG_HAVE_ARCH_SECCOMP_FILTER to its arch-specific Kconfig.
diff --git a/samples/Makefile b/samples/Makefile
index 6280817..f29b19c 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,4 @@
# Makefile for Linux samples code

obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ tracepoints/ trace_events/ \
- hw_breakpoint/ kfifo/ kdb/ hidraw/
+ hw_breakpoint/ kfifo/ kdb/ hidraw/ seccomp/
diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
new file mode 100644
index 0000000..38922f7
--- /dev/null
+++ b/samples/seccomp/Makefile
@@ -0,0 +1,31 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+hostprogs-$(CONFIG_SECCOMP) := bpf-fancy dropper
+bpf-fancy-objs := bpf-fancy.o bpf-helper.o
+
+HOSTCFLAGS_bpf-fancy.o += -I$(objtree)/usr/include
+HOSTCFLAGS_bpf-fancy.o += -idirafter $(objtree)/include
+HOSTCFLAGS_bpf-helper.o += -I$(objtree)/usr/include
+HOSTCFLAGS_bpf-helper.o += -idirafter $(objtree)/include
+
+HOSTCFLAGS_dropper.o += -I$(objtree)/usr/include
+HOSTCFLAGS_dropper.o += -idirafter $(objtree)/include
+dropper-objs := dropper.o
+
+# bpf-direct.c is x86-only.
+ifeq ($(filter-out x86_64 i386,$(KBUILD_BUILDHOST)),)
+# List of programs to build
+hostprogs-$(CONFIG_SECCOMP) += bpf-direct
+bpf-direct-objs := bpf-direct.o
+endif
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_bpf-direct.o += -I$(objtree)/usr/include
+HOSTCFLAGS_bpf-direct.o += -idirafter $(objtree)/include
+ifeq ($(KBUILD_BUILDHOST),x86_64)
+HOSTCFLAGS_bpf-direct.o += -m32
+HOSTLOADLIBES_bpf-direct += -m32
+endif
diff --git a/samples/seccomp/bpf-direct.c b/samples/seccomp/bpf-direct.c
new file mode 100644
index 0000000..56e5443
--- /dev/null
+++ b/samples/seccomp/bpf-direct.c
@@ -0,0 +1,150 @@
+/*
+ * 32-bit seccomp filter example with BPF macros
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <[email protected]>
+ * Author: Will Drewry <[email protected]>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_SET_SECCOMP, 2, ...).
+ */
+#define __USE_GNU 1
+#define _GNU_SOURCE 1
+
+#include <linux/types.h>
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+#include <linux/unistd.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <string.h>
+#include <sys/prctl.h>
+#include <unistd.h>
+
+#define syscall_arg(_n) (offsetof(struct seccomp_data, args[_n]))
+#define syscall_nr (offsetof(struct seccomp_data, nr))
+
+#ifndef PR_SET_NO_NEW_PRIVS
+#define PR_SET_NO_NEW_PRIVS 36
+#endif
+
+#ifndef SYS_SECCOMP
+#define SYS_SECCOMP 1
+#endif
+
+static void emulator(int nr, siginfo_t *info, void *void_context)
+{
+ ucontext_t *ctx = (ucontext_t *)(void_context);
+ int syscall;
+ char *buf;
+ ssize_t bytes;
+ size_t len;
+ if (info->si_code != SYS_SECCOMP)
+ return;
+ if (!ctx)
+ return;
+ syscall = ctx->uc_mcontext.gregs[REG_EAX];
+ buf = (char *) ctx->uc_mcontext.gregs[REG_ECX];
+ len = (size_t) ctx->uc_mcontext.gregs[REG_EDX];
+
+ if (syscall != __NR_write)
+ return;
+ if (ctx->uc_mcontext.gregs[REG_EBX] != STDERR_FILENO)
+ return;
+ /* Redirect stderr messages to stdout. Doesn't handle EINTR, etc */
+ write(STDOUT_FILENO, "[ERR] ", 6);
+ bytes = write(STDOUT_FILENO, buf, len);
+ ctx->uc_mcontext.gregs[REG_EAX] = bytes;
+ return;
+}
+
+static int install_emulator(void)
+{
+ struct sigaction act;
+ sigset_t mask;
+ memset(&act, 0, sizeof(act));
+ sigemptyset(&mask);
+ sigaddset(&mask, SIGSYS);
+
+ act.sa_sigaction = &emulator;
+ act.sa_flags = SA_SIGINFO;
+ if (sigaction(SIGSYS, &act, NULL) < 0) {
+ perror("sigaction");
+ return -1;
+ }
+ if (sigprocmask(SIG_UNBLOCK, &mask, NULL)) {
+ perror("sigprocmask");
+ return -1;
+ }
+ return 0;
+}
+
+static int install_filter(void)
+{
+ struct sock_filter filter[] = {
+ /* Grab the system call number */
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_nr),
+ /* Jump table for the allowed syscalls */
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 0, 1),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 0, 1),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 0, 1),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 0, 1),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 3, 2),
+
+ /* Check that read is only using stdin. */
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 4, 0),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
+
+ /* Check that write is only using stdout */
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
+ /* Trap attempts to write to stderr */
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 1, 2),
+
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_TRAP),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ .filter = filter,
+ };
+
+ if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
+ perror("prctl(NO_NEW_PRIVS)");
+ return 1;
+ }
+
+
+ if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
+ perror("prctl");
+ return 1;
+ }
+ return 0;
+}
+
+#define payload(_c) (_c), sizeof((_c))
+int main(int argc, char **argv)
+{
+ char buf[4096];
+ ssize_t bytes = 0;
+ if (install_emulator())
+ return 1;
+ if (install_filter())
+ return 1;
+ syscall(__NR_write, STDOUT_FILENO,
+ payload("OHAI! WHAT IS YOUR NAME? "));
+ bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
+ syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+ syscall(__NR_write, STDOUT_FILENO, buf, bytes);
+ syscall(__NR_write, STDERR_FILENO,
+ payload("Error message going to STDERR\n"));
+ return 0;
+}
diff --git a/samples/seccomp/bpf-fancy.c b/samples/seccomp/bpf-fancy.c
new file mode 100644
index 0000000..bf1f6b5
--- /dev/null
+++ b/samples/seccomp/bpf-fancy.c
@@ -0,0 +1,102 @@
+/*
+ * Seccomp BPF example using a macro-based generator.
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <[email protected]>
+ * Author: Will Drewry <[email protected]>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+#include <linux/unistd.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/prctl.h>
+#include <unistd.h>
+
+#include "bpf-helper.h"
+
+#ifndef PR_SET_NO_NEW_PRIVS
+#define PR_SET_NO_NEW_PRIVS 36
+#endif
+
+int main(int argc, char **argv)
+{
+ struct bpf_labels l;
+ static const char msg1[] = "Please type something: ";
+ static const char msg2[] = "You typed: ";
+ char buf[256];
+ struct sock_filter filter[] = {
+ /* TODO: LOAD_SYSCALL_NR(arch) and enforce an arch */
+ LOAD_SYSCALL_NR,
+ SYSCALL(__NR_exit, ALLOW),
+ SYSCALL(__NR_exit_group, ALLOW),
+ SYSCALL(__NR_write, JUMP(&l, write_fd)),
+ SYSCALL(__NR_read, JUMP(&l, read)),
+ DENY, /* Don't passthrough into a label */
+
+ LABEL(&l, read),
+ ARG(0),
+ JNE(STDIN_FILENO, DENY),
+ ARG(1),
+ JNE((unsigned long)buf, DENY),
+ ARG(2),
+ JGE(sizeof(buf), DENY),
+ ALLOW,
+
+ LABEL(&l, write_fd),
+ ARG(0),
+ JEQ(STDOUT_FILENO, JUMP(&l, write_buf)),
+ JEQ(STDERR_FILENO, JUMP(&l, write_buf)),
+ DENY,
+
+ LABEL(&l, write_buf),
+ ARG(1),
+ JEQ((unsigned long)msg1, JUMP(&l, msg1_len)),
+ JEQ((unsigned long)msg2, JUMP(&l, msg2_len)),
+ JEQ((unsigned long)buf, JUMP(&l, buf_len)),
+ DENY,
+
+ LABEL(&l, msg1_len),
+ ARG(2),
+ JLT(sizeof(msg1), ALLOW),
+ DENY,
+
+ LABEL(&l, msg2_len),
+ ARG(2),
+ JLT(sizeof(msg2), ALLOW),
+ DENY,
+
+ LABEL(&l, buf_len),
+ ARG(2),
+ JLT(sizeof(buf), ALLOW),
+ DENY,
+ };
+ struct sock_fprog prog = {
+ .filter = filter,
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ };
+ ssize_t bytes;
+ bpf_resolve_jumps(&l, filter, sizeof(filter)/sizeof(*filter));
+
+ if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
+ perror("prctl(NO_NEW_PRIVS)");
+ return 1;
+ }
+
+ if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
+ perror("prctl(SECCOMP)");
+ return 1;
+ }
+ syscall(__NR_write, STDOUT_FILENO, msg1, strlen(msg1));
+ bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf)-1);
+ bytes = (bytes > 0 ? bytes : 0);
+ syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2));
+ syscall(__NR_write, STDERR_FILENO, buf, bytes);
+ /* Now get killed */
+ syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2)+2);
+ return 0;
+}
diff --git a/samples/seccomp/bpf-helper.c b/samples/seccomp/bpf-helper.c
new file mode 100644
index 0000000..579cfe3
--- /dev/null
+++ b/samples/seccomp/bpf-helper.c
@@ -0,0 +1,89 @@
+/*
+ * Seccomp BPF helper functions
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <[email protected]>
+ * Author: Will Drewry <[email protected]>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <stdio.h>
+#include <string.h>
+
+#include "bpf-helper.h"
+
+int bpf_resolve_jumps(struct bpf_labels *labels,
+ struct sock_filter *filter, size_t count)
+{
+ struct sock_filter *begin = filter;
+ __u8 insn = count - 1;
+
+ if (count < 1)
+ return -1;
+ /*
+ * Walk it once, backwards, to build the label table and do fixups.
+ * Since backward jumps are disallowed by BPF, this is easy.
+ */
+ filter += insn;
+ for (; filter >= begin; --insn, --filter) {
+ if (filter->code != (BPF_JMP+BPF_JA))
+ continue;
+ switch ((filter->jt<<8)|filter->jf) {
+ case (JUMP_JT<<8)|JUMP_JF:
+ if (labels->labels[filter->k].location == 0xffffffff) {
+ fprintf(stderr, "Unresolved label: '%s'\n",
+ labels->labels[filter->k].label);
+ return 1;
+ }
+ filter->k = labels->labels[filter->k].location -
+ (insn + 1);
+ filter->jt = 0;
+ filter->jf = 0;
+ continue;
+ case (LABEL_JT<<8)|LABEL_JF:
+ if (labels->labels[filter->k].location != 0xffffffff) {
+ fprintf(stderr, "Duplicate label use: '%s'\n",
+ labels->labels[filter->k].label);
+ return 1;
+ }
+ labels->labels[filter->k].location = insn;
+ filter->k = 0; /* fall through */
+ filter->jt = 0;
+ filter->jf = 0;
+ continue;
+ }
+ }
+ return 0;
+}
+
+/* Simple lookup table for labels. */
+__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label)
+{
+ struct __bpf_label *begin = labels->labels, *end;
+ int id;
+ if (labels->count == 0) {
+ begin->label = label;
+ begin->location = 0xffffffff;
+ labels->count++;
+ return 0;
+ }
+ end = begin + labels->count;
+ for (id = 0; begin < end; ++begin, ++id) {
+ if (!strcmp(label, begin->label))
+ return id;
+ }
+ begin->label = label;
+ begin->location = 0xffffffff;
+ labels->count++;
+ return id;
+}
+
+void seccomp_bpf_print(struct sock_filter *filter, size_t count)
+{
+ struct sock_filter *end = filter + count;
+ for ( ; filter < end; ++filter)
+ printf("{ code=%u,jt=%u,jf=%u,k=%u },\n",
+ filter->code, filter->jt, filter->jf, filter->k);
+}
diff --git a/samples/seccomp/bpf-helper.h b/samples/seccomp/bpf-helper.h
new file mode 100644
index 0000000..273fcd7
--- /dev/null
+++ b/samples/seccomp/bpf-helper.h
@@ -0,0 +1,236 @@
+/*
+ * Example wrapper around BPF macros.
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <[email protected]>
+ * Author: Will Drewry <[email protected]>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_SET_SECCOMP, 2, ...).
+ *
+ * No guarantees are provided with respect to the correctness
+ * or functionality of this code.
+ */
+#ifndef __BPF_HELPER_H__
+#define __BPF_HELPER_H__
+
+#include <asm/bitsperlong.h> /* for __BITS_PER_LONG */
+#include <linux/filter.h>
+#include <linux/seccomp.h> /* for seccomp_data */
+#include <linux/types.h>
+#include <linux/unistd.h>
+#include <stddef.h>
+
+#define BPF_LABELS_MAX 256
+struct bpf_labels {
+ int count;
+ struct __bpf_label {
+ const char *label;
+ __u32 location;
+ } labels[BPF_LABELS_MAX];
+};
+
+int bpf_resolve_jumps(struct bpf_labels *labels,
+ struct sock_filter *filter, size_t count);
+__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label);
+void seccomp_bpf_print(struct sock_filter *filter, size_t count);
+
+#define JUMP_JT 0xff
+#define JUMP_JF 0xff
+#define LABEL_JT 0xfe
+#define LABEL_JF 0xfe
+
+#define ALLOW \
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW)
+#define DENY \
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL)
+#define JUMP(labels, label) \
+ BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL((labels), (label)), \
+ JUMP_JT, JUMP_JF)
+#define LABEL(labels, label) \
+ BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL((labels), (label)), \
+ LABEL_JT, LABEL_JF)
+#define SYSCALL(nr, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (nr), 0, 1), \
+ jt
+
+/* Lame, but just an example */
+#define FIND_LABEL(labels, label) seccomp_bpf_label((labels), #label)
+
+#define EXPAND(...) __VA_ARGS__
+/* Map all width-sensitive operations */
+#if __BITS_PER_LONG == 32
+
+#define JEQ(x, jt) JEQ32(x, EXPAND(jt))
+#define JNE(x, jt) JNE32(x, EXPAND(jt))
+#define JGT(x, jt) JGT32(x, EXPAND(jt))
+#define JLT(x, jt) JLT32(x, EXPAND(jt))
+#define JGE(x, jt) JGE32(x, EXPAND(jt))
+#define JLE(x, jt) JLE32(x, EXPAND(jt))
+#define JA(x, jt) JA32(x, EXPAND(jt))
+#define ARG(i) ARG_32(i)
+
+#elif __BITS_PER_LONG == 64
+
+/* Ensure that we load the logically correct offset. */
+#if defined(__LITTLE_ENDIAN)
+#define LO_ARG(idx) offsetof(struct seccomp_data, args[(idx)])
+#define HI_ARG(idx) offsetof(struct seccomp_data, args[(idx)]) + sizeof(__u32)
+#define ENDIAN(_lo, _hi) _lo, _hi
+#elif defined(__BIG_ENDIAN)
+#define ENDIAN(_lo, _hi) _hi, _lo
+#define LO_ARG(idx) offsetof(struct seccomp_data, args[(idx)]) + sizeof(__u32)
+#define HI_ARG(idx) offsetof(struct seccomp_data, args[(idx)])
+#else
+#error "Unknown endianness"
+#endif
+
+union arg64 {
+ struct {
+ __u32 ENDIAN(lo32, hi32);
+ };
+ __u64 u64;
+};
+
+#define JEQ(x, jt) \
+ JEQ64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JGT(x, jt) \
+ JGT64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JGE(x, jt) \
+ JGE64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JNE(x, jt) \
+ JNE64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JLT(x, jt) \
+ JLT64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JLE(x, jt) \
+ JLE64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+
+#define JA(x, jt) \
+ JA64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define ARG(i) ARG_64(i)
+
+#else
+#error __BITS_PER_LONG value unusable.
+#endif
+
+/* Loads the arg into A */
+#define ARG_32(idx) \
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, LO_ARG(idx))
+
+/* Loads hi into A and lo in X */
+#define ARG_64(idx) \
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, LO_ARG(idx)), \
+ BPF_STMT(BPF_ST, 0), /* lo -> M[0] */ \
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, HI_ARG(idx)), \
+ BPF_STMT(BPF_ST, 1) /* hi -> M[1] */
+
+#define JEQ32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 0, 1), \
+ jt
+
+#define JNE32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 1, 0), \
+ jt
+
+/* Checks the lo, then swaps to check the hi. A=lo,X=hi */
+#define JEQ64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 0, 2), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JNE64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 5, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 2, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JA32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (value), 0, 1), \
+ jt
+
+#define JA64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (hi), 3, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (lo), 0, 2), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JGE32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 0, 1), \
+ jt
+
+#define JLT32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 1, 0), \
+ jt
+
+/* Shortcut checking if hi > arg.hi. */
+#define JGE64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 4, 0), \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (lo), 0, 2), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JLT64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (hi), 0, 4), \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 2, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JGT32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 0, 1), \
+ jt
+
+#define JLE32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 1, 0), \
+ jt
+
+/* Check hi > args.hi first, then do the GE checking */
+#define JGT64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 4, 0), \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 0, 2), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JLE64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 6, 0), \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 3), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 2, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define LOAD_SYSCALL_NR \
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+ offsetof(struct seccomp_data, nr))
+
+#endif /* __BPF_HELPER_H__ */
diff --git a/samples/seccomp/dropper.c b/samples/seccomp/dropper.c
new file mode 100644
index 0000000..74e035d
--- /dev/null
+++ b/samples/seccomp/dropper.c
@@ -0,0 +1,68 @@
+/*
+ * Naive system call dropper built on seccomp_filter.
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <[email protected]>
+ * Author: Will Drewry <[email protected]>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_SET_SECCOMP, 2, ...).
+ *
+ * When run, returns the specified errno for the specified
+ * system call number against the given architecture.
+ *
+ * Run this one as root as PR_SET_NO_NEW_PRIVS is not called.
+ */
+
+#include <errno.h>
+#include <linux/audit.h>
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+#include <linux/unistd.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <sys/prctl.h>
+#include <unistd.h>
+
+static int install_filter(int nr, int arch, int error)
+{
+ struct sock_filter filter[] = {
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+ (offsetof(struct seccomp_data, arch))),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, arch, 0, 3),
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+ (offsetof(struct seccomp_data, nr))),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, nr, 0, 1),
+ BPF_STMT(BPF_RET+BPF_K,
+ SECCOMP_RET_ERRNO|(error & SECCOMP_RET_DATA)),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ .filter = filter,
+ };
+ if (prctl(PR_SET_SECCOMP, 2, &prog)) {
+ perror("prctl");
+ return 1;
+ }
+ return 0;
+}
+
+int main(int argc, char **argv)
+{
+ if (argc < 5) {
+ fprintf(stderr, "Usage:\n"
+ "dropper <syscall_nr> <arch> <errno> <prog> [<args>]\n"
+ "Hint: AUDIT_ARCH_I386: %x\n"
+ " AUDIT_ARCH_X86_64: %x\n"
+ "\n", AUDIT_ARCH_I386, AUDIT_ARCH_X86_64);
+ return 1;
+ }
+ if (install_filter(strtol(argv[1], NULL, 0), strtol(argv[2], NULL, 0),
+ strtol(argv[3], NULL, 0)))
+ return 1;
+ execv(argv[4], &argv[4]);
+ printf("Failed to execv\n");
+ return 255;
+}
--
1.7.5.4

2012-02-21 17:31:46

by Will Drewry

[permalink] [raw]
Subject: [PATCH v10 10/11] x86: Enable HAVE_ARCH_SECCOMP_FILTER

Enable support for seccomp filter on x86:
- asm/tracehook.h exists and seccomp_tracer_done called.
- asm/syscall.h functions work
- secure_computing() return value is honored (see below)

This changes adds support for honoring the return
value from secure_computing().

SECCOMP_RET_TRACE and SECCOMP_RET_TRAP may result in seccomp needing to
skip a system call without killing the process. This is done by
returning a non-zero (-1) value from secure_computing. This change
makes x86 respect that return value.

To ensure that minimal kernel code is exposed, a non-zero return value
results in an immediate return to user space (with an invalid syscall
number).

v10: no change

Signed-off-by: Will Drewry <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/kernel/ptrace.c | 7 ++++++-
2 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5bed94e..4c9012b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -82,6 +82,7 @@ config X86
select CLKEVT_I8253
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select GENERIC_IOMAP
+ select HAVE_ARCH_SECCOMP_FILTER

config INSTRUCTION_DECODER
def_bool (KPROBES || PERF_EVENTS)
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 5026738..90d465a 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1380,7 +1380,11 @@ long syscall_trace_enter(struct pt_regs *regs)
regs->flags |= X86_EFLAGS_TF;

/* do the secure computing check first */
- secure_computing(regs->orig_ax);
+ if (secure_computing(regs->orig_ax)) {
+ /* seccomp failures shouldn't expose any additional code. */
+ ret = -1L;
+ goto out;
+ }

if (unlikely(test_thread_flag(TIF_SYSCALL_EMU)))
ret = -1L;
@@ -1405,6 +1409,7 @@ long syscall_trace_enter(struct pt_regs *regs)
regs->dx, regs->r10);
#endif

+out:
return ret ?: regs->orig_ax;
}

--
1.7.5.4

2012-02-21 17:31:43

by Will Drewry

[permalink] [raw]
Subject: [PATCH v10 09/11] ptrace,seccomp: Add PTRACE_SECCOMP support

A new return value is added to seccomp filters that allows
the system call policy for the affected system calls to be
implemented by a ptrace(2)ing process.

If a tracer attaches to a task, specifies the PTRACE_O_TRACESECCOMP
option, then PTRACE_CONT. After doing so, the tracer will
be notified if a seccomp filter program returns SECCOMP_RET_TRACE.
If there is no seccomp event tracer, SECCOMP_RET_TRACE system calls will
return a -ENOSYS errno to user space. If the tracer detaches during a
hand-off, the process will be killed.

To ensure that seccomp is syscall fast-path friendly in the future,
ptrace is delegated to by setting TIF_SYSCALL_TRACE. Since seccomp
events are equivalent to system call entry events, this allows for
seccomp to be evaluated as a fork off the fast-path and only,
optionally, jump to the slow path. When the tracer is notified, all
will function as with ptrace(PTRACE_SYSCALLS), but when the tracer calls
ptrace(PTRACE_CONT), TIF_SYSCALL_TRACE will be unset and the task
will proceed just receiving PTRACE_O_TRACESECCOMP events.

I realize there are pending patches for cleaning up ptrace events.
I can either reintegrate with those when they are available or
vice versa. That's assuming these changes make sense and are viable.

v10: - moved to PTRACE_O_SECCOMP / PT_TRACE_SECCOMP
v9: - n/a
v8: - guarded PTRACE_SECCOMP use with an ifdef
v7: - introduced

Signed-off-by: Will Drewry <[email protected]>
---
arch/Kconfig | 4 +++
include/linux/ptrace.h | 7 ++++-
include/linux/seccomp.h | 14 +++++++++--
include/linux/tracehook.h | 7 +++++-
kernel/ptrace.c | 4 +++
kernel/seccomp.c | 52 ++++++++++++++++++++++++++++++++++++++++++--
6 files changed, 79 insertions(+), 9 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 6d6d9dc..02c18ca 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -203,6 +203,7 @@ config HAVE_ARCH_SECCOMP_FILTER
bool
help
This symbol should be selected by an architecure if it provides:
+ linux/tracehook.h, for TIF_SYSCALL_TRACE and ptrace_report_syscall
asm/syscall.h:
- syscall_get_arch()
- syscall_get_arguments()
@@ -211,6 +212,9 @@ config HAVE_ARCH_SECCOMP_FILTER
SIGSYS siginfo_t support must be implemented.
__secure_computing_int()/secure_computing()'s return value must be
checked, with -1 resulting in the syscall being skipped.
+ If secure_computing is not in the system call slow path, the thread
+ info flags will need to be checked upon exit to ensure delegation to
+ ptrace(2) did not occur, or if it did, jump to the slow-path.

config SECCOMP_FILTER
def_bool y
diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index c2f1f6a..2fccdbc 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -62,8 +62,9 @@
#define PTRACE_O_TRACEEXEC 0x00000010
#define PTRACE_O_TRACEVFORKDONE 0x00000020
#define PTRACE_O_TRACEEXIT 0x00000040
+#define PTRACE_O_TRACESECCOMP 0x00000080

-#define PTRACE_O_MASK 0x0000007f
+#define PTRACE_O_MASK 0x000000ff

/* Wait extended result codes for the above trace options. */
#define PTRACE_EVENT_FORK 1
@@ -73,6 +74,7 @@
#define PTRACE_EVENT_VFORK_DONE 5
#define PTRACE_EVENT_EXIT 6
#define PTRACE_EVENT_STOP 7
+#define PTRACE_EVENT_SECCOMP 8 /* never directly delivered */

#include <asm/ptrace.h>

@@ -101,8 +103,9 @@
#define PT_TRACE_EXEC PT_EVENT_FLAG(PTRACE_EVENT_EXEC)
#define PT_TRACE_VFORK_DONE PT_EVENT_FLAG(PTRACE_EVENT_VFORK_DONE)
#define PT_TRACE_EXIT PT_EVENT_FLAG(PTRACE_EVENT_EXIT)
+#define PT_TRACE_SECCOMP PT_EVENT_FLAG(PTRACE_EVENT_SECCOMP)

-#define PT_TRACE_MASK 0x000003f4
+#define PT_TRACE_MASK 0x00000ff4

/* single stepping state bits (used on ARM and PA-RISC) */
#define PT_SINGLESTEP_BIT 31
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index d039b7b..16887c1 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -19,8 +19,9 @@
* selects the least permissive choice.
*/
#define SECCOMP_RET_KILL 0x00000000U /* kill the task immediately */
-#define SECCOMP_RET_TRAP 0x00020000U /* disallow and send sigtrap */
-#define SECCOMP_RET_ERRNO 0x00030000U /* returns an errno */
+#define SECCOMP_RET_TRAP 0x00020000U /* only send sigtrap */
+#define SECCOMP_RET_ERRNO 0x00030000U /* only return an errno */
+#define SECCOMP_RET_TRACE 0x7ffe0000U /* allow, but notify the tracer */
#define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */

/* Masks for the return value sections. */
@@ -55,6 +56,7 @@ struct seccomp_filter;
*
* @mode: indicates one of the valid values above for controlled
* system calls available to a process.
+ * @in_trace: indicates a seccomp filter hand off to ptrace has occurred
* @filter: The metadata and ruleset for determining what system calls
* are allowed for a task.
*
@@ -63,6 +65,7 @@ struct seccomp_filter;
*/
struct seccomp {
int mode;
+ int in_trace;
struct seccomp_filter *filter;
};

@@ -116,15 +119,20 @@ static inline int seccomp_mode(struct seccomp *s)
extern void put_seccomp_filter(struct seccomp_filter *);
extern void copy_seccomp(struct seccomp *child,
const struct seccomp *parent);
+extern void seccomp_tracer_done(void);
#else /* CONFIG_SECCOMP_FILTER */
/* The macro consumes the ->filter reference. */
#define put_seccomp_filter(_s) do { } while (0)
-
static inline void copy_seccomp(struct seccomp *child,
const struct seccomp *prev)
{
return;
}
+
+static inline void seccomp_tracer_done(void)
+{
+ return;
+}
#endif /* CONFIG_SECCOMP_FILTER */
#endif /* __KERNEL__ */
#endif /* _LINUX_SECCOMP_H */
diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
index a71a292..5000169 100644
--- a/include/linux/tracehook.h
+++ b/include/linux/tracehook.h
@@ -48,6 +48,7 @@

#include <linux/sched.h>
#include <linux/ptrace.h>
+#include <linux/seccomp.h>
#include <linux/security.h>
struct linux_binprm;

@@ -59,7 +60,7 @@ static inline void ptrace_report_syscall(struct pt_regs *regs)
int ptrace = current->ptrace;

if (!(ptrace & PT_PTRACED))
- return;
+ goto out;

ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));

@@ -72,6 +73,10 @@ static inline void ptrace_report_syscall(struct pt_regs *regs)
send_sig(current->exit_code, current, 1);
current->exit_code = 0;
}
+
+out:
+ if (ptrace & PT_TRACE_SECCOMP)
+ seccomp_tracer_done();
}

/**
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 00ab2ca..61e5ac4 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -19,6 +19,7 @@
#include <linux/signal.h>
#include <linux/audit.h>
#include <linux/pid_namespace.h>
+#include <linux/seccomp.h>
#include <linux/syscalls.h>
#include <linux/uaccess.h>
#include <linux/regset.h>
@@ -551,6 +552,9 @@ static int ptrace_setoptions(struct task_struct *child, unsigned long data)
if (data & PTRACE_O_TRACEEXIT)
child->ptrace |= PT_TRACE_EXIT;

+ if (data & PTRACE_O_TRACESECCOMP)
+ child->ptrace |= PT_TRACE_SECCOMP;
+
return (data & ~PTRACE_O_MASK) ? -EINVAL : 0;
}

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index fc25d3a..120ceec 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -270,13 +270,12 @@ void put_seccomp_filter(struct seccomp_filter *orig)
* @child: forkee's seccomp
* @prev: forker's seccomp
*
- * Ensures that @child inherits seccomp mode and state if
- * seccomp filtering is in use.
+ * Ensures that @child inherits seccomp filtering if in use.
*/
void copy_seccomp(struct seccomp *child,
const struct seccomp *prev)
{
- child->mode = prev->mode;
+ /* Other fields are handled by dup_task_struct. */
child->filter = get_seccomp_filter(prev->filter);
}

@@ -299,6 +298,31 @@ static void seccomp_send_sigsys(int syscall, int reason)
info.si_syscall = syscall;
force_sig_info(SIGSYS, &info, current);
}
+
+/**
+ * seccomp_tracer_done: handles clean up after handing off to ptrace.
+ *
+ * Checks that the hand off from SECCOMP_RET_TRACE to ptrace was not
+ * subject to a race condition where the tracer disappeared or was
+ * never notified because of a pending SIGKILL.
+ * N.b., if ptrace_syscall_entry returned an int, this call could just
+ * disable the system call rather than using do_exit on tracer death.
+ */
+void seccomp_tracer_done(void)
+{
+ struct seccomp *s = &current->seccomp;
+ /* Some other slow-path call occurred */
+ if (!s->in_trace)
+ return;
+ s->in_trace = 0;
+ /* Tracer detached/died at some point after handing off to ptrace. */
+ if (!(current->ptrace & PT_PTRACED))
+ do_exit(SIGKILL);
+ /* If there is a SIGKILL pending, just do_exit. */
+ if (sigismember(&current->pending.signal, SIGKILL) ||
+ sigismember(&current->signal->shared_pending.signal, SIGKILL))
+ do_exit(SIGKILL);
+}
#endif /* CONFIG_SECCOMP_FILTER */

/*
@@ -360,6 +384,28 @@ int __secure_computing_int(int this_syscall)
seccomp_send_sigsys(this_syscall, reason_code);
return -1;
}
+ case SECCOMP_RET_TRACE:
+ /* If there is no interested tracer, return ENOSYS. */
+ if (!(current->ptrace & PT_TRACE_SECCOMP))
+ return -1;
+ /*
+ * Delegate to TIF_SYSCALL_TRACE. This allows fast-path
+ * seccomp calls to delegate to slow-path if needed.
+ * Since TIF_SYSCALL_TRACE will be unset on ptrace(2)
+ * continuation, there should be no direct side
+ * effects. If TIF_SYSCALL_TRACE is already set, this
+ * has no effect. Upon completion of handling, ptrace
+ * will call seccomp_tracer_done() which helps handle
+ * races.
+ */
+ set_tsk_thread_flag(current, TIF_SYSCALL_TRACE);
+ current->seccomp.in_trace = 1;
+ /*
+ * Allow the call, but upon completion, ptrace will
+ * call seccomp_tracer_done to handle tracer
+ * disappearance/death to ensure notification occurred.
+ */
+ return 0;
case SECCOMP_RET_ALLOW:
return 0;
case SECCOMP_RET_KILL:
--
1.7.5.4

2012-02-21 17:33:04

by Will Drewry

[permalink] [raw]
Subject: [PATCH v10 08/11] seccomp: Add SECCOMP_RET_TRAP

Adds a new return value to seccomp filters that triggers a SIGSYS to be
delivered with the new SYS_SECCOMP si_code.

This allows in-process system call emulation, including just specifying
an errno or cleanly dumping core, rather than just dying.

v10: - use SIGSYS, syscall_get_arch, updates arch/Kconfig
note suggested-by (though original suggestion had other behaviors)
v9: - changes to SIGILL
v8: - clean up based on changes to dependent patches
v7: - introduction

Suggested-by: Markus Gutschke <[email protected]>
Suggested-by: Julien Tinnes <[email protected]>
Signed-off-by: Will Drewry <[email protected]>
---
arch/Kconfig | 14 +++++++++-----
include/linux/seccomp.h | 1 +
kernel/seccomp.c | 28 ++++++++++++++++++++++++++++
3 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index aa00571..6d6d9dc 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -202,11 +202,15 @@ config HAVE_CMPXCHG_DOUBLE
config HAVE_ARCH_SECCOMP_FILTER
bool
help
- This symbol should be selected by an architecure if it provides
- asm/syscall.h, specifically syscall_get_arguments(),
- syscall_get_arch(), and syscall_set_return_value(). Additionally,
- its system call entry path must respect a return value of -1 from
- __secure_computing_int() and/or secure_computing().
+ This symbol should be selected by an architecure if it provides:
+ asm/syscall.h:
+ - syscall_get_arch()
+ - syscall_get_arguments()
+ - syscall_rollback()
+ - syscall_set_return_value()
+ SIGSYS siginfo_t support must be implemented.
+ __secure_computing_int()/secure_computing()'s return value must be
+ checked, with -1 resulting in the syscall being skipped.

config SECCOMP_FILTER
def_bool y
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 54ecb61..d039b7b 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -19,6 +19,7 @@
* selects the least permissive choice.
*/
#define SECCOMP_RET_KILL 0x00000000U /* kill the task immediately */
+#define SECCOMP_RET_TRAP 0x00020000U /* disallow and send sigtrap */
#define SECCOMP_RET_ERRNO 0x00030000U /* returns an errno */
#define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 23f1844..fc25d3a 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -279,6 +279,26 @@ void copy_seccomp(struct seccomp *child,
child->mode = prev->mode;
child->filter = get_seccomp_filter(prev->filter);
}
+
+/**
+ * seccomp_send_sigsys - signals the task to allow in-process syscall emulation
+ * @syscall: syscall number to send to userland
+ * @reason: filter-supplied reason code to send to userland (via si_errno)
+ *
+ * Forces a SIGSYS with a code of SYS_SECCOMP and related sigsys info.
+ */
+static void seccomp_send_sigsys(int syscall, int reason)
+{
+ struct siginfo info;
+ memset(&info, 0, sizeof(info));
+ info.si_signo = SIGSYS;
+ info.si_code = SYS_SECCOMP;
+ info.si_call_addr = (void __user *)KSTK_EIP(current);
+ info.si_errno = reason;
+ info.si_arch = syscall_get_arch(current, task_pt_regs(current));
+ info.si_syscall = syscall;
+ force_sig_info(SIGSYS, &info, current);
+}
#endif /* CONFIG_SECCOMP_FILTER */

/*
@@ -332,6 +352,14 @@ int __secure_computing_int(int this_syscall)
-(action & SECCOMP_RET_DATA),
0);
return -1;
+ case SECCOMP_RET_TRAP: {
+ int reason_code = action & SECCOMP_RET_DATA;
+ /* Show the handler the original registers. */
+ syscall_rollback(current, task_pt_regs(current));
+ /* Let the filter pass back 16 bits of data. */
+ seccomp_send_sigsys(this_syscall, reason_code);
+ return -1;
+ }
case SECCOMP_RET_ALLOW:
return 0;
case SECCOMP_RET_KILL:
--
1.7.5.4

2012-02-21 17:31:33

by Will Drewry

[permalink] [raw]
Subject: [PATCH v10 04/11] arch/x86: add syscall_get_arch to syscall.h

Add syscall_get_arch() to export the current AUDIT_ARCH_* based on system call
entry path.

Signed-off-by: Will Drewry <[email protected]>
---
arch/x86/include/asm/syscall.h | 23 +++++++++++++++++++++++
1 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index d962e56..1d713e4 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -13,6 +13,7 @@
#ifndef _ASM_X86_SYSCALL_H
#define _ASM_X86_SYSCALL_H

+#include <linux/audit.h>
#include <linux/sched.h>
#include <linux/err.h>
#include <asm/asm-offsets.h> /* For NR_syscalls */
@@ -87,6 +88,12 @@ static inline void syscall_set_arguments(struct task_struct *task,
memcpy(&regs->bx + i, args, n * sizeof(args[0]));
}

+static inline int syscall_get_arch(struct task_struct *task,
+ struct pt_regs *regs)
+{
+ return AUDIT_ARCH_I386;
+}
+
#else /* CONFIG_X86_64 */

static inline void syscall_get_arguments(struct task_struct *task,
@@ -211,6 +218,22 @@ static inline void syscall_set_arguments(struct task_struct *task,
}
}

+static inline int syscall_get_arch(struct task_struct *task,
+ struct pt_regs *regs)
+{
+#ifdef CONFIG_IA32_EMULATION
+ /*
+ * TS_COMPAT is set for 32-bit syscall entries and then
+ * remains set until we return to user mode.
+ *
+ * TIF_IA32 tasks should always have TS_COMPAT set at
+ * system call time.
+ */
+ if (task_thread_info(task)->status & TS_COMPAT)
+ return AUDIT_ARCH_I386;
+#endif
+ return AUDIT_ARCH_X86_64;
+}
#endif /* CONFIG_X86_32 */

#endif /* _ASM_X86_SYSCALL_H */
--
1.7.5.4

2012-02-21 17:33:29

by Will Drewry

[permalink] [raw]
Subject: [PATCH v10 05/11] seccomp: add system call filtering using BPF

[This patch depends on [email protected]'s no_new_privs patch:
https://lkml.org/lkml/2012/1/30/264
]

This patch adds support for seccomp mode 2. Mode 2 introduces the
ability for unprivileged processes to install system call filtering
policy expressed in terms of a Berkeley Packet Filter (BPF) program.
This program will be evaluated in the kernel for each system call
the task makes and computes a result based on data in the format
of struct seccomp_data.

A filter program may be installed by calling:
struct sock_fprog fprog = { ... };
...
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &fprog);

The return value of the filter program determines if the system call is
allowed to proceed or denied. If the first filter program installed
allows prctl(2) calls, then the above call may be made repeatedly
by a task to further reduce its access to the kernel. All attached
programs must be evaluated before a system call will be allowed to
proceed.

Filter programs will be inherited across fork/clone and execve.
However, if the task attaching the filter is unprivileged
(!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task. This
ensures that unprivileged tasks cannot attach filters that affect
privileged tasks (e.g., setuid binary).

There are a number of benefits to this approach. A few of which are
as follows:
- BPF has been exposed to userland for a long time
- BPF optimization (and JIT'ing) are well understood
- Userland already knows its ABI: system call numbers and desired
arguments
- No time-of-check-time-of-use vulnerable data accesses are possible.
- system call arguments are loaded on access only to minimize copying
required for system call policy decisions.

Mode 2 support is restricted to architectures that enable
HAVE_ARCH_SECCOMP_FILTER. In this patch, the primary dependency is on
syscall_get_arguments(). The full desired scope of this feature will
add a few minor additional requirements expressed later in this series.
Based on discussion, SECCOMP_RET_ERRNO and SECCOMP_RET_TRACE seem to be
the desired additional functionality.

No architectures are enabled in this patch.

v10: - seccomp_data has changed again to be more aesthetically pleasing
([email protected])
- calling convention is noted in a new u32 field using syscall_get_arch.
This allows for cross-calling convention tasks to use seccomp filters.
([email protected])
- lots of clean up (thanks, Indan!)
v9: - n/a
v8: - use bpf_chk_filter, bpf_run_filter. update load_fns
- Lots of fixes courtesy of [email protected]:
-- fix up load behavior, compat fixups, and merge alloc code,
-- renamed pc and dropped __packed, use bool compat.
-- Added a hidden CONFIG_SECCOMP_FILTER to synthesize non-arch
dependencies
v7: (massive overhaul thanks to Indan, others)
- added CONFIG_HAVE_ARCH_SECCOMP_FILTER
- merged into seccomp.c
- minimal seccomp_filter.h
- no config option (part of seccomp)
- no new prctl
- doesn't break seccomp on systems without asm/syscall.h
(works but arg access always fails)
- dropped seccomp_init_task, extra free functions, ...
- dropped the no-asm/syscall.h code paths
- merges with network sk_run_filter and sk_chk_filter
v6: - fix memory leak on attach compat check failure
- require no_new_privs || CAP_SYS_ADMIN prior to filter
installation. ([email protected])
- s/seccomp_struct_/seccomp_/ for macros/functions ([email protected])
- cleaned up Kconfig ([email protected])
- on block, note if the call was compat (so the # means something)
v5: - uses syscall_get_arguments
([email protected],[email protected], [email protected])
- uses union-based arg storage with hi/lo struct to
handle endianness. Compromises between the two alternate
proposals to minimize extra arg shuffling and account for
endianness assuming userspace uses offsetof().
([email protected], [email protected])
- update Kconfig description
- add include/seccomp_filter.h and add its installation
- (naive) on-demand syscall argument loading
- drop seccomp_t ([email protected])
v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
- now uses current->no_new_privs
([email protected],[email protected])
- assign names to seccomp modes ([email protected])
- fix style issues ([email protected])
- reworded Kconfig entry ([email protected])
v3: - macros to inline ([email protected])
- init_task behavior fixed ([email protected])
- drop creator entry and extra NULL check ([email protected])
- alloc returns -EINVAL on bad sizing ([email protected])
- adds tentative use of "always_unprivileged" as per
[email protected] and [email protected]
v2: - (patch 2 only)

Signed-off-by: Will Drewry <[email protected]>
---
arch/Kconfig | 18 +++
include/linux/Kbuild | 1 +
include/linux/seccomp.h | 76 +++++++++++-
kernel/fork.c | 3 +
kernel/seccomp.c | 321 ++++++++++++++++++++++++++++++++++++++++++++---
kernel/sys.c | 2 +-
6 files changed, 399 insertions(+), 22 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 4f55c73..8150fa2 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -199,4 +199,22 @@ config HAVE_CMPXCHG_LOCAL
config HAVE_CMPXCHG_DOUBLE
bool

+config HAVE_ARCH_SECCOMP_FILTER
+ bool
+ help
+ This symbol should be selected by an architecure if it provides
+ asm/syscall.h, specifically syscall_get_arguments() and
+ syscall_get_arch().
+
+config SECCOMP_FILTER
+ def_bool y
+ depends on HAVE_ARCH_SECCOMP_FILTER && SECCOMP && NET
+ help
+ Enable tasks to build secure computing environments defined
+ in terms of Berkeley Packet Filter programs which implement
+ task-defined system call filtering polices.
+
+ See Documentation/prctl/seccomp_filter.txt for more
+ information on the topic of seccomp filtering.
+
source "kernel/gcov/Kconfig"
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index c94e717..d41ba12 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -330,6 +330,7 @@ header-y += scc.h
header-y += sched.h
header-y += screen_info.h
header-y += sdla.h
+header-y += seccomp.h
header-y += securebits.h
header-y += selinux_netlink.h
header-y += sem.h
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index d61f27f..001f883 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -1,14 +1,67 @@
#ifndef _LINUX_SECCOMP_H
#define _LINUX_SECCOMP_H

+#include <linux/compiler.h>
+#include <linux/types.h>
+
+
+/* Valid values for seccomp.mode and prctl(PR_SET_SECCOMP, <mode>) */
+#define SECCOMP_MODE_DISABLED 0 /* seccomp is not in use. */
+#define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */
+#define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */
+
+/*
+ * BPF programs may return a 32-bit value.
+ * The bottom 16-bits are reserved for future use.
+ * The upper 16-bits are ordered from least permissive values to most.
+ *
+ * The ordering ensures that a min_t() over composed return values always
+ * selects the least permissive choice.
+ */
+#define SECCOMP_RET_KILL 0x00000000U /* kill the task immediately */
+#define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */
+
+/* Masks for the return value sections. */
+#define SECCOMP_RET_ACTION 0xffff0000U
+#define SECCOMP_RET_DATA 0x0000ffffU
+
+/**
+ * struct seccomp_data - the format the BPF program executes over.
+ * @args: up to 6 system call arguments. When the calling convention is
+ * 32-bit, the arguments will still be at each args[X] offset.
+ * @instruction_pointer: at the time of the system call.
+ * @arch: indicates system call convention as an AUDIT_ARCH_* value
+ * as defined in <linux/audit.h>.
+ * @nr: the system call number
+ */
+struct seccomp_data {
+ __u64 args[6];
+ __u64 instruction_pointer;
+ __u32 arch;
+ int nr;
+};

+#ifdef __KERNEL__
#ifdef CONFIG_SECCOMP

#include <linux/thread_info.h>
#include <asm/seccomp.h>

+struct seccomp_filter;
+/**
+ * struct seccomp - the state of a seccomp'ed process
+ *
+ * @mode: indicates one of the valid values above for controlled
+ * system calls available to a process.
+ * @filter: The metadata and ruleset for determining what system calls
+ * are allowed for a task.
+ *
+ * @filter must only be accessed from the context of current as there
+ * is no locking.
+ */
struct seccomp {
int mode;
+ struct seccomp_filter *filter;
};

extern void __secure_computing(int);
@@ -19,7 +72,7 @@ static inline void secure_computing(int this_syscall)
}

extern long prctl_get_seccomp(void);
-extern long prctl_set_seccomp(unsigned long);
+extern long prctl_set_seccomp(unsigned long, char __user *);

static inline int seccomp_mode(struct seccomp *s)
{
@@ -31,15 +84,16 @@ static inline int seccomp_mode(struct seccomp *s)
#include <linux/errno.h>

struct seccomp { };
+struct seccomp_filter { };

-#define secure_computing(x) do { } while (0)
+#define secure_computing(x) 0

static inline long prctl_get_seccomp(void)
{
return -EINVAL;
}

-static inline long prctl_set_seccomp(unsigned long arg2)
+static inline long prctl_set_seccomp(unsigned long arg2, char __user *arg3)
{
return -EINVAL;
}
@@ -48,7 +102,21 @@ static inline int seccomp_mode(struct seccomp *s)
{
return 0;
}
-
#endif /* CONFIG_SECCOMP */

+#ifdef CONFIG_SECCOMP_FILTER
+extern void put_seccomp_filter(struct seccomp_filter *);
+extern void copy_seccomp(struct seccomp *child,
+ const struct seccomp *parent);
+#else /* CONFIG_SECCOMP_FILTER */
+/* The macro consumes the ->filter reference. */
+#define put_seccomp_filter(_s) do { } while (0)
+
+static inline void copy_seccomp(struct seccomp *child,
+ const struct seccomp *prev)
+{
+ return;
+}
+#endif /* CONFIG_SECCOMP_FILTER */
+#endif /* __KERNEL__ */
#endif /* _LINUX_SECCOMP_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index b77fd55..a5187b7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
#include <linux/cgroup.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/seccomp.h>
#include <linux/swap.h>
#include <linux/syscalls.h>
#include <linux/jiffies.h>
@@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
free_thread_info(tsk->stack);
rt_mutex_debug_task_free(tsk);
ftrace_graph_exit_task(tsk);
+ put_seccomp_filter(tsk->seccomp.filter);
free_task_struct(tsk);
}
EXPORT_SYMBOL(free_task);
@@ -1113,6 +1115,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
goto fork_out;

ftrace_graph_init_task(p);
+ copy_seccomp(&p->seccomp, &current->seccomp);

rt_mutex_init_task(p);

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index e8d76c5..0043b7e 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -3,16 +3,287 @@
*
* Copyright 2004-2005 Andrea Arcangeli <[email protected]>
*
- * This defines a simple but solid secure-computing mode.
+ * Copyright (C) 2012 Google, Inc.
+ * Will Drewry <[email protected]>
+ *
+ * This defines a simple but solid secure-computing facility.
+ *
+ * Mode 1 uses a fixed list of allowed system calls.
+ * Mode 2 allows user-defined system call filters in the form
+ * of Berkeley Packet Filters/Linux Socket Filters.
*/

+#include <linux/atomic.h>
#include <linux/audit.h>
-#include <linux/seccomp.h>
-#include <linux/sched.h>
#include <linux/compat.h>
+#include <linux/filter.h>
+#include <linux/sched.h>
+#include <linux/seccomp.h>
+#include <linux/security.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+#include <linux/tracehook.h>
+#include <asm/syscall.h>

/* #define SECCOMP_DEBUG 1 */
-#define NR_SECCOMP_MODES 1
+
+#ifdef CONFIG_SECCOMP_FILTER
+/**
+ * struct seccomp_filter - container for seccomp BPF programs
+ *
+ * @usage: reference count to manage the object liftime.
+ * get/put helpers should be used when accessing an instance
+ * outside of a lifetime-guarded section. In general, this
+ * is only needed for handling filters shared across tasks.
+ * @prev: points to a previously installed, or inherited, filter
+ * @compat: indicates the value of is_compat_task() at creation time
+ * @insns: the BPF program instructions to evaluate
+ * @len: the number of instructions in the program
+ *
+ * seccomp_filter objects are organized in a tree linked via the @prev
+ * pointer. For any task, it appears to be a singly-linked list starting
+ * with current->seccomp.filter, the most recently attached or inherited filter.
+ * However, multiple filters may share a @prev node, by way of fork(), which
+ * results in a unidirectional tree existing in memory. This is similar to
+ * how namespaces work.
+ *
+ * seccomp_filter objects should never be modified after being attached
+ * to a task_struct (other than @usage).
+ */
+struct seccomp_filter {
+ atomic_t usage;
+ struct seccomp_filter *prev;
+ bool compat;
+ unsigned short len; /* Instruction count */
+ struct sock_filter insns[];
+};
+
+static void seccomp_filter_log_failure(int syscall)
+{
+ int compat = 0;
+#ifdef CONFIG_COMPAT
+ compat = is_compat_task();
+#endif
+ pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n",
+ current->comm, task_pid_nr(current),
+ (compat ? "compat " : ""),
+ syscall, KSTK_EIP(current));
+}
+
+/**
+ * get_u32 - returns a u32 offset into data
+ * @data: a unsigned 64 bit value
+ * @index: 0 or 1 to return the first or second 32-bits
+ *
+ * This inline exists to hide the length of unsigned long.
+ * If a 32-bit unsigned long is passed in, it will be extended
+ * and the top 32-bits will be 0. If it is a 64-bit unsigned
+ * long, then whatever data is resident will be properly returned.
+ */
+static inline u32 get_u32(u64 data, int index)
+{
+ return ((u32 *)&data)[index];
+}
+
+/* Helper for bpf_load below. */
+#define BPF_DATA(_name) offsetof(struct seccomp_data, _name)
+/**
+ * bpf_load: checks and returns a pointer to the requested offset
+ * @nr: int syscall passed as a void * to bpf_run_filter
+ * @off: index into struct seccomp_data to load from
+ * @size: load width requested
+ * @buffer: temporary storage supplied by bpf_run_filter
+ *
+ * Returns a pointer to @buffer where the value was stored.
+ * On failure, returns NULL.
+ */
+static void *bpf_load(const void *nr, int off, unsigned int size, void *buf)
+{
+ unsigned long value;
+ u32 *A = buf;
+
+ if (size != sizeof(u32))
+ return NULL;
+
+ if (off >= BPF_DATA(args[0]) && off < BPF_DATA(args[6])) {
+ struct pt_regs *regs = task_pt_regs(current);
+ int arg = off >> 3; /* args[0] is at offset 0. */
+ int index = (off % sizeof(u64)) ? 1 : 0;
+ syscall_get_arguments(current, regs, arg, 1, &value);
+ *A = get_u32(value, index);
+ } else if (off == BPF_DATA(nr)) {
+ *A = (u32)(uintptr_t)nr;
+ } else if (off == BPF_DATA(arch)) {
+ struct pt_regs *regs = task_pt_regs(current);
+ *A = syscall_get_arch(current, regs);
+ } else if (off == BPF_DATA(instruction_pointer)) {
+ *A = get_u32(KSTK_EIP(current), 0);
+ } else if (off == BPF_DATA(instruction_pointer) + sizeof(u32)) {
+ *A = get_u32(KSTK_EIP(current), 1);
+ } else {
+ return NULL;
+ }
+ return buf;
+}
+
+/**
+ * seccomp_run_filters - evaluates all seccomp filters against @syscall
+ * @syscall: number of the current system call
+ *
+ * Returns valid seccomp BPF response codes.
+ */
+static u32 seccomp_run_filters(int syscall)
+{
+ struct seccomp_filter *f;
+ u32 ret = SECCOMP_RET_KILL;
+ static const struct bpf_load_fn fns = {
+ bpf_load,
+ sizeof(struct seccomp_data),
+ };
+ const void *sc_ptr = (const void *)(uintptr_t)syscall;
+
+ /*
+ * All filters are evaluated in order of youngest to oldest. The lowest
+ * BPF return value always takes priority.
+ */
+ for (f = current->seccomp.filter; f; f = f->prev) {
+ ret = bpf_run_filter(sc_ptr, f->insns, &fns);
+ if (ret != SECCOMP_RET_ALLOW)
+ break;
+ }
+ return ret;
+}
+
+/**
+ * seccomp_attach_filter: Attaches a seccomp filter to current.
+ * @fprog: BPF program to install
+ *
+ * Returns 0 on success or an errno on failure.
+ */
+static long seccomp_attach_filter(struct sock_fprog *fprog)
+{
+ struct seccomp_filter *filter;
+ unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
+ long ret;
+
+ if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
+ return -EINVAL;
+
+ /* Allocate a new seccomp_filter */
+ filter = kzalloc(sizeof(struct seccomp_filter) + fp_size, GFP_KERNEL);
+ if (!filter)
+ return -ENOMEM;
+ atomic_set(&filter->usage, 1);
+ filter->len = fprog->len;
+
+ /* Copy the instructions from fprog. */
+ ret = -EFAULT;
+ if (copy_from_user(filter->insns, fprog->filter, fp_size))
+ goto out;
+
+ /* Check the fprog */
+ ret = bpf_chk_filter(filter->insns, filter->len, BPF_CHK_FLAGS_NO_SKB);
+ if (ret)
+ goto out;
+
+ /*
+ * Installing a seccomp filter requires that the task
+ * have CAP_SYS_ADMIN in its namespace or be running with
+ * no_new_privs. This avoids scenarios where unprivileged
+ * tasks can affect the behavior of privileged children.
+ */
+ ret = -EACCES;
+ if (!current->no_new_privs &&
+ security_capable_noaudit(current_cred(), current_user_ns(),
+ CAP_SYS_ADMIN) != 0)
+ goto out;
+
+ /*
+ * If there is an existing filter, make it the prev
+ * and don't drop its task reference.
+ */
+ filter->prev = current->seccomp.filter;
+ current->seccomp.filter = filter;
+ return 0;
+out:
+ put_seccomp_filter(filter); /* for get or task, on err */
+ return ret;
+}
+
+/**
+ * seccomp_attach_user_filter - attaches a user-supplied sock_fprog
+ * @user_filter: pointer to the user data containing a sock_fprog.
+ *
+ * This function may be called repeatedly to install additional filters.
+ * Every filter successfully installed will be evaluated (in reverse order)
+ * for each system call the task makes.
+ *
+ * Returns 0 on success and non-zero otherwise.
+ */
+long seccomp_attach_user_filter(char __user *user_filter)
+{
+ struct sock_fprog fprog;
+ long ret = -EFAULT;
+
+ if (!user_filter)
+ goto out;
+#ifdef CONFIG_COMPAT
+ if (is_compat_task()) {
+ /* XXX: Share with net/compat.c (compat_sock_fprog) */
+ struct {
+ u16 len;
+ compat_uptr_t filter; /* struct sock_filter */
+ } fprog32;
+ if (copy_from_user(&fprog32, user_filter, sizeof(fprog32)))
+ goto out;
+ fprog.len = fprog32.len;
+ fprog.filter = compat_ptr(fprog32.filter);
+ } else /* falls through to the if below. */
+#endif
+ if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
+ goto out;
+ ret = seccomp_attach_filter(&fprog);
+out:
+ return ret;
+}
+
+/* get_seccomp_filter - increments the reference count of @orig. */
+static struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
+{
+ if (!orig)
+ return NULL;
+ /* Reference count is bounded by the number of total processes. */
+ atomic_inc(&orig->usage);
+ return orig;
+}
+
+/* put_seccomp_filter - decrements the ref count of @orig and may free. */
+void put_seccomp_filter(struct seccomp_filter *orig)
+{
+ /* Clean up single-reference branches iteratively. */
+ while (orig && atomic_dec_and_test(&orig->usage)) {
+ struct seccomp_filter *freeme = orig;
+ orig = orig->prev;
+ kfree(freeme);
+ }
+}
+
+/**
+ * copy_seccomp: manages inheritance on fork
+ * @child: forkee's seccomp
+ * @prev: forker's seccomp
+ *
+ * Ensures that @child inherits seccomp mode and state if
+ * seccomp filtering is in use.
+ */
+void copy_seccomp(struct seccomp *child,
+ const struct seccomp *prev)
+{
+ child->mode = prev->mode;
+ child->filter = get_seccomp_filter(prev->filter);
+}
+#endif /* CONFIG_SECCOMP_FILTER */

/*
* Secure computing mode 1 allows only read/write/exit/sigreturn.
@@ -34,10 +305,10 @@ static int mode1_syscalls_32[] = {
void __secure_computing(int this_syscall)
{
int mode = current->seccomp.mode;
- int * syscall;
+ int *syscall;

switch (mode) {
- case 1:
+ case SECCOMP_MODE_STRICT:
syscall = mode1_syscalls;
#ifdef CONFIG_COMPAT
if (is_compat_task())
@@ -48,6 +319,13 @@ void __secure_computing(int this_syscall)
return;
} while (*++syscall);
break;
+#ifdef CONFIG_SECCOMP_FILTER
+ case SECCOMP_MODE_FILTER:
+ if (seccomp_run_filters(this_syscall) == SECCOMP_RET_ALLOW)
+ return;
+ seccomp_filter_log_failure(this_syscall);
+ break;
+#endif
default:
BUG();
}
@@ -64,25 +342,34 @@ long prctl_get_seccomp(void)
return current->seccomp.mode;
}

-long prctl_set_seccomp(unsigned long seccomp_mode)
+long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
{
- long ret;
+ long ret = -EINVAL;

- /* can set it only once to be even more secure */
- ret = -EPERM;
- if (unlikely(current->seccomp.mode))
+ if (current->seccomp.mode &&
+ current->seccomp.mode != seccomp_mode)
goto out;

- ret = -EINVAL;
- if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
- current->seccomp.mode = seccomp_mode;
- set_thread_flag(TIF_SECCOMP);
+ switch (seccomp_mode) {
+ case SECCOMP_MODE_STRICT:
+ ret = 0;
#ifdef TIF_NOTSC
disable_TSC();
#endif
- ret = 0;
+ break;
+#ifdef CONFIG_SECCOMP_FILTER
+ case SECCOMP_MODE_FILTER:
+ ret = seccomp_attach_user_filter(filter);
+ if (ret)
+ goto out;
+ break;
+#endif
+ default:
+ goto out;
}

- out:
+ current->seccomp.mode = seccomp_mode;
+ set_thread_flag(TIF_SECCOMP);
+out:
return ret;
}
diff --git a/kernel/sys.c b/kernel/sys.c
index 4070153..905031e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1899,7 +1899,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = prctl_get_seccomp();
break;
case PR_SET_SECCOMP:
- error = prctl_set_seccomp(arg2);
+ error = prctl_set_seccomp(arg2, (char __user *)arg3);
break;
case PR_GET_TSC:
error = GET_TSC_CTL(arg2);
--
1.7.5.4

2012-02-21 17:33:34

by Will Drewry

[permalink] [raw]
Subject: [PATCH v10 06/11] seccomp: add SECCOMP_RET_ERRNO

This change adds the SECCOMP_RET_ERRNO as a valid return value from a
seccomp filter. Additionally, it makes the first use of the lower
16-bits for storing a filter-supplied errno. 16-bits is more than
enough for the errno-base.h calls.

Returning errors instead of immediately terminating processes that
violate seccomp policy allow for broader use of this functionality
for kernel attack surface reduction. For example, a linux container
could maintain a whitelist of pre-existing system calls but drop
all new ones with errnos. This would keep a logically static attack
surface while providing errnos that may allow for graceful failure
without the downside of do_exit() on a bad call.

v10: - change loaders to fn
v9: - n/a
v8: - update Kconfig to note new need for syscall_set_return_value.
- reordered such that TRAP behavior follows on later.
- made the for loop a little less indent-y
v7: - introduced

Signed-off-by: Will Drewry <[email protected]>
---
arch/Kconfig | 6 ++++--
include/linux/seccomp.h | 15 +++++++++++----
kernel/seccomp.c | 39 ++++++++++++++++++++++++++++-----------
3 files changed, 43 insertions(+), 17 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 8150fa2..aa00571 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -203,8 +203,10 @@ config HAVE_ARCH_SECCOMP_FILTER
bool
help
This symbol should be selected by an architecure if it provides
- asm/syscall.h, specifically syscall_get_arguments() and
- syscall_get_arch().
+ asm/syscall.h, specifically syscall_get_arguments(),
+ syscall_get_arch(), and syscall_set_return_value(). Additionally,
+ its system call entry path must respect a return value of -1 from
+ __secure_computing_int() and/or secure_computing().

config SECCOMP_FILTER
def_bool y
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 001f883..54ecb61 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -12,13 +12,14 @@

/*
* BPF programs may return a 32-bit value.
- * The bottom 16-bits are reserved for future use.
+ * The bottom 16-bits are for optional related return data.
* The upper 16-bits are ordered from least permissive values to most.
*
* The ordering ensures that a min_t() over composed return values always
* selects the least permissive choice.
*/
#define SECCOMP_RET_KILL 0x00000000U /* kill the task immediately */
+#define SECCOMP_RET_ERRNO 0x00030000U /* returns an errno */
#define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */

/* Masks for the return value sections. */
@@ -64,11 +65,17 @@ struct seccomp {
struct seccomp_filter *filter;
};

-extern void __secure_computing(int);
-static inline void secure_computing(int this_syscall)
+/*
+ * Direct callers to __secure_computing should be updated as
+ * CONFIG_HAVE_ARCH_SECCOMP_FILTER propagates.
+ */
+extern void __secure_computing(int) __deprecated;
+extern int __secure_computing_int(int);
+static inline int secure_computing(int this_syscall)
{
if (unlikely(test_thread_flag(TIF_SECCOMP)))
- __secure_computing(this_syscall);
+ return __secure_computing_int(this_syscall);
+ return 0;
}

extern long prctl_get_seccomp(void);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 0043b7e..23f1844 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -136,22 +136,18 @@ static void *bpf_load(const void *nr, int off, unsigned int size, void *buf)
static u32 seccomp_run_filters(int syscall)
{
struct seccomp_filter *f;
- u32 ret = SECCOMP_RET_KILL;
static const struct bpf_load_fn fns = {
bpf_load,
sizeof(struct seccomp_data),
};
+ u32 ret = SECCOMP_RET_ALLOW;
const void *sc_ptr = (const void *)(uintptr_t)syscall;
-
/*
* All filters are evaluated in order of youngest to oldest. The lowest
* BPF return value always takes priority.
*/
- for (f = current->seccomp.filter; f; f = f->prev) {
- ret = bpf_run_filter(sc_ptr, f->insns, &fns);
- if (ret != SECCOMP_RET_ALLOW)
- break;
- }
+ for (f = current->seccomp.filter; f; f = f->prev)
+ ret = min_t(u32, ret, bpf_run_filter(sc_ptr, f->insns, &fns));
return ret;
}

@@ -304,6 +300,13 @@ static int mode1_syscalls_32[] = {

void __secure_computing(int this_syscall)
{
+ /* Filter calls should never use this function. */
+ BUG_ON(current->seccomp.mode == SECCOMP_MODE_FILTER);
+ __secure_computing_int(this_syscall);
+}
+
+int __secure_computing_int(int this_syscall)
+{
int mode = current->seccomp.mode;
int *syscall;

@@ -316,15 +319,28 @@ void __secure_computing(int this_syscall)
#endif
do {
if (*syscall == this_syscall)
- return;
+ return 0;
} while (*++syscall);
break;
#ifdef CONFIG_SECCOMP_FILTER
- case SECCOMP_MODE_FILTER:
- if (seccomp_run_filters(this_syscall) == SECCOMP_RET_ALLOW)
- return;
+ case SECCOMP_MODE_FILTER: {
+ u32 action = seccomp_run_filters(this_syscall);
+ switch (action & SECCOMP_RET_ACTION) {
+ case SECCOMP_RET_ERRNO:
+ /* Set the low-order 16-bits as a errno. */
+ syscall_set_return_value(current, task_pt_regs(current),
+ -(action & SECCOMP_RET_DATA),
+ 0);
+ return -1;
+ case SECCOMP_RET_ALLOW:
+ return 0;
+ case SECCOMP_RET_KILL:
+ default:
+ break;
+ }
seccomp_filter_log_failure(this_syscall);
break;
+ }
#endif
default:
BUG();
@@ -335,6 +351,7 @@ void __secure_computing(int this_syscall)
#endif
audit_seccomp(this_syscall);
do_exit(SIGKILL);
+ return -1; /* never reached */
}

long prctl_get_seccomp(void)
--
1.7.5.4

2012-02-21 17:34:43

by Will Drewry

[permalink] [raw]
Subject: [PATCH v10 03/11] asm/syscall.h: add syscall_get_arch

Adds a stub for a function that will return the AUDIT_ARCH_*
value appropriate to the supplied task based on the system
call convention.

For audit's use, the value can generally be hard-coded at the
audit-site. However, for other functionality not inlined into
syscall entry/exit, this makes that information available.
seccomp_filter is the first planned consumer and, as such,
the comment indicates a tie to HAVE_ARCH_SECCOMP_FILTER. That
is probably an unneeded detail.

Suggested-by: Roland McGrath <[email protected]>
Signed-off-by: Will Drewry <[email protected]>
---
include/asm-generic/syscall.h | 14 ++++++++++++++
1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/include/asm-generic/syscall.h b/include/asm-generic/syscall.h
index 5c122ae..13edfa2 100644
--- a/include/asm-generic/syscall.h
+++ b/include/asm-generic/syscall.h
@@ -142,4 +142,18 @@ void syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
unsigned int i, unsigned int n,
const unsigned long *args);

+/**
+ * syscall_get_arch - return the AUDIT_ARCH for the current system call
+ * @task: task of interest, must be in system call entry tracing
+ * @regs: task_pt_regs() of @task
+ *
+ * Returns the AUDIT_ARCH_* based on the system call convention in use.
+ *
+ * It's only valid to call this when @task is stopped on entry to a system
+ * call, due to %TIF_SYSCALL_TRACE, %TIF_SYSCALL_AUDIT, or %TIF_SECCOMP.
+ *
+ * Note, at present this function is only required with
+ * CONFIG_HAVE_ARCH_SECCOMP_FILTER.
+ */
+void syscall_get_arch(struct task_struct *task, struct pt_regs *regs);
#endif /* _ASM_SYSCALL_H */
--
1.7.5.4

2012-02-21 18:47:20

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH v10 03/11] asm/syscall.h: add syscall_get_arch

> +void syscall_get_arch(struct task_struct *task, struct pt_regs *regs);

Bad return type.

2012-02-21 18:57:06

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH v10 03/11] asm/syscall.h: add syscall_get_arch

On Tue, Feb 21, 2012 at 12:46 PM, Roland McGrath <[email protected]> wrote:
>> +void syscall_get_arch(struct task_struct *task, struct pt_regs *regs);
>
> Bad return type.

Bah! Thanks!

2012-02-21 19:02:45

by Will Drewry

[permalink] [raw]
Subject: [PATCH v11 03/11] asm/syscall.h: add syscall_get_arch

Adds a stub for a function that will return the AUDIT_ARCH_*
value appropriate to the supplied task based on the system
call convention.

For audit's use, the value can generally be hard-coded at the
audit-site. However, for other functionality not inlined into
syscall entry/exit, this makes that information available.
seccomp_filter is the first planned consumer and, as such,
the comment indicates a tie to HAVE_ARCH_SECCOMP_FILTER. That
is probably an unneeded detail.

Suggested-by: Roland McGrath <[email protected]>
Signed-off-by: Will Drewry <[email protected]>

v11: fixed improper return type
v10: introduced
---
include/asm-generic/syscall.h | 14 ++++++++++++++
1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/include/asm-generic/syscall.h b/include/asm-generic/syscall.h
index 5c122ae..a2c13dc 100644
--- a/include/asm-generic/syscall.h
+++ b/include/asm-generic/syscall.h
@@ -142,4 +142,18 @@ void syscall_set_arguments(struct task_struct *task, struct pt_regs *regs,
unsigned int i, unsigned int n,
const unsigned long *args);

+/**
+ * syscall_get_arch - return the AUDIT_ARCH for the current system call
+ * @task: task of interest, must be in system call entry tracing
+ * @regs: task_pt_regs() of @task
+ *
+ * Returns the AUDIT_ARCH_* based on the system call convention in use.
+ *
+ * It's only valid to call this when @task is stopped on entry to a system
+ * call, due to %TIF_SYSCALL_TRACE, %TIF_SYSCALL_AUDIT, or %TIF_SECCOMP.
+ *
+ * Note, at present this function is only required with
+ * CONFIG_HAVE_ARCH_SECCOMP_FILTER.
+ */
+int syscall_get_arch(struct task_struct *task, struct pt_regs *regs);
#endif /* _ASM_SYSCALL_H */
--
1.7.5.4

2012-02-21 22:44:33

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v10 06/11] seccomp: add SECCOMP_RET_ERRNO

On Tue, Feb 21, 2012 at 11:30:30AM -0600, Will Drewry wrote:
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 0043b7e..23f1844 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -136,22 +136,18 @@ static void *bpf_load(const void *nr, int off, unsigned int size, void *buf)
> static u32 seccomp_run_filters(int syscall)
> {
> struct seccomp_filter *f;
> - u32 ret = SECCOMP_RET_KILL;
> static const struct bpf_load_fn fns = {
> bpf_load,
> sizeof(struct seccomp_data),
> };
> + u32 ret = SECCOMP_RET_ALLOW;
> const void *sc_ptr = (const void *)(uintptr_t)syscall;
> -
> /*
> * All filters are evaluated in order of youngest to oldest. The lowest
> * BPF return value always takes priority.
> */
> - for (f = current->seccomp.filter; f; f = f->prev) {
> - ret = bpf_run_filter(sc_ptr, f->insns, &fns);
> - if (ret != SECCOMP_RET_ALLOW)
> - break;
> - }
> + for (f = current->seccomp.filter; f; f = f->prev)
> + ret = min_t(u32, ret, bpf_run_filter(sc_ptr, f->insns, &fns));
> return ret;
> }

I'd like to see this fail closed in the (theoretically impossible, but
why risk it) case of there being no filters at all. Could do something
like this:

u32 ret = current->seccomp.filter ? SECCOMP_RET_ALLOW : SECCOMP_RET_KILL;

Or, just this, to catch the misbehavior:

if (unlikely(current->seccomp.filter == NULL))
return SECCOMP_RET_KILL;

-Kees

--
Kees Cook
ChromeOS Security

2012-02-21 22:48:10

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH v10 06/11] seccomp: add SECCOMP_RET_ERRNO

On Tue, Feb 21, 2012 at 4:41 PM, Kees Cook <[email protected]> wrote:
> On Tue, Feb 21, 2012 at 11:30:30AM -0600, Will Drewry wrote:
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index 0043b7e..23f1844 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -136,22 +136,18 @@ static void *bpf_load(const void *nr, int off, unsigned int size, void *buf)
>> ?static u32 seccomp_run_filters(int syscall)
>> ?{
>> ? ? ? struct seccomp_filter *f;
>> - ? ? u32 ret = SECCOMP_RET_KILL;
>> ? ? ? static const struct bpf_load_fn fns = {
>> ? ? ? ? ? ? ? bpf_load,
>> ? ? ? ? ? ? ? sizeof(struct seccomp_data),
>> ? ? ? };
>> + ? ? u32 ret = SECCOMP_RET_ALLOW;
>> ? ? ? const void *sc_ptr = (const void *)(uintptr_t)syscall;
>> -
>> ? ? ? /*
>> ? ? ? ?* All filters are evaluated in order of youngest to oldest. The lowest
>> ? ? ? ?* BPF return value always takes priority.
>> ? ? ? ?*/
>> - ? ? for (f = current->seccomp.filter; f; f = f->prev) {
>> - ? ? ? ? ? ? ret = bpf_run_filter(sc_ptr, f->insns, &fns);
>> - ? ? ? ? ? ? if (ret != SECCOMP_RET_ALLOW)
>> - ? ? ? ? ? ? ? ? ? ? break;
>> - ? ? }
>> + ? ? for (f = current->seccomp.filter; f; f = f->prev)
>> + ? ? ? ? ? ? ret = min_t(u32, ret, bpf_run_filter(sc_ptr, f->insns, &fns));
>> ? ? ? return ret;
>> ?}
>
> I'd like to see this fail closed in the (theoretically impossible, but
> why risk it) case of there being no filters at all. Could do something
> like this:
>
> ? ? ? ?u32 ret = current->seccomp.filter ? SECCOMP_RET_ALLOW : SECCOMP_RET_KILL;
>
> Or, just this, to catch the misbehavior:
>
> ? ? ? ?if (unlikely(current->seccomp.filter == NULL))
> ? ? ? ? ? ? ? ?return SECCOMP_RET_KILL;

I think the last one makes the most sense to me. I'll add it and rev the patch.

thanks!

2012-02-21 23:13:41

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v10 11/11] Documentation: prctl/seccomp_filter

Hi,

I've collected the initial no-new-privs patches, and this whole series
and pushed it here so I could more easily review it:
http://git.kernel.org/?p=linux/kernel/git/kees/linux.git;a=shortlog;h=refs/heads/seccomp

Some minor tweaks below...

On Tue, Feb 21, 2012 at 11:30:35AM -0600, Will Drewry wrote:
> Documents how system call filtering using Berkeley Packet
> Filter programs works and how it may be used.
> Includes an example for x86 (32-bit) and a semi-generic
> example using a macro-based code generator.
>
> v10: - update for SIGSYS
> - update for new seccomp_data layout
> - update for ptrace option use
> v9: - updated bpf-direct.c for SIGILL
> v8: - add PR_SET_NO_NEW_PRIVS to the samples.
> v7: - updated for all the new stuff in v7: TRAP, TRACE
> - only talk about PR_SET_SECCOMP now
> - fixed bad JLE32 check ([email protected])
> - adds dropper.c: a simple system call disabler
> v6: - tweak the language to note the requirement of
> PR_SET_NO_NEW_PRIVS being called prior to use. ([email protected])
> v5: - update sample to use system call arguments
> - adds a "fancy" example using a macro-based generator
> - cleaned up bpf in the sample
> - update docs to mention arguments
> - fix prctl value ([email protected])
> - language cleanup ([email protected])
> v4: - update for no_new_privs use
> - minor tweaks
> v3: - call out BPF <-> Berkeley Packet Filter ([email protected])
> - document use of tentative always-unprivileged
> - guard sample compilation for i386 and x86_64
> v2: - move code to samples ([email protected])
>
> Signed-off-by: Will Drewry <[email protected]>
> ---
> Documentation/prctl/seccomp_filter.txt | 157 +++++++++++++++++++++
> samples/Makefile | 2 +-
> samples/seccomp/Makefile | 31 ++++
> samples/seccomp/bpf-direct.c | 150 ++++++++++++++++++++
> samples/seccomp/bpf-fancy.c | 102 ++++++++++++++
> samples/seccomp/bpf-helper.c | 89 ++++++++++++
> samples/seccomp/bpf-helper.h | 236 ++++++++++++++++++++++++++++++++
> samples/seccomp/dropper.c | 68 +++++++++
> 8 files changed, 834 insertions(+), 1 deletions(-)
> create mode 100644 Documentation/prctl/seccomp_filter.txt
> create mode 100644 samples/seccomp/Makefile
> create mode 100644 samples/seccomp/bpf-direct.c
> create mode 100644 samples/seccomp/bpf-fancy.c
> create mode 100644 samples/seccomp/bpf-helper.c
> create mode 100644 samples/seccomp/bpf-helper.h
> create mode 100644 samples/seccomp/dropper.c
>
> diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
> new file mode 100644
> index 0000000..7de865b
> --- /dev/null
> +++ b/Documentation/prctl/seccomp_filter.txt
> @@ -0,0 +1,157 @@
> + SECure COMPuting with filters
> + =============================
> +
> +Introduction
> +------------
> +
> +A large number of system calls are exposed to every userland process
> +with many of them going unused for the entire lifetime of the process.
> +As system calls change and mature, bugs are found and eradicated. A
> +certain subset of userland applications benefit by having a reduced set
> +of available system calls. The resulting set reduces the total kernel
> +surface exposed to the application. System call filtering is meant for
> +use with those applications.
> +
> +Seccomp filtering provides a means for a process to specify a filter for
> +incoming system calls. The filter is expressed as a Berkeley Packet
> +Filter (BPF) program, as with socket filters, except that the data
> +operated on is related to the system call being made: system call
> +number and the system call arguments. This allows for expressive
> +filtering of system calls using a filter program language with a long
> +history of being exposed to userland and a straightforward data set.
> +
> +Additionally, BPF makes it impossible for users of seccomp to fall prey
> +to time-of-check-time-of-use (TOCTOU) attacks that are common in system
> +call interposition frameworks. BPF programs may not dereference
> +pointers which constrains all filters to solely evaluating the system
> +call arguments directly.
> +
> +What it isn't
> +-------------
> +
> +System call filtering isn't a sandbox. It provides a clearly defined
> +mechanism for minimizing the exposed kernel surface. It is meant to be
> +a tool for sandbox developers to use. Beyond that, policy for logical
> +behavior and information flow should be managed with a combination of
> +other system hardening techniques and, potentially, an LSM of your
> +choosing. Expressive, dynamic filters provide further options down this
> +path (avoiding pathological sizes or selecting which of the multiplexed
> +system calls in socketcall() is allowed, for instance) which could be
> +construed, incorrectly, as a more complete sandboxing solution.
> +
> +Usage
> +-----
> +
> +An additional seccomp mode is added and is enabled using the same
> +prctl(2) call as the strict seccomp. If the architecture has
> +CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below:
> +
> +PR_SET_SECCOMP:
> + Now takes an additional argument which specifies a new filter
> + using a BPF program.
> + The BPF program will be executed over struct seccomp_data
> + reflecting the system call number, arguments, and other
> + metadata. The BPF program must then return one of the
> + acceptable values to inform the kernel which action should be
> + taken.
> +
> + Usage:
> + prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog);
> +
> + The 'prog' argument is a pointer to a struct sock_fprog which
> + will contain the filter program. If the program is invalid, the
> + call will return -1 and set errno to EINVAL.
> +
> + Note, is_compat_task is also tracked for the @prog. This means
> + that once set the calling task will have all of its system calls
> + blocked if it switches its system call ABI.
> +
> + If fork/clone and execve are allowed by @prog, any child
> + processes will be constrained to the same filters and system
> + call ABI as the parent.
> +
> + Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or
> + run with CAP_SYS_ADMIN privileges in its namespace. If these are not
> + true, -EACCES will be returned. This requirement ensures that filter
> + programs cannot be applied to child processes with greater privileges
> + than the task that installed them.
> +
> + Additionally, if prctl(2) is allowed by the attached filter,
> + additional filters may be layered on which will increase evaluation
> + time, but allow for further decreasing the attack surface during
> + execution of a process.
> +
> +The above call returns 0 on success and non-zero on error.
> +
> +Return values
> +-------------
> +
> +A seccomp filter may return any of the following values:
> + SECCOMP_RET_ALLOW, SECCOMP_RET_KILL, SECCOMP_RET_TRAP,
> + SECCOMP_RET_ERRNO, or SECCOMP_RET_TRACE.
> +
> +SECCOMP_RET_ALLOW:
> + If all filters for a given task return this value then
> + the system call will proceed normally.
> +
> +SECCOMP_RET_KILL:
> + If any filters for a given take return this value then
> + the task will exit immediately without executing the system
> + call.
> +
> +SECCOMP_RET_TRAP:
> + If any filters specify SECCOMP_RET_TRAP and none of them
> + specify SECCOMP_RET_KILL, then the kernel will send a SIGTRAP
> + signal to the task and not execute the system call. The kernel
> + will rollback the register state to just before system call
> + entry such that a signal handler in the process will be able
> + to inspect the ucontext_t->uc_mcontext registers and emulate
> + system call success or failure upon return from the signal
> + handler.
> +
> + The SIGTRAP is differentiated by other SIGTRAPS by a si_code
> + of TRAP_SECCOMP.

This should reflect the SIGTRAP->SIGSYS change (and SYS_SECCOMP si_code
change).

> +
> +SECCOMP_RET_ERRNO:
> + If returned, the value provided in the lower 16-bits is
> + returned to userland as the errno and the system call is
> + not executed.

The other sections each say "If any" or "If all" to clarify their
behavior with multiple filters. The same should be done here, but more
comments below. Additionally, it should clarify that on multiple
uses of RET_ERRNO, the lower of the errnos will be returned.

> +
> +SECCOMP_RET_TRACE:
> + If any filters return this value and the others return
> + SECCOMP_RET_ALLOW, then the kernel will attempt to notify
> + a ptrace()-based tracer prior to executing the system call.
> +
> + A tracer will be notified if it requests PTRACE_O_TRACESECCOMP
> + via PTRACE_SETOPTIONS. Otherwise, the system call will
> + not execute and -ENOSYS will be returned to userspace.
> +
> + If the tracer ignores notification, then the system call will
> + proceed normally. Changes to the registers will function
> + similarly to PTRACE_SYSCALL. Additionally, if the tracer
> + detaches during notification or just after, the task may be
> + terminated as precautionary measure.
> +
> +Please note that the order of precedence is as follows:
> +SECCOMP_RET_KILL, SECCOMP_RET_ERRNO, SECCOMP_RET_TRAP,
> +SECCOMP_RET_TRACE, SECCOMP_RET_ALLOW.
> +
> +If multiple filters exist, the return value for the evaluation of a given
> +system call will always use the highest precedent value.
> +SECCOMP_RET_KILL will always take precedence.

I think this clarification about precedence is good but should be at the
head of the "Return values" section, and the sections ordered from that
perspective, so that the "highest precedent value" aspect is a little
bit easier to follow:


Return values
-------------
A seccomp filter may return any of the following values. If multiple
filters exist, the return value for the evaluation of a given system
call will always use the highest precedent value. (For example,
SECCOMP_RET_KILL will always take precedence.)

In precedence order, they are:

SECCOMP_RET_KILL:
If any filters for a given take return this value then
the task will exit immediately without executing the system
call.

SECCOMP_RET_TRAP:
If any filters specify SECCOMP_RET_TRAP and none of them
specify SECCOMP_RET_KILL, then the kernel will send a SIGSYS
signal to the task and not execute the system call. The kernel
will rollback the register state to just before system call
entry such that a signal handler in the process will be able
to inspect the ucontext_t->uc_mcontext registers and emulate
system call success or failure upon return from the signal
handler.

The SIGSYS is differentiated by other SIGSYS signals by a si_code
of SYS_SECCOMP.

SECCOMP_RET_ERRNO:
If any filters return this value and none of them specify a
higher precedence value, then the lowest of the values provided
in the lower 16-bits is returned to userland as the errno and
the system call is not executed.

SECCOMP_RET_TRACE:
If any filters return this value and none of them specify a
higher precedence value, then the kernel will attempt to notify
a ptrace()-based tracer prior to executing the system call.

A tracer will be notified if it requests PTRACE_O_TRACESECCOMP
via PTRACE_SETOPTIONS. Otherwise, the system call will
not execute and -ENOSYS will be returned to userspace.
If the tracer ignores notification, then the system call will
proceed normally. Changes to the registers will function
similarly to PTRACE_SYSCALL. Additionally, if the tracer
detaches during notification or just after, the task may be
terminated as precautionary measure.

SECCOMP_RET_ALLOW:
If all filters for a given task return this value then
the system call will proceed normally.




-Kees

--
Kees Cook
ChromeOS Security

2012-02-22 03:41:55

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH v10 11/11] Documentation: prctl/seccomp_filter

On Tue, Feb 21, 2012 at 3:12 PM, Kees Cook <[email protected]> wrote:
> Hi,
>
> I've collected the initial no-new-privs patches, and this whole series
> and pushed it here so I could more easily review it:
> http://git.kernel.org/?p=linux/kernel/git/kees/linux.git;a=shortlog;h=refs/heads/seccomp
>
> Some minor tweaks below...
>
> On Tue, Feb 21, 2012 at 11:30:35AM -0600, Will Drewry wrote:
>> Documents how system call filtering using Berkeley Packet
>> Filter programs works and how it may be used.
>> Includes an example for x86 (32-bit) and a semi-generic
>> example using a macro-based code generator.
>>
>> v10: - update for SIGSYS
>> ? ? ?- update for new seccomp_data layout
>> ? ? ?- update for ptrace option use
>> v9: - updated bpf-direct.c for SIGILL
>> v8: - add PR_SET_NO_NEW_PRIVS to the samples.
>> v7: - updated for all the new stuff in v7: TRAP, TRACE
>> ? ? - only talk about PR_SET_SECCOMP now
>> ? ? - fixed bad JLE32 check ([email protected])
>> ? ? - adds dropper.c: a simple system call disabler
>> v6: - tweak the language to note the requirement of
>> ? ? ? PR_SET_NO_NEW_PRIVS being called prior to use. ([email protected])
>> v5: - update sample to use system call arguments
>> ? ? - adds a "fancy" example using a macro-based generator
>> ? ? - cleaned up bpf in the sample
>> ? ? - update docs to mention arguments
>> ? ? - fix prctl value ([email protected])
>> ? ? - language cleanup ([email protected])
>> v4: - update for no_new_privs use
>> ? ? - minor tweaks
>> v3: - call out BPF <-> Berkeley Packet Filter ([email protected])
>> ? ? - document use of tentative always-unprivileged
>> ? ? - guard sample compilation for i386 and x86_64
>> v2: - move code to samples ([email protected])
>>
>> Signed-off-by: Will Drewry <[email protected]>
>> ---
>> ?Documentation/prctl/seccomp_filter.txt | ?157 +++++++++++++++++++++
>> ?samples/Makefile ? ? ? ? ? ? ? ? ? ? ? | ? ?2 +-
>> ?samples/seccomp/Makefile ? ? ? ? ? ? ? | ? 31 ++++
>> ?samples/seccomp/bpf-direct.c ? ? ? ? ? | ?150 ++++++++++++++++++++
>> ?samples/seccomp/bpf-fancy.c ? ? ? ? ? ?| ?102 ++++++++++++++
>> ?samples/seccomp/bpf-helper.c ? ? ? ? ? | ? 89 ++++++++++++
>> ?samples/seccomp/bpf-helper.h ? ? ? ? ? | ?236 ++++++++++++++++++++++++++++++++
>> ?samples/seccomp/dropper.c ? ? ? ? ? ? ?| ? 68 +++++++++
>> ?8 files changed, 834 insertions(+), 1 deletions(-)
>> ?create mode 100644 Documentation/prctl/seccomp_filter.txt
>> ?create mode 100644 samples/seccomp/Makefile
>> ?create mode 100644 samples/seccomp/bpf-direct.c
>> ?create mode 100644 samples/seccomp/bpf-fancy.c
>> ?create mode 100644 samples/seccomp/bpf-helper.c
>> ?create mode 100644 samples/seccomp/bpf-helper.h
>> ?create mode 100644 samples/seccomp/dropper.c
>>
>> diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
>> new file mode 100644
>> index 0000000..7de865b
>> --- /dev/null
>> +++ b/Documentation/prctl/seccomp_filter.txt
>> @@ -0,0 +1,157 @@
>> + ? ? ? ? ? ? SECure COMPuting with filters
>> + ? ? ? ? ? ? =============================
>> +
>> +Introduction
>> +------------
>> +
>> +A large number of system calls are exposed to every userland process
>> +with many of them going unused for the entire lifetime of the process.
>> +As system calls change and mature, bugs are found and eradicated. ?A
>> +certain subset of userland applications benefit by having a reduced set
>> +of available system calls. ?The resulting set reduces the total kernel
>> +surface exposed to the application. ?System call filtering is meant for
>> +use with those applications.
>> +
>> +Seccomp filtering provides a means for a process to specify a filter for
>> +incoming system calls. ?The filter is expressed as a Berkeley Packet
>> +Filter (BPF) program, as with socket filters, except that the data
>> +operated on is related to the system call being made: system call
>> +number and the system call arguments. ?This allows for expressive
>> +filtering of system calls using a filter program language with a long
>> +history of being exposed to userland and a straightforward data set.
>> +
>> +Additionally, BPF makes it impossible for users of seccomp to fall prey
>> +to time-of-check-time-of-use (TOCTOU) attacks that are common in system
>> +call interposition frameworks. ?BPF programs may not dereference
>> +pointers which constrains all filters to solely evaluating the system
>> +call arguments directly.
>> +
>> +What it isn't
>> +-------------
>> +
>> +System call filtering isn't a sandbox. ?It provides a clearly defined
>> +mechanism for minimizing the exposed kernel surface. ?It is meant to be
>> +a tool for sandbox developers to use. ?Beyond that, policy for logical
>> +behavior and information flow should be managed with a combination of
>> +other system hardening techniques and, potentially, an LSM of your
>> +choosing. ?Expressive, dynamic filters provide further options down this
>> +path (avoiding pathological sizes or selecting which of the multiplexed
>> +system calls in socketcall() is allowed, for instance) which could be
>> +construed, incorrectly, as a more complete sandboxing solution.
>> +
>> +Usage
>> +-----
>> +
>> +An additional seccomp mode is added and is enabled using the same
>> +prctl(2) call as the strict seccomp. ?If the architecture has
>> +CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below:
>> +
>> +PR_SET_SECCOMP:
>> + ? ? Now takes an additional argument which specifies a new filter
>> + ? ? using a BPF program.
>> + ? ? The BPF program will be executed over struct seccomp_data
>> + ? ? reflecting the system call number, arguments, and other
>> + ? ? metadata. ?The BPF program must then return one of the
>> + ? ? acceptable values to inform the kernel which action should be
>> + ? ? taken.
>> +
>> + ? ? Usage:
>> + ? ? ? ? ? ? prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog);
>> +
>> + ? ? The 'prog' argument is a pointer to a struct sock_fprog which
>> + ? ? will contain the filter program. ?If the program is invalid, the
>> + ? ? call will return -1 and set errno to EINVAL.
>> +
>> + ? ? Note, is_compat_task is also tracked for the @prog. ?This means
>> + ? ? that once set the calling task will have all of its system calls
>> + ? ? blocked if it switches its system call ABI.
>> +
>> + ? ? If fork/clone and execve are allowed by @prog, any child
>> + ? ? processes will be constrained to the same filters and system
>> + ? ? call ABI as the parent.
>> +
>> + ? ? Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or
>> + ? ? run with CAP_SYS_ADMIN privileges in its namespace. ?If these are not
>> + ? ? true, -EACCES will be returned. ?This requirement ensures that filter
>> + ? ? programs cannot be applied to child processes with greater privileges
>> + ? ? than the task that installed them.
>> +
>> + ? ? Additionally, if prctl(2) is allowed by the attached filter,
>> + ? ? additional filters may be layered on which will increase evaluation
>> + ? ? time, but allow for further decreasing the attack surface during
>> + ? ? execution of a process.
>> +
>> +The above call returns 0 on success and non-zero on error.
>> +
>> +Return values
>> +-------------
>> +
>> +A seccomp filter may return any of the following values:
>> + ? ? SECCOMP_RET_ALLOW, SECCOMP_RET_KILL, SECCOMP_RET_TRAP,
>> + ? ? SECCOMP_RET_ERRNO, or SECCOMP_RET_TRACE.
>> +
>> +SECCOMP_RET_ALLOW:
>> + ? ? If all filters for a given task return this value then
>> + ? ? the system call will proceed normally.
>> +
>> +SECCOMP_RET_KILL:
>> + ? ? If any filters for a given take return this value then
>> + ? ? the task will exit immediately without executing the system
>> + ? ? call.
>> +
>> +SECCOMP_RET_TRAP:
>> + ? ? If any filters specify SECCOMP_RET_TRAP and none of them
>> + ? ? specify SECCOMP_RET_KILL, then the kernel will send a SIGTRAP
>> + ? ? signal to the task and not execute the system call. ?The kernel
>> + ? ? will rollback the register state to just before system call
>> + ? ? entry such that a signal handler in the process will be able
>> + ? ? to inspect the ucontext_t->uc_mcontext registers and emulate
>> + ? ? system call success or failure upon return from the signal
>> + ? ? handler.
>> +
>> + ? ? The SIGTRAP is differentiated by other SIGTRAPS by a si_code
>> + ? ? of TRAP_SECCOMP.
>
> This should reflect the SIGTRAP->SIGSYS change (and SYS_SECCOMP si_code
> change).

Oops - yup.

>> +
>> +SECCOMP_RET_ERRNO:
>> + ? ? If returned, the value provided in the lower 16-bits is
>> + ? ? returned to userland as the errno and the system call is
>> + ? ? not executed.
>
> The other sections each say "If any" or "If all" to clarify their
> behavior with multiple filters. The same should be done here, but more
> comments below. Additionally, it should clarify that on multiple
> uses of RET_ERRNO, the lower of the errnos will be returned.

I might drop all of the written out precedence verbiage since your
layout is more intuitive without it I think.

>> +
>> +SECCOMP_RET_TRACE:
>> + ? ? If any filters return this value and the others return
>> + ? ? SECCOMP_RET_ALLOW, then the kernel will attempt to notify
>> + ? ? a ptrace()-based tracer prior to executing the system call.
>> +
>> + ? ? A tracer will be notified if it requests PTRACE_O_TRACESECCOMP
>> + ? ? via PTRACE_SETOPTIONS. ?Otherwise, the system call will
>> + ? ? not execute and -ENOSYS will be returned to userspace.
>> +
>> + ? ? If the tracer ignores notification, then the system call will
>> + ? ? proceed normally. ?Changes to the registers will function
>> + ? ? similarly to PTRACE_SYSCALL. ?Additionally, if the tracer
>> + ? ? detaches during notification or just after, the task may be
>> + ? ? terminated as precautionary measure.
>> +
>> +Please note that the order of precedence is as follows:
>> +SECCOMP_RET_KILL, SECCOMP_RET_ERRNO, SECCOMP_RET_TRAP,
>> +SECCOMP_RET_TRACE, SECCOMP_RET_ALLOW.
>> +
>> +If multiple filters exist, the return value for the evaluation of a given
>> +system call will always use the highest precedent value.
>> +SECCOMP_RET_KILL will always take precedence.
>
> I think this clarification about precedence is good but should be at the
> head of the "Return values" section, and the sections ordered from that
> perspective, so that the "highest precedent value" aspect is a little
> bit easier to follow:
>
>
> Return values
> -------------
> A seccomp filter may return any of the following values. If multiple
> filters exist, the return value for the evaluation of a given system
> call will always use the highest precedent value. (For example,
> SECCOMP_RET_KILL will always take precedence.)
>
> In precedence order, they are:
>
> SECCOMP_RET_KILL:
> ? ? ? ?If any filters for a given take return this value then
> ? ? ? ?the task will exit immediately without executing the system
> ? ? ? ?call.
>
> SECCOMP_RET_TRAP:
> ? ? ? ?If any filters specify SECCOMP_RET_TRAP and none of them
> ? ? ? ?specify SECCOMP_RET_KILL, then the kernel will send a SIGSYS
> ? ? ? ?signal to the task and not execute the system call. The kernel
> ? ? ? ?will rollback the register state to just before system call
> ? ? ? ?entry such that a signal handler in the process will be able
> ? ? ? ?to inspect the ucontext_t->uc_mcontext registers and emulate
> ? ? ? ?system call success or failure upon return from the signal
> ? ? ? ?handler.
>
> ? ? ? ?The SIGSYS is differentiated by other SIGSYS signals by a si_code
> ? ? ? ?of SYS_SECCOMP.
>
> SECCOMP_RET_ERRNO:
> ? ? ? ?If any filters return this value and none of them specify a
> ? ? ? ?higher precedence value, then the lowest of the values provided
> ? ? ? ?in the lower 16-bits is returned to userland as the errno and
> ? ? ? ?the system call is not executed.
>
> SECCOMP_RET_TRACE:
> ? ? ? ?If any filters return this value and none of them specify a
> ? ? ? ?higher precedence value, then the kernel will attempt to notify
> ? ? ? ?a ptrace()-based tracer prior to executing the system call.
>
> ? ? ? ?A tracer will be notified if it requests PTRACE_O_TRACESECCOMP
> ? ? ? ?via PTRACE_SETOPTIONS. Otherwise, the system call will
> ? ? ? ?not execute and -ENOSYS will be returned to userspace.
> ? ? ? ?If the tracer ignores notification, then the system call will
> ? ? ? ?proceed normally. Changes to the registers will function
> ? ? ? ?similarly to PTRACE_SYSCALL. Additionally, if the tracer
> ? ? ? ?detaches during notification or just after, the task may be
> ? ? ? ?terminated as precautionary measure.
>
> SECCOMP_RET_ALLOW:
> ? ? ? ?If all filters for a given task return this value then
> ? ? ? ?the system call will proceed normally.
>

Thanks! I'll integrate all of this and post a full v11 series in the
morning (depending on any feedback trickling later :).

cheers,
will

2012-02-22 06:34:11

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v10 05/11] seccomp: add system call filtering using BPF

On 02/21/2012 09:30 AM, Will Drewry wrote:
> +
> +/**
> + * struct seccomp_data - the format the BPF program executes over.
> + * @args: up to 6 system call arguments. When the calling convention is
> + * 32-bit, the arguments will still be at each args[X] offset.
> + * @instruction_pointer: at the time of the system call.
> + * @arch: indicates system call convention as an AUDIT_ARCH_* value
> + * as defined in <linux/audit.h>.
> + * @nr: the system call number
> + */
> +struct seccomp_data {
> + __u64 args[6];
> + __u64 instruction_pointer;
> + __u32 arch;
> + int nr;
> +};
>

This got flipped around for some reason... that is a problem if we ever
need to extend this to more than 6 arguments (I thought we had at least
one architecture which supported 7 arguments already, but I could just
be delusional.)

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-02-22 08:19:49

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH v10 05/11] seccomp: add system call filtering using BPF

Hello,

On Tue, February 21, 2012 18:30, Will Drewry wrote:
> [This patch depends on [email protected]'s no_new_privs patch:
> https://lkml.org/lkml/2012/1/30/264
> ]
>
> This patch adds support for seccomp mode 2. Mode 2 introduces the
> ability for unprivileged processes to install system call filtering
> policy expressed in terms of a Berkeley Packet Filter (BPF) program.
> This program will be evaluated in the kernel for each system call
> the task makes and computes a result based on data in the format
> of struct seccomp_data.
>
> A filter program may be installed by calling:
> struct sock_fprog fprog = { ... };
> ...
> prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &fprog);
>
> The return value of the filter program determines if the system call is
> allowed to proceed or denied. If the first filter program installed
> allows prctl(2) calls, then the above call may be made repeatedly
> by a task to further reduce its access to the kernel. All attached
> programs must be evaluated before a system call will be allowed to
> proceed.
>
> Filter programs will be inherited across fork/clone and execve.
> However, if the task attaching the filter is unprivileged
> (!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task. This
> ensures that unprivileged tasks cannot attach filters that affect
> privileged tasks (e.g., setuid binary).
>
> There are a number of benefits to this approach. A few of which are
> as follows:
> - BPF has been exposed to userland for a long time
> - BPF optimization (and JIT'ing) are well understood
> - Userland already knows its ABI: system call numbers and desired
> arguments
> - No time-of-check-time-of-use vulnerable data accesses are possible.
> - system call arguments are loaded on access only to minimize copying
> required for system call policy decisions.
>
> Mode 2 support is restricted to architectures that enable
> HAVE_ARCH_SECCOMP_FILTER. In this patch, the primary dependency is on
> syscall_get_arguments(). The full desired scope of this feature will
> add a few minor additional requirements expressed later in this series.
> Based on discussion, SECCOMP_RET_ERRNO and SECCOMP_RET_TRACE seem to be
> the desired additional functionality.
>
> No architectures are enabled in this patch.
>
> v10: - seccomp_data has changed again to be more aesthetically pleasing
> ([email protected])
> - calling convention is noted in a new u32 field using syscall_get_arch.
> This allows for cross-calling convention tasks to use seccomp filters.
> ([email protected])

I highly disagree with every filter having to check the mode: Filters that
don't check the arch on e.g. x86 are buggy, so they have to check it, even
if it's a 32-bit or 64-bit only system, the filters can't know that and
needs to check the arch at every syscall entry. All other info in the data
depends on the arch, because of this there isn't much code to share between
the two archs, so you can as well have one filter for each arch.

Alternative approach: Tell the arch at filter install time and only run the
filters with the same arch as the current system call. If no filters are run,
deny the systemcall.

Advantages:

- Filters don't have to check the arch every syscall entry.

- Secure by default. Filters don't have to do anything arch specific to
be secure, no surprises possible.

- If a new arch comes into existence, there is no chance of old filters
becoming buggy and insecure. This is especially true for archs that
had only one mode, but added another one later on: Old filters had no
need to check the mode at all.

- For kernels supporting only one arch, the check can be optimised away,
by not installing unsupported arch filters at all.

It's more secure, faster and simpler for the filters.

If something like this is implemented it's fine to expose the arch info
in the syscall data too, and have a way to install filters for all archs,
for the few cases where that might be useful, although I can't think of
any reason why people would like to do unnecessary work in the filters.

All that's needed is an extra argument to the prctl() call. I propose
0 for the current arch, -1 for all archs and anything else to specify
the arch. Installing a filter for an unsupported arch could return
ENOEXEC.

As far as the implementation goes, either have a list per supported arch
or store the arch per filter and check that before running the filter.

> - lots of clean up (thanks, Indan!)
> v9: - n/a
> v8: - use bpf_chk_filter, bpf_run_filter. update load_fns
> - Lots of fixes courtesy of [email protected]:
> -- fix up load behavior, compat fixups, and merge alloc code,
> -- renamed pc and dropped __packed, use bool compat.
> -- Added a hidden CONFIG_SECCOMP_FILTER to synthesize non-arch
> dependencies
> v7: (massive overhaul thanks to Indan, others)
> - added CONFIG_HAVE_ARCH_SECCOMP_FILTER
> - merged into seccomp.c
> - minimal seccomp_filter.h
> - no config option (part of seccomp)
> - no new prctl
> - doesn't break seccomp on systems without asm/syscall.h
> (works but arg access always fails)
> - dropped seccomp_init_task, extra free functions, ...
> - dropped the no-asm/syscall.h code paths
> - merges with network sk_run_filter and sk_chk_filter
> v6: - fix memory leak on attach compat check failure
> - require no_new_privs || CAP_SYS_ADMIN prior to filter
> installation. ([email protected])
> - s/seccomp_struct_/seccomp_/ for macros/functions ([email protected])
> - cleaned up Kconfig ([email protected])
> - on block, note if the call was compat (so the # means something)
> v5: - uses syscall_get_arguments
> ([email protected],[email protected], [email protected])
> - uses union-based arg storage with hi/lo struct to
> handle endianness. Compromises between the two alternate
> proposals to minimize extra arg shuffling and account for
> endianness assuming userspace uses offsetof().
> ([email protected], [email protected])
> - update Kconfig description
> - add include/seccomp_filter.h and add its installation
> - (naive) on-demand syscall argument loading
> - drop seccomp_t ([email protected])
> v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
> - now uses current->no_new_privs
> ([email protected],[email protected])
> - assign names to seccomp modes ([email protected])
> - fix style issues ([email protected])
> - reworded Kconfig entry ([email protected])
> v3: - macros to inline ([email protected])
> - init_task behavior fixed ([email protected])
> - drop creator entry and extra NULL check ([email protected])
> - alloc returns -EINVAL on bad sizing ([email protected])
> - adds tentative use of "always_unprivileged" as per
> [email protected] and [email protected]
> v2: - (patch 2 only)
>
> Signed-off-by: Will Drewry <[email protected]>
> ---
> arch/Kconfig | 18 +++
> include/linux/Kbuild | 1 +
> include/linux/seccomp.h | 76 +++++++++++-
> kernel/fork.c | 3 +
> kernel/seccomp.c | 321 ++++++++++++++++++++++++++++++++++++++++++++---
> kernel/sys.c | 2 +-
> 6 files changed, 399 insertions(+), 22 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 4f55c73..8150fa2 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -199,4 +199,22 @@ config HAVE_CMPXCHG_LOCAL
> config HAVE_CMPXCHG_DOUBLE
> bool
>
> +config HAVE_ARCH_SECCOMP_FILTER
> + bool
> + help
> + This symbol should be selected by an architecure if it provides
> + asm/syscall.h, specifically syscall_get_arguments() and
> + syscall_get_arch().
> +
> +config SECCOMP_FILTER
> + def_bool y
> + depends on HAVE_ARCH_SECCOMP_FILTER && SECCOMP && NET
> + help
> + Enable tasks to build secure computing environments defined
> + in terms of Berkeley Packet Filter programs which implement
> + task-defined system call filtering polices.
> +
> + See Documentation/prctl/seccomp_filter.txt for more
> + information on the topic of seccomp filtering.

The last part is redundant, the topic is clear.

> +
> source "kernel/gcov/Kconfig"
> diff --git a/include/linux/Kbuild b/include/linux/Kbuild
> index c94e717..d41ba12 100644
> --- a/include/linux/Kbuild
> +++ b/include/linux/Kbuild
> @@ -330,6 +330,7 @@ header-y += scc.h
> header-y += sched.h
> header-y += screen_info.h
> header-y += sdla.h
> +header-y += seccomp.h
> header-y += securebits.h
> header-y += selinux_netlink.h
> header-y += sem.h
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index d61f27f..001f883 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -1,14 +1,67 @@
> #ifndef _LINUX_SECCOMP_H
> #define _LINUX_SECCOMP_H
>
> +#include <linux/compiler.h>
> +#include <linux/types.h>
> +
> +
> +/* Valid values for seccomp.mode and prctl(PR_SET_SECCOMP, <mode>) */
> +#define SECCOMP_MODE_DISABLED 0 /* seccomp is not in use. */
> +#define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */
> +#define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */
> +
> +/*
> + * BPF programs may return a 32-bit value.

They have to return a 32-bit value, no "may" about it.

> + * The bottom 16-bits are reserved for future use.
> + * The upper 16-bits are ordered from least permissive values to most.
> + *
> + * The ordering ensures that a min_t() over composed return values always
> + * selects the least permissive choice.
> + */
> +#define SECCOMP_RET_KILL 0x00000000U /* kill the task immediately */
> +#define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */
> +
> +/* Masks for the return value sections. */
> +#define SECCOMP_RET_ACTION 0xffff0000U
> +#define SECCOMP_RET_DATA 0x0000ffffU
> +
> +/**
> + * struct seccomp_data - the format the BPF program executes over.
> + * @args: up to 6 system call arguments. When the calling convention is
> + * 32-bit, the arguments will still be at each args[X] offset.

What does this mean? Do you mean the data layout will always be "LE" for
32-bit archs? I hope not, because that would make it incompatible with
the 64-bit code for BE archs, so it will be confusing. Except if the data
layout is always LE, but then you should document that. If neither is the
case, then the comment is just confusing. Just say that the data layout
depends on the arch's endianness.

> + * @instruction_pointer: at the time of the system call.
> + * @arch: indicates system call convention as an AUDIT_ARCH_* value
> + * as defined in <linux/audit.h>.
> + * @nr: the system call number
> + */
> +struct seccomp_data {
> + __u64 args[6];
> + __u64 instruction_pointer;
> + __u32 arch;
> + int nr;
> +};

I agree this looks a hell of a lot nicer. I just hope it's worth it.
Oh well, a bit more ugliness in userspace to make the kernel code a
bit nicer isn't too bad. Just document the endianness issue properly.

What use is the instruction pointer considering it tells nothing about
the call path?

>
> +#ifdef __KERNEL__
> #ifdef CONFIG_SECCOMP
>
> #include <linux/thread_info.h>
> #include <asm/seccomp.h>
>
> +struct seccomp_filter;
> +/**
> + * struct seccomp - the state of a seccomp'ed process
> + *
> + * @mode: indicates one of the valid values above for controlled
> + * system calls available to a process.
> + * @filter: The metadata and ruleset for determining what system calls
> + * are allowed for a task.
> + *
> + * @filter must only be accessed from the context of current as there
> + * is no locking.
> + */
> struct seccomp {
> int mode;
> + struct seccomp_filter *filter;
> };
>
> extern void __secure_computing(int);
> @@ -19,7 +72,7 @@ static inline void secure_computing(int this_syscall)
> }
>
> extern long prctl_get_seccomp(void);
> -extern long prctl_set_seccomp(unsigned long);
> +extern long prctl_set_seccomp(unsigned long, char __user *);
>
> static inline int seccomp_mode(struct seccomp *s)
> {
> @@ -31,15 +84,16 @@ static inline int seccomp_mode(struct seccomp *s)
> #include <linux/errno.h>
>
> struct seccomp { };
> +struct seccomp_filter { };
>
> -#define secure_computing(x) do { } while (0)
> +#define secure_computing(x) 0
>
> static inline long prctl_get_seccomp(void)
> {
> return -EINVAL;
> }
>
> -static inline long prctl_set_seccomp(unsigned long arg2)
> +static inline long prctl_set_seccomp(unsigned long arg2, char __user *arg3)
> {
> return -EINVAL;
> }
> @@ -48,7 +102,21 @@ static inline int seccomp_mode(struct seccomp *s)
> {
> return 0;
> }
> -
> #endif /* CONFIG_SECCOMP */
>
> +#ifdef CONFIG_SECCOMP_FILTER
> +extern void put_seccomp_filter(struct seccomp_filter *);
> +extern void copy_seccomp(struct seccomp *child,
> + const struct seccomp *parent);

This is 80 chars long, why break it up? Please, stop your bad habit of
breaking up (slightly too) long lines.

> +#else /* CONFIG_SECCOMP_FILTER */
> +/* The macro consumes the ->filter reference. */
> +#define put_seccomp_filter(_s) do { } while (0)
> +
> +static inline void copy_seccomp(struct seccomp *child,
> + const struct seccomp *prev)
> +{
> + return;
> +}
> +#endif /* CONFIG_SECCOMP_FILTER */
> +#endif /* __KERNEL__ */
> #endif /* _LINUX_SECCOMP_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b77fd55..a5187b7 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -34,6 +34,7 @@
> #include <linux/cgroup.h>
> #include <linux/security.h>
> #include <linux/hugetlb.h>
> +#include <linux/seccomp.h>
> #include <linux/swap.h>
> #include <linux/syscalls.h>
> #include <linux/jiffies.h>
> @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
> free_thread_info(tsk->stack);
> rt_mutex_debug_task_free(tsk);
> ftrace_graph_exit_task(tsk);
> + put_seccomp_filter(tsk->seccomp.filter);

So that's why you use macro's sometimes, to make it compile with
CONFIG_SECCOMP disabled where there is no seccomp.filter.

> free_task_struct(tsk);
> }
> EXPORT_SYMBOL(free_task);
> @@ -1113,6 +1115,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> goto fork_out;
>
> ftrace_graph_init_task(p);
> + copy_seccomp(&p->seccomp, &current->seccomp);
>
> rt_mutex_init_task(p);
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index e8d76c5..0043b7e 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -3,16 +3,287 @@
> *
> * Copyright 2004-2005 Andrea Arcangeli <[email protected]>
> *
> - * This defines a simple but solid secure-computing mode.
> + * Copyright (C) 2012 Google, Inc.
> + * Will Drewry <[email protected]>
> + *
> + * This defines a simple but solid secure-computing facility.
> + *
> + * Mode 1 uses a fixed list of allowed system calls.
> + * Mode 2 allows user-defined system call filters in the form
> + * of Berkeley Packet Filters/Linux Socket Filters.
> */
>
> +#include <linux/atomic.h>
> #include <linux/audit.h>
> -#include <linux/seccomp.h>
> -#include <linux/sched.h>
> #include <linux/compat.h>
> +#include <linux/filter.h>
> +#include <linux/sched.h>
> +#include <linux/seccomp.h>
> +#include <linux/security.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +
> +#include <linux/tracehook.h>
> +#include <asm/syscall.h>
>
> /* #define SECCOMP_DEBUG 1 */
> -#define NR_SECCOMP_MODES 1
> +
> +#ifdef CONFIG_SECCOMP_FILTER
> +/**
> + * struct seccomp_filter - container for seccomp BPF programs
> + *
> + * @usage: reference count to manage the object liftime.
> + * get/put helpers should be used when accessing an instance
> + * outside of a lifetime-guarded section. In general, this
> + * is only needed for handling filters shared across tasks.
> + * @prev: points to a previously installed, or inherited, filter
> + * @compat: indicates the value of is_compat_task() at creation time

You're not really using 'compat', except for logging.

But you could use it to run only the filters with the right arch.

> + * @insns: the BPF program instructions to evaluate
> + * @len: the number of instructions in the program
> + *
> + * seccomp_filter objects are organized in a tree linked via the @prev
> + * pointer. For any task, it appears to be a singly-linked list starting
> + * with current->seccomp.filter, the most recently attached or inherited filter.
> + * However, multiple filters may share a @prev node, by way of fork(), which
> + * results in a unidirectional tree existing in memory. This is similar to
> + * how namespaces work.
> + *
> + * seccomp_filter objects should never be modified after being attached
> + * to a task_struct (other than @usage).
> + */
> +struct seccomp_filter {
> + atomic_t usage;
> + struct seccomp_filter *prev;
> + bool compat;
> + unsigned short len; /* Instruction count */
> + struct sock_filter insns[];
> +};
> +
> +static void seccomp_filter_log_failure(int syscall)
> +{
> + int compat = 0;
> +#ifdef CONFIG_COMPAT
> + compat = is_compat_task();
> +#endif
> + pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n",
> + current->comm, task_pid_nr(current),
> + (compat ? "compat " : ""),
> + syscall, KSTK_EIP(current));
> +}
> +
> +/**
> + * get_u32 - returns a u32 offset into data
> + * @data: a unsigned 64 bit value
> + * @index: 0 or 1 to return the first or second 32-bits
> + *
> + * This inline exists to hide the length of unsigned long.
> + * If a 32-bit unsigned long is passed in, it will be extended
> + * and the top 32-bits will be 0. If it is a 64-bit unsigned
> + * long, then whatever data is resident will be properly returned.
> + */
> +static inline u32 get_u32(u64 data, int index)
> +{
> + return ((u32 *)&data)[index];
> +}
> +
> +/* Helper for bpf_load below. */
> +#define BPF_DATA(_name) offsetof(struct seccomp_data, _name)
> +/**
> + * bpf_load: checks and returns a pointer to the requested offset
> + * @nr: int syscall passed as a void * to bpf_run_filter
> + * @off: index into struct seccomp_data to load from
> + * @size: load width requested
> + * @buffer: temporary storage supplied by bpf_run_filter
> + *
> + * Returns a pointer to @buffer where the value was stored.
> + * On failure, returns NULL.
> + */
> +static void *bpf_load(const void *nr, int off, unsigned int size, void *buf)
> +{
> + unsigned long value;
> + u32 *A = buf;
> +
> + if (size != sizeof(u32))
> + return NULL;
> +
> + if (off >= BPF_DATA(args[0]) && off < BPF_DATA(args[6])) {
> + struct pt_regs *regs = task_pt_regs(current);
> + int arg = off >> 3; /* args[0] is at offset 0. */

Probably clearer if you just do off / 8, you can count on compilers to
get that right and turn it into a shift.

> + int index = (off % sizeof(u64)) ? 1 : 0;

Considering the previous line I expected to see (off & 4).

Anyway, this code mostly ignores the lowest three bits and instead of
either returning an error or the requested data, it returns the aligned
value instead. Not good.

> + syscall_get_arguments(current, regs, arg, 1, &value);
> + *A = get_u32(value, index);
> + } else if (off == BPF_DATA(nr)) {
> + *A = (u32)(uintptr_t)nr;
> + } else if (off == BPF_DATA(arch)) {
> + struct pt_regs *regs = task_pt_regs(current);
> + *A = syscall_get_arch(current, regs);
> + } else if (off == BPF_DATA(instruction_pointer)) {
> + *A = get_u32(KSTK_EIP(current), 0);
> + } else if (off == BPF_DATA(instruction_pointer) + sizeof(u32)) {
> + *A = get_u32(KSTK_EIP(current), 1);
> + } else {
> + return NULL;
> + }
> + return buf;
> +}
> +
> +/**
> + * seccomp_run_filters - evaluates all seccomp filters against @syscall
> + * @syscall: number of the current system call
> + *
> + * Returns valid seccomp BPF response codes.
> + */
> +static u32 seccomp_run_filters(int syscall)
> +{
> + struct seccomp_filter *f;
> + u32 ret = SECCOMP_RET_KILL;
> + static const struct bpf_load_fn fns = {
> + bpf_load,
> + sizeof(struct seccomp_data),

I suppose this could be used to check for new fields if struct seccomp_data
ever gets extended in the future.

> + };
> + const void *sc_ptr = (const void *)(uintptr_t)syscall;
> +
> + /*
> + * All filters are evaluated in order of youngest to oldest. The lowest
> + * BPF return value always takes priority.
> + */
> + for (f = current->seccomp.filter; f; f = f->prev) {
> + ret = bpf_run_filter(sc_ptr, f->insns, &fns);
> + if (ret != SECCOMP_RET_ALLOW)
> + break;
> + }
> + return ret;
> +}
> +
> +/**
> + * seccomp_attach_filter: Attaches a seccomp filter to current.
> + * @fprog: BPF program to install
> + *
> + * Returns 0 on success or an errno on failure.
> + */
> +static long seccomp_attach_filter(struct sock_fprog *fprog)
> +{
> + struct seccomp_filter *filter;
> + unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
> + long ret;
> +
> + if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
> + return -EINVAL;
> +
> + /* Allocate a new seccomp_filter */
> + filter = kzalloc(sizeof(struct seccomp_filter) + fp_size, GFP_KERNEL);
> + if (!filter)
> + return -ENOMEM;
> + atomic_set(&filter->usage, 1);
> + filter->len = fprog->len;
> +
> + /* Copy the instructions from fprog. */
> + ret = -EFAULT;
> + if (copy_from_user(filter->insns, fprog->filter, fp_size))
> + goto out;
> +
> + /* Check the fprog */
> + ret = bpf_chk_filter(filter->insns, filter->len, BPF_CHK_FLAGS_NO_SKB);
> + if (ret)
> + goto out;
> +
> + /*
> + * Installing a seccomp filter requires that the task
> + * have CAP_SYS_ADMIN in its namespace or be running with
> + * no_new_privs. This avoids scenarios where unprivileged
> + * tasks can affect the behavior of privileged children.
> + */
> + ret = -EACCES;
> + if (!current->no_new_privs &&
> + security_capable_noaudit(current_cred(), current_user_ns(),
> + CAP_SYS_ADMIN) != 0)
> + goto out;
> +
> + /*
> + * If there is an existing filter, make it the prev
> + * and don't drop its task reference.
> + */
> + filter->prev = current->seccomp.filter;
> + current->seccomp.filter = filter;
> + return 0;
> +out:
> + put_seccomp_filter(filter); /* for get or task, on err */
> + return ret;
> +}
> +
> +/**
> + * seccomp_attach_user_filter - attaches a user-supplied sock_fprog
> + * @user_filter: pointer to the user data containing a sock_fprog.
> + *
> + * This function may be called repeatedly to install additional filters.
> + * Every filter successfully installed will be evaluated (in reverse order)
> + * for each system call the task makes.
> + *
> + * Returns 0 on success and non-zero otherwise.
> + */
> +long seccomp_attach_user_filter(char __user *user_filter)
> +{
> + struct sock_fprog fprog;
> + long ret = -EFAULT;
> +
> + if (!user_filter)
> + goto out;
> +#ifdef CONFIG_COMPAT
> + if (is_compat_task()) {
> + /* XXX: Share with net/compat.c (compat_sock_fprog) */

Then do so as part of your BPF sharing patch.

> + struct {
> + u16 len;
> + compat_uptr_t filter; /* struct sock_filter */
> + } fprog32;
> + if (copy_from_user(&fprog32, user_filter, sizeof(fprog32)))
> + goto out;
> + fprog.len = fprog32.len;
> + fprog.filter = compat_ptr(fprog32.filter);
> + } else /* falls through to the if below. */
> +#endif
> + if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
> + goto out;
> + ret = seccomp_attach_filter(&fprog);
> +out:
> + return ret;
> +}
> +
> +/* get_seccomp_filter - increments the reference count of @orig. */
> +static struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
> +{
> + if (!orig)
> + return NULL;
> + /* Reference count is bounded by the number of total processes. */
> + atomic_inc(&orig->usage);
> + return orig;
> +}
> +
> +/* put_seccomp_filter - decrements the ref count of @orig and may free. */
> +void put_seccomp_filter(struct seccomp_filter *orig)
> +{
> + /* Clean up single-reference branches iteratively. */
> + while (orig && atomic_dec_and_test(&orig->usage)) {
> + struct seccomp_filter *freeme = orig;
> + orig = orig->prev;
> + kfree(freeme);
> + }
> +}
> +
> +/**
> + * copy_seccomp: manages inheritance on fork
> + * @child: forkee's seccomp
> + * @prev: forker's seccomp
> + *
> + * Ensures that @child inherits seccomp mode and state if
> + * seccomp filtering is in use.
> + */
> +void copy_seccomp(struct seccomp *child,
> + const struct seccomp *prev)

One line please.

> +{
> + child->mode = prev->mode;
> + child->filter = get_seccomp_filter(prev->filter);
> +}
> +#endif /* CONFIG_SECCOMP_FILTER */
>
> /*
> * Secure computing mode 1 allows only read/write/exit/sigreturn.
> @@ -34,10 +305,10 @@ static int mode1_syscalls_32[] = {
> void __secure_computing(int this_syscall)
> {
> int mode = current->seccomp.mode;
> - int * syscall;
> + int *syscall;
>
> switch (mode) {
> - case 1:
> + case SECCOMP_MODE_STRICT:
> syscall = mode1_syscalls;
> #ifdef CONFIG_COMPAT
> if (is_compat_task())
> @@ -48,6 +319,13 @@ void __secure_computing(int this_syscall)
> return;
> } while (*++syscall);
> break;
> +#ifdef CONFIG_SECCOMP_FILTER
> + case SECCOMP_MODE_FILTER:
> + if (seccomp_run_filters(this_syscall) == SECCOMP_RET_ALLOW)
> + return;
> + seccomp_filter_log_failure(this_syscall);
> + break;
> +#endif
> default:
> BUG();
> }
> @@ -64,25 +342,34 @@ long prctl_get_seccomp(void)
> return current->seccomp.mode;
> }
>
> -long prctl_set_seccomp(unsigned long seccomp_mode)
> +long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
> {
> - long ret;
> + long ret = -EINVAL;
>
> - /* can set it only once to be even more secure */
> - ret = -EPERM;
> - if (unlikely(current->seccomp.mode))
> + if (current->seccomp.mode &&
> + current->seccomp.mode != seccomp_mode)

Wouldn't it make sense to allow going from mode 2 to 1?
After all, the filter could have blocked it if it didn't
want to permit it, and mode 1 is more restrictive than
mode 2.

> goto out;
>
> - ret = -EINVAL;
> - if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
> - current->seccomp.mode = seccomp_mode;
> - set_thread_flag(TIF_SECCOMP);
> + switch (seccomp_mode) {
> + case SECCOMP_MODE_STRICT:
> + ret = 0;
> #ifdef TIF_NOTSC
> disable_TSC();
> #endif
> - ret = 0;
> + break;
> +#ifdef CONFIG_SECCOMP_FILTER
> + case SECCOMP_MODE_FILTER:
> + ret = seccomp_attach_user_filter(filter);
> + if (ret)
> + goto out;
> + break;
> +#endif
> + default:
> + goto out;
> }
>
> - out:
> + current->seccomp.mode = seccomp_mode;
> + set_thread_flag(TIF_SECCOMP);
> +out:
> return ret;
> }
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 4070153..905031e 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1899,7 +1899,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2,
unsigned long, arg3,
> error = prctl_get_seccomp();
> break;
> case PR_SET_SECCOMP:
> - error = prctl_set_seccomp(arg2);
> + error = prctl_set_seccomp(arg2, (char __user *)arg3);
> break;
> case PR_GET_TSC:
> error = GET_TSC_CTL(arg2);

Out of curiosity, did you measure the kernel size differences before and
after these patches? Would be sad if sharing it with the networking code
didn't reduce the actual kernel size.

Greetings,

Indan

2012-02-22 08:34:50

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH v10 07/11] signal, x86: add SIGSYS info and make it synchronous.

On Tue, February 21, 2012 18:30, Will Drewry wrote:
> This change enables SIGSYS, defines _sigfields._sigsys, and adds
> x86 (compat) arch support. _sigsys defines fields which allow
> a signal handler to receive the triggering system call number,
> the relevant AUDIT_ARCH_* value for that number, and the address
> of the callsite.
>
> To ensure that SIGSYS delivery occurs on return from the triggering
> system call, SIGSYS is added to the SYNCHRONOUS_MASK macro. I'm
> this is enough to ensure it will be synchronous or if it is explicitly
> required to ensure an immediate delivery of the signal upon return from
> the blocked system call.
>
> The first consumer of SIGSYS would be seccomp filter. In particular,
> a filter program could specify a new return value, SECCOMP_RET_TRAP,
> which would result in the system call being denied and the calling
> thread signaled. This also means that implementing arch-specific
> support can be dependent upon HAVE_ARCH_SECCOMP_FILTER.

I think others said this is useful, but I don't see how. Easier
debugging compared to checking return values?

I suppose SIGSYS can be blocked, so there is no guarantee the process
will be killed.

> v10: - first version based on suggestion
>
> Suggested-by: H. Peter Anvin <[email protected]>
> Signed-off-by: Will Drewry <[email protected]>
> ---
> arch/x86/ia32/ia32_signal.c | 4 ++++
> arch/x86/include/asm/ia32.h | 6 ++++++
> include/asm-generic/siginfo.h | 18 ++++++++++++++++++
> kernel/signal.c | 2 +-
> 4 files changed, 29 insertions(+), 1 deletions(-)
>
> diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
> index 6557769..c81d2c7 100644
> --- a/arch/x86/ia32/ia32_signal.c
> +++ b/arch/x86/ia32/ia32_signal.c
> @@ -73,6 +73,10 @@ int copy_siginfo_to_user32(compat_siginfo_t __user *to, siginfo_t
> *from)
> switch (from->si_code >> 16) {
> case __SI_FAULT >> 16:
> break;
> + case __SI_SYS >> 16:
> + put_user_ex(from->si_syscall, &to->si_syscall);
> + put_user_ex(from->si_arch, &to->si_arch);
> + break;
> case __SI_CHLD >> 16:
> put_user_ex(from->si_utime, &to->si_utime);
> put_user_ex(from->si_stime, &to->si_stime);
> diff --git a/arch/x86/include/asm/ia32.h b/arch/x86/include/asm/ia32.h
> index 1f7e625..541485f 100644
> --- a/arch/x86/include/asm/ia32.h
> +++ b/arch/x86/include/asm/ia32.h
> @@ -126,6 +126,12 @@ typedef struct compat_siginfo {
> int _band; /* POLL_IN, POLL_OUT, POLL_MSG */
> int _fd;
> } _sigpoll;
> +
> + struct {
> + unsigned int _call_addr; /* calling insn */

Why an int here, but a pointer below?

> + int _syscall; /* triggering system call number */
> + unsigned int _arch; /* AUDIT_ARCH_* of syscall */
> + } _sigsys;
> } _sifields;
> } compat_siginfo_t;
>
> diff --git a/include/asm-generic/siginfo.h b/include/asm-generic/siginfo.h
> index 0dd4e87..a83b478 100644
> --- a/include/asm-generic/siginfo.h
> +++ b/include/asm-generic/siginfo.h
> @@ -90,6 +90,13 @@ typedef struct siginfo {
> __ARCH_SI_BAND_T _band; /* POLL_IN, POLL_OUT, POLL_MSG */
> int _fd;
> } _sigpoll;
> +
> + /* SIGSYS */
> + struct {
> + void __user *_call_addr; /* calling insn */

Is this a user instruction pointer or a filter instruction?

> + int _syscall; /* triggering system call number */
> + unsigned int _arch; /* AUDIT_ARCH_* of syscall */
> + } _sigsys;
> } _sifields;
> } siginfo_t;
>
> @@ -116,6 +123,9 @@ typedef struct siginfo {
> #define si_addr_lsb _sifields._sigfault._addr_lsb
> #define si_band _sifields._sigpoll._band
> #define si_fd _sifields._sigpoll._fd
> +#define si_call_addr _sifields._sigsys._call_addr
> +#define si_syscall _sifields._sigsys._syscall
> +#define si_arch _sifields._sigsys._arch
>
> #ifdef __KERNEL__
> #define __SI_MASK 0xffff0000u
> @@ -126,6 +136,7 @@ typedef struct siginfo {
> #define __SI_CHLD (4 << 16)
> #define __SI_RT (5 << 16)
> #define __SI_MESGQ (6 << 16)
> +#define __SI_SYS (7 << 16)
> #define __SI_CODE(T,N) ((T) | ((N) & 0xffff))
> #else
> #define __SI_KILL 0
> @@ -135,6 +146,7 @@ typedef struct siginfo {
> #define __SI_CHLD 0
> #define __SI_RT 0
> #define __SI_MESGQ 0
> +#define __SI_SYS 0
> #define __SI_CODE(T,N) (N)
> #endif
>
> @@ -232,6 +244,12 @@ typedef struct siginfo {
> #define NSIGPOLL 6
>
> /*
> + * SIGSYS si_codes
> + */
> +#define SYS_SECCOMP (__SI_SYS|1) /* seccomp triggered */
> +#define NSIGSYS 1
> +
> +/*
> * sigevent definitions
> *
> * It seems likely that SIGEV_THREAD will have to be handled from
> diff --git a/kernel/signal.c b/kernel/signal.c
> index c73c428..7573819 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -160,7 +160,7 @@ void recalc_sigpending(void)
>
> #define SYNCHRONOUS_MASK \
> (sigmask(SIGSEGV) | sigmask(SIGBUS) | sigmask(SIGILL) | \
> - sigmask(SIGTRAP) | sigmask(SIGFPE))
> + sigmask(SIGTRAP) | sigmask(SIGFPE) | sigmask(SIGSYS))
>
> int next_signal(struct sigpending *pending, sigset_t *mask)
> {
> --

Greetings,

Indan

2012-02-22 12:23:09

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH v10 09/11] ptrace,seccomp: Add PTRACE_SECCOMP support

On Tue, February 21, 2012 18:30, Will Drewry wrote:
> A new return value is added to seccomp filters that allows
> the system call policy for the affected system calls to be
> implemented by a ptrace(2)ing process.
>
> If a tracer attaches to a task, specifies the PTRACE_O_TRACESECCOMP
> option, then PTRACE_CONT.

Awkward formulation here. I'd start with "If a tracer sets the
PTRACE_O_TRACESECCOMP option, then ..."

> After doing so, the tracer will
> be notified if a seccomp filter program returns SECCOMP_RET_TRACE.

This means that strace and gdb won't see seccomp filtered syscalls.
I think you have to reverse the logic and have an option that asks
to hide normal SECCOMP_RET_ERRNO, but not SECCOMP_RET_TRACE ones.

That gives the expected behaviour in all cases: Programs not setting
it behave as they do now, and co-operating tracers can ignore syscall
events they're not interested in.

> If there is no seccomp event tracer, SECCOMP_RET_TRACE system calls will
> return a -ENOSYS errno to user space. If the tracer detaches during a
> hand-off, the process will be killed.
>
> To ensure that seccomp is syscall fast-path friendly in the future,
> ptrace is delegated to by setting TIF_SYSCALL_TRACE. Since seccomp
> events are equivalent to system call entry events, this allows for
> seccomp to be evaluated as a fork off the fast-path and only,
> optionally, jump to the slow path. When the tracer is notified, all
> will function as with ptrace(PTRACE_SYSCALLS), but when the tracer calls
> ptrace(PTRACE_CONT), TIF_SYSCALL_TRACE will be unset and the task
> will proceed just receiving PTRACE_O_TRACESECCOMP events.

Please, no. That's making it more complicated than necessary.

I propose to keep the ptrace rules exactly the same as they are, with
the only change being that if PTRACE_O_SECCOMP is set, no syscall events
will be generated for SECCOMP_RET_ERRNO. This way ptrace behaviour is
the same, but only less syscall events are received. With your way
ptracers see syscall events when they normally wouldn't.

>
> I realize there are pending patches for cleaning up ptrace events.
> I can either reintegrate with those when they are available or
> vice versa. That's assuming these changes make sense and are viable.
>
> v10: - moved to PTRACE_O_SECCOMP / PT_TRACE_SECCOMP
> v9: - n/a
> v8: - guarded PTRACE_SECCOMP use with an ifdef
> v7: - introduced
>
> Signed-off-by: Will Drewry <[email protected]>
> ---
> arch/Kconfig | 4 +++
> include/linux/ptrace.h | 7 ++++-
> include/linux/seccomp.h | 14 +++++++++--
> include/linux/tracehook.h | 7 +++++-
> kernel/ptrace.c | 4 +++
> kernel/seccomp.c | 52 ++++++++++++++++++++++++++++++++++++++++++--
> 6 files changed, 79 insertions(+), 9 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 6d6d9dc..02c18ca 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -203,6 +203,7 @@ config HAVE_ARCH_SECCOMP_FILTER
> bool
> help
> This symbol should be selected by an architecure if it provides:
> + linux/tracehook.h, for TIF_SYSCALL_TRACE and ptrace_report_syscall
> asm/syscall.h:
> - syscall_get_arch()
> - syscall_get_arguments()
> @@ -211,6 +212,9 @@ config HAVE_ARCH_SECCOMP_FILTER
> SIGSYS siginfo_t support must be implemented.
> __secure_computing_int()/secure_computing()'s return value must be
> checked, with -1 resulting in the syscall being skipped.
> + If secure_computing is not in the system call slow path, the thread
> + info flags will need to be checked upon exit to ensure delegation to
> + ptrace(2) did not occur, or if it did, jump to the slow-path.
>
> config SECCOMP_FILTER
> def_bool y
> diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
> index c2f1f6a..2fccdbc 100644
> --- a/include/linux/ptrace.h
> +++ b/include/linux/ptrace.h
> @@ -62,8 +62,9 @@
> #define PTRACE_O_TRACEEXEC 0x00000010
> #define PTRACE_O_TRACEVFORKDONE 0x00000020
> #define PTRACE_O_TRACEEXIT 0x00000040
> +#define PTRACE_O_TRACESECCOMP 0x00000080
>
> -#define PTRACE_O_MASK 0x0000007f
> +#define PTRACE_O_MASK 0x000000ff
>
> /* Wait extended result codes for the above trace options. */
> #define PTRACE_EVENT_FORK 1
> @@ -73,6 +74,7 @@
> #define PTRACE_EVENT_VFORK_DONE 5
> #define PTRACE_EVENT_EXIT 6
> #define PTRACE_EVENT_STOP 7
> +#define PTRACE_EVENT_SECCOMP 8 /* never directly delivered */
>
> #include <asm/ptrace.h>
>
> @@ -101,8 +103,9 @@
> #define PT_TRACE_EXEC PT_EVENT_FLAG(PTRACE_EVENT_EXEC)
> #define PT_TRACE_VFORK_DONE PT_EVENT_FLAG(PTRACE_EVENT_VFORK_DONE)
> #define PT_TRACE_EXIT PT_EVENT_FLAG(PTRACE_EVENT_EXIT)
> +#define PT_TRACE_SECCOMP PT_EVENT_FLAG(PTRACE_EVENT_SECCOMP)
>
> -#define PT_TRACE_MASK 0x000003f4
> +#define PT_TRACE_MASK 0x00000ff4
>
> /* single stepping state bits (used on ARM and PA-RISC) */
> #define PT_SINGLESTEP_BIT 31
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index d039b7b..16887c1 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -19,8 +19,9 @@
> * selects the least permissive choice.
> */
> #define SECCOMP_RET_KILL 0x00000000U /* kill the task immediately */
> -#define SECCOMP_RET_TRAP 0x00020000U /* disallow and send sigtrap */
> -#define SECCOMP_RET_ERRNO 0x00030000U /* returns an errno */
> +#define SECCOMP_RET_TRAP 0x00020000U /* only send sigtrap */
> +#define SECCOMP_RET_ERRNO 0x00030000U /* only return an errno */
> +#define SECCOMP_RET_TRACE 0x7ffe0000U /* allow, but notify the tracer */
> #define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */
>
> /* Masks for the return value sections. */
> @@ -55,6 +56,7 @@ struct seccomp_filter;
> *
> * @mode: indicates one of the valid values above for controlled
> * system calls available to a process.
> + * @in_trace: indicates a seccomp filter hand off to ptrace has occurred
> * @filter: The metadata and ruleset for determining what system calls
> * are allowed for a task.
> *
> @@ -63,6 +65,7 @@ struct seccomp_filter;
> */
> struct seccomp {
> int mode;
> + int in_trace;
> struct seccomp_filter *filter;
> };
>
> @@ -116,15 +119,20 @@ static inline int seccomp_mode(struct seccomp *s)
> extern void put_seccomp_filter(struct seccomp_filter *);
> extern void copy_seccomp(struct seccomp *child,
> const struct seccomp *parent);
> +extern void seccomp_tracer_done(void);
> #else /* CONFIG_SECCOMP_FILTER */
> /* The macro consumes the ->filter reference. */
> #define put_seccomp_filter(_s) do { } while (0)
> -
> static inline void copy_seccomp(struct seccomp *child,
> const struct seccomp *prev)
> {
> return;
> }
> +
> +static inline void seccomp_tracer_done(void)
> +{
> + return;
> +}
> #endif /* CONFIG_SECCOMP_FILTER */
> #endif /* __KERNEL__ */
> #endif /* _LINUX_SECCOMP_H */
> diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
> index a71a292..5000169 100644
> --- a/include/linux/tracehook.h
> +++ b/include/linux/tracehook.h
> @@ -48,6 +48,7 @@
>
> #include <linux/sched.h>
> #include <linux/ptrace.h>
> +#include <linux/seccomp.h>
> #include <linux/security.h>
> struct linux_binprm;
>
> @@ -59,7 +60,7 @@ static inline void ptrace_report_syscall(struct pt_regs *regs)
> int ptrace = current->ptrace;
>
> if (!(ptrace & PT_PTRACED))
> - return;
> + goto out;
>
> ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
>
> @@ -72,6 +73,10 @@ static inline void ptrace_report_syscall(struct pt_regs *regs)
> send_sig(current->exit_code, current, 1);
> current->exit_code = 0;
> }
> +
> +out:
> + if (ptrace & PT_TRACE_SECCOMP)
> + seccomp_tracer_done();
> }
>
> /**
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 00ab2ca..61e5ac4 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -19,6 +19,7 @@
> #include <linux/signal.h>
> #include <linux/audit.h>
> #include <linux/pid_namespace.h>
> +#include <linux/seccomp.h>
> #include <linux/syscalls.h>
> #include <linux/uaccess.h>
> #include <linux/regset.h>
> @@ -551,6 +552,9 @@ static int ptrace_setoptions(struct task_struct *child, unsigned
long data)
> if (data & PTRACE_O_TRACEEXIT)
> child->ptrace |= PT_TRACE_EXIT;
>
> + if (data & PTRACE_O_TRACESECCOMP)
> + child->ptrace |= PT_TRACE_SECCOMP;
> +
> return (data & ~PTRACE_O_MASK) ? -EINVAL : 0;
> }
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index fc25d3a..120ceec 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -270,13 +270,12 @@ void put_seccomp_filter(struct seccomp_filter *orig)
> * @child: forkee's seccomp
> * @prev: forker's seccomp
> *
> - * Ensures that @child inherits seccomp mode and state if
> - * seccomp filtering is in use.
> + * Ensures that @child inherits seccomp filtering if in use.
> */
> void copy_seccomp(struct seccomp *child,
> const struct seccomp *prev)
> {
> - child->mode = prev->mode;
> + /* Other fields are handled by dup_task_struct. */
> child->filter = get_seccomp_filter(prev->filter);
> }
>
> @@ -299,6 +298,31 @@ static void seccomp_send_sigsys(int syscall, int reason)
> info.si_syscall = syscall;
> force_sig_info(SIGSYS, &info, current);
> }
> +
> +/**
> + * seccomp_tracer_done: handles clean up after handing off to ptrace.
> + *
> + * Checks that the hand off from SECCOMP_RET_TRACE to ptrace was not
> + * subject to a race condition where the tracer disappeared or was
> + * never notified because of a pending SIGKILL.
> + * N.b., if ptrace_syscall_entry returned an int, this call could just
> + * disable the system call rather than using do_exit on tracer death.
> + */
> +void seccomp_tracer_done(void)
> +{
> + struct seccomp *s = &current->seccomp;
> + /* Some other slow-path call occurred */
> + if (!s->in_trace)

So I guess it's more like 'check_trace' or something.

> + return;
> + s->in_trace = 0;
> + /* Tracer detached/died at some point after handing off to ptrace. */
> + if (!(current->ptrace & PT_PTRACED))
> + do_exit(SIGKILL);

This isn't possible, because seccomp_tracer_done() is only called when
PT_TRACE_SECCOMP is set, which gets cleared when the tracer goes away.

> + /* If there is a SIGKILL pending, just do_exit. */
> + if (sigismember(&current->pending.signal, SIGKILL) ||
> + sigismember(&current->signal->shared_pending.signal, SIGKILL))
> + do_exit(SIGKILL);

This bit shouldn't be necessary, as it should be in ptrace core. Oleg's
fix should be upstream before your seccomp patches. Except if I missed
something and this is not to fix current buggy behaviour that the task
is only killed after the current syscall?

But you got the logic reversed, the task should be killed except if
seccomp_tracer_done() was called. You can't kill the task from within
seccomp_tracer_done(), that is unreliable.

> +}
> #endif /* CONFIG_SECCOMP_FILTER */
>
> /*
> @@ -360,6 +384,28 @@ int __secure_computing_int(int this_syscall)
> seccomp_send_sigsys(this_syscall, reason_code);
> return -1;
> }
> + case SECCOMP_RET_TRACE:
> + /* If there is no interested tracer, return ENOSYS. */
> + if (!(current->ptrace & PT_TRACE_SECCOMP))
> + return -1;
> + /*
> + * Delegate to TIF_SYSCALL_TRACE. This allows fast-path
> + * seccomp calls to delegate to slow-path if needed.
> + * Since TIF_SYSCALL_TRACE will be unset on ptrace(2)
> + * continuation, there should be no direct side
> + * effects. If TIF_SYSCALL_TRACE is already set, this
> + * has no effect. Upon completion of handling, ptrace
> + * will call seccomp_tracer_done() which helps handle
> + * races.
> + */
> + set_tsk_thread_flag(current, TIF_SYSCALL_TRACE);
> + current->seccomp.in_trace = 1;
> + /*
> + * Allow the call, but upon completion, ptrace will
> + * call seccomp_tracer_done to handle tracer
> + * disappearance/death to ensure notification occurred.
> + */
> + return 0;
> case SECCOMP_RET_ALLOW:
> return 0;
> case SECCOMP_RET_KILL:
> --

Greetings,

Indan

2012-02-22 14:23:50

by Ben Hutchings

[permalink] [raw]
Subject: Re: [PATCH v10 05/11] seccomp: add system call filtering using BPF

On Wed, 2012-02-22 at 09:19 +0100, Indan Zupancic wrote:
[...]
> Alternative approach: Tell the arch at filter install time and only run the
> filters with the same arch as the current system call. If no filters are run,
> deny the systemcall.
>
> Advantages:
>
> - Filters don't have to check the arch every syscall entry.
>
> - Secure by default. Filters don't have to do anything arch specific to
> be secure, no surprises possible.
>
> - If a new arch comes into existence, there is no chance of old filters
> becoming buggy and insecure. This is especially true for archs that
> had only one mode, but added another one later on: Old filters had no
> need to check the mode at all.
[...]

What about when there are multiple layers of restrictions? So long as
any one layer covers the new architecture, there is no default-deny even
though the other layers might not cover it.

I would have thought the way to make sure the architecture is always
checked is to pack it together with the syscall number.

Ben.

--
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

2012-02-22 19:47:28

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH v10 05/11] seccomp: add system call filtering using BPF

On Wed, Feb 22, 2012 at 2:19 AM, Indan Zupancic <[email protected]> wrote:
> Hello,
>
> On Tue, February 21, 2012 18:30, Will Drewry wrote:
>> [This patch depends on [email protected]'s no_new_privs patch:
>> https://lkml.org/lkml/2012/1/30/264
>> ]
>>
>> This patch adds support for seccomp mode 2. ?Mode 2 introduces the
>> ability for unprivileged processes to install system call filtering
>> policy expressed in terms of a Berkeley Packet Filter (BPF) program.
>> This program will be evaluated in the kernel for each system call
>> the task makes and computes a result based on data in the format
>> of struct seccomp_data.
>>
>> A filter program may be installed by calling:
>> ?struct sock_fprog fprog = { ... };
>> ?...
>> ?prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &fprog);
>>
>> The return value of the filter program determines if the system call is
>> allowed to proceed or denied. ?If the first filter program installed
>> allows prctl(2) calls, then the above call may be made repeatedly
>> by a task to further reduce its access to the kernel. ?All attached
>> programs must be evaluated before a system call will be allowed to
>> proceed.
>>
>> Filter programs will be inherited across fork/clone and execve.
>> However, if the task attaching the filter is unprivileged
>> (!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task. ?This
>> ensures that unprivileged tasks cannot attach filters that affect
>> privileged tasks (e.g., setuid binary).
>>
>> There are a number of benefits to this approach. A few of which are
>> as follows:
>> - BPF has been exposed to userland for a long time
>> - BPF optimization (and JIT'ing) are well understood
>> - Userland already knows its ABI: system call numbers and desired
>> ?arguments
>> - No time-of-check-time-of-use vulnerable data accesses are possible.
>> - system call arguments are loaded on access only to minimize copying
>> ?required for system call policy decisions.
>>
>> Mode 2 support is restricted to architectures that enable
>> HAVE_ARCH_SECCOMP_FILTER. ?In this patch, the primary dependency is on
>> syscall_get_arguments(). ?The full desired scope of this feature will
>> add a few minor additional requirements expressed later in this series.
>> Based on discussion, SECCOMP_RET_ERRNO and SECCOMP_RET_TRACE seem to be
>> the desired additional functionality.
>>
>> No architectures are enabled in this patch.
>>
>> v10: - seccomp_data has changed again to be more aesthetically pleasing
>> ? ? ? ([email protected])
>> ? ? - calling convention is noted in a new u32 field using syscall_get_arch.
>> ? ? ? This allows for cross-calling convention tasks to use seccomp filters.
>> ? ? ? ([email protected])
>
> I highly disagree with every filter having to check the mode: Filters that
> don't check the arch on e.g. x86 are buggy, so they have to check it, even
> if it's a 32-bit or 64-bit only system, the filters can't know that and
> needs to check the arch at every syscall entry. All other info in the data
> depends on the arch, because of this there isn't much code to share between
> the two archs, so you can as well have one filter for each arch.
>
> Alternative approach: Tell the arch at filter install time and only run the
> filters with the same arch as the current system call. If no filters are run,
> deny the systemcall.

This was roughly how I first implemented compat and non-compat
support. It causes some implicit behavior across inheritance that is
not nice though.

> Advantages:
>
> - Filters don't have to check the arch every syscall entry.

This I like.

> - Secure by default. Filters don't have to do anything arch specific to
> ?be secure, no surprises possible.

This is partially true, but it is exactly why I hid compat before.

> - If a new arch comes into existence, there is no chance of old filters
> ?becoming buggy and insecure. This is especially true for archs that
> ?had only one mode, but added another one later on: Old filters had no
> ?need to check the mode at all.

Perhaps. A buggy filter that works on x86-64 might be exposed on a
new x32 ABI. It's hard to predict how audit_arch and the syscall abi
will develop with new platforms.

> - For kernels supporting only one arch, the check can be optimised away,
> ?by not installing unsupported arch filters at all.

Somewhat. Without having a dedicated arch helper, you'd have to guess
that arches only support one or two arches (if compat is supported),
but I don't know if that is a safe assumption to make.

> It's more secure, faster and simpler for the filters.
>
> If something like this is implemented it's fine to expose the arch info
> in the syscall data too, and have a way to install filters for all archs,
> for the few cases where that might be useful, although I can't think of
> any reason why people would like to do unnecessary work in the filters.

It seems to just add complexity to support both. I think we'll
probably end up with it in the filters for better or worse. Possibly
JITing will be useful since at least a 32-bit load and je is pretty
cheap in native instructions.

> All that's needed is an extra argument to the prctl() call. I propose
> 0 for the current arch, -1 for all archs and anything else to specify
> the arch. Installing a filter for an unsupported arch could return
> ENOEXEC.

Without adding a per-arch call, there is no way to know all the
supported arches at install time. Current arch, at least, can be
determined with a call to syscall_get_arch().

As is, I'm not sure it makes sense to try to reserve two extra input
types: 0 and -1. 0 would be sane to treat as either a wildcard or
current because it is unlikely to be used by AUDIT_ARCH_* ever since
EM_NONE is assigned to 0. However, I have no such insight into
whether it will ever be possible to compose 0xffffffff as an
AUDIT_ARCH_.

> As far as the implementation goes, either have a list per supported arch
> or store the arch per filter and check that before running the filter.

You can't do it per arch without adding even more per-arch
dependencies. Keeping them annotated in the same list is the clearest
way I've seen so far, but it comes with its own burdens.

>> ? ? - lots of clean up (thanks, Indan!)
>> v9: - n/a
>> v8: - use bpf_chk_filter, bpf_run_filter. update load_fns
>> ? ? - Lots of fixes courtesy of [email protected]:
>> ? ? -- fix up load behavior, compat fixups, and merge alloc code,
>> ? ? -- renamed pc and dropped __packed, use bool compat.
>> ? ? -- Added a hidden CONFIG_SECCOMP_FILTER to synthesize non-arch
>> ? ? ? ?dependencies
>> v7: ?(massive overhaul thanks to Indan, others)
>> ? ? - added CONFIG_HAVE_ARCH_SECCOMP_FILTER
>> ? ? - merged into seccomp.c
>> ? ? - minimal seccomp_filter.h
>> ? ? - no config option (part of seccomp)
>> ? ? - no new prctl
>> ? ? - doesn't break seccomp on systems without asm/syscall.h
>> ? ? ? (works but arg access always fails)
>> ? ? - dropped seccomp_init_task, extra free functions, ...
>> ? ? - dropped the no-asm/syscall.h code paths
>> ? ? - merges with network sk_run_filter and sk_chk_filter
>> v6: - fix memory leak on attach compat check failure
>> ? ? - require no_new_privs || CAP_SYS_ADMIN prior to filter
>> ? ? ? installation. ([email protected])
>> ? ? - s/seccomp_struct_/seccomp_/ for macros/functions ([email protected])
>> ? ? - cleaned up Kconfig ([email protected])
>> ? ? - on block, note if the call was compat (so the # means something)
>> v5: - uses syscall_get_arguments
>> ? ? ? ([email protected],[email protected], [email protected])
>> ? ? ?- uses union-based arg storage with hi/lo struct to
>> ? ? ? ?handle endianness. ?Compromises between the two alternate
>> ? ? ? ?proposals to minimize extra arg shuffling and account for
>> ? ? ? ?endianness assuming userspace uses offsetof().
>> ? ? ? ?([email protected], [email protected])
>> ? ? ?- update Kconfig description
>> ? ? ?- add include/seccomp_filter.h and add its installation
>> ? ? ?- (naive) on-demand syscall argument loading
>> ? ? ?- drop seccomp_t ([email protected])
>> v4: ?- adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
>> ? ? ?- now uses current->no_new_privs
>> ? ? ? ?([email protected],[email protected])
>> ? ? ?- assign names to seccomp modes ([email protected])
>> ? ? ?- fix style issues ([email protected])
>> ? ? ?- reworded Kconfig entry ([email protected])
>> v3: ?- macros to inline ([email protected])
>> ? ? ?- init_task behavior fixed ([email protected])
>> ? ? ?- drop creator entry and extra NULL check ([email protected])
>> ? ? ?- alloc returns -EINVAL on bad sizing ([email protected])
>> ? ? ?- adds tentative use of "always_unprivileged" as per
>> ? ? ? [email protected] and [email protected]
>> v2: ?- (patch 2 only)
>>
>> Signed-off-by: Will Drewry <[email protected]>
>> ---
>> arch/Kconfig ? ? ? ? ? ?| ? 18 +++
>> include/linux/Kbuild ? ?| ? ?1 +
>> include/linux/seccomp.h | ? 76 +++++++++++-
>> kernel/fork.c ? ? ? ? ? | ? ?3 +
>> kernel/seccomp.c ? ? ? ?| ?321 ++++++++++++++++++++++++++++++++++++++++++++---
>> kernel/sys.c ? ? ? ? ? ?| ? ?2 +-
>> 6 files changed, 399 insertions(+), 22 deletions(-)
>>
>> diff --git a/arch/Kconfig b/arch/Kconfig
>> index 4f55c73..8150fa2 100644
>> --- a/arch/Kconfig
>> +++ b/arch/Kconfig
>> @@ -199,4 +199,22 @@ config HAVE_CMPXCHG_LOCAL
>> config HAVE_CMPXCHG_DOUBLE
>> ? ? ? bool
>>
>> +config HAVE_ARCH_SECCOMP_FILTER
>> + ? ? bool
>> + ? ? help
>> + ? ? ? This symbol should be selected by an architecure if it provides
>> + ? ? ? asm/syscall.h, specifically syscall_get_arguments() and
>> + ? ? ? syscall_get_arch().
>> +
>> +config SECCOMP_FILTER
>> + ? ? def_bool y
>> + ? ? depends on HAVE_ARCH_SECCOMP_FILTER && SECCOMP && NET
>> + ? ? help
>> + ? ? ? Enable tasks to build secure computing environments defined
>> + ? ? ? in terms of Berkeley Packet Filter programs which implement
>> + ? ? ? task-defined system call filtering polices.
>> +
>> + ? ? ? See Documentation/prctl/seccomp_filter.txt for more
>> + ? ? ? information on the topic of seccomp filtering.
>
> The last part is redundant, the topic is clear.

I'll drop it.

>> +
>> source "kernel/gcov/Kconfig"
>> diff --git a/include/linux/Kbuild b/include/linux/Kbuild
>> index c94e717..d41ba12 100644
>> --- a/include/linux/Kbuild
>> +++ b/include/linux/Kbuild
>> @@ -330,6 +330,7 @@ header-y += scc.h
>> header-y += sched.h
>> header-y += screen_info.h
>> header-y += sdla.h
>> +header-y += seccomp.h
>> header-y += securebits.h
>> header-y += selinux_netlink.h
>> header-y += sem.h
>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>> index d61f27f..001f883 100644
>> --- a/include/linux/seccomp.h
>> +++ b/include/linux/seccomp.h
>> @@ -1,14 +1,67 @@
>> #ifndef _LINUX_SECCOMP_H
>> #define _LINUX_SECCOMP_H
>>
>> +#include <linux/compiler.h>
>> +#include <linux/types.h>
>> +
>> +
>> +/* Valid values for seccomp.mode and prctl(PR_SET_SECCOMP, <mode>) */
>> +#define SECCOMP_MODE_DISABLED ? ? ? ?0 /* seccomp is not in use. */
>> +#define SECCOMP_MODE_STRICT ?1 /* uses hard-coded filter. */
>> +#define SECCOMP_MODE_FILTER ?2 /* uses user-supplied filter. */
>> +
>> +/*
>> + * BPF programs may return a 32-bit value.
>
> They have to return a 32-bit value, no "may" about it.

True.

>> + * The bottom 16-bits are reserved for future use.
>> + * The upper 16-bits are ordered from least permissive values to most.
>> + *
>> + * The ordering ensures that a min_t() over composed return values always
>> + * selects the least permissive choice.
>> + */
>> +#define SECCOMP_RET_KILL ? ? 0x00000000U /* kill the task immediately */
>> +#define SECCOMP_RET_ALLOW ? ?0x7fff0000U /* allow */
>> +
>> +/* Masks for the return value sections. */
>> +#define SECCOMP_RET_ACTION ? 0xffff0000U
>> +#define SECCOMP_RET_DATA ? ? 0x0000ffffU
>> +
>> +/**
>> + * struct seccomp_data - the format the BPF program executes over.
>> + * @args: up to 6 system call arguments. ?When the calling convention is
>> + * ? ? ? ?32-bit, the arguments will still be at each args[X] offset.
>
> What does this mean? Do you mean the data layout will always be "LE" for
> 32-bit archs? I hope not, because that would make it incompatible with
> the 64-bit code for BE archs, so it will be confusing. Except if the data
> layout is always LE, but then you should document that. If neither is the
> case, then the comment is just confusing. Just say that the data layout
> depends on the arch's endianness.

I'll rephrase. I just wanted to call out that the argument values
will always be treated as a 64-bit value even if the calling
convention is 32-bit. This doesn't matter for LE systems, except to
acknowledge the padding, but for BE systems they might load the wrong
half.

>> + * @instruction_pointer: at the time of the system call.
>> + * @arch: indicates system call convention as an AUDIT_ARCH_* value
>> + * ? ? ? ?as defined in <linux/audit.h>.
>> + * @nr: the system call number
>> + */
>> +struct seccomp_data {
>> + ? ? __u64 args[6];
>> + ? ? __u64 instruction_pointer;
>> + ? ? __u32 arch;
>> + ? ? int nr;
>> +};
>
> I agree this looks a hell of a lot nicer. I just hope it's worth it.
> Oh well, a bit more ugliness in userspace to make the kernel code a
> bit nicer isn't too bad. Just document the endianness issue properly.
>
> What use is the instruction pointer considering it tells nothing about
> the call path?
>
>>
>> +#ifdef __KERNEL__
>> #ifdef CONFIG_SECCOMP
>>
>> #include <linux/thread_info.h>
>> #include <asm/seccomp.h>
>>
>> +struct seccomp_filter;
>> +/**
>> + * struct seccomp - the state of a seccomp'ed process
>> + *
>> + * @mode: ?indicates one of the valid values above for controlled
>> + * ? ? ? ? system calls available to a process.
>> + * @filter: The metadata and ruleset for determining what system calls
>> + * ? ? ? ? ?are allowed for a task.
>> + *
>> + * ? ? ? ? ?@filter must only be accessed from the context of current as there
>> + * ? ? ? ? ?is no locking.
>> + */
>> struct seccomp {
>> ? ? ? int mode;
>> + ? ? struct seccomp_filter *filter;
>> };
>>
>> extern void __secure_computing(int);
>> @@ -19,7 +72,7 @@ static inline void secure_computing(int this_syscall)
>> }
>>
>> extern long prctl_get_seccomp(void);
>> -extern long prctl_set_seccomp(unsigned long);
>> +extern long prctl_set_seccomp(unsigned long, char __user *);
>>
>> static inline int seccomp_mode(struct seccomp *s)
>> {
>> @@ -31,15 +84,16 @@ static inline int seccomp_mode(struct seccomp *s)
>> #include <linux/errno.h>
>>
>> struct seccomp { };
>> +struct seccomp_filter { };
>>
>> -#define secure_computing(x) do { } while (0)
>> +#define secure_computing(x) 0
>>
>> static inline long prctl_get_seccomp(void)
>> {
>> ? ? ? return -EINVAL;
>> }
>>
>> -static inline long prctl_set_seccomp(unsigned long arg2)
>> +static inline long prctl_set_seccomp(unsigned long arg2, char __user *arg3)
>> {
>> ? ? ? return -EINVAL;
>> }
>> @@ -48,7 +102,21 @@ static inline int seccomp_mode(struct seccomp *s)
>> {
>> ? ? ? return 0;
>> }
>> -
>> #endif /* CONFIG_SECCOMP */
>>
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +extern void put_seccomp_filter(struct seccomp_filter *);
>> +extern void copy_seccomp(struct seccomp *child,
>> + ? ? ? ? ? ? ? ? ? ? ?const struct seccomp *parent);
>
> This is 80 chars long, why break it up? Please, stop your bad habit of
> breaking up (slightly too) long lines.
>
>> +#else ?/* CONFIG_SECCOMP_FILTER */
>> +/* The macro consumes the ->filter reference. */
>> +#define put_seccomp_filter(_s) do { } while (0)
>> +
>> +static inline void copy_seccomp(struct seccomp *child,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? const struct seccomp *prev)
>> +{
>> + ? ? return;
>> +}
>> +#endif /* CONFIG_SECCOMP_FILTER */
>> +#endif /* __KERNEL__ */
>> #endif /* _LINUX_SECCOMP_H */
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index b77fd55..a5187b7 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -34,6 +34,7 @@
>> #include <linux/cgroup.h>
>> #include <linux/security.h>
>> #include <linux/hugetlb.h>
>> +#include <linux/seccomp.h>
>> #include <linux/swap.h>
>> #include <linux/syscalls.h>
>> #include <linux/jiffies.h>
>> @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
>> ? ? ? free_thread_info(tsk->stack);
>> ? ? ? rt_mutex_debug_task_free(tsk);
>> ? ? ? ftrace_graph_exit_task(tsk);
>> + ? ? put_seccomp_filter(tsk->seccomp.filter);
>
> So that's why you use macro's sometimes, to make it compile with
> CONFIG_SECCOMP disabled where there is no seccomp.filter.

Exactly!

>> ? ? ? free_task_struct(tsk);
>> }
>> EXPORT_SYMBOL(free_task);
>> @@ -1113,6 +1115,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>> ? ? ? ? ? ? ? goto fork_out;
>>
>> ? ? ? ftrace_graph_init_task(p);
>> + ? ? copy_seccomp(&p->seccomp, &current->seccomp);
>>
>> ? ? ? rt_mutex_init_task(p);
>>
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index e8d76c5..0043b7e 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -3,16 +3,287 @@
>> ?*
>> ?* Copyright 2004-2005 ?Andrea Arcangeli <[email protected]>
>> ?*
>> - * This defines a simple but solid secure-computing mode.
>> + * Copyright (C) 2012 Google, Inc.
>> + * Will Drewry <[email protected]>
>> + *
>> + * This defines a simple but solid secure-computing facility.
>> + *
>> + * Mode 1 uses a fixed list of allowed system calls.
>> + * Mode 2 allows user-defined system call filters in the form
>> + * ? ? ? ?of Berkeley Packet Filters/Linux Socket Filters.
>> ?*/
>>
>> +#include <linux/atomic.h>
>> #include <linux/audit.h>
>> -#include <linux/seccomp.h>
>> -#include <linux/sched.h>
>> #include <linux/compat.h>
>> +#include <linux/filter.h>
>> +#include <linux/sched.h>
>> +#include <linux/seccomp.h>
>> +#include <linux/security.h>
>> +#include <linux/slab.h>
>> +#include <linux/uaccess.h>
>> +
>> +#include <linux/tracehook.h>
>> +#include <asm/syscall.h>
>>
>> /* #define SECCOMP_DEBUG 1 */
>> -#define NR_SECCOMP_MODES 1
>> +
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +/**
>> + * struct seccomp_filter - container for seccomp BPF programs
>> + *
>> + * @usage: reference count to manage the object liftime.
>> + * ? ? ? ? get/put helpers should be used when accessing an instance
>> + * ? ? ? ? outside of a lifetime-guarded section. ?In general, this
>> + * ? ? ? ? is only needed for handling filters shared across tasks.
>> + * @prev: points to a previously installed, or inherited, filter
>> + * @compat: indicates the value of is_compat_task() at creation time
>
> You're not really using 'compat', except for logging.

Oops - I should've dropped it altogether!

> But you could use it to run only the filters with the right arch.

Well an int arch would do the trick.

>> + * @insns: the BPF program instructions to evaluate
>> + * @len: the number of instructions in the program
>> + *
>> + * seccomp_filter objects are organized in a tree linked via the @prev
>> + * pointer. ?For any task, it appears to be a singly-linked list starting
>> + * with current->seccomp.filter, the most recently attached or inherited filter.
>> + * However, multiple filters may share a @prev node, by way of fork(), which
>> + * results in a unidirectional tree existing in memory. ?This is similar to
>> + * how namespaces work.
>> + *
>> + * seccomp_filter objects should never be modified after being attached
>> + * to a task_struct (other than @usage).
>> + */
>> +struct seccomp_filter {
>> + ? ? atomic_t usage;
>> + ? ? struct seccomp_filter *prev;
>> + ? ? bool compat;
>> + ? ? unsigned short len; ?/* Instruction count */
>> + ? ? struct sock_filter insns[];
>> +};
>> +
>> +static void seccomp_filter_log_failure(int syscall)
>> +{
>> + ? ? int compat = 0;
>> +#ifdef CONFIG_COMPAT
>> + ? ? compat = is_compat_task();
>> +#endif
>> + ? ? pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n",
>> + ? ? ? ? ? ? current->comm, task_pid_nr(current),
>> + ? ? ? ? ? ? (compat ? "compat " : ""),
>> + ? ? ? ? ? ? syscall, KSTK_EIP(current));
>> +}
>> +
>> +/**
>> + * get_u32 - returns a u32 offset into data
>> + * @data: a unsigned 64 bit value
>> + * @index: 0 or 1 to return the first or second 32-bits
>> + *
>> + * This inline exists to hide the length of unsigned long.
>> + * If a 32-bit unsigned long is passed in, it will be extended
>> + * and the top 32-bits will be 0. If it is a 64-bit unsigned
>> + * long, then whatever data is resident will be properly returned.
>> + */
>> +static inline u32 get_u32(u64 data, int index)
>> +{
>> + ? ? return ((u32 *)&data)[index];
>> +}
>> +
>> +/* Helper for bpf_load below. */
>> +#define BPF_DATA(_name) offsetof(struct seccomp_data, _name)
>> +/**
>> + * bpf_load: checks and returns a pointer to the requested offset
>> + * @nr: int syscall passed as a void * to bpf_run_filter
>> + * @off: index into struct seccomp_data to load from
>> + * @size: load width requested
>> + * @buffer: temporary storage supplied by bpf_run_filter
>> + *
>> + * Returns a pointer to @buffer where the value was stored.
>> + * On failure, returns NULL.
>> + */
>> +static void *bpf_load(const void *nr, int off, unsigned int size, void *buf)
>> +{
>> + ? ? unsigned long value;
>> + ? ? u32 *A = buf;
>> +
>> + ? ? if (size != sizeof(u32))
>> + ? ? ? ? ? ? return NULL;
>> +
>> + ? ? if (off >= BPF_DATA(args[0]) && off < BPF_DATA(args[6])) {
>> + ? ? ? ? ? ? struct pt_regs *regs = task_pt_regs(current);
>> + ? ? ? ? ? ? int arg = off >> 3; ?/* args[0] is at offset 0. */
>
> Probably clearer if you just do off / 8, you can count on compilers to
> get that right and turn it into a shift.
>
>> + ? ? ? ? ? ? int index = (off % sizeof(u64)) ? 1 : 0;
>
> Considering the previous line I expected to see (off & 4).
>
> Anyway, this code mostly ignores the lowest three bits and instead of
> either returning an error or the requested data, it returns the aligned
> value instead. Not good.

Hrm true - I got sloppy. I'll fix that up!

>> + ? ? ? ? ? ? syscall_get_arguments(current, regs, arg, 1, &value);
>> + ? ? ? ? ? ? *A = get_u32(value, index);
>> + ? ? } else if (off == BPF_DATA(nr)) {
>> + ? ? ? ? ? ? *A = (u32)(uintptr_t)nr;
>> + ? ? } else if (off == BPF_DATA(arch)) {
>> + ? ? ? ? ? ? struct pt_regs *regs = task_pt_regs(current);
>> + ? ? ? ? ? ? *A = syscall_get_arch(current, regs);
>> + ? ? } else if (off == BPF_DATA(instruction_pointer)) {
>> + ? ? ? ? ? ? *A = get_u32(KSTK_EIP(current), 0);
>> + ? ? } else if (off == BPF_DATA(instruction_pointer) + sizeof(u32)) {
>> + ? ? ? ? ? ? *A = get_u32(KSTK_EIP(current), 1);
>> + ? ? } else {
>> + ? ? ? ? ? ? return NULL;
>> + ? ? }
>> + ? ? return buf;
>> +}
>> +
>> +/**
>> + * seccomp_run_filters - evaluates all seccomp filters against @syscall
>> + * @syscall: number of the current system call
>> + *
>> + * Returns valid seccomp BPF response codes.
>> + */
>> +static u32 seccomp_run_filters(int syscall)
>> +{
>> + ? ? struct seccomp_filter *f;
>> + ? ? u32 ret = SECCOMP_RET_KILL;
>> + ? ? static const struct bpf_load_fn fns = {
>> + ? ? ? ? ? ? bpf_load,
>> + ? ? ? ? ? ? sizeof(struct seccomp_data),
>
> I suppose this could be used to check for new fields if struct seccomp_data
> ever gets extended in the future.

Yeah since the only other indicator might be @arch.

>> + ? ? };
>> + ? ? const void *sc_ptr = (const void *)(uintptr_t)syscall;
>> +
>> + ? ? /*
>> + ? ? ?* All filters are evaluated in order of youngest to oldest. The lowest
>> + ? ? ?* BPF return value always takes priority.
>> + ? ? ?*/
>> + ? ? for (f = current->seccomp.filter; f; f = f->prev) {
>> + ? ? ? ? ? ? ret = bpf_run_filter(sc_ptr, f->insns, &fns);
>> + ? ? ? ? ? ? if (ret != SECCOMP_RET_ALLOW)
>> + ? ? ? ? ? ? ? ? ? ? break;
>> + ? ? }
>> + ? ? return ret;
>> +}
>> +
>> +/**
>> + * seccomp_attach_filter: Attaches a seccomp filter to current.
>> + * @fprog: BPF program to install
>> + *
>> + * Returns 0 on success or an errno on failure.
>> + */
>> +static long seccomp_attach_filter(struct sock_fprog *fprog)
>> +{
>> + ? ? struct seccomp_filter *filter;
>> + ? ? unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
>> + ? ? long ret;
>> +
>> + ? ? if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
>> + ? ? ? ? ? ? return -EINVAL;
>> +
>> + ? ? /* Allocate a new seccomp_filter */
>> + ? ? filter = kzalloc(sizeof(struct seccomp_filter) + fp_size, GFP_KERNEL);
>> + ? ? if (!filter)
>> + ? ? ? ? ? ? return -ENOMEM;
>> + ? ? atomic_set(&filter->usage, 1);
>> + ? ? filter->len = fprog->len;
>> +
>> + ? ? /* Copy the instructions from fprog. */
>> + ? ? ret = -EFAULT;
>> + ? ? if (copy_from_user(filter->insns, fprog->filter, fp_size))
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? /* Check the fprog */
>> + ? ? ret = bpf_chk_filter(filter->insns, filter->len, BPF_CHK_FLAGS_NO_SKB);
>> + ? ? if (ret)
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? /*
>> + ? ? ?* Installing a seccomp filter requires that the task
>> + ? ? ?* have CAP_SYS_ADMIN in its namespace or be running with
>> + ? ? ?* no_new_privs. ?This avoids scenarios where unprivileged
>> + ? ? ?* tasks can affect the behavior of privileged children.
>> + ? ? ?*/
>> + ? ? ret = -EACCES;
>> + ? ? if (!current->no_new_privs &&
>> + ? ? ? ? security_capable_noaudit(current_cred(), current_user_ns(),
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?CAP_SYS_ADMIN) != 0)
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? /*
>> + ? ? ?* If there is an existing filter, make it the prev
>> + ? ? ?* and don't drop its task reference.
>> + ? ? ?*/
>> + ? ? filter->prev = current->seccomp.filter;
>> + ? ? current->seccomp.filter = filter;
>> + ? ? return 0;
>> +out:
>> + ? ? put_seccomp_filter(filter); ?/* for get or task, on err */
>> + ? ? return ret;
>> +}
>> +
>> +/**
>> + * seccomp_attach_user_filter - attaches a user-supplied sock_fprog
>> + * @user_filter: pointer to the user data containing a sock_fprog.
>> + *
>> + * This function may be called repeatedly to install additional filters.
>> + * Every filter successfully installed will be evaluated (in reverse order)
>> + * for each system call the task makes.
>> + *
>> + * Returns 0 on success and non-zero otherwise.
>> + */
>> +long seccomp_attach_user_filter(char __user *user_filter)
>> +{
>> + ? ? struct sock_fprog fprog;
>> + ? ? long ret = -EFAULT;
>> +
>> + ? ? if (!user_filter)
>> + ? ? ? ? ? ? goto out;
>> +#ifdef CONFIG_COMPAT
>> + ? ? if (is_compat_task()) {
>> + ? ? ? ? ? ? /* XXX: Share with net/compat.c (compat_sock_fprog) */
>
> Then do so as part of your BPF sharing patch.

Makes sense. Queuing it up.

>> + ? ? ? ? ? ? struct {
>> + ? ? ? ? ? ? ? ? ? ? u16 len;
>> + ? ? ? ? ? ? ? ? ? ? compat_uptr_t filter; ? /* struct sock_filter */
>> + ? ? ? ? ? ? } fprog32;
>> + ? ? ? ? ? ? if (copy_from_user(&fprog32, user_filter, sizeof(fprog32)))
>> + ? ? ? ? ? ? ? ? ? ? goto out;
>> + ? ? ? ? ? ? fprog.len = fprog32.len;
>> + ? ? ? ? ? ? fprog.filter = compat_ptr(fprog32.filter);
>> + ? ? } else /* falls through to the if below. */
>> +#endif
>> + ? ? if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
>> + ? ? ? ? ? ? goto out;
>> + ? ? ret = seccomp_attach_filter(&fprog);
>> +out:
>> + ? ? return ret;
>> +}
>> +
>> +/* get_seccomp_filter - increments the reference count of @orig. */
>> +static struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
>> +{
>> + ? ? if (!orig)
>> + ? ? ? ? ? ? return NULL;
>> + ? ? /* Reference count is bounded by the number of total processes. */
>> + ? ? atomic_inc(&orig->usage);
>> + ? ? return orig;
>> +}
>> +
>> +/* put_seccomp_filter - decrements the ref count of @orig and may free. */
>> +void put_seccomp_filter(struct seccomp_filter *orig)
>> +{
>> + ? ? /* Clean up single-reference branches iteratively. */
>> + ? ? while (orig && atomic_dec_and_test(&orig->usage)) {
>> + ? ? ? ? ? ? struct seccomp_filter *freeme = orig;
>> + ? ? ? ? ? ? orig = orig->prev;
>> + ? ? ? ? ? ? kfree(freeme);
>> + ? ? }
>> +}
>> +
>> +/**
>> + * copy_seccomp: manages inheritance on fork
>> + * @child: forkee's seccomp
>> + * @prev: forker's seccomp
>> + *
>> + * Ensures that @child inherits seccomp mode and state if
>> + * seccomp filtering is in use.
>> + */
>> +void copy_seccomp(struct seccomp *child,
>> + ? ? ? ? ? ? ? const struct seccomp *prev)
>
> One line please.

Alright :)

>> +{
>> + ? ? child->mode = prev->mode;
>> + ? ? child->filter = get_seccomp_filter(prev->filter);
>> +}
>> +#endif ? ? ? /* CONFIG_SECCOMP_FILTER */
>>
>> /*
>> ?* Secure computing mode 1 allows only read/write/exit/sigreturn.
>> @@ -34,10 +305,10 @@ static int mode1_syscalls_32[] = {
>> void __secure_computing(int this_syscall)
>> {
>> ? ? ? int mode = current->seccomp.mode;
>> - ? ? int * syscall;
>> + ? ? int *syscall;
>>
>> ? ? ? switch (mode) {
>> - ? ? case 1:
>> + ? ? case SECCOMP_MODE_STRICT:
>> ? ? ? ? ? ? ? syscall = mode1_syscalls;
>> #ifdef CONFIG_COMPAT
>> ? ? ? ? ? ? ? if (is_compat_task())
>> @@ -48,6 +319,13 @@ void __secure_computing(int this_syscall)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return;
>> ? ? ? ? ? ? ? } while (*++syscall);
>> ? ? ? ? ? ? ? break;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> + ? ? case SECCOMP_MODE_FILTER:
>> + ? ? ? ? ? ? if (seccomp_run_filters(this_syscall) == SECCOMP_RET_ALLOW)
>> + ? ? ? ? ? ? ? ? ? ? return;
>> + ? ? ? ? ? ? seccomp_filter_log_failure(this_syscall);
>> + ? ? ? ? ? ? break;
>> +#endif
>> ? ? ? default:
>> ? ? ? ? ? ? ? BUG();
>> ? ? ? }
>> @@ -64,25 +342,34 @@ long prctl_get_seccomp(void)
>> ? ? ? return current->seccomp.mode;
>> }
>>
>> -long prctl_set_seccomp(unsigned long seccomp_mode)
>> +long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
>> {
>> - ? ? long ret;
>> + ? ? long ret = -EINVAL;
>>
>> - ? ? /* can set it only once to be even more secure */
>> - ? ? ret = -EPERM;
>> - ? ? if (unlikely(current->seccomp.mode))
>> + ? ? if (current->seccomp.mode &&
>> + ? ? ? ? current->seccomp.mode != seccomp_mode)
>
> Wouldn't it make sense to allow going from mode 2 to 1?
> After all, the filter could have blocked it if it didn't
> want to permit it, and mode 1 is more restrictive than
> mode 2.

Nope - that might allow a downgrade that bypasses write/read
restrictions. E.g., a filter could only allow a read to a certain buf
or of a certain size. Allowing a downgrade would allow bypassing
those filters, whether they are the most sane things or not :)

>> ? ? ? ? ? ? ? goto out;
>>
>> - ? ? ret = -EINVAL;
>> - ? ? if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
>> - ? ? ? ? ? ? current->seccomp.mode = seccomp_mode;
>> - ? ? ? ? ? ? set_thread_flag(TIF_SECCOMP);
>> + ? ? switch (seccomp_mode) {
>> + ? ? case SECCOMP_MODE_STRICT:
>> + ? ? ? ? ? ? ret = 0;
>> #ifdef TIF_NOTSC
>> ? ? ? ? ? ? ? disable_TSC();
>> #endif
>> - ? ? ? ? ? ? ret = 0;
>> + ? ? ? ? ? ? break;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> + ? ? case SECCOMP_MODE_FILTER:
>> + ? ? ? ? ? ? ret = seccomp_attach_user_filter(filter);
>> + ? ? ? ? ? ? if (ret)
>> + ? ? ? ? ? ? ? ? ? ? goto out;
>> + ? ? ? ? ? ? break;
>> +#endif
>> + ? ? default:
>> + ? ? ? ? ? ? goto out;
>> ? ? ? }
>>
>> - out:
>> + ? ? current->seccomp.mode = seccomp_mode;
>> + ? ? set_thread_flag(TIF_SECCOMP);
>> +out:
>> ? ? ? return ret;
>> }
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index 4070153..905031e 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -1899,7 +1899,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2,
> unsigned long, arg3,
>> ? ? ? ? ? ? ? ? ? ? ? error = prctl_get_seccomp();
>> ? ? ? ? ? ? ? ? ? ? ? break;
>> ? ? ? ? ? ? ? case PR_SET_SECCOMP:
>> - ? ? ? ? ? ? ? ? ? ? error = prctl_set_seccomp(arg2);
>> + ? ? ? ? ? ? ? ? ? ? error = prctl_set_seccomp(arg2, (char __user *)arg3);
>> ? ? ? ? ? ? ? ? ? ? ? break;
>> ? ? ? ? ? ? ? case PR_GET_TSC:
>> ? ? ? ? ? ? ? ? ? ? ? error = GET_TSC_CTL(arg2);
>
> Out of curiosity, did you measure the kernel size differences before and
> after these patches? Would be sad if sharing it with the networking code
> didn't reduce the actual kernel size.

Oh yeah - it was a serious reduction. Initially, seccomp_filter.o
added 8kb by itself. With the merged seccomp.o, continued code
trimming (as suggested), and all the SECCOMP_RET_* variations, the
total kernel growth is 2972 bytes for the same kernel config. This is
shared across ~2000 bytes in seccomp.o and ~800 bytes in filter.o.

Thanks!
will

2012-02-22 19:47:50

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH v10 05/11] seccomp: add system call filtering using BPF

On Wed, Feb 22, 2012 at 8:23 AM, Ben Hutchings
<[email protected]> wrote:
> On Wed, 2012-02-22 at 09:19 +0100, Indan Zupancic wrote:
> [...]
>> Alternative approach: Tell the arch at filter install time and only run the
>> filters with the same arch as the current system call. If no filters are run,
>> deny the systemcall.
>>
>> Advantages:
>>
>> - Filters don't have to check the arch every syscall entry.
>>
>> - Secure by default. Filters don't have to do anything arch specific to
>> ? be secure, no surprises possible.
>>
>> - If a new arch comes into existence, there is no chance of old filters
>> ? becoming buggy and insecure. This is especially true for archs that
>> ? had only one mode, but added another one later on: Old filters had no
>> ? need to check the mode at all.
> [...]
>
> What about when there are multiple layers of restrictions? ?So long as
> any one layer covers the new architecture, there is no default-deny even
> though the other layers might not cover it.

This is the biggest challenge with using the split-labeled approach. I
started with the first patches supporting compat and non-compat
side-by-side. It makes things complicated with inheritance. If you
have a parent that installed filters for arch=i386 and arch=x86_64,
then a child process installs a filter for arch=x86_64, its behavior
when spawned by that parent is that any i386 calls the parent allows
are allowed, but when it is spawned without any inherited filters, no
i386 calls would be allowed. This was part of the reason why I
abandoned that approach and went with locking the compat bit. I don't
think there is a clean way to support inheritance and
implicit-disallow without it being hideous. (I had tried it before
with annotations saying if a filter was inherited or self-created, but
that made the code much more complex for very little gain, imo.)

> I would have thought the way to make sure the architecture is always
> checked is to pack it together with the syscall number.

If the current patchset used the elf machine only and not the
AUDIT_ARCH_* that might be possible since e_machine is only 16 bits.
However, that would still assume that an arch wouldn't introduce a
syscall number above 65535 which is most likely not a safe assumption.
Am I wrong there?

thanks!
will

2012-02-22 19:48:00

by Will Drewry

[permalink] [raw]
Subject: Re: [kernel-hardening] Re: [PATCH v10 09/11] ptrace,seccomp: Add PTRACE_SECCOMP support

On Wed, Feb 22, 2012 at 6:22 AM, Indan Zupancic <[email protected]> wrote:
> On Tue, February 21, 2012 18:30, Will Drewry wrote:
>> A new return value is added to seccomp filters that allows
>> the system call policy for the affected system calls to be
>> implemented by a ptrace(2)ing process.
>>
>> If a tracer attaches to a task, specifies the PTRACE_O_TRACESECCOMP
>> option, then PTRACE_CONT.
>
> Awkward formulation here. I'd start with "If a tracer sets the
> PTRACE_O_TRACESECCOMP option, then ..."
>
>> After doing so, the tracer will
>> be notified if a seccomp filter program returns SECCOMP_RET_TRACE.
>
> This means that strace and gdb won't see seccomp filtered syscalls.
> I think you have to reverse the logic and have an option that asks
> to hide normal SECCOMP_RET_ERRNO, but not SECCOMP_RET_TRACE ones.
>
> That gives the expected behaviour in all cases: Programs not setting
> it behave as they do now, and co-operating tracers can ignore syscall
> events they're not interested in.

Reversing the logic resolves the slow-path/fast-path problem too. I'll
repost. This will make the code much saner I think!

>> If there is no seccomp event tracer, SECCOMP_RET_TRACE system calls will
>> return a -ENOSYS errno to user space. ?If the tracer detaches during a
>> hand-off, the process will be killed.
>>
>> To ensure that seccomp is syscall fast-path friendly in the future,
>> ptrace is delegated to by setting TIF_SYSCALL_TRACE. ?Since seccomp
>> events are equivalent to system call entry events, this allows for
>> seccomp to be evaluated as a fork off the fast-path and only,
>> optionally, jump to the slow path. When the tracer is notified, all
>> will function as with ptrace(PTRACE_SYSCALLS), but when the tracer calls
>> ptrace(PTRACE_CONT), TIF_SYSCALL_TRACE will be unset and the task
>> will proceed just receiving PTRACE_O_TRACESECCOMP events.
>
> Please, no. That's making it more complicated than necessary.
>
> I propose to keep the ptrace rules exactly the same as they are, with
> the only change being that if PTRACE_O_SECCOMP is set, no syscall events
> will be generated for SECCOMP_RET_ERRNO. This way ptrace behaviour is
> the same, but only less syscall events are received. With your way
> ptracers see syscall events when they normally wouldn't.
>
>>
>> I realize there are pending patches for cleaning up ptrace events.
>> I can either reintegrate with those when they are available or
>> vice versa. That's assuming these changes make sense and are viable.
>>
>> v10: - moved to PTRACE_O_SECCOMP / PT_TRACE_SECCOMP
>> v9: ?- n/a
>> v8: ?- guarded PTRACE_SECCOMP use with an ifdef
>> v7: ?- introduced
>>
>> Signed-off-by: Will Drewry <[email protected]>
>> ---
>> arch/Kconfig ? ? ? ? ? ? ?| ? ?4 +++
>> include/linux/ptrace.h ? ?| ? ?7 ++++-
>> include/linux/seccomp.h ? | ? 14 +++++++++--
>> include/linux/tracehook.h | ? ?7 +++++-
>> kernel/ptrace.c ? ? ? ? ? | ? ?4 +++
>> kernel/seccomp.c ? ? ? ? ?| ? 52 ++++++++++++++++++++++++++++++++++++++++++--
>> 6 files changed, 79 insertions(+), 9 deletions(-)
>>
>> diff --git a/arch/Kconfig b/arch/Kconfig
>> index 6d6d9dc..02c18ca 100644
>> --- a/arch/Kconfig
>> +++ b/arch/Kconfig
>> @@ -203,6 +203,7 @@ config HAVE_ARCH_SECCOMP_FILTER
>> ? ? ? bool
>> ? ? ? help
>> ? ? ? ? This symbol should be selected by an architecure if it provides:
>> + ? ? ? linux/tracehook.h, for TIF_SYSCALL_TRACE and ptrace_report_syscall
>> ? ? ? ? asm/syscall.h:
>> ? ? ? ? - syscall_get_arch()
>> ? ? ? ? - syscall_get_arguments()
>> @@ -211,6 +212,9 @@ config HAVE_ARCH_SECCOMP_FILTER
>> ? ? ? ? SIGSYS siginfo_t support must be implemented.
>> ? ? ? ? __secure_computing_int()/secure_computing()'s return value must be
>> ? ? ? ? checked, with -1 resulting in the syscall being skipped.
>> + ? ? ? If secure_computing is not in the system call slow path, the thread
>> + ? ? ? info flags will need to be checked upon exit to ensure delegation to
>> + ? ? ? ptrace(2) did not occur, or if it did, jump to the slow-path.
>>
>> config SECCOMP_FILTER
>> ? ? ? def_bool y
>> diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
>> index c2f1f6a..2fccdbc 100644
>> --- a/include/linux/ptrace.h
>> +++ b/include/linux/ptrace.h
>> @@ -62,8 +62,9 @@
>> #define PTRACE_O_TRACEEXEC ? ?0x00000010
>> #define PTRACE_O_TRACEVFORKDONE ? ? ? 0x00000020
>> #define PTRACE_O_TRACEEXIT ? ?0x00000040
>> +#define PTRACE_O_TRACESECCOMP ? ? ? ?0x00000080
>>
>> -#define PTRACE_O_MASK ? ? ? ? ? ? ? ?0x0000007f
>> +#define PTRACE_O_MASK ? ? ? ? ? ? ? ?0x000000ff
>>
>> /* Wait extended result codes for the above trace options. ?*/
>> #define PTRACE_EVENT_FORK ? ? 1
>> @@ -73,6 +74,7 @@
>> #define PTRACE_EVENT_VFORK_DONE ? ? ? 5
>> #define PTRACE_EVENT_EXIT ? ? 6
>> #define PTRACE_EVENT_STOP ? ? 7
>> +#define PTRACE_EVENT_SECCOMP 8 ? ? ? /* never directly delivered */
>>
>> #include <asm/ptrace.h>
>>
>> @@ -101,8 +103,9 @@
>> #define PT_TRACE_EXEC ? ? ? ? PT_EVENT_FLAG(PTRACE_EVENT_EXEC)
>> #define PT_TRACE_VFORK_DONE ? PT_EVENT_FLAG(PTRACE_EVENT_VFORK_DONE)
>> #define PT_TRACE_EXIT ? ? ? ? PT_EVENT_FLAG(PTRACE_EVENT_EXIT)
>> +#define PT_TRACE_SECCOMP ? ? PT_EVENT_FLAG(PTRACE_EVENT_SECCOMP)
>>
>> -#define PT_TRACE_MASK ? ? ? ?0x000003f4
>> +#define PT_TRACE_MASK ? ? ? ?0x00000ff4
>>
>> /* single stepping state bits (used on ARM and PA-RISC) */
>> #define PT_SINGLESTEP_BIT ? ? 31
>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>> index d039b7b..16887c1 100644
>> --- a/include/linux/seccomp.h
>> +++ b/include/linux/seccomp.h
>> @@ -19,8 +19,9 @@
>> ?* selects the least permissive choice.
>> ?*/
>> #define SECCOMP_RET_KILL ? ? ?0x00000000U /* kill the task immediately */
>> -#define SECCOMP_RET_TRAP ? ? 0x00020000U /* disallow and send sigtrap */
>> -#define SECCOMP_RET_ERRNO ? ?0x00030000U /* returns an errno */
>> +#define SECCOMP_RET_TRAP ? ? 0x00020000U /* only send sigtrap */
>> +#define SECCOMP_RET_ERRNO ? ?0x00030000U /* only return an errno */
>> +#define SECCOMP_RET_TRACE ? ?0x7ffe0000U /* allow, but notify the tracer */
>> #define SECCOMP_RET_ALLOW ? ? 0x7fff0000U /* allow */
>>
>> /* Masks for the return value sections. */
>> @@ -55,6 +56,7 @@ struct seccomp_filter;
>> ?*
>> ?* @mode: ?indicates one of the valid values above for controlled
>> ?* ? ? ? ? system calls available to a process.
>> + * @in_trace: indicates a seccomp filter hand off to ptrace has occurred
>> ?* @filter: The metadata and ruleset for determining what system calls
>> ?* ? ? ? ? ?are allowed for a task.
>> ?*
>> @@ -63,6 +65,7 @@ struct seccomp_filter;
>> ?*/
>> struct seccomp {
>> ? ? ? int mode;
>> + ? ? int in_trace;
>> ? ? ? struct seccomp_filter *filter;
>> };
>>
>> @@ -116,15 +119,20 @@ static inline int seccomp_mode(struct seccomp *s)
>> extern void put_seccomp_filter(struct seccomp_filter *);
>> extern void copy_seccomp(struct seccomp *child,
>> ? ? ? ? ? ? ? ? ? ? ? ?const struct seccomp *parent);
>> +extern void seccomp_tracer_done(void);
>> #else ?/* CONFIG_SECCOMP_FILTER */
>> /* The macro consumes the ->filter reference. */
>> #define put_seccomp_filter(_s) do { } while (0)
>> -
>> static inline void copy_seccomp(struct seccomp *child,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? const struct seccomp *prev)
>> {
>> ? ? ? return;
>> }
>> +
>> +static inline void seccomp_tracer_done(void)
>> +{
>> + ? ? return;
>> +}
>> #endif /* CONFIG_SECCOMP_FILTER */
>> #endif /* __KERNEL__ */
>> #endif /* _LINUX_SECCOMP_H */
>> diff --git a/include/linux/tracehook.h b/include/linux/tracehook.h
>> index a71a292..5000169 100644
>> --- a/include/linux/tracehook.h
>> +++ b/include/linux/tracehook.h
>> @@ -48,6 +48,7 @@
>>
>> #include <linux/sched.h>
>> #include <linux/ptrace.h>
>> +#include <linux/seccomp.h>
>> #include <linux/security.h>
>> struct linux_binprm;
>>
>> @@ -59,7 +60,7 @@ static inline void ptrace_report_syscall(struct pt_regs *regs)
>> ? ? ? int ptrace = current->ptrace;
>>
>> ? ? ? if (!(ptrace & PT_PTRACED))
>> - ? ? ? ? ? ? return;
>> + ? ? ? ? ? ? goto out;
>>
>> ? ? ? ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0));
>>
>> @@ -72,6 +73,10 @@ static inline void ptrace_report_syscall(struct pt_regs *regs)
>> ? ? ? ? ? ? ? send_sig(current->exit_code, current, 1);
>> ? ? ? ? ? ? ? current->exit_code = 0;
>> ? ? ? }
>> +
>> +out:
>> + ? ? if (ptrace & PT_TRACE_SECCOMP)
>> + ? ? ? ? ? ? seccomp_tracer_done();
>> }
>>
>> /**
>> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
>> index 00ab2ca..61e5ac4 100644
>> --- a/kernel/ptrace.c
>> +++ b/kernel/ptrace.c
>> @@ -19,6 +19,7 @@
>> #include <linux/signal.h>
>> #include <linux/audit.h>
>> #include <linux/pid_namespace.h>
>> +#include <linux/seccomp.h>
>> #include <linux/syscalls.h>
>> #include <linux/uaccess.h>
>> #include <linux/regset.h>
>> @@ -551,6 +552,9 @@ static int ptrace_setoptions(struct task_struct *child, unsigned
> long data)
>> ? ? ? if (data & PTRACE_O_TRACEEXIT)
>> ? ? ? ? ? ? ? child->ptrace |= PT_TRACE_EXIT;
>>
>> + ? ? if (data & PTRACE_O_TRACESECCOMP)
>> + ? ? ? ? ? ? child->ptrace |= PT_TRACE_SECCOMP;
>> +
>> ? ? ? return (data & ~PTRACE_O_MASK) ? -EINVAL : 0;
>> }
>>
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index fc25d3a..120ceec 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -270,13 +270,12 @@ void put_seccomp_filter(struct seccomp_filter *orig)
>> ?* @child: forkee's seccomp
>> ?* @prev: forker's seccomp
>> ?*
>> - * Ensures that @child inherits seccomp mode and state if
>> - * seccomp filtering is in use.
>> + * Ensures that @child inherits seccomp filtering if in use.
>> ?*/
>> void copy_seccomp(struct seccomp *child,
>> ? ? ? ? ? ? ? ? const struct seccomp *prev)
>> {
>> - ? ? child->mode = prev->mode;
>> + ? ? /* Other fields are handled by dup_task_struct. */
>> ? ? ? child->filter = get_seccomp_filter(prev->filter);
>> }
>>
>> @@ -299,6 +298,31 @@ static void seccomp_send_sigsys(int syscall, int reason)
>> ? ? ? info.si_syscall = syscall;
>> ? ? ? force_sig_info(SIGSYS, &info, current);
>> }
>> +
>> +/**
>> + * seccomp_tracer_done: handles clean up after handing off to ptrace.
>> + *
>> + * Checks that the hand off from SECCOMP_RET_TRACE to ptrace was not
>> + * subject to a race condition where the tracer disappeared or was
>> + * never notified because of a pending SIGKILL.
>> + * N.b., if ptrace_syscall_entry returned an int, this call could just
>> + * ? ? ? disable the system call rather than using do_exit on tracer death.
>> + */
>> +void seccomp_tracer_done(void)
>> +{
>> + ? ? struct seccomp *s = &current->seccomp;
>> + ? ? /* Some other slow-path call occurred */
>> + ? ? if (!s->in_trace)
>
> So I guess it's more like 'check_trace' or something.

Yup - but I think this whole thing can go now.

>> + ? ? ? ? ? ? return;
>> + ? ? s->in_trace = 0;
>> + ? ? /* Tracer detached/died at some point after handing off to ptrace. */
>> + ? ? if (!(current->ptrace & PT_PTRACED))
>> + ? ? ? ? ? ? do_exit(SIGKILL);
>
> This isn't possible, because seccomp_tracer_done() is only called when
> PT_TRACE_SECCOMP is set, which gets cleared when the tracer goes away.

Well it's still a race. What I should be checking is
current->seccomp.mode == 2 then, once called, check if in_trace is 1.

I'll rework this whole thing with the inverted logic. It is much more appealing.

>> + ? ? /* If there is a SIGKILL pending, just do_exit. */
>> + ? ? if (sigismember(&current->pending.signal, SIGKILL) ||
>> + ? ? ? ? sigismember(&current->signal->shared_pending.signal, SIGKILL))
>> + ? ? ? ? ? ? do_exit(SIGKILL);
>
> This bit shouldn't be necessary, as it should be in ptrace core. Oleg's
> fix should be upstream before your seccomp patches. Except if I missed
> something and this is not to fix current buggy behaviour that the task
> is only killed after the current syscall?

Cool. The old behavior is the task being killed after the current
syscall. Any new behavior should fix this I hope :)

> But you got the logic reversed, the task should be killed except if
> seccomp_tracer_done() was called. You can't kill the task from within
> seccomp_tracer_done(), that is unreliable.

Yeah - this is pretty ugly no matter which why you slice it.

>> +}
>> #endif ? ? ? ?/* CONFIG_SECCOMP_FILTER */
>>
>> /*
>> @@ -360,6 +384,28 @@ int __secure_computing_int(int this_syscall)
>> ? ? ? ? ? ? ? ? ? ? ? seccomp_send_sigsys(this_syscall, reason_code);
>> ? ? ? ? ? ? ? ? ? ? ? return -1;
>> ? ? ? ? ? ? ? }
>> + ? ? ? ? ? ? case SECCOMP_RET_TRACE:
>> + ? ? ? ? ? ? ? ? ? ? /* If there is no interested tracer, return ENOSYS. */
>> + ? ? ? ? ? ? ? ? ? ? if (!(current->ptrace & PT_TRACE_SECCOMP))
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? return -1;
>> + ? ? ? ? ? ? ? ? ? ? /*
>> + ? ? ? ? ? ? ? ? ? ? ?* Delegate to TIF_SYSCALL_TRACE. This allows fast-path
>> + ? ? ? ? ? ? ? ? ? ? ?* seccomp calls to delegate to slow-path if needed.
>> + ? ? ? ? ? ? ? ? ? ? ?* Since TIF_SYSCALL_TRACE will be unset on ptrace(2)
>> + ? ? ? ? ? ? ? ? ? ? ?* continuation, there should be no direct side
>> + ? ? ? ? ? ? ? ? ? ? ?* effects. ?If TIF_SYSCALL_TRACE is already set, this
>> + ? ? ? ? ? ? ? ? ? ? ?* has no effect. ?Upon completion of handling, ptrace
>> + ? ? ? ? ? ? ? ? ? ? ?* will call seccomp_tracer_done() which helps handle
>> + ? ? ? ? ? ? ? ? ? ? ?* races.
>> + ? ? ? ? ? ? ? ? ? ? ?*/
>> + ? ? ? ? ? ? ? ? ? ? set_tsk_thread_flag(current, TIF_SYSCALL_TRACE);
>> + ? ? ? ? ? ? ? ? ? ? current->seccomp.in_trace = 1;
>> + ? ? ? ? ? ? ? ? ? ? /*
>> + ? ? ? ? ? ? ? ? ? ? ?* Allow the call, but upon completion, ptrace will
>> + ? ? ? ? ? ? ? ? ? ? ?* call seccomp_tracer_done to handle tracer
>> + ? ? ? ? ? ? ? ? ? ? ?* disappearance/death to ensure notification occurred.
>> + ? ? ? ? ? ? ? ? ? ? ?*/
>> + ? ? ? ? ? ? ? ? ? ? return 0;
>> ? ? ? ? ? ? ? case SECCOMP_RET_ALLOW:
>> ? ? ? ? ? ? ? ? ? ? ? return 0;
>> ? ? ? ? ? ? ? case SECCOMP_RET_KILL:
>> --
>


Thanks!
will

2012-02-22 19:48:15

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH v10 07/11] signal, x86: add SIGSYS info and make it synchronous.

On Wed, Feb 22, 2012 at 2:34 AM, Indan Zupancic <[email protected]> wrote:
> On Tue, February 21, 2012 18:30, Will Drewry wrote:
>> This change enables SIGSYS, defines _sigfields._sigsys, and adds
>> x86 (compat) arch support. ?_sigsys defines fields which allow
>> a signal handler to receive the triggering system call number,
>> the relevant AUDIT_ARCH_* value for that number, and the address
>> of the callsite.
>>
>> To ensure that SIGSYS delivery occurs on return from the triggering
>> system call, SIGSYS is added to the SYNCHRONOUS_MASK macro. ?I'm
>> this is enough to ensure it will be synchronous or if it is explicitly
>> required to ensure an immediate delivery of the signal upon return from
>> the blocked system call.
>>
>> The first consumer of SIGSYS would be seccomp filter. ?In particular,
>> a filter program could specify a new return value, SECCOMP_RET_TRAP,
>> which would result in the system call being denied and the calling
>> thread signaled. ?This also means that implementing arch-specific
>> support can be dependent upon HAVE_ARCH_SECCOMP_FILTER.
>
> I think others said this is useful, but I don't see how. Easier
> debugging compared to checking return values?
>
> I suppose SIGSYS can be blocked, so there is no guarantee the process
> will be killed.

Yeah, this allows for in-process system call emulation, if desired, or
for the process to dump core/etc. With RET_ERRNO or RET_KILL, there
isn't any feedback to the system about the state of the process. Kill
populates audit_seccomp and dmesg, but if the application
user/developer isn't the system admin, installing audit bits or
checking system logs seems onerous.

>> v10: - first version based on suggestion
>>
>> Suggested-by: H. Peter Anvin <[email protected]>
>> Signed-off-by: Will Drewry <[email protected]>
>> ---
>> ?arch/x86/ia32/ia32_signal.c ? | ? ?4 ++++
>> ?arch/x86/include/asm/ia32.h ? | ? ?6 ++++++
>> ?include/asm-generic/siginfo.h | ? 18 ++++++++++++++++++
>> ?kernel/signal.c ? ? ? ? ? ? ? | ? ?2 +-
>> ?4 files changed, 29 insertions(+), 1 deletions(-)
>>
>> diff --git a/arch/x86/ia32/ia32_signal.c b/arch/x86/ia32/ia32_signal.c
>> index 6557769..c81d2c7 100644
>> --- a/arch/x86/ia32/ia32_signal.c
>> +++ b/arch/x86/ia32/ia32_signal.c
>> @@ -73,6 +73,10 @@ int copy_siginfo_to_user32(compat_siginfo_t __user *to, siginfo_t
>> *from)
>> ? ? ? ? ? ? ? ? ? ? ? switch (from->si_code >> 16) {
>> ? ? ? ? ? ? ? ? ? ? ? case __SI_FAULT >> 16:
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
>> + ? ? ? ? ? ? ? ? ? ? case __SI_SYS >> 16:
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? put_user_ex(from->si_syscall, &to->si_syscall);
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? put_user_ex(from->si_arch, &to->si_arch);
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
>> ? ? ? ? ? ? ? ? ? ? ? case __SI_CHLD >> 16:
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? put_user_ex(from->si_utime, &to->si_utime);
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? put_user_ex(from->si_stime, &to->si_stime);
>> diff --git a/arch/x86/include/asm/ia32.h b/arch/x86/include/asm/ia32.h
>> index 1f7e625..541485f 100644
>> --- a/arch/x86/include/asm/ia32.h
>> +++ b/arch/x86/include/asm/ia32.h
>> @@ -126,6 +126,12 @@ typedef struct compat_siginfo {
>> ? ? ? ? ? ? ? ? ? ? ? int _band; ? ? ?/* POLL_IN, POLL_OUT, POLL_MSG */
>> ? ? ? ? ? ? ? ? ? ? ? int _fd;
>> ? ? ? ? ? ? ? } _sigpoll;
>> +
>> + ? ? ? ? ? ? struct {
>> + ? ? ? ? ? ? ? ? ? ? unsigned int _call_addr; /* calling insn */
>
> Why an int here, but a pointer below?

This is the compat version and it expects to just use an unsigned int
(see the _addr entry in _sigfault earlier in the same file).

>> + ? ? ? ? ? ? ? ? ? ? int _syscall; ? /* triggering system call number */
>> + ? ? ? ? ? ? ? ? ? ? unsigned int _arch; ? ? /* AUDIT_ARCH_* of syscall */
>> + ? ? ? ? ? ? } _sigsys;
>> ? ? ? } _sifields;
>> ?} compat_siginfo_t;
>>
>> diff --git a/include/asm-generic/siginfo.h b/include/asm-generic/siginfo.h
>> index 0dd4e87..a83b478 100644
>> --- a/include/asm-generic/siginfo.h
>> +++ b/include/asm-generic/siginfo.h
>> @@ -90,6 +90,13 @@ typedef struct siginfo {
>> ? ? ? ? ? ? ? ? ? ? ? __ARCH_SI_BAND_T _band; /* POLL_IN, POLL_OUT, POLL_MSG */
>> ? ? ? ? ? ? ? ? ? ? ? int _fd;
>> ? ? ? ? ? ? ? } _sigpoll;
>> +
>> + ? ? ? ? ? ? /* SIGSYS */
>> + ? ? ? ? ? ? struct {
>> + ? ? ? ? ? ? ? ? ? ? void __user *_call_addr; /* calling insn */
>
> Is this a user instruction pointer or a filter instruction?

User instruction pointer, I'll clarify.

>> + ? ? ? ? ? ? ? ? ? ? int _syscall; ? /* triggering system call number */
>> + ? ? ? ? ? ? ? ? ? ? unsigned int _arch; ? ? /* AUDIT_ARCH_* of syscall */
>> + ? ? ? ? ? ? } _sigsys;
>> ? ? ? } _sifields;
>> ?} siginfo_t;
>>
>> @@ -116,6 +123,9 @@ typedef struct siginfo {
>> ?#define si_addr_lsb ?_sifields._sigfault._addr_lsb
>> ?#define si_band ? ? ? ? ? ? ?_sifields._sigpoll._band
>> ?#define si_fd ? ? ? ? ? ? ? ?_sifields._sigpoll._fd
>> +#define si_call_addr _sifields._sigsys._call_addr
>> +#define si_syscall ? _sifields._sigsys._syscall
>> +#define si_arch ? ? ? ? ? ? ?_sifields._sigsys._arch
>>
>> ?#ifdef __KERNEL__
>> ?#define __SI_MASK ? ?0xffff0000u
>> @@ -126,6 +136,7 @@ typedef struct siginfo {
>> ?#define __SI_CHLD ? ?(4 << 16)
>> ?#define __SI_RT ? ? ? ? ? ? ?(5 << 16)
>> ?#define __SI_MESGQ ? (6 << 16)
>> +#define __SI_SYS ? ? (7 << 16)
>> ?#define __SI_CODE(T,N) ? ? ? ((T) | ((N) & 0xffff))
>> ?#else
>> ?#define __SI_KILL ? ?0
>> @@ -135,6 +146,7 @@ typedef struct siginfo {
>> ?#define __SI_CHLD ? ?0
>> ?#define __SI_RT ? ? ? ? ? ? ?0
>> ?#define __SI_MESGQ ? 0
>> +#define __SI_SYS ? ? 0
>> ?#define __SI_CODE(T,N) ? ? ? (N)
>> ?#endif
>>
>> @@ -232,6 +244,12 @@ typedef struct siginfo {
>> ?#define NSIGPOLL ? ? 6
>>
>> ?/*
>> + * SIGSYS si_codes
>> + */
>> +#define SYS_SECCOMP ? ? ? ? ?(__SI_SYS|1) ? ?/* seccomp triggered */
>> +#define NSIGSYS ? ? ?1
>> +
>> +/*
>> ? * sigevent definitions
>> ? *
>> ? * It seems likely that SIGEV_THREAD will have to be handled from
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index c73c428..7573819 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -160,7 +160,7 @@ void recalc_sigpending(void)
>>
>> ?#define SYNCHRONOUS_MASK \
>> ? ? ? (sigmask(SIGSEGV) | sigmask(SIGBUS) | sigmask(SIGILL) | \
>> - ? ? ?sigmask(SIGTRAP) | sigmask(SIGFPE))
>> + ? ? ?sigmask(SIGTRAP) | sigmask(SIGFPE) | sigmask(SIGSYS))
>>
>> ?int next_signal(struct sigpending *pending, sigset_t *mask)
>> ?{
>> --

thanks!
will

2012-02-22 19:48:23

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH v10 05/11] seccomp: add system call filtering using BPF

On Wed, Feb 22, 2012 at 12:32 AM, H. Peter Anvin <[email protected]> wrote:
> On 02/21/2012 09:30 AM, Will Drewry wrote:
>> +
>> +/**
>> + * struct seccomp_data - the format the BPF program executes over.
>> + * @args: up to 6 system call arguments. ?When the calling convention is
>> + * ? ? ? ?32-bit, the arguments will still be at each args[X] offset.
>> + * @instruction_pointer: at the time of the system call.
>> + * @arch: indicates system call convention as an AUDIT_ARCH_* value
>> + * ? ? ? ?as defined in <linux/audit.h>.
>> + * @nr: the system call number
>> + */
>> +struct seccomp_data {
>> + ? ? __u64 args[6];
>> + ? ? __u64 instruction_pointer;
>> + ? ? __u32 arch;
>> + ? ? int nr;
>> +};
>>
>
> This got flipped around for some reason... that is a problem if we ever
> need to extend this to more than 6 arguments (I thought we had at least
> one architecture which supported 7 arguments already, but I could just
> be delusional.)

Makes sense - I'll put it back in the proper order.

thanks!

2012-02-22 19:55:37

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v10 05/11] seccomp: add system call filtering using BPF

On 02/22/2012 11:47 AM, Will Drewry wrote:
>>
>> I highly disagree with every filter having to check the mode: Filters that
>> don't check the arch on e.g. x86 are buggy, so they have to check it, even
>> if it's a 32-bit or 64-bit only system, the filters can't know that and
>> needs to check the arch at every syscall entry. All other info in the data
>> depends on the arch, because of this there isn't much code to share between
>> the two archs, so you can as well have one filter for each arch.
>>
>> Alternative approach: Tell the arch at filter install time and only run the
>> filters with the same arch as the current system call. If no filters are run,
>> deny the systemcall.
>
> This was roughly how I first implemented compat and non-compat
> support. It causes some implicit behavior across inheritance that is
> not nice though.
>

This is trivially doable at the BPF level, right? Just make this the
first instruction in the program (either deny or jump to a separate
program branch)... and then there is still "one program" without any
weird inheritance issues?

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-02-22 20:01:14

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH v10 05/11] seccomp: add system call filtering using BPF

On Wed, Feb 22, 2012 at 1:53 PM, H. Peter Anvin <[email protected]> wrote:
> On 02/22/2012 11:47 AM, Will Drewry wrote:
>>>
>>> I highly disagree with every filter having to check the mode: Filters that
>>> don't check the arch on e.g. x86 are buggy, so they have to check it, even
>>> if it's a 32-bit or 64-bit only system, the filters can't know that and
>>> needs to check the arch at every syscall entry. All other info in the data
>>> depends on the arch, because of this there isn't much code to share between
>>> the two archs, so you can as well have one filter for each arch.
>>>
>>> Alternative approach: Tell the arch at filter install time and only run the
>>> filters with the same arch as the current system call. If no filters are run,
>>> deny the systemcall.
>>
>> This was roughly how I first implemented compat and non-compat
>> support. ?It causes some implicit behavior across inheritance that is
>> not nice though.
>>
>
> This is trivially doable at the BPF level, right? ?Just make this the
> first instruction in the program (either deny or jump to a separate
> program branch)... and then there is still "one program" without any
> weird inheritance issues?

Exactly, and that's what the patch does now (after your feedback :)

ld arch
je arch, 1, 0
ret SECCOMP_RET_KILL
<rest of bpf program>

At this point, I don't think it makes sense to do it a different way
than just in the BPF program even if it does mean leaving out the
check could leave the program open to compat-style bugs. At least a
shared library and/or good practices should be able to catch that
error.

thanks!
will

2012-02-22 23:03:47

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH v10 05/11] seccomp: add system call filtering using BPF

On Wed, February 22, 2012 15:23, Ben Hutchings wrote:
> On Wed, 2012-02-22 at 09:19 +0100, Indan Zupancic wrote:
> [...]
>> Alternative approach: Tell the arch at filter install time and only run the
>> filters with the same arch as the current system call. If no filters are run,
>> deny the systemcall.
>>
>> Advantages:
>>
>> - Filters don't have to check the arch every syscall entry.
>>
>> - Secure by default. Filters don't have to do anything arch specific to
>> be secure, no surprises possible.
>>
>> - If a new arch comes into existence, there is no chance of old filters
>> becoming buggy and insecure. This is especially true for archs that
>> had only one mode, but added another one later on: Old filters had no
>> need to check the mode at all.
> [...]
>
> What about when there are multiple layers of restrictions? So long as
> any one layer covers the new architecture, there is no default-deny even
> though the other layers might not cover it.

When I wrote the above I assumed this wouldn't be a big problem because
if filters allow prctl, they can check the arg flag for supported archs.
Or they can install a filter for all archs and do the arch check in there.
All under the assumption that allowing prctl is rare and if it's allowed,
it needs special checks anyway.

But having thought more about it, I fear sometimes needing such check may
be worse than checking the arch for each filter.

> I would have thought the way to make sure the architecture is always
> checked is to pack it together with the syscall number.

There is no default deny when passing the arch to the filter either,
nothing forces filters to check the arch.

But documenting that filters should always check the arch is simpler and
easier than telling them to check for unknown archs in prctl, or to do
something else obscure.

Greetings,

Indan

2012-02-22 23:39:21

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v10 07/11] signal, x86: add SIGSYS info and make it synchronous.

On Wed, Feb 22, 2012 at 11:48 AM, Will Drewry <[email protected]> wrote:
> On Wed, Feb 22, 2012 at 2:34 AM, Indan Zupancic <[email protected]> wrote:
>> On Tue, February 21, 2012 18:30, Will Drewry wrote:
>>> This change enables SIGSYS, defines _sigfields._sigsys, and adds
>>> x86 (compat) arch support. ?_sigsys defines fields which allow
>>> a signal handler to receive the triggering system call number,
>>> the relevant AUDIT_ARCH_* value for that number, and the address
>>> of the callsite.
>>>
>>> To ensure that SIGSYS delivery occurs on return from the triggering
>>> system call, SIGSYS is added to the SYNCHRONOUS_MASK macro. ?I'm
>>> this is enough to ensure it will be synchronous or if it is explicitly
>>> required to ensure an immediate delivery of the signal upon return from
>>> the blocked system call.
>>>
>>> The first consumer of SIGSYS would be seccomp filter. ?In particular,
>>> a filter program could specify a new return value, SECCOMP_RET_TRAP,
>>> which would result in the system call being denied and the calling
>>> thread signaled. ?This also means that implementing arch-specific
>>> support can be dependent upon HAVE_ARCH_SECCOMP_FILTER.
>>
>> I think others said this is useful, but I don't see how. Easier
>> debugging compared to checking return values?
>>
>> I suppose SIGSYS can be blocked, so there is no guarantee the process
>> will be killed.
>
> Yeah, this allows for in-process system call emulation, if desired, or
> for the process to dump core/etc. ?With RET_ERRNO or RET_KILL, there
> isn't any feedback to the system about the state of the process. ?Kill
> populates audit_seccomp and dmesg, but if the application
> user/developer isn't the system admin, installing audit bits or
> checking system logs seems onerous.

[Warning: this suggestion may be bad for any number of reasons]

I wonder if it would be helpful to change the semantics of RET_KILL
slightly. Rather than killing via do_exit, what if it killed via a
forcibly-fatal SIGSYS? That way, the parent's waitid() / SIGCHLD
would indicate CLD_KILLED with si_status == SIGSYS. The parent could
check that and report that the child was probably compromised.

--Andy

2012-02-22 23:46:43

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH v10 05/11] seccomp: add system call filtering using BPF

On Wed, February 22, 2012 20:47, Will Drewry wrote:
> On Wed, Feb 22, 2012 at 8:23 AM, Ben Hutchings
>> I would have thought the way to make sure the architecture is always
>> checked is to pack it together with the syscall number.

I missed that suggestion, putting the syscall number and arch in one
data field would indeed make it harder to not check the arch.

> If the current patchset used the elf machine only and not the
> AUDIT_ARCH_* that might be possible since e_machine is only 16 bits.

Using AUDIT_ARCH_ has the advantage that it contains the endianness and
width of the arch, which is crucial info for archs that support multiple
modes with the same arch. E.g. MIPS got:

#define AUDIT_ARCH_MIPS (EM_MIPS)
#define AUDIT_ARCH_MIPSEL (EM_MIPS|__AUDIT_ARCH_LE)
#define AUDIT_ARCH_MIPS64 (EM_MIPS|__AUDIT_ARCH_64BIT)
#define AUDIT_ARCH_MIPSEL64 (EM_MIPS|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE)

So just EM_MIPS isn't enough info.

> However, that would still assume that an arch wouldn't introduce a
> syscall number above 65535 which is most likely not a safe assumption.
> Am I wrong there?

No, it's not a safe assumption. E.g. look at arm_syscall() in
arch/arm/kernel/traps.c:

"0x9f0000 - 0x9fffff are some more esoteric system calls"

You could check if the filter read the 'arch' field and deny it if it
didn't when it returns though. Or check it in the filter check function.
Wouldn't be the nicest code ever, but it would give the same assurance
as packing it with the syscall number.

Greetings,

Indan

2012-02-22 23:52:04

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v10 05/11] seccomp: add system call filtering using BPF

On Wed, Feb 22, 2012 at 3:46 PM, Indan Zupancic <[email protected]> wrote:
> On Wed, February 22, 2012 20:47, Will Drewry wrote:
>> On Wed, Feb 22, 2012 at 8:23 AM, Ben Hutchings
>>> I would have thought the way to make sure the architecture is always
>>> checked is to pack it together with the syscall number.
>
> I missed that suggestion, putting the syscall number and arch in one
> data field would indeed make it harder to not check the arch.

Is there enough room? On x86-64 at least, rax could conceivably be
extended to 64 bits some day. Bit 30 is already spoken for by x32.

--Andy

2012-02-22 23:53:14

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v10 07/11] signal, x86: add SIGSYS info and make it synchronous.

On Wed, Feb 22, 2012 at 3:38 PM, Andrew Lutomirski <[email protected]> wrote:
> On Wed, Feb 22, 2012 at 11:48 AM, Will Drewry <[email protected]> wrote:
>> On Wed, Feb 22, 2012 at 2:34 AM, Indan Zupancic <[email protected]> wrote:
>>> On Tue, February 21, 2012 18:30, Will Drewry wrote:
>>>> This change enables SIGSYS, defines _sigfields._sigsys, and adds
>>>> x86 (compat) arch support. ?_sigsys defines fields which allow
>>>> a signal handler to receive the triggering system call number,
>>>> the relevant AUDIT_ARCH_* value for that number, and the address
>>>> of the callsite.
>>>>
>>>> To ensure that SIGSYS delivery occurs on return from the triggering
>>>> system call, SIGSYS is added to the SYNCHRONOUS_MASK macro. ?I'm
>>>> this is enough to ensure it will be synchronous or if it is explicitly
>>>> required to ensure an immediate delivery of the signal upon return from
>>>> the blocked system call.
>>>>
>>>> The first consumer of SIGSYS would be seccomp filter. ?In particular,
>>>> a filter program could specify a new return value, SECCOMP_RET_TRAP,
>>>> which would result in the system call being denied and the calling
>>>> thread signaled. ?This also means that implementing arch-specific
>>>> support can be dependent upon HAVE_ARCH_SECCOMP_FILTER.
>>>
>>> I think others said this is useful, but I don't see how. Easier
>>> debugging compared to checking return values?
>>>
>>> I suppose SIGSYS can be blocked, so there is no guarantee the process
>>> will be killed.
>>
>> Yeah, this allows for in-process system call emulation, if desired, or
>> for the process to dump core/etc. ?With RET_ERRNO or RET_KILL, there
>> isn't any feedback to the system about the state of the process. ?Kill
>> populates audit_seccomp and dmesg, but if the application
>> user/developer isn't the system admin, installing audit bits or
>> checking system logs seems onerous.
>
> [Warning: this suggestion may be bad for any number of reasons]
>
> I wonder if it would be helpful to change the semantics of RET_KILL
> slightly. ?Rather than killing via do_exit, what if it killed via a
> forcibly-fatal SIGSYS? ?That way, the parent's waitid() / SIGCHLD
> would indicate CLD_KILLED with si_status == SIGSYS. ?The parent could
> check that and report that the child was probably compromised.
>
> --Andy

I'd prefer sticking with do_exit. This provides much less chance of
things going wrong. A parent seeing a child killed with SIGKILL is
already pretty distinct, IMO.

-Kees

--
Kees Cook
ChromeOS Security

2012-02-23 00:05:11

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH v10 07/11] signal, x86: add SIGSYS info and make it synchronous.

On Wed, Feb 22, 2012 at 5:53 PM, Kees Cook <[email protected]> wrote:
> On Wed, Feb 22, 2012 at 3:38 PM, Andrew Lutomirski <[email protected]> wrote:
>> On Wed, Feb 22, 2012 at 11:48 AM, Will Drewry <[email protected]> wrote:
>>> On Wed, Feb 22, 2012 at 2:34 AM, Indan Zupancic <[email protected]> wrote:
>>>> On Tue, February 21, 2012 18:30, Will Drewry wrote:
>>>>> This change enables SIGSYS, defines _sigfields._sigsys, and adds
>>>>> x86 (compat) arch support. ?_sigsys defines fields which allow
>>>>> a signal handler to receive the triggering system call number,
>>>>> the relevant AUDIT_ARCH_* value for that number, and the address
>>>>> of the callsite.
>>>>>
>>>>> To ensure that SIGSYS delivery occurs on return from the triggering
>>>>> system call, SIGSYS is added to the SYNCHRONOUS_MASK macro. ?I'm
>>>>> this is enough to ensure it will be synchronous or if it is explicitly
>>>>> required to ensure an immediate delivery of the signal upon return from
>>>>> the blocked system call.
>>>>>
>>>>> The first consumer of SIGSYS would be seccomp filter. ?In particular,
>>>>> a filter program could specify a new return value, SECCOMP_RET_TRAP,
>>>>> which would result in the system call being denied and the calling
>>>>> thread signaled. ?This also means that implementing arch-specific
>>>>> support can be dependent upon HAVE_ARCH_SECCOMP_FILTER.
>>>>
>>>> I think others said this is useful, but I don't see how. Easier
>>>> debugging compared to checking return values?
>>>>
>>>> I suppose SIGSYS can be blocked, so there is no guarantee the process
>>>> will be killed.
>>>
>>> Yeah, this allows for in-process system call emulation, if desired, or
>>> for the process to dump core/etc. ?With RET_ERRNO or RET_KILL, there
>>> isn't any feedback to the system about the state of the process. ?Kill
>>> populates audit_seccomp and dmesg, but if the application
>>> user/developer isn't the system admin, installing audit bits or
>>> checking system logs seems onerous.
>>
>> [Warning: this suggestion may be bad for any number of reasons]
>>
>> I wonder if it would be helpful to change the semantics of RET_KILL
>> slightly. ?Rather than killing via do_exit, what if it killed via a
>> forcibly-fatal SIGSYS? ?That way, the parent's waitid() / SIGCHLD
>> would indicate CLD_KILLED with si_status == SIGSYS. ?The parent could
>> check that and report that the child was probably compromised.
>>
>> --Andy
>
> I'd prefer sticking with do_exit. This provides much less chance of
> things going wrong. A parent seeing a child killed with SIGKILL is
> already pretty distinct, IMO.

Hrm, it might be possible to do_exit(SIGSYS) which would be both. It
looks like tsk->exit_code would be SIGSYS then, but I'll look a little
more closely to see what that'll actually do.

2012-02-23 00:08:31

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH v10 05/11] seccomp: add system call filtering using BPF

On Thu, February 23, 2012 00:51, Andrew Lutomirski wrote:
> On Wed, Feb 22, 2012 at 3:46 PM, Indan Zupancic <[email protected]> wrote:
>> On Wed, February 22, 2012 20:47, Will Drewry wrote:
>>> On Wed, Feb 22, 2012 at 8:23 AM, Ben Hutchings
>>>> I would have thought the way to make sure the architecture is always
>>>> checked is to pack it together with the syscall number.
>>
>> I missed that suggestion, putting the syscall number and arch in one
>> data field would indeed make it harder to not check the arch.
>
> Is there enough room? On x86-64 at least, rax could conceivably be
> extended to 64 bits some day. Bit 30 is already spoken for by x32.

No, there isn't.

2012-02-23 00:08:38

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v10 07/11] signal, x86: add SIGSYS info and make it synchronous.

On Wed, Feb 22, 2012 at 4:05 PM, Will Drewry <[email protected]> wrote:
> On Wed, Feb 22, 2012 at 5:53 PM, Kees Cook <[email protected]> wrote:
>> On Wed, Feb 22, 2012 at 3:38 PM, Andrew Lutomirski <[email protected]> wrote:
>>> On Wed, Feb 22, 2012 at 11:48 AM, Will Drewry <[email protected]> wrote:
>>>> On Wed, Feb 22, 2012 at 2:34 AM, Indan Zupancic <[email protected]> wrote:
>>>>> On Tue, February 21, 2012 18:30, Will Drewry wrote:
>>>>>> This change enables SIGSYS, defines _sigfields._sigsys, and adds
>>>>>> x86 (compat) arch support. ?_sigsys defines fields which allow
>>>>>> a signal handler to receive the triggering system call number,
>>>>>> the relevant AUDIT_ARCH_* value for that number, and the address
>>>>>> of the callsite.
>>>>>>
>>>>>> To ensure that SIGSYS delivery occurs on return from the triggering
>>>>>> system call, SIGSYS is added to the SYNCHRONOUS_MASK macro. ?I'm
>>>>>> this is enough to ensure it will be synchronous or if it is explicitly
>>>>>> required to ensure an immediate delivery of the signal upon return from
>>>>>> the blocked system call.
>>>>>>
>>>>>> The first consumer of SIGSYS would be seccomp filter. ?In particular,
>>>>>> a filter program could specify a new return value, SECCOMP_RET_TRAP,
>>>>>> which would result in the system call being denied and the calling
>>>>>> thread signaled. ?This also means that implementing arch-specific
>>>>>> support can be dependent upon HAVE_ARCH_SECCOMP_FILTER.
>>>>>
>>>>> I think others said this is useful, but I don't see how. Easier
>>>>> debugging compared to checking return values?
>>>>>
>>>>> I suppose SIGSYS can be blocked, so there is no guarantee the process
>>>>> will be killed.
>>>>
>>>> Yeah, this allows for in-process system call emulation, if desired, or
>>>> for the process to dump core/etc. ?With RET_ERRNO or RET_KILL, there
>>>> isn't any feedback to the system about the state of the process. ?Kill
>>>> populates audit_seccomp and dmesg, but if the application
>>>> user/developer isn't the system admin, installing audit bits or
>>>> checking system logs seems onerous.
>>>
>>> [Warning: this suggestion may be bad for any number of reasons]
>>>
>>> I wonder if it would be helpful to change the semantics of RET_KILL
>>> slightly. ?Rather than killing via do_exit, what if it killed via a
>>> forcibly-fatal SIGSYS? ?That way, the parent's waitid() / SIGCHLD
>>> would indicate CLD_KILLED with si_status == SIGSYS. ?The parent could
>>> check that and report that the child was probably compromised.
>>>
>>> --Andy
>>
>> I'd prefer sticking with do_exit. This provides much less chance of
>> things going wrong. A parent seeing a child killed with SIGKILL is
>> already pretty distinct, IMO.
>
> Hrm, it might be possible to do_exit(SIGSYS) which would be both. It
> looks like tsk->exit_code would be SIGSYS then, but I'll look a little
> more closely to see what that'll actually do.

As long as there's no way it can get blocked, I'd be fine with that.
It would, actually, be better than SIGKILL because, as Andy said, it's
more distinguishable from other situations. I've long wanted a signal
to be used for "violated policy" that wasn't just a straight SIGKILL.

-Kees

--
Kees Cook
ChromeOS Security

2012-02-23 00:11:26

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH v10 07/11] signal, x86: add SIGSYS info and make it synchronous.

On Wed, Feb 22, 2012 at 3:38 PM, Andrew Lutomirski <[email protected]> wrote:
> I wonder if it would be helpful to change the semantics of RET_KILL
> slightly. ?Rather than killing via do_exit, what if it killed via a
> forcibly-fatal SIGSYS? ?That way, the parent's waitid() / SIGCHLD
> would indicate CLD_KILLED with si_status == SIGSYS. ?The parent could
> check that and report that the child was probably compromised.

That would be better. But it is certainly a more complex code path, which
makes the security weenies twitch. As to concrete issues, any "normal"
path needs the changes that are maybe pending from Oleg to make it actually
abort the syscall instead of completing it before getting to the signal path.

2012-02-23 00:25:14

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH v10 05/11] seccomp: add system call filtering using BPF

On Wed, February 22, 2012 20:47, Will Drewry wrote:
> On Wed, Feb 22, 2012 at 2:19 AM, Indan Zupancic <[email protected]> wrote:
>> I highly disagree with every filter having to check the mode: Filters that
>> don't check the arch on e.g. x86 are buggy, so they have to check it, even
>> if it's a 32-bit or 64-bit only system, the filters can't know that and
>> needs to check the arch at every syscall entry. All other info in the data
>> depends on the arch, because of this there isn't much code to share between
>> the two archs, so you can as well have one filter for each arch.
>>
>> Alternative approach: Tell the arch at filter install time and only run the
>> filters with the same arch as the current system call. If no filters are run,
>> deny the systemcall.
>
> This was roughly how I first implemented compat and non-compat
> support. It causes some implicit behavior across inheritance that is
> not nice though.

Same implicit behaviour Ben mentioned or something else?

Yeah, that's a bit of a problem. It can be solved within filters, but
it's starting to get more obscure than just checking the arch for every
syscall.

>
>> Advantages:
>>
>> - Filters don't have to check the arch every syscall entry.
>
> This I like.
>
>> - Secure by default. Filters don't have to do anything arch specific to
>>  be secure, no surprises possible.
>
> This is partially true, but it is exactly why I hid compat before.
>
>> - If a new arch comes into existence, there is no chance of old filters
>>  becoming buggy and insecure. This is especially true for archs that
>>  had only one mode, but added another one later on: Old filters had no
>>  need to check the mode at all.
>
> Perhaps. A buggy filter that works on x86-64 might be exposed on a
> new x32 ABI. It's hard to predict how audit_arch and the syscall abi
> will develop with new platforms.

It doesn't matter, if the filter assumes there are only two archs possible
and those two archs need different treatment, then the new arch at best
will only match one of them and it depends on which arch the filter checks
for. E.g. whether it does:

if (arch == AUDIT_ARCH_I386)
...
else /* assume x86_64 */
...

versus

if (arch == AUDIT_ARCH_X86_64)
...
else /* assume i386 */
...

>> - For kernels supporting only one arch, the check can be optimised away,
>>  by not installing unsupported arch filters at all.
>
> Somewhat. Without having a dedicated arch helper, you'd have to guess
> that arches only support one or two arches (if compat is supported),
> but I don't know if that is a safe assumption to make.

Well, if you want to optimise all checks away, then you obviously need
arch helpers. Without it, you have to install all filters, even the ones
you'll never run.

>> It's more secure, faster and simpler for the filters.
>>
>> If something like this is implemented it's fine to expose the arch info
>> in the syscall data too, and have a way to install filters for all archs,
>> for the few cases where that might be useful, although I can't think of
>> any reason why people would like to do unnecessary work in the filters.
>
> It seems to just add complexity to support both. I think we'll
> probably end up with it in the filters for better or worse. Possibly
> JITing will be useful since at least a 32-bit load and je is pretty
> cheap in native instructions.

Yeah, except that you can't easily do that because you don't have direct
access to the arch.

>> All that's needed is an extra argument to the prctl() call. I propose
>> 0 for the current arch, -1 for all archs and anything else to specify
>> the arch. Installing a filter for an unsupported arch could return
>> ENOEXEC.
>
> Without adding a per-arch call, there is no way to know all the
> supported arches at install time. Current arch, at least, can be
> determined with a call to syscall_get_arch().

True.

> As is, I'm not sure it makes sense to try to reserve two extra input
> types: 0 and -1. 0 would be sane to treat as either a wildcard or
> current because it is unlikely to be used by AUDIT_ARCH_* ever since
> EM_NONE is assigned to 0. However, I have no such insight into
> whether it will ever be possible to compose 0xffffffff as an
> AUDIT_ARCH_.

That seems impossible.

>> As far as the implementation goes, either have a list per supported arch
>> or store the arch per filter and check that before running the filter.
>
> You can't do it per arch without adding even more per-arch
> dependencies. Keeping them annotated in the same list is the clearest
> way I've seen so far, but it comes with its own burdens.

You could have a list per installed arch, so there is no need to know all
supported archs, if you don't have the per arch helpers.

I don't see how keeping the arch in the filter itself comes with a burden,
that's what you were basically doing with the compat flag anyway.

But keeping the check within the filter is the simplest solution it seems,
so ignore my objections and just let's live with the extra hassle at the
user space side.

>> What use is the instruction pointer considering it tells nothing about
>> the call path?

My fear of exposing the IP is that people will erroneously assume that it
says anything about the call path, and hence write insecure security code
that's easily bypassed by just jumping to the right instruction address.
And if the vDSO is used, the IP will always be the same.

So what use is knowing the IP?

>> Wouldn't it make sense to allow going from mode 2 to 1?
>> After all, the filter could have blocked it if it didn't
>> want to permit it, and mode 1 is more restrictive than
>> mode 2.
>
> Nope - that might allow a downgrade that bypasses write/read
> restrictions. E.g., a filter could only allow a read to a certain buf
> or of a certain size. Allowing a downgrade would allow bypassing
> those filters, whether they are the most sane things or not :)

But now you enforce that decision while the filter could make that
choice itself instead. If the filter doesn't allow read() and write(),
I really doubt it would allow prctl().

>> Out of curiosity, did you measure the kernel size differences before and
>> after these patches? Would be sad if sharing it with the networking code
>> didn't reduce the actual kernel size.
>
> Oh yeah - it was a serious reduction. Initially, seccomp_filter.o
> added 8kb by itself. With the merged seccomp.o, continued code
> trimming (as suggested), and all the SECCOMP_RET_* variations, the
> total kernel growth is 2972 bytes for the same kernel config. This is
> shared across ~2000 bytes in seccomp.o and ~800 bytes in filter.o.

Looks good, though the 800 extra bytes for filter.o seems high, it
used to be 292 bytes according to your email from January the 30th.
You said the run filter function added 861 bytes, so sharing doesn't
seem to reduce the kernel size any more?

Greetings,

Indan

2012-02-23 00:30:51

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v10 07/11] signal, x86: add SIGSYS info and make it synchronous.

On 02/22/2012 04:08 PM, Kees Cook wrote:
>>
>> Hrm, it might be possible to do_exit(SIGSYS) which would be both. It
>> looks like tsk->exit_code would be SIGSYS then, but I'll look a little
>> more closely to see what that'll actually do.
>
> As long as there's no way it can get blocked, I'd be fine with that.
> It would, actually, be better than SIGKILL because, as Andy said, it's
> more distinguishable from other situations. I've long wanted a signal
> to be used for "violated policy" that wasn't just a straight SIGKILL.
>

Can we really introduce force-kill semantics for a POSIX-defined signal?
Other user space programs might use it for other purposes.

I'm wondering if the right thing may be to introduce some variant of
exit() which can return more information about a signal, including some
kind of cause code for SIGKILL?

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-02-23 00:51:05

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH v10 07/11] signal, x86: add SIGSYS info and make it synchronous.

On Wed, Feb 22, 2012 at 4:29 PM, H. Peter Anvin <[email protected]> wrote:
> Can we really introduce force-kill semantics for a POSIX-defined signal?
> Other user space programs might use it for other purposes.

The semantics are based on how the signal was generated, not what signal
number it was. The only thing that depends on the signal number is
SYNCHRONOUS_MASK, which just determines in which order pending signals are
dequeued (POSIX says it may be any order). We only have that so your state
doesn't get unhelpfully warped to another signal handler entry point
(including fiddling the stack) before you dump core.

No use of SIGSYS is specified by POSIX at all, of course, since "system
call" is an implementation concept below the level POSIX specifies.

2012-02-23 01:07:09

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v10 07/11] signal, x86: add SIGSYS info and make it synchronous.

On 02/22/2012 04:50 PM, Roland McGrath wrote:
> On Wed, Feb 22, 2012 at 4:29 PM, H. Peter Anvin <[email protected]> wrote:
>> Can we really introduce force-kill semantics for a POSIX-defined signal?
>> Other user space programs might use it for other purposes.
>
> The semantics are based on how the signal was generated, not what signal
> number it was. The only thing that depends on the signal number is
> SYNCHRONOUS_MASK, which just determines in which order pending signals are
> dequeued (POSIX says it may be any order). We only have that so your state
> doesn't get unhelpfully warped to another signal handler entry point
> (including fiddling the stack) before you dump core.
>
> No use of SIGSYS is specified by POSIX at all, of course, since "system
> call" is an implementation concept below the level POSIX specifies.

I meant whether or not a signal can be blocked/caught and the fact that
the signal exists at all.

Now I guess we could have "blockable" and "unblockable" SIGSYS, but that
would seem to have its own set of issues...

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-02-23 01:07:55

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v10 05/11] seccomp: add system call filtering using BPF

On 02/22/2012 03:51 PM, Andrew Lutomirski wrote:
> On Wed, Feb 22, 2012 at 3:46 PM, Indan Zupancic <[email protected]> wrote:
>> On Wed, February 22, 2012 20:47, Will Drewry wrote:
>>> On Wed, Feb 22, 2012 at 8:23 AM, Ben Hutchings
>>>> I would have thought the way to make sure the architecture is always
>>>> checked is to pack it together with the syscall number.
>>
>> I missed that suggestion, putting the syscall number and arch in one
>> data field would indeed make it harder to not check the arch.
>
> Is there enough room? On x86-64 at least, rax could conceivably be
> extended to 64 bits some day. Bit 30 is already spoken for by x32.
>

No it couldn't, because we mask off the high 32 bits and thus it could
(theoretically) break user space.

-hpa


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-02-23 16:44:34

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH v10 07/11] signal, x86: add SIGSYS info and make it synchronous.

On Wed, Feb 22, 2012 at 6:29 PM, H. Peter Anvin <[email protected]> wrote:
> On 02/22/2012 04:08 PM, Kees Cook wrote:
>>>
>>> Hrm, it might be possible to do_exit(SIGSYS) which would be both. It
>>> looks like tsk->exit_code would be SIGSYS then, but I'll look a little
>>> more closely to see what that'll actually do.
>>
>> As long as there's no way it can get blocked, I'd be fine with that.
>> It would, actually, be better than SIGKILL because, as Andy said, it's
>> more distinguishable from other situations. I've long wanted a signal
>> to be used for "violated policy" that wasn't just a straight SIGKILL.
>>
>
> Can we really introduce force-kill semantics for a POSIX-defined signal?
> ?Other user space programs might use it for other purposes.
>
> I'm wondering if the right thing may be to introduce some variant of
> exit() which can return more information about a signal, including some
> kind of cause code for SIGKILL?

While it'd be harder to send back extra info, passing SIGSYS to
do_exit() should result in the si_status for the emitted SIGCHLD to be
SIGSYS (si_status = (tsk->exit_code & 0x7f)). I think it'll still
have a si_code of CLD_KILLED, but it'd be enough for a parent to
differentiate the task-death path. I'll try it out before I post
another patch rev.

A variant that allowed extended exit information would be useful
(especially for this patch series), I'm not sure I'd know where to
start.

cheers!
will

2012-02-23 17:39:23

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH v10 07/11] signal, x86: add SIGSYS info and make it synchronous.

On Wed, Feb 22, 2012 at 5:06 PM, H. Peter Anvin <[email protected]> wrote:
> I meant whether or not a signal can be blocked/caught and the fact that
> the signal exists at all.
>
> Now I guess we could have "blockable" and "unblockable" SIGSYS, but that
> would seem to have its own set of issues...

Oh. I certainly don't think we should ever add any new signals to the set
that cannot be caught, blocked, or ignored. That has been just SIGKILL and
SIGSTOP since 4.2BSD, which first introduced the modern concept of blocking
signals. There are lots of reasons not to change that, which I won't go
into unless someone really wants me to.

However, I don't think there is anything really wrong with having certain
cases that generate a signal and at the same time unblock it and reset it
to SIG_DFL. That's just an implementation detail of a policy of "dump core
right now, no other option". (Conversely, directly calling do_exit won't
ever dump core, though it can be made to look signalesque to the parent and
tracers.)

For seccomp-filter, I personally don't see any problem with simply
generating SIGSYS in the normal way (and aborting the syscall, of course).
If someone wants to ensure that SIGSYS is never caught or blocked, they can
just do that by having a filter that doesn't allow it to be caught or
blocked (and of course make sure to reset its inherited state). It is a
bit tricky to cover all the ways, since it's not just sigaction and
sigprocmask but also sigreturn, where the blocked signal set to be restored
is in a slightly arcane location--but it ain't rocket science.

But I don't really have any strong opinion about what seccomp-filter should
do. (Though it does seem worthwhile not to rule out the possibility of
dumping core on a policy violation, since that will be useful for people to
debug their code.)


Thanks,
Roland

2012-02-23 19:27:04

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH v10 07/11] signal, x86: add SIGSYS info and make it synchronous.

On Thu, Feb 23, 2012 at 11:38 AM, Roland McGrath <[email protected]> wrote:
> On Wed, Feb 22, 2012 at 5:06 PM, H. Peter Anvin <[email protected]> wrote:
>> I meant whether or not a signal can be blocked/caught and the fact that
>> the signal exists at all.
>>
>> Now I guess we could have "blockable" and "unblockable" SIGSYS, but that
>> would seem to have its own set of issues...
>
> Oh. ?I certainly don't think we should ever add any new signals to the set
> that cannot be caught, blocked, or ignored. ?That has been just SIGKILL and
> SIGSTOP since 4.2BSD, which first introduced the modern concept of blocking
> signals. ?There are lots of reasons not to change that, which I won't go
> into unless someone really wants me to.
>
> However, I don't think there is anything really wrong with having certain
> cases that generate a signal and at the same time unblock it and reset it
> to SIG_DFL. ?That's just an implementation detail of a policy of "dump core
> right now, no other option". ?(Conversely, directly calling do_exit won't
> ever dump core, though it can be made to look signalesque to the parent and
> tracers.)
>
> For seccomp-filter, I personally don't see any problem with simply
> generating SIGSYS in the normal way (and aborting the syscall, of course).
> If someone wants to ensure that SIGSYS is never caught or blocked, they can
> just do that by having a filter that doesn't allow it to be caught or
> blocked (and of course make sure to reset its inherited state). ?It is a
> bit tricky to cover all the ways, since it's not just sigaction and
> sigprocmask but also sigreturn, where the blocked signal set to be restored
> is in a slightly arcane location--but it ain't rocket science.
>
> But I don't really have any strong opinion about what seccomp-filter should
> do. ?(Though it does seem worthwhile not to rule out the possibility of
> dumping core on a policy violation, since that will be useful for people to
> debug their code.)

Seems like there's an argument for another return code,
SECCOMP_RET_CORE, that resets/unblocks the SIGSYS handler since the
existing TRAP and KILL options seem to cover the other paths (signal
handler and do_exit).

It's a very small tweak if that'd be useful to include explicitly.

Thanks!
will

2012-02-23 22:16:11

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH v10 07/11] signal, x86: add SIGSYS info and make it synchronous.

On Thu, February 23, 2012 20:26, Will Drewry wrote:
> Seems like there's an argument for another return code,
> SECCOMP_RET_CORE, that resets/unblocks the SIGSYS handler since the
> existing TRAP and KILL options seem to cover the other paths (signal
> handler and do_exit).

What about making SECCOMP_RET_TRAP dump core/send SIGSYS if there is
no tracer with PTRACE_O_SECCOMP set? And perhaps go for a blockable
SIGSYS? That way you only have KILL, ERRNO and TRAP, with the last
one meaning deny, but giving someone else a chance to do something.
Or is that just confusing?

I don't think there should be too many return values, or else you
put too much runtime policy into the filters.

Sending SIGSYS is useful, but it's quite a bit less useful if user
space can't handle it in a signal handler, so I don't think it's
worth it to make a unblockable version.

Greetings,

Indan

2012-02-23 22:33:27

by Markus Gutschke

[permalink] [raw]
Subject: Re: [PATCH v10 07/11] signal, x86: add SIGSYS info and make it synchronous.

On Thu, Feb 23, 2012 at 14:15, Indan Zupancic <[email protected]> wrote:
> What about making SECCOMP_RET_TRAP dump core/send SIGSYS if there is
> no tracer with PTRACE_O_SECCOMP set?

Please don't make things dependent on having a tracer. There are
applications that don't really need a tracer; in fact, these are
typically the exact same applications that can benefit from receiving
SIGSYS and then handling it internally.

If a tracer was required to set this up, it would make it difficult to
use gdb, strace, or any other common debugging tools.

> Sending SIGSYS is useful, but it's quite a bit less useful if user
> space can't handle it in a signal handler, so I don't think it's
> worth it to make a unblockable version.

Maybe, I am not parsing your e-mail correctly. But don't we already
get the desired behavior, if SIGSYS is treated the same as any other
synchronous signal? If it is unblocked and has a handler, the
application can decide to handle it. If neither one of these
conditions is true, it terminates the program. Ulimits and
PR_SET_DUMPABLE determine whether a core file is generated.


Markus

2012-02-23 22:34:50

by Will Drewry

[permalink] [raw]
Subject: Re: [kernel-hardening] Re: [PATCH v10 07/11] signal, x86: add SIGSYS info and make it synchronous.

On Thu, Feb 23, 2012 at 4:15 PM, Indan Zupancic <[email protected]> wrote:
> On Thu, February 23, 2012 20:26, Will Drewry wrote:
>> Seems like there's an argument for another return code,
>> SECCOMP_RET_CORE, that resets/unblocks the SIGSYS handler since the
>> existing TRAP and KILL options seem to cover the other paths (signal
>> handler and do_exit).
>
> What about making SECCOMP_RET_TRAP dump core/send SIGSYS if there is
> no tracer with PTRACE_O_SECCOMP set? And perhaps go for a blockable
> SIGSYS? That way you only have KILL, ERRNO and TRAP, with the last
> one meaning deny, but giving someone else a chance to do something.
> Or is that just confusing?

I don't think it makes sense to mix up signal delivery for in-process
handling and ptrace. In particular, TRACE calls must assume t the
ptracer actually enacted a policy, but with TRAP as is, it always
rejects it.

> I don't think there should be too many return values, or else you
> put too much runtime policy into the filters.

I'd rather make it explicit than not. This will be a quagmire if any
behavior is implicit.

> Sending SIGSYS is useful, but it's quite a bit less useful if user
> space can't handle it in a signal handler, so I don't think it's
> worth it to make a unblockable version.

I believe the point here would be that you'd get a useful coredump
without needing to enforce that the process can't handle normal SIGSYS
or other syscalls by blocking signal masking.

cheers!
will

2012-02-23 22:36:27

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH v10 07/11] signal, x86: add SIGSYS info and make it synchronous.

On Thu, Feb 23, 2012 at 4:33 PM, Markus Gutschke <[email protected]> wrote:
> On Thu, Feb 23, 2012 at 14:15, Indan Zupancic <[email protected]> wrote:
>> What about making SECCOMP_RET_TRAP dump core/send SIGSYS if there is
>> no tracer with PTRACE_O_SECCOMP set?
>
> Please don't make things dependent on having a tracer. There are
> applications that don't really need a tracer; in fact, these are
> typically the exact same applications that can benefit from receiving
> SIGSYS and then handling it internally.
>
> If a tracer was required to set this up, it would make it difficult to
> use gdb, strace, or any other common debugging tools.
>
>> Sending SIGSYS is useful, but it's quite a bit less useful if user
>> space can't handle it in a signal handler, so I don't think it's
>> worth it to make a unblockable version.
>
> Maybe, I am not parsing your e-mail correctly. But don't we already
> get the desired behavior, if SIGSYS is treated the same as any other
> synchronous signal? If it is unblocked and has a handler, the
> application can decide to handle it. If neither one of these
> conditions is true, it terminates the program. Ulimits and
> PR_SET_DUMPABLE determine whether a core file is generated.

Yeah - the current patchset does that just fine. The tweak I was
proposing was making ti possible to deliver an SIGSYS that always uses
SIG_DFL so that you don't have to play with signal call enforcement in
the filters.

This is a pretty minor tweak either way.
cheers!
will

2012-02-27 12:32:18

by Indan Zupancic

[permalink] [raw]
Subject: Re: [PATCH v10 07/11] signal, x86: add SIGSYS info and make it synchronous.

On Thu, February 23, 2012 23:33, Markus Gutschke wrote:
> On Thu, Feb 23, 2012 at 14:15, Indan Zupancic <[email protected]> wrote:
>> What about making SECCOMP_RET_TRAP dump core/send SIGSYS if there is
>> no tracer with PTRACE_O_SECCOMP set?
>
> Please don't make things dependent on having a tracer. There are
> applications that don't really need a tracer; in fact, these are
> typically the exact same applications that can benefit from receiving
> SIGSYS and then handling it internally.

My proposal was to send the SIGSYS only when there is no seccomp aware
tracer. If there is no such tracer, the process will receive a SIGSYS
that it can handle internally. So having a tracer isn't required.

I'm curious how you would like to handle SIGSYSs internally, because I
don't see how you could gracefully recover from such failed system call,
so I don't really see the added value compared to fail the syscall with
some ERRNO or to just kill the task. Is it just for notification purposes?

>
> If a tracer was required to set this up, it would make it difficult to
> use gdb, strace, or any other common debugging tools.

gdb and strace and such won't set the PTRACE_O_SECCOMP option, so it
will behave the same whether it's being debugged or not.

The main objective was to reduce the amount of policy in filters,
I thought it could be done by having only one return value which
delegates to user space. But that may be too confusing, and the
interaction between a seccomp aware tracer and SIGSYS aware code
is fuzzy, so I'm not sure if it's a good idea.

>
>> Sending SIGSYS is useful, but it's quite a bit less useful if user
>> space can't handle it in a signal handler, so I don't think it's
>> worth it to make a unblockable version.
>
> Maybe, I am not parsing your e-mail correctly. But don't we already
> get the desired behavior, if SIGSYS is treated the same as any other
> synchronous signal? If it is unblocked and has a handler, the
> application can decide to handle it. If neither one of these
> conditions is true, it terminates the program. Ulimits and
> PR_SET_DUMPABLE determine whether a core file is generated.

The proposal I was replying to wanted to make SIGSYS always kill the
process (with a core dump), so you wouldn't be able to set a handler
any more. I think that is a bad idea. Or did I misunderstood?

Enforcing task termination when there is no handler doesn't make
conceptual sense, because an empty signal handler is effectively
the same as blocking a signal. Though I guess it's simpler to check
for just sigaction in the BPF filters, so perhaps that was the idea.

Greetings,

Indan

2012-02-27 16:21:39

by Will Drewry

[permalink] [raw]
Subject: Re: [PATCH v10 07/11] signal, x86: add SIGSYS info and make it synchronous.

On Mon, Feb 27, 2012 at 6:32 AM, Indan Zupancic <[email protected]> wrote:
> On Thu, February 23, 2012 23:33, Markus Gutschke wrote:
>> On Thu, Feb 23, 2012 at 14:15, Indan Zupancic <[email protected]> wrote:
>>> What about making SECCOMP_RET_TRAP dump core/send SIGSYS if there is
>>> no tracer with PTRACE_O_SECCOMP set?
>>
>> Please don't make things dependent on having a tracer. There are
>> applications that don't really need a tracer; in fact, these are
>> typically the exact same applications that can benefit from receiving
>> SIGSYS and then handling it internally.
>
> My proposal was to send the SIGSYS only when there is no seccomp aware
> tracer. If there is no such tracer, the process will receive a SIGSYS
> that it can handle internally. So having a tracer isn't required.
>
> I'm curious how you would like to handle SIGSYSs internally, because I
> don't see how you could gracefully recover from such failed system call,
> so I don't really see the added value compared to fail the syscall with
> some ERRNO or to just kill the task. Is it just for notification purposes?

Take a look at samples/seccomp/bpf-direct.c. You can emulate the call
which can be useful for patching up external code. This is
especially useful if you don't want to hand patch every library that
is doing something you don't like without a full supervisor framework
(e.g., glibc checking for an nscd.socket file).

Another use is implementing a signal-handler based system call
delegation system. E.g., setup your fds with a broker, then pass the
requested syscall number and desired arguments over to the broker from
the hanlder who can pass back an fd or whatever is appropriate. Then
the return from the syscall can be fixed up.


>>
>> If a tracer was required to set this up, it would make it difficult to
>> use gdb, strace, or any other common debugging tools.
>
> gdb and strace and such won't set the PTRACE_O_SECCOMP option, so it
> will behave the same whether it's being debugged or not.
>
> The main objective was to reduce the amount of policy in filters,
> I thought it could be done by having only one return value which
> delegates to user space. But that may be too confusing, and the
> interaction between a seccomp aware tracer and SIGSYS aware code
> is fuzzy, so I'm not sure if it's a good idea.

Yeah - I want to avoid as much implicit behavior as possible. It's a
trap I regularly fall in and both you and luto@ have kept me honest
this time. I don't want to regress :)

>>
>>> Sending SIGSYS is useful, but it's quite a bit less useful if user
>>> space can't handle it in a signal handler, so I don't think it's
>>> worth it to make a unblockable version.
>>
>> Maybe, I am not parsing your e-mail correctly. But don't we already
>> get the desired behavior, if SIGSYS is treated the same as any other
>> synchronous signal? If it is unblocked and has a handler, the
>> application can decide to handle it. If neither one of these
>> conditions is true, it terminates the program. Ulimits and
>> PR_SET_DUMPABLE determine whether a core file is generated.
>
> The proposal I was replying to wanted to make SIGSYS always kill the
> process (with a core dump), so you wouldn't be able to set a handler
> any more. I think that is a bad idea. Or did I misunderstood?

Yeah - I suspect some crossed wires. The idea of a forced core dump
seems useful in some scenarios, but it is not hard to synthesize with
the RET_TRAP already. If something like RET_CORE is more obviously
useful after we have all the proposed consumers using this interface,
then it would make sense to entertain it then.

> Enforcing task termination when there is no handler doesn't make
> conceptual sense, because an empty signal handler is effectively
> the same as blocking a signal. Though I guess it's simpler to check
> for just sigaction in the BPF filters, so perhaps that was the idea.

Pretty much, I guess. Right now RET_TRAP calls force_siginfo which
will either use an installed handler or unblock the signal and use
SIG_DFL, which is just dump core. So the tweak would be to just set
it back to SIG_DFL prior to delivery (like force_sigsegv does when it
detects it is double-faulting).

Anyway, I think that forcing a coredump with a return code is overkill
at this point. do_exit(SIGSYS) is nice and so is using RET_TRAP to
trigger a core. Whether RET_TRACE should emit a sigsys when a tracer
isn't present is less clear to me, but I think keeping behavior
explicit will end up leading to the least number of mistakes and
breaks.

Thanks!
will