This change allows CONFIG_SECCOMP to make use of BPF programs for
user-controlled system call filtering (as shown in this patch series).
To minimize the impact on existing BPF evaluation, function pointer
use must be declared at sk_chk_filter-time. This allows ancillary
load instructions to be generated that use the function pointer rather
than adding _any_ code to the existing LD_* instruction paths.
Crude performance numbers using udpflood -l 10000000 against dummy0.
3 trials for baseline, 3 for with tcpdump. Averaged then differenced.
Hard to believe trials were repeated at least a couple more times.
* x86 32-bit (Atom N570 @ 1.66 GHz 2 core HT) [stackprot]:
- Without: 94.05s - 76.36s = 17.68s
- With: 86.22s - 73.30s = 12.92s
- Slowdown per call: -476 nanoseconds
* x86 32-bit (Atom N570 @ 1.66 GHz 2 core HT) [no stackprot]:
- Without: 92.06s - 77.81s = 14.25s
- With: 91.77s - 76.91s = 14.86s
- Slowdown per call: +61 nanoseconds
* x86 64-bit (Atom N570 @ 1.66 GHz 2 core HT) [stackprot]:
- Without: 122.58s - 99.54s = 23.04s
- With: 115.52s - 98.99s = 16.53s
- Slowdown per call: -651 nanoseconds
* x86 64-bit (Atom N570 @ 1.66 GHz 2 core HT) [no stackprot]:
- Without: 114.95s - 91.92s = 23.03s
- With: 110.47s - 90.79s = 19.68s
- Slowdown per call: -335 nanoseconds
This makes the x86-32-nossp make sense. Added register pressure always
makes x86-32 sad. If this is a concern, I could change the call
approach to bpf_run_filter to see if I can alleviate it a bit.
That said, the x86-*-ssp numbers show a marked increase in performance.
I've tested and retested and I keep getting these results. I'm also
suprised by the nossp speed up on 64-bit, but I dunno. I haven't looked
at the full disassembly of the call path. If that is required for the
performance differences I'm seeing, please let me know. Or if I there is
a preferred cpu to run this against - atoms can be a little weird.
v8: - fixed variable positioning and bad cast ([email protected])
- no longer passes A as a pointer (inspection of x86 asm shows A is
%ebx again; thanks [email protected])
- cleaned up switch macros and expanded use
([email protected], [email protected])
- added length fn pointer and handled LD_W_LEN/LDX_W_LEN
- moved from a wrapping struct to a typedef for the function
pointer. (matches existing function pointer style)
- added comprehensive comment above the typedef.
- benchmarks
v7: - first cut
Signed-off-by: Will Drewry <[email protected]>
---
include/linux/filter.h | 69 +++++++++++++++++++++-
net/core/filter.c | 152 +++++++++++++++++++++++++++++++++++++----------
2 files changed, 185 insertions(+), 36 deletions(-)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 8eeb205..d22ad46 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -110,6 +110,9 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
*/
#define BPF_MEMWORDS 16
+/* BPF program (checking) flags */
+#define BPF_CHK_FLAGS_NO_SKB 1
+
/* RATIONALE. Negative offsets are invalid in BPF.
We use them to reference ancillary data.
Unlike introduction new instructions, it does not break
@@ -145,17 +148,67 @@ struct sk_filter
struct sock_filter insns[0];
};
+/**
+ * struct bpf_load_fns - callbacks for bpf_run_filter
+ * These functions are called by bpf_run_filter if bpf_chk_filter
+ * was invoked with BPF_CHK_FLAGS_NO_SKB.
+ *
+ * pointer:
+ * @data: const pointer to the data passed into bpf_run_filter
+ * @k: offset into @skb's data
+ * @size: the size of the requested data in bytes: 1, 2, or 4.
+ * @buffer: If non-NULL, a 32-bit buffer for staging data.
+ *
+ * Returns a pointer to the requested data.
+ *
+ * This function operates similarly to load_pointer in net/core/filter.c
+ * except that the pointer to the returned data must already be
+ * byteswapped as appropriate to the source data and endianness.
+ * @buffer may be used if the data needs to be staged.
+ *
+ * length:
+ * @data: const pointer to the data passed into bpf_fun_filter
+ *
+ * Returns the length of the data.
+ */
+struct bpf_load_fns {
+ void *(*pointer)(const void *data, int k, unsigned int size,
+ void *buffer);
+ u32 (*length)(const void *data);
+};
+
static inline unsigned int sk_filter_len(const struct sk_filter *fp)
{
return fp->len * sizeof(struct sock_filter) + sizeof(*fp);
}
+extern unsigned int bpf_run_filter(const void *data,
+ const struct sock_filter *filter,
+ const struct bpf_load_fns *load_fn);
+
+/**
+ * sk_run_filter - run a filter on a socket
+ * @skb: buffer to run the filter on
+ * @fentry: filter to apply
+ *
+ * Runs bpf_run_filter with the struct sk_buff-specific data
+ * accessor behavior.
+ */
+static inline unsigned int sk_run_filter(const struct sk_buff *skb,
+ const struct sock_filter *filter)
+{
+ return bpf_run_filter(skb, filter, NULL);
+}
+
extern int sk_filter(struct sock *sk, struct sk_buff *skb);
-extern unsigned int sk_run_filter(const struct sk_buff *skb,
- const struct sock_filter *filter);
extern int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
extern int sk_detach_filter(struct sock *sk);
-extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
+extern int bpf_chk_filter(struct sock_filter *filter, unsigned int flen, u32 flags);
+
+static inline int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
+{
+ return bpf_chk_filter(filter, flen, 0);
+}
#ifdef CONFIG_BPF_JIT
extern void bpf_jit_compile(struct sk_filter *fp);
@@ -228,6 +281,16 @@ enum {
BPF_S_ANC_HATYPE,
BPF_S_ANC_RXHASH,
BPF_S_ANC_CPU,
+ /* Used to differentiate SKB data and generic data */
+ BPF_S_ANC_LD_W_ABS,
+ BPF_S_ANC_LD_H_ABS,
+ BPF_S_ANC_LD_B_ABS,
+ BPF_S_ANC_LD_W_LEN,
+ BPF_S_ANC_LD_W_IND,
+ BPF_S_ANC_LD_H_IND,
+ BPF_S_ANC_LD_B_IND,
+ BPF_S_ANC_LDX_W_LEN,
+ BPF_S_ANC_LDX_B_MSH,
};
#endif /* __KERNEL__ */
diff --git a/net/core/filter.c b/net/core/filter.c
index 5dea452..a5c98a9 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -98,9 +98,10 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
EXPORT_SYMBOL(sk_filter);
/**
- * sk_run_filter - run a filter on a socket
- * @skb: buffer to run the filter on
+ * bpf_run_filter - run a filter on a BPF program
+ * @data: buffer to run the filter on
* @fentry: filter to apply
+ * @load_fns: custom data accessor functions
*
* Decode and apply filter instructions to the skb->data.
* Return length to keep, 0 for none. @skb is the data we are
@@ -108,9 +109,13 @@ EXPORT_SYMBOL(sk_filter);
* Because all jumps are guaranteed to be before last instruction,
* and last instruction guaranteed to be a RET, we dont need to check
* flen. (We used to pass to this function the length of filter)
+ *
+ * load_fn is only used if SKF_FLAGS_USE_LOAD_FNS was specified
+ * to sk_chk_generic_filter.
*/
-unsigned int sk_run_filter(const struct sk_buff *skb,
- const struct sock_filter *fentry)
+unsigned int bpf_run_filter(const void *data,
+ const struct sock_filter *fentry,
+ const struct bpf_load_fns *load_fns)
{
void *ptr;
u32 A = 0; /* Accumulator */
@@ -128,6 +133,7 @@ unsigned int sk_run_filter(const struct sk_buff *skb,
#else
const u32 K = fentry->k;
#endif
+#define SKB(_data) ((const struct sk_buff *)(_data))
switch (fentry->code) {
case BPF_S_ALU_ADD_X:
@@ -213,7 +219,7 @@ unsigned int sk_run_filter(const struct sk_buff *skb,
case BPF_S_LD_W_ABS:
k = K;
load_w:
- ptr = load_pointer(skb, k, 4, &tmp);
+ ptr = load_pointer(data, k, 4, &tmp);
if (ptr != NULL) {
A = get_unaligned_be32(ptr);
continue;
@@ -222,7 +228,7 @@ load_w:
case BPF_S_LD_H_ABS:
k = K;
load_h:
- ptr = load_pointer(skb, k, 2, &tmp);
+ ptr = load_pointer(data, k, 2, &tmp);
if (ptr != NULL) {
A = get_unaligned_be16(ptr);
continue;
@@ -231,17 +237,17 @@ load_h:
case BPF_S_LD_B_ABS:
k = K;
load_b:
- ptr = load_pointer(skb, k, 1, &tmp);
+ ptr = load_pointer(data, k, 1, &tmp);
if (ptr != NULL) {
A = *(u8 *)ptr;
continue;
}
return 0;
case BPF_S_LD_W_LEN:
- A = skb->len;
+ A = SKB(data)->len;
continue;
case BPF_S_LDX_W_LEN:
- X = skb->len;
+ X = SKB(data)->len;
continue;
case BPF_S_LD_W_IND:
k = X + K;
@@ -253,7 +259,7 @@ load_b:
k = X + K;
goto load_b;
case BPF_S_LDX_B_MSH:
- ptr = load_pointer(skb, K, 1, &tmp);
+ ptr = load_pointer(data, K, 1, &tmp);
if (ptr != NULL) {
X = (*(u8 *)ptr & 0xf) << 2;
continue;
@@ -288,29 +294,29 @@ load_b:
mem[K] = X;
continue;
case BPF_S_ANC_PROTOCOL:
- A = ntohs(skb->protocol);
+ A = ntohs(SKB(data)->protocol);
continue;
case BPF_S_ANC_PKTTYPE:
- A = skb->pkt_type;
+ A = SKB(data)->pkt_type;
continue;
case BPF_S_ANC_IFINDEX:
- if (!skb->dev)
+ if (!SKB(data)->dev)
return 0;
- A = skb->dev->ifindex;
+ A = SKB(data)->dev->ifindex;
continue;
case BPF_S_ANC_MARK:
- A = skb->mark;
+ A = SKB(data)->mark;
continue;
case BPF_S_ANC_QUEUE:
- A = skb->queue_mapping;
+ A = SKB(data)->queue_mapping;
continue;
case BPF_S_ANC_HATYPE:
- if (!skb->dev)
+ if (!SKB(data)->dev)
return 0;
- A = skb->dev->type;
+ A = SKB(data)->dev->type;
continue;
case BPF_S_ANC_RXHASH:
- A = skb->rxhash;
+ A = SKB(data)->rxhash;
continue;
case BPF_S_ANC_CPU:
A = raw_smp_processor_id();
@@ -318,15 +324,15 @@ load_b:
case BPF_S_ANC_NLATTR: {
struct nlattr *nla;
- if (skb_is_nonlinear(skb))
+ if (skb_is_nonlinear(SKB(data)))
return 0;
- if (A > skb->len - sizeof(struct nlattr))
+ if (A > SKB(data)->len - sizeof(struct nlattr))
return 0;
- nla = nla_find((struct nlattr *)&skb->data[A],
- skb->len - A, X);
+ nla = nla_find((struct nlattr *)&SKB(data)->data[A],
+ SKB(data)->len - A, X);
if (nla)
- A = (void *)nla - (void *)skb->data;
+ A = (void *)nla - (void *)SKB(data)->data;
else
A = 0;
continue;
@@ -334,22 +340,71 @@ load_b:
case BPF_S_ANC_NLATTR_NEST: {
struct nlattr *nla;
- if (skb_is_nonlinear(skb))
+ if (skb_is_nonlinear(SKB(data)))
return 0;
- if (A > skb->len - sizeof(struct nlattr))
+ if (A > SKB(data)->len - sizeof(struct nlattr))
return 0;
- nla = (struct nlattr *)&skb->data[A];
- if (nla->nla_len > A - skb->len)
+ nla = (struct nlattr *)&SKB(data)->data[A];
+ if (nla->nla_len > A - SKB(data)->len)
return 0;
nla = nla_find_nested(nla, X);
if (nla)
- A = (void *)nla - (void *)skb->data;
+ A = (void *)nla - (void *)SKB(data)->data;
else
A = 0;
continue;
}
+ case BPF_S_ANC_LD_W_ABS:
+ k = K;
+load_fn_w:
+ ptr = load_fns->pointer(data, k, 4, &tmp);
+ if (ptr) {
+ A = *(u32 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_ANC_LD_H_ABS:
+ k = K;
+load_fn_h:
+ ptr = load_fns->pointer(data, k, 2, &tmp);
+ if (ptr) {
+ A = *(u16 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_ANC_LD_B_ABS:
+ k = K;
+load_fn_b:
+ ptr = load_fns->pointer(data, k, 1, &tmp);
+ if (ptr) {
+ A = *(u8 *)ptr;
+ continue;
+ }
+ return 0;
+ case BPF_S_ANC_LDX_B_MSH:
+ ptr = load_fns->pointer(data, K, 1, &tmp);
+ if (ptr) {
+ X = (*(u8 *)ptr & 0xf) << 2;
+ continue;
+ }
+ return 0;
+ case BPF_S_ANC_LD_W_IND:
+ k = X + K;
+ goto load_fn_w;
+ case BPF_S_ANC_LD_H_IND:
+ k = X + K;
+ goto load_fn_h;
+ case BPF_S_ANC_LD_B_IND:
+ k = X + K;
+ goto load_fn_b;
+ case BPF_S_ANC_LD_W_LEN:
+ A = load_fns->length(data);
+ continue;
+ case BPF_S_ANC_LDX_W_LEN:
+ X = load_fns->length(data);
+ continue;
default:
WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
fentry->code, fentry->jt,
@@ -360,7 +415,7 @@ load_b:
return 0;
}
-EXPORT_SYMBOL(sk_run_filter);
+EXPORT_SYMBOL(bpf_run_filter);
/*
* Security :
@@ -423,9 +478,10 @@ error:
}
/**
- * sk_chk_filter - verify socket filter code
+ * bpf_chk_filter - verify socket filter BPF code
* @filter: filter to verify
* @flen: length of filter
+ * @flags: May be BPF_CHK_FLAGS_NO_SKB or 0
*
* Check the user's filter code. If we let some ugly
* filter code slip through kaboom! The filter must contain
@@ -434,9 +490,13 @@ error:
*
* All jumps are forward as they are not signed.
*
+ * If BPF_CHK_FLAGS_NO_SKB is set in flags, any SKB-specific
+ * rules become illegal and a custom set of bpf_load_fns will
+ * be expected by bpf_run_filter.
+ *
* Returns 0 if the rule set is legal or -EINVAL if not.
*/
-int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
+int bpf_chk_filter(struct sock_filter *filter, unsigned int flen, u32 flags)
{
/*
* Valid instructions are initialized to non-0.
@@ -542,9 +602,35 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
pc + ftest->jf + 1 >= flen)
return -EINVAL;
break;
+#define MAYBE_USE_LOAD_FN(CODE) \
+ if (flags & BPF_CHK_FLAGS_NO_SKB) { \
+ code = BPF_S_ANC_##CODE; \
+ break; \
+ }
+ case BPF_S_LD_W_LEN:
+ MAYBE_USE_LOAD_FN(LD_W_LEN);
+ break;
+ case BPF_S_LDX_W_LEN:
+ MAYBE_USE_LOAD_FN(LDX_W_LEN);
+ break;
+ case BPF_S_LD_W_IND:
+ MAYBE_USE_LOAD_FN(LD_W_IND);
+ break;
+ case BPF_S_LD_H_IND:
+ MAYBE_USE_LOAD_FN(LD_H_IND);
+ break;
+ case BPF_S_LD_B_IND:
+ MAYBE_USE_LOAD_FN(LD_B_IND);
+ break;
+ case BPF_S_LDX_B_MSH:
+ MAYBE_USE_LOAD_FN(LDX_B_MSH);
+ break;
case BPF_S_LD_W_ABS:
+ MAYBE_USE_LOAD_FN(LD_W_ABS);
case BPF_S_LD_H_ABS:
+ MAYBE_USE_LOAD_FN(LD_H_ABS);
case BPF_S_LD_B_ABS:
+ MAYBE_USE_LOAD_FN(LD_B_ABS);
#define ANCILLARY(CODE) case SKF_AD_OFF + SKF_AD_##CODE: \
code = BPF_S_ANC_##CODE; \
break
@@ -572,7 +658,7 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
}
return -EINVAL;
}
-EXPORT_SYMBOL(sk_chk_filter);
+EXPORT_SYMBOL(bpf_chk_filter);
/**
* sk_filter_release_rcu - Release a socket filter by rcu_head
--
1.7.5.4
Replaces the seccomp_t typedef with struct seccomp to match modern
kernel style.
v7: struct seccomp_struct -> struct seccomp
v6: original inclusion in this series.
Signed-off-by: Will Drewry <[email protected]>
---
include/linux/sched.h | 2 +-
include/linux/seccomp.h | 10 ++++++----
2 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7d379a6..c30526f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1418,7 +1418,7 @@ struct task_struct {
uid_t loginuid;
unsigned int sessionid;
#endif
- seccomp_t seccomp;
+ struct seccomp seccomp;
/* Thread group tracking */
u32 parent_exec_id;
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index cc7a4e9..d61f27f 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -7,7 +7,9 @@
#include <linux/thread_info.h>
#include <asm/seccomp.h>
-typedef struct { int mode; } seccomp_t;
+struct seccomp {
+ int mode;
+};
extern void __secure_computing(int);
static inline void secure_computing(int this_syscall)
@@ -19,7 +21,7 @@ static inline void secure_computing(int this_syscall)
extern long prctl_get_seccomp(void);
extern long prctl_set_seccomp(unsigned long);
-static inline int seccomp_mode(seccomp_t *s)
+static inline int seccomp_mode(struct seccomp *s)
{
return s->mode;
}
@@ -28,7 +30,7 @@ static inline int seccomp_mode(seccomp_t *s)
#include <linux/errno.h>
-typedef struct { } seccomp_t;
+struct seccomp { };
#define secure_computing(x) do { } while (0)
@@ -42,7 +44,7 @@ static inline long prctl_set_seccomp(unsigned long arg2)
return -EINVAL;
}
-static inline int seccomp_mode(seccomp_t *s)
+static inline int seccomp_mode(struct seccomp *s)
{
return 0;
}
--
1.7.5.4
Enable support for seccomp filter on x86:
- asm/tracehook.h exists
- syscall_get_arguments() works
- syscall_rollback() works
- ptrace_report_syscall() works
- secure_computing() return value is honored (see below)
This also adds support for honoring the return
value from secure_computing().
SECCOMP_RET_TRACE and SECCOMP_RET_TRAP may result in seccomp needing to
skip a system call without killing the process. This is done by
returning a non-zero (-1) value from secure_computing. This change
makes x86 respect that return value.
To ensure that minimal kernel code is exposed, a non-zero return value
results in an immediate return to user space (with an invalid syscall
number).
Signed-off-by: Will Drewry <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/kernel/ptrace.c | 7 ++++++-
2 files changed, 7 insertions(+), 1 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 5bed94e..4c9012b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -82,6 +82,7 @@ config X86
select CLKEVT_I8253
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select GENERIC_IOMAP
+ select HAVE_ARCH_SECCOMP_FILTER
config INSTRUCTION_DECODER
def_bool (KPROBES || PERF_EVENTS)
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 5026738..90d465a 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1380,7 +1380,11 @@ long syscall_trace_enter(struct pt_regs *regs)
regs->flags |= X86_EFLAGS_TF;
/* do the secure computing check first */
- secure_computing(regs->orig_ax);
+ if (secure_computing(regs->orig_ax)) {
+ /* seccomp failures shouldn't expose any additional code. */
+ ret = -1L;
+ goto out;
+ }
if (unlikely(test_thread_flag(TIF_SYSCALL_EMU)))
ret = -1L;
@@ -1405,6 +1409,7 @@ long syscall_trace_enter(struct pt_regs *regs)
regs->dx, regs->r10);
#endif
+out:
return ret ?: regs->orig_ax;
}
--
1.7.5.4
Documents how system call filtering using Berkeley Packet
Filter programs works and how it may be used.
Includes an example for x86 (32-bit) and a semi-generic
example using a macro-based code generator.
v8: - add PR_SET_NO_NEW_PRIVS to the samples.
v7: - updated for all the new stuff in v7: TRAP, TRACE
- only talk about PR_SET_SECCOMP now
- fixed bad JLE32 check ([email protected])
- adds dropper.c: a simple system call disabler
v6: - tweak the language to note the requirement of
PR_SET_NO_NEW_PRIVS being called prior to use. ([email protected])
v5: - update sample to use system call arguments
- adds a "fancy" example using a macro-based generator
- cleaned up bpf in the sample
- update docs to mention arguments
- fix prctl value ([email protected])
- language cleanup ([email protected])
v4: - update for no_new_privs use
- minor tweaks
v3: - call out BPF <-> Berkeley Packet Filter ([email protected])
- document use of tentative always-unprivileged
- guard sample compilation for i386 and x86_64
v2: - move code to samples ([email protected])
Signed-off-by: Will Drewry <[email protected]>
---
Documentation/prctl/seccomp_filter.txt | 155 +++++++++++++++++++++
samples/Makefile | 2 +-
samples/seccomp/Makefile | 31 ++++
samples/seccomp/bpf-direct.c | 150 ++++++++++++++++++++
samples/seccomp/bpf-fancy.c | 101 ++++++++++++++
samples/seccomp/bpf-helper.c | 89 ++++++++++++
samples/seccomp/bpf-helper.h | 234 ++++++++++++++++++++++++++++++++
samples/seccomp/dropper.c | 52 +++++++
8 files changed, 813 insertions(+), 1 deletions(-)
create mode 100644 Documentation/prctl/seccomp_filter.txt
create mode 100644 samples/seccomp/Makefile
create mode 100644 samples/seccomp/bpf-direct.c
create mode 100644 samples/seccomp/bpf-fancy.c
create mode 100644 samples/seccomp/bpf-helper.c
create mode 100644 samples/seccomp/bpf-helper.h
create mode 100644 samples/seccomp/dropper.c
diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..2c6bd12
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,155 @@
+ SECure COMPuting with filters
+ =============================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated. A
+certain subset of userland applications benefit by having a reduced set
+of available system calls. The resulting set reduces the total kernel
+surface exposed to the application. System call filtering is meant for
+use with those applications.
+
+Seccomp filtering provides a means for a process to specify a filter for
+incoming system calls. The filter is expressed as a Berkeley Packet
+Filter (BPF) program, as with socket filters, except that the data
+operated on is related to the system call being made: system call
+number and the system call arguments. This allows for expressive
+filtering of system calls using a filter program language with a long
+history of being exposed to userland and a straightforward data set.
+
+Additionally, BPF makes it impossible for users of seccomp to fall prey
+to time-of-check-time-of-use (TOCTOU) attacks that are common in system
+call interposition frameworks. BPF programs may not dereference
+pointers which constrains all filters to solely evaluating the system
+call arguments directly.
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox. It provides a clearly defined
+mechanism for minimizing the exposed kernel surface. It is meant to be
+a tool for sandbox developers to use. Beyond that, policy for logical
+behavior and information flow should be managed with a combination of
+other system hardening techniques and, potentially, an LSM of your
+choosing. Expressive, dynamic filters provide further options down this
+path (avoiding pathological sizes or selecting which of the multiplexed
+system calls in socketcall() is allowed, for instance) which could be
+construed, incorrectly, as a more complete sandboxing solution.
+
+Usage
+-----
+
+An additional seccomp mode is added and is enabled using the same
+prctl(2) call as the strict seccomp. If the architecture has
+CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below:
+
+PR_SET_SECCOMP:
+ Now takes an additional argument which specifies a new filter
+ using a BPF program.
+ The BPF program will be executed over struct seccomp_data
+ reflecting the system call number, arguments, and other
+ metadata. The BPF program must then return one of the
+ acceptable values to inform the kernel which action should be
+ taken.
+
+ Usage:
+ prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog);
+
+ The 'prog' argument is a pointer to a struct sock_fprog which
+ will contain the filter program. If the program is invalid, the
+ call will return -1 and set errno to EINVAL.
+
+ Note, is_compat_task is also tracked for the @prog. This means
+ that once set the calling task will have all of its system calls
+ blocked if it switches its system call ABI.
+
+ If fork/clone and execve are allowed by @prog, any child
+ processes will be constrained to the same filters and system
+ call ABI as the parent.
+
+ Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or
+ run with CAP_SYS_ADMIN privileges in its namespace. If these are not
+ true, -EACCES will be returned. This requirement ensures that filter
+ programs cannot be applied to child processes with greater privileges
+ than the task that installed them.
+
+ Additionally, if prctl(2) is allowed by the attached filter,
+ additional filters may be layered on which will increase evaluation
+ time, but allow for further decreasing the attack surface during
+ execution of a process.
+
+The above call returns 0 on success and non-zero on error.
+
+Return values
+-------------
+
+A seccomp filter may return any of the following values:
+ SECCOMP_RET_ALLOW, SECCOMP_RET_KILL, SECCOMP_RET_TRAP,
+ SECCOMP_RET_ERRNO, or SECCOMP_RET_TRACE.
+
+SECCOMP_RET_ALLOW:
+ If all filters for a given task return this value then
+ the system call will proceed normally.
+
+SECCOMP_RET_KILL:
+ If any filters for a given take return this value then
+ the task will exit immediately without executing the system
+ call.
+
+SECCOMP_RET_TRAP:
+ If any filters specify SECCOMP_RET_TRAP and none of them
+ specify SECCOMP_RET_KILL, then the kernel will send a SIGTRAP
+ signal to the task and not execute the system call. The kernel
+ will rollback the register state to just before system call
+ entry such that a signal handler in the process will be able
+ to inspect the ucontext_t->uc_mcontext registers and emulate
+ system call success or failure upon return from the signal
+ handler.
+
+ The SIGTRAP is differentiated by other SIGTRAPS by a si_code
+ of TRAP_SECCOMP.
+
+SECCOMP_RET_ERRNO:
+ If returned, the value provided in the lower 16-bits is
+ returned to userland as the errno and the system call is
+ not executed.
+
+SECCOMP_RET_TRACE:
+ If any filters return this value and the others return
+ SECCOMP_RET_ALLOW, then the kernel will attempt to notify
+ a ptrace()-based tracer prior to executing the system call.
+
+ A tracer will be notified if it is attached with
+ ptrace(PTRACE_SECCOMP, ...). Otherwise, the system call will
+ not execute and -ENOSYS will be returned to userspace.
+
+ If the tracer ignores notification, then the system call will
+ proceed normally. Changes to the registers will function
+ similarly to PTRACE_SYSCALL.
+
+Please note that the order of precedence is as follows:
+SECCOMP_RET_KILL, SECCOMP_RET_ERRNO, SECCOMP_RET_TRAP,
+SECCOMP_RET_TRACE, SECCOMP_RET_ALLOW.
+
+If multiple filters exist, the return value for the evaluation of a given
+system call will always use the highest precedent value.
+SECCOMP_RET_KILL will always take precedence.
+
+
+Example
+-------
+
+The samples/seccomp/ directory contains both a 32-bit specific example
+and a more generic example of a higher level macro interface for BPF
+program generation.
+
+Adding architecture support
+-----------------------
+
+See arch/Kconfig for the required functionality. In general, if an
+architecture supports both tracehook and seccomp, it will be able to
+support seccomp filter. Then it must just add
+CONFIG_HAVE_ARCH_SECCOMP_FILTER to its arch-specific Kconfig.
diff --git a/samples/Makefile b/samples/Makefile
index 6280817..f29b19c 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,4 @@
# Makefile for Linux samples code
obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ tracepoints/ trace_events/ \
- hw_breakpoint/ kfifo/ kdb/ hidraw/
+ hw_breakpoint/ kfifo/ kdb/ hidraw/ seccomp/
diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
new file mode 100644
index 0000000..38922f7
--- /dev/null
+++ b/samples/seccomp/Makefile
@@ -0,0 +1,31 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+hostprogs-$(CONFIG_SECCOMP) := bpf-fancy dropper
+bpf-fancy-objs := bpf-fancy.o bpf-helper.o
+
+HOSTCFLAGS_bpf-fancy.o += -I$(objtree)/usr/include
+HOSTCFLAGS_bpf-fancy.o += -idirafter $(objtree)/include
+HOSTCFLAGS_bpf-helper.o += -I$(objtree)/usr/include
+HOSTCFLAGS_bpf-helper.o += -idirafter $(objtree)/include
+
+HOSTCFLAGS_dropper.o += -I$(objtree)/usr/include
+HOSTCFLAGS_dropper.o += -idirafter $(objtree)/include
+dropper-objs := dropper.o
+
+# bpf-direct.c is x86-only.
+ifeq ($(filter-out x86_64 i386,$(KBUILD_BUILDHOST)),)
+# List of programs to build
+hostprogs-$(CONFIG_SECCOMP) += bpf-direct
+bpf-direct-objs := bpf-direct.o
+endif
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_bpf-direct.o += -I$(objtree)/usr/include
+HOSTCFLAGS_bpf-direct.o += -idirafter $(objtree)/include
+ifeq ($(KBUILD_BUILDHOST),x86_64)
+HOSTCFLAGS_bpf-direct.o += -m32
+HOSTLOADLIBES_bpf-direct += -m32
+endif
diff --git a/samples/seccomp/bpf-direct.c b/samples/seccomp/bpf-direct.c
new file mode 100644
index 0000000..ffd8adc
--- /dev/null
+++ b/samples/seccomp/bpf-direct.c
@@ -0,0 +1,150 @@
+/*
+ * 32-bit seccomp filter example with BPF macros
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <[email protected]>
+ * Author: Will Drewry <[email protected]>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_SET_SECCOMP, 2, ...).
+ */
+#define __USE_GNU 1
+#define _GNU_SOURCE 1
+
+#include <linux/types.h>
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+#include <linux/unistd.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <string.h>
+#include <sys/prctl.h>
+#include <unistd.h>
+
+#define syscall_arg(_n) (offsetof(struct seccomp_data, lo32[_n]))
+#define syscall_nr (offsetof(struct seccomp_data, nr))
+
+#ifndef TRAP_SECCOMP
+#define TRAP_SECCOMP (TRAP_TRACE + 3)
+#endif
+
+#ifndef PR_SET_NO_NEW_PRIVS
+#define PR_SET_NO_NEW_PRIVS 36
+#endif
+
+static void emulator(int nr, siginfo_t *info, void *void_context)
+{
+ ucontext_t *ctx = (ucontext_t *)(void_context);
+ int syscall;
+ char *buf;
+ ssize_t bytes;
+ size_t len;
+ if (info->si_code != TRAP_SECCOMP)
+ return;
+ if (!ctx)
+ return;
+ syscall = ctx->uc_mcontext.gregs[REG_EAX];
+ buf = (char *) ctx->uc_mcontext.gregs[REG_ECX];
+ len = (size_t) ctx->uc_mcontext.gregs[REG_EDX];
+
+ if (syscall != __NR_write)
+ return;
+ if (ctx->uc_mcontext.gregs[REG_EBX] != STDERR_FILENO)
+ return;
+ /* Redirect stderr messages to stdout. Doesn't handle EINTR, etc */
+ write(STDOUT_FILENO, "[ERR] ", 6);
+ bytes = write(STDOUT_FILENO, buf, len);
+ ctx->uc_mcontext.gregs[REG_EAX] = bytes;
+ return;
+}
+
+static int install_emulator(void)
+{
+ struct sigaction act;
+ sigset_t mask;
+ memset(&act, 0, sizeof(act));
+ sigemptyset(&mask);
+ sigaddset(&mask, SIGTRAP);
+
+ act.sa_sigaction = &emulator;
+ act.sa_flags = SA_SIGINFO;
+ if (sigaction(SIGTRAP, &act, NULL) < 0) {
+ perror("sigaction");
+ return -1;
+ }
+ if (sigprocmask(SIG_UNBLOCK, &mask, NULL)) {
+ perror("sigprocmask");
+ return -1;
+ }
+ return 0;
+}
+
+static int install_filter(void)
+{
+ struct sock_filter filter[] = {
+ /* Grab the system call number */
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_nr),
+ /* Jump table for the allowed syscalls */
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 0, 1),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 0, 1),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 0, 1),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 0, 1),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 3, 2),
+
+ /* Check that read is only using stdin. */
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 4, 0),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
+
+ /* Check that write is only using stdout */
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
+ /* Trap attempts to write to stderr */
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 1, 2),
+
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_TRAP),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ .filter = filter,
+ };
+
+ if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
+ perror("prctl(NO_NEW_PRIVS)");
+ return 1;
+ }
+
+
+ if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
+ perror("prctl");
+ return 1;
+ }
+ return 0;
+}
+
+#define payload(_c) (_c), sizeof((_c))
+int main(int argc, char **argv)
+{
+ char buf[4096];
+ ssize_t bytes = 0;
+ if (install_emulator())
+ return 1;
+ if (install_filter())
+ return 1;
+ syscall(__NR_write, STDOUT_FILENO,
+ payload("OHAI! WHAT IS YOUR NAME? "));
+ bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
+ syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+ syscall(__NR_write, STDOUT_FILENO, buf, bytes);
+ syscall(__NR_write, STDERR_FILENO,
+ payload("Error message going to STDERR\n"));
+ return 0;
+}
diff --git a/samples/seccomp/bpf-fancy.c b/samples/seccomp/bpf-fancy.c
new file mode 100644
index 0000000..bcfe3a0
--- /dev/null
+++ b/samples/seccomp/bpf-fancy.c
@@ -0,0 +1,101 @@
+/*
+ * Seccomp BPF example using a macro-based generator.
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <[email protected]>
+ * Author: Will Drewry <[email protected]>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+#include <linux/unistd.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/prctl.h>
+#include <unistd.h>
+
+#include "bpf-helper.h"
+
+#ifndef PR_SET_NO_NEW_PRIVS
+#define PR_SET_NO_NEW_PRIVS 36
+#endif
+
+int main(int argc, char **argv)
+{
+ struct bpf_labels l;
+ static const char msg1[] = "Please type something: ";
+ static const char msg2[] = "You typed: ";
+ char buf[256];
+ struct sock_filter filter[] = {
+ LOAD_SYSCALL_NR,
+ SYSCALL(__NR_exit, ALLOW),
+ SYSCALL(__NR_exit_group, ALLOW),
+ SYSCALL(__NR_write, JUMP(&l, write_fd)),
+ SYSCALL(__NR_read, JUMP(&l, read)),
+ DENY, /* Don't passthrough into a label */
+
+ LABEL(&l, read),
+ ARG(0),
+ JNE(STDIN_FILENO, DENY),
+ ARG(1),
+ JNE((unsigned long)buf, DENY),
+ ARG(2),
+ JGE(sizeof(buf), DENY),
+ ALLOW,
+
+ LABEL(&l, write_fd),
+ ARG(0),
+ JEQ(STDOUT_FILENO, JUMP(&l, write_buf)),
+ JEQ(STDERR_FILENO, JUMP(&l, write_buf)),
+ DENY,
+
+ LABEL(&l, write_buf),
+ ARG(1),
+ JEQ((unsigned long)msg1, JUMP(&l, msg1_len)),
+ JEQ((unsigned long)msg2, JUMP(&l, msg2_len)),
+ JEQ((unsigned long)buf, JUMP(&l, buf_len)),
+ DENY,
+
+ LABEL(&l, msg1_len),
+ ARG(2),
+ JLT(sizeof(msg1), ALLOW),
+ DENY,
+
+ LABEL(&l, msg2_len),
+ ARG(2),
+ JLT(sizeof(msg2), ALLOW),
+ DENY,
+
+ LABEL(&l, buf_len),
+ ARG(2),
+ JLT(sizeof(buf), ALLOW),
+ DENY,
+ };
+ struct sock_fprog prog = {
+ .filter = filter,
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ };
+ ssize_t bytes;
+ bpf_resolve_jumps(&l, filter, sizeof(filter)/sizeof(*filter));
+
+ if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
+ perror("prctl(NO_NEW_PRIVS)");
+ return 1;
+ }
+
+ if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
+ perror("prctl(SECCOMP)");
+ return 1;
+ }
+ syscall(__NR_write, STDOUT_FILENO, msg1, strlen(msg1));
+ bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf)-1);
+ bytes = (bytes > 0 ? bytes : 0);
+ syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2));
+ syscall(__NR_write, STDERR_FILENO, buf, bytes);
+ /* Now get killed */
+ syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2)+2);
+ return 0;
+}
diff --git a/samples/seccomp/bpf-helper.c b/samples/seccomp/bpf-helper.c
new file mode 100644
index 0000000..579cfe3
--- /dev/null
+++ b/samples/seccomp/bpf-helper.c
@@ -0,0 +1,89 @@
+/*
+ * Seccomp BPF helper functions
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <[email protected]>
+ * Author: Will Drewry <[email protected]>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <stdio.h>
+#include <string.h>
+
+#include "bpf-helper.h"
+
+int bpf_resolve_jumps(struct bpf_labels *labels,
+ struct sock_filter *filter, size_t count)
+{
+ struct sock_filter *begin = filter;
+ __u8 insn = count - 1;
+
+ if (count < 1)
+ return -1;
+ /*
+ * Walk it once, backwards, to build the label table and do fixups.
+ * Since backward jumps are disallowed by BPF, this is easy.
+ */
+ filter += insn;
+ for (; filter >= begin; --insn, --filter) {
+ if (filter->code != (BPF_JMP+BPF_JA))
+ continue;
+ switch ((filter->jt<<8)|filter->jf) {
+ case (JUMP_JT<<8)|JUMP_JF:
+ if (labels->labels[filter->k].location == 0xffffffff) {
+ fprintf(stderr, "Unresolved label: '%s'\n",
+ labels->labels[filter->k].label);
+ return 1;
+ }
+ filter->k = labels->labels[filter->k].location -
+ (insn + 1);
+ filter->jt = 0;
+ filter->jf = 0;
+ continue;
+ case (LABEL_JT<<8)|LABEL_JF:
+ if (labels->labels[filter->k].location != 0xffffffff) {
+ fprintf(stderr, "Duplicate label use: '%s'\n",
+ labels->labels[filter->k].label);
+ return 1;
+ }
+ labels->labels[filter->k].location = insn;
+ filter->k = 0; /* fall through */
+ filter->jt = 0;
+ filter->jf = 0;
+ continue;
+ }
+ }
+ return 0;
+}
+
+/* Simple lookup table for labels. */
+__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label)
+{
+ struct __bpf_label *begin = labels->labels, *end;
+ int id;
+ if (labels->count == 0) {
+ begin->label = label;
+ begin->location = 0xffffffff;
+ labels->count++;
+ return 0;
+ }
+ end = begin + labels->count;
+ for (id = 0; begin < end; ++begin, ++id) {
+ if (!strcmp(label, begin->label))
+ return id;
+ }
+ begin->label = label;
+ begin->location = 0xffffffff;
+ labels->count++;
+ return id;
+}
+
+void seccomp_bpf_print(struct sock_filter *filter, size_t count)
+{
+ struct sock_filter *end = filter + count;
+ for ( ; filter < end; ++filter)
+ printf("{ code=%u,jt=%u,jf=%u,k=%u },\n",
+ filter->code, filter->jt, filter->jf, filter->k);
+}
diff --git a/samples/seccomp/bpf-helper.h b/samples/seccomp/bpf-helper.h
new file mode 100644
index 0000000..9c64801
--- /dev/null
+++ b/samples/seccomp/bpf-helper.h
@@ -0,0 +1,234 @@
+/*
+ * Example wrapper around BPF macros.
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <[email protected]>
+ * Author: Will Drewry <[email protected]>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_SET_SECCOMP, 2, ...).
+ *
+ * No guarantees are provided with respect to the correctness
+ * or functionality of this code.
+ */
+#ifndef __BPF_HELPER_H__
+#define __BPF_HELPER_H__
+
+#include <asm/bitsperlong.h> /* for __BITS_PER_LONG */
+#include <linux/filter.h>
+#include <linux/seccomp.h> /* for seccomp_data */
+#include <linux/types.h>
+#include <linux/unistd.h>
+#include <stddef.h>
+
+#define BPF_LABELS_MAX 256
+struct bpf_labels {
+ int count;
+ struct __bpf_label {
+ const char *label;
+ __u32 location;
+ } labels[BPF_LABELS_MAX];
+};
+
+int bpf_resolve_jumps(struct bpf_labels *labels,
+ struct sock_filter *filter, size_t count);
+__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label);
+void seccomp_bpf_print(struct sock_filter *filter, size_t count);
+
+#define JUMP_JT 0xff
+#define JUMP_JF 0xff
+#define LABEL_JT 0xfe
+#define LABEL_JF 0xfe
+
+#define ALLOW \
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW)
+#define DENY \
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL)
+#define JUMP(labels, label) \
+ BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL((labels), (label)), \
+ JUMP_JT, JUMP_JF)
+#define LABEL(labels, label) \
+ BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL((labels), (label)), \
+ LABEL_JT, LABEL_JF)
+#define SYSCALL(nr, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (nr), 0, 1), \
+ jt
+
+/* Lame, but just an example */
+#define FIND_LABEL(labels, label) seccomp_bpf_label((labels), #label)
+
+#define EXPAND(...) __VA_ARGS__
+/* Map all width-sensitive operations */
+#if __BITS_PER_LONG == 32
+
+#define JEQ(x, jt) JEQ32(x, EXPAND(jt))
+#define JNE(x, jt) JNE32(x, EXPAND(jt))
+#define JGT(x, jt) JGT32(x, EXPAND(jt))
+#define JLT(x, jt) JLT32(x, EXPAND(jt))
+#define JGE(x, jt) JGE32(x, EXPAND(jt))
+#define JLE(x, jt) JLE32(x, EXPAND(jt))
+#define JA(x, jt) JA32(x, EXPAND(jt))
+#define ARG(i) ARG_32(i)
+
+#elif __BITS_PER_LONG == 64
+
+#if defined(__LITTLE_ENDIAN)
+#define ENDIAN(_lo, _hi) _lo, _hi
+#elif defined(__BIG_ENDIAN)
+#define ENDIAN(_lo, _hi) _hi, _lo
+#else
+#error "Unknown endianness"
+#endif
+
+union arg64 {
+ struct {
+ __u32 ENDIAN(lo32, hi32);
+ };
+ __u64 u64;
+};
+
+#define JEQ(x, jt) \
+ JEQ64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JGT(x, jt) \
+ JGT64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JGE(x, jt) \
+ JGE64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JNE(x, jt) \
+ JNE64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JLT(x, jt) \
+ JLT64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JLE(x, jt) \
+ JLE64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+
+#define JA(x, jt) \
+ JA64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define ARG(i) ARG_64(i)
+
+#else
+#error __BITS_PER_LONG value unusable.
+#endif
+
+/* Loads the arg into A */
+#define ARG_32(idx) \
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+ offsetof(struct seccomp_data, lo32[(idx)]))
+
+/* Loads hi into A and lo in X */
+#define ARG_64(idx) \
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+ offsetof(struct seccomp_data, lo32[(idx)])), \
+ BPF_STMT(BPF_ST, 0), /* lo -> M[0] */ \
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+ offsetof(struct seccomp_data, hi32[(idx)])), \
+ BPF_STMT(BPF_ST, 1) /* hi -> M[1] */
+
+#define JEQ32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 0, 1), \
+ jt
+
+#define JNE32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 1, 0), \
+ jt
+
+/* Checks the lo, then swaps to check the hi. A=lo,X=hi */
+#define JEQ64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 0, 2), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JNE64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 5, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 2, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JA32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (value), 0, 1), \
+ jt
+
+#define JA64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (hi), 3, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (lo), 0, 2), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JGE32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 0, 1), \
+ jt
+
+#define JLT32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 1, 0), \
+ jt
+
+/* Shortcut checking if hi > arg.hi. */
+#define JGE64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 4, 0), \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (lo), 0, 2), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JLT64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (hi), 0, 4), \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 2, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JGT32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 0, 1), \
+ jt
+
+#define JLE32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 1, 0), \
+ jt
+
+/* Check hi > args.hi first, then do the GE checking */
+#define JGT64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 4, 0), \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 0, 2), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JLE64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 6, 0), \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 3), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 2, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define LOAD_SYSCALL_NR \
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+ offsetof(struct seccomp_data, nr))
+
+#endif /* __BPF_HELPER_H__ */
diff --git a/samples/seccomp/dropper.c b/samples/seccomp/dropper.c
new file mode 100644
index 0000000..535db8a
--- /dev/null
+++ b/samples/seccomp/dropper.c
@@ -0,0 +1,52 @@
+/*
+ * Naive system call dropper built on seccomp_filter.
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <[email protected]>
+ * Author: Will Drewry <[email protected]>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_SET_SECCOMP, 2, ...).
+ *
+ * When run, returns the specified errno for the specified
+ * system call number.
+ *
+ * Run this one as root as PR_SET_NO_NEW_PRIVS is not called.
+ */
+
+#include <errno.h>
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+#include <linux/unistd.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <sys/prctl.h>
+#include <unistd.h>
+
+int main(int argc, char **argv)
+{
+ if (argc < 4) {
+ fprintf(stderr, "Usage:\n"
+ "dropper <syscall_nr> <errno> <prog> [<args>]\n\n");
+ return 1;
+ }
+ struct sock_filter filter[] = {
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+ (offsetof(struct seccomp_data, nr))),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, atoi(argv[1]), 0, 1),
+ BPF_STMT(BPF_RET+BPF_K,
+ SECCOMP_RET_ERRNO|(atoi(argv[2]) & SECCOMP_RET_DATA)),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ .filter = filter,
+ };
+ if (prctl(PR_SET_SECCOMP, 2, &prog)) {
+ perror("prctl");
+ return 1;
+ }
+ execv(argv[3], &argv[3]);
+ return 1;
+}
--
1.7.5.4
Adds a new return value to seccomp filters that triggers a SIGTRAP to be
delivered with the new TRAP_SECCOMP si_code.
This allows in-process system call emulation -- including just
specifying an errno or cleanly dumping core -- rather than just dying.
Supporting this change requires that secure_computing returns a value.
This change adds an int return value and creates a new
__secure_computing_int and deprecates the old __secure_computing call.
This allows for piecemeal arch updating using HAVE_ARCH_SECCOMP_FILTER.
(If -1 is returned, the system call must be skipped.)
Note, the addition of TRAP_SECCOMP may not be appropriate. There are
GNU specific extensions in place (e.g., TRAP_HWBKPT), but I'm not sure
how sacred the definitions are. If it would be preferable to add
a brand new si_code independent of the TRAP_* or use the unused si_errno
(with ENOSYS), or do something totally different, please let me know!
v8: - clean up based on changes to dependent patches
v7: - introduction
Signed-off-by: Will Drewry <[email protected]>
---
arch/Kconfig | 8 ++++----
include/asm-generic/siginfo.h | 3 ++-
include/linux/seccomp.h | 1 +
kernel/seccomp.c | 20 ++++++++++++++++++++
4 files changed, 27 insertions(+), 5 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index 3f3052b..a01c151 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -203,10 +203,10 @@ config HAVE_ARCH_SECCOMP_FILTER
bool
help
This symbol should be selected by an architecure if it provides
- asm/syscall.h, specifically syscall_get_arguments() and
- syscall_set_return_value(). Additionally, its system call
- entry path must respect a return value of -1 from
- __secure_computing_int() and/or secure_computing().
+ asm/syscall.h, specifically syscall_get_arguments(),
+ syscall_set_return_value(), and syscall_rollback().
+ Additionally, its system call entry path must respect a return
+ value of -1 from __secure_computing_int() and/or secure_computing().
config SECCOMP_FILTER
def_bool y
diff --git a/include/asm-generic/siginfo.h b/include/asm-generic/siginfo.h
index 0dd4e87..a6c51a6 100644
--- a/include/asm-generic/siginfo.h
+++ b/include/asm-generic/siginfo.h
@@ -207,7 +207,8 @@ typedef struct siginfo {
#define TRAP_TRACE (__SI_FAULT|2) /* process trace trap */
#define TRAP_BRANCH (__SI_FAULT|3) /* process taken branch trap */
#define TRAP_HWBKPT (__SI_FAULT|4) /* hardware breakpoint/watchpoint */
-#define NSIGTRAP 4
+#define TRAP_SECCOMP (__SI_FAULT|5) /* secure computing trap */
+#define NSIGTRAP 5
/*
* SIGCHLD si_codes
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 879ece2..1be562f 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -19,6 +19,7 @@
* selects the least permissive choice.
*/
#define SECCOMP_RET_KILL 0x00000000U /* kill the task immediately */
+#define SECCOMP_RET_TRAP 0x00020000U /* disallow and send sigtrap */
#define SECCOMP_RET_ERRNO 0x00030000U /* returns an errno */
#define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 55d000d..c75485c 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -290,6 +290,21 @@ void copy_seccomp(struct seccomp *child,
child->mode = prev->mode;
child->filter = get_seccomp_filter(prev->filter);
}
+
+/**
+ * seccomp_send_sigtrap - signals the task to allow in-process syscall emulation
+ *
+ * Forces a SIGTRAP with si_code of TRAP_SECCOMP.
+ */
+static void seccomp_send_sigtrap(void)
+{
+ struct siginfo info;
+ memset(&info, 0, sizeof(info));
+ info.si_signo = SIGTRAP;
+ info.si_code = TRAP_SECCOMP;
+ info.si_addr = (void __user *)KSTK_EIP(current);
+ force_sig_info(SIGTRAP, &info, current);
+}
#endif /* CONFIG_SECCOMP_FILTER */
/*
@@ -343,6 +358,11 @@ int __secure_computing_int(int this_syscall)
-(action & SECCOMP_RET_DATA),
0);
return -1;
+ case SECCOMP_RET_TRAP:
+ /* Show the handler the original registers. */
+ syscall_rollback(current, task_pt_regs(current));
+ seccomp_send_sigtrap();
+ return -1;
case SECCOMP_RET_ALLOW:
return 0;
case SECCOMP_RET_KILL:
--
1.7.5.4
A new return value is added to seccomp filters that allows
the system call policy for the affected system calls to be
implemented by a ptrace(2)ing process.
If a tracer attaches to a task using PTRACE_SECCOMP, then the
traced process will notify the tracer if a seccomp filter
returns SECCOMP_RET_TRACE. If the tracer detaches, then
system calls made by the task will fail.
To ensure that seccomp is syscall fast-path friendly in the future,
ptrace is delegated to by setting TIF_SYSCALL_TRACE. Since seccomp
events are equivalent to system call entry events, this allows for
seccomp to be evaluated as a fork off the fast-path and only,
optionally, jump to the slow path. When the tracer is notified, all
will function as with ptrace(PTRACE_SYSCALLS), but when the tracer calls
ptrace(PTRACE_SECCOMP), TIF_SYSCALL_TRACE will be unset and the task
will proceed.
Note, this patch takes the path of least resistance for integration. It
is not necessarily the best path and any guidance will be appreciated!
The key challenges are ensuring that register state is correct at
ptrace handoff and ensuring that all only seccomp-based notification
occurs.
v8: - guarded PTRACE_SECCOMP use with an ifdef
v7: - introduced
Signed-off-by: Will Drewry <[email protected]>
---
arch/Kconfig | 12 ++++++++----
include/linux/ptrace.h | 1 +
include/linux/seccomp.h | 39 +++++++++++++++++++++++++++++++++++++--
kernel/ptrace.c | 12 ++++++++++++
kernel/seccomp.c | 15 +++++++++++++++
5 files changed, 73 insertions(+), 6 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index a01c151..ae40aec 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -203,10 +203,14 @@ config HAVE_ARCH_SECCOMP_FILTER
bool
help
This symbol should be selected by an architecure if it provides
- asm/syscall.h, specifically syscall_get_arguments(),
- syscall_set_return_value(), and syscall_rollback().
- Additionally, its system call entry path must respect a return
- value of -1 from __secure_computing_int() and/or secure_computing().
+ linux/tracehook.h, for TIF_SYSCALL_TRACE, and asm/syscall.h,
+ specifically syscall_get_arguments(), syscall_set_return_value(), and
+ syscall_rollback(). Additionally, its system call entry path must
+ respect a return value of -1 from __secure_computing_int() and/or
+ secure_computing(). If secure_computing is not in the system call
+ slow path, the thread info flags will need to be checked upon exit to
+ ensure delegation to ptrace(2) did not occur, or if it did, jump to
+ the slow-path.
config SECCOMP_FILTER
def_bool y
diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index c2f1f6a..00220de 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -50,6 +50,7 @@
#define PTRACE_SEIZE 0x4206
#define PTRACE_INTERRUPT 0x4207
#define PTRACE_LISTEN 0x4208
+#define PTRACE_SECCOMP 0x4209
/* flags in @data for PTRACE_SEIZE */
#define PTRACE_SEIZE_DEVEL 0x80000000 /* temp flag for development */
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 1be562f..1cb7d5c 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -19,8 +19,9 @@
* selects the least permissive choice.
*/
#define SECCOMP_RET_KILL 0x00000000U /* kill the task immediately */
-#define SECCOMP_RET_TRAP 0x00020000U /* disallow and send sigtrap */
-#define SECCOMP_RET_ERRNO 0x00030000U /* returns an errno */
+#define SECCOMP_RET_TRAP 0x00020000U /* only send sigtrap */
+#define SECCOMP_RET_ERRNO 0x00030000U /* only return an errno */
+#define SECCOMP_RET_TRACE 0x7ffe0000U /* allow, but notify the tracer */
#define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */
/* Masks for accessing the above values. */
@@ -51,6 +52,7 @@ struct seccomp_filter;
*
* @mode: indicates one of the valid values above for controlled
* system calls available to a process.
+ * @flags: per-process flags. Currently only used for SECCOMP_FLAGS_TRACED.
* @filter: The metadata and ruleset for determining what system calls
* are allowed for a task.
*
@@ -59,9 +61,13 @@ struct seccomp_filter;
*/
struct seccomp {
int mode;
+ unsigned long flags;
struct seccomp_filter *filter;
};
+/* Indicates if a tracer is attached. */
+#define SECCOMP_FLAGS_TRACED 0
+
/*
* Direct callers to __secure_computing should be updated as
* CONFIG_HAVE_ARCH_SECCOMP_FILTER propagates.
@@ -83,6 +89,20 @@ static inline int seccomp_mode(struct seccomp *s)
return s->mode;
}
+static inline void seccomp_set_traced(struct seccomp *s)
+{
+ set_bit(SECCOMP_FLAGS_TRACED, &s->flags);
+}
+
+static inline void seccomp_clear_traced(struct seccomp *s)
+{
+ clear_bit(SECCOMP_FLAGS_TRACED, &s->flags);
+}
+
+static inline int seccomp_traced(struct seccomp *s)
+{
+ return test_bit(SECCOMP_FLAGS_TRACED, &s->flags);
+}
#else /* CONFIG_SECCOMP */
#include <linux/errno.h>
@@ -106,6 +126,21 @@ static inline int seccomp_mode(struct seccomp *s)
{
return 0;
}
+
+static inline void seccomp_set_traced(struct seccomp *s)
+{
+ return;
+}
+
+static inline void seccomp_clear_traced(struct seccomp *s)
+{
+ return;
+}
+
+static inline int seccomp_traced(struct seccomp *s)
+{
+ return 0;
+}
#endif /* CONFIG_SECCOMP */
#ifdef CONFIG_SECCOMP_FILTER
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 00ab2ca..199a6da 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -19,6 +19,7 @@
#include <linux/signal.h>
#include <linux/audit.h>
#include <linux/pid_namespace.h>
+#include <linux/seccomp.h>
#include <linux/syscalls.h>
#include <linux/uaccess.h>
#include <linux/regset.h>
@@ -426,6 +427,7 @@ static int ptrace_detach(struct task_struct *child, unsigned int data)
/* Architecture-specific hardware disable .. */
ptrace_disable(child);
clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE);
+ seccomp_clear_traced(&child->seccomp);
write_lock_irq(&tasklist_lock);
/*
@@ -616,6 +618,13 @@ static int ptrace_resume(struct task_struct *child, long request,
else
clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE);
+#ifdef CONFIG_SECCOMP_FILTER
+ if (request == PTRACE_SECCOMP)
+ seccomp_set_traced(&child->seccomp);
+ else
+ seccomp_clear_traced(&child->seccomp);
+#endif
+
#ifdef TIF_SYSCALL_EMU
if (request == PTRACE_SYSEMU || request == PTRACE_SYSEMU_SINGLESTEP)
set_tsk_thread_flag(child, TIF_SYSCALL_EMU);
@@ -816,6 +825,9 @@ int ptrace_request(struct task_struct *child, long request,
case PTRACE_SYSEMU:
case PTRACE_SYSEMU_SINGLESTEP:
#endif
+#ifdef CONFIG_SECCOMP_FILTER
+ case PTRACE_SECCOMP:
+#endif
case PTRACE_SYSCALL:
case PTRACE_CONT:
return ptrace_resume(child, request, data);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index c75485c..f9d419f 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -289,6 +289,8 @@ void copy_seccomp(struct seccomp *child,
{
child->mode = prev->mode;
child->filter = get_seccomp_filter(prev->filter);
+ /* Note, this leaves seccomp tracing enabled across fork. */
+ child->flags = prev->flags;
}
/**
@@ -363,6 +365,19 @@ int __secure_computing_int(int this_syscall)
syscall_rollback(current, task_pt_regs(current));
seccomp_send_sigtrap();
return -1;
+ case SECCOMP_RET_TRACE:
+ if (!seccomp_traced(¤t->seccomp))
+ return -1;
+ /*
+ * Delegate to TIF_SYSCALL_TRACE. This allows fast-path
+ * seccomp calls to delegate to slow-path if needed.
+ * Since TIF_SYSCALL_TRACE will be unset on ptrace(2)
+ * continuation, there should be no direct side
+ * effects. If TIF_SYSCALL_TRACE is already set, this
+ * has no effect.
+ */
+ set_tsk_thread_flag(current, TIF_SYSCALL_TRACE);
+ /* Falls through to allow. */
case SECCOMP_RET_ALLOW:
return 0;
case SECCOMP_RET_KILL:
--
1.7.5.4
This change adds the SECCOMP_RET_ERRNO as a valid return value from a
seccomp filter. Additionally, it makes the first use of the lower
16-bits for storing a filter-supplied errno. 16-bits is more than
enough for the errno-base.h calls.
Returning errors instead of immediately terminating processes that
violate seccomp policy allow for broader use of this functionality
for kernel attack surface reduction. For example, a linux container
could maintain a whitelist of pre-existing system calls but drop
all new ones with errnos. This would keep a logically static attack
surface while providing errnos that may allow for graceful failure
without the downside of do_exit() on a bad call.
v8: - update Kconfig to note new need for syscall_set_return_value.
- reordered such that TRAP behavior follows on later.
- made the for loop a little less indent-y
v7: - introduced
Signed-off-by: Will Drewry <[email protected]>
(cherry picked from commit e90e1a5389d0ce3a667640121b0a90538014a16c)
---
arch/Kconfig | 5 ++++-
include/linux/seccomp.h | 20 +++++++++++++++-----
kernel/seccomp.c | 42 ++++++++++++++++++++++++++++++------------
3 files changed, 49 insertions(+), 18 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index c6ba1db..3f3052b 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -203,7 +203,10 @@ config HAVE_ARCH_SECCOMP_FILTER
bool
help
This symbol should be selected by an architecure if it provides
- asm/syscall.h, specifically syscall_get_arguments().
+ asm/syscall.h, specifically syscall_get_arguments() and
+ syscall_set_return_value(). Additionally, its system call
+ entry path must respect a return value of -1 from
+ __secure_computing_int() and/or secure_computing().
config SECCOMP_FILTER
def_bool y
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 2bee1f7..879ece2 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -12,16 +12,20 @@
/*
* BPF programs may return a 32-bit value.
- * The bottom 16-bits are reserved for future use.
+ * The bottom 16-bits are for optional related return data.
* The upper 16-bits are ordered from least permissive values to most.
*
* The ordering ensures that a min_t() over composed return values always
* selects the least permissive choice.
*/
-#define SECCOMP_RET_MASK 0xffff0000U
#define SECCOMP_RET_KILL 0x00000000U /* kill the task immediately */
+#define SECCOMP_RET_ERRNO 0x00030000U /* returns an errno */
#define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */
+/* Masks for accessing the above values. */
+#define SECCOMP_RET_ACTION 0xffff0000U
+#define SECCOMP_RET_DATA 0x0000ffffU
+
/* Format of the data the BPF program executes over. */
struct seccomp_data {
int nr;
@@ -57,11 +61,17 @@ struct seccomp {
struct seccomp_filter *filter;
};
-extern void __secure_computing(int);
-static inline void secure_computing(int this_syscall)
+/*
+ * Direct callers to __secure_computing should be updated as
+ * CONFIG_HAVE_ARCH_SECCOMP_FILTER propagates.
+ */
+extern void __secure_computing(int) __deprecated;
+extern int __secure_computing_int(int);
+static inline int secure_computing(int this_syscall)
{
if (unlikely(test_thread_flag(TIF_SECCOMP)))
- __secure_computing(this_syscall);
+ return __secure_computing_int(this_syscall);
+ return 0;
}
extern long prctl_get_seccomp(void);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 14d1869..55d000d 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -137,25 +137,22 @@ static void *bpf_pointer(const void *nr, int off, unsigned int size, void *buf)
static u32 seccomp_run_filters(int syscall)
{
struct seccomp_filter *f;
- const struct bpf_load_fns loaders = { bpf_pointer, bpf_length };
- u32 ret = SECCOMP_RET_KILL;
+ const struct bpf_load_fns fns = { bpf_pointer, bpf_length };
+ u32 ret = SECCOMP_RET_ALLOW;
const void *sc_ptr = (const void *)(uintptr_t)syscall;
/* It's not possible for the filter to be NULL here. */
#ifdef CONFIG_COMPAT
if (current->seccomp.filter->compat != !!(is_compat_task()))
- return ret;
+ return SECCOMP_RET_KILL;
#endif
/*
* All filters are evaluated in order of youngest to oldest. The lowest
* BPF return value always takes priority.
*/
- for (f = current->seccomp.filter; f; f = f->prev) {
- ret = bpf_run_filter(sc_ptr, f->insns, &loaders);
- if (ret != SECCOMP_RET_ALLOW)
- break;
- }
+ for (f = current->seccomp.filter; f; f = f->prev)
+ ret = min_t(u32, ret, bpf_run_filter(sc_ptr, f->insns, &fns));
return ret;
}
@@ -314,6 +311,13 @@ static int mode1_syscalls_32[] = {
void __secure_computing(int this_syscall)
{
+ /* Filter calls should never use this function. */
+ BUG_ON(current->seccomp.mode == SECCOMP_MODE_FILTER);
+ __secure_computing_int(this_syscall);
+}
+
+int __secure_computing_int(int this_syscall)
+{
int mode = current->seccomp.mode;
int *syscall;
@@ -326,15 +330,28 @@ void __secure_computing(int this_syscall)
#endif
do {
if (*syscall == this_syscall)
- return;
+ return 0;
} while (*++syscall);
break;
#ifdef CONFIG_SECCOMP_FILTER
- case SECCOMP_MODE_FILTER:
- if (seccomp_run_filters(this_syscall) == SECCOMP_RET_ALLOW)
- return;
+ case SECCOMP_MODE_FILTER: {
+ u32 action = seccomp_run_filters(this_syscall);
+ switch (action & SECCOMP_RET_ACTION) {
+ case SECCOMP_RET_ERRNO:
+ /* Set the low-order 16-bits as a errno. */
+ syscall_set_return_value(current, task_pt_regs(current),
+ -(action & SECCOMP_RET_DATA),
+ 0);
+ return -1;
+ case SECCOMP_RET_ALLOW:
+ return 0;
+ case SECCOMP_RET_KILL:
+ default:
+ break;
+ }
seccomp_filter_log_failure(this_syscall);
break;
+ }
#endif
default:
BUG();
@@ -345,6 +362,7 @@ void __secure_computing(int this_syscall)
#endif
audit_seccomp(this_syscall);
do_exit(SIGKILL);
+ return -1; /* never reached */
}
long prctl_get_seccomp(void)
--
1.7.5.4
[This patch depends on [email protected]'s no_new_privs patch:
https://lkml.org/lkml/2012/1/30/264
]
This patch adds support for seccomp mode 2. Mode 2 introduces the
ability for unprivileged processes to install system call filtering
policy expressed in terms of a Berkeley Packet Filter (BPF) program.
This program will be evaluated in the kernel for each system call
the task makes and computes a result based on data in the format
of struct seccomp_data.
A filter program may be installed by calling:
struct sock_fprog fprog = { ... };
...
prctl(PR_SET_SECCOMP, 2, &fprog);
The return value of the filter program determines if the system call is
allowed to proceed or denied. If the first filter program installed
allows prctl(2) calls, then the above call may be made repeatedly
by a task to further reduce its access to the kernel. All attached
programs must be evaluated before a system call will be allowed to
proceed.
To avoid CONFIG_COMPAT related landmines, once a filter program is
installed using specific is_compat_task() value, it is not allowed to
make system calls using the alternate entry point.
Filter programs will be inherited across fork/clone and execve.
However, if the task attaching the filter is unprivileged
(!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task. This
ensures that unprivileged tasks cannot attach filters that affect
privileged tasks (e.g., setuid binary).
There are a number of benefits to this approach. A few of which are
as follows:
- BPF has been exposed to userland for a long time
- BPF optimization (and JIT'ing) are well understood
- Userland already knows its ABI: system call numbers and desired
arguments
- No time-of-check-time-of-use vulnerable data accesses are possible.
- system call arguments are loaded on access only to minimize copying
required for system call policy decisions.
Mode 2 support is restricted to architectures that enable
HAVE_ARCH_SECCOMP_FILTER. In this patch, the primary dependency is on
syscall_get_arguments(). The full desired scope of this feature will
add a few minor additional requirements expressed later in this series.
Based on discussion, SECCOMP_RET_ERRNO and SECCOMP_RET_TRACE seem to be
the desired additional functionality.
No architectures are enabled in this patch.
v8: - use bpf_chk_filter, bpf_run_filter. update load_fns
- Lots of fixes courtesy of [email protected]:
-- fix up load behavior, compat fixups, and merge alloc code,
-- renamed pc and dropped __packed, use bool compat.
-- Added a hidden CONFIG_SECCOMP_FILTER to synthesize non-arch
dependencies
v7: (massive overhaul thanks to Indan, others)
- added CONFIG_HAVE_ARCH_SECCOMP_FILTER
- merged into seccomp.c
- minimal seccomp_filter.h
- no config option (part of seccomp)
- no new prctl
- doesn't break seccomp on systems without asm/syscall.h
(works but arg access always fails)
- dropped seccomp_init_task, extra free functions, ...
- dropped the no-asm/syscall.h code paths
- merges with network sk_run_filter and sk_chk_filter
v6: - fix memory leak on attach compat check failure
- require no_new_privs || CAP_SYS_ADMIN prior to filter
installation. ([email protected])
- s/seccomp_struct_/seccomp_/ for macros/functions ([email protected])
- cleaned up Kconfig ([email protected])
- on block, note if the call was compat (so the # means something)
v5: - uses syscall_get_arguments
([email protected],[email protected], [email protected])
- uses union-based arg storage with hi/lo struct to
handle endianness. Compromises between the two alternate
proposals to minimize extra arg shuffling and account for
endianness assuming userspace uses offsetof().
([email protected], [email protected])
- update Kconfig description
- add include/seccomp_filter.h and add its installation
- (naive) on-demand syscall argument loading
- drop seccomp_t ([email protected])
v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
- now uses current->no_new_privs
([email protected],[email protected])
- assign names to seccomp modes ([email protected])
- fix style issues ([email protected])
- reworded Kconfig entry ([email protected])
v3: - macros to inline ([email protected])
- init_task behavior fixed ([email protected])
- drop creator entry and extra NULL check ([email protected])
- alloc returns -EINVAL on bad sizing ([email protected])
- adds tentative use of "always_unprivileged" as per
[email protected] and [email protected]
v2: - (patch 2 only)
Signed-off-by: Will Drewry <[email protected]>
---
arch/Kconfig | 17 +++
include/linux/Kbuild | 1 +
include/linux/seccomp.h | 69 ++++++++++-
kernel/fork.c | 3 +
kernel/seccomp.c | 327 ++++++++++++++++++++++++++++++++++++++++++++--
kernel/sys.c | 2 +-
6 files changed, 399 insertions(+), 20 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index 4f55c73..c6ba1db 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -199,4 +199,21 @@ config HAVE_CMPXCHG_LOCAL
config HAVE_CMPXCHG_DOUBLE
bool
+config HAVE_ARCH_SECCOMP_FILTER
+ bool
+ help
+ This symbol should be selected by an architecure if it provides
+ asm/syscall.h, specifically syscall_get_arguments().
+
+config SECCOMP_FILTER
+ def_bool y
+ depends on HAVE_ARCH_SECCOMP_FILTER && SECCOMP && NET
+ help
+ Enable tasks to build secure computing environments defined
+ in terms of Berkeley Packet Filter programs which implement
+ task-defined system call filtering polices.
+
+ See Documentation/prctl/seccomp_filter.txt for more
+ information on the topic of seccomp filtering.
+
source "kernel/gcov/Kconfig"
diff --git a/include/linux/Kbuild b/include/linux/Kbuild
index c94e717..d41ba12 100644
--- a/include/linux/Kbuild
+++ b/include/linux/Kbuild
@@ -330,6 +330,7 @@ header-y += scc.h
header-y += sched.h
header-y += screen_info.h
header-y += sdla.h
+header-y += seccomp.h
header-y += securebits.h
header-y += selinux_netlink.h
header-y += sem.h
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index d61f27f..2bee1f7 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -1,14 +1,60 @@
#ifndef _LINUX_SECCOMP_H
#define _LINUX_SECCOMP_H
+#include <linux/compiler.h>
+#include <linux/types.h>
+
+
+/* Valid values for seccomp.mode and prctl(PR_SET_SECCOMP, <mode>) */
+#define SECCOMP_MODE_DISABLED 0 /* seccomp is not in use. */
+#define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */
+#define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */
+
+/*
+ * BPF programs may return a 32-bit value.
+ * The bottom 16-bits are reserved for future use.
+ * The upper 16-bits are ordered from least permissive values to most.
+ *
+ * The ordering ensures that a min_t() over composed return values always
+ * selects the least permissive choice.
+ */
+#define SECCOMP_RET_MASK 0xffff0000U
+#define SECCOMP_RET_KILL 0x00000000U /* kill the task immediately */
+#define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */
+
+/* Format of the data the BPF program executes over. */
+struct seccomp_data {
+ int nr;
+ __u32 __reserved[3];
+ struct {
+ __u32 lo;
+ __u32 hi;
+ } instruction_pointer;
+ __u32 lo32[6];
+ __u32 hi32[6];
+};
+#ifdef __KERNEL__
#ifdef CONFIG_SECCOMP
#include <linux/thread_info.h>
#include <asm/seccomp.h>
+struct seccomp_filter;
+/**
+ * struct seccomp - the state of a seccomp'ed process
+ *
+ * @mode: indicates one of the valid values above for controlled
+ * system calls available to a process.
+ * @filter: The metadata and ruleset for determining what system calls
+ * are allowed for a task.
+ *
+ * @filter must only be accessed from the context of current as there
+ * is no locking.
+ */
struct seccomp {
int mode;
+ struct seccomp_filter *filter;
};
extern void __secure_computing(int);
@@ -19,7 +65,7 @@ static inline void secure_computing(int this_syscall)
}
extern long prctl_get_seccomp(void);
-extern long prctl_set_seccomp(unsigned long);
+extern long prctl_set_seccomp(unsigned long, char __user *);
static inline int seccomp_mode(struct seccomp *s)
{
@@ -31,15 +77,16 @@ static inline int seccomp_mode(struct seccomp *s)
#include <linux/errno.h>
struct seccomp { };
+struct seccomp_filter { };
-#define secure_computing(x) do { } while (0)
+#define secure_computing(x) 0
static inline long prctl_get_seccomp(void)
{
return -EINVAL;
}
-static inline long prctl_set_seccomp(unsigned long arg2)
+static inline long prctl_set_seccomp(unsigned long arg2, char __user *arg3)
{
return -EINVAL;
}
@@ -48,7 +95,21 @@ static inline int seccomp_mode(struct seccomp *s)
{
return 0;
}
-
#endif /* CONFIG_SECCOMP */
+#ifdef CONFIG_SECCOMP_FILTER
+extern void put_seccomp_filter(struct seccomp_filter *);
+extern void copy_seccomp(struct seccomp *child,
+ const struct seccomp *parent);
+#else /* CONFIG_SECCOMP_FILTER */
+/* The macro consumes the ->filter reference. */
+#define put_seccomp_filter(_s) do { } while (0)
+
+static inline void copy_seccomp(struct seccomp *child,
+ const struct seccomp *prev)
+{
+ return;
+}
+#endif /* CONFIG_SECCOMP_FILTER */
+#endif /* __KERNEL__ */
#endif /* _LINUX_SECCOMP_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index b77fd55..a5187b7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
#include <linux/cgroup.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/seccomp.h>
#include <linux/swap.h>
#include <linux/syscalls.h>
#include <linux/jiffies.h>
@@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
free_thread_info(tsk->stack);
rt_mutex_debug_task_free(tsk);
ftrace_graph_exit_task(tsk);
+ put_seccomp_filter(tsk->seccomp.filter);
free_task_struct(tsk);
}
EXPORT_SYMBOL(free_task);
@@ -1113,6 +1115,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
goto fork_out;
ftrace_graph_init_task(p);
+ copy_seccomp(&p->seccomp, ¤t->seccomp);
rt_mutex_init_task(p);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index e8d76c5..14d1869 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -3,16 +3,297 @@
*
* Copyright 2004-2005 Andrea Arcangeli <[email protected]>
*
- * This defines a simple but solid secure-computing mode.
+ * Copyright (C) 2012 Google, Inc.
+ * Will Drewry <[email protected]>
+ *
+ * This defines a simple but solid secure-computing facility.
+ *
+ * Mode 1 uses a fixed list of allowed system calls.
+ * Mode 2 allows user-defined system call filters in the form
+ * of Berkeley Packet Filters/Linux Socket Filters.
*/
#include <linux/audit.h>
+#include <linux/filter.h>
#include <linux/seccomp.h>
#include <linux/sched.h>
#include <linux/compat.h>
+#include <linux/atomic.h>
+#include <linux/security.h>
+
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/user.h>
+
+#include <linux/tracehook.h>
+#include <asm/syscall.h>
+
/* #define SECCOMP_DEBUG 1 */
-#define NR_SECCOMP_MODES 1
+
+#ifdef CONFIG_SECCOMP_FILTER
+/**
+ * struct seccomp_filter - container for seccomp BPF programs
+ *
+ * @usage: reference count to manage the object liftime.
+ * get/put helpers should be used when accessing an instance
+ * outside of a lifetime-guarded section. In general, this
+ * is only needed for handling filters shared across tasks.
+ * @prev: points to a previously installed, or inherited, filter
+ * @compat: indicates the value of is_compat_task() at creation time
+ * @insns: the BPF program instructions to evaluate
+ * @count: the number of instructions in the program
+ *
+ * seccomp_filter objects are organized in a tree linked via the @prev
+ * pointer. For any task, it appears to be a singly-linked list starting
+ * with current->seccomp.filter, the most recently attached or inherited filter.
+ * However, multiple filters may share a @prev node, by way of fork(), which
+ * results in a unidirectional tree existing in memory. This is similar to
+ * how namespaces work.
+ *
+ * seccomp_filter objects should never be modified after being attached
+ * to a task_struct (other than @usage).
+ */
+struct seccomp_filter {
+ atomic_t usage;
+ struct seccomp_filter *prev;
+ bool compat;
+ unsigned short count; /* Instruction count */
+ struct sock_filter insns[];
+};
+
+static void seccomp_filter_log_failure(int syscall)
+{
+ int compat = 0;
+#ifdef CONFIG_COMPAT
+ compat = is_compat_task();
+#endif
+ pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n",
+ current->comm, task_pid_nr(current),
+ (compat ? "compat " : ""),
+ syscall, KSTK_EIP(current));
+}
+
+static inline u32 get_high_bits(unsigned long value)
+{
+ int bits = 32;
+ return value >> bits;
+}
+
+static inline u32 bpf_length(const void *data)
+{
+ return sizeof(struct seccomp_data);
+}
+
+/**
+ * bpf_pointer: checks and returns a pointer to the requested offset
+ * @nr: int syscall passed as a void * to bpf_run_filter
+ * @off: index to load a from in @data
+ * @size: load width requested
+ * @buffer: temporary storage supplied by bpf_run_filter
+ *
+ * Returns a pointer to @buffer where the value was stored.
+ * On failure, returns NULL.
+ */
+static void *bpf_pointer(const void *nr, int off, unsigned int size, void *buf)
+{
+ unsigned long value;
+ u32 *A = (u32 *)buf;
+
+ if (size != sizeof(u32))
+ return NULL;
+
+#define BPF_DATA(_name) offsetof(struct seccomp_data, _name)
+ /* Index by entry instead of by byte. */
+ if (off == BPF_DATA(nr)) {
+ *A = (u32)(uintptr_t)nr;
+ } else if (off == BPF_DATA(instruction_pointer.lo)) {
+ *A = KSTK_EIP(current);
+ } else if (off == BPF_DATA(instruction_pointer.hi)) {
+ *A = get_high_bits(KSTK_EIP(current));
+ } else if (off >= BPF_DATA(lo32[0]) && off <= BPF_DATA(lo32[5])) {
+ struct pt_regs *regs = task_pt_regs(current);
+ int arg = (off - BPF_DATA(lo32[0])) >> 2;
+ syscall_get_arguments(current, regs, arg, 1, &value);
+ *A = value;
+ } else if (off >= BPF_DATA(hi32[0]) && off <= BPF_DATA(hi32[5])) {
+ struct pt_regs *regs = task_pt_regs(current);
+ int arg = (off - BPF_DATA(hi32[0])) >> 2;
+ syscall_get_arguments(current, regs, arg, 1, &value);
+ *A = get_high_bits(value);
+ } else {
+ return NULL;
+ }
+#undef BPF_DATA
+ return buf;
+}
+
+/**
+ * seccomp_run_filters - run 'current' against the given syscall
+ * @syscall: number of the current system call
+ *
+ * Returns valid seccomp BPF response codes.
+ */
+static u32 seccomp_run_filters(int syscall)
+{
+ struct seccomp_filter *f;
+ const struct bpf_load_fns loaders = { bpf_pointer, bpf_length };
+ u32 ret = SECCOMP_RET_KILL;
+ const void *sc_ptr = (const void *)(uintptr_t)syscall;
+
+ /* It's not possible for the filter to be NULL here. */
+#ifdef CONFIG_COMPAT
+ if (current->seccomp.filter->compat != !!(is_compat_task()))
+ return ret;
+#endif
+
+ /*
+ * All filters are evaluated in order of youngest to oldest. The lowest
+ * BPF return value always takes priority.
+ */
+ for (f = current->seccomp.filter; f; f = f->prev) {
+ ret = bpf_run_filter(sc_ptr, f->insns, &loaders);
+ if (ret != SECCOMP_RET_ALLOW)
+ break;
+ }
+ return ret;
+}
+
+/**
+ * seccomp_attach_filter: Attaches a seccomp filter to current.
+ * @fprog: BPF program to install
+ *
+ * Returns 0 on success or an errno on failure.
+ */
+static long seccomp_attach_filter(struct sock_fprog *fprog)
+{
+ struct seccomp_filter *filter = NULL;
+ unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
+ long ret = -EINVAL;
+
+ if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
+ goto out;
+
+ /* Allocate a new seccomp_filter */
+ ret = -ENOMEM;
+ filter = kzalloc(sizeof(struct seccomp_filter) + fp_size, GFP_KERNEL);
+ if (!filter)
+ goto out;
+ atomic_set(&filter->usage, 1);
+ filter->count = fprog->len;
+
+ /* Copy the instructions from fprog. */
+ ret = -EFAULT;
+ if (copy_from_user(filter->insns, fprog->filter, fp_size))
+ goto out;
+
+ /* Check the fprog */
+ ret = bpf_chk_filter(filter->insns, filter->count, BPF_CHK_FLAGS_NO_SKB);
+ if (ret)
+ goto out;
+
+ /*
+ * Installing a seccomp filter requires that the task
+ * have CAP_SYS_ADMIN in its namespace or be running with
+ * no_new_privs. This avoids scenarios where unprivileged
+ * tasks can affect the behavior of privileged children.
+ */
+ ret = -EACCES;
+ if (!current->no_new_privs &&
+ security_capable_noaudit(current_cred(), current_user_ns(),
+ CAP_SYS_ADMIN) != 0)
+ goto out;
+
+ /* Lock the filter to the current calling convention. */
+#ifdef CONFIG_COMPAT
+ filter->compat = !!(is_compat_task());
+#endif
+
+ /*
+ * If there is an existing filter, make it the prev
+ * and don't drop its task reference.
+ */
+ filter->prev = current->seccomp.filter;
+ current->seccomp.filter = filter;
+ return 0;
+out:
+ put_seccomp_filter(filter); /* for get or task, on err */
+ return ret;
+}
+
+/**
+ * seccomp_attach_user_filter - attaches a user-supplied sock_fprog
+ * @user_filter: pointer to the user data containing a sock_fprog.
+ *
+ * This function may be called repeatedly to install additional filters.
+ * Every filter successfully installed will be evaluated (in reverse order)
+ * for each system call the task makes.
+ *
+ * Returns 0 on success and non-zero otherwise.
+ */
+long seccomp_attach_user_filter(char __user *user_filter)
+{
+ struct sock_fprog fprog;
+ long ret = -EFAULT;
+
+ if (!user_filter)
+ goto out;
+#ifdef CONFIG_COMPAT
+ if (is_compat_task()) {
+ /* XXX: Share with net/compat.c */
+ struct {
+ u16 len;
+ compat_uptr_t filter; /* struct sock_filter */
+ } fprog32;
+ if (copy_from_user(&fprog32, user_filter, sizeof(fprog32)))
+ goto out;
+ fprog.len = fprog32.len;
+ fprog.filter = compat_ptr(fprog32.filter);
+ } else
+#endif
+ if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
+ goto out;
+ ret = seccomp_attach_filter(&fprog);
+out:
+ return ret;
+}
+
+/* get_seccomp_filter - increments the reference count of @orig. */
+static struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
+{
+ if (!orig)
+ return NULL;
+ /* Reference count is bounded by the number of total processes. */
+ atomic_inc(&orig->usage);
+ return orig;
+}
+
+/* put_seccomp_filter - decrements the ref count of @orig and may free. */
+void put_seccomp_filter(struct seccomp_filter *orig)
+{
+ /* Clean up single-reference branches iteratively. */
+ while (orig && atomic_dec_and_test(&orig->usage)) {
+ struct seccomp_filter *freeme = orig;
+ orig = orig->prev;
+ kfree(freeme);
+ }
+}
+
+/**
+ * copy_seccomp: manages inheritance on fork
+ * @child: forkee's seccomp
+ * @prev: forker's seccomp
+ *
+ * Ensures that @child inherits seccomp mode and state if
+ * seccomp filtering is in use.
+ */
+void copy_seccomp(struct seccomp *child,
+ const struct seccomp *prev)
+{
+ child->mode = prev->mode;
+ child->filter = get_seccomp_filter(prev->filter);
+}
+#endif /* CONFIG_SECCOMP_FILTER */
/*
* Secure computing mode 1 allows only read/write/exit/sigreturn.
@@ -34,10 +315,10 @@ static int mode1_syscalls_32[] = {
void __secure_computing(int this_syscall)
{
int mode = current->seccomp.mode;
- int * syscall;
+ int *syscall;
switch (mode) {
- case 1:
+ case SECCOMP_MODE_STRICT:
syscall = mode1_syscalls;
#ifdef CONFIG_COMPAT
if (is_compat_task())
@@ -48,6 +329,13 @@ void __secure_computing(int this_syscall)
return;
} while (*++syscall);
break;
+#ifdef CONFIG_SECCOMP_FILTER
+ case SECCOMP_MODE_FILTER:
+ if (seccomp_run_filters(this_syscall) == SECCOMP_RET_ALLOW)
+ return;
+ seccomp_filter_log_failure(this_syscall);
+ break;
+#endif
default:
BUG();
}
@@ -64,25 +352,34 @@ long prctl_get_seccomp(void)
return current->seccomp.mode;
}
-long prctl_set_seccomp(unsigned long seccomp_mode)
+long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
{
- long ret;
+ long ret = -EINVAL;
- /* can set it only once to be even more secure */
- ret = -EPERM;
- if (unlikely(current->seccomp.mode))
+ if (current->seccomp.mode &&
+ current->seccomp.mode != seccomp_mode)
goto out;
- ret = -EINVAL;
- if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
- current->seccomp.mode = seccomp_mode;
- set_thread_flag(TIF_SECCOMP);
+ switch (seccomp_mode) {
+ case SECCOMP_MODE_STRICT:
+ ret = 0;
#ifdef TIF_NOTSC
disable_TSC();
#endif
- ret = 0;
+ break;
+#ifdef CONFIG_SECCOMP_FILTER
+ case SECCOMP_MODE_FILTER:
+ ret = seccomp_attach_user_filter(filter);
+ if (ret)
+ goto out;
+ break;
+#endif
+ default:
+ goto out;
}
- out:
+ current->seccomp.mode = seccomp_mode;
+ set_thread_flag(TIF_SECCOMP);
+out:
return ret;
}
diff --git a/kernel/sys.c b/kernel/sys.c
index 4070153..905031e 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1899,7 +1899,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = prctl_get_seccomp();
break;
case PR_SET_SECCOMP:
- error = prctl_set_seccomp(arg2);
+ error = prctl_set_seccomp(arg2, (char __user *)arg3);
break;
case PR_GET_TSC:
error = GET_TSC_CTL(arg2);
--
1.7.5.4
On 02/16/2012 12:02 PM, Will Drewry wrote:
> +
> +/* Format of the data the BPF program executes over. */
> +struct seccomp_data {
> + int nr;
> + __u32 __reserved[3];
> + struct {
> + __u32 lo;
> + __u32 hi;
> + } instruction_pointer;
> + __u32 lo32[6];
> + __u32 hi32[6];
> +};
>
This seems more than a bit odd, no?
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
Ouch. I cross-posted this whole series to
[email protected] install of .com. If you reply,
sorry for the extra bounce.
On Thu, Feb 16, 2012 at 2:02 PM, Will Drewry <[email protected]> wrote:
> This change allows CONFIG_SECCOMP to make use of BPF programs for
> user-controlled system call filtering (as shown in this patch series).
>
> To minimize the impact on existing BPF evaluation, function pointer
> use must be declared at sk_chk_filter-time. ?This allows ancillary
> load instructions to be generated that use the function pointer rather
> than adding _any_ code to the existing LD_* instruction paths.
>
> Crude performance numbers using udpflood -l 10000000 against dummy0.
> 3 trials for baseline, 3 for with tcpdump. Averaged then differenced.
> Hard to believe trials were repeated at least a couple more times.
>
> * x86 32-bit (Atom N570 @ 1.66 GHz 2 core HT) [stackprot]:
> - Without: ?94.05s - 76.36s = 17.68s
> - With: ? ? 86.22s - 73.30s = 12.92s
> - Slowdown per call: -476 nanoseconds
>
> * x86 32-bit (Atom N570 @ 1.66 GHz 2 core HT) [no stackprot]:
> - Without: ?92.06s - 77.81s = 14.25s
> - With: ? ? 91.77s - 76.91s = 14.86s
> - Slowdown per call: +61 nanoseconds
>
> * x86 64-bit (Atom N570 @ 1.66 GHz 2 core HT) [stackprot]:
> - Without: 122.58s - 99.54s = 23.04s
> - With: ? ?115.52s - 98.99s = 16.53s
> - Slowdown per call: ?-651 nanoseconds
>
> * x86 64-bit (Atom N570 @ 1.66 GHz 2 core HT) [no stackprot]:
> - Without: 114.95s - 91.92s = 23.03s
> - With: ? ?110.47s - 90.79s = 19.68s
> - Slowdown per call: -335 nanoseconds
>
> This makes the x86-32-nossp make sense. ?Added register pressure always
> makes x86-32 sad. ?If this is a concern, I could change the call
> approach to bpf_run_filter to see if I can alleviate it a bit.
>
> That said, the x86-*-ssp numbers show a marked increase in performance.
> I've tested and retested and I keep getting these results. I'm also
> suprised by the nossp speed up on 64-bit, but I dunno. I haven't looked
> at the full disassembly of the call path. If that is required for the
> performance differences I'm seeing, please let me know. Or if I there is
> a preferred cpu to run this against - atoms can be a little weird.
>
> v8: - fixed variable positioning and bad cast ([email protected])
> ? ?- no longer passes A as a pointer (inspection of x86 asm shows A is
> ? ? ?%ebx again; thanks [email protected])
> ? ?- cleaned up switch macros and expanded use
> ? ? ?([email protected], [email protected])
> ? ?- added length fn pointer and handled LD_W_LEN/LDX_W_LEN
> ? ?- moved from a wrapping struct to a typedef for the function
> ? ? ?pointer. (matches existing function pointer style)
> ? ?- added comprehensive comment above the typedef.
> ? ?- benchmarks
> v7: - first cut
>
> Signed-off-by: Will Drewry <[email protected]>
> ---
> ?include/linux/filter.h | ? 69 +++++++++++++++++++++-
> ?net/core/filter.c ? ? ?| ?152 +++++++++++++++++++++++++++++++++++++----------
> ?2 files changed, 185 insertions(+), 36 deletions(-)
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 8eeb205..d22ad46 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -110,6 +110,9 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
> ?*/
> ?#define BPF_MEMWORDS 16
>
> +/* BPF program (checking) flags */
> +#define BPF_CHK_FLAGS_NO_SKB ? 1
> +
> ?/* RATIONALE. Negative offsets are invalid in BPF.
> ? ?We use them to reference ancillary data.
> ? ?Unlike introduction new instructions, it does not break
> @@ -145,17 +148,67 @@ struct sk_filter
> ? ? ? ?struct sock_filter ? ? ?insns[0];
> ?};
>
> +/**
> + * struct bpf_load_fns - callbacks for bpf_run_filter
> + * These functions are called by bpf_run_filter if bpf_chk_filter
> + * was invoked with BPF_CHK_FLAGS_NO_SKB.
> + *
> + * pointer:
> + * @data: const pointer to the data passed into bpf_run_filter
> + * @k: offset into @skb's data
> + * @size: the size of the requested data in bytes: 1, 2, or 4.
> + * @buffer: If non-NULL, a 32-bit buffer for staging data.
> + *
> + * Returns a pointer to the requested data.
> + *
> + * This function operates similarly to load_pointer in net/core/filter.c
> + * except that the pointer to the returned data must already be
> + * byteswapped as appropriate to the source data and endianness.
> + * @buffer may be used if the data needs to be staged.
> + *
> + * length:
> + * @data: const pointer to the data passed into bpf_fun_filter
> + *
> + * Returns the length of the data.
> + */
> +struct bpf_load_fns {
> + ? ? ? void *(*pointer)(const void *data, int k, unsigned int size,
> + ? ? ? ? ? ? ? ? ? ? ? ?void *buffer);
> + ? ? ? u32 (*length)(const void *data);
> +};
> +
> ?static inline unsigned int sk_filter_len(const struct sk_filter *fp)
> ?{
> ? ? ? ?return fp->len * sizeof(struct sock_filter) + sizeof(*fp);
> ?}
>
> +extern unsigned int bpf_run_filter(const void *data,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?const struct sock_filter *filter,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?const struct bpf_load_fns *load_fn);
> +
> +/**
> + * ? ? sk_run_filter - run a filter on a socket
> + * ? ? @skb: buffer to run the filter on
> + * ? ? @fentry: filter to apply
> + *
> + * Runs bpf_run_filter with the struct sk_buff-specific data
> + * accessor behavior.
> + */
> +static inline unsigned int sk_run_filter(const struct sk_buff *skb,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?const struct sock_filter *filter)
> +{
> + ? ? ? return bpf_run_filter(skb, filter, NULL);
> +}
> +
> ?extern int sk_filter(struct sock *sk, struct sk_buff *skb);
> -extern unsigned int sk_run_filter(const struct sk_buff *skb,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? const struct sock_filter *filter);
> ?extern int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
> ?extern int sk_detach_filter(struct sock *sk);
> -extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
> +extern int bpf_chk_filter(struct sock_filter *filter, unsigned int flen, u32 flags);
> +
> +static inline int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
> +{
> + ? ? ? return bpf_chk_filter(filter, flen, 0);
> +}
>
> ?#ifdef CONFIG_BPF_JIT
> ?extern void bpf_jit_compile(struct sk_filter *fp);
> @@ -228,6 +281,16 @@ enum {
> ? ? ? ?BPF_S_ANC_HATYPE,
> ? ? ? ?BPF_S_ANC_RXHASH,
> ? ? ? ?BPF_S_ANC_CPU,
> + ? ? ? /* Used to differentiate SKB data and generic data */
> + ? ? ? BPF_S_ANC_LD_W_ABS,
> + ? ? ? BPF_S_ANC_LD_H_ABS,
> + ? ? ? BPF_S_ANC_LD_B_ABS,
> + ? ? ? BPF_S_ANC_LD_W_LEN,
> + ? ? ? BPF_S_ANC_LD_W_IND,
> + ? ? ? BPF_S_ANC_LD_H_IND,
> + ? ? ? BPF_S_ANC_LD_B_IND,
> + ? ? ? BPF_S_ANC_LDX_W_LEN,
> + ? ? ? BPF_S_ANC_LDX_B_MSH,
> ?};
>
> ?#endif /* __KERNEL__ */
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 5dea452..a5c98a9 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -98,9 +98,10 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
> ?EXPORT_SYMBOL(sk_filter);
>
> ?/**
> - * ? ? sk_run_filter - run a filter on a socket
> - * ? ? @skb: buffer to run the filter on
> + * ? ? bpf_run_filter - run a filter on a BPF program
> + * ? ? @data: buffer to run the filter on
> ?* ? ? @fentry: filter to apply
> + * ? ? @load_fns: custom data accessor functions
> ?*
> ?* Decode and apply filter instructions to the skb->data.
> ?* Return length to keep, 0 for none. @skb is the data we are
> @@ -108,9 +109,13 @@ EXPORT_SYMBOL(sk_filter);
> ?* Because all jumps are guaranteed to be before last instruction,
> ?* and last instruction guaranteed to be a RET, we dont need to check
> ?* flen. (We used to pass to this function the length of filter)
> + *
> + * load_fn is only used if SKF_FLAGS_USE_LOAD_FNS was specified
> + * to sk_chk_generic_filter.
> ?*/
> -unsigned int sk_run_filter(const struct sk_buff *skb,
> - ? ? ? ? ? ? ? ? ? ? ? ? ?const struct sock_filter *fentry)
> +unsigned int bpf_run_filter(const void *data,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? const struct sock_filter *fentry,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? const struct bpf_load_fns *load_fns)
> ?{
> ? ? ? ?void *ptr;
> ? ? ? ?u32 A = 0; ? ? ? ? ? ? ? ? ? ? ?/* Accumulator */
> @@ -128,6 +133,7 @@ unsigned int sk_run_filter(const struct sk_buff *skb,
> ?#else
> ? ? ? ? ? ? ? ?const u32 K = fentry->k;
> ?#endif
> +#define SKB(_data) ((const struct sk_buff *)(_data))
>
> ? ? ? ? ? ? ? ?switch (fentry->code) {
> ? ? ? ? ? ? ? ?case BPF_S_ALU_ADD_X:
> @@ -213,7 +219,7 @@ unsigned int sk_run_filter(const struct sk_buff *skb,
> ? ? ? ? ? ? ? ?case BPF_S_LD_W_ABS:
> ? ? ? ? ? ? ? ? ? ? ? ?k = K;
> ?load_w:
> - ? ? ? ? ? ? ? ? ? ? ? ptr = load_pointer(skb, k, 4, &tmp);
> + ? ? ? ? ? ? ? ? ? ? ? ptr = load_pointer(data, k, 4, &tmp);
> ? ? ? ? ? ? ? ? ? ? ? ?if (ptr != NULL) {
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?A = get_unaligned_be32(ptr);
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue;
> @@ -222,7 +228,7 @@ load_w:
> ? ? ? ? ? ? ? ?case BPF_S_LD_H_ABS:
> ? ? ? ? ? ? ? ? ? ? ? ?k = K;
> ?load_h:
> - ? ? ? ? ? ? ? ? ? ? ? ptr = load_pointer(skb, k, 2, &tmp);
> + ? ? ? ? ? ? ? ? ? ? ? ptr = load_pointer(data, k, 2, &tmp);
> ? ? ? ? ? ? ? ? ? ? ? ?if (ptr != NULL) {
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?A = get_unaligned_be16(ptr);
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue;
> @@ -231,17 +237,17 @@ load_h:
> ? ? ? ? ? ? ? ?case BPF_S_LD_B_ABS:
> ? ? ? ? ? ? ? ? ? ? ? ?k = K;
> ?load_b:
> - ? ? ? ? ? ? ? ? ? ? ? ptr = load_pointer(skb, k, 1, &tmp);
> + ? ? ? ? ? ? ? ? ? ? ? ptr = load_pointer(data, k, 1, &tmp);
> ? ? ? ? ? ? ? ? ? ? ? ?if (ptr != NULL) {
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?A = *(u8 *)ptr;
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ? ? ? ? ?}
> ? ? ? ? ? ? ? ? ? ? ? ?return 0;
> ? ? ? ? ? ? ? ?case BPF_S_LD_W_LEN:
> - ? ? ? ? ? ? ? ? ? ? ? A = skb->len;
> + ? ? ? ? ? ? ? ? ? ? ? A = SKB(data)->len;
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ?case BPF_S_LDX_W_LEN:
> - ? ? ? ? ? ? ? ? ? ? ? X = skb->len;
> + ? ? ? ? ? ? ? ? ? ? ? X = SKB(data)->len;
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ?case BPF_S_LD_W_IND:
> ? ? ? ? ? ? ? ? ? ? ? ?k = X + K;
> @@ -253,7 +259,7 @@ load_b:
> ? ? ? ? ? ? ? ? ? ? ? ?k = X + K;
> ? ? ? ? ? ? ? ? ? ? ? ?goto load_b;
> ? ? ? ? ? ? ? ?case BPF_S_LDX_B_MSH:
> - ? ? ? ? ? ? ? ? ? ? ? ptr = load_pointer(skb, K, 1, &tmp);
> + ? ? ? ? ? ? ? ? ? ? ? ptr = load_pointer(data, K, 1, &tmp);
> ? ? ? ? ? ? ? ? ? ? ? ?if (ptr != NULL) {
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?X = (*(u8 *)ptr & 0xf) << 2;
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?continue;
> @@ -288,29 +294,29 @@ load_b:
> ? ? ? ? ? ? ? ? ? ? ? ?mem[K] = X;
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ?case BPF_S_ANC_PROTOCOL:
> - ? ? ? ? ? ? ? ? ? ? ? A = ntohs(skb->protocol);
> + ? ? ? ? ? ? ? ? ? ? ? A = ntohs(SKB(data)->protocol);
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ?case BPF_S_ANC_PKTTYPE:
> - ? ? ? ? ? ? ? ? ? ? ? A = skb->pkt_type;
> + ? ? ? ? ? ? ? ? ? ? ? A = SKB(data)->pkt_type;
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ?case BPF_S_ANC_IFINDEX:
> - ? ? ? ? ? ? ? ? ? ? ? if (!skb->dev)
> + ? ? ? ? ? ? ? ? ? ? ? if (!SKB(data)->dev)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?return 0;
> - ? ? ? ? ? ? ? ? ? ? ? A = skb->dev->ifindex;
> + ? ? ? ? ? ? ? ? ? ? ? A = SKB(data)->dev->ifindex;
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ?case BPF_S_ANC_MARK:
> - ? ? ? ? ? ? ? ? ? ? ? A = skb->mark;
> + ? ? ? ? ? ? ? ? ? ? ? A = SKB(data)->mark;
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ?case BPF_S_ANC_QUEUE:
> - ? ? ? ? ? ? ? ? ? ? ? A = skb->queue_mapping;
> + ? ? ? ? ? ? ? ? ? ? ? A = SKB(data)->queue_mapping;
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ?case BPF_S_ANC_HATYPE:
> - ? ? ? ? ? ? ? ? ? ? ? if (!skb->dev)
> + ? ? ? ? ? ? ? ? ? ? ? if (!SKB(data)->dev)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?return 0;
> - ? ? ? ? ? ? ? ? ? ? ? A = skb->dev->type;
> + ? ? ? ? ? ? ? ? ? ? ? A = SKB(data)->dev->type;
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ?case BPF_S_ANC_RXHASH:
> - ? ? ? ? ? ? ? ? ? ? ? A = skb->rxhash;
> + ? ? ? ? ? ? ? ? ? ? ? A = SKB(data)->rxhash;
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ?case BPF_S_ANC_CPU:
> ? ? ? ? ? ? ? ? ? ? ? ?A = raw_smp_processor_id();
> @@ -318,15 +324,15 @@ load_b:
> ? ? ? ? ? ? ? ?case BPF_S_ANC_NLATTR: {
> ? ? ? ? ? ? ? ? ? ? ? ?struct nlattr *nla;
>
> - ? ? ? ? ? ? ? ? ? ? ? if (skb_is_nonlinear(skb))
> + ? ? ? ? ? ? ? ? ? ? ? if (skb_is_nonlinear(SKB(data)))
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?return 0;
> - ? ? ? ? ? ? ? ? ? ? ? if (A > skb->len - sizeof(struct nlattr))
> + ? ? ? ? ? ? ? ? ? ? ? if (A > SKB(data)->len - sizeof(struct nlattr))
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?return 0;
>
> - ? ? ? ? ? ? ? ? ? ? ? nla = nla_find((struct nlattr *)&skb->data[A],
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?skb->len - A, X);
> + ? ? ? ? ? ? ? ? ? ? ? nla = nla_find((struct nlattr *)&SKB(data)->data[A],
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?SKB(data)->len - A, X);
> ? ? ? ? ? ? ? ? ? ? ? ?if (nla)
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? A = (void *)nla - (void *)skb->data;
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? A = (void *)nla - (void *)SKB(data)->data;
> ? ? ? ? ? ? ? ? ? ? ? ?else
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?A = 0;
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
> @@ -334,22 +340,71 @@ load_b:
> ? ? ? ? ? ? ? ?case BPF_S_ANC_NLATTR_NEST: {
> ? ? ? ? ? ? ? ? ? ? ? ?struct nlattr *nla;
>
> - ? ? ? ? ? ? ? ? ? ? ? if (skb_is_nonlinear(skb))
> + ? ? ? ? ? ? ? ? ? ? ? if (skb_is_nonlinear(SKB(data)))
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?return 0;
> - ? ? ? ? ? ? ? ? ? ? ? if (A > skb->len - sizeof(struct nlattr))
> + ? ? ? ? ? ? ? ? ? ? ? if (A > SKB(data)->len - sizeof(struct nlattr))
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?return 0;
>
> - ? ? ? ? ? ? ? ? ? ? ? nla = (struct nlattr *)&skb->data[A];
> - ? ? ? ? ? ? ? ? ? ? ? if (nla->nla_len > A - skb->len)
> + ? ? ? ? ? ? ? ? ? ? ? nla = (struct nlattr *)&SKB(data)->data[A];
> + ? ? ? ? ? ? ? ? ? ? ? if (nla->nla_len > A - SKB(data)->len)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?return 0;
>
> ? ? ? ? ? ? ? ? ? ? ? ?nla = nla_find_nested(nla, X);
> ? ? ? ? ? ? ? ? ? ? ? ?if (nla)
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? A = (void *)nla - (void *)skb->data;
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? A = (void *)nla - (void *)SKB(data)->data;
> ? ? ? ? ? ? ? ? ? ? ? ?else
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?A = 0;
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ?}
> + ? ? ? ? ? ? ? case BPF_S_ANC_LD_W_ABS:
> + ? ? ? ? ? ? ? ? ? ? ? k = K;
> +load_fn_w:
> + ? ? ? ? ? ? ? ? ? ? ? ptr = load_fns->pointer(data, k, 4, &tmp);
> + ? ? ? ? ? ? ? ? ? ? ? if (ptr) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? A = *(u32 *)ptr;
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
> + ? ? ? ? ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? ? ? ? ? return 0;
> + ? ? ? ? ? ? ? case BPF_S_ANC_LD_H_ABS:
> + ? ? ? ? ? ? ? ? ? ? ? k = K;
> +load_fn_h:
> + ? ? ? ? ? ? ? ? ? ? ? ptr = load_fns->pointer(data, k, 2, &tmp);
> + ? ? ? ? ? ? ? ? ? ? ? if (ptr) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? A = *(u16 *)ptr;
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
> + ? ? ? ? ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? ? ? ? ? return 0;
> + ? ? ? ? ? ? ? case BPF_S_ANC_LD_B_ABS:
> + ? ? ? ? ? ? ? ? ? ? ? k = K;
> +load_fn_b:
> + ? ? ? ? ? ? ? ? ? ? ? ptr = load_fns->pointer(data, k, 1, &tmp);
> + ? ? ? ? ? ? ? ? ? ? ? if (ptr) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? A = *(u8 *)ptr;
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
> + ? ? ? ? ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? ? ? ? ? return 0;
> + ? ? ? ? ? ? ? case BPF_S_ANC_LDX_B_MSH:
> + ? ? ? ? ? ? ? ? ? ? ? ptr = load_fns->pointer(data, K, 1, &tmp);
> + ? ? ? ? ? ? ? ? ? ? ? if (ptr) {
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? X = (*(u8 *)ptr & 0xf) << 2;
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
> + ? ? ? ? ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? ? ? ? ? return 0;
> + ? ? ? ? ? ? ? case BPF_S_ANC_LD_W_IND:
> + ? ? ? ? ? ? ? ? ? ? ? k = X + K;
> + ? ? ? ? ? ? ? ? ? ? ? goto load_fn_w;
> + ? ? ? ? ? ? ? case BPF_S_ANC_LD_H_IND:
> + ? ? ? ? ? ? ? ? ? ? ? k = X + K;
> + ? ? ? ? ? ? ? ? ? ? ? goto load_fn_h;
> + ? ? ? ? ? ? ? case BPF_S_ANC_LD_B_IND:
> + ? ? ? ? ? ? ? ? ? ? ? k = X + K;
> + ? ? ? ? ? ? ? ? ? ? ? goto load_fn_b;
> + ? ? ? ? ? ? ? case BPF_S_ANC_LD_W_LEN:
> + ? ? ? ? ? ? ? ? ? ? ? A = load_fns->length(data);
> + ? ? ? ? ? ? ? ? ? ? ? continue;
> + ? ? ? ? ? ? ? case BPF_S_ANC_LDX_W_LEN:
> + ? ? ? ? ? ? ? ? ? ? ? X = load_fns->length(data);
> + ? ? ? ? ? ? ? ? ? ? ? continue;
> ? ? ? ? ? ? ? ?default:
> ? ? ? ? ? ? ? ? ? ? ? ?WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? fentry->code, fentry->jt,
> @@ -360,7 +415,7 @@ load_b:
>
> ? ? ? ?return 0;
> ?}
> -EXPORT_SYMBOL(sk_run_filter);
> +EXPORT_SYMBOL(bpf_run_filter);
>
> ?/*
> ?* Security :
> @@ -423,9 +478,10 @@ error:
> ?}
>
> ?/**
> - * ? ? sk_chk_filter - verify socket filter code
> + * ? ? bpf_chk_filter - verify socket filter BPF code
> ?* ? ? @filter: filter to verify
> ?* ? ? @flen: length of filter
> + * ? ? @flags: May be BPF_CHK_FLAGS_NO_SKB or 0
> ?*
> ?* Check the user's filter code. If we let some ugly
> ?* filter code slip through kaboom! The filter must contain
> @@ -434,9 +490,13 @@ error:
> ?*
> ?* All jumps are forward as they are not signed.
> ?*
> + * If BPF_CHK_FLAGS_NO_SKB is set in flags, any SKB-specific
> + * rules become illegal and a custom set of bpf_load_fns will
> + * be expected by bpf_run_filter.
> + *
> ?* Returns 0 if the rule set is legal or -EINVAL if not.
> ?*/
> -int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
> +int bpf_chk_filter(struct sock_filter *filter, unsigned int flen, u32 flags)
> ?{
> ? ? ? ?/*
> ? ? ? ? * Valid instructions are initialized to non-0.
> @@ -542,9 +602,35 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ?pc + ftest->jf + 1 >= flen)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?return -EINVAL;
> ? ? ? ? ? ? ? ? ? ? ? ?break;
> +#define MAYBE_USE_LOAD_FN(CODE) \
> + ? ? ? ? ? ? ? ? ? ? ? if (flags & BPF_CHK_FLAGS_NO_SKB) { \
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? code = BPF_S_ANC_##CODE; \
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break; \
> + ? ? ? ? ? ? ? ? ? ? ? }
> + ? ? ? ? ? ? ? case BPF_S_LD_W_LEN:
> + ? ? ? ? ? ? ? ? ? ? ? MAYBE_USE_LOAD_FN(LD_W_LEN);
> + ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? ? ? ? ? case BPF_S_LDX_W_LEN:
> + ? ? ? ? ? ? ? ? ? ? ? MAYBE_USE_LOAD_FN(LDX_W_LEN);
> + ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? ? ? ? ? case BPF_S_LD_W_IND:
> + ? ? ? ? ? ? ? ? ? ? ? MAYBE_USE_LOAD_FN(LD_W_IND);
> + ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? ? ? ? ? case BPF_S_LD_H_IND:
> + ? ? ? ? ? ? ? ? ? ? ? MAYBE_USE_LOAD_FN(LD_H_IND);
> + ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? ? ? ? ? case BPF_S_LD_B_IND:
> + ? ? ? ? ? ? ? ? ? ? ? MAYBE_USE_LOAD_FN(LD_B_IND);
> + ? ? ? ? ? ? ? ? ? ? ? break;
> + ? ? ? ? ? ? ? case BPF_S_LDX_B_MSH:
> + ? ? ? ? ? ? ? ? ? ? ? MAYBE_USE_LOAD_FN(LDX_B_MSH);
> + ? ? ? ? ? ? ? ? ? ? ? break;
> ? ? ? ? ? ? ? ?case BPF_S_LD_W_ABS:
> + ? ? ? ? ? ? ? ? ? ? ? MAYBE_USE_LOAD_FN(LD_W_ABS);
> ? ? ? ? ? ? ? ?case BPF_S_LD_H_ABS:
> + ? ? ? ? ? ? ? ? ? ? ? MAYBE_USE_LOAD_FN(LD_H_ABS);
> ? ? ? ? ? ? ? ?case BPF_S_LD_B_ABS:
> + ? ? ? ? ? ? ? ? ? ? ? MAYBE_USE_LOAD_FN(LD_B_ABS);
> ?#define ANCILLARY(CODE) case SKF_AD_OFF + SKF_AD_##CODE: ? ? ? \
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?code = BPF_S_ANC_##CODE; ? ? ? ?\
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?break
> @@ -572,7 +658,7 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
> ? ? ? ?}
> ? ? ? ?return -EINVAL;
> ?}
> -EXPORT_SYMBOL(sk_chk_filter);
> +EXPORT_SYMBOL(bpf_chk_filter);
>
> ?/**
> ?* ? ? sk_filter_release_rcu - Release a socket filter by rcu_head
> --
> 1.7.5.4
>
On Thu, Feb 16, 2012 at 2:06 PM, H. Peter Anvin <[email protected]> wrote:
> On 02/16/2012 12:02 PM, Will Drewry wrote:
>> +
>> +/* Format of the data the BPF program executes over. */
>> +struct seccomp_data {
>> + ? ? int nr;
>> + ? ? __u32 __reserved[3];
>> + ? ? struct {
>> + ? ? ? ? ? ? __u32 ? lo;
>> + ? ? ? ? ? ? __u32 ? hi;
>> + ? ? } instruction_pointer;
>> + ? ? __u32 lo32[6];
>> + ? ? __u32 hi32[6];
>> +};
>>
>
> This seems more than a bit odd, no?
>
> ? ? ? ?-hpa
I agree :) BPF being a 32-bit creature introduced some edge cases. I
has started with a
union { u32 args32[6]; u64 args64[6]; }
This was somewhat derailed by CONFIG_COMPAT behavior where
syscall_get_arguments always writes to argument of register width --
not bad, just irritating (since a copy isn't strictly necessary nor
actually done in the patch). Also, Indan pointed out that while BPF
programs expect constants in the machine-local endian layout, any
consumers would need to change how they accessed the arguments across
big/little endian machines since a load of the low-order bits would
vary.
In a second pass, I attempted to resolve this like aio_abi.h:
union {
struct {
u32 ENDIAN_SWAP(lo32, hi32);
};
u64 arg64;
} args[6];
It wasn't clear that this actually made matters better (though it did
mean syscall_get_arguments() could write directly to arg64). Using
offsetof() in the user program would be fine, but any offsets set
another way would be invalid. At that point, I moved to Indan's
proposal to stabilize low order and high order offsets -- what is in
the patch series. Now a BPF program can reliably index into the low
bits of an argument and into the high bits without endianness changing
the filter program structure.
I don't feel strongly about any given data layout, and this one seems
to balance the 32-bit-ness of BPF and the impact that has on
endianness. I'm happy to hear alternatives that might be more
aesthetically pleasing :)
cheers!
will
On Thu, Feb 16, 2012 at 12:02, Will Drewry <[email protected]> wrote:
> Adds a new return value to seccomp filters that triggers a SIGTRAP to be delivered with the new TRAP_SECCOMP si_code.
>
> This allows in-process system call emulation -- including just specifying an errno or cleanly dumping core -- rather than just dying.
SIGTRAP might not be the ideal choice of signal number, as it can make
it very difficult to debug the program in gdb. Other than that, I love
this feature. It'll significantly simplify the code that we have in
Chrome.
Markus
On Thu, Feb 16, 2012 at 2:24 PM, Markus Gutschke <[email protected]> wrote:
> SIGTRAP might not be the ideal choice of signal number, as it can make it
> very difficult to debug the program in gdb.
True enough. In theory, we could use the lower 16-bits of the return
value to let the bpf program set a signal, but not all signals are
masked synchronous and those that are probably get gdb's attention,
just not a severely :) (ILL, SEGV, BUS, TRAP, FPE). Perhaps SIGILL is
a logically appropriate option -- or letting the api user decide from
the SYNCHRONOUS_MASK set. I'm open to whatever makes sense, though.
(I wasn't even sure if it was kosher to add a new TRAP_SECCOMP value.)
cheers!
will
Documents how system call filtering using Berkeley Packet
Filter programs works and how it may be used.
Includes an example for x86 (32-bit) and a semi-generic
example using a macro-based code generator.
v9: - updated bpf-direct.c for SIGILL
v8: - add PR_SET_NO_NEW_PRIVS to the samples.
v7: - updated for all the new stuff in v7: TRAP, TRACE
- only talk about PR_SET_SECCOMP now
- fixed bad JLE32 check ([email protected])
- adds dropper.c: a simple system call disabler
v6: - tweak the language to note the requirement of
PR_SET_NO_NEW_PRIVS being called prior to use. ([email protected])
v5: - update sample to use system call arguments
- adds a "fancy" example using a macro-based generator
- cleaned up bpf in the sample
- update docs to mention arguments
- fix prctl value ([email protected])
- language cleanup ([email protected])
v4: - update for no_new_privs use
- minor tweaks
v3: - call out BPF <-> Berkeley Packet Filter ([email protected])
- document use of tentative always-unprivileged
- guard sample compilation for i386 and x86_64
v2: - move code to samples ([email protected])
Signed-off-by: Will Drewry <[email protected]>
---
Documentation/prctl/seccomp_filter.txt | 155 +++++++++++++++++++++
samples/Makefile | 2 +-
samples/seccomp/Makefile | 31 ++++
samples/seccomp/bpf-direct.c | 150 ++++++++++++++++++++
samples/seccomp/bpf-fancy.c | 101 ++++++++++++++
samples/seccomp/bpf-helper.c | 89 ++++++++++++
samples/seccomp/bpf-helper.h | 234 ++++++++++++++++++++++++++++++++
samples/seccomp/dropper.c | 52 +++++++
8 files changed, 813 insertions(+), 1 deletions(-)
create mode 100644 Documentation/prctl/seccomp_filter.txt
create mode 100644 samples/seccomp/Makefile
create mode 100644 samples/seccomp/bpf-direct.c
create mode 100644 samples/seccomp/bpf-fancy.c
create mode 100644 samples/seccomp/bpf-helper.c
create mode 100644 samples/seccomp/bpf-helper.h
create mode 100644 samples/seccomp/dropper.c
diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..2c6bd12
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,155 @@
+ SECure COMPuting with filters
+ =============================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated. A
+certain subset of userland applications benefit by having a reduced set
+of available system calls. The resulting set reduces the total kernel
+surface exposed to the application. System call filtering is meant for
+use with those applications.
+
+Seccomp filtering provides a means for a process to specify a filter for
+incoming system calls. The filter is expressed as a Berkeley Packet
+Filter (BPF) program, as with socket filters, except that the data
+operated on is related to the system call being made: system call
+number and the system call arguments. This allows for expressive
+filtering of system calls using a filter program language with a long
+history of being exposed to userland and a straightforward data set.
+
+Additionally, BPF makes it impossible for users of seccomp to fall prey
+to time-of-check-time-of-use (TOCTOU) attacks that are common in system
+call interposition frameworks. BPF programs may not dereference
+pointers which constrains all filters to solely evaluating the system
+call arguments directly.
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox. It provides a clearly defined
+mechanism for minimizing the exposed kernel surface. It is meant to be
+a tool for sandbox developers to use. Beyond that, policy for logical
+behavior and information flow should be managed with a combination of
+other system hardening techniques and, potentially, an LSM of your
+choosing. Expressive, dynamic filters provide further options down this
+path (avoiding pathological sizes or selecting which of the multiplexed
+system calls in socketcall() is allowed, for instance) which could be
+construed, incorrectly, as a more complete sandboxing solution.
+
+Usage
+-----
+
+An additional seccomp mode is added and is enabled using the same
+prctl(2) call as the strict seccomp. If the architecture has
+CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below:
+
+PR_SET_SECCOMP:
+ Now takes an additional argument which specifies a new filter
+ using a BPF program.
+ The BPF program will be executed over struct seccomp_data
+ reflecting the system call number, arguments, and other
+ metadata. The BPF program must then return one of the
+ acceptable values to inform the kernel which action should be
+ taken.
+
+ Usage:
+ prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog);
+
+ The 'prog' argument is a pointer to a struct sock_fprog which
+ will contain the filter program. If the program is invalid, the
+ call will return -1 and set errno to EINVAL.
+
+ Note, is_compat_task is also tracked for the @prog. This means
+ that once set the calling task will have all of its system calls
+ blocked if it switches its system call ABI.
+
+ If fork/clone and execve are allowed by @prog, any child
+ processes will be constrained to the same filters and system
+ call ABI as the parent.
+
+ Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or
+ run with CAP_SYS_ADMIN privileges in its namespace. If these are not
+ true, -EACCES will be returned. This requirement ensures that filter
+ programs cannot be applied to child processes with greater privileges
+ than the task that installed them.
+
+ Additionally, if prctl(2) is allowed by the attached filter,
+ additional filters may be layered on which will increase evaluation
+ time, but allow for further decreasing the attack surface during
+ execution of a process.
+
+The above call returns 0 on success and non-zero on error.
+
+Return values
+-------------
+
+A seccomp filter may return any of the following values:
+ SECCOMP_RET_ALLOW, SECCOMP_RET_KILL, SECCOMP_RET_TRAP,
+ SECCOMP_RET_ERRNO, or SECCOMP_RET_TRACE.
+
+SECCOMP_RET_ALLOW:
+ If all filters for a given task return this value then
+ the system call will proceed normally.
+
+SECCOMP_RET_KILL:
+ If any filters for a given take return this value then
+ the task will exit immediately without executing the system
+ call.
+
+SECCOMP_RET_TRAP:
+ If any filters specify SECCOMP_RET_TRAP and none of them
+ specify SECCOMP_RET_KILL, then the kernel will send a SIGTRAP
+ signal to the task and not execute the system call. The kernel
+ will rollback the register state to just before system call
+ entry such that a signal handler in the process will be able
+ to inspect the ucontext_t->uc_mcontext registers and emulate
+ system call success or failure upon return from the signal
+ handler.
+
+ The SIGTRAP is differentiated by other SIGTRAPS by a si_code
+ of TRAP_SECCOMP.
+
+SECCOMP_RET_ERRNO:
+ If returned, the value provided in the lower 16-bits is
+ returned to userland as the errno and the system call is
+ not executed.
+
+SECCOMP_RET_TRACE:
+ If any filters return this value and the others return
+ SECCOMP_RET_ALLOW, then the kernel will attempt to notify
+ a ptrace()-based tracer prior to executing the system call.
+
+ A tracer will be notified if it is attached with
+ ptrace(PTRACE_SECCOMP, ...). Otherwise, the system call will
+ not execute and -ENOSYS will be returned to userspace.
+
+ If the tracer ignores notification, then the system call will
+ proceed normally. Changes to the registers will function
+ similarly to PTRACE_SYSCALL.
+
+Please note that the order of precedence is as follows:
+SECCOMP_RET_KILL, SECCOMP_RET_ERRNO, SECCOMP_RET_TRAP,
+SECCOMP_RET_TRACE, SECCOMP_RET_ALLOW.
+
+If multiple filters exist, the return value for the evaluation of a given
+system call will always use the highest precedent value.
+SECCOMP_RET_KILL will always take precedence.
+
+
+Example
+-------
+
+The samples/seccomp/ directory contains both a 32-bit specific example
+and a more generic example of a higher level macro interface for BPF
+program generation.
+
+Adding architecture support
+-----------------------
+
+See arch/Kconfig for the required functionality. In general, if an
+architecture supports both tracehook and seccomp, it will be able to
+support seccomp filter. Then it must just add
+CONFIG_HAVE_ARCH_SECCOMP_FILTER to its arch-specific Kconfig.
diff --git a/samples/Makefile b/samples/Makefile
index 6280817..f29b19c 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,4 +1,4 @@
# Makefile for Linux samples code
obj-$(CONFIG_SAMPLES) += kobject/ kprobes/ tracepoints/ trace_events/ \
- hw_breakpoint/ kfifo/ kdb/ hidraw/
+ hw_breakpoint/ kfifo/ kdb/ hidraw/ seccomp/
diff --git a/samples/seccomp/Makefile b/samples/seccomp/Makefile
new file mode 100644
index 0000000..38922f7
--- /dev/null
+++ b/samples/seccomp/Makefile
@@ -0,0 +1,31 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+hostprogs-$(CONFIG_SECCOMP) := bpf-fancy dropper
+bpf-fancy-objs := bpf-fancy.o bpf-helper.o
+
+HOSTCFLAGS_bpf-fancy.o += -I$(objtree)/usr/include
+HOSTCFLAGS_bpf-fancy.o += -idirafter $(objtree)/include
+HOSTCFLAGS_bpf-helper.o += -I$(objtree)/usr/include
+HOSTCFLAGS_bpf-helper.o += -idirafter $(objtree)/include
+
+HOSTCFLAGS_dropper.o += -I$(objtree)/usr/include
+HOSTCFLAGS_dropper.o += -idirafter $(objtree)/include
+dropper-objs := dropper.o
+
+# bpf-direct.c is x86-only.
+ifeq ($(filter-out x86_64 i386,$(KBUILD_BUILDHOST)),)
+# List of programs to build
+hostprogs-$(CONFIG_SECCOMP) += bpf-direct
+bpf-direct-objs := bpf-direct.o
+endif
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_bpf-direct.o += -I$(objtree)/usr/include
+HOSTCFLAGS_bpf-direct.o += -idirafter $(objtree)/include
+ifeq ($(KBUILD_BUILDHOST),x86_64)
+HOSTCFLAGS_bpf-direct.o += -m32
+HOSTLOADLIBES_bpf-direct += -m32
+endif
diff --git a/samples/seccomp/bpf-direct.c b/samples/seccomp/bpf-direct.c
new file mode 100644
index 0000000..856b93b
--- /dev/null
+++ b/samples/seccomp/bpf-direct.c
@@ -0,0 +1,150 @@
+/*
+ * 32-bit seccomp filter example with BPF macros
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <[email protected]>
+ * Author: Will Drewry <[email protected]>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_SET_SECCOMP, 2, ...).
+ */
+#define __USE_GNU 1
+#define _GNU_SOURCE 1
+
+#include <linux/types.h>
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+#include <linux/unistd.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <string.h>
+#include <sys/prctl.h>
+#include <unistd.h>
+
+#define syscall_arg(_n) (offsetof(struct seccomp_data, lo32[_n]))
+#define syscall_nr (offsetof(struct seccomp_data, nr))
+
+#ifndef ILL_SECCOMP
+#define ILL_SECCOMP (ILL_BADSTK + 1)
+#endif
+
+#ifndef PR_SET_NO_NEW_PRIVS
+#define PR_SET_NO_NEW_PRIVS 36
+#endif
+
+static void emulator(int nr, siginfo_t *info, void *void_context)
+{
+ ucontext_t *ctx = (ucontext_t *)(void_context);
+ int syscall;
+ char *buf;
+ ssize_t bytes;
+ size_t len;
+ if (info->si_code != ILL_SECCOMP)
+ return;
+ if (!ctx)
+ return;
+ syscall = ctx->uc_mcontext.gregs[REG_EAX];
+ buf = (char *) ctx->uc_mcontext.gregs[REG_ECX];
+ len = (size_t) ctx->uc_mcontext.gregs[REG_EDX];
+
+ if (syscall != __NR_write)
+ return;
+ if (ctx->uc_mcontext.gregs[REG_EBX] != STDERR_FILENO)
+ return;
+ /* Redirect stderr messages to stdout. Doesn't handle EINTR, etc */
+ write(STDOUT_FILENO, "[ERR] ", 6);
+ bytes = write(STDOUT_FILENO, buf, len);
+ ctx->uc_mcontext.gregs[REG_EAX] = bytes;
+ return;
+}
+
+static int install_emulator(void)
+{
+ struct sigaction act;
+ sigset_t mask;
+ memset(&act, 0, sizeof(act));
+ sigemptyset(&mask);
+ sigaddset(&mask, SIGILL);
+
+ act.sa_sigaction = &emulator;
+ act.sa_flags = SA_SIGINFO;
+ if (sigaction(SIGILL, &act, NULL) < 0) {
+ perror("sigaction");
+ return -1;
+ }
+ if (sigprocmask(SIG_UNBLOCK, &mask, NULL)) {
+ perror("sigprocmask");
+ return -1;
+ }
+ return 0;
+}
+
+static int install_filter(void)
+{
+ struct sock_filter filter[] = {
+ /* Grab the system call number */
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_nr),
+ /* Jump table for the allowed syscalls */
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_rt_sigreturn, 0, 1),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_sigreturn, 0, 1),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit_group, 0, 1),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_exit, 0, 1),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 1, 0),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 3, 2),
+
+ /* Check that read is only using stdin. */
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDIN_FILENO, 4, 0),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
+
+ /* Check that write is only using stdout */
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, syscall_arg(0)),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDOUT_FILENO, 1, 0),
+ /* Trap attempts to write to stderr */
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, STDERR_FILENO, 1, 2),
+
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_TRAP),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ .filter = filter,
+ };
+
+ if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
+ perror("prctl(NO_NEW_PRIVS)");
+ return 1;
+ }
+
+
+ if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
+ perror("prctl");
+ return 1;
+ }
+ return 0;
+}
+
+#define payload(_c) (_c), sizeof((_c))
+int main(int argc, char **argv)
+{
+ char buf[4096];
+ ssize_t bytes = 0;
+ if (install_emulator())
+ return 1;
+ if (install_filter())
+ return 1;
+ syscall(__NR_write, STDOUT_FILENO,
+ payload("OHAI! WHAT IS YOUR NAME? "));
+ bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf));
+ syscall(__NR_write, STDOUT_FILENO, payload("HELLO, "));
+ syscall(__NR_write, STDOUT_FILENO, buf, bytes);
+ syscall(__NR_write, STDERR_FILENO,
+ payload("Error message going to STDERR\n"));
+ return 0;
+}
diff --git a/samples/seccomp/bpf-fancy.c b/samples/seccomp/bpf-fancy.c
new file mode 100644
index 0000000..bcfe3a0
--- /dev/null
+++ b/samples/seccomp/bpf-fancy.c
@@ -0,0 +1,101 @@
+/*
+ * Seccomp BPF example using a macro-based generator.
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <[email protected]>
+ * Author: Will Drewry <[email protected]>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+#include <linux/unistd.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/prctl.h>
+#include <unistd.h>
+
+#include "bpf-helper.h"
+
+#ifndef PR_SET_NO_NEW_PRIVS
+#define PR_SET_NO_NEW_PRIVS 36
+#endif
+
+int main(int argc, char **argv)
+{
+ struct bpf_labels l;
+ static const char msg1[] = "Please type something: ";
+ static const char msg2[] = "You typed: ";
+ char buf[256];
+ struct sock_filter filter[] = {
+ LOAD_SYSCALL_NR,
+ SYSCALL(__NR_exit, ALLOW),
+ SYSCALL(__NR_exit_group, ALLOW),
+ SYSCALL(__NR_write, JUMP(&l, write_fd)),
+ SYSCALL(__NR_read, JUMP(&l, read)),
+ DENY, /* Don't passthrough into a label */
+
+ LABEL(&l, read),
+ ARG(0),
+ JNE(STDIN_FILENO, DENY),
+ ARG(1),
+ JNE((unsigned long)buf, DENY),
+ ARG(2),
+ JGE(sizeof(buf), DENY),
+ ALLOW,
+
+ LABEL(&l, write_fd),
+ ARG(0),
+ JEQ(STDOUT_FILENO, JUMP(&l, write_buf)),
+ JEQ(STDERR_FILENO, JUMP(&l, write_buf)),
+ DENY,
+
+ LABEL(&l, write_buf),
+ ARG(1),
+ JEQ((unsigned long)msg1, JUMP(&l, msg1_len)),
+ JEQ((unsigned long)msg2, JUMP(&l, msg2_len)),
+ JEQ((unsigned long)buf, JUMP(&l, buf_len)),
+ DENY,
+
+ LABEL(&l, msg1_len),
+ ARG(2),
+ JLT(sizeof(msg1), ALLOW),
+ DENY,
+
+ LABEL(&l, msg2_len),
+ ARG(2),
+ JLT(sizeof(msg2), ALLOW),
+ DENY,
+
+ LABEL(&l, buf_len),
+ ARG(2),
+ JLT(sizeof(buf), ALLOW),
+ DENY,
+ };
+ struct sock_fprog prog = {
+ .filter = filter,
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ };
+ ssize_t bytes;
+ bpf_resolve_jumps(&l, filter, sizeof(filter)/sizeof(*filter));
+
+ if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
+ perror("prctl(NO_NEW_PRIVS)");
+ return 1;
+ }
+
+ if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
+ perror("prctl(SECCOMP)");
+ return 1;
+ }
+ syscall(__NR_write, STDOUT_FILENO, msg1, strlen(msg1));
+ bytes = syscall(__NR_read, STDIN_FILENO, buf, sizeof(buf)-1);
+ bytes = (bytes > 0 ? bytes : 0);
+ syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2));
+ syscall(__NR_write, STDERR_FILENO, buf, bytes);
+ /* Now get killed */
+ syscall(__NR_write, STDERR_FILENO, msg2, strlen(msg2)+2);
+ return 0;
+}
diff --git a/samples/seccomp/bpf-helper.c b/samples/seccomp/bpf-helper.c
new file mode 100644
index 0000000..579cfe3
--- /dev/null
+++ b/samples/seccomp/bpf-helper.c
@@ -0,0 +1,89 @@
+/*
+ * Seccomp BPF helper functions
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <[email protected]>
+ * Author: Will Drewry <[email protected]>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_ATTACH_SECCOMP_FILTER).
+ */
+
+#include <stdio.h>
+#include <string.h>
+
+#include "bpf-helper.h"
+
+int bpf_resolve_jumps(struct bpf_labels *labels,
+ struct sock_filter *filter, size_t count)
+{
+ struct sock_filter *begin = filter;
+ __u8 insn = count - 1;
+
+ if (count < 1)
+ return -1;
+ /*
+ * Walk it once, backwards, to build the label table and do fixups.
+ * Since backward jumps are disallowed by BPF, this is easy.
+ */
+ filter += insn;
+ for (; filter >= begin; --insn, --filter) {
+ if (filter->code != (BPF_JMP+BPF_JA))
+ continue;
+ switch ((filter->jt<<8)|filter->jf) {
+ case (JUMP_JT<<8)|JUMP_JF:
+ if (labels->labels[filter->k].location == 0xffffffff) {
+ fprintf(stderr, "Unresolved label: '%s'\n",
+ labels->labels[filter->k].label);
+ return 1;
+ }
+ filter->k = labels->labels[filter->k].location -
+ (insn + 1);
+ filter->jt = 0;
+ filter->jf = 0;
+ continue;
+ case (LABEL_JT<<8)|LABEL_JF:
+ if (labels->labels[filter->k].location != 0xffffffff) {
+ fprintf(stderr, "Duplicate label use: '%s'\n",
+ labels->labels[filter->k].label);
+ return 1;
+ }
+ labels->labels[filter->k].location = insn;
+ filter->k = 0; /* fall through */
+ filter->jt = 0;
+ filter->jf = 0;
+ continue;
+ }
+ }
+ return 0;
+}
+
+/* Simple lookup table for labels. */
+__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label)
+{
+ struct __bpf_label *begin = labels->labels, *end;
+ int id;
+ if (labels->count == 0) {
+ begin->label = label;
+ begin->location = 0xffffffff;
+ labels->count++;
+ return 0;
+ }
+ end = begin + labels->count;
+ for (id = 0; begin < end; ++begin, ++id) {
+ if (!strcmp(label, begin->label))
+ return id;
+ }
+ begin->label = label;
+ begin->location = 0xffffffff;
+ labels->count++;
+ return id;
+}
+
+void seccomp_bpf_print(struct sock_filter *filter, size_t count)
+{
+ struct sock_filter *end = filter + count;
+ for ( ; filter < end; ++filter)
+ printf("{ code=%u,jt=%u,jf=%u,k=%u },\n",
+ filter->code, filter->jt, filter->jf, filter->k);
+}
diff --git a/samples/seccomp/bpf-helper.h b/samples/seccomp/bpf-helper.h
new file mode 100644
index 0000000..9c64801
--- /dev/null
+++ b/samples/seccomp/bpf-helper.h
@@ -0,0 +1,234 @@
+/*
+ * Example wrapper around BPF macros.
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <[email protected]>
+ * Author: Will Drewry <[email protected]>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_SET_SECCOMP, 2, ...).
+ *
+ * No guarantees are provided with respect to the correctness
+ * or functionality of this code.
+ */
+#ifndef __BPF_HELPER_H__
+#define __BPF_HELPER_H__
+
+#include <asm/bitsperlong.h> /* for __BITS_PER_LONG */
+#include <linux/filter.h>
+#include <linux/seccomp.h> /* for seccomp_data */
+#include <linux/types.h>
+#include <linux/unistd.h>
+#include <stddef.h>
+
+#define BPF_LABELS_MAX 256
+struct bpf_labels {
+ int count;
+ struct __bpf_label {
+ const char *label;
+ __u32 location;
+ } labels[BPF_LABELS_MAX];
+};
+
+int bpf_resolve_jumps(struct bpf_labels *labels,
+ struct sock_filter *filter, size_t count);
+__u32 seccomp_bpf_label(struct bpf_labels *labels, const char *label);
+void seccomp_bpf_print(struct sock_filter *filter, size_t count);
+
+#define JUMP_JT 0xff
+#define JUMP_JF 0xff
+#define LABEL_JT 0xfe
+#define LABEL_JF 0xfe
+
+#define ALLOW \
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW)
+#define DENY \
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL)
+#define JUMP(labels, label) \
+ BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL((labels), (label)), \
+ JUMP_JT, JUMP_JF)
+#define LABEL(labels, label) \
+ BPF_JUMP(BPF_JMP+BPF_JA, FIND_LABEL((labels), (label)), \
+ LABEL_JT, LABEL_JF)
+#define SYSCALL(nr, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (nr), 0, 1), \
+ jt
+
+/* Lame, but just an example */
+#define FIND_LABEL(labels, label) seccomp_bpf_label((labels), #label)
+
+#define EXPAND(...) __VA_ARGS__
+/* Map all width-sensitive operations */
+#if __BITS_PER_LONG == 32
+
+#define JEQ(x, jt) JEQ32(x, EXPAND(jt))
+#define JNE(x, jt) JNE32(x, EXPAND(jt))
+#define JGT(x, jt) JGT32(x, EXPAND(jt))
+#define JLT(x, jt) JLT32(x, EXPAND(jt))
+#define JGE(x, jt) JGE32(x, EXPAND(jt))
+#define JLE(x, jt) JLE32(x, EXPAND(jt))
+#define JA(x, jt) JA32(x, EXPAND(jt))
+#define ARG(i) ARG_32(i)
+
+#elif __BITS_PER_LONG == 64
+
+#if defined(__LITTLE_ENDIAN)
+#define ENDIAN(_lo, _hi) _lo, _hi
+#elif defined(__BIG_ENDIAN)
+#define ENDIAN(_lo, _hi) _hi, _lo
+#else
+#error "Unknown endianness"
+#endif
+
+union arg64 {
+ struct {
+ __u32 ENDIAN(lo32, hi32);
+ };
+ __u64 u64;
+};
+
+#define JEQ(x, jt) \
+ JEQ64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JGT(x, jt) \
+ JGT64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JGE(x, jt) \
+ JGE64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JNE(x, jt) \
+ JNE64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JLT(x, jt) \
+ JLT64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define JLE(x, jt) \
+ JLE64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+
+#define JA(x, jt) \
+ JA64(((union arg64){.u64 = (x)}).lo32, \
+ ((union arg64){.u64 = (x)}).hi32, \
+ EXPAND(jt))
+#define ARG(i) ARG_64(i)
+
+#else
+#error __BITS_PER_LONG value unusable.
+#endif
+
+/* Loads the arg into A */
+#define ARG_32(idx) \
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+ offsetof(struct seccomp_data, lo32[(idx)]))
+
+/* Loads hi into A and lo in X */
+#define ARG_64(idx) \
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+ offsetof(struct seccomp_data, lo32[(idx)])), \
+ BPF_STMT(BPF_ST, 0), /* lo -> M[0] */ \
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+ offsetof(struct seccomp_data, hi32[(idx)])), \
+ BPF_STMT(BPF_ST, 1) /* hi -> M[1] */
+
+#define JEQ32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 0, 1), \
+ jt
+
+#define JNE32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (value), 1, 0), \
+ jt
+
+/* Checks the lo, then swaps to check the hi. A=lo,X=hi */
+#define JEQ64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 0, 2), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JNE64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 5, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (lo), 2, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JA32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (value), 0, 1), \
+ jt
+
+#define JA64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (hi), 3, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JSET+BPF_K, (lo), 0, 2), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JGE32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 0, 1), \
+ jt
+
+#define JLT32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (value), 1, 0), \
+ jt
+
+/* Shortcut checking if hi > arg.hi. */
+#define JGE64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 4, 0), \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (lo), 0, 2), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JLT64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGE+BPF_K, (hi), 0, 4), \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 2, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JGT32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 0, 1), \
+ jt
+
+#define JLE32(value, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (value), 1, 0), \
+ jt
+
+/* Check hi > args.hi first, then do the GE checking */
+#define JGT64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 4, 0), \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 5), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 0, 2), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define JLE64(lo, hi, jt) \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (hi), 6, 0), \
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, (hi), 0, 3), \
+ BPF_STMT(BPF_LD+BPF_MEM, 0), /* swap in lo */ \
+ BPF_JUMP(BPF_JMP+BPF_JGT+BPF_K, (lo), 2, 0), \
+ BPF_STMT(BPF_LD+BPF_MEM, 1), /* passed: swap hi back in */ \
+ jt, \
+ BPF_STMT(BPF_LD+BPF_MEM, 1) /* failed: swap hi back in */
+
+#define LOAD_SYSCALL_NR \
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, \
+ offsetof(struct seccomp_data, nr))
+
+#endif /* __BPF_HELPER_H__ */
diff --git a/samples/seccomp/dropper.c b/samples/seccomp/dropper.c
new file mode 100644
index 0000000..535db8a
--- /dev/null
+++ b/samples/seccomp/dropper.c
@@ -0,0 +1,52 @@
+/*
+ * Naive system call dropper built on seccomp_filter.
+ *
+ * Copyright (c) 2012 The Chromium OS Authors <[email protected]>
+ * Author: Will Drewry <[email protected]>
+ *
+ * The code may be used by anyone for any purpose,
+ * and can serve as a starting point for developing
+ * applications using prctl(PR_SET_SECCOMP, 2, ...).
+ *
+ * When run, returns the specified errno for the specified
+ * system call number.
+ *
+ * Run this one as root as PR_SET_NO_NEW_PRIVS is not called.
+ */
+
+#include <errno.h>
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+#include <linux/unistd.h>
+#include <stdio.h>
+#include <stddef.h>
+#include <stdlib.h>
+#include <sys/prctl.h>
+#include <unistd.h>
+
+int main(int argc, char **argv)
+{
+ if (argc < 4) {
+ fprintf(stderr, "Usage:\n"
+ "dropper <syscall_nr> <errno> <prog> [<args>]\n\n");
+ return 1;
+ }
+ struct sock_filter filter[] = {
+ BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
+ (offsetof(struct seccomp_data, nr))),
+ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, atoi(argv[1]), 0, 1),
+ BPF_STMT(BPF_RET+BPF_K,
+ SECCOMP_RET_ERRNO|(atoi(argv[2]) & SECCOMP_RET_DATA)),
+ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
+ .filter = filter,
+ };
+ if (prctl(PR_SET_SECCOMP, 2, &prog)) {
+ perror("prctl");
+ return 1;
+ }
+ execv(argv[3], &argv[3]);
+ return 1;
+}
--
1.7.5.4
Adds a new return value to seccomp filters that triggers a SIGILL to be
delivered with the new ILL_SECCOMP si_code.
This allows in-process system call emulation, including just specifying
an errno or cleanly dumping core, rather than just dying. It also
avoids interfering with normal debugger operation (injecting SIGTRAPs).
v9: - changes to SIGILL ([email protected])
v8: - clean up based on changes to dependent patches
v7: - introduction
Signed-off-by: Will Drewry <[email protected]>
---
arch/Kconfig | 8 ++++----
include/asm-generic/siginfo.h | 3 ++-
include/linux/seccomp.h | 1 +
kernel/seccomp.c | 20 ++++++++++++++++++++
4 files changed, 27 insertions(+), 5 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index 3f3052b..a01c151 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -203,10 +203,10 @@ config HAVE_ARCH_SECCOMP_FILTER
bool
help
This symbol should be selected by an architecure if it provides
- asm/syscall.h, specifically syscall_get_arguments() and
- syscall_set_return_value(). Additionally, its system call
- entry path must respect a return value of -1 from
- __secure_computing_int() and/or secure_computing().
+ asm/syscall.h, specifically syscall_get_arguments(),
+ syscall_set_return_value(), and syscall_rollback().
+ Additionally, its system call entry path must respect a return
+ value of -1 from __secure_computing_int() and/or secure_computing().
config SECCOMP_FILTER
def_bool y
diff --git a/include/asm-generic/siginfo.h b/include/asm-generic/siginfo.h
index 0dd4e87..e565662 100644
--- a/include/asm-generic/siginfo.h
+++ b/include/asm-generic/siginfo.h
@@ -166,7 +166,8 @@ typedef struct siginfo {
#define ILL_PRVREG (__SI_FAULT|6) /* privileged register */
#define ILL_COPROC (__SI_FAULT|7) /* coprocessor error */
#define ILL_BADSTK (__SI_FAULT|8) /* internal stack error */
-#define NSIGILL 8
+#define ILL_SECCOMP (__SI_FAULT|9) /* illegal syscall via seccomp */
+#define NSIGILL 9
/*
* SIGFPE si_codes
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 879ece2..1be562f 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -19,6 +19,7 @@
* selects the least permissive choice.
*/
#define SECCOMP_RET_KILL 0x00000000U /* kill the task immediately */
+#define SECCOMP_RET_TRAP 0x00020000U /* disallow and send sigtrap */
#define SECCOMP_RET_ERRNO 0x00030000U /* returns an errno */
#define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 55d000d..a7b6510 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -290,6 +290,21 @@ void copy_seccomp(struct seccomp *child,
child->mode = prev->mode;
child->filter = get_seccomp_filter(prev->filter);
}
+
+/**
+ * seccomp_send_sigill - signals the task to allow in-process syscall emulation
+ *
+ * Forces a SIGILL with si_code of ILL_SECCOMP.
+ */
+static void seccomp_send_sigill(void)
+{
+ struct siginfo info;
+ memset(&info, 0, sizeof(info));
+ info.si_signo = SIGILL;
+ info.si_code = ILL_SECCOMP;
+ info.si_addr = (void __user *)KSTK_EIP(current);
+ force_sig_info(SIGILL, &info, current);
+}
#endif /* CONFIG_SECCOMP_FILTER */
/*
@@ -343,6 +358,11 @@ int __secure_computing_int(int this_syscall)
-(action & SECCOMP_RET_DATA),
0);
return -1;
+ case SECCOMP_RET_TRAP:
+ /* Show the handler the original registers. */
+ syscall_rollback(current, task_pt_regs(current));
+ seccomp_send_sigill();
+ return -1;
case SECCOMP_RET_ALLOW:
return 0;
case SECCOMP_RET_KILL:
--
1.7.5.4
On 02/16/2012 12:25 PM, Will Drewry wrote:
>
> I agree :) BPF being a 32-bit creature introduced some edge cases. I
> has started with a
> union { u32 args32[6]; u64 args64[6]; }
>
> This was somewhat derailed by CONFIG_COMPAT behavior where
> syscall_get_arguments always writes to argument of register width --
> not bad, just irritating (since a copy isn't strictly necessary nor
> actually done in the patch). Also, Indan pointed out that while BPF
> programs expect constants in the machine-local endian layout, any
> consumers would need to change how they accessed the arguments across
> big/little endian machines since a load of the low-order bits would
> vary.
>
> In a second pass, I attempted to resolve this like aio_abi.h:
> union {
> struct {
> u32 ENDIAN_SWAP(lo32, hi32);
> };
> u64 arg64;
> } args[6];
> It wasn't clear that this actually made matters better (though it did
> mean syscall_get_arguments() could write directly to arg64). Usings
> offsetof() in the user program would be fine, but any offsets set
> another way would be invalid. At that point, I moved to Indan's
> proposal to stabilize low order and high order offsets -- what is in
> the patch series. Now a BPF program can reliably index into the low
> bits of an argument and into the high bits without endianness changing
> the filter program structure.
>
> I don't feel strongly about any given data layout, and this one seems
> to balance the 32-bit-ness of BPF and the impact that has on
> endianness. I'm happy to hear alternatives that might be more
> aesthetically pleasing :)
>
I would have to say I think native endian is probably the sane thing
still, out of several bad alternatives. Certainly splitting the high
and low halves of arguments is insane.
The other thing that you really need in addition to system call number
is ABI identifier, since a syscall number may mean different things for
different entry points. For example, on x86-64 system call number 4 is
write() if called via int $0x80 but stat() if called via syscall64.
This is a local property of the system call, not a global per process.
-hpa
On 02/16/2012 12:28 PM, Markus Gutschke wrote:
> On Thu, Feb 16, 2012 at 12:02, Will Drewry <[email protected]> wrote:
>> Adds a new return value to seccomp filters that triggers a SIGTRAP to be delivered with the new TRAP_SECCOMP si_code.
>>
>> This allows in-process system call emulation -- including just specifying an errno or cleanly dumping core -- rather than just dying.
>
> SIGTRAP might not be the ideal choice of signal number, as it can make
> it very difficult to debug the program in gdb. Other than that, I love
> this feature. It'll significantly simplify the code that we have in
> Chrome.
>
Sounds like SIGSYS to me.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Thu, Feb 16, 2012 at 13:17, H. Peter Anvin <[email protected]> wrote:
> The other thing that you really need in addition to system call number is
> ABI identifier, since a syscall number may mean different things for
> different entry points. For example, on x86-64 system call number 4 is
> write() if called via int $0x80 but stat() if called via syscall64. This is
> a local property of the system call, not a global per process.
I think, the documentation said that as soon as prctl() is used to set
a bpf filter for system calls, it automatically disallows system calls
using an entry point other than the one used by this particular
prctl().
I was trying to come up with scenarios where this particular approach
causes problem, but I can't think of any off the top of my head. So,
it might actually turn out to be a very elegant way to reduce the
attack surface of the kernel. If we are really worried about userspace
compatibility, we could make the kernel send a signal instead of
terminating the program, if the wrong entry point was used; not sure
if that is needed, though.
Markus
On 02/16/2012 12:42 PM, Will Drewry wrote:
> On Thu, Feb 16, 2012 at 2:24 PM, Markus Gutschke <[email protected]> wrote:
>> SIGTRAP might not be the ideal choice of signal number, as it can make it
>> very difficult to debug the program in gdb.
>
> True enough. In theory, we could use the lower 16-bits of the return
> value to let the bpf program set a signal, but not all signals are
> masked synchronous and those that are probably get gdb's attention,
> just not a severely :) (ILL, SEGV, BUS, TRAP, FPE). Perhaps SIGILL is
> a logically appropriate option -- or letting the api user decide from
> the SYNCHRONOUS_MASK set. I'm open to whatever makes sense, though.
> (I wasn't even sure if it was kosher to add a new TRAP_SECCOMP value.)
>
There is a standard signal for this -- SIGSYS -- which happens to be
currently unused in Linux.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Thu, Feb 16, 2012 at 3:17 PM, H. Peter Anvin <[email protected]> wrote:
> On 02/16/2012 12:25 PM, Will Drewry wrote:
>>
>>
>> I agree :) ?BPF being a 32-bit creature introduced some edge cases. ?I
>> has started with a
>> ? ? union { u32 args32[6]; u64 args64[6]; }
>>
>> This was somewhat derailed by CONFIG_COMPAT behavior where
>> syscall_get_arguments always writes to argument of register width --
>> not bad, just irritating (since a copy isn't strictly necessary nor
>> actually done in the patch). ?Also, Indan pointed out that while BPF
>> programs expect constants in the machine-local endian layout, any
>> consumers would need to change how they accessed the arguments across
>> big/little endian machines since a load of the low-order bits would
>> vary.
>>
>> In a second pass, I attempted to resolve this like aio_abi.h:
>> ? ?union {
>> ? ? ?struct {
>> ? ? ? ? u32 ENDIAN_SWAP(lo32, hi32);
>> ? ? ? };
>> ? ? ? u64 arg64;
>> ? ? } args[6];
>> It wasn't clear that this actually made matters better (though it did
>> mean syscall_get_arguments() could write directly to arg64). ?Usings
>>
>> offsetof() in the user program would be fine, but any offsets set
>> another way would be invalid. ?At that point, I moved to Indan's
>> proposal to stabilize low order and high order offsets -- what is in
>> the patch series. ?Now a BPF program can reliably index into the low
>> bits of an argument and into the high bits without endianness changing
>> the filter program structure.
>>
>> I don't feel strongly about any given data layout, and this one seems
>> to balance the 32-bit-ness of BPF and the impact that has on
>> endianness. ?I'm happy to hear alternatives that might be more
>> aesthetically pleasing :)
>>
>
> I would have to say I think native endian is probably the sane thing still,
> out of several bad alternatives. ?Certainly splitting the high and low
> halves of arguments is insane.
I'll push the bits around and see how well it plays out in sample/test
code. Right now, the patch never even populates the data itself - it
just returns four bytes at the requested offset on-demand, so
kernel-side it's pretty simple to do it whatever way seems the least
hideous for the ABI.
> The other thing that you really need in addition to system call number is
> ABI identifier, since a syscall number may mean different things for
> different entry points. ?For example, on x86-64 system call number 4 is
> write() if called via int $0x80 but stat() if called via syscall64. This is
> a local property of the system call, not a global per process.
Looks like Markus just replied to this part. I can certainly populate
a compat bit if the current approach is overconstrained, but I much
prefer to avoid making every user of seccomp need to know about the
subtleties of the calling conventions.
thanks!
will
On Thu, Feb 16, 2012 at 3:28 PM, H. Peter Anvin <[email protected]> wrote:
> On 02/16/2012 12:42 PM, Will Drewry wrote:
>> On Thu, Feb 16, 2012 at 2:24 PM, Markus Gutschke <[email protected]> wrote:
>>> SIGTRAP might not be the ideal choice of signal number, as it can make it
>>> very difficult to debug the program in gdb.
>>
>> True enough. ?In theory, we could use the lower 16-bits of the return
>> value to let the bpf program set a signal, but not all signals are
>> masked synchronous and those that are probably get gdb's attention,
>> just not a severely :) (ILL, SEGV, BUS, TRAP, FPE). Perhaps SIGILL is
>> a logically appropriate option -- or letting the api user decide from
>> the SYNCHRONOUS_MASK set. ?I'm open to whatever makes sense, though.
>> (I wasn't even sure if it was kosher to add a new TRAP_SECCOMP value.)
>>
>
> There is a standard signal for this -- SIGSYS -- which happens to be
> currently unused in Linux.
Awesome. I'll respin using that.
On 02/16/2012 01:28 PM, Markus Gutschke wrote:
>
> I think, the documentation said that as soon as prctl() is used to set
> a bpf filter for system calls, it automatically disallows system calls
> using an entry point other than the one used by this particular
> prctl().
>
> I was trying to come up with scenarios where this particular approach
> causes problem, but I can't think of any off the top of my head. So,
> it might actually turn out to be a very elegant way to reduce the
> attack surface of the kernel. If we are really worried about userspace
> compatibility, we could make the kernel send a signal instead of
> terminating the program, if the wrong entry point was used; not sure
> if that is needed, though.
>
Let's see... we're building an entire pattern-matching engine and then
randomly disallowing its use because we didn't build in the right bits?
Sorry, that's asinine.
Put the bloody bit in there and let the pattern program make that decision.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Thu, Feb 16, 2012 at 3:34 PM, H. Peter Anvin <[email protected]> wrote:
> On 02/16/2012 01:28 PM, Markus Gutschke wrote:
>>
>> I think, the documentation said that as soon as prctl() is used to set
>> a bpf filter for system calls, it automatically disallows system calls
>> using an entry point other than the one used by this particular
>> prctl().
>>
>> I was trying to come up with scenarios where this particular approach
>> causes problem, but I can't think of any off the top of my head. So,
>> it might actually turn out to be a very elegant way to reduce the
>> attack surface of the kernel. If we are really worried about userspace
>> compatibility, we could make the kernel send a signal instead of
>> terminating the program, if the wrong entry point was used; not sure
>> if that is needed, though.
>>
>
> Let's see... we're building an entire pattern-matching engine and then
> randomly disallowing its use because we didn't build in the right bits?
>
> Sorry, that's asinine.
>
> Put the bloody bit in there and let the pattern program make that decision.
Easy enough to add a bit for the mode: 32-bit or 64-bit. It seemed
like a waste of cycles for every 32-bit program or every 64-bit
program to check to see that its calling convention hadn't changed,
but it does take away a valid decision the pattern program should be
making.
I'll add a flag for 32bit/64bit while cleaning up seccomp_data. I
think that will properly encapsulate the is_compat_task() behavior in
a way that is stable for compat and non-compat tasks to use. If
there's a more obvious way, I'm all ears.
thanks!
will
On 02/16/2012 01:51 PM, Will Drewry wrote:
>>
>> Put the bloody bit in there and let the pattern program make that decision.
>
> Easy enough to add a bit for the mode: 32-bit or 64-bit. It seemed
> like a waste of cycles for every 32-bit program or every 64-bit
> program to check to see that its calling convention hadn't changed,
> but it does take away a valid decision the pattern program should be
> making.
>
> I'll add a flag for 32bit/64bit while cleaning up seccomp_data. I
> think that will properly encapsulate the is_compat_task() behavior in
> a way that is stable for compat and non-compat tasks to use. If
> there's a more obvious way, I'm all ears.
>
is_compat_task() is not going to be the right thing for x86 going
forward, as we're introducing the x32 ABI (which uses the normal x86-64
entry point, but with different eax numbers, and bit 30 set.)
The actual state is the TS_COMPAT flag in the thread_info structure,
which currently matches is_compat_task(), but perhaps we should add a
new helper function syscall_namespace() or something like that...
Either that or we can just use another bit in the syscall number field...
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Thu, Feb 16, 2012 at 4:06 PM, H. Peter Anvin <[email protected]> wrote:
> On 02/16/2012 01:51 PM, Will Drewry wrote:
>>>
>>> Put the bloody bit in there and let the pattern program make that decision.
>>
>> Easy enough to add a bit for the mode: 32-bit or 64-bit. ?It seemed
>> like a waste of cycles for every 32-bit program or every 64-bit
>> program to check to see that its calling convention hadn't changed,
>> but it does take away a valid decision the pattern program should be
>> making.
>>
>> I'll add a flag for 32bit/64bit while cleaning up seccomp_data. I
>> think that will properly encapsulate the is_compat_task() behavior in
>> a way that is stable for compat and non-compat tasks to use. ?If
>> there's a more obvious way, I'm all ears.
>>
>
> is_compat_task() is not going to be the right thing for x86 going
> forward, as we're introducing the x32 ABI (which uses the normal x86-64
> entry point, but with different eax numbers, and bit 30 set.)
>
> The actual state is the TS_COMPAT flag in the thread_info structure,
> which currently matches is_compat_task(), but perhaps we should add a
> new helper function syscall_namespace() or something like that...
Without the addition of x32, it is still the intersection of
is_compat_task()/TS_COMPAT and CONFIG_64BIT for all arches to
determine if the call is 32-bit or 64-bit, but this will add another
wrinkle. Would it make sense to assume that system call namespaces
may be ever expanding and offer up an unsigned integer value?
struct seccomp_data {
int nr;
u32 namespace;
u64 instruction_pointer;
u64 args[6];
}
Then syscall_namespace(current, regs) returns
* 0 - SYSCALL_NS_32 (for existing 32 and config_compat)
* 1 - SYSCALL_NS_64 (for existing 64 bit)
* 2 - SYSCALL_NS_X32 (everything after 2 is arch specific)
* ..
This patch series is pegged to x86 right now, so it's not a big deal
to add a simple syscall_namespace to asm/syscall.h. Of course, the
code is always the easy part. Even easier would be to only assign 0
and 1 in the seccomp_data for 32-bit or 64-bit, then leave the rest of
the u32 untouched until x32 stabilizes and the TS_COMPAT interactions
are sorted.
The other option, of course, is to hide it from the users and peg to
is_compat_task and later to however x32 is exposed, but that might
just be me trying to avoid adding more dependencies to this patch
series :)
> Either that or we can just use another bit in the syscall number field...
That would simplify the case here. The seccomp_data bit would say the
call is 64-bit and then the syscall number with the extra bit would
say that it is x32 and wouldn't collide with the existing 64-bit
numbering, and the filter program author wouldn't make a filter
program that allows a call that it shouldn't.
Another option could be to expose the task_user_regset_view() of
e_machine and e_osabi for the current/active calling convention
assuming x32 gets a new marker there. (There is a small amount
pain[1] there since x86 uses TIF_IA32 and not TS_COMPAT for regset's
e_machine value and not the current calling convention. It's not
clear to me if TS_COMPAT would be set during a core
dump/fill_note_info, and not many people use ptrace's GET/SETREGSET,
but I'm not super confident unraveling that mystery myself. Perhaps,
current_user_regset_view() )
For now, I'll just drop in a u32 for the calling convention.
Thanks!
will
1 - http://lxr.linux.no/linux+v3.2.6/arch/x86/kernel/ptrace.c#L1310
On Thu, Feb 16, 2012 at 3:00 PM, Will Drewry <[email protected]> wrote:
> On Thu, Feb 16, 2012 at 4:06 PM, H. Peter Anvin <[email protected]> wrote:
>> On 02/16/2012 01:51 PM, Will Drewry wrote:
>>>>
>>>> Put the bloody bit in there and let the pattern program make that decision.
>>>
>>> Easy enough to add a bit for the mode: 32-bit or 64-bit. ?It seemed
>>> like a waste of cycles for every 32-bit program or every 64-bit
>>> program to check to see that its calling convention hadn't changed,
>>> but it does take away a valid decision the pattern program should be
>>> making.
>>>
>>> I'll add a flag for 32bit/64bit while cleaning up seccomp_data. I
>>> think that will properly encapsulate the is_compat_task() behavior in
>>> a way that is stable for compat and non-compat tasks to use. ?If
>>> there's a more obvious way, I'm all ears.
>>>
>>
>> is_compat_task() is not going to be the right thing for x86 going
>> forward, as we're introducing the x32 ABI (which uses the normal x86-64
>> entry point, but with different eax numbers, and bit 30 set.)
>>
>> The actual state is the TS_COMPAT flag in the thread_info structure,
>> which currently matches is_compat_task(), but perhaps we should add a
>> new helper function syscall_namespace() or something like that...
>
> Without the addition of x32, it is still the intersection of
> is_compat_task()/TS_COMPAT and CONFIG_64BIT for all arches to
> determine if the call is 32-bit or 64-bit, but this will add another
> wrinkle. ?Would it make sense to assume that system call namespaces
> may be ever expanding and offer up an unsigned integer value?
>
> struct seccomp_data {
> ?int nr;
> ?u32 namespace;
> ?u64 instruction_pointer;
> ?u64 args[6];
> }
>
> Then syscall_namespace(current, regs) returns
> * 0 - SYSCALL_NS_32 (for existing 32 and config_compat)
> * 1 - SYSCALL_NS_64 (for existing 64 bit)
> * 2 - SYSCALL_NS_X32 (everything after 2 is arch specific)
> * ..
>
> This patch series is pegged to x86 right now, so it's not a big deal
> to add a simple syscall_namespace to asm/syscall.h. ?Of course, the
> code is always the easy part. ?Even easier would be to only assign 0
> and 1 in the seccomp_data for 32-bit or 64-bit, then leave the rest of
> the u32 untouched until x32 stabilizes and the TS_COMPAT interactions
> are sorted.
>
> The other option, of course, is to hide it from the users and peg to
> is_compat_task and later to however x32 is exposed, but that might
> just be me trying to avoid adding more dependencies to this patch
> series :)
>
>> Either that or we can just use another bit in the syscall number field...
>
> That would simplify the case here. The seccomp_data bit would say the
> call is 64-bit and then the syscall number with the extra bit would
> say that it is x32 and wouldn't collide with the existing 64-bit
> numbering, and the filter program author wouldn't make a filter
> program that allows a call that it shouldn't.
Presumably this works for x32 (since bit 30 might as well be part of
the syscall number), but the namespace or whatever it's called would
be nice to distinguish between the three 32-bit entry points.
For 32-bit code, I can easily see two different entry points getting
used in the same program -- some library could issue int80 directly,
but other code (in the vdso, for example, if it ever starts being
useful) could hit the other entry. And if 64-bit code ever gets a new
entry point, the same problem would happen.
Of course, if the args are magically fixed up in the 32-bit case, then
maybe the multiple entries are a nonissue. (Sorry, I haven't kept
track of that part of this patch set.)
--Andy
On 02/16/2012 03:00 PM, Will Drewry wrote:
>
> Without the addition of x32, it is still the intersection of
> is_compat_task()/TS_COMPAT and CONFIG_64BIT for all arches to
> determine if the call is 32-bit or 64-bit, but this will add another
> wrinkle. Would it make sense to assume that system call namespaces
> may be ever expanding and offer up an unsigned integer value?
>
This is definitely the most general solution.
By the way, although most processes only use one set of system calls,
there are legitimate reasons for cross-mode tasks, and those probably
have a high overlap with the ones that would benefit from this kind of
filtering facility, e.g. pin.
-hpa
On Thu, February 16, 2012 22:17, H. Peter Anvin wrote:
> On 02/16/2012 12:25 PM, Will Drewry wrote:
>>
>> I agree :) BPF being a 32-bit creature introduced some edge cases. I
>> has started with a
>> union { u32 args32[6]; u64 args64[6]; }
>>
>> This was somewhat derailed by CONFIG_COMPAT behavior where
>> syscall_get_arguments always writes to argument of register width --
>> not bad, just irritating (since a copy isn't strictly necessary nor
>> actually done in the patch). Also, Indan pointed out that while BPF
>> programs expect constants in the machine-local endian layout, any
>> consumers would need to change how they accessed the arguments across
>> big/little endian machines since a load of the low-order bits would
>> vary.
>>
>> In a second pass, I attempted to resolve this like aio_abi.h:
>> union {
>> struct {
>> u32 ENDIAN_SWAP(lo32, hi32);
>> };
>> u64 arg64;
>> } args[6];
>> It wasn't clear that this actually made matters better (though it did
>> mean syscall_get_arguments() could write directly to arg64). Usings
>> offsetof() in the user program would be fine, but any offsets set
>> another way would be invalid. At that point, I moved to Indan's
>> proposal to stabilize low order and high order offsets -- what is in
>> the patch series. Now a BPF program can reliably index into the low
>> bits of an argument and into the high bits without endianness changing
>> the filter program structure.
>>
>> I don't feel strongly about any given data layout, and this one seems
>> to balance the 32-bit-ness of BPF and the impact that has on
>> endianness. I'm happy to hear alternatives that might be more
>> aesthetically pleasing :)
>>
>
> I would have to say I think native endian is probably the sane thing
> still, out of several bad alternatives. Certainly splitting the high
> and low halves of arguments is insane.
Yes it is. But it can't be avoided because BPF programs are always 32-bit.
So they have to access the high and low halves separately, one way or the
other, even on 64-bit machines. With that in mind splitting up the halves
explicitly seems the best way.
I would go for something like:
struct seccomp_data {
int nr;
__u32 arg_low[6];
__u32 arg_high[6];
__u32 instruction_pointer_low;
__u32 instruction_pointer_high;
__u32 __reserved[3];
};
(Not sure what use the IP is because that doesn't tell anything about how
the system call instruction was reached.)
The only way to avoid splitting args is to add 64-bit support to BPF.
That is probably the best way forwards, but would require breaking the
BPF ABI by either adding a 64-bit version directly or adding extra
instructions.
This mismatch between 32-bit BPF programs and 64-bit machines is the main
reason why I'm not perfectly happy with BPF for syscall filtering. It gets
the job done, but it's not great.
> The other thing that you really need in addition to system call number
> is ABI identifier, since a syscall number may mean different things for
> different entry points. For example, on x86-64 system call number 4 is
> write() if called via int $0x80 but stat() if called via syscall64.
> This is a local property of the system call, not a global per process.
The problem of doing this is that you then force every filter to check for
the path and do something path specific. The filters that don't do this
check will be buggy. So the best way is really to install filters per mode
and call the right filter. If filters are installed but not for the current
path, the task should be killed.
prctl() should take one more argument which says for which mode the filter
will be installed for, with 0 for the current mode.
But pushing that info into the filters themselves is not a good idea.
Greetings,
Indan
On Thu, 2012-02-16 at 17:00 -0600, Will Drewry wrote:
> On Thu, Feb 16, 2012 at 4:06 PM, H. Peter Anvin <[email protected]> wrote:
> > On 02/16/2012 01:51 PM, Will Drewry wrote:
> Then syscall_namespace(current, regs) returns
> * 0 - SYSCALL_NS_32 (for existing 32 and config_compat)
> * 1 - SYSCALL_NS_64 (for existing 64 bit)
> * 2 - SYSCALL_NS_X32 (everything after 2 is arch specific)
> * ..
>
> This patch series is pegged to x86 right now, so it's not a big deal
> to add a simple syscall_namespace to asm/syscall.h. Of course, the
> code is always the easy part. Even easier would be to only assign 0
> and 1 in the seccomp_data for 32-bit or 64-bit, then leave the rest of
> the u32 untouched until x32 stabilizes and the TS_COMPAT interactions
> are sorted.
I don't know if anyone cares, but include/linux/audit.h tries to expose
this type of information so audit userspace can later piece things back
together. (we get this info from the syscall entry exit code so we know
which arch it is).
Not sure how x32 is hoping to expose its syscall info, but others are
going to have the same/similar problem.
-Eric
On Thu, Feb 16, 2012 at 4:48 PM, Indan Zupancic <[email protected]> wrote:
> On Thu, February 16, 2012 22:17, H. Peter Anvin wrote:
>> On 02/16/2012 12:25 PM, Will Drewry wrote:
>>>
>>> I agree :) ?BPF being a 32-bit creature introduced some edge cases. ?I
>>> has started with a
>>> ? ? ?union { u32 args32[6]; u64 args64[6]; }
>>>
>>> This was somewhat derailed by CONFIG_COMPAT behavior where
>>> syscall_get_arguments always writes to argument of register width --
>>> not bad, just irritating (since a copy isn't strictly necessary nor
>>> actually done in the patch). ?Also, Indan pointed out that while BPF
>>> programs expect constants in the machine-local endian layout, any
>>> consumers would need to change how they accessed the arguments across
>>> big/little endian machines since a load of the low-order bits would
>>> vary.
>>>
>>> In a second pass, I attempted to resolve this like aio_abi.h:
>>> ? ? union {
>>> ? ? ? struct {
>>> ? ? ? ? ?u32 ENDIAN_SWAP(lo32, hi32);
>>> ? ? ? ?};
>>> ? ? ? ?u64 arg64;
>>> ? ? ?} args[6];
>>> It wasn't clear that this actually made matters better (though it did
>>> mean syscall_get_arguments() could write directly to arg64). ?Usings
>>> offsetof() in the user program would be fine, but any offsets set
>>> another way would be invalid. ?At that point, I moved to Indan's
>>> proposal to stabilize low order and high order offsets -- what is in
>>> the patch series. ?Now a BPF program can reliably index into the low
>>> bits of an argument and into the high bits without endianness changing
>>> the filter program structure.
>>>
>>> I don't feel strongly about any given data layout, and this one seems
>>> to balance the 32-bit-ness of BPF and the impact that has on
>>> endianness. ?I'm happy to hear alternatives that might be more
>>> aesthetically pleasing :)
>>>
>>
>> I would have to say I think native endian is probably the sane thing
>> still, out of several bad alternatives. ?Certainly splitting the high
>> and low halves of arguments is insane.
>
> Yes it is. But it can't be avoided because BPF programs are always 32-bit.
> So they have to access the high and low halves separately, one way or the
> other, even on 64-bit machines. With that in mind splitting up the halves
> explicitly seems the best way.
>
> I would go for something like:
>
> struct seccomp_data {
> ? ? ? ?int nr;
> ? ? ? ?__u32 arg_low[6];
> ? ? ? ?__u32 arg_high[6];
> ? ? ? ?__u32 instruction_pointer_low;
> ? ? ? ?__u32 instruction_pointer_high;
> ? ? ? ?__u32 __reserved[3];
> };
>
> (Not sure what use the IP is because that doesn't tell anything about how
> the system call instruction was reached.)
>
> The only way to avoid splitting args is to add 64-bit support to BPF.
> That is probably the best way forwards, but would require breaking the
> BPF ABI by either adding a 64-bit version directly or adding extra
> instructions.
>
> This mismatch between 32-bit BPF programs and 64-bit machines is the main
> reason why I'm not perfectly happy with BPF for syscall filtering. It gets
> the job done, but it's not great.
>
>> The other thing that you really need in addition to system call number
>> is ABI identifier, since a syscall number may mean different things for
>> different entry points. ?For example, on x86-64 system call number 4 is
>> write() if called via int $0x80 but stat() if called via syscall64.
>> This is a local property of the system call, not a global per process.
>
> The problem of doing this is that you then force every filter to check for
> the path and do something path specific. The filters that don't do this
> check will be buggy. So the best way is really to install filters per mode
> and call the right filter. If filters are installed but not for the current
> path, the task should be killed.
>
> prctl() should take one more argument which says for which mode the filter
> will be installed for, with 0 for the current mode.
>
> But pushing that info into the filters themselves is not a good idea.
IMO the best solution is to have the One True Seccomp Filter Compiler
(tm). It would handle multiple namespaces, cross-arch differences,
and such, and it would do it correctly. It could live in the kernel
tree.
Without something like that or an incredible amount of special care,
actual portability is probably a pipe dream.
--Andy
On 02/16/2012 04:51 PM, Andrew Lutomirski wrote:
>
> IMO the best solution is to have the One True Seccomp Filter Compiler
> (tm). It would handle multiple namespaces, cross-arch differences,
> and such, and it would do it correctly. It could live in the kernel
> tree.
>
> Without something like that or an incredible amount of special care,
> actual portability is probably a pipe dream.
>
> --Andy
>
Seconded!
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Fri, February 17, 2012 01:51, Andrew Lutomirski wrote:
> IMO the best solution is to have the One True Seccomp Filter Compiler
> (tm). It would handle multiple namespaces, cross-arch differences,
> and such, and it would do it correctly. It could live in the kernel
> tree.
I'm not interested in any such compiler, if I use this BPF thing I'll use
it directly by scanning my syscall table info and converting it to a BPF
filter for the cases where it's possible. This code will be cross-platform,
all the platform dependent info comes from the syscall table.
It seems I'll just build a bitmask telling what to do for each syscall,
with special cases for the few syscalls that can be handled totally within
BPF by checking the arguments.
My total lines of code is 5k now, I'm not going to use a complex thousands
of lines, badly tested, probably buggy compiler just for BPF support.
> Without something like that or an incredible amount of special care,
> actual portability is probably a pipe dream.
The filter programs are already platform dependent because of the syscall
numbers and sometimes args differences. But that is no reason to make it
even less cross-platform.
Your OTSF compiler won't be able to handle different modes other than
adding a check at the start and having totally orthogonal codes for the
different cases. You can as well have separate filters then. Any other
approach dies because of the added complexity or will be a lot slower.
Greetings,
Indan
On 02/16/2012 04:48 PM, Indan Zupancic wrote:
> On Thu, February 16, 2012 22:17, H. Peter Anvin wrote:
>
> I would go for something like:
>
> struct seccomp_data {
> int nr;
> __u32 arg_low[6];
> __u32 arg_high[6];
> __u32 instruction_pointer_low;
> __u32 instruction_pointer_high;
> __u32 __reserved[3];
> };
>
Uh, that is the absolutely WORST way to do it - not only are you
creating two fields, they're not even adjacent.
> (Not sure what use the IP is because that doesn't tell anything about how
> the system call instruction was reached.)
>
> The only way to avoid splitting args is to add 64-bit support to BPF.
> That is probably the best way forwards, but would require breaking the
> BPF ABI by either adding a 64-bit version directly or adding extra
> instructions.
Or the compiler or whatever generates the BPF code just is going to have
to generate two instructions -- just like we always have to handle
[u]int64_t on 32-bit platforms. There is no difference here.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Thu, 2012-02-16 at 14:02 -0600, Will Drewry wrote:
> This change allows CONFIG_SECCOMP to make use of BPF programs for
> user-controlled system call filtering (as shown in this patch series).
[]
> diff --git a/net/core/filter.c b/net/core/filter.c
[]
> @@ -542,9 +602,35 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
[]
> case BPF_S_LD_W_ABS:
> + MAYBE_USE_LOAD_FN(LD_W_ABS);
> case BPF_S_LD_H_ABS:
> + MAYBE_USE_LOAD_FN(LD_H_ABS);
> case BPF_S_LD_B_ABS:
> + MAYBE_USE_LOAD_FN(LD_B_ABS);
Would be nice to note fallthrough or add break if necessary.
On Fri, February 17, 2012 02:33, H. Peter Anvin wrote:
> On 02/16/2012 04:48 PM, Indan Zupancic wrote:
>> On Thu, February 16, 2012 22:17, H. Peter Anvin wrote:
>>
>> I would go for something like:
>>
>> struct seccomp_data {
>> int nr;
>> __u32 arg_low[6];
>> __u32 arg_high[6];
>> __u32 instruction_pointer_low;
>> __u32 instruction_pointer_high;
>> __u32 __reserved[3];
>> };
>>
>
> Uh, that is the absolutely WORST way to do it - not only are you
> creating two fields, they're not even adjacent.
You want:
struct seccomp_data {
int nr;
__u32 __reserved[3];
__u64 arg[6];
__u64 instruction_pointer;
};
And I agree it looks a lot nicer.
You can pretend a 64-bit arg will be one field, but it won't be. It will
be always two fields no matter what. Making them adjacent is only good
because seccomp_data won't have to change if 64-bit support is ever added
to BPF.
It looks nicer, but it only makes it harder to know the right offset for
the fields for the 32-bit only BPF programs. You can try to hide reality,
but that won't change it.
>> (Not sure what use the IP is because that doesn't tell anything about how
>> the system call instruction was reached.)
>>
>> The only way to avoid splitting args is to add 64-bit support to BPF.
>> That is probably the best way forwards, but would require breaking the
>> BPF ABI by either adding a 64-bit version directly or adding extra
>> instructions.
>
> Or the compiler or whatever generates the BPF code just is going to have
> to generate two instructions -- just like we always have to handle
> [u]int64_t on 32-bit platforms. There is no difference here.
Except that if you don't hide the platform differences your compiler
or whatever needs to generate different instructions depending on the
endianess, while it could always generate the same instructions instead.
My impression is that you want to push all extra complexity into the
compiler or whatever instead of making the ABI cross-platform, because
it looks nicer. I don't care that much, but I think you're just pushing
the ugliness around instead of getting rid of it.
Greetings,
Indan
On Thu, Feb 16, 2012 at 6:00 PM, Indan Zupancic <[email protected]> wrote:
> On Fri, February 17, 2012 02:33, H. Peter Anvin wrote:
>> On 02/16/2012 04:48 PM, Indan Zupancic wrote:
>>> On Thu, February 16, 2012 22:17, H. Peter Anvin wrote:
>>>
>>> I would go for something like:
>>>
>>> struct seccomp_data {
>>> ? ? ?int nr;
>>> ? ? ?__u32 arg_low[6];
>>> ? ? ?__u32 arg_high[6];
>>> ? ? ?__u32 instruction_pointer_low;
>>> ? ? ?__u32 instruction_pointer_high;
>>> ? ? ?__u32 __reserved[3];
>>> };
>>>
>>
>> Uh, that is the absolutely WORST way to do it - not only are you
>> creating two fields, they're not even adjacent.
>
> You want:
>
> struct seccomp_data {
> ? ? ? ?int nr;
> ? ? ? ?__u32 __reserved[3];
> ? ? ? ?__u64 arg[6];
> ? ? ? ?__u64 instruction_pointer;
> };
>
> And I agree it looks a lot nicer.
>
> You can pretend a 64-bit arg will be one field, but it won't be. It will
> be always two fields no matter what. Making them adjacent is only good
> because seccomp_data won't have to change if 64-bit support is ever added
> to BPF.
>
> It looks nicer, but it only makes it harder to know the right offset for
> the fields for the 32-bit only BPF programs. You can try to hide reality,
> but that won't change it.
>
>>> (Not sure what use the IP is because that doesn't tell anything about how
>>> the system call instruction was reached.)
>>>
>>> The only way to avoid splitting args is to add 64-bit support to BPF.
>>> That is probably the best way forwards, but would require breaking the
>>> BPF ABI by either adding a 64-bit version directly or adding extra
>>> instructions.
>>
>> Or the compiler or whatever generates the BPF code just is going to have
>> to generate two instructions -- just like we always have to handle
>> [u]int64_t on 32-bit platforms. ?There is no difference here.
>
> Except that if you don't hide the platform differences your compiler
> or whatever needs to generate different instructions depending on the
> endianess, while it could always generate the same instructions instead.
>
> My impression is that you want to push all extra complexity into the
> compiler or whatever instead of making the ABI cross-platform, because
> it looks nicer. I don't care that much, but I think you're just pushing
> the ugliness around instead of getting rid of it.
Is there really no syscall that cares about endianness?
Even if it ends up working, forcing syscall arguments to have a
particular endianness seems like a bad decision, especially if anyone
ever wants to make a 64-bit BPF implementation. (Or if any
architecture adds 128-bit syscall arguments to a future syscall
namespace or whatever it's called. x86-64 has 128-bit xmm
registers...)
--Andy
On Thu, Feb 16, 2012 at 7:54 PM, Joe Perches <[email protected]> wrote:
> On Thu, 2012-02-16 at 14:02 -0600, Will Drewry wrote:
>> This change allows CONFIG_SECCOMP to make use of BPF programs for
>> user-controlled system call filtering (as shown in this patch series).
> []
>> diff --git a/net/core/filter.c b/net/core/filter.c
> []
>> @@ -542,9 +602,35 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
> []
>> ? ? ? ? ? ? ? case BPF_S_LD_W_ABS:
>> + ? ? ? ? ? ? ? ? ? ? MAYBE_USE_LOAD_FN(LD_W_ABS);
>> ? ? ? ? ? ? ? case BPF_S_LD_H_ABS:
>> + ? ? ? ? ? ? ? ? ? ? MAYBE_USE_LOAD_FN(LD_H_ABS);
>> ? ? ? ? ? ? ? case BPF_S_LD_B_ABS:
>> + ? ? ? ? ? ? ? ? ? ? MAYBE_USE_LOAD_FN(LD_B_ABS);
>
> Would be nice to note fallthrough or add break if necessary.
Cool - I'll note it. They are meant to fall through, but I can just
go back to using goto handle_skb_load or some such so it is more
readable.
Thanks!
On 02/16/2012 06:16 PM, Andrew Lutomirski wrote:
>
> Is there really no syscall that cares about endianness?
>
> Even if it ends up working, forcing syscall arguments to have a
> particular endianness seems like a bad decision, especially if anyone
> ever wants to make a 64-bit BPF implementation. (Or if any
> architecture adds 128-bit syscall arguments to a future syscall
> namespace or whatever it's called. x86-64 has 128-bit xmm
> registers...)
>
Not to mention that the reshuffling code will add totally unnecessary
cost to the normal operation. Either way, Indan has it backwards ... it
*is* one field, the fact that two operations is needed to access it is a
function of the underlying byte code, and even if the byte code can't
support it, a JIT could merge adjacent operations if 64-bit operations
are possible -- or we could (and arguably should) add 64-bit opcodes in
the future for efficiency.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On 02/16/2012 04:50 PM, Eric Paris wrote:
>
> I don't know if anyone cares, but include/linux/audit.h tries to expose
> this type of information so audit userspace can later piece things back
> together. (we get this info from the syscall entry exit code so we know
> which arch it is).
>
> Not sure how x32 is hoping to expose its syscall info, but others are
> going to have the same/similar problem.
>
It would be nice if audit could (eventually?) use the same facility, too.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Thu, February 16, 2012 21:02, Will Drewry wrote:
> [This patch depends on [email protected]'s no_new_privs patch:
> https://lkml.org/lkml/2012/1/30/264
> ]
>
> This patch adds support for seccomp mode 2. Mode 2 introduces the
> ability for unprivileged processes to install system call filtering
> policy expressed in terms of a Berkeley Packet Filter (BPF) program.
> This program will be evaluated in the kernel for each system call
> the task makes and computes a result based on data in the format
> of struct seccomp_data.
>
> A filter program may be installed by calling:
> struct sock_fprog fprog = { ... };
> ...
> prctl(PR_SET_SECCOMP, 2, &fprog);
Please add an arg to tell the filter mode.
>
> The return value of the filter program determines if the system call is
> allowed to proceed or denied. If the first filter program installed
> allows prctl(2) calls, then the above call may be made repeatedly
> by a task to further reduce its access to the kernel. All attached
> programs must be evaluated before a system call will be allowed to
> proceed.
>
> To avoid CONFIG_COMPAT related landmines, once a filter program is
> installed using specific is_compat_task() value, it is not allowed to
> make system calls using the alternate entry point.
Just allow paths with a filter and deny paths without a filter installed.
> Filter programs will be inherited across fork/clone and execve.
> However, if the task attaching the filter is unprivileged
> (!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task. This
> ensures that unprivileged tasks cannot attach filters that affect
> privileged tasks (e.g., setuid binary).
>
> There are a number of benefits to this approach. A few of which are
> as follows:
> - BPF has been exposed to userland for a long time
> - BPF optimization (and JIT'ing) are well understood
> - Userland already knows its ABI: system call numbers and desired
> arguments
> - No time-of-check-time-of-use vulnerable data accesses are possible.
> - system call arguments are loaded on access only to minimize copying
> required for system call policy decisions.
>
> Mode 2 support is restricted to architectures that enable
> HAVE_ARCH_SECCOMP_FILTER. In this patch, the primary dependency is on
> syscall_get_arguments(). The full desired scope of this feature will
> add a few minor additional requirements expressed later in this series.
> Based on discussion, SECCOMP_RET_ERRNO and SECCOMP_RET_TRACE seem to be
> the desired additional functionality.
>
> No architectures are enabled in this patch.
>
> v8: - use bpf_chk_filter, bpf_run_filter. update load_fns
> - Lots of fixes courtesy of [email protected]:
> -- fix up load behavior, compat fixups, and merge alloc code,
> -- renamed pc and dropped __packed, use bool compat.
> -- Added a hidden CONFIG_SECCOMP_FILTER to synthesize non-arch
> dependencies
> v7: (massive overhaul thanks to Indan, others)
> - added CONFIG_HAVE_ARCH_SECCOMP_FILTER
> - merged into seccomp.c
> - minimal seccomp_filter.h
> - no config option (part of seccomp)
> - no new prctl
> - doesn't break seccomp on systems without asm/syscall.h
> (works but arg access always fails)
> - dropped seccomp_init_task, extra free functions, ...
> - dropped the no-asm/syscall.h code paths
> - merges with network sk_run_filter and sk_chk_filter
> v6: - fix memory leak on attach compat check failure
> - require no_new_privs || CAP_SYS_ADMIN prior to filter
> installation. ([email protected])
> - s/seccomp_struct_/seccomp_/ for macros/functions ([email protected])
> - cleaned up Kconfig ([email protected])
> - on block, note if the call was compat (so the # means something)
> v5: - uses syscall_get_arguments
> ([email protected],[email protected], [email protected])
> - uses union-based arg storage with hi/lo struct to
> handle endianness. Compromises between the two alternate
> proposals to minimize extra arg shuffling and account for
> endianness assuming userspace uses offsetof().
> ([email protected], [email protected])
> - update Kconfig description
> - add include/seccomp_filter.h and add its installation
> - (naive) on-demand syscall argument loading
> - drop seccomp_t ([email protected])
> v4: - adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
> - now uses current->no_new_privs
> ([email protected],[email protected])
> - assign names to seccomp modes ([email protected])
> - fix style issues ([email protected])
> - reworded Kconfig entry ([email protected])
> v3: - macros to inline ([email protected])
> - init_task behavior fixed ([email protected])
> - drop creator entry and extra NULL check ([email protected])
> - alloc returns -EINVAL on bad sizing ([email protected])
> - adds tentative use of "always_unprivileged" as per
> [email protected] and [email protected]
> v2: - (patch 2 only)
>
> Signed-off-by: Will Drewry <[email protected]>
> ---
> arch/Kconfig | 17 +++
> include/linux/Kbuild | 1 +
> include/linux/seccomp.h | 69 ++++++++++-
> kernel/fork.c | 3 +
> kernel/seccomp.c | 327 ++++++++++++++++++++++++++++++++++++++++++++--
> kernel/sys.c | 2 +-
> 6 files changed, 399 insertions(+), 20 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 4f55c73..c6ba1db 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -199,4 +199,21 @@ config HAVE_CMPXCHG_LOCAL
> config HAVE_CMPXCHG_DOUBLE
> bool
>
> +config HAVE_ARCH_SECCOMP_FILTER
> + bool
> + help
> + This symbol should be selected by an architecure if it provides
> + asm/syscall.h, specifically syscall_get_arguments().
> +
> +config SECCOMP_FILTER
> + def_bool y
> + depends on HAVE_ARCH_SECCOMP_FILTER && SECCOMP && NET
> + help
> + Enable tasks to build secure computing environments defined
> + in terms of Berkeley Packet Filter programs which implement
> + task-defined system call filtering polices.
> +
> + See Documentation/prctl/seccomp_filter.txt for more
> + information on the topic of seccomp filtering.
> +
> source "kernel/gcov/Kconfig"
> diff --git a/include/linux/Kbuild b/include/linux/Kbuild
> index c94e717..d41ba12 100644
> --- a/include/linux/Kbuild
> +++ b/include/linux/Kbuild
> @@ -330,6 +330,7 @@ header-y += scc.h
> header-y += sched.h
> header-y += screen_info.h
> header-y += sdla.h
> +header-y += seccomp.h
> header-y += securebits.h
> header-y += selinux_netlink.h
> header-y += sem.h
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index d61f27f..2bee1f7 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -1,14 +1,60 @@
> #ifndef _LINUX_SECCOMP_H
> #define _LINUX_SECCOMP_H
>
> +#include <linux/compiler.h>
> +#include <linux/types.h>
> +
> +
> +/* Valid values for seccomp.mode and prctl(PR_SET_SECCOMP, <mode>) */
> +#define SECCOMP_MODE_DISABLED 0 /* seccomp is not in use. */
> +#define SECCOMP_MODE_STRICT 1 /* uses hard-coded filter. */
> +#define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */
> +
> +/*
> + * BPF programs may return a 32-bit value.
> + * The bottom 16-bits are reserved for future use.
> + * The upper 16-bits are ordered from least permissive values to most.
> + *
> + * The ordering ensures that a min_t() over composed return values always
> + * selects the least permissive choice.
> + */
> +#define SECCOMP_RET_MASK 0xffff0000U
> +#define SECCOMP_RET_KILL 0x00000000U /* kill the task immediately */
> +#define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */
> +
> +/* Format of the data the BPF program executes over. */
> +struct seccomp_data {
> + int nr;
> + __u32 __reserved[3];
> + struct {
> + __u32 lo;
> + __u32 hi;
> + } instruction_pointer;
> + __u32 lo32[6];
> + __u32 hi32[6];
> +};
I wouldn't use a struct for the IP. And I'd move the args to the front.
Why not call it something with "arg" in the names?
>
> +#ifdef __KERNEL__
> #ifdef CONFIG_SECCOMP
>
> #include <linux/thread_info.h>
> #include <asm/seccomp.h>
>
> +struct seccomp_filter;
> +/**
> + * struct seccomp - the state of a seccomp'ed process
> + *
> + * @mode: indicates one of the valid values above for controlled
> + * system calls available to a process.
> + * @filter: The metadata and ruleset for determining what system calls
> + * are allowed for a task.
> + *
> + * @filter must only be accessed from the context of current as there
> + * is no locking.
> + */
> struct seccomp {
> int mode;
> + struct seccomp_filter *filter;
> };
>
> extern void __secure_computing(int);
> @@ -19,7 +65,7 @@ static inline void secure_computing(int this_syscall)
> }
>
> extern long prctl_get_seccomp(void);
> -extern long prctl_set_seccomp(unsigned long);
> +extern long prctl_set_seccomp(unsigned long, char __user *);
>
> static inline int seccomp_mode(struct seccomp *s)
> {
> @@ -31,15 +77,16 @@ static inline int seccomp_mode(struct seccomp *s)
> #include <linux/errno.h>
>
> struct seccomp { };
> +struct seccomp_filter { };
>
> -#define secure_computing(x) do { } while (0)
> +#define secure_computing(x) 0
>
> static inline long prctl_get_seccomp(void)
> {
> return -EINVAL;
> }
>
> -static inline long prctl_set_seccomp(unsigned long arg2)
> +static inline long prctl_set_seccomp(unsigned long arg2, char __user *arg3)
> {
> return -EINVAL;
> }
> @@ -48,7 +95,21 @@ static inline int seccomp_mode(struct seccomp *s)
> {
> return 0;
> }
> -
> #endif /* CONFIG_SECCOMP */
>
> +#ifdef CONFIG_SECCOMP_FILTER
> +extern void put_seccomp_filter(struct seccomp_filter *);
> +extern void copy_seccomp(struct seccomp *child,
> + const struct seccomp *parent);
> +#else /* CONFIG_SECCOMP_FILTER */
> +/* The macro consumes the ->filter reference. */
> +#define put_seccomp_filter(_s) do { } while (0)
> +
> +static inline void copy_seccomp(struct seccomp *child,
> + const struct seccomp *prev)
> +{
> + return;
> +}
Why a macro for one but an empty inline for the other?
> +#endif /* CONFIG_SECCOMP_FILTER */
> +#endif /* __KERNEL__ */
> #endif /* _LINUX_SECCOMP_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b77fd55..a5187b7 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -34,6 +34,7 @@
> #include <linux/cgroup.h>
> #include <linux/security.h>
> #include <linux/hugetlb.h>
> +#include <linux/seccomp.h>
> #include <linux/swap.h>
> #include <linux/syscalls.h>
> #include <linux/jiffies.h>
> @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
> free_thread_info(tsk->stack);
> rt_mutex_debug_task_free(tsk);
> ftrace_graph_exit_task(tsk);
> + put_seccomp_filter(tsk->seccomp.filter);
> free_task_struct(tsk);
> }
> EXPORT_SYMBOL(free_task);
> @@ -1113,6 +1115,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> goto fork_out;
>
> ftrace_graph_init_task(p);
> + copy_seccomp(&p->seccomp, ¤t->seccomp);
>
> rt_mutex_init_task(p);
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index e8d76c5..14d1869 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -3,16 +3,297 @@
> *
> * Copyright 2004-2005 Andrea Arcangeli <[email protected]>
> *
> - * This defines a simple but solid secure-computing mode.
> + * Copyright (C) 2012 Google, Inc.
> + * Will Drewry <[email protected]>
> + *
> + * This defines a simple but solid secure-computing facility.
> + *
> + * Mode 1 uses a fixed list of allowed system calls.
> + * Mode 2 allows user-defined system call filters in the form
> + * of Berkeley Packet Filters/Linux Socket Filters.
> */
>
> #include <linux/audit.h>
> +#include <linux/filter.h>
> #include <linux/seccomp.h>
> #include <linux/sched.h>
> #include <linux/compat.h>
>
> +#include <linux/atomic.h>
> +#include <linux/security.h>
> +
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/user.h>
Are those still needed since you got rid of that manual user-copying stuff?
> +
> +#include <linux/tracehook.h>
> +#include <asm/syscall.h>
> +
> /* #define SECCOMP_DEBUG 1 */
> -#define NR_SECCOMP_MODES 1
> +
> +#ifdef CONFIG_SECCOMP_FILTER
> +/**
> + * struct seccomp_filter - container for seccomp BPF programs
> + *
> + * @usage: reference count to manage the object liftime.
> + * get/put helpers should be used when accessing an instance
> + * outside of a lifetime-guarded section. In general, this
> + * is only needed for handling filters shared across tasks.
> + * @prev: points to a previously installed, or inherited, filter
> + * @compat: indicates the value of is_compat_task() at creation time
> + * @insns: the BPF program instructions to evaluate
> + * @count: the number of instructions in the program
> + *
> + * seccomp_filter objects are organized in a tree linked via the @prev
> + * pointer. For any task, it appears to be a singly-linked list starting
> + * with current->seccomp.filter, the most recently attached or inherited filter.
> + * However, multiple filters may share a @prev node, by way of fork(), which
> + * results in a unidirectional tree existing in memory. This is similar to
> + * how namespaces work.
> + *
> + * seccomp_filter objects should never be modified after being attached
> + * to a task_struct (other than @usage).
> + */
> +struct seccomp_filter {
> + atomic_t usage;
> + struct seccomp_filter *prev;
> + bool compat;
> + unsigned short count; /* Instruction count */
> + struct sock_filter insns[];
> +};
> +
> +static void seccomp_filter_log_failure(int syscall)
> +{
> + int compat = 0;
> +#ifdef CONFIG_COMPAT
> + compat = is_compat_task();
> +#endif
> + pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n",
> + current->comm, task_pid_nr(current),
> + (compat ? "compat " : ""),
> + syscall, KSTK_EIP(current));
> +}
> +
> +static inline u32 get_high_bits(unsigned long value)
> +{
> + int bits = 32;
> + return value >> bits;
> +}
> +
> +static inline u32 bpf_length(const void *data)
> +{
> + return sizeof(struct seccomp_data);
> +}
This doesn't change, so why not pass in the length directly instead of
getting it via a function? And stop adding inline to functions that are
used for function pointers, it's misleading.
> +
> +/**
> + * bpf_pointer: checks and returns a pointer to the requested offset
> + * @nr: int syscall passed as a void * to bpf_run_filter
> + * @off: index to load a from in @data
?
> + * @size: load width requested
> + * @buffer: temporary storage supplied by bpf_run_filter
> + *
> + * Returns a pointer to @buffer where the value was stored.
> + * On failure, returns NULL.
> + */
> +static void *bpf_pointer(const void *nr, int off, unsigned int size, void *buf)
> +{
> + unsigned long value;
> + u32 *A = (u32 *)buf;
No need to cast a void pointer. That's the whole point of void pointers.
> +
> + if (size != sizeof(u32))
> + return NULL;
> +
> +#define BPF_DATA(_name) offsetof(struct seccomp_data, _name)
I'd move this outside of the function and don't bother with the undef.
Undeffing is important in header files. But here, if it's needed, it's
just plain confusing.
> + /* Index by entry instead of by byte. */
> + if (off == BPF_DATA(nr)) {
> + *A = (u32)(uintptr_t)nr;
Why the double cast? Once should be enough. Or is it a special Sparse thing?
> + } else if (off == BPF_DATA(instruction_pointer.lo)) {
> + *A = KSTK_EIP(current);
> + } else if (off == BPF_DATA(instruction_pointer.hi)) {
> + *A = get_high_bits(KSTK_EIP(current));
> + } else if (off >= BPF_DATA(lo32[0]) && off <= BPF_DATA(lo32[5])) {
> + struct pt_regs *regs = task_pt_regs(current);
> + int arg = (off - BPF_DATA(lo32[0])) >> 2;
> + syscall_get_arguments(current, regs, arg, 1, &value);
> + *A = value;
> + } else if (off >= BPF_DATA(hi32[0]) && off <= BPF_DATA(hi32[5])) {
> + struct pt_regs *regs = task_pt_regs(current);
> + int arg = (off - BPF_DATA(hi32[0])) >> 2;
> + syscall_get_arguments(current, regs, arg, 1, &value);
> + *A = get_high_bits(value);
> + } else {
> + return NULL;
> + }
> +#undef BPF_DATA
> + return buf;
> +}
> +
> +/**
> + * seccomp_run_filters - run 'current' against the given syscall
> + * @syscall: number of the current system call
Strange comments.
> + *
> + * Returns valid seccomp BPF response codes.
> + */
> +static u32 seccomp_run_filters(int syscall)
> +{
> + struct seccomp_filter *f;
> + const struct bpf_load_fns loaders = { bpf_pointer, bpf_length };
I don't see the point of this.
The return values for seccomp filters are different than the networking
ones, so there is never a need to get bpf_length from the filter code
as it's known at compile time. So just declare BPF_S_LD_W_LEN and
S_LDX_W_LEN networking-only instructions and don't bother with all this.
> + u32 ret = SECCOMP_RET_KILL;
> + const void *sc_ptr = (const void *)(uintptr_t)syscall;
> +
> + /* It's not possible for the filter to be NULL here. */
> +#ifdef CONFIG_COMPAT
> + if (current->seccomp.filter->compat != !!(is_compat_task()))
> + return ret;
> +#endif
> +
> + /*
> + * All filters are evaluated in order of youngest to oldest. The lowest
> + * BPF return value always takes priority.
> + */
> + for (f = current->seccomp.filter; f; f = f->prev) {
> + ret = bpf_run_filter(sc_ptr, f->insns, &loaders);
> + if (ret != SECCOMP_RET_ALLOW)
> + break;
> + }
> + return ret;
> +}
> +
> +/**
> + * seccomp_attach_filter: Attaches a seccomp filter to current.
> + * @fprog: BPF program to install
> + *
> + * Returns 0 on success or an errno on failure.
> + */
> +static long seccomp_attach_filter(struct sock_fprog *fprog)
> +{
> + struct seccomp_filter *filter = NULL;
Don't initialize it to NULL, next time 'filter' is used it's set
by kzalloc's return value.
> + unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
> + long ret = -EINVAL;
> +
> + if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
> + goto out;
Oh wait, you need the NULL because you can call put_filter() via out.
Well, just return EINVAL directly instead here I'd say.
> +
> + /* Allocate a new seccomp_filter */
> + ret = -ENOMEM;
> + filter = kzalloc(sizeof(struct seccomp_filter) + fp_size, GFP_KERNEL);
> + if (!filter)
> + goto out;
Same here, just return ENOMEM.
> + atomic_set(&filter->usage, 1);
> + filter->count = fprog->len;
Why is it called count in one place and len in the other? Isn't it clearer
when always using len?
> +
> + /* Copy the instructions from fprog. */
> + ret = -EFAULT;
> + if (copy_from_user(filter->insns, fprog->filter, fp_size))
> + goto out;
> +
> + /* Check the fprog */
> + ret = bpf_chk_filter(filter->insns, filter->count, BPF_CHK_FLAGS_NO_SKB);
> + if (ret)
> + goto out;
> +
> + /*
> + * Installing a seccomp filter requires that the task
> + * have CAP_SYS_ADMIN in its namespace or be running with
> + * no_new_privs. This avoids scenarios where unprivileged
> + * tasks can affect the behavior of privileged children.
> + */
> + ret = -EACCES;
> + if (!current->no_new_privs &&
> + security_capable_noaudit(current_cred(), current_user_ns(),
> + CAP_SYS_ADMIN) != 0)
> + goto out;
> +
> + /* Lock the filter to the current calling convention. */
> +#ifdef CONFIG_COMPAT
> + filter->compat = !!(is_compat_task());
> +#endif
> +
> + /*
> + * If there is an existing filter, make it the prev
> + * and don't drop its task reference.
> + */
> + filter->prev = current->seccomp.filter;
> + current->seccomp.filter = filter;
> + return 0;
> +out:
> + put_seccomp_filter(filter); /* for get or task, on err */
> + return ret;
> +}
> +
> +/**
> + * seccomp_attach_user_filter - attaches a user-supplied sock_fprog
> + * @user_filter: pointer to the user data containing a sock_fprog.
> + *
> + * This function may be called repeatedly to install additional filters.
> + * Every filter successfully installed will be evaluated (in reverse order)
> + * for each system call the task makes.
> + *
> + * Returns 0 on success and non-zero otherwise.
> + */
> +long seccomp_attach_user_filter(char __user *user_filter)
> +{
> + struct sock_fprog fprog;
> + long ret = -EFAULT;
> +
> + if (!user_filter)
> + goto out;
> +#ifdef CONFIG_COMPAT
> + if (is_compat_task()) {
> + /* XXX: Share with net/compat.c */
You can't share this with net/compat.c because they have to pass a __user
pointer to a generic sock_setsockopt(). You could refactor their code to
push the compat check later, but I think they prefer to keep all the compat
stuff in one place.
> + struct {
> + u16 len;
> + compat_uptr_t filter; /* struct sock_filter */
> + } fprog32;
> + if (copy_from_user(&fprog32, user_filter, sizeof(fprog32)))
> + goto out;
> + fprog.len = fprog32.len;
> + fprog.filter = compat_ptr(fprog32.filter);
> + } else
> +#endif
> + if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
> + goto out;
Probably a good idea to intend the else if one more time to make it more
obvious. Or add a comment after the else.
> + ret = seccomp_attach_filter(&fprog);
> +out:
> + return ret;
> +}
> +
> +/* get_seccomp_filter - increments the reference count of @orig. */
> +static struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
> +{
> + if (!orig)
> + return NULL;
> + /* Reference count is bounded by the number of total processes. */
> + atomic_inc(&orig->usage);
> + return orig;
> +}
> +
> +/* put_seccomp_filter - decrements the ref count of @orig and may free. */
> +void put_seccomp_filter(struct seccomp_filter *orig)
> +{
> + /* Clean up single-reference branches iteratively. */
> + while (orig && atomic_dec_and_test(&orig->usage)) {
> + struct seccomp_filter *freeme = orig;
> + orig = orig->prev;
> + kfree(freeme);
> + }
> +}
> +
> +/**
> + * copy_seccomp: manages inheritance on fork
> + * @child: forkee's seccomp
> + * @prev: forker's seccomp
> + *
> + * Ensures that @child inherits seccomp mode and state if
> + * seccomp filtering is in use.
> + */
> +void copy_seccomp(struct seccomp *child,
> + const struct seccomp *prev)
> +{
> + child->mode = prev->mode;
> + child->filter = get_seccomp_filter(prev->filter);
> +}
> +#endif /* CONFIG_SECCOMP_FILTER */
>
> /*
> * Secure computing mode 1 allows only read/write/exit/sigreturn.
> @@ -34,10 +315,10 @@ static int mode1_syscalls_32[] = {
> void __secure_computing(int this_syscall)
> {
> int mode = current->seccomp.mode;
> - int * syscall;
> + int *syscall;
>
> switch (mode) {
> - case 1:
> + case SECCOMP_MODE_STRICT:
> syscall = mode1_syscalls;
> #ifdef CONFIG_COMPAT
> if (is_compat_task())
> @@ -48,6 +329,13 @@ void __secure_computing(int this_syscall)
> return;
> } while (*++syscall);
> break;
> +#ifdef CONFIG_SECCOMP_FILTER
> + case SECCOMP_MODE_FILTER:
> + if (seccomp_run_filters(this_syscall) == SECCOMP_RET_ALLOW)
> + return;
> + seccomp_filter_log_failure(this_syscall);
> + break;
> +#endif
> default:
> BUG();
> }
> @@ -64,25 +352,34 @@ long prctl_get_seccomp(void)
> return current->seccomp.mode;
> }
>
> -long prctl_set_seccomp(unsigned long seccomp_mode)
> +long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
> {
> - long ret;
> + long ret = -EINVAL;
>
> - /* can set it only once to be even more secure */
> - ret = -EPERM;
> - if (unlikely(current->seccomp.mode))
> + if (current->seccomp.mode &&
> + current->seccomp.mode != seccomp_mode)
> goto out;
>
> - ret = -EINVAL;
> - if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
> - current->seccomp.mode = seccomp_mode;
> - set_thread_flag(TIF_SECCOMP);
> + switch (seccomp_mode) {
> + case SECCOMP_MODE_STRICT:
> + ret = 0;
> #ifdef TIF_NOTSC
> disable_TSC();
> #endif
> - ret = 0;
> + break;
> +#ifdef CONFIG_SECCOMP_FILTER
> + case SECCOMP_MODE_FILTER:
> + ret = seccomp_attach_user_filter(filter);
> + if (ret)
> + goto out;
> + break;
> +#endif
> + default:
> + goto out;
> }
>
> - out:
> + current->seccomp.mode = seccomp_mode;
> + set_thread_flag(TIF_SECCOMP);
> +out:
> return ret;
> }
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 4070153..905031e 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1899,7 +1899,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2,
unsigned long, arg3,
> error = prctl_get_seccomp();
> break;
> case PR_SET_SECCOMP:
> - error = prctl_set_seccomp(arg2);
> + error = prctl_set_seccomp(arg2, (char __user *)arg3);
> break;
> case PR_GET_TSC:
> error = GET_TSC_CTL(arg2);
> --
> 1.7.5.4
>
>
Greetings,
Indan
Hello,
On Thu, February 16, 2012 21:02, Will Drewry wrote:
> This change allows CONFIG_SECCOMP to make use of BPF programs for
> user-controlled system call filtering (as shown in this patch series).
>
> To minimize the impact on existing BPF evaluation, function pointer
> use must be declared at sk_chk_filter-time. This allows ancillary
> load instructions to be generated that use the function pointer rather
> than adding _any_ code to the existing LD_* instruction paths.
>
> Crude performance numbers using udpflood -l 10000000 against dummy0.
> 3 trials for baseline, 3 for with tcpdump. Averaged then differenced.
> Hard to believe trials were repeated at least a couple more times.
>
> * x86 32-bit (Atom N570 @ 1.66 GHz 2 core HT) [stackprot]:
> - Without: 94.05s - 76.36s = 17.68s
> - With: 86.22s - 73.30s = 12.92s
> - Slowdown per call: -476 nanoseconds
>
> * x86 32-bit (Atom N570 @ 1.66 GHz 2 core HT) [no stackprot]:
> - Without: 92.06s - 77.81s = 14.25s
> - With: 91.77s - 76.91s = 14.86s
> - Slowdown per call: +61 nanoseconds
>
> * x86 64-bit (Atom N570 @ 1.66 GHz 2 core HT) [stackprot]:
> - Without: 122.58s - 99.54s = 23.04s
> - With: 115.52s - 98.99s = 16.53s
> - Slowdown per call: -651 nanoseconds
>
> * x86 64-bit (Atom N570 @ 1.66 GHz 2 core HT) [no stackprot]:
> - Without: 114.95s - 91.92s = 23.03s
> - With: 110.47s - 90.79s = 19.68s
> - Slowdown per call: -335 nanoseconds
>
> This makes the x86-32-nossp make sense. Added register pressure always
> makes x86-32 sad.
Your 32-bit numbers are better than your 64-bit numbers, so I don't get
this comment.
> If this is a concern, I could change the call
> approach to bpf_run_filter to see if I can alleviate it a bit.
>
> That said, the x86-*-ssp numbers show a marked increase in performance.
> I've tested and retested and I keep getting these results. I'm also
> suprised by the nossp speed up on 64-bit, but I dunno. I haven't looked
> at the full disassembly of the call path. If that is required for the
> performance differences I'm seeing, please let me know. Or if I there is
> a preferred cpu to run this against - atoms can be a little weird.
Yeah, testing on Atom is a bit silly.
> v8: - fixed variable positioning and bad cast ([email protected])
> - no longer passes A as a pointer (inspection of x86 asm shows A is
> %ebx again; thanks [email protected])
> - cleaned up switch macros and expanded use
> ([email protected], [email protected])
> - added length fn pointer and handled LD_W_LEN/LDX_W_LEN
> - moved from a wrapping struct to a typedef for the function
> pointer. (matches existing function pointer style)
> - added comprehensive comment above the typedef.
> - benchmarks
> v7: - first cut
>
> Signed-off-by: Will Drewry <[email protected]>
> ---
> include/linux/filter.h | 69 +++++++++++++++++++++-
> net/core/filter.c | 152 +++++++++++++++++++++++++++++++++++++----------
> 2 files changed, 185 insertions(+), 36 deletions(-)
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 8eeb205..d22ad46 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -110,6 +110,9 @@ struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
> */
> #define BPF_MEMWORDS 16
>
> +/* BPF program (checking) flags */
> +#define BPF_CHK_FLAGS_NO_SKB 1
> +
> /* RATIONALE. Negative offsets are invalid in BPF.
> We use them to reference ancillary data.
> Unlike introduction new instructions, it does not break
> @@ -145,17 +148,67 @@ struct sk_filter
> struct sock_filter insns[0];
> };
>
> +/**
> + * struct bpf_load_fns - callbacks for bpf_run_filter
> + * These functions are called by bpf_run_filter if bpf_chk_filter
> + * was invoked with BPF_CHK_FLAGS_NO_SKB.
> + *
> + * pointer:
> + * @data: const pointer to the data passed into bpf_run_filter
> + * @k: offset into @skb's data
> + * @size: the size of the requested data in bytes: 1, 2, or 4.
> + * @buffer: If non-NULL, a 32-bit buffer for staging data.
> + *
> + * Returns a pointer to the requested data.
> + *
> + * This function operates similarly to load_pointer in net/core/filter.c
> + * except that the pointer to the returned data must already be
> + * byteswapped as appropriate to the source data and endianness.
> + * @buffer may be used if the data needs to be staged.
> + *
> + * length:
> + * @data: const pointer to the data passed into bpf_fun_filter
> + *
> + * Returns the length of the data.
> + */
> +struct bpf_load_fns {
> + void *(*pointer)(const void *data, int k, unsigned int size,
> + void *buffer);
> + u32 (*length)(const void *data);
> +};
Like I said in the other email, length is useless for the non-skb case.
If you really want to add it, just make it a constant. And 'pointer' isn't
the best name.
> +
> static inline unsigned int sk_filter_len(const struct sk_filter *fp)
> {
> return fp->len * sizeof(struct sock_filter) + sizeof(*fp);
> }
>
> +extern unsigned int bpf_run_filter(const void *data,
> + const struct sock_filter *filter,
> + const struct bpf_load_fns *load_fn);
> +
> +/**
> + * sk_run_filter - run a filter on a socket
> + * @skb: buffer to run the filter on
> + * @fentry: filter to apply
> + *
> + * Runs bpf_run_filter with the struct sk_buff-specific data
> + * accessor behavior.
> + */
> +static inline unsigned int sk_run_filter(const struct sk_buff *skb,
> + const struct sock_filter *filter)
> +{
> + return bpf_run_filter(skb, filter, NULL);
> +}
> +
> extern int sk_filter(struct sock *sk, struct sk_buff *skb);
> -extern unsigned int sk_run_filter(const struct sk_buff *skb,
> - const struct sock_filter *filter);
> extern int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
> extern int sk_detach_filter(struct sock *sk);
> -extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
> +extern int bpf_chk_filter(struct sock_filter *filter, unsigned int flen, u32 flags);
> +
> +static inline int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
> +{
> + return bpf_chk_filter(filter, flen, 0);
> +}
>
> #ifdef CONFIG_BPF_JIT
> extern void bpf_jit_compile(struct sk_filter *fp);
> @@ -228,6 +281,16 @@ enum {
> BPF_S_ANC_HATYPE,
> BPF_S_ANC_RXHASH,
> BPF_S_ANC_CPU,
> + /* Used to differentiate SKB data and generic data */
> + BPF_S_ANC_LD_W_ABS,
> + BPF_S_ANC_LD_H_ABS,
> + BPF_S_ANC_LD_B_ABS,
> + BPF_S_ANC_LD_W_LEN,
> + BPF_S_ANC_LD_W_IND,
> + BPF_S_ANC_LD_H_IND,
> + BPF_S_ANC_LD_B_IND,
> + BPF_S_ANC_LDX_W_LEN,
> + BPF_S_ANC_LDX_B_MSH,
> };
>
> #endif /* __KERNEL__ */
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 5dea452..a5c98a9 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -98,9 +98,10 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
> EXPORT_SYMBOL(sk_filter);
>
> /**
> - * sk_run_filter - run a filter on a socket
> - * @skb: buffer to run the filter on
> + * bpf_run_filter - run a filter on a BPF program
The filter is the BPF program, so this comment is weird.
> + * @data: buffer to run the filter on
> * @fentry: filter to apply
> + * @load_fns: custom data accessor functions
> *
> * Decode and apply filter instructions to the skb->data.
> * Return length to keep, 0 for none. @skb is the data we are
> @@ -108,9 +109,13 @@ EXPORT_SYMBOL(sk_filter);
> * Because all jumps are guaranteed to be before last instruction,
> * and last instruction guaranteed to be a RET, we dont need to check
> * flen. (We used to pass to this function the length of filter)
> + *
> + * load_fn is only used if SKF_FLAGS_USE_LOAD_FNS was specified
> + * to sk_chk_generic_filter.
Stale comment.
> */
> -unsigned int sk_run_filter(const struct sk_buff *skb,
> - const struct sock_filter *fentry)
> +unsigned int bpf_run_filter(const void *data,
> + const struct sock_filter *fentry,
> + const struct bpf_load_fns *load_fns)
> {
> void *ptr;
> u32 A = 0; /* Accumulator */
> @@ -128,6 +133,7 @@ unsigned int sk_run_filter(const struct sk_buff *skb,
> #else
> const u32 K = fentry->k;
> #endif
> +#define SKB(_data) ((const struct sk_buff *)(_data))
Urgh!
If you had done:
const struct sk_buff *skb = data;
at the top, all those changed wouldn't be needed and it would look better too.
>
> switch (fentry->code) {
> case BPF_S_ALU_ADD_X:
> @@ -213,7 +219,7 @@ unsigned int sk_run_filter(const struct sk_buff *skb,
> case BPF_S_LD_W_ABS:
> k = K;
> load_w:
> - ptr = load_pointer(skb, k, 4, &tmp);
> + ptr = load_pointer(data, k, 4, &tmp);
> if (ptr != NULL) {
> A = get_unaligned_be32(ptr);
> continue;
> @@ -222,7 +228,7 @@ load_w:
> case BPF_S_LD_H_ABS:
> k = K;
> load_h:
> - ptr = load_pointer(skb, k, 2, &tmp);
> + ptr = load_pointer(data, k, 2, &tmp);
> if (ptr != NULL) {
> A = get_unaligned_be16(ptr);
> continue;
> @@ -231,17 +237,17 @@ load_h:
> case BPF_S_LD_B_ABS:
> k = K;
> load_b:
> - ptr = load_pointer(skb, k, 1, &tmp);
> + ptr = load_pointer(data, k, 1, &tmp);
> if (ptr != NULL) {
> A = *(u8 *)ptr;
> continue;
> }
> return 0;
> case BPF_S_LD_W_LEN:
> - A = skb->len;
> + A = SKB(data)->len;
> continue;
> case BPF_S_LDX_W_LEN:
> - X = skb->len;
> + X = SKB(data)->len;
> continue;
> case BPF_S_LD_W_IND:
> k = X + K;
> @@ -253,7 +259,7 @@ load_b:
> k = X + K;
> goto load_b;
> case BPF_S_LDX_B_MSH:
> - ptr = load_pointer(skb, K, 1, &tmp);
> + ptr = load_pointer(data, K, 1, &tmp);
> if (ptr != NULL) {
> X = (*(u8 *)ptr & 0xf) << 2;
> continue;
> @@ -288,29 +294,29 @@ load_b:
> mem[K] = X;
> continue;
> case BPF_S_ANC_PROTOCOL:
> - A = ntohs(skb->protocol);
> + A = ntohs(SKB(data)->protocol);
> continue;
> case BPF_S_ANC_PKTTYPE:
> - A = skb->pkt_type;
> + A = SKB(data)->pkt_type;
> continue;
> case BPF_S_ANC_IFINDEX:
> - if (!skb->dev)
> + if (!SKB(data)->dev)
> return 0;
> - A = skb->dev->ifindex;
> + A = SKB(data)->dev->ifindex;
> continue;
> case BPF_S_ANC_MARK:
> - A = skb->mark;
> + A = SKB(data)->mark;
> continue;
> case BPF_S_ANC_QUEUE:
> - A = skb->queue_mapping;
> + A = SKB(data)->queue_mapping;
> continue;
> case BPF_S_ANC_HATYPE:
> - if (!skb->dev)
> + if (!SKB(data)->dev)
> return 0;
> - A = skb->dev->type;
> + A = SKB(data)->dev->type;
> continue;
> case BPF_S_ANC_RXHASH:
> - A = skb->rxhash;
> + A = SKB(data)->rxhash;
> continue;
> case BPF_S_ANC_CPU:
> A = raw_smp_processor_id();
> @@ -318,15 +324,15 @@ load_b:
> case BPF_S_ANC_NLATTR: {
> struct nlattr *nla;
>
> - if (skb_is_nonlinear(skb))
> + if (skb_is_nonlinear(SKB(data)))
> return 0;
> - if (A > skb->len - sizeof(struct nlattr))
> + if (A > SKB(data)->len - sizeof(struct nlattr))
> return 0;
>
> - nla = nla_find((struct nlattr *)&skb->data[A],
> - skb->len - A, X);
> + nla = nla_find((struct nlattr *)&SKB(data)->data[A],
> + SKB(data)->len - A, X);
> if (nla)
> - A = (void *)nla - (void *)skb->data;
> + A = (void *)nla - (void *)SKB(data)->data;
> else
> A = 0;
> continue;
> @@ -334,22 +340,71 @@ load_b:
> case BPF_S_ANC_NLATTR_NEST: {
> struct nlattr *nla;
>
> - if (skb_is_nonlinear(skb))
> + if (skb_is_nonlinear(SKB(data)))
> return 0;
> - if (A > skb->len - sizeof(struct nlattr))
> + if (A > SKB(data)->len - sizeof(struct nlattr))
> return 0;
>
> - nla = (struct nlattr *)&skb->data[A];
> - if (nla->nla_len > A - skb->len)
> + nla = (struct nlattr *)&SKB(data)->data[A];
> + if (nla->nla_len > A - SKB(data)->len)
> return 0;
>
> nla = nla_find_nested(nla, X);
> if (nla)
> - A = (void *)nla - (void *)skb->data;
> + A = (void *)nla - (void *)SKB(data)->data;
> else
> A = 0;
> continue;
> }
All changes up to here are unnecessary.
> + case BPF_S_ANC_LD_W_ABS:
> + k = K;
> +load_fn_w:
> + ptr = load_fns->pointer(data, k, 4, &tmp);
> + if (ptr) {
> + A = *(u32 *)ptr;
> + continue;
> + }
> + return 0;
> + case BPF_S_ANC_LD_H_ABS:
> + k = K;
> +load_fn_h:
> + ptr = load_fns->pointer(data, k, 2, &tmp);
> + if (ptr) {
> + A = *(u16 *)ptr;
> + continue;
> + }
> + return 0;
> + case BPF_S_ANC_LD_B_ABS:
> + k = K;
> +load_fn_b:
> + ptr = load_fns->pointer(data, k, 1, &tmp);
> + if (ptr) {
> + A = *(u8 *)ptr;
> + continue;
> + }
> + return 0;
> + case BPF_S_ANC_LDX_B_MSH:
> + ptr = load_fns->pointer(data, K, 1, &tmp);
> + if (ptr) {
> + X = (*(u8 *)ptr & 0xf) << 2;
> + continue;
> + }
> + return 0;
> + case BPF_S_ANC_LD_W_IND:
> + k = X + K;
> + goto load_fn_w;
> + case BPF_S_ANC_LD_H_IND:
> + k = X + K;
> + goto load_fn_h;
> + case BPF_S_ANC_LD_B_IND:
> + k = X + K;
> + goto load_fn_b;
> + case BPF_S_ANC_LD_W_LEN:
> + A = load_fns->length(data);
> + continue;
> + case BPF_S_ANC_LDX_W_LEN:
> + X = load_fns->length(data);
These two should either return 0, be networking-only, just return 0/-1 or
use a constant length.
> + continue;
> default:
> WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
> fentry->code, fentry->jt,
> @@ -360,7 +415,7 @@ load_b:
>
> return 0;
> }
> -EXPORT_SYMBOL(sk_run_filter);
> +EXPORT_SYMBOL(bpf_run_filter);
>
> /*
> * Security :
> @@ -423,9 +478,10 @@ error:
> }
>
> /**
> - * sk_chk_filter - verify socket filter code
> + * bpf_chk_filter - verify socket filter BPF code
> * @filter: filter to verify
> * @flen: length of filter
> + * @flags: May be BPF_CHK_FLAGS_NO_SKB or 0
> *
> * Check the user's filter code. If we let some ugly
> * filter code slip through kaboom! The filter must contain
> @@ -434,9 +490,13 @@ error:
> *
> * All jumps are forward as they are not signed.
> *
> + * If BPF_CHK_FLAGS_NO_SKB is set in flags, any SKB-specific
> + * rules become illegal and a custom set of bpf_load_fns will
> + * be expected by bpf_run_filter.
> + *
> * Returns 0 if the rule set is legal or -EINVAL if not.
> */
> -int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
> +int bpf_chk_filter(struct sock_filter *filter, unsigned int flen, u32 flags)
> {
> /*
> * Valid instructions are initialized to non-0.
> @@ -542,9 +602,35 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
> pc + ftest->jf + 1 >= flen)
> return -EINVAL;
> break;
> +#define MAYBE_USE_LOAD_FN(CODE) \
> + if (flags & BPF_CHK_FLAGS_NO_SKB) { \
> + code = BPF_S_ANC_##CODE; \
> + break; \
> + }
You can as well hide everything in the macro then, including the case,
like the ANCILLARY() macro does.
> + case BPF_S_LD_W_LEN:
> + MAYBE_USE_LOAD_FN(LD_W_LEN);
> + break;
> + case BPF_S_LDX_W_LEN:
> + MAYBE_USE_LOAD_FN(LDX_W_LEN);
> + break;
> + case BPF_S_LD_W_IND:
> + MAYBE_USE_LOAD_FN(LD_W_IND);
> + break;
> + case BPF_S_LD_H_IND:
> + MAYBE_USE_LOAD_FN(LD_H_IND);
> + break;
> + case BPF_S_LD_B_IND:
> + MAYBE_USE_LOAD_FN(LD_B_IND);
> + break;
> + case BPF_S_LDX_B_MSH:
> + MAYBE_USE_LOAD_FN(LDX_B_MSH);
> + break;
> case BPF_S_LD_W_ABS:
> + MAYBE_USE_LOAD_FN(LD_W_ABS);
> case BPF_S_LD_H_ABS:
> + MAYBE_USE_LOAD_FN(LD_H_ABS);
> case BPF_S_LD_B_ABS:
> + MAYBE_USE_LOAD_FN(LD_B_ABS);
> #define ANCILLARY(CODE) case SKF_AD_OFF + SKF_AD_##CODE: \
> code = BPF_S_ANC_##CODE; \
> break
> @@ -572,7 +658,7 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
> }
> return -EINVAL;
> }
> -EXPORT_SYMBOL(sk_chk_filter);
> +EXPORT_SYMBOL(bpf_chk_filter);
>
> /**
> * sk_filter_release_rcu - Release a socket filter by rcu_head
> --
> 1.7.5.4
>
Greetings,
Indan
On Fri, February 17, 2012 03:22, H. Peter Anvin wrote:
> On 02/16/2012 06:16 PM, Andrew Lutomirski wrote:
>>
>> Is there really no syscall that cares about endianness?
Perhaps when 64-bit syscall args are passed via two 32-bit registers.
But that is no argument to make all argument accesses endianness aware.
>> Even if it ends up working, forcing syscall arguments to have a
>> particular endianness seems like a bad decision, especially if anyone
>> ever wants to make a 64-bit BPF implementation. (Or if any
>> architecture adds 128-bit syscall arguments to a future syscall
>> namespace or whatever it's called. x86-64 has 128-bit xmm
>> registers...)
>>
>
> Not to mention that the reshuffling code will add totally unnecessary
> cost to the normal operation.
There is no such extra cost.
> Either way, Indan has it backwards ... it
> *is* one field, the fact that two operations is needed to access it is a
> function of the underlying byte code, and even if the byte code can't
> support it, a JIT could merge adjacent operations if 64-bit operations
> are possible -- or we could (and arguably should) add 64-bit opcodes in
> the future for efficiency.
It is a virtual data structure with as sole purpose to provide syscall
info to the byte code. The actual data structure as such never exists
in memory. So giving something that is hard to digest is silly.
A JIT won't be able to merge accesses because it also has to merge other
instructions and recognize when 64-bit operations are done with 32-bit
instructions. I think that will be too hard for a JIT.
The only good reason to use 64 bit fields is if 64-bit support will be
added to BPF in the future. If not, then it's just unnecessary pain for
no good reason.
An alternative to struct seccomp_data would be to add special instructions
that load the desired info to 'A'. E.g. BPF_S_ANC_SYSCALL_ARG with 'k'
selecting which arg. But that's probably harder to fit into the current
filter code.
Greetings,
Indan
On Thu, Feb 16, 2012 at 8:44 PM, Indan Zupancic <[email protected]> wrote:
> On Thu, February 16, 2012 21:02, Will Drewry wrote:
>> [This patch depends on [email protected]'s no_new_privs patch:
>> ?https://lkml.org/lkml/2012/1/30/264
>> ]
>>
>> This patch adds support for seccomp mode 2. ?Mode 2 introduces the
>> ability for unprivileged processes to install system call filtering
>> policy expressed in terms of a Berkeley Packet Filter (BPF) program.
>> This program will be evaluated in the kernel for each system call
>> the task makes and computes a result based on data in the format
>> of struct seccomp_data.
>>
>> A filter program may be installed by calling:
>> ? struct sock_fprog fprog = { ... };
>> ? ...
>> ? prctl(PR_SET_SECCOMP, 2, &fprog);
>
> Please add an arg to tell the filter mode.
Hrm? Do you mean SECCOMP_MODE_FILTER? I'll update the changelog to
include that.
>>
>> The return value of the filter program determines if the system call is
>> allowed to proceed or denied. ?If the first filter program installed
>> allows prctl(2) calls, then the above call may be made repeatedly
>> by a task to further reduce its access to the kernel. ?All attached
>> programs must be evaluated before a system call will be allowed to
>> proceed.
>>
>> To avoid CONFIG_COMPAT related landmines, once a filter program is
>> installed using specific is_compat_task() value, it is not allowed to
>> make system calls using the alternate entry point.
>
> Just allow paths with a filter and deny paths without a filter installed.
Not sure what that means, but given the feedback today, I'm just
adding a calling convention u32 so this code all disappears.
>> Filter programs will be inherited across fork/clone and execve.
>> However, if the task attaching the filter is unprivileged
>> (!CAP_SYS_ADMIN) the no_new_privs bit will be set on the task. ?This
>> ensures that unprivileged tasks cannot attach filters that affect
>> privileged tasks (e.g., setuid binary).
>>
>> There are a number of benefits to this approach. A few of which are
>> as follows:
>> - BPF has been exposed to userland for a long time
>> - BPF optimization (and JIT'ing) are well understood
>> - Userland already knows its ABI: system call numbers and desired
>> ? arguments
>> - No time-of-check-time-of-use vulnerable data accesses are possible.
>> - system call arguments are loaded on access only to minimize copying
>> ? required for system call policy decisions.
>>
>> Mode 2 support is restricted to architectures that enable
>> HAVE_ARCH_SECCOMP_FILTER. ?In this patch, the primary dependency is on
>> syscall_get_arguments(). ?The full desired scope of this feature will
>> add a few minor additional requirements expressed later in this series.
>> Based on discussion, SECCOMP_RET_ERRNO and SECCOMP_RET_TRACE seem to be
>> the desired additional functionality.
>>
>> No architectures are enabled in this patch.
>>
>> ?v8: - use bpf_chk_filter, bpf_run_filter. update load_fns
>> ? ? ?- Lots of fixes courtesy of [email protected]:
>> ? ? ?-- fix up load behavior, compat fixups, and merge alloc code,
>> ? ? ?-- renamed pc and dropped __packed, use bool compat.
>> ? ? ?-- Added a hidden CONFIG_SECCOMP_FILTER to synthesize non-arch
>> ? ? ? ? dependencies
>> ?v7: ?(massive overhaul thanks to Indan, others)
>> ? ? ?- added CONFIG_HAVE_ARCH_SECCOMP_FILTER
>> ? ? ?- merged into seccomp.c
>> ? ? ?- minimal seccomp_filter.h
>> ? ? ?- no config option (part of seccomp)
>> ? ? ?- no new prctl
>> ? ? ?- doesn't break seccomp on systems without asm/syscall.h
>> ? ? ? ?(works but arg access always fails)
>> ? ? ?- dropped seccomp_init_task, extra free functions, ...
>> ? ? ?- dropped the no-asm/syscall.h code paths
>> ? ? ?- merges with network sk_run_filter and sk_chk_filter
>> ?v6: - fix memory leak on attach compat check failure
>> ? ? ?- require no_new_privs || CAP_SYS_ADMIN prior to filter
>> ? ? ? ?installation. ([email protected])
>> ? ? ?- s/seccomp_struct_/seccomp_/ for macros/functions ([email protected])
>> ? ? ?- cleaned up Kconfig ([email protected])
>> ? ? ?- on block, note if the call was compat (so the # means something)
>> ?v5: - uses syscall_get_arguments
>> ? ? ? ?([email protected],[email protected], [email protected])
>> ? ? ? - uses union-based arg storage with hi/lo struct to
>> ? ? ? ? handle endianness. ?Compromises between the two alternate
>> ? ? ? ? proposals to minimize extra arg shuffling and account for
>> ? ? ? ? endianness assuming userspace uses offsetof().
>> ? ? ? ? ([email protected], [email protected])
>> ? ? ? - update Kconfig description
>> ? ? ? - add include/seccomp_filter.h and add its installation
>> ? ? ? - (naive) on-demand syscall argument loading
>> ? ? ? - drop seccomp_t ([email protected])
>> ?v4: ?- adjusted prctl to make room for PR_[SG]ET_NO_NEW_PRIVS
>> ? ? ? - now uses current->no_new_privs
>> ? ? ? ? ([email protected],[email protected])
>> ? ? ? - assign names to seccomp modes ([email protected])
>> ? ? ? - fix style issues ([email protected])
>> ? ? ? - reworded Kconfig entry ([email protected])
>> ?v3: ?- macros to inline ([email protected])
>> ? ? ? - init_task behavior fixed ([email protected])
>> ? ? ? - drop creator entry and extra NULL check ([email protected])
>> ? ? ? - alloc returns -EINVAL on bad sizing ([email protected])
>> ? ? ? - adds tentative use of "always_unprivileged" as per
>> ? ? ? ? [email protected] and [email protected]
>> ?v2: ?- (patch 2 only)
>>
>> Signed-off-by: Will Drewry <[email protected]>
>> ---
>> ?arch/Kconfig ? ? ? ? ? ?| ? 17 +++
>> ?include/linux/Kbuild ? ?| ? ?1 +
>> ?include/linux/seccomp.h | ? 69 ++++++++++-
>> ?kernel/fork.c ? ? ? ? ? | ? ?3 +
>> ?kernel/seccomp.c ? ? ? ?| ?327 ++++++++++++++++++++++++++++++++++++++++++++--
>> ?kernel/sys.c ? ? ? ? ? ?| ? ?2 +-
>> ?6 files changed, 399 insertions(+), 20 deletions(-)
>>
>> diff --git a/arch/Kconfig b/arch/Kconfig
>> index 4f55c73..c6ba1db 100644
>> --- a/arch/Kconfig
>> +++ b/arch/Kconfig
>> @@ -199,4 +199,21 @@ config HAVE_CMPXCHG_LOCAL
>> ?config HAVE_CMPXCHG_DOUBLE
>> ? ? ? bool
>>
>> +config HAVE_ARCH_SECCOMP_FILTER
>> + ? ? bool
>> + ? ? help
>> + ? ? ? This symbol should be selected by an architecure if it provides
>> + ? ? ? asm/syscall.h, specifically syscall_get_arguments().
>> +
>> +config SECCOMP_FILTER
>> + ? ? def_bool y
>> + ? ? depends on HAVE_ARCH_SECCOMP_FILTER && SECCOMP && NET
>> + ? ? help
>> + ? ? ? Enable tasks to build secure computing environments defined
>> + ? ? ? in terms of Berkeley Packet Filter programs which implement
>> + ? ? ? task-defined system call filtering polices.
>> +
>> + ? ? ? See Documentation/prctl/seccomp_filter.txt for more
>> + ? ? ? information on the topic of seccomp filtering.
>> +
>> ?source "kernel/gcov/Kconfig"
>> diff --git a/include/linux/Kbuild b/include/linux/Kbuild
>> index c94e717..d41ba12 100644
>> --- a/include/linux/Kbuild
>> +++ b/include/linux/Kbuild
>> @@ -330,6 +330,7 @@ header-y += scc.h
>> ?header-y += sched.h
>> ?header-y += screen_info.h
>> ?header-y += sdla.h
>> +header-y += seccomp.h
>> ?header-y += securebits.h
>> ?header-y += selinux_netlink.h
>> ?header-y += sem.h
>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>> index d61f27f..2bee1f7 100644
>> --- a/include/linux/seccomp.h
>> +++ b/include/linux/seccomp.h
>> @@ -1,14 +1,60 @@
>> ?#ifndef _LINUX_SECCOMP_H
>> ?#define _LINUX_SECCOMP_H
>>
>> +#include <linux/compiler.h>
>> +#include <linux/types.h>
>> +
>> +
>> +/* Valid values for seccomp.mode and prctl(PR_SET_SECCOMP, <mode>) */
>> +#define SECCOMP_MODE_DISABLED ? ? ? ?0 /* seccomp is not in use. */
>> +#define SECCOMP_MODE_STRICT ?1 /* uses hard-coded filter. */
>> +#define SECCOMP_MODE_FILTER ?2 /* uses user-supplied filter. */
>> +
>> +/*
>> + * BPF programs may return a 32-bit value.
>> + * The bottom 16-bits are reserved for future use.
>> + * The upper 16-bits are ordered from least permissive values to most.
>> + *
>> + * The ordering ensures that a min_t() over composed return values always
>> + * selects the least permissive choice.
>> + */
>> +#define SECCOMP_RET_MASK ? ? 0xffff0000U
>> +#define SECCOMP_RET_KILL ? ? 0x00000000U /* kill the task immediately */
>> +#define SECCOMP_RET_ALLOW ? ?0x7fff0000U /* allow */
>> +
>> +/* Format of the data the BPF program executes over. */
>> +struct seccomp_data {
>> + ? ? int nr;
>> + ? ? __u32 __reserved[3];
>> + ? ? struct {
>> + ? ? ? ? ? ? __u32 ? lo;
>> + ? ? ? ? ? ? __u32 ? hi;
>> + ? ? } instruction_pointer;
>> + ? ? __u32 lo32[6];
>> + ? ? __u32 hi32[6];
>> +};
>
> I wouldn't use a struct for the IP. And I'd move the args to the front.
I'd left it at the end for future expansion, but I think that that
will have to be dealt with differently when it comes, so I'll reorder.
> Why not call it something with "arg" in the names?
Changing this related to the comments on the thread, but however it
ends up, I'll add some mention of args!
>>
>> +#ifdef __KERNEL__
>> ?#ifdef CONFIG_SECCOMP
>>
>> ?#include <linux/thread_info.h>
>> ?#include <asm/seccomp.h>
>>
>> +struct seccomp_filter;
>> +/**
>> + * struct seccomp - the state of a seccomp'ed process
>> + *
>> + * @mode: ?indicates one of the valid values above for controlled
>> + * ? ? ? ? system calls available to a process.
>> + * @filter: The metadata and ruleset for determining what system calls
>> + * ? ? ? ? ?are allowed for a task.
>> + *
>> + * ? ? ? ? ?@filter must only be accessed from the context of current as there
>> + * ? ? ? ? ?is no locking.
>> + */
>> ?struct seccomp {
>> ? ? ? int mode;
>> + ? ? struct seccomp_filter *filter;
>> ?};
>>
>> ?extern void __secure_computing(int);
>> @@ -19,7 +65,7 @@ static inline void secure_computing(int this_syscall)
>> ?}
>>
>> ?extern long prctl_get_seccomp(void);
>> -extern long prctl_set_seccomp(unsigned long);
>> +extern long prctl_set_seccomp(unsigned long, char __user *);
>>
>> ?static inline int seccomp_mode(struct seccomp *s)
>> ?{
>> @@ -31,15 +77,16 @@ static inline int seccomp_mode(struct seccomp *s)
>> ?#include <linux/errno.h>
>>
>> ?struct seccomp { };
>> +struct seccomp_filter { };
>>
>> -#define secure_computing(x) do { } while (0)
>> +#define secure_computing(x) 0
>>
>> ?static inline long prctl_get_seccomp(void)
>> ?{
>> ? ? ? return -EINVAL;
>> ?}
>>
>> -static inline long prctl_set_seccomp(unsigned long arg2)
>> +static inline long prctl_set_seccomp(unsigned long arg2, char __user *arg3)
>> ?{
>> ? ? ? return -EINVAL;
>> ?}
>> @@ -48,7 +95,21 @@ static inline int seccomp_mode(struct seccomp *s)
>> ?{
>> ? ? ? return 0;
>> ?}
>> -
>> ?#endif /* CONFIG_SECCOMP */
>>
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +extern void put_seccomp_filter(struct seccomp_filter *);
>> +extern void copy_seccomp(struct seccomp *child,
>> + ? ? ? ? ? ? ? ? ? ? ?const struct seccomp *parent);
>> +#else ?/* CONFIG_SECCOMP_FILTER */
>> +/* The macro consumes the ->filter reference. */
>> +#define put_seccomp_filter(_s) do { } while (0)
>> +
>> +static inline void copy_seccomp(struct seccomp *child,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? const struct seccomp *prev)
>> +{
>> + ? ? return;
>> +}
>
> Why a macro for one but an empty inline for the other?
As the comment mentions, it consumes the reference.
put_seccomp_filter operates on current->seccomp.filter, but filter
doesn't exist if !CONFIG_SECCOMP_FILTER. So the macro eats it if not
defined. However, inlines are preferred style, so I used one for
copy_seccomp.
>> +#endif /* CONFIG_SECCOMP_FILTER */
>> +#endif /* __KERNEL__ */
>> ?#endif /* _LINUX_SECCOMP_H */
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index b77fd55..a5187b7 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -34,6 +34,7 @@
>> ?#include <linux/cgroup.h>
>> ?#include <linux/security.h>
>> ?#include <linux/hugetlb.h>
>> +#include <linux/seccomp.h>
>> ?#include <linux/swap.h>
>> ?#include <linux/syscalls.h>
>> ?#include <linux/jiffies.h>
>> @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
>> ? ? ? free_thread_info(tsk->stack);
>> ? ? ? rt_mutex_debug_task_free(tsk);
>> ? ? ? ftrace_graph_exit_task(tsk);
>> + ? ? put_seccomp_filter(tsk->seccomp.filter);
>> ? ? ? free_task_struct(tsk);
>> ?}
>> ?EXPORT_SYMBOL(free_task);
>> @@ -1113,6 +1115,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>> ? ? ? ? ? ? ? goto fork_out;
>>
>> ? ? ? ftrace_graph_init_task(p);
>> + ? ? copy_seccomp(&p->seccomp, ¤t->seccomp);
>>
>> ? ? ? rt_mutex_init_task(p);
>>
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index e8d76c5..14d1869 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -3,16 +3,297 @@
>> ? *
>> ? * Copyright 2004-2005 ?Andrea Arcangeli <[email protected]>
>> ? *
>> - * This defines a simple but solid secure-computing mode.
>> + * Copyright (C) 2012 Google, Inc.
>> + * Will Drewry <[email protected]>
>> + *
>> + * This defines a simple but solid secure-computing facility.
>> + *
>> + * Mode 1 uses a fixed list of allowed system calls.
>> + * Mode 2 allows user-defined system call filters in the form
>> + * ? ? ? ?of Berkeley Packet Filters/Linux Socket Filters.
>> ? */
>>
>> ?#include <linux/audit.h>
>> +#include <linux/filter.h>
>> ?#include <linux/seccomp.h>
>> ?#include <linux/sched.h>
>> ?#include <linux/compat.h>
>>
>> +#include <linux/atomic.h>
>> +#include <linux/security.h>
>> +
>> +#include <linux/slab.h>
>> +#include <linux/uaccess.h>
>> +#include <linux/user.h>
>
> Are those still needed since you got rid of that manual user-copying stuff?
Nice catch - probably not. I'll remove what I can.
>> +
>> +#include <linux/tracehook.h>
>> +#include <asm/syscall.h>
>> +
>> ?/* #define SECCOMP_DEBUG 1 */
>> -#define NR_SECCOMP_MODES 1
>> +
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +/**
>> + * struct seccomp_filter - container for seccomp BPF programs
>> + *
>> + * @usage: reference count to manage the object liftime.
>> + * ? ? ? ? get/put helpers should be used when accessing an instance
>> + * ? ? ? ? outside of a lifetime-guarded section. ?In general, this
>> + * ? ? ? ? is only needed for handling filters shared across tasks.
>> + * @prev: points to a previously installed, or inherited, filter
>> + * @compat: indicates the value of is_compat_task() at creation time
>> + * @insns: the BPF program instructions to evaluate
>> + * @count: the number of instructions in the program
>> + *
>> + * seccomp_filter objects are organized in a tree linked via the @prev
>> + * pointer. ?For any task, it appears to be a singly-linked list starting
>> + * with current->seccomp.filter, the most recently attached or inherited filter.
>> + * However, multiple filters may share a @prev node, by way of fork(), which
>> + * results in a unidirectional tree existing in memory. ?This is similar to
>> + * how namespaces work.
>> + *
>> + * seccomp_filter objects should never be modified after being attached
>> + * to a task_struct (other than @usage).
>> + */
>> +struct seccomp_filter {
>> + ? ? atomic_t usage;
>> + ? ? struct seccomp_filter *prev;
>> + ? ? bool compat;
>> + ? ? unsigned short count; ?/* Instruction count */
>> + ? ? struct sock_filter insns[];
>> +};
>> +
>> +static void seccomp_filter_log_failure(int syscall)
>> +{
>> + ? ? int compat = 0;
>> +#ifdef CONFIG_COMPAT
>> + ? ? compat = is_compat_task();
>> +#endif
>> + ? ? pr_info("%s[%d]: %ssystem call %d blocked at 0x%lx\n",
>> + ? ? ? ? ? ? current->comm, task_pid_nr(current),
>> + ? ? ? ? ? ? (compat ? "compat " : ""),
>> + ? ? ? ? ? ? syscall, KSTK_EIP(current));
>> +}
>> +
>> +static inline u32 get_high_bits(unsigned long value)
>> +{
>> + ? ? int bits = 32;
>> + ? ? return value >> bits;
>> +}
>> +
>> +static inline u32 bpf_length(const void *data)
>> +{
>> + ? ? return sizeof(struct seccomp_data);
>> +}
>
> This doesn't change, so why not pass in the length directly instead of
> getting it via a function?
True - since it's passed in at bpf_run_filters time, there's no reason
to make it a function!
> And stop adding inline to functions that are
> used for function pointers, it's misleading.
Only a little misleading :)
>> +
>> +/**
>> + * bpf_pointer: checks and returns a pointer to the requested offset
>> + * @nr: int syscall passed as a void * to bpf_run_filter
>> + * @off: index to load a from in @data
>
Oops - I'll fix that.
>
>> + * @size: load width requested
>> + * @buffer: temporary storage supplied by bpf_run_filter
>> + *
>> + * Returns a pointer to @buffer where the value was stored.
>> + * On failure, returns NULL.
>> + */
>> +static void *bpf_pointer(const void *nr, int off, unsigned int size, void *buf)
>> +{
>> + ? ? unsigned long value;
>> + ? ? u32 *A = (u32 *)buf;
>
> No need to cast a void pointer. That's the whole point of void pointers.
True.
>> +
>> + ? ? if (size != sizeof(u32))
>> + ? ? ? ? ? ? return NULL;
>> +
>> +#define BPF_DATA(_name) offsetof(struct seccomp_data, _name)
>
> I'd move this outside of the function and don't bother with the undef.
> Undeffing is important in header files. But here, if it's needed, it's
> just plain confusing.
Will do.
>> + ? ? /* Index by entry instead of by byte. */
>> + ? ? if (off == BPF_DATA(nr)) {
>> + ? ? ? ? ? ? *A = (u32)(uintptr_t)nr;
>
> Why the double cast? Once should be enough. Or is it a special Sparse thing?
It's a gcc thing when you cast an int to a pointer of a different
size. I just chose uintptr_t as my intermediary type.
>> + ? ? } else if (off == BPF_DATA(instruction_pointer.lo)) {
>> + ? ? ? ? ? ? *A = KSTK_EIP(current);
>> + ? ? } else if (off == BPF_DATA(instruction_pointer.hi)) {
>> + ? ? ? ? ? ? *A = get_high_bits(KSTK_EIP(current));
>> + ? ? } else if (off >= BPF_DATA(lo32[0]) && off <= BPF_DATA(lo32[5])) {
>> + ? ? ? ? ? ? struct pt_regs *regs = task_pt_regs(current);
>> + ? ? ? ? ? ? int arg = (off - BPF_DATA(lo32[0])) >> 2;
>> + ? ? ? ? ? ? syscall_get_arguments(current, regs, arg, 1, &value);
>> + ? ? ? ? ? ? *A = value;
>> + ? ? } else if (off >= BPF_DATA(hi32[0]) && off <= BPF_DATA(hi32[5])) {
>> + ? ? ? ? ? ? struct pt_regs *regs = task_pt_regs(current);
>> + ? ? ? ? ? ? int arg = (off - BPF_DATA(hi32[0])) >> 2;
>> + ? ? ? ? ? ? syscall_get_arguments(current, regs, arg, 1, &value);
>> + ? ? ? ? ? ? *A = get_high_bits(value);
>> + ? ? } else {
>> + ? ? ? ? ? ? return NULL;
>> + ? ? }
>> +#undef BPF_DATA
>> + ? ? return buf;
>> +}
>> +
>> +/**
>> + * seccomp_run_filters - run 'current' against the given syscall
>> + * @syscall: number of the current system call
>
> Strange comments.
Weird. I'll fix it.
>> + *
>> + * Returns valid seccomp BPF response codes.
>> + */
>> +static u32 seccomp_run_filters(int syscall)
>> +{
>> + ? ? struct seccomp_filter *f;
>> + ? ? const struct bpf_load_fns loaders = { bpf_pointer, bpf_length };
>
> I don't see the point of this.
>
> The return values for seccomp filters are different than the networking
> ones, so there is never a need to get bpf_length from the filter code
> as it's known at compile time. So just declare BPF_S_LD_W_LEN and
> S_LDX_W_LEN networking-only instructions and don't bother with all this.
Since I was proposing bpf_* as generic bpf interfaces, it seemed weird
to make data length socket specific. I can see cases where length may
matter for other (non-existent) users of BPF or if the length of the
seccomp_data changes in the future. Of course, the calling convention
index may be the way to handle that without the filter program
checking the length. I'm fine dropping it, but for this one, I'll do
whatever makes sense to the networking people.
>> + ? ? u32 ret = SECCOMP_RET_KILL;
>> + ? ? const void *sc_ptr = (const void *)(uintptr_t)syscall;
>> +
>> + ? ? /* It's not possible for the filter to be NULL here. */
>> +#ifdef CONFIG_COMPAT
>> + ? ? if (current->seccomp.filter->compat != !!(is_compat_task()))
>> + ? ? ? ? ? ? return ret;
>> +#endif
>> +
>> + ? ? /*
>> + ? ? ?* All filters are evaluated in order of youngest to oldest. The lowest
>> + ? ? ?* BPF return value always takes priority.
>> + ? ? ?*/
>> + ? ? for (f = current->seccomp.filter; f; f = f->prev) {
>> + ? ? ? ? ? ? ret = bpf_run_filter(sc_ptr, f->insns, &loaders);
>> + ? ? ? ? ? ? if (ret != SECCOMP_RET_ALLOW)
>> + ? ? ? ? ? ? ? ? ? ? break;
>> + ? ? }
>> + ? ? return ret;
>> +}
>> +
>> +/**
>> + * seccomp_attach_filter: Attaches a seccomp filter to current.
>> + * @fprog: BPF program to install
>> + *
>> + * Returns 0 on success or an errno on failure.
>> + */
>> +static long seccomp_attach_filter(struct sock_fprog *fprog)
>> +{
>> + ? ? struct seccomp_filter *filter = NULL;
>
> Don't initialize it to NULL, next time 'filter' is used it's set
> by kzalloc's return value.
>
>> + ? ? unsigned long fp_size = fprog->len * sizeof(struct sock_filter);
>> + ? ? long ret = -EINVAL;
>> +
>> + ? ? if (fprog->len == 0 || fprog->len > BPF_MAXINSNS)
>> + ? ? ? ? ? ? goto out;
>
> Oh wait, you need the NULL because you can call put_filter() via out.
> Well, just return EINVAL directly instead here I'd say.
>
>> +
>> + ? ? /* Allocate a new seccomp_filter */
>> + ? ? ret = -ENOMEM;
>> + ? ? filter = kzalloc(sizeof(struct seccomp_filter) + fp_size, GFP_KERNEL);
>> + ? ? if (!filter)
>> + ? ? ? ? ? ? goto out;
>
> Same here, just return ENOMEM.
Done.
>> + ? ? atomic_set(&filter->usage, 1);
>> + ? ? filter->count = fprog->len;
>
> Why is it called count in one place and len in the other? Isn't it clearer
> when always using len?
Sure - changed!
>> +
>> + ? ? /* Copy the instructions from fprog. */
>> + ? ? ret = -EFAULT;
>> + ? ? if (copy_from_user(filter->insns, fprog->filter, fp_size))
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? /* Check the fprog */
>> + ? ? ret = bpf_chk_filter(filter->insns, filter->count, BPF_CHK_FLAGS_NO_SKB);
>> + ? ? if (ret)
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? /*
>> + ? ? ?* Installing a seccomp filter requires that the task
>> + ? ? ?* have CAP_SYS_ADMIN in its namespace or be running with
>> + ? ? ?* no_new_privs. ?This avoids scenarios where unprivileged
>> + ? ? ?* tasks can affect the behavior of privileged children.
>> + ? ? ?*/
>> + ? ? ret = -EACCES;
>> + ? ? if (!current->no_new_privs &&
>> + ? ? ? ? security_capable_noaudit(current_cred(), current_user_ns(),
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?CAP_SYS_ADMIN) != 0)
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? /* Lock the filter to the current calling convention. */
>> +#ifdef CONFIG_COMPAT
>> + ? ? filter->compat = !!(is_compat_task());
>> +#endif
>> +
>> + ? ? /*
>> + ? ? ?* If there is an existing filter, make it the prev
>> + ? ? ?* and don't drop its task reference.
>> + ? ? ?*/
>> + ? ? filter->prev = current->seccomp.filter;
>> + ? ? current->seccomp.filter = filter;
>> + ? ? return 0;
>> +out:
>> + ? ? put_seccomp_filter(filter); ?/* for get or task, on err */
>> + ? ? return ret;
>> +}
>> +
>> +/**
>> + * seccomp_attach_user_filter - attaches a user-supplied sock_fprog
>> + * @user_filter: pointer to the user data containing a sock_fprog.
>> + *
>> + * This function may be called repeatedly to install additional filters.
>> + * Every filter successfully installed will be evaluated (in reverse order)
>> + * for each system call the task makes.
>> + *
>> + * Returns 0 on success and non-zero otherwise.
>> + */
>> +long seccomp_attach_user_filter(char __user *user_filter)
>> +{
>> + ? ? struct sock_fprog fprog;
>> + ? ? long ret = -EFAULT;
>> +
>> + ? ? if (!user_filter)
>> + ? ? ? ? ? ? goto out;
>> +#ifdef CONFIG_COMPAT
>> + ? ? if (is_compat_task()) {
>> + ? ? ? ? ? ? /* XXX: Share with net/compat.c */
>
> You can't share this with net/compat.c because they have to pass a __user
> pointer to a generic sock_setsockopt(). You could refactor their code to
> push the compat check later, but I think they prefer to keep all the compat
> stuff in one place.
struct compat_sock_fprog is identical to what I have below - I just
can't share the rest of the code.
>> + ? ? ? ? ? ? struct {
>> + ? ? ? ? ? ? ? ? ? ? u16 len;
>> + ? ? ? ? ? ? ? ? ? ? compat_uptr_t filter; ? /* struct sock_filter */
>> + ? ? ? ? ? ? } fprog32;
>> + ? ? ? ? ? ? if (copy_from_user(&fprog32, user_filter, sizeof(fprog32)))
>> + ? ? ? ? ? ? ? ? ? ? goto out;
>> + ? ? ? ? ? ? fprog.len = fprog32.len;
>> + ? ? ? ? ? ? fprog.filter = compat_ptr(fprog32.filter);
>> + ? ? } else
>> +#endif
>> + ? ? if (copy_from_user(&fprog, user_filter, sizeof(fprog)))
>> + ? ? ? ? ? ? goto out;
>
> Probably a good idea to intend the else if one more time to make it more
> obvious. Or add a comment after the else.
Added a comment after the else.
>> + ? ? ret = seccomp_attach_filter(&fprog);
>> +out:
>> + ? ? return ret;
>> +}
>> +
>> +/* get_seccomp_filter - increments the reference count of @orig. */
>> +static struct seccomp_filter *get_seccomp_filter(struct seccomp_filter *orig)
>> +{
>> + ? ? if (!orig)
>> + ? ? ? ? ? ? return NULL;
>> + ? ? /* Reference count is bounded by the number of total processes. */
>> + ? ? atomic_inc(&orig->usage);
>> + ? ? return orig;
>> +}
>> +
>> +/* put_seccomp_filter - decrements the ref count of @orig and may free. */
>> +void put_seccomp_filter(struct seccomp_filter *orig)
>> +{
>> + ? ? /* Clean up single-reference branches iteratively. */
>> + ? ? while (orig && atomic_dec_and_test(&orig->usage)) {
>> + ? ? ? ? ? ? struct seccomp_filter *freeme = orig;
>> + ? ? ? ? ? ? orig = orig->prev;
>> + ? ? ? ? ? ? kfree(freeme);
>> + ? ? }
>> +}
>> +
>> +/**
>> + * copy_seccomp: manages inheritance on fork
>> + * @child: forkee's seccomp
>> + * @prev: forker's seccomp
>> + *
>> + * Ensures that @child inherits seccomp mode and state if
>> + * seccomp filtering is in use.
>> + */
>> +void copy_seccomp(struct seccomp *child,
>> + ? ? ? ? ? ? ? const struct seccomp *prev)
>> +{
>> + ? ? child->mode = prev->mode;
>> + ? ? child->filter = get_seccomp_filter(prev->filter);
>> +}
>> +#endif ? ? ? /* CONFIG_SECCOMP_FILTER */
>>
>> ?/*
>> ? * Secure computing mode 1 allows only read/write/exit/sigreturn.
>> @@ -34,10 +315,10 @@ static int mode1_syscalls_32[] = {
>> ?void __secure_computing(int this_syscall)
>> ?{
>> ? ? ? int mode = current->seccomp.mode;
>> - ? ? int * syscall;
>> + ? ? int *syscall;
>>
>> ? ? ? switch (mode) {
>> - ? ? case 1:
>> + ? ? case SECCOMP_MODE_STRICT:
>> ? ? ? ? ? ? ? syscall = mode1_syscalls;
>> ?#ifdef CONFIG_COMPAT
>> ? ? ? ? ? ? ? if (is_compat_task())
>> @@ -48,6 +329,13 @@ void __secure_computing(int this_syscall)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return;
>> ? ? ? ? ? ? ? } while (*++syscall);
>> ? ? ? ? ? ? ? break;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> + ? ? case SECCOMP_MODE_FILTER:
>> + ? ? ? ? ? ? if (seccomp_run_filters(this_syscall) == SECCOMP_RET_ALLOW)
>> + ? ? ? ? ? ? ? ? ? ? return;
>> + ? ? ? ? ? ? seccomp_filter_log_failure(this_syscall);
>> + ? ? ? ? ? ? break;
>> +#endif
>> ? ? ? default:
>> ? ? ? ? ? ? ? BUG();
>> ? ? ? }
>> @@ -64,25 +352,34 @@ long prctl_get_seccomp(void)
>> ? ? ? return current->seccomp.mode;
>> ?}
>>
>> -long prctl_set_seccomp(unsigned long seccomp_mode)
>> +long prctl_set_seccomp(unsigned long seccomp_mode, char __user *filter)
>> ?{
>> - ? ? long ret;
>> + ? ? long ret = -EINVAL;
>>
>> - ? ? /* can set it only once to be even more secure */
>> - ? ? ret = -EPERM;
>> - ? ? if (unlikely(current->seccomp.mode))
>> + ? ? if (current->seccomp.mode &&
>> + ? ? ? ? current->seccomp.mode != seccomp_mode)
>> ? ? ? ? ? ? ? goto out;
>>
>> - ? ? ret = -EINVAL;
>> - ? ? if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
>> - ? ? ? ? ? ? current->seccomp.mode = seccomp_mode;
>> - ? ? ? ? ? ? set_thread_flag(TIF_SECCOMP);
>> + ? ? switch (seccomp_mode) {
>> + ? ? case SECCOMP_MODE_STRICT:
>> + ? ? ? ? ? ? ret = 0;
>> ?#ifdef TIF_NOTSC
>> ? ? ? ? ? ? ? disable_TSC();
>> ?#endif
>> - ? ? ? ? ? ? ret = 0;
>> + ? ? ? ? ? ? break;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> + ? ? case SECCOMP_MODE_FILTER:
>> + ? ? ? ? ? ? ret = seccomp_attach_user_filter(filter);
>> + ? ? ? ? ? ? if (ret)
>> + ? ? ? ? ? ? ? ? ? ? goto out;
>> + ? ? ? ? ? ? break;
>> +#endif
>> + ? ? default:
>> + ? ? ? ? ? ? goto out;
>> ? ? ? }
>>
>> - out:
>> + ? ? current->seccomp.mode = seccomp_mode;
>> + ? ? set_thread_flag(TIF_SECCOMP);
>> +out:
>> ? ? ? return ret;
>> ?}
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index 4070153..905031e 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -1899,7 +1899,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2,
> unsigned long, arg3,
>> ? ? ? ? ? ? ? ? ? ? ? error = prctl_get_seccomp();
>> ? ? ? ? ? ? ? ? ? ? ? break;
>> ? ? ? ? ? ? ? case PR_SET_SECCOMP:
>> - ? ? ? ? ? ? ? ? ? ? error = prctl_set_seccomp(arg2);
>> + ? ? ? ? ? ? ? ? ? ? error = prctl_set_seccomp(arg2, (char __user *)arg3);
>> ? ? ? ? ? ? ? ? ? ? ? break;
>> ? ? ? ? ? ? ? case PR_GET_TSC:
>> ? ? ? ? ? ? ? ? ? ? ? error = GET_TSC_CTL(arg2);
>> --
>> 1.7.5.4
>>
>>
>
> Greetings,
Thanks!
will
On Thu, Feb 16, 2012 at 6:50 PM, Eric Paris <[email protected]> wrote:
> On Thu, 2012-02-16 at 17:00 -0600, Will Drewry wrote:
>> On Thu, Feb 16, 2012 at 4:06 PM, H. Peter Anvin <[email protected]> wrote:
>> > On 02/16/2012 01:51 PM, Will Drewry wrote:
>
>> Then syscall_namespace(current, regs) returns
>> * 0 - SYSCALL_NS_32 (for existing 32 and config_compat)
>> * 1 - SYSCALL_NS_64 (for existing 64 bit)
>> * 2 - SYSCALL_NS_X32 (everything after 2 is arch specific)
>> * ..
>>
>> This patch series is pegged to x86 right now, so it's not a big deal
>> to add a simple syscall_namespace to asm/syscall.h. ?Of course, the
>> code is always the easy part. ?Even easier would be to only assign 0
>> and 1 in the seccomp_data for 32-bit or 64-bit, then leave the rest of
>> the u32 untouched until x32 stabilizes and the TS_COMPAT interactions
>> are sorted.
>
> I don't know if anyone cares, but include/linux/audit.h tries to expose
> this type of information so audit userspace can later piece things back
> together. ?(we get this info from the syscall entry exit code so we know
> which arch it is).
>
> Not sure how x32 is hoping to expose its syscall info, but others are
> going to have the same/similar problem.
An earlier change Roland had prodded me toward was adding a
syscall_get_arch() call to asm/syscall.h which returned the
appropriate audit arch value for the current calling convention. I
hate to suggest this, but should I go ahead and wire that up for x86
now, make it a dependency for HAVE_ARCH_SECCOMP_FILTER (and officially
part of asm/syscall.h) then let it trickle into existence? Maybe
something like:
static inline int syscall_get_arch(struct task_struct *task, struct
pt_regs *regs)
{
#ifdef CONFIG_IA32_EMULATION
if (task_thread_info(task)->status & TS_COMPAT)
return AUDIT_ARCH_I386;
#endif
#ifdef CONFIG_64BIT
return AUDIT_ARCH_X86_64;
#else
return AUDIT_ARCH_I386;
#endif
}
There would be no other callers, though, because everywhere AUDIT_ARCH
is used it is hardcoded as appropriate. Then when x32 comes along, it
can figure out where it belongs using tif status and/or regs.
I'm not sure what the appropriate way to add things to asm/syscall.h,
but I can certainly do a first cut in the x86 version.
thanks!
On 02/16/2012 07:27 PM, Indan Zupancic wrote:
>
> A JIT won't be able to merge accesses because it also has to merge other
> instructions and recognize when 64-bit operations are done with 32-bit
> instructions. I think that will be too hard for a JIT.
>
Please Google "peephole optimizer".
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On 02/16/2012 07:53 PM, Will Drewry wrote:
>
> An earlier change Roland had prodded me toward was adding a
> syscall_get_arch() call to asm/syscall.h which returned the
> appropriate audit arch value for the current calling convention. I
> hate to suggest this, but should I go ahead and wire that up for x86
> now, make it a dependency for HAVE_ARCH_SECCOMP_FILTER (and officially
> part of asm/syscall.h) then let it trickle into existence? Maybe
> something like:
>
... and we have been talking about making a regset and export it to
ptrace and core dumps, too.
> static inline int syscall_get_arch(struct task_struct *task, struct
> pt_regs *regs)
> {
> #ifdef CONFIG_IA32_EMULATION
> if (task_thread_info(task)->status & TS_COMPAT)
> return AUDIT_ARCH_I386;
> #endif
> #ifdef CONFIG_64BIT
> return AUDIT_ARCH_X86_64;
> #else
> return AUDIT_ARCH_I386;
> #endif
> }
>
In this case it could be is_compat_task().
> There would be no other callers, though, because everywhere AUDIT_ARCH
> is used it is hardcoded as appropriate. Then when x32 comes along, it
> can figure out where it belongs using tif status and/or regs.
For x32 you have the option of introducing a new value or relying on bit
30 in eax (and AUDIT_ARCH_X86_64). The latter is more natural, probably.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Thu, Feb 16, 2012 at 9:04 PM, Indan Zupancic <[email protected]> wrote:
> Hello,
>
> On Thu, February 16, 2012 21:02, Will Drewry wrote:
>> This change allows CONFIG_SECCOMP to make use of BPF programs for
>> user-controlled system call filtering (as shown in this patch series).
>>
>> To minimize the impact on existing BPF evaluation, function pointer
>> use must be declared at sk_chk_filter-time. ?This allows ancillary
>> load instructions to be generated that use the function pointer rather
>> than adding _any_ code to the existing LD_* instruction paths.
>>
>> Crude performance numbers using udpflood -l 10000000 against dummy0.
>> 3 trials for baseline, 3 for with tcpdump. Averaged then differenced.
>> Hard to believe trials were repeated at least a couple more times.
>>
>> * x86 32-bit (Atom N570 @ 1.66 GHz 2 core HT) [stackprot]:
>> - Without: ?94.05s - 76.36s = 17.68s
>> - With: ? ? 86.22s - 73.30s = 12.92s
>> - Slowdown per call: -476 nanoseconds
>>
>> * x86 32-bit (Atom N570 @ 1.66 GHz 2 core HT) [no stackprot]:
>> - Without: ?92.06s - 77.81s = 14.25s
>> - With: ? ? 91.77s - 76.91s = 14.86s
>> - Slowdown per call: +61 nanoseconds
>>
>> * x86 64-bit (Atom N570 @ 1.66 GHz 2 core HT) [stackprot]:
>> - Without: 122.58s - 99.54s = 23.04s
>> - With: ? ?115.52s - 98.99s = 16.53s
>> - Slowdown per call: ?-651 nanoseconds
>>
>> * x86 64-bit (Atom N570 @ 1.66 GHz 2 core HT) [no stackprot]:
>> - Without: 114.95s - 91.92s = 23.03s
>> - With: ? ?110.47s - 90.79s = 19.68s
>> - Slowdown per call: -335 nanoseconds
>>
>> This makes the x86-32-nossp make sense. ?Added register pressure always
>> makes x86-32 sad.
>
> Your 32-bit numbers are better than your 64-bit numbers, so I don't get
> this comment.
They are in the absolute. Relatively, all performance improved with
my patch except for x86-nossp.
>> If this is a concern, I could change the call
>> approach to bpf_run_filter to see if I can alleviate it a bit.
>>
>> That said, the x86-*-ssp numbers show a marked increase in performance.
>> I've tested and retested and I keep getting these results. I'm also
>> suprised by the nossp speed up on 64-bit, but I dunno. I haven't looked
>> at the full disassembly of the call path. If that is required for the
>> performance differences I'm seeing, please let me know. Or if I there is
>> a preferred cpu to run this against - atoms can be a little weird.
>
> Yeah, testing on Atom is a bit silly.
Making things run well on Atom is important for my daily work. And it
usually means (barring Atom-specific weirdness) that it then runs even
better on bigger processors :)
>> v8: - fixed variable positioning and bad cast ([email protected])
>> ? ? - no longer passes A as a pointer (inspection of x86 asm shows A is
>> ? ? ? %ebx again; thanks [email protected])
>> ? ? - cleaned up switch macros and expanded use
>> ? ? ? ([email protected], [email protected])
>> ? ? - added length fn pointer and handled LD_W_LEN/LDX_W_LEN
>> ? ? - moved from a wrapping struct to a typedef for the function
>> ? ? ? pointer. (matches existing function pointer style)
>> ? ? - added comprehensive comment above the typedef.
>> ? ? - benchmarks
>> v7: - first cut
>>
>> Signed-off-by: Will Drewry <[email protected]>
>> ---
>> ?include/linux/filter.h | ? 69 +++++++++++++++++++++-
>> ?net/core/filter.c ? ? ?| ?152 +++++++++++++++++++++++++++++++++++++----------
>> ?2 files changed, 185 insertions(+), 36 deletions(-)
>>
>> diff --git a/include/linux/filter.h b/include/linux/filter.h
>> index 8eeb205..d22ad46 100644
>> --- a/include/linux/filter.h
>> +++ b/include/linux/filter.h
>> @@ -110,6 +110,9 @@ struct sock_fprog { ? ? ? /* Required for SO_ATTACH_FILTER. */
>> ? */
>> ?#define BPF_MEMWORDS 16
>>
>> +/* BPF program (checking) flags */
>> +#define BPF_CHK_FLAGS_NO_SKB 1
>> +
>> ?/* RATIONALE. Negative offsets are invalid in BPF.
>> ? ? We use them to reference ancillary data.
>> ? ? Unlike introduction new instructions, it does not break
>> @@ -145,17 +148,67 @@ struct sk_filter
>> ? ? ? struct sock_filter ? ? ?insns[0];
>> ?};
>>
>> +/**
>> + * struct bpf_load_fns - callbacks for bpf_run_filter
>> + * These functions are called by bpf_run_filter if bpf_chk_filter
>> + * was invoked with BPF_CHK_FLAGS_NO_SKB.
>> + *
>> + * pointer:
>> + * @data: const pointer to the data passed into bpf_run_filter
>> + * @k: offset into @skb's data
>> + * @size: the size of the requested data in bytes: 1, 2, or 4.
>> + * @buffer: If non-NULL, a 32-bit buffer for staging data.
>> + *
>> + * Returns a pointer to the requested data.
>> + *
>> + * This function operates similarly to load_pointer in net/core/filter.c
>> + * except that the pointer to the returned data must already be
>> + * byteswapped as appropriate to the source data and endianness.
>> + * @buffer may be used if the data needs to be staged.
>> + *
>> + * length:
>> + * @data: const pointer to the data passed into bpf_fun_filter
>> + *
>> + * Returns the length of the data.
>> + */
>> +struct bpf_load_fns {
>> + ? ? void *(*pointer)(const void *data, int k, unsigned int size,
>> + ? ? ? ? ? ? ? ? ? ? ?void *buffer);
>> + ? ? u32 (*length)(const void *data);
>> +};
>
> Like I said in the other email, length is useless for the non-skb case.
> If you really want to add it, just make it a constant. And 'pointer' isn't
> the best name.
>
>> +
>> ?static inline unsigned int sk_filter_len(const struct sk_filter *fp)
>> ?{
>> ? ? ? return fp->len * sizeof(struct sock_filter) + sizeof(*fp);
>> ?}
>>
>> +extern unsigned int bpf_run_filter(const void *data,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?const struct sock_filter *filter,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?const struct bpf_load_fns *load_fn);
>> +
>> +/**
>> + * ? sk_run_filter - run a filter on a socket
>> + * ? @skb: buffer to run the filter on
>> + * ? @fentry: filter to apply
>> + *
>> + * Runs bpf_run_filter with the struct sk_buff-specific data
>> + * accessor behavior.
>> + */
>> +static inline unsigned int sk_run_filter(const struct sk_buff *skb,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?const struct sock_filter *filter)
>> +{
>> + ? ? return bpf_run_filter(skb, filter, NULL);
>> +}
>> +
>> ?extern int sk_filter(struct sock *sk, struct sk_buff *skb);
>> -extern unsigned int sk_run_filter(const struct sk_buff *skb,
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? const struct sock_filter *filter);
>> ?extern int sk_attach_filter(struct sock_fprog *fprog, struct sock *sk);
>> ?extern int sk_detach_filter(struct sock *sk);
>> -extern int sk_chk_filter(struct sock_filter *filter, unsigned int flen);
>> +extern int bpf_chk_filter(struct sock_filter *filter, unsigned int flen, u32 flags);
>> +
>> +static inline int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
>> +{
>> + ? ? return bpf_chk_filter(filter, flen, 0);
>> +}
>>
>> ?#ifdef CONFIG_BPF_JIT
>> ?extern void bpf_jit_compile(struct sk_filter *fp);
>> @@ -228,6 +281,16 @@ enum {
>> ? ? ? BPF_S_ANC_HATYPE,
>> ? ? ? BPF_S_ANC_RXHASH,
>> ? ? ? BPF_S_ANC_CPU,
>> + ? ? /* Used to differentiate SKB data and generic data */
>> + ? ? BPF_S_ANC_LD_W_ABS,
>> + ? ? BPF_S_ANC_LD_H_ABS,
>> + ? ? BPF_S_ANC_LD_B_ABS,
>> + ? ? BPF_S_ANC_LD_W_LEN,
>> + ? ? BPF_S_ANC_LD_W_IND,
>> + ? ? BPF_S_ANC_LD_H_IND,
>> + ? ? BPF_S_ANC_LD_B_IND,
>> + ? ? BPF_S_ANC_LDX_W_LEN,
>> + ? ? BPF_S_ANC_LDX_B_MSH,
>> ?};
>>
>> ?#endif /* __KERNEL__ */
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 5dea452..a5c98a9 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -98,9 +98,10 @@ int sk_filter(struct sock *sk, struct sk_buff *skb)
>> ?EXPORT_SYMBOL(sk_filter);
>>
>> ?/**
>> - * ? sk_run_filter - run a filter on a socket
>> - * ? @skb: buffer to run the filter on
>> + * ? bpf_run_filter - run a filter on a BPF program
>
> The filter is the BPF program, so this comment is weird.
True, I'll rephrase.
>> + * ? @data: buffer to run the filter on
>> ? * ? @fentry: filter to apply
>> + * ? @load_fns: custom data accessor functions
>> ? *
>> ? * Decode and apply filter instructions to the skb->data.
>> ? * Return length to keep, 0 for none. @skb is the data we are
>> @@ -108,9 +109,13 @@ EXPORT_SYMBOL(sk_filter);
>> ? * Because all jumps are guaranteed to be before last instruction,
>> ? * and last instruction guaranteed to be a RET, we dont need to check
>> ? * flen. (We used to pass to this function the length of filter)
>> + *
>> + * load_fn is only used if SKF_FLAGS_USE_LOAD_FNS was specified
>> + * to sk_chk_generic_filter.
>
> Stale comment.
Fixed!
>> ? */
>> -unsigned int sk_run_filter(const struct sk_buff *skb,
>> - ? ? ? ? ? ? ? ? ? ? ? ?const struct sock_filter *fentry)
>> +unsigned int bpf_run_filter(const void *data,
>> + ? ? ? ? ? ? ? ? ? ? ? ? const struct sock_filter *fentry,
>> + ? ? ? ? ? ? ? ? ? ? ? ? const struct bpf_load_fns *load_fns)
>> ?{
>> ? ? ? void *ptr;
>> ? ? ? u32 A = 0; ? ? ? ? ? ? ? ? ? ? ?/* Accumulator */
>> @@ -128,6 +133,7 @@ unsigned int sk_run_filter(const struct sk_buff *skb,
>> ?#else
>> ? ? ? ? ? ? ? const u32 K = fentry->k;
>> ?#endif
>> +#define SKB(_data) ((const struct sk_buff *)(_data))
>
> Urgh!
>
> If you had done:
> ? ? ? ? ? ? ? ?const struct sk_buff *skb = data;
>
> at the top, all those changed wouldn't be needed and it would look better too.
That just means I need to disassemble after to make sure the compiler
does the right thing. I'll do that and change it if gcc is doing the
right thing.
>>
>> ? ? ? ? ? ? ? switch (fentry->code) {
>> ? ? ? ? ? ? ? case BPF_S_ALU_ADD_X:
>> @@ -213,7 +219,7 @@ unsigned int sk_run_filter(const struct sk_buff *skb,
>> ? ? ? ? ? ? ? case BPF_S_LD_W_ABS:
>> ? ? ? ? ? ? ? ? ? ? ? k = K;
>> ?load_w:
>> - ? ? ? ? ? ? ? ? ? ? ptr = load_pointer(skb, k, 4, &tmp);
>> + ? ? ? ? ? ? ? ? ? ? ptr = load_pointer(data, k, 4, &tmp);
>> ? ? ? ? ? ? ? ? ? ? ? if (ptr != NULL) {
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? A = get_unaligned_be32(ptr);
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
>> @@ -222,7 +228,7 @@ load_w:
>> ? ? ? ? ? ? ? case BPF_S_LD_H_ABS:
>> ? ? ? ? ? ? ? ? ? ? ? k = K;
>> ?load_h:
>> - ? ? ? ? ? ? ? ? ? ? ptr = load_pointer(skb, k, 2, &tmp);
>> + ? ? ? ? ? ? ? ? ? ? ptr = load_pointer(data, k, 2, &tmp);
>> ? ? ? ? ? ? ? ? ? ? ? if (ptr != NULL) {
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? A = get_unaligned_be16(ptr);
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
>> @@ -231,17 +237,17 @@ load_h:
>> ? ? ? ? ? ? ? case BPF_S_LD_B_ABS:
>> ? ? ? ? ? ? ? ? ? ? ? k = K;
>> ?load_b:
>> - ? ? ? ? ? ? ? ? ? ? ptr = load_pointer(skb, k, 1, &tmp);
>> + ? ? ? ? ? ? ? ? ? ? ptr = load_pointer(data, k, 1, &tmp);
>> ? ? ? ? ? ? ? ? ? ? ? if (ptr != NULL) {
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? A = *(u8 *)ptr;
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
>> ? ? ? ? ? ? ? ? ? ? ? }
>> ? ? ? ? ? ? ? ? ? ? ? return 0;
>> ? ? ? ? ? ? ? case BPF_S_LD_W_LEN:
>> - ? ? ? ? ? ? ? ? ? ? A = skb->len;
>> + ? ? ? ? ? ? ? ? ? ? A = SKB(data)->len;
>> ? ? ? ? ? ? ? ? ? ? ? continue;
>> ? ? ? ? ? ? ? case BPF_S_LDX_W_LEN:
>> - ? ? ? ? ? ? ? ? ? ? X = skb->len;
>> + ? ? ? ? ? ? ? ? ? ? X = SKB(data)->len;
>> ? ? ? ? ? ? ? ? ? ? ? continue;
>> ? ? ? ? ? ? ? case BPF_S_LD_W_IND:
>> ? ? ? ? ? ? ? ? ? ? ? k = X + K;
>> @@ -253,7 +259,7 @@ load_b:
>> ? ? ? ? ? ? ? ? ? ? ? k = X + K;
>> ? ? ? ? ? ? ? ? ? ? ? goto load_b;
>> ? ? ? ? ? ? ? case BPF_S_LDX_B_MSH:
>> - ? ? ? ? ? ? ? ? ? ? ptr = load_pointer(skb, K, 1, &tmp);
>> + ? ? ? ? ? ? ? ? ? ? ptr = load_pointer(data, K, 1, &tmp);
>> ? ? ? ? ? ? ? ? ? ? ? if (ptr != NULL) {
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? X = (*(u8 *)ptr & 0xf) << 2;
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
>> @@ -288,29 +294,29 @@ load_b:
>> ? ? ? ? ? ? ? ? ? ? ? mem[K] = X;
>> ? ? ? ? ? ? ? ? ? ? ? continue;
>> ? ? ? ? ? ? ? case BPF_S_ANC_PROTOCOL:
>> - ? ? ? ? ? ? ? ? ? ? A = ntohs(skb->protocol);
>> + ? ? ? ? ? ? ? ? ? ? A = ntohs(SKB(data)->protocol);
>> ? ? ? ? ? ? ? ? ? ? ? continue;
>> ? ? ? ? ? ? ? case BPF_S_ANC_PKTTYPE:
>> - ? ? ? ? ? ? ? ? ? ? A = skb->pkt_type;
>> + ? ? ? ? ? ? ? ? ? ? A = SKB(data)->pkt_type;
>> ? ? ? ? ? ? ? ? ? ? ? continue;
>> ? ? ? ? ? ? ? case BPF_S_ANC_IFINDEX:
>> - ? ? ? ? ? ? ? ? ? ? if (!skb->dev)
>> + ? ? ? ? ? ? ? ? ? ? if (!SKB(data)->dev)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return 0;
>> - ? ? ? ? ? ? ? ? ? ? A = skb->dev->ifindex;
>> + ? ? ? ? ? ? ? ? ? ? A = SKB(data)->dev->ifindex;
>> ? ? ? ? ? ? ? ? ? ? ? continue;
>> ? ? ? ? ? ? ? case BPF_S_ANC_MARK:
>> - ? ? ? ? ? ? ? ? ? ? A = skb->mark;
>> + ? ? ? ? ? ? ? ? ? ? A = SKB(data)->mark;
>> ? ? ? ? ? ? ? ? ? ? ? continue;
>> ? ? ? ? ? ? ? case BPF_S_ANC_QUEUE:
>> - ? ? ? ? ? ? ? ? ? ? A = skb->queue_mapping;
>> + ? ? ? ? ? ? ? ? ? ? A = SKB(data)->queue_mapping;
>> ? ? ? ? ? ? ? ? ? ? ? continue;
>> ? ? ? ? ? ? ? case BPF_S_ANC_HATYPE:
>> - ? ? ? ? ? ? ? ? ? ? if (!skb->dev)
>> + ? ? ? ? ? ? ? ? ? ? if (!SKB(data)->dev)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return 0;
>> - ? ? ? ? ? ? ? ? ? ? A = skb->dev->type;
>> + ? ? ? ? ? ? ? ? ? ? A = SKB(data)->dev->type;
>> ? ? ? ? ? ? ? ? ? ? ? continue;
>> ? ? ? ? ? ? ? case BPF_S_ANC_RXHASH:
>> - ? ? ? ? ? ? ? ? ? ? A = skb->rxhash;
>> + ? ? ? ? ? ? ? ? ? ? A = SKB(data)->rxhash;
>> ? ? ? ? ? ? ? ? ? ? ? continue;
>> ? ? ? ? ? ? ? case BPF_S_ANC_CPU:
>> ? ? ? ? ? ? ? ? ? ? ? A = raw_smp_processor_id();
>> @@ -318,15 +324,15 @@ load_b:
>> ? ? ? ? ? ? ? case BPF_S_ANC_NLATTR: {
>> ? ? ? ? ? ? ? ? ? ? ? struct nlattr *nla;
>>
>> - ? ? ? ? ? ? ? ? ? ? if (skb_is_nonlinear(skb))
>> + ? ? ? ? ? ? ? ? ? ? if (skb_is_nonlinear(SKB(data)))
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return 0;
>> - ? ? ? ? ? ? ? ? ? ? if (A > skb->len - sizeof(struct nlattr))
>> + ? ? ? ? ? ? ? ? ? ? if (A > SKB(data)->len - sizeof(struct nlattr))
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return 0;
>>
>> - ? ? ? ? ? ? ? ? ? ? nla = nla_find((struct nlattr *)&skb->data[A],
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?skb->len - A, X);
>> + ? ? ? ? ? ? ? ? ? ? nla = nla_find((struct nlattr *)&SKB(data)->data[A],
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?SKB(data)->len - A, X);
>> ? ? ? ? ? ? ? ? ? ? ? if (nla)
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? A = (void *)nla - (void *)skb->data;
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? A = (void *)nla - (void *)SKB(data)->data;
>> ? ? ? ? ? ? ? ? ? ? ? else
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? A = 0;
>> ? ? ? ? ? ? ? ? ? ? ? continue;
>> @@ -334,22 +340,71 @@ load_b:
>> ? ? ? ? ? ? ? case BPF_S_ANC_NLATTR_NEST: {
>> ? ? ? ? ? ? ? ? ? ? ? struct nlattr *nla;
>>
>> - ? ? ? ? ? ? ? ? ? ? if (skb_is_nonlinear(skb))
>> + ? ? ? ? ? ? ? ? ? ? if (skb_is_nonlinear(SKB(data)))
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return 0;
>> - ? ? ? ? ? ? ? ? ? ? if (A > skb->len - sizeof(struct nlattr))
>> + ? ? ? ? ? ? ? ? ? ? if (A > SKB(data)->len - sizeof(struct nlattr))
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return 0;
>>
>> - ? ? ? ? ? ? ? ? ? ? nla = (struct nlattr *)&skb->data[A];
>> - ? ? ? ? ? ? ? ? ? ? if (nla->nla_len > A - skb->len)
>> + ? ? ? ? ? ? ? ? ? ? nla = (struct nlattr *)&SKB(data)->data[A];
>> + ? ? ? ? ? ? ? ? ? ? if (nla->nla_len > A - SKB(data)->len)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return 0;
>>
>> ? ? ? ? ? ? ? ? ? ? ? nla = nla_find_nested(nla, X);
>> ? ? ? ? ? ? ? ? ? ? ? if (nla)
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? A = (void *)nla - (void *)skb->data;
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? A = (void *)nla - (void *)SKB(data)->data;
>> ? ? ? ? ? ? ? ? ? ? ? else
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? A = 0;
>> ? ? ? ? ? ? ? ? ? ? ? continue;
>> ? ? ? ? ? ? ? }
>
> All changes up to here are unnecessary.
I hope so.
>> + ? ? ? ? ? ? case BPF_S_ANC_LD_W_ABS:
>> + ? ? ? ? ? ? ? ? ? ? k = K;
>> +load_fn_w:
>> + ? ? ? ? ? ? ? ? ? ? ptr = load_fns->pointer(data, k, 4, &tmp);
>> + ? ? ? ? ? ? ? ? ? ? if (ptr) {
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? A = *(u32 *)ptr;
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
>> + ? ? ? ? ? ? ? ? ? ? }
>> + ? ? ? ? ? ? ? ? ? ? return 0;
>> + ? ? ? ? ? ? case BPF_S_ANC_LD_H_ABS:
>> + ? ? ? ? ? ? ? ? ? ? k = K;
>> +load_fn_h:
>> + ? ? ? ? ? ? ? ? ? ? ptr = load_fns->pointer(data, k, 2, &tmp);
>> + ? ? ? ? ? ? ? ? ? ? if (ptr) {
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? A = *(u16 *)ptr;
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
>> + ? ? ? ? ? ? ? ? ? ? }
>> + ? ? ? ? ? ? ? ? ? ? return 0;
>> + ? ? ? ? ? ? case BPF_S_ANC_LD_B_ABS:
>> + ? ? ? ? ? ? ? ? ? ? k = K;
>> +load_fn_b:
>> + ? ? ? ? ? ? ? ? ? ? ptr = load_fns->pointer(data, k, 1, &tmp);
>> + ? ? ? ? ? ? ? ? ? ? if (ptr) {
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? A = *(u8 *)ptr;
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
>> + ? ? ? ? ? ? ? ? ? ? }
>> + ? ? ? ? ? ? ? ? ? ? return 0;
>> + ? ? ? ? ? ? case BPF_S_ANC_LDX_B_MSH:
>> + ? ? ? ? ? ? ? ? ? ? ptr = load_fns->pointer(data, K, 1, &tmp);
>> + ? ? ? ? ? ? ? ? ? ? if (ptr) {
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? X = (*(u8 *)ptr & 0xf) << 2;
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
>> + ? ? ? ? ? ? ? ? ? ? }
>> + ? ? ? ? ? ? ? ? ? ? return 0;
>> + ? ? ? ? ? ? case BPF_S_ANC_LD_W_IND:
>> + ? ? ? ? ? ? ? ? ? ? k = X + K;
>> + ? ? ? ? ? ? ? ? ? ? goto load_fn_w;
>> + ? ? ? ? ? ? case BPF_S_ANC_LD_H_IND:
>> + ? ? ? ? ? ? ? ? ? ? k = X + K;
>> + ? ? ? ? ? ? ? ? ? ? goto load_fn_h;
>> + ? ? ? ? ? ? case BPF_S_ANC_LD_B_IND:
>> + ? ? ? ? ? ? ? ? ? ? k = X + K;
>> + ? ? ? ? ? ? ? ? ? ? goto load_fn_b;
>> + ? ? ? ? ? ? case BPF_S_ANC_LD_W_LEN:
>> + ? ? ? ? ? ? ? ? ? ? A = load_fns->length(data);
>> + ? ? ? ? ? ? ? ? ? ? continue;
>> + ? ? ? ? ? ? case BPF_S_ANC_LDX_W_LEN:
>> + ? ? ? ? ? ? ? ? ? ? X = load_fns->length(data);
>
> These two should either return 0, be networking-only, just return 0/-1 or
> use a constant length.
I'm changing it to constant length, but I can get rid of it
altogether. I don't care either way, it just depends on if there is
anyone else who will want this support.
>> + ? ? ? ? ? ? ? ? ? ? continue;
>> ? ? ? ? ? ? ? default:
>> ? ? ? ? ? ? ? ? ? ? ? WARN_RATELIMIT(1, "Unknown code:%u jt:%u tf:%u k:%u\n",
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?fentry->code, fentry->jt,
>> @@ -360,7 +415,7 @@ load_b:
>>
>> ? ? ? return 0;
>> ?}
>> -EXPORT_SYMBOL(sk_run_filter);
>> +EXPORT_SYMBOL(bpf_run_filter);
>>
>> ?/*
>> ? * Security :
>> @@ -423,9 +478,10 @@ error:
>> ?}
>>
>> ?/**
>> - * ? sk_chk_filter - verify socket filter code
>> + * ? bpf_chk_filter - verify socket filter BPF code
>> ? * ? @filter: filter to verify
>> ? * ? @flen: length of filter
>> + * ? @flags: May be BPF_CHK_FLAGS_NO_SKB or 0
>> ? *
>> ? * Check the user's filter code. If we let some ugly
>> ? * filter code slip through kaboom! The filter must contain
>> @@ -434,9 +490,13 @@ error:
>> ? *
>> ? * All jumps are forward as they are not signed.
>> ? *
>> + * If BPF_CHK_FLAGS_NO_SKB is set in flags, any SKB-specific
>> + * rules become illegal and a custom set of bpf_load_fns will
>> + * be expected by bpf_run_filter.
>> + *
>> ? * Returns 0 if the rule set is legal or -EINVAL if not.
>> ? */
>> -int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
>> +int bpf_chk_filter(struct sock_filter *filter, unsigned int flen, u32 flags)
>> ?{
>> ? ? ? /*
>> ? ? ? ?* Valid instructions are initialized to non-0.
>> @@ -542,9 +602,35 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? pc + ftest->jf + 1 >= flen)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return -EINVAL;
>> ? ? ? ? ? ? ? ? ? ? ? break;
>> +#define MAYBE_USE_LOAD_FN(CODE) \
>> + ? ? ? ? ? ? ? ? ? ? if (flags & BPF_CHK_FLAGS_NO_SKB) { \
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? code = BPF_S_ANC_##CODE; \
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? break; \
>> + ? ? ? ? ? ? ? ? ? ? }
>
> You can as well hide everything in the macro then, including the case,
> like the ANCILLARY() macro does.
I'm not sure that would make it any more readable though, especially
since I don't always break; after.
>> + ? ? ? ? ? ? case BPF_S_LD_W_LEN:
>> + ? ? ? ? ? ? ? ? ? ? MAYBE_USE_LOAD_FN(LD_W_LEN);
>> + ? ? ? ? ? ? ? ? ? ? break;
>> + ? ? ? ? ? ? case BPF_S_LDX_W_LEN:
>> + ? ? ? ? ? ? ? ? ? ? MAYBE_USE_LOAD_FN(LDX_W_LEN);
>> + ? ? ? ? ? ? ? ? ? ? break;
>> + ? ? ? ? ? ? case BPF_S_LD_W_IND:
>> + ? ? ? ? ? ? ? ? ? ? MAYBE_USE_LOAD_FN(LD_W_IND);
>> + ? ? ? ? ? ? ? ? ? ? break;
>> + ? ? ? ? ? ? case BPF_S_LD_H_IND:
>> + ? ? ? ? ? ? ? ? ? ? MAYBE_USE_LOAD_FN(LD_H_IND);
>> + ? ? ? ? ? ? ? ? ? ? break;
>> + ? ? ? ? ? ? case BPF_S_LD_B_IND:
>> + ? ? ? ? ? ? ? ? ? ? MAYBE_USE_LOAD_FN(LD_B_IND);
>> + ? ? ? ? ? ? ? ? ? ? break;
>> + ? ? ? ? ? ? case BPF_S_LDX_B_MSH:
>> + ? ? ? ? ? ? ? ? ? ? MAYBE_USE_LOAD_FN(LDX_B_MSH);
>> + ? ? ? ? ? ? ? ? ? ? break;
>> ? ? ? ? ? ? ? case BPF_S_LD_W_ABS:
>> + ? ? ? ? ? ? ? ? ? ? MAYBE_USE_LOAD_FN(LD_W_ABS);
>> ? ? ? ? ? ? ? case BPF_S_LD_H_ABS:
>> + ? ? ? ? ? ? ? ? ? ? MAYBE_USE_LOAD_FN(LD_H_ABS);
>> ? ? ? ? ? ? ? case BPF_S_LD_B_ABS:
>> + ? ? ? ? ? ? ? ? ? ? MAYBE_USE_LOAD_FN(LD_B_ABS);
>> ?#define ANCILLARY(CODE) case SKF_AD_OFF + SKF_AD_##CODE: ? ? \
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? code = BPF_S_ANC_##CODE; ? ? ? ?\
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? break
>> @@ -572,7 +658,7 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
>> ? ? ? }
>> ? ? ? return -EINVAL;
>> ?}
>> -EXPORT_SYMBOL(sk_chk_filter);
>> +EXPORT_SYMBOL(bpf_chk_filter);
>>
>> ?/**
>> ? * ? sk_filter_release_rcu - Release a socket filter by rcu_head
>> --
>> 1.7.5.4
>>
>
> Greetings,
Thanks!
will
On Thu, Feb 16, 2012 at 10:12 PM, H. Peter Anvin <[email protected]> wrote:
> On 02/16/2012 07:53 PM, Will Drewry wrote:
>>
>> An earlier change Roland had prodded me toward was adding a
>> syscall_get_arch() call to asm/syscall.h which returned the
>> appropriate audit arch value for the current calling convention. ?I
>> hate to suggest this, but should I go ahead and wire that up for x86
>> now, make it a dependency for HAVE_ARCH_SECCOMP_FILTER (and officially
>> part of asm/syscall.h) then let it trickle into existence? ?Maybe
>> something like:
>>
>
> ... and we have been talking about making a regset and export it to
> ptrace and core dumps, too.
Would having an audit_arch returning function be useful for building
those cases too? Or would this just be nearly-duplicated code
everywhere? (As is, ptrace usually takes shortcuts since it has the
arch-specific knowledge, so maybe it just wouldn't matter.)
>> static inline int syscall_get_arch(struct task_struct *task, struct
>> pt_regs *regs)
>> {
>> #ifdef CONFIG_IA32_EMULATION
>> ? if (task_thread_info(task)->status & TS_COMPAT)
>> ? ? return AUDIT_ARCH_I386;
>> #endif
>> #ifdef CONFIG_64BIT
>> ? return AUDIT_ARCH_X86_64;
>> #else
>> ? return AUDIT_ARCH_I386;
>> #endif
>> }
>>
>
> In this case it could be is_compat_task().
I wasn't sure if it was fine to add any syscall_* functions that
depended on the caller being current.
>> There would be no other callers, though, because everywhere AUDIT_ARCH
>> is used it is hardcoded as appropriate. ?Then when x32 comes along, it
>> can figure out where it belongs using tif status and/or regs.
>
> For x32 you have the option of introducing a new value or relying on bit
> 30 in eax (and AUDIT_ARCH_X86_64). ?The latter is more natural, probably.
Will that bit be visible as the syscall number or will it be stripped
out before passing the number around? If it's visible, then it
doesn't seem like there'd need to be a new AUDIT_ARCH, but I suspect
someone like Eric will have an actually useful opinion.
thanks!
will
On 02/16/2012 08:26 PM, Will Drewry wrote:
>>
>> For x32 you have the option of introducing a new value or relying on bit
>> 30 in eax (and AUDIT_ARCH_X86_64). The latter is more natural, probably.
>
> Will that bit be visible as the syscall number or will it be stripped
> out before passing the number around? If it's visible, then it
> doesn't seem like there'd need to be a new AUDIT_ARCH, but I suspect
> someone like Eric will have an actually useful opinion.
>
Bit 30 is visible in orig_eax; whether you export it as part of "the
syscall number" is presumably TBD, but I think it's more natural to do so.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
On Thu, Feb 16, 2012 at 10:32 PM, H. Peter Anvin <[email protected]> wrote:
> On 02/16/2012 08:26 PM, Will Drewry wrote:
>>>
>>> For x32 you have the option of introducing a new value or relying on bit
>>> 30 in eax (and AUDIT_ARCH_X86_64). ?The latter is more natural, probably.
>>
>> Will that bit be visible as the syscall number or will it be stripped
>> out before passing the number around? ?If it's visible, then it
>> doesn't seem like there'd need to be a new AUDIT_ARCH, but I suspect
>> someone like Eric will have an actually useful opinion.
>>
>
> Bit 30 is visible in orig_eax; whether you export it as part of "the
> syscall number" is presumably TBD, but I think it's more natural to do so.
That's what I meant - thanks!
On Fri, February 17, 2012 05:09, H. Peter Anvin wrote:
> On 02/16/2012 07:27 PM, Indan Zupancic wrote:
>>
>> A JIT won't be able to merge accesses because it also has to merge other
>> instructions and recognize when 64-bit operations are done with 32-bit
>> instructions. I think that will be too hard for a JIT.
>>
>
> Please Google "peephole optimizer".
I have written one for uni. Like I said, I think it will be too hard
for a BPF JIT because the pattern is too complex. Keep in mind that
there is no 64-bit register where you can load the data to, everything
is done on 32-bit values. So you have to recognize 32-bit code emulating
64-bit ops. I don't think anyone will add all the different patterns of
doing that to the JIT, there are too many.
The current JIT is networking-only and is very simplistic. It is a very
long way to a sophisticated enough JIT that does such complex peephole
optimisations. I'm not saying it's impossible in general, just that the
kernel BPF JIT won't be able to do it. It's a lot easier to just add
64-bit support to BPF instead.
Greetings,
Indan
On Fri, February 17, 2012 05:13, Will Drewry wrote:
> On Thu, Feb 16, 2012 at 9:04 PM, Indan Zupancic <[email protected]> wrote:
>>> Hello,
>>>
>>> On Thu, February 16, 2012 21:02, Will Drewry wrote:
>>>> This change allows CONFIG_SECCOMP to make use of BPF programs for
>>>> user-controlled system call filtering (as shown in this patch series).
>>>>
>>>> To minimize the impact on existing BPF evaluation, function pointer
>>>> use must be declared at sk_chk_filter-time. This allows ancillary
>>>> load instructions to be generated that use the function pointer rather
>>>> than adding _any_ code to the existing LD_* instruction paths.
>>>>
>>>> Crude performance numbers using udpflood -l 10000000 against dummy0.
>>>> 3 trials for baseline, 3 for with tcpdump. Averaged then differenced.
>>>> Hard to believe trials were repeated at least a couple more times.
>>>>
>>>> * x86 32-bit (Atom N570 @ 1.66 GHz 2 core HT) [stackprot]:
>>>> - Without: 94.05s - 76.36s = 17.68s
>>>> - With: 86.22s - 73.30s = 12.92s
>>>> - Slowdown per call: -476 nanoseconds
>>>>
>>>> * x86 32-bit (Atom N570 @ 1.66 GHz 2 core HT) [no stackprot]:
>>>> - Without: 92.06s - 77.81s = 14.25s
>>>> - With: 91.77s - 76.91s = 14.86s
>>>> - Slowdown per call: +61 nanoseconds
>>>>
>>>> * x86 64-bit (Atom N570 @ 1.66 GHz 2 core HT) [stackprot]:
>>>> - Without: 122.58s - 99.54s = 23.04s
>>>> - With: 115.52s - 98.99s = 16.53s
>>>> - Slowdown per call: -651 nanoseconds
>>>>
>>>> * x86 64-bit (Atom N570 @ 1.66 GHz 2 core HT) [no stackprot]:
>>>> - Without: 114.95s - 91.92s = 23.03s
>>>> - With: 110.47s - 90.79s = 19.68s
>>>> - Slowdown per call: -335 nanoseconds
>>>>
>>>> This makes the x86-32-nossp make sense. Added register pressure always
>>>> makes x86-32 sad.
>>>
>>> Your 32-bit numbers are better than your 64-bit numbers, so I don't get
>>> this comment.
>
> They are in the absolute. Relatively, all performance improved with
> my patch except for x86-nossp.
Why is 32-bit sad if it's faster than the 64-bit version?
I'd say the 64-bit numbers are sad considering the extra registers.
>>> Yeah, testing on Atom is a bit silly.
>
> Making things run well on Atom is important for my daily work. And it
> usually means (barring Atom-specific weirdness) that it then runs even
> better on bigger processors :)
Fair enough!
>>>> +#define SKB(_data) ((const struct sk_buff *)(_data))
>>>
>>> Urgh!
>>>
>>> If you had done:
>>> const struct sk_buff *skb = data;
>>>
>>> at the top, all those changed wouldn't be needed and it would look better too.
>
> That just means I need to disassemble after to make sure the compiler
> does the right thing. I'll do that and change it if gcc is doing the
> right thing.
You're telling the compiler the same thing, so it better do the right thing!
It just looks better.
>>> These two should either return 0, be networking-only, just return 0/-1 or
>>> use a constant length.
>
> I'm changing it to constant length, but I can get rid of it
> altogether. I don't care either way, it just depends on if there is
> anyone else who will want this support.
Right now there is no one else.
>>>> +#define MAYBE_USE_LOAD_FN(CODE) \
>>>> + if (flags & BPF_CHK_FLAGS_NO_SKB) { \
>>>> + code = BPF_S_ANC_##CODE; \
>>>> + break; \
>>>> + }
>>>
>>> You can as well hide everything in the macro then, including the case,
>>> like the ANCILLARY() macro does.
>
> I'm not sure that would make it any more readable though, especially
> since I don't always break; after.
Ah, true. Because there was a break; in the macro, I assumed it would
always break, for some reason. I wish there was a way to make it look
nice though, it's so ugly.
>
>>>> + case BPF_S_LD_W_LEN:
>>>> + MAYBE_USE_LOAD_FN(LD_W_LEN);
>>>> + break;
>>>> + case BPF_S_LDX_W_LEN:
>>>> + MAYBE_USE_LOAD_FN(LDX_W_LEN);
>>>> + break;
>>>> + case BPF_S_LD_W_IND:
>>>> + MAYBE_USE_LOAD_FN(LD_W_IND);
>>>> + break;
>>>> + case BPF_S_LD_H_IND:
>>>> + MAYBE_USE_LOAD_FN(LD_H_IND);
>>>> + break;
>>>> + case BPF_S_LD_B_IND:
>>>> + MAYBE_USE_LOAD_FN(LD_B_IND);
>>>> + break;
>>>> + case BPF_S_LDX_B_MSH:
>>>> + MAYBE_USE_LOAD_FN(LDX_B_MSH);
>>>> + break;
>>>> case BPF_S_LD_W_ABS:
>>>> + MAYBE_USE_LOAD_FN(LD_W_ABS);
>>>> case BPF_S_LD_H_ABS:
>>>> + MAYBE_USE_LOAD_FN(LD_H_ABS);
>>>> case BPF_S_LD_B_ABS:
>>>> + MAYBE_USE_LOAD_FN(LD_B_ABS);
>>>> #define ANCILLARY(CODE) case SKF_AD_OFF + SKF_AD_##CODE: \
>>>> code = BPF_S_ANC_##CODE; \
>>>> break
>>>> @@ -572,7 +658,7 @@ int sk_chk_filter(struct sock_filter *filter, unsigned int flen)
>>>> }
>>>> return -EINVAL;
>>>> }
>>>> -EXPORT_SYMBOL(sk_chk_filter);
>>>> +EXPORT_SYMBOL(bpf_chk_filter);
>>>>
>>>> /**
>>>> * sk_filter_release_rcu - Release a socket filter by rcu_head
>>>> --
>>>> 1.7.5.4
>>>>
>>>
>>> Greetings,
>
> Thanks!
You're welcome!
Indan
Hello,
On Thu, February 16, 2012 21:02, Will Drewry wrote:
> A new return value is added to seccomp filters that allows
> the system call policy for the affected system calls to be
> implemented by a ptrace(2)ing process.
>
> If a tracer attaches to a task using PTRACE_SECCOMP, then the
> traced process will notify the tracer if a seccomp filter
> returns SECCOMP_RET_TRACE. If the tracer detaches, then
> system calls made by the task will fail.
This is what I need to make BPF useful to me.
To have least impact on current and future (user space) ptrace code,
I suspect it's best if PTRACE_SECCOMP becomes a ptrace option. That
may be a little bit more work, but the end result should be more
robust against any future ptrace changes.
> To ensure that seccomp is syscall fast-path friendly in the future,
> ptrace is delegated to by setting TIF_SYSCALL_TRACE. Since seccomp
> events are equivalent to system call entry events, this allows for
> seccomp to be evaluated as a fork off the fast-path and only,
> optionally, jump to the slow path.
I think you have to go through the slow path anyway to get access to
the syscall arguments, at least for some archs.
> When the tracer is notified, all
> will function as with ptrace(PTRACE_SYSCALLS), but when the tracer calls
> ptrace(PTRACE_SECCOMP), TIF_SYSCALL_TRACE will be unset and the task
> will proceed.
> Note, this patch takes the path of least resistance for integration. It
> is not necessarily the best path and any guidance will be appreciated!
> The key challenges are ensuring that register state is correct at
> ptrace handoff and ensuring that all only seccomp-based notification
> occurs.
>
> v8: - guarded PTRACE_SECCOMP use with an ifdef
> v7: - introduced
>
> Signed-off-by: Will Drewry <[email protected]>
> ---
> arch/Kconfig | 12 ++++++++----
> include/linux/ptrace.h | 1 +
> include/linux/seccomp.h | 39 +++++++++++++++++++++++++++++++++++++--
> kernel/ptrace.c | 12 ++++++++++++
> kernel/seccomp.c | 15 +++++++++++++++
> 5 files changed, 73 insertions(+), 6 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index a01c151..ae40aec 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -203,10 +203,14 @@ config HAVE_ARCH_SECCOMP_FILTER
> bool
> help
> This symbol should be selected by an architecure if it provides
> - asm/syscall.h, specifically syscall_get_arguments(),
> - syscall_set_return_value(), and syscall_rollback().
> - Additionally, its system call entry path must respect a return
> - value of -1 from __secure_computing_int() and/or secure_computing().
> + linux/tracehook.h, for TIF_SYSCALL_TRACE, and asm/syscall.h,
> + specifically syscall_get_arguments(), syscall_set_return_value(), and
> + syscall_rollback(). Additionally, its system call entry path must
> + respect a return value of -1 from __secure_computing_int() and/or
> + secure_computing(). If secure_computing is not in the system call
> + slow path, the thread info flags will need to be checked upon exit to
> + ensure delegation to ptrace(2) did not occur, or if it did, jump to
> + the slow-path.
>
> config SECCOMP_FILTER
> def_bool y
> diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
> index c2f1f6a..00220de 100644
> --- a/include/linux/ptrace.h
> +++ b/include/linux/ptrace.h
> @@ -50,6 +50,7 @@
> #define PTRACE_SEIZE 0x4206
> #define PTRACE_INTERRUPT 0x4207
> #define PTRACE_LISTEN 0x4208
> +#define PTRACE_SECCOMP 0x4209
>
> /* flags in @data for PTRACE_SEIZE */
> #define PTRACE_SEIZE_DEVEL 0x80000000 /* temp flag for development */
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 1be562f..1cb7d5c 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -19,8 +19,9 @@
> * selects the least permissive choice.
> */
> #define SECCOMP_RET_KILL 0x00000000U /* kill the task immediately */
> -#define SECCOMP_RET_TRAP 0x00020000U /* disallow and send sigtrap */
> -#define SECCOMP_RET_ERRNO 0x00030000U /* returns an errno */
> +#define SECCOMP_RET_TRAP 0x00020000U /* only send sigtrap */
> +#define SECCOMP_RET_ERRNO 0x00030000U /* only return an errno */
> +#define SECCOMP_RET_TRACE 0x7ffe0000U /* allow, but notify the tracer */
> #define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */
>
> /* Masks for accessing the above values. */
> @@ -51,6 +52,7 @@ struct seccomp_filter;
> *
> * @mode: indicates one of the valid values above for controlled
> * system calls available to a process.
> + * @flags: per-process flags. Currently only used for SECCOMP_FLAGS_TRACED.
If it has only one use, don't make it flags then. You can always change
it to flags later. Until then, it only makes things less clear.
I actually think it's better to get rid of this altogether, and only
check task->ptrace for PTRACE_O_SECCOMP. That would avoid a lot of code.
> * @filter: The metadata and ruleset for determining what system calls
> * are allowed for a task.
> *
> @@ -59,9 +61,13 @@ struct seccomp_filter;
> */
> struct seccomp {
> int mode;
> + unsigned long flags;
Why a long? That wastes 4 bytes of padding and you still can't use the upper
32-bits because you have to support 32-bit systems too.
> struct seccomp_filter *filter;
> };
>
> +/* Indicates if a tracer is attached. */
> +#define SECCOMP_FLAGS_TRACED 0
That's not the best way to check if a tracer is attached, and if you did use
it for that, you don't need to toggle it all the time.
> +
> /*
> * Direct callers to __secure_computing should be updated as
> * CONFIG_HAVE_ARCH_SECCOMP_FILTER propagates.
> @@ -83,6 +89,20 @@ static inline int seccomp_mode(struct seccomp *s)
> return s->mode;
> }
>
> +static inline void seccomp_set_traced(struct seccomp *s)
> +{
> + set_bit(SECCOMP_FLAGS_TRACED, &s->flags);
> +}
> +
> +static inline void seccomp_clear_traced(struct seccomp *s)
> +{
> + clear_bit(SECCOMP_FLAGS_TRACED, &s->flags);
> +}
> +
> +static inline int seccomp_traced(struct seccomp *s)
> +{
> + return test_bit(SECCOMP_FLAGS_TRACED, &s->flags);
> +}
> #else /* CONFIG_SECCOMP */
>
> #include <linux/errno.h>
> @@ -106,6 +126,21 @@ static inline int seccomp_mode(struct seccomp *s)
> {
> return 0;
> }
> +
> +static inline void seccomp_set_traced(struct seccomp *s)
> +{
> + return;
> +}
> +
> +static inline void seccomp_clear_traced(struct seccomp *s)
> +{
> + return;
> +}
> +
> +static inline int seccomp_traced(struct seccomp *s)
> +{
> + return 0;
> +}
> #endif /* CONFIG_SECCOMP */
>
> #ifdef CONFIG_SECCOMP_FILTER
> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
> index 00ab2ca..199a6da 100644
> --- a/kernel/ptrace.c
> +++ b/kernel/ptrace.c
> @@ -19,6 +19,7 @@
> #include <linux/signal.h>
> #include <linux/audit.h>
> #include <linux/pid_namespace.h>
> +#include <linux/seccomp.h>
> #include <linux/syscalls.h>
> #include <linux/uaccess.h>
> #include <linux/regset.h>
> @@ -426,6 +427,7 @@ static int ptrace_detach(struct task_struct *child, unsigned int
data)
> /* Architecture-specific hardware disable .. */
> ptrace_disable(child);
> clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE);
> + seccomp_clear_traced(&child->seccomp);
>
> write_lock_irq(&tasklist_lock);
> /*
> @@ -616,6 +618,13 @@ static int ptrace_resume(struct task_struct *child, long request,
> else
> clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE);
>
> +#ifdef CONFIG_SECCOMP_FILTER
> + if (request == PTRACE_SECCOMP)
> + seccomp_set_traced(&child->seccomp);
> + else
> + seccomp_clear_traced(&child->seccomp);
> +#endif
> +
> #ifdef TIF_SYSCALL_EMU
> if (request == PTRACE_SYSEMU || request == PTRACE_SYSEMU_SINGLESTEP)
> set_tsk_thread_flag(child, TIF_SYSCALL_EMU);
> @@ -816,6 +825,9 @@ int ptrace_request(struct task_struct *child, long request,
> case PTRACE_SYSEMU:
> case PTRACE_SYSEMU_SINGLESTEP:
> #endif
> +#ifdef CONFIG_SECCOMP_FILTER
> + case PTRACE_SECCOMP:
> +#endif
> case PTRACE_SYSCALL:
> case PTRACE_CONT:
> return ptrace_resume(child, request, data);
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index c75485c..f9d419f 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -289,6 +289,8 @@ void copy_seccomp(struct seccomp *child,
> {
> child->mode = prev->mode;
> child->filter = get_seccomp_filter(prev->filter);
> + /* Note, this leaves seccomp tracing enabled across fork. */
> + child->flags = prev->flags;
What if the child isn't ptraced?
> }
>
> /**
> @@ -363,6 +365,19 @@ int __secure_computing_int(int this_syscall)
> syscall_rollback(current, task_pt_regs(current));
> seccomp_send_sigtrap();
> return -1;
> + case SECCOMP_RET_TRACE:
> + if (!seccomp_traced(¤t->seccomp))
> + return -1;
> + /*
> + * Delegate to TIF_SYSCALL_TRACE. This allows fast-path
> + * seccomp calls to delegate to slow-path if needed.
> + * Since TIF_SYSCALL_TRACE will be unset on ptrace(2)
> + * continuation, there should be no direct side
> + * effects. If TIF_SYSCALL_TRACE is already set, this
> + * has no effect.
> + */
> + set_tsk_thread_flag(current, TIF_SYSCALL_TRACE);
> + /* Falls through to allow. */
This is nice and simple, but not race-free. You want to check if the ptracer
handled the event or not. If the ptracer died before handling this then the
syscall should be denied and the task should be killed.
Many people would like a PTRACE_O_KILL_TRACEE_IF_DEBUGGER_DIES option,
Oleg was working on that, among other things. Perhaps re-use that to
handle this case too?
> case SECCOMP_RET_ALLOW:
For this and the ERRNO case you could check that PTRACE_O_SECCOMP option and
decide to do something or not in ptrace.
> return 0;
> case SECCOMP_RET_KILL:
> --
> 1.7.5.4
>
>
Greetings,
Indan
On Thu, Feb 16, 2012 at 11:08 PM, Indan Zupancic <[email protected]> wrote:
> Hello,
>
> On Thu, February 16, 2012 21:02, Will Drewry wrote:
>> A new return value is added to seccomp filters that allows
>> the system call policy for the affected system calls to be
>> implemented by a ptrace(2)ing process.
>>
>> If a tracer attaches to a task using PTRACE_SECCOMP, then the
>> traced process will notify the tracer if a seccomp filter
>> returns SECCOMP_RET_TRACE. ?If the tracer detaches, then
>> system calls made by the task will fail.
>
> This is what I need to make BPF useful to me.
>
> To have least impact on current and future (user space) ptrace code,
> I suspect it's best if PTRACE_SECCOMP becomes a ptrace option. That
> may be a little bit more work, but the end result should be more
> robust against any future ptrace changes.
>
>> To ensure that seccomp is syscall fast-path friendly in the future,
>> ptrace is delegated to by setting TIF_SYSCALL_TRACE. ?Since seccomp
>> events are equivalent to system call entry events, this allows for
>> seccomp to be evaluated as a fork off the fast-path and only,
>> optionally, jump to the slow path.
>
> I think you have to go through the slow path anyway to get access to
> the syscall arguments, at least for some archs.
Just depends on the arch. On x86, it only populates the arguments and
the syscall number by default, but the slow path takes the time to
copy the rest.
>> When the tracer is notified, all
>> will function as with ptrace(PTRACE_SYSCALLS), but when the tracer calls
>> ptrace(PTRACE_SECCOMP), TIF_SYSCALL_TRACE will be unset and the task
>> will proceed.
>> Note, this patch takes the path of least resistance for integration. It
>> is not necessarily the best path and any guidance will be appreciated!
>> The key challenges are ensuring that register state is correct at
>> ptrace handoff and ensuring that all only seccomp-based notification
>> occurs.
>>
>> v8: - guarded PTRACE_SECCOMP use with an ifdef
>> v7: - introduced
>>
>> Signed-off-by: Will Drewry <[email protected]>
>> ---
>> ?arch/Kconfig ? ? ? ? ? ?| ? 12 ++++++++----
>> ?include/linux/ptrace.h ?| ? ?1 +
>> ?include/linux/seccomp.h | ? 39 +++++++++++++++++++++++++++++++++++++--
>> ?kernel/ptrace.c ? ? ? ? | ? 12 ++++++++++++
>> ?kernel/seccomp.c ? ? ? ?| ? 15 +++++++++++++++
>> ?5 files changed, 73 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/Kconfig b/arch/Kconfig
>> index a01c151..ae40aec 100644
>> --- a/arch/Kconfig
>> +++ b/arch/Kconfig
>> @@ -203,10 +203,14 @@ config HAVE_ARCH_SECCOMP_FILTER
>> ? ? ? bool
>> ? ? ? help
>> ? ? ? ? This symbol should be selected by an architecure if it provides
>> - ? ? ? asm/syscall.h, specifically syscall_get_arguments(),
>> - ? ? ? syscall_set_return_value(), and syscall_rollback().
>> - ? ? ? Additionally, its system call entry path must respect a return
>> - ? ? ? value of -1 from __secure_computing_int() and/or secure_computing().
>> + ? ? ? linux/tracehook.h, for TIF_SYSCALL_TRACE, and asm/syscall.h,
>> + ? ? ? specifically syscall_get_arguments(), syscall_set_return_value(), and
>> + ? ? ? syscall_rollback(). ?Additionally, its system call entry path must
>> + ? ? ? respect a return value of -1 from __secure_computing_int() and/or
>> + ? ? ? secure_computing(). ?If secure_computing is not in the system call
>> + ? ? ? slow path, the thread info flags will need to be checked upon exit to
>> + ? ? ? ensure delegation to ptrace(2) did not occur, or if it did, jump to
>> + ? ? ? the slow-path.
>>
>> ?config SECCOMP_FILTER
>> ? ? ? def_bool y
>> diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
>> index c2f1f6a..00220de 100644
>> --- a/include/linux/ptrace.h
>> +++ b/include/linux/ptrace.h
>> @@ -50,6 +50,7 @@
>> ?#define PTRACE_SEIZE ? ? ? ? 0x4206
>> ?#define PTRACE_INTERRUPT ? ? 0x4207
>> ?#define PTRACE_LISTEN ? ? ? ? ? ? ? ?0x4208
>> +#define PTRACE_SECCOMP ? ? ? ? ? ? ? 0x4209
>>
>> ?/* flags in @data for PTRACE_SEIZE */
>> ?#define PTRACE_SEIZE_DEVEL ? 0x80000000 /* temp flag for development */
>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>> index 1be562f..1cb7d5c 100644
>> --- a/include/linux/seccomp.h
>> +++ b/include/linux/seccomp.h
>> @@ -19,8 +19,9 @@
>> ? * selects the least permissive choice.
>> ? */
>> ?#define SECCOMP_RET_KILL ? ? 0x00000000U /* kill the task immediately */
>> -#define SECCOMP_RET_TRAP ? ? 0x00020000U /* disallow and send sigtrap */
>> -#define SECCOMP_RET_ERRNO ? ?0x00030000U /* returns an errno */
>> +#define SECCOMP_RET_TRAP ? ? 0x00020000U /* only send sigtrap */
>> +#define SECCOMP_RET_ERRNO ? ?0x00030000U /* only return an errno */
>> +#define SECCOMP_RET_TRACE ? ?0x7ffe0000U /* allow, but notify the tracer */
>> ?#define SECCOMP_RET_ALLOW ? ?0x7fff0000U /* allow */
>>
>> ?/* Masks for accessing the above values. */
>> @@ -51,6 +52,7 @@ struct seccomp_filter;
>> ? *
>> ? * @mode: ?indicates one of the valid values above for controlled
>> ? * ? ? ? ? system calls available to a process.
>> + * @flags: per-process flags. Currently only used for SECCOMP_FLAGS_TRACED.
>
> If it has only one use, don't make it flags then. You can always change
> it to flags later. Until then, it only makes things less clear.
I'll see if the updated one can ditch it.
> I actually think it's better to get rid of this altogether, and only
> check task->ptrace for PTRACE_O_SECCOMP. That would avoid a lot of code.
It wasn't clear to me how to best add a PTRACE_O_* since the most
recent refactor. Maybe if I see one of Oleg's new patches, I can
model it on that.
>> ? * @filter: The metadata and ruleset for determining what system calls
>> ? * ? ? ? ? ?are allowed for a task.
>> ? *
>> @@ -59,9 +61,13 @@ struct seccomp_filter;
>> ? */
>> ?struct seccomp {
>> ? ? ? int mode;
>> + ? ? unsigned long flags;
>
> Why a long? That wastes 4 bytes of padding and you still can't use the upper
> 32-bits because you have to support 32-bit systems too.
I was just using the bitset helper functions. The unsigned long
assures the arch can work its magic, etc.
>> ? ? ? struct seccomp_filter *filter;
>> ?};
>>
>> +/* Indicates if a tracer is attached. */
>> +#define SECCOMP_FLAGS_TRACED 0
>
> That's not the best way to check if a tracer is attached, and if you did use
> it for that, you don't need to toggle it all the time.
It's logically no different than task->ptrace. If it is less
desirable, that's fine, but it is functionally equivalent.
>> +
>> ?/*
>> ? * Direct callers to __secure_computing should be updated as
>> ? * CONFIG_HAVE_ARCH_SECCOMP_FILTER propagates.
>> @@ -83,6 +89,20 @@ static inline int seccomp_mode(struct seccomp *s)
>> ? ? ? return s->mode;
>> ?}
>>
>> +static inline void seccomp_set_traced(struct seccomp *s)
>> +{
>> + ? ? set_bit(SECCOMP_FLAGS_TRACED, &s->flags);
>> +}
>> +
>> +static inline void seccomp_clear_traced(struct seccomp *s)
>> +{
>> + ? ? clear_bit(SECCOMP_FLAGS_TRACED, &s->flags);
>> +}
>> +
>> +static inline int seccomp_traced(struct seccomp *s)
>> +{
>> + ? ? return test_bit(SECCOMP_FLAGS_TRACED, &s->flags);
>> +}
>> ?#else /* CONFIG_SECCOMP */
>>
>> ?#include <linux/errno.h>
>> @@ -106,6 +126,21 @@ static inline int seccomp_mode(struct seccomp *s)
>> ?{
>> ? ? ? return 0;
>> ?}
>> +
>> +static inline void seccomp_set_traced(struct seccomp *s)
>> +{
>> + ? ? return;
>> +}
>> +
>> +static inline void seccomp_clear_traced(struct seccomp *s)
>> +{
>> + ? ? return;
>> +}
>> +
>> +static inline int seccomp_traced(struct seccomp *s)
>> +{
>> + ? ? return 0;
>> +}
>> ?#endif /* CONFIG_SECCOMP */
>>
>> ?#ifdef CONFIG_SECCOMP_FILTER
>> diff --git a/kernel/ptrace.c b/kernel/ptrace.c
>> index 00ab2ca..199a6da 100644
>> --- a/kernel/ptrace.c
>> +++ b/kernel/ptrace.c
>> @@ -19,6 +19,7 @@
>> ?#include <linux/signal.h>
>> ?#include <linux/audit.h>
>> ?#include <linux/pid_namespace.h>
>> +#include <linux/seccomp.h>
>> ?#include <linux/syscalls.h>
>> ?#include <linux/uaccess.h>
>> ?#include <linux/regset.h>
>> @@ -426,6 +427,7 @@ static int ptrace_detach(struct task_struct *child, unsigned int
> data)
>> ? ? ? /* Architecture-specific hardware disable .. */
>> ? ? ? ptrace_disable(child);
>> ? ? ? clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE);
>> + ? ? seccomp_clear_traced(&child->seccomp);
>>
>> ? ? ? write_lock_irq(&tasklist_lock);
>> ? ? ? /*
>> @@ -616,6 +618,13 @@ static int ptrace_resume(struct task_struct *child, long request,
>> ? ? ? else
>> ? ? ? ? ? ? ? clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE);
>>
>> +#ifdef CONFIG_SECCOMP_FILTER
>> + ? ? if (request == PTRACE_SECCOMP)
>> + ? ? ? ? ? ? seccomp_set_traced(&child->seccomp);
>> + ? ? else
>> + ? ? ? ? ? ? seccomp_clear_traced(&child->seccomp);
>> +#endif
>> +
>> ?#ifdef TIF_SYSCALL_EMU
>> ? ? ? if (request == PTRACE_SYSEMU || request == PTRACE_SYSEMU_SINGLESTEP)
>> ? ? ? ? ? ? ? set_tsk_thread_flag(child, TIF_SYSCALL_EMU);
>> @@ -816,6 +825,9 @@ int ptrace_request(struct task_struct *child, long request,
>> ? ? ? case PTRACE_SYSEMU:
>> ? ? ? case PTRACE_SYSEMU_SINGLESTEP:
>> ?#endif
>> +#ifdef CONFIG_SECCOMP_FILTER
>> + ? ? case PTRACE_SECCOMP:
>> +#endif
>> ? ? ? case PTRACE_SYSCALL:
>> ? ? ? case PTRACE_CONT:
>> ? ? ? ? ? ? ? return ptrace_resume(child, request, data);
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index c75485c..f9d419f 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -289,6 +289,8 @@ void copy_seccomp(struct seccomp *child,
>> ?{
>> ? ? ? child->mode = prev->mode;
>> ? ? ? child->filter = get_seccomp_filter(prev->filter);
>> + ? ? /* Note, this leaves seccomp tracing enabled across fork. */
>> + ? ? child->flags = prev->flags;
>
> What if the child isn't ptraced?
Then falling through with TIF_SYSCALL_TRACE will result in the
SECCOMP_RET_TRACE events to be allowed, but this comes back to the
race. If I can effectively "check" that ptrace did its job, then I
think this becomes a non-issue.
>> ?}
>>
>> ?/**
>> @@ -363,6 +365,19 @@ int __secure_computing_int(int this_syscall)
>> ? ? ? ? ? ? ? ? ? ? ? syscall_rollback(current, task_pt_regs(current));
>> ? ? ? ? ? ? ? ? ? ? ? seccomp_send_sigtrap();
>> ? ? ? ? ? ? ? ? ? ? ? return -1;
>> + ? ? ? ? ? ? case SECCOMP_RET_TRACE:
>> + ? ? ? ? ? ? ? ? ? ? if (!seccomp_traced(¤t->seccomp))
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? return -1;
>> + ? ? ? ? ? ? ? ? ? ? /*
>> + ? ? ? ? ? ? ? ? ? ? ?* Delegate to TIF_SYSCALL_TRACE. This allows fast-path
>> + ? ? ? ? ? ? ? ? ? ? ?* seccomp calls to delegate to slow-path if needed.
>> + ? ? ? ? ? ? ? ? ? ? ?* Since TIF_SYSCALL_TRACE will be unset on ptrace(2)
>> + ? ? ? ? ? ? ? ? ? ? ?* continuation, there should be no direct side
>> + ? ? ? ? ? ? ? ? ? ? ?* effects. ?If TIF_SYSCALL_TRACE is already set, this
>> + ? ? ? ? ? ? ? ? ? ? ?* has no effect.
>> + ? ? ? ? ? ? ? ? ? ? ?*/
>> + ? ? ? ? ? ? ? ? ? ? set_tsk_thread_flag(current, TIF_SYSCALL_TRACE);
>> + ? ? ? ? ? ? ? ? ? ? /* Falls through to allow. */
>
> This is nice and simple, but not race-free. You want to check if the ptracer
> handled the event or not. If the ptracer died before handling this then the
> syscall should be denied and the task should be killed.
Hrm. I think there's a way to do this without forcing seccomp to
always go slow path. I'll update the patch and see how it goes.
> Many people would like a PTRACE_O_KILL_TRACEE_IF_DEBUGGER_DIES option,
> Oleg was working on that, among other things. Perhaps re-use that to
> handle this case too?
Well, if you can inject initial code into the tracee, then it can call
prctl(PR_SET_PDEATHSIG, SIGKILL). Then when the tracer dies, the
child dies. If the SIGKILL race in arch_ptrace_... is resolved, then
a SIGKILL that arrives between seccomp and delegation to ptrace should
result in process death. Though perhaps my proposal above will make
seccomp's integration with ptrace less subject to ptrace behaviors.
>> ? ? ? ? ? ? ? case SECCOMP_RET_ALLOW:
>
> For this and the ERRNO case you could check that PTRACE_O_SECCOMP option and
> decide to do something or not in ptrace.
For ERRNO, I'd prefer not to since it adds implicit behavior to the
rules and, without pulling a ptrace_event()ish call into this code, it
would change the return flow and potentially open up errno, which
should be solid, to races, etc. For ALLOW, sure, but at that point,
just use PTRACE_SYSCALL. Perhaps this can all be ameliorated if I can
get a useful ptrace_entry completed notification.
>> ? ? ? ? ? ? ? ? ? ? ? return 0;
>> ? ? ? ? ? ? ? case SECCOMP_RET_KILL:
>> --
>> 1.7.5.4
>>
>>
>
> Greetings,
As usual, thanks!
will
Hello,
On Fri, February 17, 2012 17:23, Will Drewry wrote:
> On Thu, Feb 16, 2012 at 11:08 PM, Indan Zupancic <[email protected]> wrote:
>>> +/* Indicates if a tracer is attached. */
>>> +#define SECCOMP_FLAGS_TRACED 0
>>
>> That's not the best way to check if a tracer is attached, and if you did use
>> it for that, you don't need to toggle it all the time.
>
> It's logically no different than task->ptrace. If it is less
> desirable, that's fine, but it is functionally equivalent.
Except that when using task->ptrace the ptrace code keeps track of it and
clears it when the ptracer goes away. And you're toggling SECCOMP_FLAGS_TRACED
all the time.
>>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>>> index c75485c..f9d419f 100644
>>> --- a/kernel/seccomp.c
>>> +++ b/kernel/seccomp.c
>>> @@ -289,6 +289,8 @@ void copy_seccomp(struct seccomp *child,
>>> {
>>> child->mode = prev->mode;
>>> child->filter = get_seccomp_filter(prev->filter);
>>> + /* Note, this leaves seccomp tracing enabled across fork. */
>>> + child->flags = prev->flags;
>>
>> What if the child isn't ptraced?
>
> Then falling through with TIF_SYSCALL_TRACE will result in the
> SECCOMP_RET_TRACE events to be allowed, but this comes back to the
> race. If I can effectively "check" that ptrace did its job, then I
> think this becomes a non-issue.
Yes. But it would be still sloppy state tracking, which can lead to
all kind of unlikely but interesting scenario's. If the child is ever
attached to later on, that flag will be still set. Same is true for
any descendant, they all will have that flag copied.
>>> }
>>>
>>> /**
>>> @@ -363,6 +365,19 @@ int __secure_computing_int(int this_syscall)
>>> syscall_rollback(current, task_pt_regs(current));
>>> seccomp_send_sigtrap();
>>> return -1;
>>> + case SECCOMP_RET_TRACE:
>>> + if (!seccomp_traced(¤t->seccomp))
>>> + return -1;
>>> + /*
>>> + * Delegate to TIF_SYSCALL_TRACE. This allows fast-path
>>> + * seccomp calls to delegate to slow-path if needed.
>>> + * Since TIF_SYSCALL_TRACE will be unset on ptrace(2)
>>> + * continuation, there should be no direct side
>>> + * effects. If TIF_SYSCALL_TRACE is already set, this
>>> + * has no effect.
>>> + */
>>> + set_tsk_thread_flag(current, TIF_SYSCALL_TRACE);
>>> + /* Falls through to allow. */
>>
>> This is nice and simple, but not race-free. You want to check if the ptracer
>> handled the event or not. If the ptracer died before handling this then the
>> syscall should be denied and the task should be killed.
>
> Hrm. I think there's a way to do this without forcing seccomp to
> always go slow path. I'll update the patch and see how it goes.
You only have to go through the slow path for the SECCOMP_RET_TRACE case.
But yeah, toggling TIF_SYSCALL_TRACE seems the only way to avoid the slow
path, sometimes. The downside is that it's unexpected behaviour which may
clash with arch entry code, so I'm not sure if that's a good idea. I think
always going through the slow path isn't too bad, compared to the ptrace
alternative it's still a lot faster.
>> Many people would like a PTRACE_O_KILL_TRACEE_IF_DEBUGGER_DIES option,
>> Oleg was working on that, among other things. Perhaps re-use that to
>> handle this case too?
>
> Well, if you can inject initial code into the tracee, then it can call
> prctl(PR_SET_PDEATHSIG, SIGKILL). Then when the tracer dies, the
> child dies.
That only works for child tracees, not descendants of the tracee.
> If the SIGKILL race in arch_ptrace_... is resolved, then
> a SIGKILL that arrives between seccomp and delegation to ptrace should
> result in process death. Though perhaps my proposal above will make
> seccomp's integration with ptrace less subject to ptrace behaviors.
Oleg fixed the SIGKILL problem (it wasn't a race), it should go upstream
in the next kernel version, I think.
>>> case SECCOMP_RET_ALLOW:
>>
>> For this and the ERRNO case you could check that PTRACE_O_SECCOMP option and
>> decide to do something or not in ptrace.
>
> For ERRNO, I'd prefer not to since it adds implicit behavior to the
> rules and, without pulling a ptrace_event()ish call into this code, it
> would change the return flow and potentially open up errno, which
> should be solid, to races, etc. For ALLOW, sure, but at that point,
> just use PTRACE_SYSCALL. Perhaps this can all be ameliorated if I can
> get a useful ptrace_entry completed notification.
You don't want ptrace to be able to override the decision? Fair enough.
Or did you mean something else?
Greetings,
Indan
On Thu, 16 Feb 2012, Will Drewry wrote:
> Replaces the seccomp_t typedef with struct seccomp to match modern
> kernel style.
>
> v7: struct seccomp_struct -> struct seccomp
> v6: original inclusion in this series.
>
> Signed-off-by: Will Drewry <[email protected]>
Reviewed-by: James Morris <[email protected]>
--
James Morris
<[email protected]>
On Fri, Feb 17, 2012 at 4:55 PM, Indan Zupancic <[email protected]> wrote:
> Hello,
>
> On Fri, February 17, 2012 17:23, Will Drewry wrote:
>> On Thu, Feb 16, 2012 at 11:08 PM, Indan Zupancic <[email protected]> wrote:
>>>> +/* Indicates if a tracer is attached. */
>>>> +#define SECCOMP_FLAGS_TRACED 0
>>>
>>> That's not the best way to check if a tracer is attached, and if you did use
>>> it for that, you don't need to toggle it all the time.
>>
>> It's logically no different than task->ptrace. ?If it is less
>> desirable, that's fine, but it is functionally equivalent.
>
> Except that when using task->ptrace the ptrace code keeps track of it and
> clears it when the ptracer goes away. And you're toggling SECCOMP_FLAGS_TRACED
> all the time.
Yep, the code is gone in the coming version. It was ugly to need to
change it everywhere TIF_SYSCALL_TRACE was toggled.
>>>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>>>> index c75485c..f9d419f 100644
>>>> --- a/kernel/seccomp.c
>>>> +++ b/kernel/seccomp.c
>>>> @@ -289,6 +289,8 @@ void copy_seccomp(struct seccomp *child,
>>>> ?{
>>>> ? ? ? child->mode = prev->mode;
>>>> ? ? ? child->filter = get_seccomp_filter(prev->filter);
>>>> + ? ? /* Note, this leaves seccomp tracing enabled across fork. */
>>>> + ? ? child->flags = prev->flags;
>>>
>>> What if the child isn't ptraced?
>>
>> Then falling through with TIF_SYSCALL_TRACE will result in the
>> SECCOMP_RET_TRACE events to be allowed, but this comes back to the
>> race. ?If I can effectively "check" that ptrace did its job, then I
>> think this becomes a non-issue.
>
> Yes. But it would be still sloppy state tracking, which can lead to
> all kind of unlikely but interesting scenario's. If the child is ever
> attached to later on, that flag will be still set. Same is true for
> any descendant, they all will have that flag copied.
Yup - it'd lead to tracehook fall through and an implicit allow. Not
ideal at all.
>>>> ?}
>>>>
>>>> ?/**
>>>> @@ -363,6 +365,19 @@ int __secure_computing_int(int this_syscall)
>>>> ? ? ? ? ? ? ? ? ? ? ? syscall_rollback(current, task_pt_regs(current));
>>>> ? ? ? ? ? ? ? ? ? ? ? seccomp_send_sigtrap();
>>>> ? ? ? ? ? ? ? ? ? ? ? return -1;
>>>> + ? ? ? ? ? ? case SECCOMP_RET_TRACE:
>>>> + ? ? ? ? ? ? ? ? ? ? if (!seccomp_traced(¤t->seccomp))
>>>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? return -1;
>>>> + ? ? ? ? ? ? ? ? ? ? /*
>>>> + ? ? ? ? ? ? ? ? ? ? ?* Delegate to TIF_SYSCALL_TRACE. This allows fast-path
>>>> + ? ? ? ? ? ? ? ? ? ? ?* seccomp calls to delegate to slow-path if needed.
>>>> + ? ? ? ? ? ? ? ? ? ? ?* Since TIF_SYSCALL_TRACE will be unset on ptrace(2)
>>>> + ? ? ? ? ? ? ? ? ? ? ?* continuation, there should be no direct side
>>>> + ? ? ? ? ? ? ? ? ? ? ?* effects. ?If TIF_SYSCALL_TRACE is already set, this
>>>> + ? ? ? ? ? ? ? ? ? ? ?* has no effect.
>>>> + ? ? ? ? ? ? ? ? ? ? ?*/
>>>> + ? ? ? ? ? ? ? ? ? ? set_tsk_thread_flag(current, TIF_SYSCALL_TRACE);
>>>> + ? ? ? ? ? ? ? ? ? ? /* Falls through to allow. */
>>>
>>> This is nice and simple, but not race-free. You want to check if the ptracer
>>> handled the event or not. If the ptracer died before handling this then the
>>> syscall should be denied and the task should be killed.
>>
>> Hrm. I think there's a way to do this without forcing seccomp to
>> always go slow path. ?I'll update the patch and see how it goes.
>
> You only have to go through the slow path for the SECCOMP_RET_TRACE case.
You'd have to know at syscall entry time to decide or pre-scan the bpf
filter to see if SECCOMP_RET_TRACE is returned non-programmatically,
then add a TIF flag for slow pathing, but that seems crazy bad.
> But yeah, toggling TIF_SYSCALL_TRACE seems the only way to avoid the slow
> path, sometimes. The downside is that it's unexpected behaviour which may
> clash with arch entry code, so I'm not sure if that's a good idea. I think
> always going through the slow path isn't too bad, compared to the ptrace
> alternative it's still a lot faster.
Since supporting that behavior is documented as a prerequisite for
adding HAVE_ARCH_SECCOMP_FILTER, I don't see how it could be
unexpected behavior. On systems, like x86, where seccomp is always
slowpath, it has no impact. However, it means that if a fast path is
added (like audit), then it will need to know to re-check the
threadinfo flags. I don't want to try to optimize in advance, but
it'd be nice to not close off any options for later. If an explicit
ptrace_event(SECCOMP) call was being made, then we'd be stuck in the
slow path or stuck making the ptrace code have more ifs for
determining if the source was a normal ptrace event or special
seccomp-triggered one. That might be okay as a long-term-thing,
though, since the other option (which the incoming patchset does) is
to add a post-trace callback into seccomp. I'm not sure which is
truly preferable.
>>> Many people would like a PTRACE_O_KILL_TRACEE_IF_DEBUGGER_DIES option,
>>> Oleg was working on that, among other things. Perhaps re-use that to
>>> handle this case too?
>>
>> Well, if you can inject initial code into the tracee, then it can call
>> prctl(PR_SET_PDEATHSIG, SIGKILL). ?Then when the tracer dies, the
>> child dies.
>
> That only works for child tracees, not descendants of the tracee.
True enough.
>> If the SIGKILL race in arch_ptrace_... is resolved, then
>> a SIGKILL that arrives between seccomp and delegation to ptrace should
>> result in process death. ?Though perhaps my proposal above will make
>> seccomp's integration with ptrace less subject to ptrace behaviors.
>
> Oleg fixed the SIGKILL problem (it wasn't a race), it should go upstream
> in the next kernel version, I think.
Pick your own name for it then, I guess. The signal lock was held in
ptrace_notify. Then, in order to hand off to the arch_ptrace_notify
code, it releases the lock, then claims it again after. If SIGKILL
was delivered in that time window, then the post-arch-handoff code
would see it, skip notification of the tracer, and allow the syscall
to run prior to terminating the task. I'm excited to see it fixed :)
>>>> ? ? ? ? ? ? ? case SECCOMP_RET_ALLOW:
>>>
>>> For this and the ERRNO case you could check that PTRACE_O_SECCOMP option and
>>> decide to do something or not in ptrace.
>>
>> For ERRNO, I'd prefer not to since it adds implicit behavior to the
>> rules and, without pulling a ptrace_event()ish call into this code, it
>> would change the return flow and potentially open up errno, which
>> should be solid, to races, etc. ?For ALLOW, sure, but at that point,
>> just use PTRACE_SYSCALL. ?Perhaps this can all be ameliorated if I can
>> get a useful ptrace_entry completed notification.
>
> You don't want ptrace to be able to override the decision? Fair enough.
> Or did you mean something else?
Exactly. The first time I went down this path, I let a tracer pick up
any denied syscalls, but that complicated the interactions and
security model quite a bit. I also don't want to add an implicit
dependency on the syscall slow-path for any other return values --
just in case the proposed TIF_SYSCALL_TRACE approach isn't acceptable.
thanks!
will