2020-10-09 23:07:20

by YiFei Zhu

[permalink] [raw]
Subject: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache

From: YiFei Zhu <[email protected]>

Currently the kernel does not provide an infrastructure to translate
architecture numbers to a human-readable name. Translating syscall
numbers to syscall names is possible through FTRACE_SYSCALL
infrastructure but it does not provide support for compat syscalls.

This will create a file for each PID as /proc/pid/seccomp_cache.
The file will be empty when no seccomp filters are loaded, or be
in the format of:
<arch name> <decimal syscall number> <ALLOW | FILTER>
where ALLOW means the cache is guaranteed to allow the syscall,
and filter means the cache will pass the syscall to the BPF filter.

For the docker default profile on x86_64 it looks like:
x86_64 0 ALLOW
x86_64 1 ALLOW
x86_64 2 ALLOW
x86_64 3 ALLOW
[...]
x86_64 132 ALLOW
x86_64 133 ALLOW
x86_64 134 FILTER
x86_64 135 FILTER
x86_64 136 FILTER
x86_64 137 ALLOW
x86_64 138 ALLOW
x86_64 139 FILTER
x86_64 140 ALLOW
x86_64 141 ALLOW
[...]

This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default
of N because I think certain users of seccomp might not want the
application to know which syscalls are definitely usable. For
the same reason, it is also guarded by CAP_SYS_ADMIN.

Suggested-by: Jann Horn <[email protected]>
Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/
Signed-off-by: YiFei Zhu <[email protected]>
---
arch/Kconfig | 24 ++++++++++++++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/seccomp.h | 3 ++
fs/proc/base.c | 6 ++++
include/linux/seccomp.h | 5 +++
kernel/seccomp.c | 59 ++++++++++++++++++++++++++++++++++
6 files changed, 98 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 21a3675a7a3a..85239a974f04 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -471,6 +471,15 @@ config HAVE_ARCH_SECCOMP_FILTER
results in the system call being skipped immediately.
- seccomp syscall wired up

+config HAVE_ARCH_SECCOMP_CACHE
+ bool
+ help
+ An arch should select this symbol if it provides all of these things:
+ - all the requirements for HAVE_ARCH_SECCOMP_FILTER
+ - SECCOMP_ARCH_NATIVE
+ - SECCOMP_ARCH_NATIVE_NR
+ - SECCOMP_ARCH_NATIVE_NAME
+
config SECCOMP
prompt "Enable seccomp to safely execute untrusted bytecode"
def_bool y
@@ -498,6 +507,21 @@ config SECCOMP_FILTER

See Documentation/userspace-api/seccomp_filter.rst for details.

+config SECCOMP_CACHE_DEBUG
+ bool "Show seccomp filter cache status in /proc/pid/seccomp_cache"
+ depends on SECCOMP
+ depends on SECCOMP_FILTER
+ depends on PROC_FS
+ help
+ This is enables /proc/pid/seccomp_cache interface to monitor
+ seccomp cache data. The file format is subject to change. Reading
+ the file requires CAP_SYS_ADMIN.
+
+ This option is for debugging only. Enabling present the risk that
+ an adversary may be able to infer the seccomp filter logic.
+
+ If unsure, say N.
+
config HAVE_ARCH_STACKLEAK
bool
help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 1ab22869a765..1a807f89ac77 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -150,6 +150,7 @@ config X86
select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT
select HAVE_ARCH_PREL32_RELOCATIONS
select HAVE_ARCH_SECCOMP_FILTER
+ select HAVE_ARCH_SECCOMP_CACHE
select HAVE_ARCH_THREAD_STRUCT_WHITELIST
select HAVE_ARCH_STACKLEAK
select HAVE_ARCH_TRACEHOOK
diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
index 03365af6165d..cd57c3eabab5 100644
--- a/arch/x86/include/asm/seccomp.h
+++ b/arch/x86/include/asm/seccomp.h
@@ -19,13 +19,16 @@
#ifdef CONFIG_X86_64
# define SECCOMP_ARCH_NATIVE AUDIT_ARCH_X86_64
# define SECCOMP_ARCH_NATIVE_NR NR_syscalls
+# define SECCOMP_ARCH_NATIVE_NAME "x86_64"
# ifdef CONFIG_COMPAT
# define SECCOMP_ARCH_COMPAT AUDIT_ARCH_I386
# define SECCOMP_ARCH_COMPAT_NR IA32_NR_syscalls
+# define SECCOMP_ARCH_COMPAT_NAME "ia32"
# endif
#else /* !CONFIG_X86_64 */
# define SECCOMP_ARCH_NATIVE AUDIT_ARCH_I386
# define SECCOMP_ARCH_NATIVE_NR NR_syscalls
+# define SECCOMP_ARCH_NATIVE_NAME "ia32"
#endif

#include <asm-generic/seccomp.h>
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 617db4e0faa0..a4990410ff05 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3258,6 +3258,9 @@ static const struct pid_entry tgid_base_stuff[] = {
#ifdef CONFIG_PROC_PID_ARCH_STATUS
ONE("arch_status", S_IRUGO, proc_pid_arch_status),
#endif
+#ifdef CONFIG_SECCOMP_CACHE_DEBUG
+ ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
+#endif
};

static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
@@ -3587,6 +3590,9 @@ static const struct pid_entry tid_base_stuff[] = {
#ifdef CONFIG_PROC_PID_ARCH_STATUS
ONE("arch_status", S_IRUGO, proc_pid_arch_status),
#endif
+#ifdef CONFIG_SECCOMP_CACHE_DEBUG
+ ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
+#endif
};

static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 02aef2844c38..1f028d55142a 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -121,4 +121,9 @@ static inline long seccomp_get_metadata(struct task_struct *task,
return -EINVAL;
}
#endif /* CONFIG_SECCOMP_FILTER && CONFIG_CHECKPOINT_RESTORE */
+
+#ifdef CONFIG_SECCOMP_CACHE_DEBUG
+int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
+ struct pid *pid, struct task_struct *task);
+#endif
#endif /* _LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 51032b41fe59..a75746d259a5 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -548,6 +548,9 @@ void seccomp_filter_release(struct task_struct *tsk)
{
struct seccomp_filter *orig = tsk->seccomp.filter;

+ /* We are effectively holding the siglock by not having any sighand. */
+ WARN_ON(tsk->sighand != NULL);
+
/* Detach task from its filter tree. */
tsk->seccomp.filter = NULL;
__seccomp_filter_release(orig);
@@ -2308,3 +2311,59 @@ static int __init seccomp_sysctl_init(void)
device_initcall(seccomp_sysctl_init)

#endif /* CONFIG_SYSCTL */
+
+#ifdef CONFIG_SECCOMP_CACHE_DEBUG
+/* Currently CONFIG_SECCOMP_CACHE_DEBUG implies SECCOMP_ARCH_NATIVE */
+static void proc_pid_seccomp_cache_arch(struct seq_file *m, const char *name,
+ const void *bitmap, size_t bitmap_size)
+{
+ int nr;
+
+ for (nr = 0; nr < bitmap_size; nr++) {
+ bool cached = test_bit(nr, bitmap);
+ char *status = cached ? "ALLOW" : "FILTER";
+
+ seq_printf(m, "%s %d %s\n", name, nr, status);
+ }
+}
+
+int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
+ struct pid *pid, struct task_struct *task)
+{
+ struct seccomp_filter *f;
+ unsigned long flags;
+
+ /*
+ * We don't want some sandboxed process know what their seccomp
+ * filters consist of.
+ */
+ if (!file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN))
+ return -EACCES;
+
+ if (!lock_task_sighand(task, &flags))
+ return 0;
+
+ f = READ_ONCE(task->seccomp.filter);
+ if (!f) {
+ unlock_task_sighand(task, &flags);
+ return 0;
+ }
+
+ /* prevent filter from being freed while we are printing it */
+ __get_seccomp_filter(f);
+ unlock_task_sighand(task, &flags);
+
+ proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_NATIVE_NAME,
+ f->cache.allow_native,
+ SECCOMP_ARCH_NATIVE_NR);
+
+#ifdef SECCOMP_ARCH_COMPAT
+ proc_pid_seccomp_cache_arch(m, SECCOMP_ARCH_COMPAT_NAME,
+ f->cache.allow_compat,
+ SECCOMP_ARCH_COMPAT_NR);
+#endif /* SECCOMP_ARCH_COMPAT */
+
+ __put_seccomp_filter(f);
+ return 0;
+}
+#endif /* CONFIG_SECCOMP_CACHE_DEBUG */
--
2.28.0


2020-10-10 04:51:32

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache

Hi YiFei,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on next-20201009]
[cannot apply to tip/x86/core tip/master linux/master linus/master v5.9-rc8 v5.9-rc7 v5.9-rc6 v5.9-rc8]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/YiFei-Zhu/seccomp-Add-bitmap-cache-of-constant-allow-filter-results/20201010-013933
base: d67bc7812221606e1886620a357b13f906814af7
config: xtensa-allyesconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/a1a1697444ceecfb62796ccd5ba42057c82bd295
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review YiFei-Zhu/seccomp-Add-bitmap-cache-of-constant-allow-filter-results/20201010-013933
git checkout a1a1697444ceecfb62796ccd5ba42057c82bd295
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=xtensa

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

In file included from include/linux/sched.h:22,
from include/linux/ptrace.h:6,
from arch/xtensa/kernel/asm-offsets.c:21:
>> include/linux/seccomp.h:126:35: warning: 'struct seq_file' declared inside parameter list will not be visible outside of this definition or declaration
126 | int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
| ^~~~~~~~
--
In file included from include/linux/sched.h:22,
from include/linux/uaccess.h:8,
from fs/proc/base.c:51:
>> include/linux/seccomp.h:126:35: warning: 'struct seq_file' declared inside parameter list will not be visible outside of this definition or declaration
126 | int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
| ^~~~~~~~
fs/proc/base.c:3261:32: error: initialization of 'int (*)(struct seq_file *, struct pid_namespace *, struct pid *, struct task_struct *)' from incompatible pointer type 'int (*)(struct seq_file *, struct pid_namespace *, struct pid *, struct task_struct *)' [-Werror=incompatible-pointer-types]
3261 | ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
| ^~~~~~~~~~~~~~~~~~~~~~
fs/proc/base.c:133:10: note: in definition of macro 'NOD'
133 | .op = OP, \
| ^~
fs/proc/base.c:3261:2: note: in expansion of macro 'ONE'
3261 | ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
| ^~~
fs/proc/base.c:3261:32: note: (near initialization for 'tgid_base_stuff[51].op.proc_show')
3261 | ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
| ^~~~~~~~~~~~~~~~~~~~~~
fs/proc/base.c:133:10: note: in definition of macro 'NOD'
133 | .op = OP, \
| ^~
fs/proc/base.c:3261:2: note: in expansion of macro 'ONE'
3261 | ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
| ^~~
fs/proc/base.c:3593:32: error: initialization of 'int (*)(struct seq_file *, struct pid_namespace *, struct pid *, struct task_struct *)' from incompatible pointer type 'int (*)(struct seq_file *, struct pid_namespace *, struct pid *, struct task_struct *)' [-Werror=incompatible-pointer-types]
3593 | ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
| ^~~~~~~~~~~~~~~~~~~~~~
fs/proc/base.c:133:10: note: in definition of macro 'NOD'
133 | .op = OP, \
| ^~
fs/proc/base.c:3593:2: note: in expansion of macro 'ONE'
3593 | ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
| ^~~
fs/proc/base.c:3593:32: note: (near initialization for 'tid_base_stuff[45].op.proc_show')
3593 | ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
| ^~~~~~~~~~~~~~~~~~~~~~
fs/proc/base.c:133:10: note: in definition of macro 'NOD'
133 | .op = OP, \
| ^~
fs/proc/base.c:3593:2: note: in expansion of macro 'ONE'
3593 | ONE("seccomp_cache", S_IRUSR, proc_pid_seccomp_cache),
| ^~~
cc1: some warnings being treated as errors
--
In file included from include/linux/sched.h:22,
from include/linux/blkdev.h:5,
from fs/hfsplus/inode.c:12:
>> include/linux/seccomp.h:126:35: warning: 'struct seq_file' declared inside parameter list will not be visible outside of this definition or declaration
126 | int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
| ^~~~~~~~
fs/hfsplus/inode.c: In function 'hfsplus_cat_read_inode':
fs/hfsplus/inode.c:501:16: warning: suggest braces around empty body in an 'if' statement [-Wempty-body]
501 | /* panic? */;
| ^
fs/hfsplus/inode.c:522:16: warning: suggest braces around empty body in an 'if' statement [-Wempty-body]
522 | /* panic? */;
| ^
fs/hfsplus/inode.c: In function 'hfsplus_cat_write_inode':
fs/hfsplus/inode.c:580:16: warning: suggest braces around empty body in an 'if' statement [-Wempty-body]
580 | /* panic? */;
| ^
fs/hfsplus/inode.c:606:16: warning: suggest braces around empty body in an 'if' statement [-Wempty-body]
606 | /* panic? */;
| ^
--
In file included from include/linux/sched.h:22,
from include/linux/mm.h:32,
from include/linux/pagemap.h:8,
from fs/hfs/inode.c:14:
>> include/linux/seccomp.h:126:35: warning: 'struct seq_file' declared inside parameter list will not be visible outside of this definition or declaration
126 | int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
| ^~~~~~~~
fs/hfs/inode.c: In function 'hfs_write_inode':
fs/hfs/inode.c:464:16: warning: suggest braces around empty body in an 'if' statement [-Wempty-body]
464 | /* panic? */;
| ^
fs/hfs/inode.c:485:16: warning: suggest braces around empty body in an 'if' statement [-Wempty-body]
485 | /* panic? */;
| ^
--
In file included from include/linux/sched.h:22,
from include/linux/mm.h:32,
from include/linux/pagemap.h:8,
from fs/efs/symlink.c:11:
>> include/linux/seccomp.h:126:35: warning: 'struct seq_file' declared inside parameter list will not be visible outside of this definition or declaration
126 | int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
| ^~~~~~~~
In file included from fs/efs/symlink.c:13:
fs/efs/efs.h:22:19: warning: 'cprt' defined but not used [-Wunused-const-variable=]
22 | static const char cprt[] = "EFS: "EFS_VERSION" - (c) 1999 Al Smith <[email protected]>";
| ^~~~
--
In file included from include/linux/sched.h:22,
from fs/jffs2/nodelist.c:15:
>> include/linux/seccomp.h:126:35: warning: 'struct seq_file' declared inside parameter list will not be visible outside of this definition or declaration
126 | int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
| ^~~~~~~~
fs/jffs2/nodelist.c: In function 'jffs2_add_frag_to_fragtree':
fs/jffs2/nodelist.c:255:37: warning: suggest braces around empty body in an 'else' statement [-Wempty-body]
255 | this->ofs, this->ofs + this->size);
| ^
fs/jffs2/nodelist.c:278:38: warning: suggest braces around empty body in an 'else' statement [-Wempty-body]
278 | this->ofs, this->ofs+this->size);
| ^
fs/jffs2/nodelist.c: In function 'jffs2_lookup_node_frag':
fs/jffs2/nodelist.c:558:52: warning: suggest braces around empty body in an 'else' statement [-Wempty-body]
558 | dbg_fragtree2("returning NULL, empty fragtree\n");
| ^
--
In file included from include/linux/sched.h:22,
from include/linux/mm.h:32,
from include/linux/bvec.h:14,
from fs/orangefs/inode.c:13:
>> include/linux/seccomp.h:126:35: warning: 'struct seq_file' declared inside parameter list will not be visible outside of this definition or declaration
126 | int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
| ^~~~~~~~
In file included from fs/orangefs/protocol.h:287,
from fs/orangefs/inode.c:14:
fs/orangefs/orangefs-debug.h:86:18: warning: 'num_kmod_keyword_mask_map' defined but not used [-Wunused-const-variable=]
86 | static const int num_kmod_keyword_mask_map = (int)
| ^~~~~~~~~~~~~~~~~~~~~~~~~
--
In file included from include/linux/sched.h:22,
from fs/btrfs/extent-tree.c:6:
>> include/linux/seccomp.h:126:35: warning: 'struct seq_file' declared inside parameter list will not be visible outside of this definition or declaration
126 | int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
| ^~~~~~~~
In file included from include/linux/printk.h:7,
from include/linux/kernel.h:16,
from include/asm-generic/bug.h:20,
from ./arch/xtensa/include/generated/asm/bug.h:1,
from include/linux/bug.h:5,
from include/linux/thread_info.h:12,
from arch/xtensa/include/asm/current.h:18,
from include/linux/sched.h:12,
from fs/btrfs/extent-tree.c:6:
fs/btrfs/extent-tree.c: In function '__btrfs_free_extent':
include/linux/kern_levels.h:5:18: warning: format '%lu' expects argument of type 'long unsigned int', but argument 8 has type 'unsigned int' [-Wformat=]
5 | #define KERN_SOH "\001" /* ASCII Start Of Header */
| ^~~~~~
include/linux/kern_levels.h:10:19: note: in expansion of macro 'KERN_SOH'
10 | #define KERN_CRIT KERN_SOH "2" /* critical conditions */
| ^~~~~~~~
fs/btrfs/ctree.h:3148:24: note: in expansion of macro 'KERN_CRIT'
3148 | btrfs_printk(fs_info, KERN_CRIT fmt, ##args)
| ^~~~~~~~~
fs/btrfs/extent-tree.c:3187:4: note: in expansion of macro 'btrfs_crit'
3187 | btrfs_crit(info,
| ^~~~~~~~~~
fs/btrfs/extent-tree.c:3188:83: note: format string is defined here
3188 | "invalid extent item size for key (%llu, %u, %llu) owner %llu, has %u expect >= %lu",
| ~~^
| |
| long unsigned int
| %u
--
In file included from include/linux/sched.h:22,
from include/linux/ptrace.h:6,
from arch/xtensa/kernel/asm-offsets.c:21:
>> include/linux/seccomp.h:126:35: warning: 'struct seq_file' declared inside parameter list will not be visible outside of this definition or declaration
126 | int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
| ^~~~~~~~

vim +126 include/linux/seccomp.h

124
125 #ifdef CONFIG_SECCOMP_CACHE_DEBUG
> 126 int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (12.55 kB)
.config.gz (64.56 kB)
Download all attachments

2020-10-10 05:07:43

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache

On Fri, Oct 9, 2020 at 7:15 PM YiFei Zhu <[email protected]> wrote:
> Currently the kernel does not provide an infrastructure to translate
> architecture numbers to a human-readable name. Translating syscall
> numbers to syscall names is possible through FTRACE_SYSCALL
> infrastructure but it does not provide support for compat syscalls.
>
> This will create a file for each PID as /proc/pid/seccomp_cache.
> The file will be empty when no seccomp filters are loaded, or be
> in the format of:
> <arch name> <decimal syscall number> <ALLOW | FILTER>
> where ALLOW means the cache is guaranteed to allow the syscall,
> and filter means the cache will pass the syscall to the BPF filter.
>
> For the docker default profile on x86_64 it looks like:
> x86_64 0 ALLOW
> x86_64 1 ALLOW
> x86_64 2 ALLOW
> x86_64 3 ALLOW
> [...]
> x86_64 132 ALLOW
> x86_64 133 ALLOW
> x86_64 134 FILTER
> x86_64 135 FILTER
> x86_64 136 FILTER
> x86_64 137 ALLOW
> x86_64 138 ALLOW
> x86_64 139 FILTER
> x86_64 140 ALLOW
> x86_64 141 ALLOW
> [...]
>
> This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default
> of N because I think certain users of seccomp might not want the
> application to know which syscalls are definitely usable. For
> the same reason, it is also guarded by CAP_SYS_ADMIN.
>
> Suggested-by: Jann Horn <[email protected]>
> Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/
> Signed-off-by: YiFei Zhu <[email protected]>
[...]
> diff --git a/arch/Kconfig b/arch/Kconfig
[...]
> +config SECCOMP_CACHE_DEBUG
> + bool "Show seccomp filter cache status in /proc/pid/seccomp_cache"
> + depends on SECCOMP
> + depends on SECCOMP_FILTER
> + depends on PROC_FS
> + help
> + This is enables /proc/pid/seccomp_cache interface to monitor

nit: s/This is enables/This enables the/

> + seccomp cache data. The file format is subject to change. Reading
> + the file requires CAP_SYS_ADMIN.
> +
> + This option is for debugging only. Enabling present the risk that

nit: *presents

> + an adversary may be able to infer the seccomp filter logic.

[...]
> +int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,
> + struct pid *pid, struct task_struct *task)
> +{
> + struct seccomp_filter *f;
> + unsigned long flags;
> +
> + /*
> + * We don't want some sandboxed process know what their seccomp

s/know/to know/

> + * filters consist of.
> + */
> + if (!file_ns_capable(m->file, &init_user_ns, CAP_SYS_ADMIN))
> + return -EACCES;
> +
> + if (!lock_task_sighand(task, &flags))
> + return 0;

maybe return -ESRCH here so that userspace can distinguish between an
exiting process and a process with no filters?

> + f = READ_ONCE(task->seccomp.filter);
> + if (!f) {
> + unlock_task_sighand(task, &flags);
> + return 0;
> + }
[...]

2020-10-10 23:11:16

by YiFei Zhu

[permalink] [raw]
Subject: Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache

On Fri, Oct 9, 2020 at 6:14 PM Kees Cook <[email protected]> wrote:
> HAVE_ARCH_SECCOMP_CACHE isn't used any more. I think this was left over
> from before.

Oh, I was meant to add this to the dependencies of
SECCOMP_CACHE_DEBUG. Is this something that would make sense?

YiFei Zhu

2020-10-11 02:49:27

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache

On Fri, Oct 09, 2020 at 12:14:33PM -0500, YiFei Zhu wrote:
> From: YiFei Zhu <[email protected]>
>
> Currently the kernel does not provide an infrastructure to translate
> architecture numbers to a human-readable name. Translating syscall
> numbers to syscall names is possible through FTRACE_SYSCALL
> infrastructure but it does not provide support for compat syscalls.
>
> This will create a file for each PID as /proc/pid/seccomp_cache.
> The file will be empty when no seccomp filters are loaded, or be
> in the format of:
> <arch name> <decimal syscall number> <ALLOW | FILTER>
> where ALLOW means the cache is guaranteed to allow the syscall,
> and filter means the cache will pass the syscall to the BPF filter.
>
> For the docker default profile on x86_64 it looks like:
> x86_64 0 ALLOW
> x86_64 1 ALLOW
> x86_64 2 ALLOW
> x86_64 3 ALLOW
> [...]
> x86_64 132 ALLOW
> x86_64 133 ALLOW
> x86_64 134 FILTER
> x86_64 135 FILTER
> x86_64 136 FILTER
> x86_64 137 ALLOW
> x86_64 138 ALLOW
> x86_64 139 FILTER
> x86_64 140 ALLOW
> x86_64 141 ALLOW
> [...]
>
> This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default
> of N because I think certain users of seccomp might not want the
> application to know which syscalls are definitely usable. For
> the same reason, it is also guarded by CAP_SYS_ADMIN.
>
> Suggested-by: Jann Horn <[email protected]>
> Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/
> Signed-off-by: YiFei Zhu <[email protected]>
> ---
> arch/Kconfig | 24 ++++++++++++++
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/seccomp.h | 3 ++
> fs/proc/base.c | 6 ++++
> include/linux/seccomp.h | 5 +++
> kernel/seccomp.c | 59 ++++++++++++++++++++++++++++++++++
> 6 files changed, 98 insertions(+)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 21a3675a7a3a..85239a974f04 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -471,6 +471,15 @@ config HAVE_ARCH_SECCOMP_FILTER
> results in the system call being skipped immediately.
> - seccomp syscall wired up
>
> +config HAVE_ARCH_SECCOMP_CACHE
> + bool
> + help
> + An arch should select this symbol if it provides all of these things:
> + - all the requirements for HAVE_ARCH_SECCOMP_FILTER
> + - SECCOMP_ARCH_NATIVE
> + - SECCOMP_ARCH_NATIVE_NR
> + - SECCOMP_ARCH_NATIVE_NAME
> +
> [...]
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 1ab22869a765..1a807f89ac77 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -150,6 +150,7 @@ config X86
> select HAVE_ARCH_COMPAT_MMAP_BASES if MMU && COMPAT
> select HAVE_ARCH_PREL32_RELOCATIONS
> select HAVE_ARCH_SECCOMP_FILTER
> + select HAVE_ARCH_SECCOMP_CACHE
> select HAVE_ARCH_THREAD_STRUCT_WHITELIST
> select HAVE_ARCH_STACKLEAK
> select HAVE_ARCH_TRACEHOOK

HAVE_ARCH_SECCOMP_CACHE isn't used any more. I think this was left over
from before.

--
Kees Cook

2020-10-13 11:14:03

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache

On Sat, Oct 10, 2020 at 08:26:16AM -0500, YiFei Zhu wrote:
> On Fri, Oct 9, 2020 at 6:14 PM Kees Cook <[email protected]> wrote:
> > HAVE_ARCH_SECCOMP_CACHE isn't used any more. I think this was left over
> > from before.
>
> Oh, I was meant to add this to the dependencies of
> SECCOMP_CACHE_DEBUG. Is this something that would make sense?

I think it's fine to just have this "dangle" with a help text update of
"if seccomp action caching is supported by the architecture, provide the
/proc/$pid ..."

--
Kees Cook

2020-10-13 11:14:43

by YiFei Zhu

[permalink] [raw]
Subject: Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache

On Mon, Oct 12, 2020 at 5:57 PM Kees Cook <[email protected]> wrote:
> I think it's fine to just have this "dangle" with a help text update of
> "if seccomp action caching is supported by the architecture, provide the
> /proc/$pid ..."

I think it would be weird if someone sees this help text and wonder...
"hmm does my architecture support seccomp action caching" and without
a clear pointer to how seccomp action cache works, goes and compiles
the kernel with this config option on for the purpose of knowing if
their arch supports it... Or, is it a common practice in the kernel to
leave dangling configs?

YiFei Zhu

2020-10-23 05:44:11

by YiFei Zhu

[permalink] [raw]
Subject: Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache

On Mon, Oct 12, 2020 at 7:31 PM YiFei Zhu <[email protected]> wrote:
>
> On Mon, Oct 12, 2020 at 5:57 PM Kees Cook <[email protected]> wrote:
> > I think it's fine to just have this "dangle" with a help text update of
> > "if seccomp action caching is supported by the architecture, provide the
> > /proc/$pid ..."
>
> I think it would be weird if someone sees this help text and wonder...
> "hmm does my architecture support seccomp action caching" and without
> a clear pointer to how seccomp action cache works, goes and compiles
> the kernel with this config option on for the purpose of knowing if
> their arch supports it... Or, is it a common practice in the kernel to
> leave dangling configs?

Bump, in case this question was missed. I don't really want to miss
the 5.10 merge window...

YiFei Zhu

2020-10-23 06:36:28

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache

On Thu, Oct 22, 2020 at 03:52:20PM -0500, YiFei Zhu wrote:
> On Mon, Oct 12, 2020 at 7:31 PM YiFei Zhu <[email protected]> wrote:
> >
> > On Mon, Oct 12, 2020 at 5:57 PM Kees Cook <[email protected]> wrote:
> > > I think it's fine to just have this "dangle" with a help text update of
> > > "if seccomp action caching is supported by the architecture, provide the
> > > /proc/$pid ..."
> >
> > I think it would be weird if someone sees this help text and wonder...
> > "hmm does my architecture support seccomp action caching" and without
> > a clear pointer to how seccomp action cache works, goes and compiles
> > the kernel with this config option on for the purpose of knowing if
> > their arch supports it... Or, is it a common practice in the kernel to
> > leave dangling configs?
>
> Bump, in case this question was missed.

I've been going back and forth on this, and I think what I've settled
on is I'd like to avoid new CONFIG dependencies just for this feature.
Instead, how about we just fill in SECCOMP_NATIVE and SECCOMP_COMPAT
for all the HAVE_ARCH_SECCOMP_FILTER architectures, and then the
cache reporting can be cleanly tied to CONFIG_SECCOMP_FILTER? It
should be relatively simple to extract those details and make
SECCOMP_ARCH_{NATIVE,COMPAT}_NAME part of the per-arch enabling patches?

> I don't really want to miss the 5.10 merge window...

Sorry, the 5.10 merge window is already closed for stuff that hasn't
already been in -next. Most subsystem maintainers (myself included)
don't take new features into their trees between roughly N-rc6 and
(N+1)-rc1. My plan is to put this in my -next tree after -rc1 is released
(expected to be Oct 25th).

I'd still like to get more specific workload performance numbers too.
The microbenchmark is nice, but getting things like build times under
docker's default seccomp filter, etc would be lovely. I've almost gotten
there, but my benchmarks are still really noisy and CPU isolation
continues to frustrate me. :)

--
Kees Cook

2020-10-23 06:53:10

by YiFei Zhu

[permalink] [raw]
Subject: Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache

On Thu, Oct 22, 2020 at 5:32 PM Kees Cook <[email protected]> wrote:
> I've been going back and forth on this, and I think what I've settled
> on is I'd like to avoid new CONFIG dependencies just for this feature.
> Instead, how about we just fill in SECCOMP_NATIVE and SECCOMP_COMPAT
> for all the HAVE_ARCH_SECCOMP_FILTER architectures, and then the
> cache reporting can be cleanly tied to CONFIG_SECCOMP_FILTER? It
> should be relatively simple to extract those details and make
> SECCOMP_ARCH_{NATIVE,COMPAT}_NAME part of the per-arch enabling patches?

Hmm. So I could enable the cache logic to every architecture (one
patch per arch) that does not have the sparse syscall numbers, and
then have the proc reporting after the arch patches? I could do that.
I don't have test machines to run anything other than x86_64 or ia32,
so they will need a closer look by people more familiar with those
arches.

> I'd still like to get more specific workload performance numbers too.
> The microbenchmark is nice, but getting things like build times under
> docker's default seccomp filter, etc would be lovely. I've almost gotten
> there, but my benchmarks are still really noisy and CPU isolation
> continues to frustrate me. :)

Ok, let me know if I can help.

YiFei Zhu

2020-10-24 02:57:33

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache

On Thu, Oct 22, 2020 at 06:40:08PM -0500, YiFei Zhu wrote:
> On Thu, Oct 22, 2020 at 5:32 PM Kees Cook <[email protected]> wrote:
> > I've been going back and forth on this, and I think what I've settled
> > on is I'd like to avoid new CONFIG dependencies just for this feature.
> > Instead, how about we just fill in SECCOMP_NATIVE and SECCOMP_COMPAT
> > for all the HAVE_ARCH_SECCOMP_FILTER architectures, and then the
> > cache reporting can be cleanly tied to CONFIG_SECCOMP_FILTER? It
> > should be relatively simple to extract those details and make
> > SECCOMP_ARCH_{NATIVE,COMPAT}_NAME part of the per-arch enabling patches?
>
> Hmm. So I could enable the cache logic to every architecture (one
> patch per arch) that does not have the sparse syscall numbers, and
> then have the proc reporting after the arch patches? I could do that.
> I don't have test machines to run anything other than x86_64 or ia32,
> so they will need a closer look by people more familiar with those
> arches.

Cool, yes please. It looks like MIPS will need to be skipped for now. I
would have the debug cache reporting patch then depend on
!CONFIG_HAVE_SPARSE_SYSCALL_NR.

> > I'd still like to get more specific workload performance numbers too.
> > The microbenchmark is nice, but getting things like build times under
> > docker's default seccomp filter, etc would be lovely. I've almost gotten
> > there, but my benchmarks are still really noisy and CPU isolation
> > continues to frustrate me. :)
>
> Ok, let me know if I can help.

Do you have a test environment where you can compare the before/after
of repeated kernel build times (or some other sufficiently
complex/interesting) workload under these conditions:

bare metal
docker w/ seccomp policy disabled
docker w/ default seccomp policy

This is what I've been trying to construct, but it's really noisy, so
I've been trying to pin CPUs and NUMA memory nodes, but it's not really
helping yet. :P

--
Kees Cook

2020-10-30 12:29:03

by YiFei Zhu

[permalink] [raw]
Subject: Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache

On Fri, Oct 23, 2020 at 9:51 PM Kees Cook <[email protected]> wrote:
> Do you have a test environment where you can compare the before/after
> of repeated kernel build times (or some other sufficiently
> complex/interesting) workload under these conditions:
>
> bare metal
> docker w/ seccomp policy disabled
> docker w/ default seccomp policy
>
> This is what I've been trying to construct, but it's really noisy, so
> I've been trying to pin CPUs and NUMA memory nodes, but it's not really
> helping yet. :P

Hi, sorry for the delay. The benchmarks took a while to get.

I got a bare metal test machine with Intel(R) Xeon(R) CPU E5-2660 v3 @
2.60GHz, running Ubuntu 18.04. Test kernels are compiled at
57a339117e52 ("selftests/seccomp: Compare bitmap vs filter overhead")
and 3650b228f83a ("Linux 5.10-rc1"), built with Ubuntu's
5.3.0-64-generic's config, then `make olddefconfig`. "Mitigations off"
indicate the kernel was booted with "nospectre_v2 nospectre_v1
no_stf_barrier tsx=off tsx_async_abort=off".

The benchmark was single-job make on x86_64 defconfig of 5.9.1, with
CPU affinity to set only processor #0. Raw results are appended below.
Each boot is tested by running the build directly and inside docker,
with and without seccomp. The commands used are attached below. Each
test is 4 trials, with the middle two (non-minimum, non-maximum) wall
clock time averaged. Results summary:

Mitigations On Mitigations Off
With Cache Without Cache With Cache Without Cache
Native 18:17.38 18:13.78 18:16.08 18:15.67
D. no seccomp 18:15.54 18:17.71 18:17.58 18:16.75
D. + seccomp 20:42.47 20:45.04 18:47.67 18:49.01

To be honest, I'm somewhat surprised that it didn't produce as much of
a dent in the seccomp overhead in this macro benchmark as I had
expected.

Below are commands used and outputs from time command.

Commands used to start the docker containers:
docker run -w /srv/yifeifz2/linux-buildtest \
--tmpfs /srv/yifeifz2/linux-buildtest:exec --rm -it ubuntu:18.04

docker run -w /srv/yifeifz2/linux-buildtest \
--tmpfs /srv/yifeifz2/linux-buildtest:exec --rm -it \
--security-opt seccomp=unconfined ubuntu:18.04

Commands used to install the toolchain inside docker:
apt -y update
apt -y dist-upgrade
apt -y install build-essential wget flex bison time libssl-dev bc libelf-dev

Commands to benchmark on native:
for i in {1..4}; do
mkdir -p /srv/yifeifz2/linux-buildtest
mount -t tmpfs tmpfs /srv/yifeifz2/linux-buildtest
pushd /srv/yifeifz2/linux-buildtest
wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.9.1.tar.xz
tar xf linux-5.9.1.tar.xz
cd linux-5.9.1
make mrproper
make defconfig
taskset 0x1 time make -j1 > /dev/null
popd
umount /srv/yifeifz2/linux-buildtest
done

Commands to benchmark inside docker:
for i in {1..4}; do
wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.9.1.tar.xz
tar xf linux-5.9.1.tar.xz
pushd linux-5.9.1
make mrproper
make defconfig
taskset 0x1 time make -j1 > /dev/null
popd
rm -rf linux-5.9.1 linux-5.9.1.tar.xz
done

==== with cache, mitigations on ====

973.52user 113.98system 18:16.51elapsed 99%CPU (0avgtext+0avgdata
239784maxresident)k
0inputs+217152outputs (0major+51937662minor)pagefaults 0swaps

973.74user 115.35system 18:18.41elapsed 99%CPU (0avgtext+0avgdata
239640maxresident)k
0inputs+217152outputs (0major+51933865minor)pagefaults 0swaps

973.31user 114.41system 18:17.37elapsed 99%CPU (0avgtext+0avgdata
239660maxresident)k
72inputs+217152outputs (0major+51936343minor)pagefaults 0swaps

971.76user 116.04system 18:17.39elapsed 99%CPU (0avgtext+0avgdata
239588maxresident)k
0inputs+217152outputs (0major+51936222minor)pagefaults 0swaps


961.44user 121.30system 18:15.30elapsed 98%CPU (0avgtext+0avgdata
239580maxresident)k
0inputs+217152outputs (0major+51555371minor)pagefaults 0swaps

961.86user 119.48system 18:13.96elapsed 98%CPU (0avgtext+0avgdata
239480maxresident)k
0inputs+217152outputs (0major+51552153minor)pagefaults 0swaps

961.68user 121.75system 18:15.78elapsed 98%CPU (0avgtext+0avgdata
239504maxresident)k
0inputs+217152outputs (0major+51559201minor)pagefaults 0swaps

960.80user 122.04system 18:18.99elapsed 98%CPU (0avgtext+0avgdata
239644maxresident)k
0inputs+217152outputs (0major+51557386minor)pagefaults 0swaps


1104.08user 124.48system 20:42.13elapsed 98%CPU (0avgtext+0avgdata
239544maxresident)k
984inputs+217152outputs (21major+51552022minor)pagefaults 0swaps

1101.78user 125.66system 20:40.80elapsed 98%CPU (0avgtext+0avgdata
239692maxresident)k
0inputs+217152outputs (0major+51546446minor)pagefaults 0swaps

1102.98user 126.03system 20:43.09elapsed 98%CPU (0avgtext+0avgdata
239592maxresident)k
0inputs+217152outputs (0major+51551238minor)pagefaults 0swaps

1103.34user 125.32system 20:42.82elapsed 98%CPU (0avgtext+0avgdata
239620maxresident)k
0inputs+217152outputs (0major+51554493minor)pagefaults 0swaps


==== without cache, mitigations on ====

967.19user 115.77system 18:17.20elapsed 98%CPU (0avgtext+0avgdata
239536maxresident)k
25112inputs+217152outputs (166major+51935958minor)pagefaults 0swaps

969.05user 114.18system 18:12.92elapsed 99%CPU (0avgtext+0avgdata
239544maxresident)k
0inputs+217152outputs (0major+51938961minor)pagefaults 0swaps

968.51user 116.50system 18:14.64elapsed 99%CPU (0avgtext+0avgdata
239716maxresident)k
0inputs+217152outputs (0major+51937686minor)pagefaults 0swaps

968.53user 115.13system 18:10.33elapsed 99%CPU (0avgtext+0avgdata
239628maxresident)k
0inputs+217152outputs (0major+51938033minor)pagefaults 0swaps


962.85user 121.56system 18:17.73elapsed 98%CPU (0avgtext+0avgdata
239736maxresident)k
0inputs+217152outputs (0major+51549715minor)pagefaults 0swaps

962.51user 121.74system 18:17.42elapsed 98%CPU (0avgtext+0avgdata
239480maxresident)k
0inputs+217152outputs (0major+51558249minor)pagefaults 0swaps

963.37user 121.24system 18:18.59elapsed 98%CPU (0avgtext+0avgdata
239224maxresident)k
0inputs+217152outputs (0major+51551031minor)pagefaults 0swaps

963.71user 120.75system 18:17.70elapsed 98%CPU (0avgtext+0avgdata
239460maxresident)k
0inputs+217152outputs (0major+51555583minor)pagefaults 0swaps


1103.35user 126.49system 20:45.59elapsed 98%CPU (0avgtext+0avgdata
239600maxresident)k
984inputs+217152outputs (21major+51557916minor)pagefaults 0swaps

1103.01user 126.69system 20:45.36elapsed 98%CPU (0avgtext+0avgdata
239708maxresident)k
232inputs+217152outputs (0major+51560311minor)pagefaults 0swaps

1102.97user 127.13system 20:44.73elapsed 98%CPU (0avgtext+0avgdata
239440maxresident)k
0inputs+217152outputs (0major+51552998minor)pagefaults 0swaps

1103.09user 127.01system 20:44.48elapsed 98%CPU (0avgtext+0avgdata
239448maxresident)k
0inputs+217152outputs (0major+51559328minor)pagefaults 0swaps


==== with cache, mitigations off ====

971.35user 114.45system 18:16.36elapsed 99%CPU (0avgtext+0avgdata
239740maxresident)k
1584inputs+217152outputs (10major+51937572minor)pagefaults 0swaps

971.75user 115.18system 18:16.04elapsed 99%CPU (0avgtext+0avgdata
239648maxresident)k
0inputs+217152outputs (0major+51944016minor)pagefaults 0swaps

972.03user 114.47system 18:16.12elapsed 99%CPU (0avgtext+0avgdata
239368maxresident)k
744inputs+217152outputs (0major+51946745minor)pagefaults 0swaps

970.59user 115.13system 18:15.21elapsed 99%CPU (0avgtext+0avgdata
239736maxresident)k
0inputs+217152outputs (1major+51936971minor)pagefaults 0swaps


964.13user 121.15system 18:17.44elapsed 98%CPU (0avgtext+0avgdata
239496maxresident)k
0inputs+217152outputs (0major+51554855minor)pagefaults 0swaps

964.46user 120.73system 18:16.89elapsed 98%CPU (0avgtext+0avgdata
239492maxresident)k
0inputs+217152outputs (0major+51563668minor)pagefaults 0swaps

964.00user 121.71system 18:18.42elapsed 98%CPU (0avgtext+0avgdata
239504maxresident)k
0inputs+217152outputs (0major+51549101minor)pagefaults 0swaps

963.99user 121.46system 18:17.72elapsed 98%CPU (0avgtext+0avgdata
239644maxresident)k
0inputs+217152outputs (0major+51561705minor)pagefaults 0swaps


993.01user 123.83system 18:47.73elapsed 99%CPU (0avgtext+0avgdata
239648maxresident)k
984inputs+217152outputs (21major+51554203minor)pagefaults 0swaps

991.53user 125.49system 18:47.28elapsed 99%CPU (0avgtext+0avgdata
239380maxresident)k
0inputs+217152outputs (0major+51557014minor)pagefaults 0swaps

992.52user 124.53system 18:47.61elapsed 99%CPU (0avgtext+0avgdata
239344maxresident)k
0inputs+217152outputs (0major+51555681minor)pagefaults 0swaps

993.47user 125.01system 18:48.98elapsed 99%CPU (0avgtext+0avgdata
239448maxresident)k
0inputs+217152outputs (0major+51558830minor)pagefaults 0swaps


==== without cache, mitigations off ====

969.87user 118.18system 18:16.82elapsed 99%CPU (0avgtext+0avgdata
239640maxresident)k
0inputs+217152outputs (0major+51937042minor)pagefaults 0swaps

971.42user 114.62system 18:14.93elapsed 99%CPU (0avgtext+0avgdata
239840maxresident)k
0inputs+217152outputs (0major+51937617minor)pagefaults 0swaps

971.73user 114.40system 18:15.39elapsed 99%CPU (0avgtext+0avgdata
239724maxresident)k
0inputs+217152outputs (0major+51937768minor)pagefaults 0swaps

969.71user 117.13system 18:15.95elapsed 99%CPU (0avgtext+0avgdata
239680maxresident)k
0inputs+217152outputs (0major+51940505minor)pagefaults 0swaps


963.51user 121.32system 18:16.91elapsed 98%CPU (0avgtext+0avgdata
239516maxresident)k
0inputs+217152outputs (0major+51561337minor)pagefaults 0swaps

963.10user 120.75system 18:17.34elapsed 98%CPU (0avgtext+0avgdata
239464maxresident)k
0inputs+217152outputs (0major+51547338minor)pagefaults 0swaps

962.27user 122.48system 18:16.59elapsed 98%CPU (0avgtext+0avgdata
239544maxresident)k
0inputs+217152outputs (0major+51552060minor)pagefaults 0swaps

962.83user 120.21system 18:15.37elapsed 98%CPU (0avgtext+0avgdata
239496maxresident)k
0inputs+217152outputs (0major+51553345minor)pagefaults 0swaps


990.69user 125.78system 18:48.93elapsed 98%CPU (0avgtext+0avgdata
239440maxresident)k
984inputs+217152outputs (21major+51558142minor)pagefaults 0swaps

990.76user 126.01system 18:48.88elapsed 98%CPU (0avgtext+0avgdata
239800maxresident)k
0inputs+217152outputs (0major+51558483minor)pagefaults 0swaps

991.06user 125.99system 18:49.30elapsed 98%CPU (0avgtext+0avgdata
239412maxresident)k
0inputs+217152outputs (0major+51556462minor)pagefaults 0swaps

992.33user 124.77system 18:49.09elapsed 98%CPU (0avgtext+0avgdata
239684maxresident)k
0inputs+217152outputs (0major+51549745minor)pagefaults 0swaps

YiFei Zhu

2020-11-03 13:10:14

by YiFei Zhu

[permalink] [raw]
Subject: Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache

On Fri, Oct 30, 2020 at 7:18 AM YiFei Zhu <[email protected]> wrote:
> I got a bare metal test machine with Intel(R) Xeon(R) CPU E5-2660 v3 @
> 2.60GHz, running Ubuntu 18.04. Test kernels are compiled at
> 57a339117e52 ("selftests/seccomp: Compare bitmap vs filter overhead")
> and 3650b228f83a ("Linux 5.10-rc1"), built with Ubuntu's
> 5.3.0-64-generic's config, then `make olddefconfig`. "Mitigations off"
> indicate the kernel was booted with "nospectre_v2 nospectre_v1
> no_stf_barrier tsx=off tsx_async_abort=off".
>
> The benchmark was single-job make on x86_64 defconfig of 5.9.1, with
> CPU affinity to set only processor #0. Raw results are appended below.
> Each boot is tested by running the build directly and inside docker,
> with and without seccomp. The commands used are attached below. Each
> test is 4 trials, with the middle two (non-minimum, non-maximum) wall
> clock time averaged. Results summary:
>
> Mitigations On Mitigations Off
> With Cache Without Cache With Cache Without Cache
> Native 18:17.38 18:13.78 18:16.08 18:15.67
> D. no seccomp 18:15.54 18:17.71 18:17.58 18:16.75
> D. + seccomp 20:42.47 20:45.04 18:47.67 18:49.01
>
> To be honest, I'm somewhat surprised that it didn't produce as much of
> a dent in the seccomp overhead in this macro benchmark as I had
> expected.

My peers pointed out that in my previous benchmark there are still a
few mitigations left on, and suggested to use "noibrs noibpb nopti
nospectre_v2 nospectre_v1 l1tf=off nospec_store_bypass_disable
no_stf_barrier mds=off tsx=on tsx_async_abort=off mitigations=off".
Results with "Mitigations Off" updated:

Mitigations On Mitigations Off
With Cache Without Cache With Cache Without Cache
Native 18:17.38 18:13.78 17:43.42 17:47.68
D. no seccomp 18:15.54 18:17.71 17:34.59 17:37.54
D. + seccomp 20:42.47 20:45.04 17:35.70 17:37.16

Whether seccomp is on or off seems not to make much of a difference
for this benchmark. Bitmap being enabled does seem to decrease the
overall compilation time but it also affects where seccomp is off, so
the speedup is probably from other factors. We are thinking about
using more syscall-intensive workloads, such as httpd.

Thugh, this does make me wonder, where does the 3-minute overhead with
seccomp with mitigations come from? Is it data cache misses? If that
is the case, can we somehow preload the seccomp bitmap cache maybe? I
mean, mitigations only cause around half a minute slowdown without
seccomp but seccomp somehow amplify the slowdown with an additional
2.5 minutes, so something must be off here.

This is the raw output for the time commands:

==== with cache, mitigations off ====

947.02user 108.62system 17:47.65elapsed 98%CPU (0avgtext+0avgdata
239804maxresident)k
25112inputs+217152outputs (166major+51934447minor)pagefaults 0swaps

947.91user 108.20system 17:46.53elapsed 99%CPU (0avgtext+0avgdata
239576maxresident)k
0inputs+217152outputs (0major+51941524minor)pagefaults 0swaps

948.33user 108.70system 17:47.72elapsed 98%CPU (0avgtext+0avgdata
239604maxresident)k
0inputs+217152outputs (0major+51938566minor)pagefaults 0swaps

948.65user 108.81system 17:48.41elapsed 98%CPU (0avgtext+0avgdata
239692maxresident)k
0inputs+217152outputs (0major+51935349minor)pagefaults 0swaps


932.12user 113.68system 17:37.24elapsed 98%CPU (0avgtext+0avgdata
239660maxresident)k
0inputs+217152outputs (0major+51547571minor)pagefaults 0swap

931.69user 114.12system 17:37.84elapsed 98%CPU (0avgtext+0avgdata
239448maxresident)k
0inputs+217152outputs (0major+51539964minor)pagefaults 0swaps

932.25user 113.39system 17:37.75elapsed 98%CPU (0avgtext+0avgdata
239372maxresident)k
0inputs+217152outputs (0major+51538018minor)pagefaults 0swaps

931.09user 114.25system 17:37.34elapsed 98%CPU (0avgtext+0avgdata
239508maxresident)k
0inputs+217152outputs (0major+51537700minor)pagefaults 0swaps


929.96user 113.42system 17:36.23elapsed 98%CPU (0avgtext+0avgdata
239448maxresident)k
984inputs+217152outputs (22major+51544059minor)pagefaults 0swaps

929.73user 115.13system 17:38.09elapsed 98%CPU (0avgtext+0avgdata
239464maxresident)k
0inputs+217152outputs (0major+51540259minor)pagefaults 0swaps

930.13user 112.71system 17:36.17elapsed 98%CPU (0avgtext+0avgdata
239620maxresident)k
0inputs+217152outputs (0major+51540623minor)pagefaults 0swaps

930.57user 113.02system 17:49.70elapsed 97%CPU (0avgtext+0avgdata
239432maxresident)k
0inputs+217152outputs (0major+51537776minor)pagefaults 0swaps

==== without cache, mitigations off ====

947.59user 108.06system 17:44.56elapsed 99%CPU (0avgtext+0avgdata
239484maxresident)k
25112inputs+217152outputs (167major+51938723minor)pagefaults 0swaps

947.95user 108.58system 17:43.40elapsed 99%CPU (0avgtext+0avgdata
239580maxresident)k
0inputs+217152outputs (0major+51943434minor)pagefaults 0swaps

948.54user 106.62system 17:42.39elapsed 99%CPU (0avgtext+0avgdata
239608maxresident)k
0inputs+217152outputs (0major+51936408minor)pagefaults 0swaps

947.85user 107.92system 17:43.44elapsed 99%CPU (0avgtext+0avgdata
239656maxresident)k
0inputs+217152outputs (0major+51931633minor)pagefaults 0swaps


931.28user 111.16system 17:33.59elapsed 98%CPU (0avgtext+0avgdata
239440maxresident)k
0inputs+217152outputs (0major+51543540minor)pagefaults 0swaps

930.21user 112.56system 17:34.20elapsed 98%CPU (0avgtext+0avgdata
239400maxresident)k
0inputs+217152outputs (0major+51539699minor)pagefaults 0swaps

930.16user 113.74system 17:35.06elapsed 98%CPU (0avgtext+0avgdata
239344maxresident)k
0inputs+217152outputs (0major+51543072minor)pagefaults 0swaps

930.17user 112.77system 17:34.98elapsed 98%CPU (0avgtext+0avgdata
239176maxresident)k
0inputs+217152outputs (0major+51540777minor)pagefaults 0swaps


931.92user 113.31system 17:36.05elapsed 98%CPU (0avgtext+0avgdata
239520maxresident)k
984inputs+217152outputs (22major+51534636minor)pagefaults 0swaps

931.14user 112.81system 17:35.35elapsed 98%CPU (0avgtext+0avgdata
239524maxresident)k
0inputs+217152outputs (0major+51549007minor)pagefaults 0swaps

930.93user 114.56system 17:37.72elapsed 98%CPU (0avgtext+0avgdata
239360maxresident)k
0inputs+217152outputs (0major+51542191minor)pagefaults 0swaps

932.26user 111.54system 17:35.36elapsed 98%CPU (0avgtext+0avgdata
239572maxresident)k
0inputs+217152outputs (0major+51537921minor)pagefaults 0swaps

YiFei Zhu

2020-11-04 00:38:05

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache

On Tue, Nov 03, 2020 at 07:00:22AM -0600, YiFei Zhu wrote:
> My peers pointed out that in my previous benchmark there are still a
> few mitigations left on, and suggested to use "noibrs noibpb nopti
> nospectre_v2 nospectre_v1 l1tf=off nospec_store_bypass_disable
> no_stf_barrier mds=off tsx=on tsx_async_abort=off mitigations=off".
> Results with "Mitigations Off" updated:
>
> Mitigations On Mitigations Off
> With Cache Without Cache With Cache Without Cache
> Native 18:17.38 18:13.78 17:43.42 17:47.68
> D. no seccomp 18:15.54 18:17.71 17:34.59 17:37.54
> D. + seccomp 20:42.47 20:45.04 17:35.70 17:37.16
>
> Whether seccomp is on or off seems not to make much of a difference
> for this benchmark. Bitmap being enabled does seem to decrease the
> overall compilation time but it also affects where seccomp is off, so
> the speedup is probably from other factors. We are thinking about
> using more syscall-intensive workloads, such as httpd.

Yeah, this is very interesting. That there is anything measurably _slower_
with the cache is surprising. Though with only 4 runs, I wonder if it's
still noisy? What happens at 10 runs -- more importantly what is the
standard deviation?

> Thugh, this does make me wonder, where does the 3-minute overhead with
> seccomp with mitigations come from? Is it data cache misses? If that
> is the case, can we somehow preload the seccomp bitmap cache maybe? I
> mean, mitigations only cause around half a minute slowdown without
> seccomp but seccomp somehow amplify the slowdown with an additional
> 2.5 minutes, so something must be off here.

I assume this is from Indirect Branch Prediction Barrier (IBPB) and
Single Threaded Indirect Branch Prediction (STIBP) (which get enabled
for threads under seccomp by default).

Try booting with "spectre_v2_user=prctl"

https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/spectre.html#spectre-mitigation-control-command-line

--
Kees Cook

2020-11-04 11:49:44

by YiFei Zhu

[permalink] [raw]
Subject: Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache

On Tue, Nov 3, 2020 at 6:29 PM Kees Cook <[email protected]> wrote:
> Yeah, this is very interesting. That there is anything measurably _slower_
> with the cache is surprising. Though with only 4 runs, I wonder if it's
> still noisy? What happens at 10 runs -- more importantly what is the
> standard deviation?

I could do that. it just takes such a long time. Each run takes about
20 minutes so with 10 runs per environment, 3 environments (native + 2
docker) per boot, and 4 boots (2 bootparam * 2 compile config), it's
27 hours of compilation. I should probably script it at that point.

> I assume this is from Indirect Branch Prediction Barrier (IBPB) and
> Single Threaded Indirect Branch Prediction (STIBP) (which get enabled
> for threads under seccomp by default).
>
> Try booting with "spectre_v2_user=prctl"

Hmm, to make sure, boot with just "spectre_v2_user=prctl" on the
command line and test the performance of that?

YiFei Zhu

2020-11-05 00:46:12

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v4 seccomp 5/5] seccomp/cache: Report cache data through /proc/pid/seccomp_cache

On Wed, Nov 04, 2020 at 05:40:51AM -0600, YiFei Zhu wrote:
> On Tue, Nov 3, 2020 at 6:29 PM Kees Cook <[email protected]> wrote:
> > Yeah, this is very interesting. That there is anything measurably _slower_
> > with the cache is surprising. Though with only 4 runs, I wonder if it's
> > still noisy? What happens at 10 runs -- more importantly what is the
> > standard deviation?
>
> I could do that. it just takes such a long time. Each run takes about
> 20 minutes so with 10 runs per environment, 3 environments (native + 2
> docker) per boot, and 4 boots (2 bootparam * 2 compile config), it's
> 27 hours of compilation. I should probably script it at that point.

Yeah, I was facing the same issues. Though perhaps hackbench (with
multiple CPUs) would be a better test (and it's much faster):

https://lore.kernel.org/lkml/[email protected]/

(I usually run this with a CNT of 20 to get quick results.)

> > I assume this is from Indirect Branch Prediction Barrier (IBPB) and
> > Single Threaded Indirect Branch Prediction (STIBP) (which get enabled
> > for threads under seccomp by default).
> >
> > Try booting with "spectre_v2_user=prctl"
>
> Hmm, to make sure, boot with just "spectre_v2_user=prctl" on the
> command line and test the performance of that?

Right, see if that eliminates the 3 minute jump seen for seccomp.

--
Kees Cook