2023-03-08 07:32:18

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 0/6 v5 RESEND] seccomp: add the synchronous mode for seccomp_unotify

seccomp_unotify allows more privileged processes do actions on behalf
of less privileged processes.

In many cases, the workflow is fully synchronous. It means a target
process triggers a system call and passes controls to a supervisor
process that handles the system call and returns controls back to the
target process. In this context, "synchronous" means that only one
process is running and another one is waiting.

The new WF_CURRENT_CPU flag advises the scheduler to move the wakee to
the current CPU. For such synchronous workflows, it makes context
switches a few times faster.

Right now, each interaction takes 12µs. With this patch, it takes about
3µs.

v2: clean up the first patch and add the test.
v3: update commit messages and a few fixes suggested by Kees Cook.
v4: update the third patch to avoid code duplications (suggested by
Peter Zijlstra)
Add the benchmark to the perf bench set.
v5: Update the author email. No code changes.

Kees is ready to take this patch set, but wants to get Acks from the
sched folks.

Cc: Andy Lutomirski <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Peter Oskolkov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Tycho Andersen <[email protected]>
Cc: Will Drewry <[email protected]>
Cc: Vincent Guittot <[email protected]>

Andrei Vagin (4):
seccomp: don't use semaphore and wait_queue together
sched: add a few helpers to wake up tasks on the current cpu
seccomp: add the synchronous mode for seccomp_unotify
selftest/seccomp: add a new test for the sync mode of
seccomp_user_notify

Peter Oskolkov (1):
sched: add WF_CURRENT_CPU and externise ttwu

include/linux/completion.h | 1 +
include/linux/swait.h | 2 +-
include/linux/wait.h | 3 +
include/uapi/linux/seccomp.h | 4 +
kernel/sched/completion.c | 26 ++-
kernel/sched/core.c | 5 +-
kernel/sched/fair.c | 4 +
kernel/sched/sched.h | 13 +-
kernel/sched/swait.c | 8 +-
kernel/sched/wait.c | 5 +
kernel/seccomp.c | 72 +++++++-
tools/arch/x86/include/uapi/asm/unistd_32.h | 3 +
tools/arch/x86/include/uapi/asm/unistd_64.h | 3 +
tools/perf/bench/Build | 1 +
tools/perf/bench/bench.h | 1 +
tools/perf/bench/sched-seccomp-notify.c | 167 ++++++++++++++++++
tools/perf/builtin-bench.c | 1 +
tools/testing/selftests/seccomp/seccomp_bpf.c | 55 ++++++
18 files changed, 346 insertions(+), 28 deletions(-)
create mode 100644 tools/perf/bench/sched-seccomp-notify.c

--
2.37.2



2023-03-08 07:32:25

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 1/6] seccomp: don't use semaphore and wait_queue together

The main reason is to use new wake_up helpers that will be added in the
following patches. But here are a few other reasons:

* if we use two different ways, we always need to call them both. This
patch fixes seccomp_notify_recv where we forgot to call wake_up_poll
in the error path.

* If we use one primitive, we can control how many waiters are woken up
for each request. Our goal is to wake up just one that will handle a
request. Right now, wake_up_poll can wake up one waiter and
up(&match->notif->request) can wake up one more.

Signed-off-by: Andrei Vagin <[email protected]>
---
kernel/seccomp.c | 41 ++++++++++++++++++++++++++++++++++++-----
1 file changed, 36 insertions(+), 5 deletions(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index cebf26445f9e..9fca9345111c 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -145,7 +145,7 @@ struct seccomp_kaddfd {
* @notifications: A list of struct seccomp_knotif elements.
*/
struct notification {
- struct semaphore request;
+ atomic_t requests;
u64 next_id;
struct list_head notifications;
};
@@ -1116,7 +1116,7 @@ static int seccomp_do_user_notification(int this_syscall,
list_add_tail(&n.list, &match->notif->notifications);
INIT_LIST_HEAD(&n.addfd);

- up(&match->notif->request);
+ atomic_inc(&match->notif->requests);
wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);

/*
@@ -1450,6 +1450,37 @@ find_notification(struct seccomp_filter *filter, u64 id)
return NULL;
}

+static int recv_wake_function(wait_queue_entry_t *wait, unsigned int mode, int sync,
+ void *key)
+{
+ /* Avoid a wakeup if event not interesting for us. */
+ if (key && !(key_to_poll(key) & (EPOLLIN | EPOLLERR)))
+ return 0;
+ return autoremove_wake_function(wait, mode, sync, key);
+}
+
+static int recv_wait_event(struct seccomp_filter *filter)
+{
+ DEFINE_WAIT_FUNC(wait, recv_wake_function);
+ int ret;
+
+ if (atomic_dec_if_positive(&filter->notif->requests) >= 0)
+ return 0;
+
+ for (;;) {
+ ret = prepare_to_wait_event(&filter->wqh, &wait, TASK_INTERRUPTIBLE);
+
+ if (atomic_dec_if_positive(&filter->notif->requests) >= 0)
+ break;
+
+ if (ret)
+ return ret;
+
+ schedule();
+ }
+ finish_wait(&filter->wqh, &wait);
+ return 0;
+}

static long seccomp_notify_recv(struct seccomp_filter *filter,
void __user *buf)
@@ -1467,7 +1498,7 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,

memset(&unotif, 0, sizeof(unotif));

- ret = down_interruptible(&filter->notif->request);
+ ret = recv_wait_event(filter);
if (ret < 0)
return ret;

@@ -1515,7 +1546,8 @@ static long seccomp_notify_recv(struct seccomp_filter *filter,
if (should_sleep_killable(filter, knotif))
complete(&knotif->ready);
knotif->state = SECCOMP_NOTIFY_INIT;
- up(&filter->notif->request);
+ atomic_inc(&filter->notif->requests);
+ wake_up_poll(&filter->wqh, EPOLLIN | EPOLLRDNORM);
}
mutex_unlock(&filter->notify_lock);
}
@@ -1777,7 +1809,6 @@ static struct file *init_listener(struct seccomp_filter *filter)
if (!filter->notif)
goto out;

- sema_init(&filter->notif->request, 0);
filter->notif->next_id = get_random_u64();
INIT_LIST_HEAD(&filter->notif->notifications);

--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-08 07:32:30

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 2/6] sched: add WF_CURRENT_CPU and externise ttwu

From: Peter Oskolkov <[email protected]>

Add WF_CURRENT_CPU wake flag that advices the scheduler to
move the wakee to the current CPU. This is useful for fast on-CPU
context switching use cases.

In addition, make ttwu external rather than static so that
the flag could be passed to it from outside of sched/core.c.

Signed-off-by: Peter Oskolkov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
---
kernel/sched/core.c | 3 +--
kernel/sched/fair.c | 4 ++++
kernel/sched/sched.h | 13 ++++++++-----
3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index af017e038b48..386a0c40d341 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4123,8 +4123,7 @@ bool ttwu_state_match(struct task_struct *p, unsigned int state, int *success)
* Return: %true if @p->state changes (an actual wakeup was done),
* %false otherwise.
*/
-static int
-try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
+int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
unsigned long flags;
int cpu, success = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a1b1f855b96..4c67652aa302 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
if (wake_flags & WF_TTWU) {
record_wakee(p);

+ if ((wake_flags & WF_CURRENT_CPU) &&
+ cpumask_test_cpu(cpu, p->cpus_ptr))
+ return cpu;
+
if (sched_energy_enabled()) {
new_cpu = find_energy_efficient_cpu(p, prev_cpu);
if (new_cpu >= 0)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3e8df6d31c1e..f8420e9ed290 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2093,12 +2093,13 @@ static inline int task_on_rq_migrating(struct task_struct *p)
}

/* Wake flags. The first three directly map to some SD flag value */
-#define WF_EXEC 0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
-#define WF_FORK 0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
-#define WF_TTWU 0x08 /* Wakeup; maps to SD_BALANCE_WAKE */
+#define WF_EXEC 0x02 /* Wakeup after exec; maps to SD_BALANCE_EXEC */
+#define WF_FORK 0x04 /* Wakeup after fork; maps to SD_BALANCE_FORK */
+#define WF_TTWU 0x08 /* Wakeup; maps to SD_BALANCE_WAKE */

-#define WF_SYNC 0x10 /* Waker goes to sleep after wakeup */
-#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
+#define WF_SYNC 0x10 /* Waker goes to sleep after wakeup */
+#define WF_MIGRATED 0x20 /* Internal use, task got migrated */
+#define WF_CURRENT_CPU 0x40 /* Prefer to move the wakee to the current CPU. */

#ifdef CONFIG_SMP
static_assert(WF_EXEC == SD_BALANCE_EXEC);
@@ -3232,6 +3233,8 @@ static inline bool is_per_cpu_kthread(struct task_struct *p)
extern void swake_up_all_locked(struct swait_queue_head *q);
extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait);

+extern int try_to_wake_up(struct task_struct *tsk, unsigned int state, int wake_flags);
+
#ifdef CONFIG_PREEMPT_DYNAMIC
extern int preempt_dynamic_mode;
extern int sched_dynamic_mode(const char *str);
--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-08 07:32:34

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 3/6] sched: add a few helpers to wake up tasks on the current cpu

Add complete_on_current_cpu, wake_up_poll_on_current_cpu helpers to wake
up tasks on the current CPU.

These two helpers are useful when the task needs to make a synchronous context
switch to another task. In this context, synchronous means it wakes up the
target task and falls asleep right after that.

One example of such workloads is seccomp user notifies. This mechanism allows
the supervisor process handles system calls on behalf of a target process.
While the supervisor is handling an intercepted system call, the target process
will be blocked in the kernel, waiting for a response to come back.

On-CPU context switches are much faster than regular ones.

Signed-off-by: Andrei Vagin <[email protected]>
---
include/linux/completion.h | 1 +
include/linux/swait.h | 2 +-
include/linux/wait.h | 3 +++
kernel/sched/completion.c | 26 ++++++++++++++++++--------
kernel/sched/core.c | 2 +-
kernel/sched/swait.c | 8 ++++----
kernel/sched/wait.c | 5 +++++
7 files changed, 33 insertions(+), 14 deletions(-)

diff --git a/include/linux/completion.h b/include/linux/completion.h
index 62b32b19e0a8..fb2915676574 100644
--- a/include/linux/completion.h
+++ b/include/linux/completion.h
@@ -116,6 +116,7 @@ extern bool try_wait_for_completion(struct completion *x);
extern bool completion_done(struct completion *x);

extern void complete(struct completion *);
+extern void complete_on_current_cpu(struct completion *x);
extern void complete_all(struct completion *);

#endif
diff --git a/include/linux/swait.h b/include/linux/swait.h
index 6a8c22b8c2a5..d324419482a0 100644
--- a/include/linux/swait.h
+++ b/include/linux/swait.h
@@ -146,7 +146,7 @@ static inline bool swq_has_sleeper(struct swait_queue_head *wq)

extern void swake_up_one(struct swait_queue_head *q);
extern void swake_up_all(struct swait_queue_head *q);
-extern void swake_up_locked(struct swait_queue_head *q);
+extern void swake_up_locked(struct swait_queue_head *q, int wake_flags);

extern void prepare_to_swait_exclusive(struct swait_queue_head *q, struct swait_queue *wait, int state);
extern long prepare_to_swait_event(struct swait_queue_head *q, struct swait_queue *wait, int state);
diff --git a/include/linux/wait.h b/include/linux/wait.h
index a0307b516b09..5ec7739400f4 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -210,6 +210,7 @@ __remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq
}

int __wake_up(struct wait_queue_head *wq_head, unsigned int mode, int nr, void *key);
+void __wake_up_on_current_cpu(struct wait_queue_head *wq_head, unsigned int mode, void *key);
void __wake_up_locked_key(struct wait_queue_head *wq_head, unsigned int mode, void *key);
void __wake_up_locked_key_bookmark(struct wait_queue_head *wq_head,
unsigned int mode, void *key, wait_queue_entry_t *bookmark);
@@ -237,6 +238,8 @@ void __wake_up_pollfree(struct wait_queue_head *wq_head);
#define key_to_poll(m) ((__force __poll_t)(uintptr_t)(void *)(m))
#define wake_up_poll(x, m) \
__wake_up(x, TASK_NORMAL, 1, poll_to_key(m))
+#define wake_up_poll_on_current_cpu(x, m) \
+ __wake_up_on_current_cpu(x, TASK_NORMAL, poll_to_key(m))
#define wake_up_locked_poll(x, m) \
__wake_up_locked_key((x), TASK_NORMAL, poll_to_key(m))
#define wake_up_interruptible_poll(x, m) \
diff --git a/kernel/sched/completion.c b/kernel/sched/completion.c
index d57a5c1c1cd9..3561ab533dd4 100644
--- a/kernel/sched/completion.c
+++ b/kernel/sched/completion.c
@@ -13,6 +13,23 @@
* Waiting for completion is a typically sync point, but not an exclusion point.
*/

+static void complete_with_flags(struct completion *x, int wake_flags)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&x->wait.lock, flags);
+
+ if (x->done != UINT_MAX)
+ x->done++;
+ swake_up_locked(&x->wait, wake_flags);
+ raw_spin_unlock_irqrestore(&x->wait.lock, flags);
+}
+
+void complete_on_current_cpu(struct completion *x)
+{
+ return complete_with_flags(x, WF_CURRENT_CPU);
+}
+
/**
* complete: - signals a single thread waiting on this completion
* @x: holds the state of this particular completion
@@ -27,14 +44,7 @@
*/
void complete(struct completion *x)
{
- unsigned long flags;
-
- raw_spin_lock_irqsave(&x->wait.lock, flags);
-
- if (x->done != UINT_MAX)
- x->done++;
- swake_up_locked(&x->wait);
- raw_spin_unlock_irqrestore(&x->wait.lock, flags);
+ complete_with_flags(x, 0);
}
EXPORT_SYMBOL(complete);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 386a0c40d341..c5f7bfbc4967 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6941,7 +6941,7 @@ asmlinkage __visible void __sched preempt_schedule_irq(void)
int default_wake_function(wait_queue_entry_t *curr, unsigned mode, int wake_flags,
void *key)
{
- WARN_ON_ONCE(IS_ENABLED(CONFIG_SCHED_DEBUG) && wake_flags & ~WF_SYNC);
+ WARN_ON_ONCE(IS_ENABLED(CONFIG_SCHED_DEBUG) && wake_flags & ~(WF_SYNC|WF_CURRENT_CPU));
return try_to_wake_up(curr->private, mode, wake_flags);
}
EXPORT_SYMBOL(default_wake_function);
diff --git a/kernel/sched/swait.c b/kernel/sched/swait.c
index 76b9b796e695..72505cd3b60a 100644
--- a/kernel/sched/swait.c
+++ b/kernel/sched/swait.c
@@ -18,7 +18,7 @@ EXPORT_SYMBOL(__init_swait_queue_head);
* If for some reason it would return 0, that means the previously waiting
* task is already running, so it will observe condition true (or has already).
*/
-void swake_up_locked(struct swait_queue_head *q)
+void swake_up_locked(struct swait_queue_head *q, int wake_flags)
{
struct swait_queue *curr;

@@ -26,7 +26,7 @@ void swake_up_locked(struct swait_queue_head *q)
return;

curr = list_first_entry(&q->task_list, typeof(*curr), task_list);
- wake_up_process(curr->task);
+ try_to_wake_up(curr->task, TASK_NORMAL, wake_flags);
list_del_init(&curr->task_list);
}
EXPORT_SYMBOL(swake_up_locked);
@@ -41,7 +41,7 @@ EXPORT_SYMBOL(swake_up_locked);
void swake_up_all_locked(struct swait_queue_head *q)
{
while (!list_empty(&q->task_list))
- swake_up_locked(q);
+ swake_up_locked(q, 0);
}

void swake_up_one(struct swait_queue_head *q)
@@ -49,7 +49,7 @@ void swake_up_one(struct swait_queue_head *q)
unsigned long flags;

raw_spin_lock_irqsave(&q->lock, flags);
- swake_up_locked(q);
+ swake_up_locked(q, 0);
raw_spin_unlock_irqrestore(&q->lock, flags);
}
EXPORT_SYMBOL(swake_up_one);
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index 133b74730738..47803a0b8d5d 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -161,6 +161,11 @@ int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
}
EXPORT_SYMBOL(__wake_up);

+void __wake_up_on_current_cpu(struct wait_queue_head *wq_head, unsigned int mode, void *key)
+{
+ __wake_up_common_lock(wq_head, mode, 1, WF_CURRENT_CPU, key);
+}
+
/*
* Same as __wake_up but called with the spinlock in wait_queue_head_t held.
*/
--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-08 07:32:44

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 5/6] selftest/seccomp: add a new test for the sync mode of seccomp_user_notify

Test output:
# RUN global.user_notification_sync ...
# OK global.user_notification_sync
ok 51 global.user_notification_sync

Signed-off-by: Andrei Vagin <[email protected]>
---
tools/testing/selftests/seccomp/seccomp_bpf.c | 55 +++++++++++++++++++
1 file changed, 55 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 43ec36b179dc..f6a04d88e02f 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -4255,6 +4255,61 @@ TEST(user_notification_addfd_rlimit)
close(memfd);
}

+#ifndef SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP
+#define SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP (1UL << 0)
+#define SECCOMP_IOCTL_NOTIF_SET_FLAGS SECCOMP_IOW(4, __u64)
+#endif
+
+TEST(user_notification_sync)
+{
+ struct seccomp_notif req = {};
+ struct seccomp_notif_resp resp = {};
+ int status, listener;
+ pid_t pid;
+ long ret;
+
+ ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+ ASSERT_EQ(0, ret) {
+ TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+ }
+
+ listener = user_notif_syscall(__NR_getppid,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER);
+ ASSERT_GE(listener, 0);
+
+ /* Try to set invalid flags. */
+ EXPECT_SYSCALL_RETURN(-EINVAL,
+ ioctl(listener, SECCOMP_IOCTL_NOTIF_SET_FLAGS, 0xffffffff, 0));
+
+ ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SET_FLAGS,
+ SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP, 0), 0);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+ if (pid == 0) {
+ ret = syscall(__NR_getppid);
+ ASSERT_EQ(ret, USER_NOTIF_MAGIC) {
+ _exit(1);
+ }
+ _exit(0);
+ }
+
+ req.pid = 0;
+ ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+
+ ASSERT_EQ(req.data.nr, __NR_getppid);
+
+ resp.id = req.id;
+ resp.error = 0;
+ resp.val = USER_NOTIF_MAGIC;
+ resp.flags = 0;
+ ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+
+ ASSERT_EQ(waitpid(pid, &status, 0), pid);
+ ASSERT_EQ(status, 0);
+}
+
+
/* Make sure PTRACE_O_SUSPEND_SECCOMP requires CAP_SYS_ADMIN. */
FIXTURE(O_SUSPEND_SECCOMP) {
pid_t pid;
--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-08 07:32:41

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 4/6] seccomp: add the synchronous mode for seccomp_unotify

seccomp_unotify allows more privileged processes do actions on behalf
of less privileged processes.

In many cases, the workflow is fully synchronous. It means a target
process triggers a system call and passes controls to a supervisor
process that handles the system call and returns controls to the target
process. In this context, "synchronous" means that only one process is
running and another one is waiting.

There is the WF_CURRENT_CPU flag that is used to advise the scheduler to
move the wakee to the current CPU. For such synchronous workflows, it
makes context switches a few times faster.

Right now, each interaction takes 12µs. With this patch, it takes about
3µs.

This change introduce the SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP flag that
it used to enable the sync mode.

Signed-off-by: Andrei Vagin <[email protected]>
---
include/uapi/linux/seccomp.h | 4 ++++
kernel/seccomp.c | 31 +++++++++++++++++++++++++++++--
2 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 0fdc6ef02b94..dbfc9b37fcae 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -115,6 +115,8 @@ struct seccomp_notif_resp {
__u32 flags;
};

+#define SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP (1UL << 0)
+
/* valid flags for seccomp_notif_addfd */
#define SECCOMP_ADDFD_FLAG_SETFD (1UL << 0) /* Specify remote fd */
#define SECCOMP_ADDFD_FLAG_SEND (1UL << 1) /* Addfd and return it, atomically */
@@ -150,4 +152,6 @@ struct seccomp_notif_addfd {
#define SECCOMP_IOCTL_NOTIF_ADDFD SECCOMP_IOW(3, \
struct seccomp_notif_addfd)

+#define SECCOMP_IOCTL_NOTIF_SET_FLAGS SECCOMP_IOW(4, __u64)
+
#endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 9fca9345111c..d323edeae7da 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -143,9 +143,12 @@ struct seccomp_kaddfd {
* filter->notify_lock.
* @next_id: The id of the next request.
* @notifications: A list of struct seccomp_knotif elements.
+ * @flags: A set of SECCOMP_USER_NOTIF_FD_* flags.
*/
+
struct notification {
atomic_t requests;
+ u32 flags;
u64 next_id;
struct list_head notifications;
};
@@ -1117,7 +1120,10 @@ static int seccomp_do_user_notification(int this_syscall,
INIT_LIST_HEAD(&n.addfd);

atomic_inc(&match->notif->requests);
- wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
+ if (match->notif->flags & SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP)
+ wake_up_poll_on_current_cpu(&match->wqh, EPOLLIN | EPOLLRDNORM);
+ else
+ wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);

/*
* This is where we wait for a reply from userspace.
@@ -1593,7 +1599,10 @@ static long seccomp_notify_send(struct seccomp_filter *filter,
knotif->error = resp.error;
knotif->val = resp.val;
knotif->flags = resp.flags;
- complete(&knotif->ready);
+ if (filter->notif->flags & SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP)
+ complete_on_current_cpu(&knotif->ready);
+ else
+ complete(&knotif->ready);
out:
mutex_unlock(&filter->notify_lock);
return ret;
@@ -1623,6 +1632,22 @@ static long seccomp_notify_id_valid(struct seccomp_filter *filter,
return ret;
}

+static long seccomp_notify_set_flags(struct seccomp_filter *filter,
+ unsigned long flags)
+{
+ long ret;
+
+ if (flags & ~SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP)
+ return -EINVAL;
+
+ ret = mutex_lock_interruptible(&filter->notify_lock);
+ if (ret < 0)
+ return ret;
+ filter->notif->flags = flags;
+ mutex_unlock(&filter->notify_lock);
+ return 0;
+}
+
static long seccomp_notify_addfd(struct seccomp_filter *filter,
struct seccomp_notif_addfd __user *uaddfd,
unsigned int size)
@@ -1752,6 +1777,8 @@ static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
case SECCOMP_IOCTL_NOTIF_ID_VALID_WRONG_DIR:
case SECCOMP_IOCTL_NOTIF_ID_VALID:
return seccomp_notify_id_valid(filter, buf);
+ case SECCOMP_IOCTL_NOTIF_SET_FLAGS:
+ return seccomp_notify_set_flags(filter, arg);
}

/* Extensible Argument ioctls */
--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-08 07:32:54

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 6/6] perf/benchmark: add a new benchmark for seccom_unotify

The benchmark is similar to the pipe benchmark. It creates two processes,
one is calling syscalls, and another process is handling them via seccomp
user notifications. It measures the time required to run a specified number
of interations.

$ ./perf bench sched seccomp-notify --sync-mode --loop 1000000
# Running 'sched/seccomp-notify' benchmark:
# Executed 1000000 system calls

Total time: 2.769 [sec]

2.769629 usecs/op
361059 ops/sec

$ ./perf bench sched seccomp-notify
# Running 'sched/seccomp-notify' benchmark:
# Executed 1000000 system calls

Total time: 8.571 [sec]

8.571119 usecs/op
116670 ops/sec

Signed-off-by: Andrei Vagin <[email protected]>
---
tools/arch/x86/include/uapi/asm/unistd_32.h | 3 +
tools/arch/x86/include/uapi/asm/unistd_64.h | 3 +
tools/perf/bench/Build | 1 +
tools/perf/bench/bench.h | 1 +
tools/perf/bench/sched-seccomp-notify.c | 168 ++++++++++++++++++++
tools/perf/builtin-bench.c | 1 +
6 files changed, 177 insertions(+)
create mode 100644 tools/perf/bench/sched-seccomp-notify.c

diff --git a/tools/arch/x86/include/uapi/asm/unistd_32.h b/tools/arch/x86/include/uapi/asm/unistd_32.h
index 2712d5e03e2e..5fb3589c14bf 100644
--- a/tools/arch/x86/include/uapi/asm/unistd_32.h
+++ b/tools/arch/x86/include/uapi/asm/unistd_32.h
@@ -23,3 +23,6 @@
#ifndef __NR_setns
#define __NR_setns 346
#endif
+#ifdef __NR_seccomp
+#define __NR_seccomp 354
+#endif
diff --git a/tools/arch/x86/include/uapi/asm/unistd_64.h b/tools/arch/x86/include/uapi/asm/unistd_64.h
index a6f7fe84d4df..e0549617f9d7 100644
--- a/tools/arch/x86/include/uapi/asm/unistd_64.h
+++ b/tools/arch/x86/include/uapi/asm/unistd_64.h
@@ -23,3 +23,6 @@
#ifndef __NR_getcpu
#define __NR_getcpu 309
#endif
+#ifndef __NR_seccomp
+#define __NR_seccomp 317
+#endif
diff --git a/tools/perf/bench/Build b/tools/perf/bench/Build
index 6b6155a8ad09..e3ec2c1b0682 100644
--- a/tools/perf/bench/Build
+++ b/tools/perf/bench/Build
@@ -1,5 +1,6 @@
perf-y += sched-messaging.o
perf-y += sched-pipe.o
+perf-y += sched-seccomp-notify.o
perf-y += syscall.o
perf-y += mem-functions.o
perf-y += futex-hash.o
diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
index e43893151a3e..9d28510fcf9d 100644
--- a/tools/perf/bench/bench.h
+++ b/tools/perf/bench/bench.h
@@ -21,6 +21,7 @@ extern struct timeval bench__start, bench__end, bench__runtime;
int bench_numa(int argc, const char **argv);
int bench_sched_messaging(int argc, const char **argv);
int bench_sched_pipe(int argc, const char **argv);
+int bench_sched_seccomp_notify(int argc, const char **argv);
int bench_syscall_basic(int argc, const char **argv);
int bench_syscall_getpgid(int argc, const char **argv);
int bench_syscall_execve(int argc, const char **argv);
diff --git a/tools/perf/bench/sched-seccomp-notify.c b/tools/perf/bench/sched-seccomp-notify.c
new file mode 100644
index 000000000000..443f4b43702d
--- /dev/null
+++ b/tools/perf/bench/sched-seccomp-notify.c
@@ -0,0 +1,168 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <subcmd/parse-options.h>
+#include "bench.h"
+
+#include <uapi/linux/filter.h>
+#include <sys/types.h>
+#include <sys/time.h>
+#include <linux/unistd.h>
+#include <sys/syscall.h>
+#include <sys/ioctl.h>
+#include <linux/time64.h>
+#include <linux/seccomp.h>
+#include <sys/prctl.h>
+
+#include <unistd.h>
+#include <limits.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <sys/wait.h>
+#include <string.h>
+#include <errno.h>
+#include <assert.h>
+
+#define LOOPS_DEFAULT 1000000UL
+static uint64_t loops = LOOPS_DEFAULT;
+static bool sync_mode;
+
+static const struct option options[] = {
+ OPT_U64('l', "loop", &loops, "Specify number of loops"),
+ OPT_BOOLEAN('s', "sync-mode", &sync_mode,
+ "Enable the synchronious mode for seccomp notifications"),
+ OPT_END()
+};
+
+static const char * const bench_seccomp_usage[] = {
+ "perf bench sched secccomp-notify <options>",
+ NULL
+};
+
+static int seccomp(unsigned int op, unsigned int flags, void *args)
+{
+ return syscall(__NR_seccomp, op, flags, args);
+}
+
+static int user_notif_syscall(int nr, unsigned int flags)
+{
+ struct sock_filter filter[] = {
+ BPF_STMT(BPF_LD|BPF_W|BPF_ABS,
+ offsetof(struct seccomp_data, nr)),
+ BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, nr, 0, 1),
+ BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_USER_NOTIF),
+ BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+ };
+
+ struct sock_fprog prog = {
+ .len = (unsigned short)ARRAY_SIZE(filter),
+ .filter = filter,
+ };
+
+ return seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
+}
+
+#define USER_NOTIF_MAGIC INT_MAX
+static void user_notification_sync_loop(int listener)
+{
+ struct seccomp_notif_resp resp;
+ struct seccomp_notif req;
+ uint64_t nr;
+
+ for (nr = 0; nr < loops; nr++) {
+ memset(&req, 0, sizeof(req));
+ assert(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req) == 0);
+
+ assert(req.data.nr == __NR_gettid);
+
+ resp.id = req.id;
+ resp.error = 0;
+ resp.val = USER_NOTIF_MAGIC;
+ resp.flags = 0;
+ assert(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp) == 0);
+ }
+}
+
+#ifndef SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP
+#define SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP (1UL << 0)
+#define SECCOMP_IOCTL_NOTIF_SET_FLAGS SECCOMP_IOW(4, __u64)
+#endif
+int bench_sched_seccomp_notify(int argc, const char **argv)
+{
+ struct timeval start, stop, diff;
+ unsigned long long result_usec = 0;
+ int status, listener;
+ pid_t pid;
+ long ret;
+
+ argc = parse_options(argc, argv, options, bench_seccomp_usage, 0);
+
+ gettimeofday(&start, NULL);
+
+ prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+ listener = user_notif_syscall(__NR_gettid,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER);
+ assert(listener >= 0);
+
+ pid = fork();
+ assert(pid >= 0);
+ if (pid == 0) {
+ assert(prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0) == 0);
+ while (1) {
+ ret = syscall(__NR_gettid);
+ if (ret == USER_NOTIF_MAGIC)
+ continue;
+ break;
+ }
+ _exit(1);
+ }
+
+ if (sync_mode) {
+ assert(ioctl(listener, SECCOMP_IOCTL_NOTIF_SET_FLAGS,
+ SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP, 0) == 0);
+ }
+ user_notification_sync_loop(listener);
+
+ kill(pid, SIGKILL);
+ assert(waitpid(pid, &status, 0) == pid);
+ assert(WIFSIGNALED(status));
+ assert(WTERMSIG(status) == SIGKILL);
+
+ gettimeofday(&stop, NULL);
+ timersub(&stop, &start, &diff);
+
+ switch (bench_format) {
+ case BENCH_FORMAT_DEFAULT:
+ printf("# Executed %lu system calls\n\n",
+ loops);
+
+ result_usec = diff.tv_sec * USEC_PER_SEC;
+ result_usec += diff.tv_usec;
+
+ printf(" %14s: %lu.%03lu [sec]\n\n", "Total time",
+ (unsigned long) diff.tv_sec,
+ (unsigned long) (diff.tv_usec / USEC_PER_MSEC));
+
+ printf(" %14lf usecs/op\n",
+ (double)result_usec / (double)loops);
+ printf(" %14d ops/sec\n",
+ (int)((double)loops /
+ ((double)result_usec / (double)USEC_PER_SEC)));
+ break;
+
+ case BENCH_FORMAT_SIMPLE:
+ printf("%lu.%03lu\n",
+ (unsigned long) diff.tv_sec,
+ (unsigned long) (diff.tv_usec / USEC_PER_MSEC));
+ break;
+
+ default:
+ /* reaching here is something disaster */
+ fprintf(stderr, "Unknown format:%d\n", bench_format);
+ exit(1);
+ break;
+ }
+
+ return 0;
+}
diff --git a/tools/perf/builtin-bench.c b/tools/perf/builtin-bench.c
index 814e9afc86f6..db57813fe4e5 100644
--- a/tools/perf/builtin-bench.c
+++ b/tools/perf/builtin-bench.c
@@ -46,6 +46,7 @@ static struct bench numa_benchmarks[] = {
static struct bench sched_benchmarks[] = {
{ "messaging", "Benchmark for scheduling and IPC", bench_sched_messaging },
{ "pipe", "Benchmark for pipe() between two processes", bench_sched_pipe },
+ { "seccomp-notify", "Benchmark for seccomp user notify", bench_sched_seccomp_notify},
{ "all", "Run all scheduler benchmarks", NULL },
{ NULL, NULL, NULL }
};
--
2.40.0.rc0.216.gc4246ad0f0-goog


2023-03-21 18:19:23

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 0/6 v5 RESEND] seccomp: add the synchronous mode for seccomp_unotify

On Tue, Mar 7, 2023 at 11:32 PM Andrei Vagin <[email protected]> wrote:
>
> seccomp_unotify allows more privileged processes do actions on behalf
> of less privileged processes.
>
> In many cases, the workflow is fully synchronous. It means a target
> process triggers a system call and passes controls to a supervisor
> process that handles the system call and returns controls back to the
> target process. In this context, "synchronous" means that only one
> process is running and another one is waiting.
>
> The new WF_CURRENT_CPU flag advises the scheduler to move the wakee to
> the current CPU. For such synchronous workflows, it makes context
> switches a few times faster.
>
> Right now, each interaction takes 12µs. With this patch, it takes about
> 3µs.
>
> v2: clean up the first patch and add the test.
> v3: update commit messages and a few fixes suggested by Kees Cook.
> v4: update the third patch to avoid code duplications (suggested by
> Peter Zijlstra)
> Add the benchmark to the perf bench set.
> v5: Update the author email. No code changes.
>
> Kees is ready to take this patch set, but wants to get Acks from the
> sched folks.

Peter, could you review the second and third patches of this series?

Thanks,
Andrei

2023-03-27 10:30:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/6 v5 RESEND] seccomp: add the synchronous mode for seccomp_unotify

On Tue, Mar 07, 2023 at 11:31:55PM -0800, Andrei Vagin wrote:

> Kees is ready to take this patch set, but wants to get Acks from the
> sched folks.
>

> Andrei Vagin (4):
> seccomp: don't use semaphore and wait_queue together
> sched: add a few helpers to wake up tasks on the current cpu
> seccomp: add the synchronous mode for seccomp_unotify
> selftest/seccomp: add a new test for the sync mode of
> seccomp_user_notify
>
> Peter Oskolkov (1):
> sched: add WF_CURRENT_CPU and externise ttwu

Acked-by: Peter Zijlstra (Intel) <[email protected]>

2023-04-03 18:33:22

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 0/6 v5 RESEND] seccomp: add the synchronous mode for seccomp_unotify

On Mon, Mar 27, 2023 at 3:27 AM Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Mar 07, 2023 at 11:31:55PM -0800, Andrei Vagin wrote:
>
> > Kees is ready to take this patch set, but wants to get Acks from the
> > sched folks.
> >
>
> > Andrei Vagin (4):
> > seccomp: don't use semaphore and wait_queue together
> > sched: add a few helpers to wake up tasks on the current cpu
> > seccomp: add the synchronous mode for seccomp_unotify
> > selftest/seccomp: add a new test for the sync mode of
> > seccomp_user_notify
> >
> > Peter Oskolkov (1):
> > sched: add WF_CURRENT_CPU and externise ttwu
>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>

Kees,

Could you look at this patch set? You wrote to one of the previous
versions that you are ready to take it if sched maintainers approve it.
Here is no major changes from that moment. The sched-related part has
been cleaned up according with Peter's comments, and I moved the perf
test to the perf tool.

Thanks,
Andrei

2023-04-06 03:33:06

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 0/6 v5 RESEND] seccomp: add the synchronous mode for seccomp_unotify

On April 3, 2023 11:32:00 AM PDT, Andrei Vagin <[email protected]> wrote:
>On Mon, Mar 27, 2023 at 3:27 AM Peter Zijlstra <[email protected]> wrote:
>>
>> On Tue, Mar 07, 2023 at 11:31:55PM -0800, Andrei Vagin wrote:
>>
>> > Kees is ready to take this patch set, but wants to get Acks from the
>> > sched folks.
>> >
>>
>> > Andrei Vagin (4):
>> > seccomp: don't use semaphore and wait_queue together
>> > sched: add a few helpers to wake up tasks on the current cpu
>> > seccomp: add the synchronous mode for seccomp_unotify
>> > selftest/seccomp: add a new test for the sync mode of
>> > seccomp_user_notify
>> >
>> > Peter Oskolkov (1):
>> > sched: add WF_CURRENT_CPU and externise ttwu
>>
>> Acked-by: Peter Zijlstra (Intel) <[email protected]>
>
>Kees,
>
>Could you look at this patch set? You wrote to one of the previous
>versions that you are ready to take it if sched maintainers approve it.
>Here is no major changes from that moment. The sched-related part has
>been cleaned up according with Peter's comments, and I moved the perf
>test to the perf tool.

Hi!

Yes, thanks for keeping this going! I'm catching up after some vacation, but this is on my TODO list. :)

-Kees


--
Kees Cook

2023-04-06 03:51:27

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 4/6] seccomp: add the synchronous mode for seccomp_unotify

On Tue, Mar 7, 2023 at 11:32 PM Andrei Vagin <[email protected]> wrote:
>
> seccomp_unotify allows more privileged processes do actions on behalf
> of less privileged processes.
>
> In many cases, the workflow is fully synchronous. It means a target
> process triggers a system call and passes controls to a supervisor
> process that handles the system call and returns controls to the target
> process. In this context, "synchronous" means that only one process is
> running and another one is waiting.
>
> There is the WF_CURRENT_CPU flag that is used to advise the scheduler to
> move the wakee to the current CPU. For such synchronous workflows, it
> makes context switches a few times faster.
>
> Right now, each interaction takes 12µs. With this patch, it takes about
> 3µs.

This is great, but:

>
> This change introduce the SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP flag that
> it used to enable the sync mode.

Other than being faster, what does this flag actually do in terms of
user-visible semantics?

>
> Signed-off-by: Andrei Vagin <[email protected]>
> ---
> include/uapi/linux/seccomp.h | 4 ++++
> kernel/seccomp.c | 31 +++++++++++++++++++++++++++++--
> 2 files changed, 33 insertions(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index 0fdc6ef02b94..dbfc9b37fcae 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -115,6 +115,8 @@ struct seccomp_notif_resp {
> __u32 flags;
> };
>
> +#define SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP (1UL << 0)
> +
> /* valid flags for seccomp_notif_addfd */
> #define SECCOMP_ADDFD_FLAG_SETFD (1UL << 0) /* Specify remote fd */
> #define SECCOMP_ADDFD_FLAG_SEND (1UL << 1) /* Addfd and return it, atomically */
> @@ -150,4 +152,6 @@ struct seccomp_notif_addfd {
> #define SECCOMP_IOCTL_NOTIF_ADDFD SECCOMP_IOW(3, \
> struct seccomp_notif_addfd)
>
> +#define SECCOMP_IOCTL_NOTIF_SET_FLAGS SECCOMP_IOW(4, __u64)
> +
> #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 9fca9345111c..d323edeae7da 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -143,9 +143,12 @@ struct seccomp_kaddfd {
> * filter->notify_lock.
> * @next_id: The id of the next request.
> * @notifications: A list of struct seccomp_knotif elements.
> + * @flags: A set of SECCOMP_USER_NOTIF_FD_* flags.
> */
> +
> struct notification {
> atomic_t requests;
> + u32 flags;
> u64 next_id;
> struct list_head notifications;
> };
> @@ -1117,7 +1120,10 @@ static int seccomp_do_user_notification(int this_syscall,
> INIT_LIST_HEAD(&n.addfd);
>
> atomic_inc(&match->notif->requests);
> - wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
> + if (match->notif->flags & SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP)
> + wake_up_poll_on_current_cpu(&match->wqh, EPOLLIN | EPOLLRDNORM);
> + else
> + wake_up_poll(&match->wqh, EPOLLIN | EPOLLRDNORM);
>
> /*
> * This is where we wait for a reply from userspace.
> @@ -1593,7 +1599,10 @@ static long seccomp_notify_send(struct seccomp_filter *filter,
> knotif->error = resp.error;
> knotif->val = resp.val;
> knotif->flags = resp.flags;
> - complete(&knotif->ready);
> + if (filter->notif->flags & SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP)
> + complete_on_current_cpu(&knotif->ready);
> + else
> + complete(&knotif->ready);
> out:
> mutex_unlock(&filter->notify_lock);
> return ret;
> @@ -1623,6 +1632,22 @@ static long seccomp_notify_id_valid(struct seccomp_filter *filter,
> return ret;
> }
>
> +static long seccomp_notify_set_flags(struct seccomp_filter *filter,
> + unsigned long flags)
> +{
> + long ret;
> +
> + if (flags & ~SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP)
> + return -EINVAL;
> +
> + ret = mutex_lock_interruptible(&filter->notify_lock);
> + if (ret < 0)
> + return ret;
> + filter->notif->flags = flags;
> + mutex_unlock(&filter->notify_lock);
> + return 0;
> +}
> +
> static long seccomp_notify_addfd(struct seccomp_filter *filter,
> struct seccomp_notif_addfd __user *uaddfd,
> unsigned int size)
> @@ -1752,6 +1777,8 @@ static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
> case SECCOMP_IOCTL_NOTIF_ID_VALID_WRONG_DIR:
> case SECCOMP_IOCTL_NOTIF_ID_VALID:
> return seccomp_notify_id_valid(filter, buf);
> + case SECCOMP_IOCTL_NOTIF_SET_FLAGS:
> + return seccomp_notify_set_flags(filter, arg);
> }
>
> /* Extensible Argument ioctls */
> --
> 2.40.0.rc0.216.gc4246ad0f0-goog
>

2023-04-08 03:22:08

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH 2/6] sched: add WF_CURRENT_CPU and externise ttwu

On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote:
> From: Peter Oskolkov <[email protected]>
>
> Add WF_CURRENT_CPU wake flag that advices the scheduler to
> move the wakee to the current CPU. This is useful for fast on-CPU
> context switching use cases.
>
> In addition, make ttwu external rather than static so that
> the flag could be passed to it from outside of sched/core.c.
>
> Signed-off-by: Peter Oskolkov <[email protected]>
> Signed-off-by: Andrei Vagin <[email protected]>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> if (wake_flags & WF_TTWU) {
> record_wakee(p);
>
> + if ((wake_flags & WF_CURRENT_CPU) &&
> + cpumask_test_cpu(cpu, p->cpus_ptr))
> + return cpu;
> +
I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there
are regressions when running some workloads, and these workloads want to be
spreaded on idle CPUs whenever possible.
The reason for the regression is that, above change chooses current CPU no matter
what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts
throughput/latency. And I believe this issue would be more severe on system with
smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD,
Arm64.

I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic
to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter?
Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by
checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag
dynamically when some condition is met(a short task for example).

Thanks,
Chenyu

2023-04-10 05:03:29

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 2/6] sched: add WF_CURRENT_CPU and externise ttwu

On Fri, Apr 7, 2023 at 8:20 PM Chen Yu <[email protected]> wrote:
>
> On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote:
> > From: Peter Oskolkov <[email protected]>
> >
> > Add WF_CURRENT_CPU wake flag that advices the scheduler to
> > move the wakee to the current CPU. This is useful for fast on-CPU
> > context switching use cases.
> >
> > In addition, make ttwu external rather than static so that
> > the flag could be passed to it from outside of sched/core.c.
> >
> > Signed-off-by: Peter Oskolkov <[email protected]>
> > Signed-off-by: Andrei Vagin <[email protected]>
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > if (wake_flags & WF_TTWU) {
> > record_wakee(p);
> >
> > + if ((wake_flags & WF_CURRENT_CPU) &&
> > + cpumask_test_cpu(cpu, p->cpus_ptr))
> > + return cpu;
> > +
> I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there
> are regressions when running some workloads, and these workloads want to be
> spreaded on idle CPUs whenever possible.
> The reason for the regression is that, above change chooses current CPU no matter
> what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts
> throughput/latency. And I believe this issue would be more severe on system with
> smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD,
> Arm64.

WF_CURRENT_CPU works only in certain conditions. Maybe you saw my
attempt to change how WF_SYNC works:

https://www.spinics.net/lists/kernel/msg4567650.html

Then we've found that this idea doesn't work well, and it is a reason
why we have the separate WF_CURRENT_CPU flag.

>
> I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic
> to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter?
> Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by
> checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag
> dynamically when some condition is met(a short task for example).

We discussed all of these here and here:

https://www.spinics.net/lists/kernel/msg4657545.html

https://lore.kernel.org/lkml/CANaxB-yWkKzhhPMGXCQbtjntJbqZ40FL2qtM2hk7LLWE-ZpbAg@mail.gmail.com/

I like your idea about short-duration tasks, but I think it is a
separate task and it has to be done in a separate patch set. Here, I
solve the problem of optimizing synchronous switches when one task wakes
up another one and falls asleep immediately after that. Waking up the
target task on the current CPU looks reasonable for a few reasons in
this case. First, waking up a task on the current CPU is cheaper than on
another one and it is much cheaper than waking on an idle cpu. Second,
when tasks want to do synchronous switches, they often exchange some
data, so memory caches can play on us.

Thanks,
Andrei

2023-04-10 07:12:31

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 4/6] seccomp: add the synchronous mode for seccomp_unotify

On Wed, Apr 5, 2023 at 8:42 PM Andy Lutomirski <[email protected]> wrote:
>
> On Tue, Mar 7, 2023 at 11:32 PM Andrei Vagin <[email protected]> wrote:
> >
> > seccomp_unotify allows more privileged processes to do actions on behalf
> > of less privileged processes.
> >
> > In many cases, the workflow is fully synchronous. It means a target
> > process triggers a system call and passes controls to a supervisor
> > process that handles the system call and returns controls to the target
> > process. In this context, "synchronous" means that only one process is
> > running and another one is waiting.
> >
> > There is the WF_CURRENT_CPU flag that is used to advise the scheduler to
> > move the wakee to the current CPU. For such synchronous workflows, it
> > makes context switches a few times faster.
> >
> > Right now, each interaction takes 12µs. With this patch, it takes about
> > 3µs.
>
> This is great, but:
>
> >
> > This change introduce the SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP flag that
> > it used to enable the sync mode.
>
> Other than being faster, what does this flag actually do in terms of
> user-visible semantics?

In short, the process handling an event wakes up on the same cpu where the
process that triggered the event has been running. Knowing this fact, the user
can understand when it is appropriate to use this flag.

Let's imagine that we have two processes where one calls syscalls (the
target) and another one handles these syscalls (the supervisor). In this case,
the user should see that both processes are running on the same cpu.

If we have one target process and one supervisor process, they synchronously
swap with each other and don't need to run on cpu concurrently. But
it becomes more
complicated if one supervisor handles a group of target processes. In this
case, setting the SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP flag depends on the
frequency of events. If the supervisor often has pending events (doesn't sleep
between events), it is better to unset the flag or add more supervisor
processes.

Thanks,
Andrei

2023-04-10 17:32:35

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/6] sched: add WF_CURRENT_CPU and externise ttwu

On Sun, Apr 9, 2023 at 9:56 PM Andrei Vagin <[email protected]> wrote:
>
> On Fri, Apr 7, 2023 at 8:20 PM Chen Yu <[email protected]> wrote:
> >
> > On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote:
> > > From: Peter Oskolkov <[email protected]>
> > >
> > > Add WF_CURRENT_CPU wake flag that advices the scheduler to
> > > move the wakee to the current CPU. This is useful for fast on-CPU
> > > context switching use cases.
> > >
> > > In addition, make ttwu external rather than static so that
> > > the flag could be passed to it from outside of sched/core.c.
> > >
> > > Signed-off-by: Peter Oskolkov <[email protected]>
> > > Signed-off-by: Andrei Vagin <[email protected]>
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > > if (wake_flags & WF_TTWU) {
> > > record_wakee(p);
> > >
> > > + if ((wake_flags & WF_CURRENT_CPU) &&
> > > + cpumask_test_cpu(cpu, p->cpus_ptr))
> > > + return cpu;
> > > +
> > I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there
> > are regressions when running some workloads, and these workloads want to be
> > spreaded on idle CPUs whenever possible.
> > The reason for the regression is that, above change chooses current CPU no matter
> > what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts
> > throughput/latency. And I believe this issue would be more severe on system with
> > smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD,
> > Arm64.
>
> WF_CURRENT_CPU works only in certain conditions. Maybe you saw my
> attempt to change how WF_SYNC works:
>
> https://www.spinics.net/lists/kernel/msg4567650.html
>
> Then we've found that this idea doesn't work well, and it is a reason
> why we have the separate WF_CURRENT_CPU flag.
>
> >
> > I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic
> > to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter?
> > Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by
> > checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag
> > dynamically when some condition is met(a short task for example).
>
> We discussed all of these here and here:
>
> https://www.spinics.net/lists/kernel/msg4657545.html
>
> https://lore.kernel.org/lkml/CANaxB-yWkKzhhPMGXCQbtjntJbqZ40FL2qtM2hk7LLWE-ZpbAg@mail.gmail.com/
>
> I like your idea about short-duration tasks, but I think it is a
> separate task and it has to be done in a separate patch set. Here, I
> solve the problem of optimizing synchronous switches when one task wakes
> up another one and falls asleep immediately after that. Waking up the
> target task on the current CPU looks reasonable for a few reasons in
> this case. First, waking up a task on the current CPU is cheaper than on
> another one and it is much cheaper than waking on an idle cpu. Second,
> when tasks want to do synchronous switches, they often exchange some
> data, so memory caches can play on us.

I've contemplated this on occasion for quite a few years, and I think
that part of our issue is that the userspace ABI part doesn't exist
widely. In particular, most of the common ways that user tasks talk
to each other don't have a single system call that can do the
send-a-message-and-start-waiting part all at once. For example, if
task A is running and it wants to wake task B and then sleep:

UNIX sockets (or other sockets): A calls send() or write() or
sendmsg() then recv() or read() or recvmsg() or epoll_wait() or poll()
or select().

Pipes: Same as sockets except no send/recv.

Shared memory: no wakeup or sleep mechanism at all. UMONITOR doesn't count :)

I think io_uring can kind of do a write-and-wait operation, but I
doubt it's wired up for this purpose.


seccomp seems like it should be able to do this straightforwardly on
the transition from the seccomp-sandboxed task to the monitor, but the
reverse direction is tricky.



Anyway, short of a massive API project, I don't see a totally
brilliant solution. But maybe we could take a baby step toward a
general solution by deferring all the hard work of a wakeup a bit so
that, as we grow syscalls and other mechanisms that do wake-and-wait,
we can optimize them automatically. For example, we could perhaps add
a pending wakeup to task_struct, kind of like:

struct task_struct *task_to_wake;

and then, the next time we sleep or return to usermode, we handle the
wakeup. And if we're going to sleep, we can do it as an optimized
synchronous wakeup. And if we try to wake a task while task_to_wake
is set, we just wake everything normally.

(There are refcounting issues here, and maybe this wants to be percpu,
not per task.)

I think it would be really nifty if Linux could somewhat reliably do
this style of synchronous con

PeterZ, is this at all sensible or am I nuts?

--Andy

2023-04-10 18:25:57

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH 2/6] sched: add WF_CURRENT_CPU and externise ttwu

On 2023-04-09 at 21:56:26 -0700, Andrei Vagin wrote:
> On Fri, Apr 7, 2023 at 8:20 PM Chen Yu <[email protected]> wrote:
> >
> > On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote:
> > > From: Peter Oskolkov <[email protected]>
> > >
> > > Add WF_CURRENT_CPU wake flag that advices the scheduler to
> > > move the wakee to the current CPU. This is useful for fast on-CPU
> > > context switching use cases.
> > >
> > > In addition, make ttwu external rather than static so that
> > > the flag could be passed to it from outside of sched/core.c.
> > >
> > > Signed-off-by: Peter Oskolkov <[email protected]>
> > > Signed-off-by: Andrei Vagin <[email protected]>
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > > if (wake_flags & WF_TTWU) {
> > > record_wakee(p);
> > >
> > > + if ((wake_flags & WF_CURRENT_CPU) &&
> > > + cpumask_test_cpu(cpu, p->cpus_ptr))
> > > + return cpu;
> > > +
> > I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there
> > are regressions when running some workloads, and these workloads want to be
> > spreaded on idle CPUs whenever possible.
> > The reason for the regression is that, above change chooses current CPU no matter
> > what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts
> > throughput/latency. And I believe this issue would be more severe on system with
> > smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD,
> > Arm64.
>
> WF_CURRENT_CPU works only in certain conditions. Maybe you saw my
> attempt to change how WF_SYNC works:
>
> https://www.spinics.net/lists/kernel/msg4567650.html
>
> Then we've found that this idea doesn't work well, and it is a reason
> why we have the separate WF_CURRENT_CPU flag.
>
I see, in seccomp case, even the idle CPU is not a better choice.
> >
> > I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic
> > to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter?
> > Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by
> > checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag
> > dynamically when some condition is met(a short task for example).
>
> We discussed all of these here and here:
>
> https://www.spinics.net/lists/kernel/msg4657545.html
>
> https://lore.kernel.org/lkml/CANaxB-yWkKzhhPMGXCQbtjntJbqZ40FL2qtM2hk7LLWE-ZpbAg@mail.gmail.com/
>
> I like your idea about short-duration tasks, but I think it is a
> separate task and it has to be done in a separate patch set. Here, I
> solve the problem of optimizing synchronous switches when one task wakes
> up another one and falls asleep immediately after that. Waking up the
> target task on the current CPU looks reasonable for a few reasons in
> this case. First, waking up a task on the current CPU is cheaper than on
> another one and it is much cheaper than waking on an idle cpu.
It depends. For waker and wakee that compete for cache resource and do
not have share data, sometimes an idle target would be better.
> Second,
> when tasks want to do synchronous switches, they often exchange some
> data, so memory caches can play on us.
I like the name of 'WF_CURRENT_CPU' too : ) and I was thinking that if this could
become a auto-detect behavior so others can benefit from this.

If I understand correctly, the scenario this patch deals with is:
task A wakeups task B, task A and taks B have close relationship with each
other(cache sharing eg), when task A fall asleep, choose A's CPU, rather than an
idle CPU.

I'm thinking if the following logic would cover your case:
1. the waker A is a short duration one (waker will fall asleep soon)
2. the waker B is a short duration one (impact of B is minor)
3. the A->wakee_flips is 0 and A->last_wakee = B
4. the A->wakee_flips is 0 and B->last_wakee = A
5, cpu(A)->nr_running = 1

(3 and 4 mean that, A and B wake up each other, so it is likely that
they share cache data, and they are good firends to be put together)

If above conditions are met, choose current CPU. In this way, WF_CURRENT_CPU
can be set dynamically.

thanks,
Chenyu

2023-04-10 20:56:23

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 4/6] seccomp: add the synchronous mode for seccomp_unotify



On Sun, Apr 9, 2023, at 11:59 PM, Andrei Vagin wrote:
> On Wed, Apr 5, 2023 at 8:42 PM Andy Lutomirski <[email protected]> wrote:
>>
>> On Tue, Mar 7, 2023 at 11:32 PM Andrei Vagin <[email protected]> wrote:
>> >
>> > seccomp_unotify allows more privileged processes to do actions on behalf
>> > of less privileged processes.
>> >
>> > In many cases, the workflow is fully synchronous. It means a target
>> > process triggers a system call and passes controls to a supervisor
>> > process that handles the system call and returns controls to the target
>> > process. In this context, "synchronous" means that only one process is
>> > running and another one is waiting.
>> >
>> > There is the WF_CURRENT_CPU flag that is used to advise the scheduler to
>> > move the wakee to the current CPU. For such synchronous workflows, it
>> > makes context switches a few times faster.
>> >
>> > Right now, each interaction takes 12µs. With this patch, it takes about
>> > 3µs.
>>
>> This is great, but:
>>
>> >
>> > This change introduce the SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP flag that
>> > it used to enable the sync mode.
>>
>> Other than being faster, what does this flag actually do in terms of
>> user-visible semantics?
>
> In short, the process handling an event wakes up on the same cpu where the
> process that triggered the event has been running. Knowing this fact, the user
> can understand when it is appropriate to use this flag.
>
> Let's imagine that we have two processes where one calls syscalls (the
> target) and another one handles these syscalls (the supervisor). In this case,
> the user should see that both processes are running on the same cpu.

So I think you're saying it has no semantic effect. It's a performance thing.

>
> If we have one target process and one supervisor process, they synchronously
> swap with each other and don't need to run on cpu concurrently. But
> it becomes more
> complicated if one supervisor handles a group of target processes. In this
> case, setting the SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP flag depends on the
> frequency of events. If the supervisor often has pending events (doesn't sleep
> between events), it is better to unset the flag or add more supervisor
> processes.

ISTM the kernel ought to be able to handle this much more automatically. The scheduler knows whether the target is running and how busy it has been.

I'm not sure what the right heuristic is in the kernel, but I'm also not convinced that user code has any particular insight here.

2023-04-11 02:10:36

by Chen Yu

[permalink] [raw]
Subject: Re: [PATCH 2/6] sched: add WF_CURRENT_CPU and externise ttwu

On 2023-04-11 at 02:16:56 +0800, Chen Yu wrote:
> On 2023-04-09 at 21:56:26 -0700, Andrei Vagin wrote:
> > On Fri, Apr 7, 2023 at 8:20 PM Chen Yu <[email protected]> wrote:
> > >
> > > On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote:
> > > > From: Peter Oskolkov <[email protected]>
> > > >
> > > > Add WF_CURRENT_CPU wake flag that advices the scheduler to
> > > > move the wakee to the current CPU. This is useful for fast on-CPU
> > > > context switching use cases.
> > > >
> > > > In addition, make ttwu external rather than static so that
> > > > the flag could be passed to it from outside of sched/core.c.
> > > >
> > > > Signed-off-by: Peter Oskolkov <[email protected]>
> > > > Signed-off-by: Andrei Vagin <[email protected]>
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > > > if (wake_flags & WF_TTWU) {
> > > > record_wakee(p);
> > > >
> > > > + if ((wake_flags & WF_CURRENT_CPU) &&
> > > > + cpumask_test_cpu(cpu, p->cpus_ptr))
> > > > + return cpu;
> > > > +
> > > I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there
> > > are regressions when running some workloads, and these workloads want to be
> > > spreaded on idle CPUs whenever possible.
> > > The reason for the regression is that, above change chooses current CPU no matter
> > > what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts
> > > throughput/latency. And I believe this issue would be more severe on system with
> > > smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD,
> > > Arm64.
> >
> > WF_CURRENT_CPU works only in certain conditions. Maybe you saw my
> > attempt to change how WF_SYNC works:
> >
> > https://www.spinics.net/lists/kernel/msg4567650.html
> >
> > Then we've found that this idea doesn't work well, and it is a reason
> > why we have the separate WF_CURRENT_CPU flag.
> >
> I see, in seccomp case, even the idle CPU is not a better choice.
> > >
> > > I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic
> > > to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter?
> > > Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by
> > > checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag
> > > dynamically when some condition is met(a short task for example).
> >
> > We discussed all of these here and here:
> >
> > https://www.spinics.net/lists/kernel/msg4657545.html
> >
> > https://lore.kernel.org/lkml/CANaxB-yWkKzhhPMGXCQbtjntJbqZ40FL2qtM2hk7LLWE-ZpbAg@mail.gmail.com/
> >
> > I like your idea about short-duration tasks, but I think it is a
> > separate task and it has to be done in a separate patch set. Here, I
> > solve the problem of optimizing synchronous switches when one task wakes
> > up another one and falls asleep immediately after that. Waking up the
> > target task on the current CPU looks reasonable for a few reasons in
> > this case. First, waking up a task on the current CPU is cheaper than on
> > another one and it is much cheaper than waking on an idle cpu.
> It depends. For waker and wakee that compete for cache resource and do
> not have share data, sometimes an idle target would be better.
> > Second,
> > when tasks want to do synchronous switches, they often exchange some
> > data, so memory caches can play on us.
> I like the name of 'WF_CURRENT_CPU' too : ) and I was thinking that if this could
> become a auto-detect behavior so others can benefit from this.
>
> If I understand correctly, the scenario this patch deals with is:
> task A wakeups task B, task A and taks B have close relationship with each
> other(cache sharing eg), when task A fall asleep, choose A's CPU, rather than an
> idle CPU.
>
> I'm thinking if the following logic would cover your case:
> 1. the waker A is a short duration one (waker will fall asleep soon)
> 2. the waker B is a short duration one (impact of B is minor)
typo: s/waker B/wakee B/
> 3. the A->wakee_flips is 0 and A->last_wakee = B
> 4. the A->wakee_flips is 0 and B->last_wakee = A
typo: s/A->wakee_flips/B->wakee_flips/

Sorry I typed too quickly yesterday.

thanks,
Chenyu
> 5, cpu(A)->nr_running = 1
>
> (3 and 4 mean that, A and B wake up each other, so it is likely that
> they share cache data, and they are good firends to be put together)
>
> If above conditions are met, choose current CPU. In this way, WF_CURRENT_CPU
> can be set dynamically.
>
> thanks,
> Chenyu

2023-04-12 19:46:48

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 2/6] sched: add WF_CURRENT_CPU and externise ttwu

On Mon, Apr 10, 2023 at 10:27 AM Andy Lutomirski <[email protected]> wrote:
>
> On Sun, Apr 9, 2023 at 9:56 PM Andrei Vagin <[email protected]> wrote:
> >
> > On Fri, Apr 7, 2023 at 8:20 PM Chen Yu <[email protected]> wrote:
> > >
> > > On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote:
> > > > From: Peter Oskolkov <[email protected]>
> > > >
> > > > Add WF_CURRENT_CPU wake flag that advices the scheduler to
> > > > move the wakee to the current CPU. This is useful for fast on-CPU
> > > > context switching use cases.
> > > >
> > > > In addition, make ttwu external rather than static so that
> > > > the flag could be passed to it from outside of sched/core.c.
> > > >
> > > > Signed-off-by: Peter Oskolkov <[email protected]>
> > > > Signed-off-by: Andrei Vagin <[email protected]>
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > > > if (wake_flags & WF_TTWU) {
> > > > record_wakee(p);
> > > >
> > > > + if ((wake_flags & WF_CURRENT_CPU) &&
> > > > + cpumask_test_cpu(cpu, p->cpus_ptr))
> > > > + return cpu;
> > > > +
> > > I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there
> > > are regressions when running some workloads, and these workloads want to be
> > > spreaded on idle CPUs whenever possible.
> > > The reason for the regression is that, above change chooses current CPU no matter
> > > what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts
> > > throughput/latency. And I believe this issue would be more severe on system with
> > > smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD,
> > > Arm64.
> >
> > WF_CURRENT_CPU works only in certain conditions. Maybe you saw my
> > attempt to change how WF_SYNC works:
> >
> > https://www.spinics.net/lists/kernel/msg4567650.html
> >
> > Then we've found that this idea doesn't work well, and it is a reason
> > why we have the separate WF_CURRENT_CPU flag.
> >
> > >
> > > I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic
> > > to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter?
> > > Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by
> > > checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag
> > > dynamically when some condition is met(a short task for example).
> >
> > We discussed all of these here and here:
> >
> > https://www.spinics.net/lists/kernel/msg4657545.html
> >
> > https://lore.kernel.org/lkml/CANaxB-yWkKzhhPMGXCQbtjntJbqZ40FL2qtM2hk7LLWE-ZpbAg@mail.gmail.com/
> >
> > I like your idea about short-duration tasks, but I think it is a
> > separate task and it has to be done in a separate patch set. Here, I
> > solve the problem of optimizing synchronous switches when one task wakes
> > up another one and falls asleep immediately after that. Waking up the
> > target task on the current CPU looks reasonable for a few reasons in
> > this case. First, waking up a task on the current CPU is cheaper than on
> > another one and it is much cheaper than waking on an idle cpu. Second,
> > when tasks want to do synchronous switches, they often exchange some
> > data, so memory caches can play on us.
>
> I've contemplated this on occasion for quite a few years, and I think
> that part of our issue is that the userspace ABI part doesn't exist
> widely. In particular, most of the common ways that user tasks talk
> to each other don't have a single system call that can do the
> send-a-message-and-start-waiting part all at once. For example, if
> task A is running and it wants to wake task B and then sleep:
>
> UNIX sockets (or other sockets): A calls send() or write() or
> sendmsg() then recv() or read() or recvmsg() or epoll_wait() or poll()
> or select().
>
> Pipes: Same as sockets except no send/recv.
>
> Shared memory: no wakeup or sleep mechanism at all. UMONITOR doesn't count :)

futex-es? Here was an attempt to add FUTEX_SWAP a few years ago:
https://www.spinics.net/lists/kernel/msg3871065.html

It hasn't been merged to the upstream repo in favor of umcg:
https://lore.kernel.org/linux-mm/[email protected]/

Both these features solve similar problems, where FUTEX_SWAP is simple
and straightforward
but umcg is wider and more complicated.

>
> I think io_uring can kind of do a write-and-wait operation, but I
> doubt it's wired up for this purpose.

I think it may be a good candidate where this logic can be placed.

>
>
> seccomp seems like it should be able to do this straightforwardly on
> the transition from the seccomp-sandboxed task to the monitor, but the
> reverse direction is tricky.
>
>
>
> Anyway, short of a massive API project, I don't see a totally
> brilliant solution. But maybe we could take a baby step toward a
> general solution by deferring all the hard work of a wakeup a bit so
> that, as we grow syscalls and other mechanisms that do wake-and-wait,
> we can optimize them automatically. For example, we could perhaps add
> a pending wakeup to task_struct, kind of like:
>
> struct task_struct *task_to_wake;
>
> and then, the next time we sleep or return to usermode, we handle the
> wakeup. And if we're going to sleep, we can do it as an optimized
> synchronous wakeup. And if we try to wake a task while task_to_wake
> is set, we just wake everything normally.

I am not sure that I understand when it has to be set and when it will
be in effect. For example, we want to do the pair write&read syscall. It
means write sets task_to_wake, then the current task is resumed without waking
the target task and only after that task_to_wake will be in effect.
In other words,
it has to be in effect after the next but one returns to user-mode.

Thanks,
Andrei

>
> (There are refcounting issues here, and maybe this wants to be percpu,
> not per task.)
>
> I think it would be really nifty if Linux could somewhat reliably do
> this style of synchronous con
>
> PeterZ, is this at all sensible or am I nuts?
>
> --Andy

2023-04-17 19:28:53

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 2/6] sched: add WF_CURRENT_CPU and externise ttwu

On Mon, Apr 10, 2023 at 11:17 AM Chen Yu <[email protected]> wrote:
>
> On 2023-04-09 at 21:56:26 -0700, Andrei Vagin wrote:
> > On Fri, Apr 7, 2023 at 8:20 PM Chen Yu <[email protected]> wrote:
> > >
> > > On 2023-03-07 at 23:31:57 -0800, Andrei Vagin wrote:
> > > > From: Peter Oskolkov <[email protected]>
> > > >
> > > > Add WF_CURRENT_CPU wake flag that advices the scheduler to
> > > > move the wakee to the current CPU. This is useful for fast on-CPU
> > > > context switching use cases.
> > > >
> > > > In addition, make ttwu external rather than static so that
> > > > the flag could be passed to it from outside of sched/core.c.
> > > >
> > > > Signed-off-by: Peter Oskolkov <[email protected]>
> > > > Signed-off-by: Andrei Vagin <[email protected]>
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -7569,6 +7569,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> > > > if (wake_flags & WF_TTWU) {
> > > > record_wakee(p);
> > > >
> > > > + if ((wake_flags & WF_CURRENT_CPU) &&
> > > > + cpumask_test_cpu(cpu, p->cpus_ptr))
> > > > + return cpu;
> > > > +
> > > I tried to reuse WF_CURRENT_CPU to mitigate the cross-cpu wakeup, however there
> > > are regressions when running some workloads, and these workloads want to be
> > > spreaded on idle CPUs whenever possible.
> > > The reason for the regression is that, above change chooses current CPU no matter
> > > what the load/utilization of this CPU is. So task are stacked on 1 CPU and hurts
> > > throughput/latency. And I believe this issue would be more severe on system with
> > > smaller number of CPU within 1 LLC(when compared to Intel platforms), such as AMD,
> > > Arm64.
> >
> > WF_CURRENT_CPU works only in certain conditions. Maybe you saw my
> > attempt to change how WF_SYNC works:
> >
> > https://www.spinics.net/lists/kernel/msg4567650.html
> >
> > Then we've found that this idea doesn't work well, and it is a reason
> > why we have the separate WF_CURRENT_CPU flag.
> >
> I see, in seccomp case, even the idle CPU is not a better choice.
> > >
> > > I know WF_CURRENT_CPU benefits seccomp, and can we make this change more genefic
> > > to benefit other workloads, by making the condition to trigger WF_CURRENT_CPU stricter?
> > > Say, only current CPU has 1 runnable task, and treat current CPU as the last resort by
> > > checking if the wakee's previous CPU is not idle. In this way, we can enable WF_CURRENT_CPU flag
> > > dynamically when some condition is met(a short task for example).
> >
> > We discussed all of these here and here:
> >
> > https://www.spinics.net/lists/kernel/msg4657545.html
> >
> > https://lore.kernel.org/lkml/CANaxB-yWkKzhhPMGXCQbtjntJbqZ40FL2qtM2hk7LLWE-ZpbAg@mail.gmail.com/
> >
> > I like your idea about short-duration tasks, but I think it is a
> > separate task and it has to be done in a separate patch set. Here, I
> > solve the problem of optimizing synchronous switches when one task wakes
> > up another one and falls asleep immediately after that. Waking up the
> > target task on the current CPU looks reasonable for a few reasons in
> > this case. First, waking up a task on the current CPU is cheaper than on
> > another one and it is much cheaper than waking on an idle cpu.
> It depends. For waker and wakee that compete for cache resource and do
> not have share data, sometimes an idle target would be better.
> > Second,
> > when tasks want to do synchronous switches, they often exchange some
> > data, so memory caches can play on us.
> I like the name of 'WF_CURRENT_CPU' too : ) and I was thinking that if this could
> become a auto-detect behavior so others can benefit from this.
>
> If I understand correctly, the scenario this patch deals with is:
> task A wakeups task B, task A and taks B have close relationship with each
> other(cache sharing eg), when task A fall asleep, choose A's CPU, rather than an
> idle CPU.
>
> I'm thinking if the following logic would cover your case:
> 1. the waker A is a short duration one (waker will fall asleep soon)
> 2. the waker B is a short duration one (impact of B is minor)

In the seccomp case, A or B can be a non-short-duration but if they do
synchronous
switches they get all the benefits of WF_CURRENT_CPU.

> 3. the A->wakee_flips is 0 and A->last_wakee = B

In our case, a "supervisor" is written in golang and there are
goroutines that can be
executed from different system threads, so this condition will fail often too.

> 4. the A->wakee_flips is 0 and B->last_wakee = A
> 5, cpu(A)->nr_running = 1
>
> (3 and 4 mean that, A and B wake up each other, so it is likely that
> they share cache data, and they are good firends to be put together)
>
> If above conditions are met, choose current CPU. In this way, WF_CURRENT_CPU
> can be set dynamically.
>
> thanks,
> Chenyu

2023-04-26 18:55:10

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 3/6] sched: add a few helpers to wake up tasks on the current cpu

On Wed, Apr 26, 2023 at 7:43 AM Bernd Schubert <[email protected]> wrote:
>
> > Add complete_on_current_cpu, wake_up_poll_on_current_cpu helpers to wake
> > up tasks on the current CPU.
>
> > These two helpers are useful when the task needs to make a synchronous context
> > switch to another task. In this context, synchronous means it wakes up the
> > target task and falls asleep right after that.
>
> > One example of such workloads is seccomp user notifies. This mechanism allows
> > the supervisor process handles system calls on behalf of a target process.
> > While the supervisor is handling an intercepted system call, the target process
> > will be blocked in the kernel, waiting for a response to come back.
>
> > On-CPU context switches are much faster than regular ones.
>
> > Signed-off-by: Andrei Vagin <[email protected]>
>
> Avoiding cpu switches is very desirable for fuse, I'm working on fuse over uring
> with per core queues. In my current branch and running a single threaded bonnie++
> I get about 9000 creates/s when I bind the process to a core, about 7000 creates/s
> when I set SCHED_IDLE for the ring threads and back to 9000 with SCHED_IDLE and
> disabling cpu migration in fs/fuse/dev.c request_wait_answer() before going into
> the waitq and enabling it back after waking up.
>
> I had reported this a few weeks back
> https://lore.kernel.org/lkml/[email protected]/
> and had been pointed to your and Prateeks patch series. I'm now going
> through these series. Interesting part is that a few weeks I didn't need
> SCHED_IDLE, just disabling/enabling migration before/after waking up was
> enough.
>
> [...]
>
> > EXPORT_SYMBOL(swake_up_one);
> > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> > index 133b74730738..47803a0b8d5d 100644
> > --- a/kernel/sched/wait.c
> > +++ b/kernel/sched/wait.c
> > @@ -161,6 +161,11 @@ int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
> > }
> > EXPORT_SYMBOL(__wake_up);
>
> > +void __wake_up_on_current_cpu(struct wait_queue_head *wq_head, unsigned int mode, void *key)
> > +{
> > + __wake_up_common_lock(wq_head, mode, 1, WF_CURRENT_CPU, key);
> > +}
>
> I'm about to test this instead of migrate_disable/migrate_enable, but the symbol needs
> to be exported - any objection to do that right from the beginning in your patch?

I think EXPORT_SYMBOL_GPL should not trigger any objections and it
covers your case, doesn't it?

Thanks,
Andrei

2023-04-26 19:37:37

by Bernd Schubert

[permalink] [raw]
Subject: Re: [PATCH 3/6] sched: add a few helpers to wake up tasks on the current cpu

On 4/26/23 20:52, Andrei Vagin wrote:
> On Wed, Apr 26, 2023 at 7:43 AM Bernd Schubert <[email protected]> wrote:
>>
>>> Add complete_on_current_cpu, wake_up_poll_on_current_cpu helpers to wake
>>> up tasks on the current CPU.
>>
>>> These two helpers are useful when the task needs to make a synchronous context
>>> switch to another task. In this context, synchronous means it wakes up the
>>> target task and falls asleep right after that.
>>
>>> One example of such workloads is seccomp user notifies. This mechanism allows
>>> the supervisor process handles system calls on behalf of a target process.
>>> While the supervisor is handling an intercepted system call, the target process
>>> will be blocked in the kernel, waiting for a response to come back.
>>
>>> On-CPU context switches are much faster than regular ones.
>>
>>> Signed-off-by: Andrei Vagin <[email protected]>
>>
>> Avoiding cpu switches is very desirable for fuse, I'm working on fuse over uring
>> with per core queues. In my current branch and running a single threaded bonnie++
>> I get about 9000 creates/s when I bind the process to a core, about 7000 creates/s
>> when I set SCHED_IDLE for the ring threads and back to 9000 with SCHED_IDLE and
>> disabling cpu migration in fs/fuse/dev.c request_wait_answer() before going into
>> the waitq and enabling it back after waking up.
>>
>> I had reported this a few weeks back
>> https://lore.kernel.org/lkml/[email protected]/
>> and had been pointed to your and Prateeks patch series. I'm now going
>> through these series. Interesting part is that a few weeks I didn't need
>> SCHED_IDLE, just disabling/enabling migration before/after waking up was
>> enough.
>>
>> [...]
>>
>>> EXPORT_SYMBOL(swake_up_one);
>>> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
>>> index 133b74730738..47803a0b8d5d 100644
>>> --- a/kernel/sched/wait.c
>>> +++ b/kernel/sched/wait.c
>>> @@ -161,6 +161,11 @@ int __wake_up(struct wait_queue_head *wq_head, unsigned int mode,
>>> }
>>> EXPORT_SYMBOL(__wake_up);
>>
>>> +void __wake_up_on_current_cpu(struct wait_queue_head *wq_head, unsigned int mode, void *key)
>>> +{
>>> + __wake_up_common_lock(wq_head, mode, 1, WF_CURRENT_CPU, key);
>>> +}
>>
>> I'm about to test this instead of migrate_disable/migrate_enable, but the symbol needs
>> to be exported - any objection to do that right from the beginning in your patch?
>
> I think EXPORT_SYMBOL_GPL should not trigger any objections and it
> covers your case, doesn't it?

Ah yes, sure, _GPL is fine. I have applied 2/6 and 3/6 in my branch and then have

wait.h
#define wake_up_interruptible_sync(x) __wake_up_sync((x), TASK_INTERRUPTIBLE)

and and using that in fuse_request_end() - works fine and no migration on wake up.
Though, I still need SCHED_IDLE for the uring thread to avoid a later migration,
will open a separate thread for that.


Thanks,
Bernd


2023-04-26 20:58:41

by Bernd Schubert

[permalink] [raw]
Subject: Re: [PATCH 3/6] sched: add a few helpers to wake up tasks on the current cpu

On 4/26/23 21:35, Bernd Schubert wrote:
> On 4/26/23 20:52, Andrei Vagin wrote:
>> On Wed, Apr 26, 2023 at 7:43 AM Bernd Schubert <[email protected]> wrote:
>>>
>>>> Add complete_on_current_cpu, wake_up_poll_on_current_cpu helpers to
>>>> wake
>>>> up tasks on the current CPU.
>>>
>>>> These two helpers are useful when the task needs to make a
>>>> synchronous context
>>>> switch to another task. In this context, synchronous means it wakes
>>>> up the
>>>> target task and falls asleep right after that.
>>>
>>>> One example of such workloads is seccomp user notifies. This
>>>> mechanism allows
>>>> the  supervisor process handles system calls on behalf of a target
>>>> process.
>>>> While the supervisor is handling an intercepted system call, the
>>>> target process
>>>> will be blocked in the kernel, waiting for a response to come back.
>>>
>>>> On-CPU context switches are much faster than regular ones.
>>>
>>>> Signed-off-by: Andrei Vagin <[email protected]>
>>>
>>> Avoiding cpu switches is very desirable for fuse, I'm working on fuse
>>> over uring
>>> with per core queues. In my current branch and running a single
>>> threaded bonnie++
>>> I get about 9000 creates/s when I bind the process to a core, about
>>> 7000 creates/s
>>> when I set SCHED_IDLE for the ring threads and back to 9000 with
>>> SCHED_IDLE and
>>> disabling cpu migration in fs/fuse/dev.c request_wait_answer() before
>>> going into
>>> the waitq and enabling it back after waking up.
>>>
>>> I had reported this a few weeks back
>>> https://lore.kernel.org/lkml/[email protected]/
>>> and had been pointed to your and Prateeks patch series. I'm now going
>>> through these series. Interesting part is that a few weeks I didn't need
>>> SCHED_IDLE, just disabling/enabling migration before/after waking up was
>>> enough.
>>>
>>> [...]
>>>
>>>> EXPORT_SYMBOL(swake_up_one);
>>>> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
>>>> index 133b74730738..47803a0b8d5d 100644
>>>> --- a/kernel/sched/wait.c
>>>> +++ b/kernel/sched/wait.c
>>>> @@ -161,6 +161,11 @@ int __wake_up(struct wait_queue_head *wq_head,
>>>> unsigned int mode,
>>>>   }
>>>>   EXPORT_SYMBOL(__wake_up);
>>>
>>>> +void __wake_up_on_current_cpu(struct wait_queue_head *wq_head,
>>>> unsigned int mode, void *key)
>>>> +{
>>>> +     __wake_up_common_lock(wq_head, mode, 1, WF_CURRENT_CPU, key);
>>>> +}
>>>
>>> I'm about to test this instead of migrate_disable/migrate_enable, but
>>> the symbol needs
>>> to be exported - any objection to do that right from the beginning in
>>> your patch?
>>
>> I think EXPORT_SYMBOL_GPL should not trigger any objections and it
>> covers your case, doesn't it?
>
> Ah yes, sure, _GPL is fine. I have applied 2/6 and 3/6 in my branch and
> then have
>
> wait.h
> #define wake_up_interruptible_sync(x)    __wake_up_sync((x),
> TASK_INTERRUPTIBLE)

Sorry, no, that was the part that is actually not working, I'm actually
using __wake_up_on_current_cpu(&req->waitq, TASK_NORMAL, NULL) in
fuse_request_end().

>
> and and using that in fuse_request_end() - works fine and no migration
> on wake up.
> Though, I still need SCHED_IDLE for the uring thread to avoid a later
> migration,
> will open a separate thread for that.
>
>
> Thanks,
> Bernd
>
>

2023-06-28 18:53:58

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 0/6 v5 RESEND] seccomp: add the synchronous mode for seccomp_unotify

On Wed, Apr 5, 2023 at 8:19 PM Kees Cook <[email protected]> wrote:
>
> On April 3, 2023 11:32:00 AM PDT, Andrei Vagin <[email protected]> wrote:
> >On Mon, Mar 27, 2023 at 3:27 AM Peter Zijlstra <[email protected]> wrote:
> >>
> >> On Tue, Mar 07, 2023 at 11:31:55PM -0800, Andrei Vagin wrote:
> >>
> >> > Kees is ready to take this patch set, but wants to get Acks from the
> >> > sched folks.
> >> >
> >>
> >> > Andrei Vagin (4):
> >> > seccomp: don't use semaphore and wait_queue together
> >> > sched: add a few helpers to wake up tasks on the current cpu
> >> > seccomp: add the synchronous mode for seccomp_unotify
> >> > selftest/seccomp: add a new test for the sync mode of
> >> > seccomp_user_notify
> >> >
> >> > Peter Oskolkov (1):
> >> > sched: add WF_CURRENT_CPU and externise ttwu
> >>
> >> Acked-by: Peter Zijlstra (Intel) <[email protected]>
> >
> >Kees,
> >
> >Could you look at this patch set? You wrote to one of the previous
> >versions that you are ready to take it if sched maintainers approve it.
> >Here is no major changes from that moment. The sched-related part has
> >been cleaned up according with Peter's comments, and I moved the perf
> >test to the perf tool.
>
> Hi!
>
> Yes, thanks for keeping this going! I'm catching up after some vacation, but this is on my TODO list. :)

Hi Kees. Do you have any updates on this series?

>
> -Kees
>
>
> --
> Kees Cook

2023-06-28 23:52:07

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 0/6 v5 RESEND] seccomp: add the synchronous mode for seccomp_unotify

On Wed, Jun 28, 2023 at 11:44:02AM -0700, Andrei Vagin wrote:
> On Wed, Apr 5, 2023 at 8:19 PM Kees Cook <[email protected]> wrote:
> >
> > On April 3, 2023 11:32:00 AM PDT, Andrei Vagin <[email protected]> wrote:
> > >On Mon, Mar 27, 2023 at 3:27 AM Peter Zijlstra <[email protected]> wrote:
> > >>
> > >> On Tue, Mar 07, 2023 at 11:31:55PM -0800, Andrei Vagin wrote:
> > >>
> > >> > Kees is ready to take this patch set, but wants to get Acks from the
> > >> > sched folks.
> > >> >
> > >>
> > >> > Andrei Vagin (4):
> > >> > seccomp: don't use semaphore and wait_queue together
> > >> > sched: add a few helpers to wake up tasks on the current cpu
> > >> > seccomp: add the synchronous mode for seccomp_unotify
> > >> > selftest/seccomp: add a new test for the sync mode of
> > >> > seccomp_user_notify
> > >> >
> > >> > Peter Oskolkov (1):
> > >> > sched: add WF_CURRENT_CPU and externise ttwu
> > >>
> > >> Acked-by: Peter Zijlstra (Intel) <[email protected]>
> > >
> > >Kees,
> > >
> > >Could you look at this patch set? You wrote to one of the previous
> > >versions that you are ready to take it if sched maintainers approve it.
> > >Here is no major changes from that moment. The sched-related part has
> > >been cleaned up according with Peter's comments, and I moved the perf
> > >test to the perf tool.
> >
> > Hi!
> >
> > Yes, thanks for keeping this going! I'm catching up after some vacation, but this is on my TODO list. :)
>
> Hi Kees. Do you have any updates on this series?

Apologies for the delay!

I've added this to the seccomp tree -- it should show up in -next soon.

-Kees

--
Kees Cook

2023-10-17 08:24:40

by Jiri Slaby

[permalink] [raw]
Subject: Re: [PATCH 6/6] perf/benchmark: add a new benchmark for seccom_unotify

On 08. 03. 23, 8:32, Andrei Vagin wrote:
> --- a/tools/arch/x86/include/uapi/asm/unistd_32.h
> +++ b/tools/arch/x86/include/uapi/asm/unistd_32.h
> @@ -23,3 +23,6 @@
> #ifndef __NR_setns
> #define __NR_setns 346
> #endif
> +#ifdef __NR_seccomp

Ouch. s/ifdef/ifndef/

bench/sched-seccomp-notify.c:46:24: error: '__NR_seccomp' undeclared

Will send a fix.

> +#define __NR_seccomp 354
> +#endif
> diff --git a/tools/arch/x86/include/uapi/asm/unistd_64.h b/tools/arch/x86/include/uapi/asm/unistd_64.h
> index a6f7fe84d4df..e0549617f9d7 100644
> --- a/tools/arch/x86/include/uapi/asm/unistd_64.h
> +++ b/tools/arch/x86/include/uapi/asm/unistd_64.h
> @@ -23,3 +23,6 @@
> #ifndef __NR_getcpu
> #define __NR_getcpu 309
> #endif
> +#ifndef __NR_seccomp

This one is good...

--
js
suse labs