LinuxLists.cc - [PATCH 0/3 v2] seccomp: improve handling of SECCOMP_IOCTL_NOTIF

2024-05-23 01:46:04

Subject: [PATCH 0/3 v2] seccomp: improve handling of SECCOMP_IOCTL_NOTIF_RECV

This patch set addresses two problems with the SECCOMP_IOCTL_NOTIF_RECV
ioctl:
* it doesn't return when the seccomp filter becomes unused (all tasks
have exited).
* EPOLLHUP is triggered not when a task exits, but rather when its zombie
is collected.

v2: - Remove unnecessary checks of PF_EXITING.
- Take siglock with disabling irqs.
Thanks to Oleg for the review and the help with the first version.

Andrei Vagin (3):
seccomp: interrupt SECCOMP_IOCTL_NOTIF_RECV when all users have exited
seccomp: release task filters when the task exits
selftests/seccomp: add test for NOTIF_RECV and unused filters

kernel/exit.c | 3 +-
kernel/seccomp.c | 38 ++++++++++---
tools/testing/selftests/seccomp/seccomp_bpf.c | 54 +++++++++++++++++++
3 files changed, 88 insertions(+), 7 deletions(-)

Cc: Kees Cook <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Will Drewry <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Tycho Andersen <[email protected]>

--
2.45.0.rc1.225.g2a3ae87e7f-goog

2024-05-23 01:46:07

by Andrei Vagin

[permalink] [raw]

Subject: [PATCH 1/3] seccomp: interrupt SECCOMP_IOCTL_NOTIF_RECV when all users have exited

SECCOMP_IOCTL_NOTIF_RECV promptly returns when a seccomp filter becomes
unused, as a filter without users can't trigger any events.

Previously, event listeners had to rely on epoll to detect when all
processes had exited.

The change is based on the 'commit 99cdb8b9a573 ("seccomp: notify about
unused filter")' which implemented (E)POLLHUP notifications.

Reviewed-by: Christian Brauner <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
---
kernel/seccomp.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index f70e031e06a8..35435e8f1035 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1466,7 +1466,7 @@ static int recv_wake_function(wait_queue_entry_t *wait, unsigned int mode, int s
void *key)
{
/* Avoid a wakeup if event not interesting for us. */
- if (key && !(key_to_poll(key) & (EPOLLIN | EPOLLERR)))
+ if (key && !(key_to_poll(key) & (EPOLLIN | EPOLLERR | EPOLLHUP)))
return 0;
return autoremove_wake_function(wait, mode, sync, key);
}
@@ -1476,6 +1476,9 @@ static int recv_wait_event(struct seccomp_filter *filter)
DEFINE_WAIT_FUNC(wait, recv_wake_function);
int ret;

+ if (refcount_read(&filter->users) == 0)
+ return 0;
+
if (atomic_dec_if_positive(&filter->notif->requests) >= 0)
return 0;

@@ -1484,6 +1487,8 @@ static int recv_wait_event(struct seccomp_filter *filter)

if (atomic_dec_if_positive(&filter->notif->requests) >= 0)
break;
+ if (refcount_read(&filter->users) == 0)
+ break;

if (ret)
return ret;
--
2.45.1.288.g0e0cd299f1-goog

2024-05-23 01:46:20

by Andrei Vagin

[permalink] [raw]

Subject: [PATCH 2/3] seccomp: release task filters when the task exits

Previously, seccomp filters were released in release_task(), which
required the process to exit and its zombie to be collected. However,
exited threads/processes can't trigger any seccomp events, making it
more logical to release filters upon task exits.

This adjustment simplifies scenarios where a parent is tracing its child
process. The parent process can now handle all events from a seccomp
listening descriptor and then call wait to collect a child zombie.

seccomp_filter_release takes the siglock to avoid races with
seccomp_sync_threads. There was an idea to bypass taking the lock by
checking PF_EXITING, but it can be set without holding siglock if
threads have SIGNAL_GROUP_EXIT. This means it can happen concurently
with seccomp_filter_release.

Signed-off-by: Andrei Vagin <[email protected]>
---
kernel/exit.c | 3 ++-
kernel/seccomp.c | 22 ++++++++++++++++------
2 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 41a12630cbbc..23439c021d8d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -278,7 +278,6 @@ void release_task(struct task_struct *p)
}

write_unlock_irq(&tasklist_lock);
- seccomp_filter_release(p);
proc_flush_pid(thread_pid);
put_pid(thread_pid);
release_thread(p);
@@ -836,6 +835,8 @@ void __noreturn do_exit(long code)
io_uring_files_cancel();
exit_signals(tsk); /* sets PF_EXITING */

+ seccomp_filter_release(tsk);
+
acct_update_integrals(tsk);
group_dead = atomic_dec_and_test(&tsk->signal->live);
if (group_dead) {
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 35435e8f1035..67305e776dd3 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -502,6 +502,9 @@ static inline pid_t seccomp_can_sync_threads(void)
/* Skip current, since it is initiating the sync. */
if (thread == caller)
continue;
+ /* Skip exited threads. */
+ if (thread->flags & PF_EXITING)
+ continue;

if (thread->seccomp.mode == SECCOMP_MODE_DISABLED ||
(thread->seccomp.mode == SECCOMP_MODE_FILTER &&
@@ -563,18 +566,18 @@ static void __seccomp_filter_release(struct seccomp_filter *orig)
* @tsk: task the filter should be released from.
*
* This function should only be called when the task is exiting as
- * it detaches it from its filter tree. As such, READ_ONCE() and
- * barriers are not needed here, as would normally be needed.
+ * it detaches it from its filter tree. PF_EXITING has to be set
+ * for the task.
*/
void seccomp_filter_release(struct task_struct *tsk)
{
- struct seccomp_filter *orig = tsk->seccomp.filter;
-
- /* We are effectively holding the siglock by not having any sighand. */
- WARN_ON(tsk->sighand != NULL);
+ struct seccomp_filter *orig;

+ spin_lock_irq(&current->sighand->siglock);
+ orig = tsk->seccomp.filter;
/* Detach task from its filter tree. */
tsk->seccomp.filter = NULL;
+ spin_unlock_irq(&current->sighand->siglock);
__seccomp_filter_release(orig);
}

@@ -602,6 +605,13 @@ static inline void seccomp_sync_threads(unsigned long flags)
if (thread == caller)
continue;

+ /*
+ * Skip exited threads. seccomp_filter_release could have
+ * been already called for this task.
+ */
+ if (thread->flags & PF_EXITING)
+ continue;
+
/* Get a task reference for the new leaf node. */
get_seccomp_filter(caller);

--
2.45.1.288.g0e0cd299f1-goog

2024-05-23 01:46:29

by Andrei Vagin

[permalink] [raw]

Subject: [PATCH 3/3] selftests/seccomp: add test for NOTIF_RECV and unused filters

Add a new test case to check that SECCOMP_IOCTL_NOTIF_RECV returns when all
tasks have gone.

Signed-off-by: Andrei Vagin <[email protected]>
---
tools/testing/selftests/seccomp/seccomp_bpf.c | 54 +++++++++++++++++++
1 file changed, 54 insertions(+)

diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 783ebce8c4de..390781d7c951 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -3954,6 +3954,60 @@ TEST(user_notification_filter_empty)
EXPECT_GT((pollfd.revents & POLLHUP) ?: 0, 0);
}

+TEST(user_ioctl_notification_filter_empty)
+{
+ pid_t pid;
+ long ret;
+ int status, p[2];
+ struct __clone_args args = {
+ .flags = CLONE_FILES,
+ .exit_signal = SIGCHLD,
+ };
+ struct seccomp_notif req = {};
+
+ ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+ ASSERT_EQ(0, ret) {
+ TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+ }
+
+ if (__NR_clone3 < 0)
+ SKIP(return, "Test not built with clone3 support");
+
+ ASSERT_EQ(0, pipe(p));
+
+ pid = sys_clone3(&args, sizeof(args));
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ int listener;
+
+ listener = user_notif_syscall(__NR_mknodat, SECCOMP_FILTER_FLAG_NEW_LISTENER);
+ if (listener < 0)
+ _exit(EXIT_FAILURE);
+
+ if (dup2(listener, 200) != 200)
+ _exit(EXIT_FAILURE);
+ close(p[1]);
+ close(listener);
+ sleep(1);
+
+ _exit(EXIT_SUCCESS);
+ }
+ if (read(p[0], &status, 1) != 0)
+ _exit(EXIT_SUCCESS);
+ close(p[0]);
+ /*
+ * The seccomp filter has become unused so we should be notified once
+ * the kernel gets around to cleaning up task struct.
+ */
+ EXPECT_EQ(ioctl(200, SECCOMP_IOCTL_NOTIF_RECV, &req), -1);
+ EXPECT_EQ(errno, ENOENT);
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status));
+}
+
static void *do_thread(void *data)
{
return NULL;
--
2.45.1.288.g0e0cd299f1-goog

2024-05-23 09:00:41

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH 1/3] seccomp: interrupt SECCOMP_IOCTL_NOTIF_RECV when all users have exited

Hi Andrei,

the patch looks good to me even if I don't really understand what
SECCOMP_IOCTL_NOTIF_RECV does. But let me ask a stupid question,

On 05/23, Andrei Vagin wrote:
>
> The change is based on the 'commit 99cdb8b9a573 ("seccomp: notify about
> unused filter")' which implemented (E)POLLHUP notifications.

To me this patch fixes the commit above, because without this change

> @@ -1466,7 +1466,7 @@ static int recv_wake_function(wait_queue_entry_t *wait, unsigned int mode, int s
> void *key)
> {
> /* Avoid a wakeup if event not interesting for us. */
> - if (key && !(key_to_poll(key) & (EPOLLIN | EPOLLERR)))
> + if (key && !(key_to_poll(key) & (EPOLLIN | EPOLLERR | EPOLLHUP)))

__seccomp_filter_orphan() -> wake_up_poll(&orig->wqh, EPOLLHUP) won't
wakeup the task sleeping in recv_wait_event(), right ?

In any case, FWIW

Reviewed-by: Oleg Nesterov <[email protected]>

2024-05-23 09:01:48

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH 2/3] seccomp: release task filters when the task exits

On 05/23, Andrei Vagin wrote:
>
> Previously, seccomp filters were released in release_task(), which
> required the process to exit and its zombie to be collected. However,
> exited threads/processes can't trigger any seccomp events, making it
> more logical to release filters upon task exits.
>
> This adjustment simplifies scenarios where a parent is tracing its child
> process. The parent process can now handle all events from a seccomp
> listening descriptor and then call wait to collect a child zombie.
>
> seccomp_filter_release takes the siglock to avoid races with
> seccomp_sync_threads. There was an idea to bypass taking the lock by
> checking PF_EXITING, but it can be set without holding siglock if
> threads have SIGNAL_GROUP_EXIT. This means it can happen concurently
> with seccomp_filter_release.
>
> Signed-off-by: Andrei Vagin <[email protected]>
> ---
> kernel/exit.c | 3 ++-
> kernel/seccomp.c | 22 ++++++++++++++++------
> 2 files changed, 18 insertions(+), 7 deletions(-)

Reviewed-by: Oleg Nesterov <[email protected]>

2024-05-23 09:35:18

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH 0/3 v2] seccomp: improve handling of SECCOMP_IOCTL_NOTIF_RECV

On 05/23, Andrei Vagin wrote:
>
> This patch set addresses two problems with the SECCOMP_IOCTL_NOTIF_RECV
> ioctl:
> * it doesn't return when the seccomp filter becomes unused (all tasks
> have exited).
> * EPOLLHUP is triggered not when a task exits, but rather when its zombie
> is collected.

It seems that 2/3 also fixes another minor problem.

Suppose that a group leader installs the new filter without
SECCOMP_FILTER_FLAG_TSYNC, exits, and becomes a zombie. It can't be
released until all its sub-threads exit.

After that, without 2/3, SECCOMP_FILTER_FLAG_TSYNC from any other thread
can never succeed, seccomp_can_sync_threads() will check a zombie leader
and is_ancestor() will fail.

Right?

Oleg.

2024-05-24 17:48:09

by Andrei Vagin

[permalink] [raw]

Subject: Re: [PATCH 1/3] seccomp: interrupt SECCOMP_IOCTL_NOTIF_RECV when all users have exited

On Thu, May 23, 2024 at 2:00 AM Oleg Nesterov <[email protected]> wrote:
>
> Hi Andrei,
>
> the patch looks good to me even if I don't really understand what
> SECCOMP_IOCTL_NOTIF_RECV does. But let me ask a stupid question,
>
> On 05/23, Andrei Vagin wrote:
> >
> > The change is based on the 'commit 99cdb8b9a573 ("seccomp: notify about
> > unused filter")' which implemented (E)POLLHUP notifications.
>
> To me this patch fixes the commit above, because without this change

It depends on how we look at it. I think the intention was to address
the epoll/poll/select syscalls to return (E)POLLHUP notifications when
filters have been orphaned. Plus, this code looked a bit different that
time and recv_wake_function used another notification mechanism.

>
> > @@ -1466,7 +1466,7 @@ static int recv_wake_function(wait_queue_entry_t *wait, unsigned int mode, int s
> > void *key)
> > {
> > /* Avoid a wakeup if event not interesting for us. */
> > - if (key && !(key_to_poll(key) & (EPOLLIN | EPOLLERR)))
> > + if (key && !(key_to_poll(key) & (EPOLLIN | EPOLLERR | EPOLLHUP)))
>
> __seccomp_filter_orphan() -> wake_up_poll(&orig->wqh, EPOLLHUP) won't
> wakeup the task sleeping in recv_wait_event(), right ?
>
> In any case, FWIW
>
> Reviewed-by: Oleg Nesterov <[email protected]>

Thanks,
Andrei