2023-03-24 06:39:15

by Alok Tiagi

[permalink] [raw]
Subject: [RFC v4 1/2] file: allow callers to free the old file descriptor after dup2

Allow callers of do_dup2 to get a reference to the file pointer being dup'ed
over. The callers can then replace the file with the new file in the eventpoll
interface or the file table before freeing it.

Signed-off-by: aloktiagi <[email protected]>
---
Changes in v4:
- Address review comment for a cleaner if else block in do_dup2() to free the
file pointer.

Changes in v2:
- Make the commit message more clearer.
- Address review comment to make the interface cleaner so that the caller cannot
forget to initialize the fdfile.
---
fs/file.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 4b2346b8a5ee..cbc258504a64 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -1086,7 +1086,7 @@ bool get_close_on_exec(unsigned int fd)
}

static int do_dup2(struct files_struct *files,
- struct file *file, unsigned fd, unsigned flags)
+ struct file *file, unsigned fd, struct file **fdfile, unsigned flags)
__releases(&files->file_lock)
{
struct file *tofree;
@@ -1119,7 +1119,9 @@ __releases(&files->file_lock)
__clear_close_on_exec(fd, fdt);
spin_unlock(&files->file_lock);

- if (tofree)
+ if (fdfile)
+ *fdfile = tofree;
+ else if (tofree)
filp_close(tofree, files);

return fd;
@@ -1132,6 +1134,7 @@ __releases(&files->file_lock)
int replace_fd(unsigned fd, struct file *file, unsigned flags)
{
int err;
+ struct file *fdfile = NULL;
struct files_struct *files = current->files;

if (!file)
@@ -1144,7 +1147,10 @@ int replace_fd(unsigned fd, struct file *file, unsigned flags)
err = expand_files(files, fd);
if (unlikely(err < 0))
goto out_unlock;
- return do_dup2(files, file, fd, flags);
+ err = do_dup2(files, file, fd, &fdfile, flags);
+ if (fdfile)
+ filp_close(fdfile, files);
+ return err;

out_unlock:
spin_unlock(&files->file_lock);
@@ -1237,7 +1243,7 @@ static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags)
goto Ebadf;
goto out_unlock;
}
- return do_dup2(files, file, newfd, flags);
+ return do_dup2(files, file, newfd, NULL, flags);

Ebadf:
err = -EBADF;
--
2.34.1


2023-03-24 06:41:24

by Alok Tiagi

[permalink] [raw]
Subject: [RFC v4 2/2] file, epoll: Implement do_replace() and eventpoll_replace()

Introduce a mechanism to replace a file linked in the epoll interface or a file
that has been dup'ed by a file received via the replace_fd() interface.

eventpoll_replace() is called from do_replace() and finds all instances of the
file to be replaced and replaces them with the new file.

do_replace() also replaces the file in the file descriptor table for all fd
numbers referencing it with the new file.

We have a use case where multiple IPv6 only network namespaces can use a single
IPv4 network namespace for IPv4 only egress connectivity by switching their
sockets from IPv6 to IPv4 network namespace. This allows for migration of
systems to IPv6 only while keeping their connectivity to IPv4 only destinations
intact.

Today, we achieve this by setting up seccomp filter to intercept network system
calls like connect() from a container in a container manager which runs in an
IPv4 only network namespace. The container manager creates a new IPv4 connection
and injects the new file descriptor through SECCOMP_NOTIFY_IOCTL_ADDFD replacing
the original file descriptor from the connect() call. This does not work for
cases where the original file descriptor is handed off to a system like epoll
before the connect() call. After a new file descriptor is injected the original
file descriptor being referenced by the epoll fd is not longer valid leading to
failures. As a workaround the container manager when intercepting connect()
loops through all open socket file descriptors to check if they are referencing
the socket attempting the connect() and replace the reference with the to be
injected file descriptor. This workaround is cumbersome and makes the solution
prone to similar yet to be discovered issues.

The above change will enable us remove the workaround in the container manager
and let the kernel handle the replacement correctly.

Signed-off-by: aloktiagi <[email protected]>
---
Changes in v4:
- address review comment to remove the redundant eventpoll_replace() function.
- removed an extra empty line introduced in include/linux/file.h

Changes in v3:
- address review comment and iterate over the file table while holding the
spin_lock(&files->file_lock).
- address review comment and call filp_close() outside the
spin_lock(&files->file_lock).
---
fs/eventpoll.c | 38 ++++++++
fs/file.c | 56 +++++++++++
include/linux/eventpoll.h | 8 ++
tools/testing/selftests/seccomp/seccomp_bpf.c | 97 +++++++++++++++++++
4 files changed, 199 insertions(+)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 64659b110973..958ad995fd45 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -935,6 +935,44 @@ void eventpoll_release_file(struct file *file)
mutex_unlock(&epmutex);
}

+static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
+ struct file *tfile, int fd, int full_check);
+
+/*
+ * This is called from eventpoll_replace() to replace a linked file in the epoll
+ * interface with a new file received from another process. This is useful in
+ * cases where a process is trying to install a new file for an existing one
+ * that is linked in the epoll interface
+ */
+void eventpoll_replace_file(struct file *toreplace, struct file *file)
+{
+ int fd;
+ struct eventpoll *ep;
+ struct epitem *epi;
+ struct hlist_node *next;
+ struct epoll_event event;
+
+ if (!file_can_poll(file))
+ return;
+
+ mutex_lock(&epmutex);
+ if (unlikely(!toreplace->f_ep)) {
+ mutex_unlock(&epmutex);
+ return;
+ }
+
+ hlist_for_each_entry_safe(epi, next, toreplace->f_ep, fllink) {
+ ep = epi->ep;
+ mutex_lock(&ep->mtx);
+ fd = epi->ffd.fd;
+ event = epi->event;
+ ep_remove(ep, epi);
+ ep_insert(ep, &event, file, fd, 1);
+ mutex_unlock(&ep->mtx);
+ }
+ mutex_unlock(&epmutex);
+}
+
static int ep_alloc(struct eventpoll **pep)
{
int error;
diff --git a/fs/file.c b/fs/file.c
index cbc258504a64..9ccfe39eb622 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -20,10 +20,18 @@
#include <linux/spinlock.h>
#include <linux/rcupdate.h>
#include <linux/close_range.h>
+#include <linux/eventpoll.h>
#include <net/sock.h>

#include "internal.h"

+struct replace_filefd {
+ struct files_struct *files;
+ unsigned fd;
+ struct file *fdfile;
+ struct file *file;
+};
+
unsigned int sysctl_nr_open __read_mostly = 1024*1024;
unsigned int sysctl_nr_open_min = BITS_PER_LONG;
/* our min() is unusable in constant expressions ;-/ */
@@ -1131,6 +1139,52 @@ __releases(&files->file_lock)
return -EBUSY;
}

+static int do_replace_fd_array(const void *v, struct file *tofree, unsigned int n)
+{
+ struct replace_filefd *ffd = (void *)v;
+ struct fdtable *fdt;
+
+ fdt = files_fdtable(ffd->files);
+
+ if ((n != ffd->fd) && (tofree == ffd->fdfile)) {
+ get_file(ffd->file);
+ rcu_assign_pointer(fdt->fd[n], ffd->file);
+ spin_unlock(&ffd->files->file_lock);
+ filp_close(ffd->fdfile, ffd->files);
+ cond_resched();
+ spin_lock(&ffd->files->file_lock);
+ }
+ return 0;
+}
+
+static void do_replace(struct files_struct *files,
+ struct file *file, unsigned fd, struct file *fdfile)
+{
+ unsigned n = 0;
+ struct replace_filefd ffd = {
+ .files = files,
+ .fd = fd,
+ .fdfile = fdfile,
+ .file = file
+ };
+
+ /*
+ * Check if the file referenced by the fd number is linked to the epoll
+ * interface. If yes, replace the reference with the received file in
+ * the epoll interface.
+ */
+ if (fdfile && fdfile->f_ep) {
+ eventpoll_replace_file(fdfile, file);
+ }
+ /*
+ * Install the received file in the file descriptor table for all fd
+ * numbers referencing the same file as the one we are trying to
+ * replace. Do not install it for the fd number received since that is
+ * handled in do_dup2()
+ */
+ iterate_fd(files, n, do_replace_fd_array, &ffd);
+}
+
int replace_fd(unsigned fd, struct file *file, unsigned flags)
{
int err;
@@ -1148,8 +1202,10 @@ int replace_fd(unsigned fd, struct file *file, unsigned flags)
if (unlikely(err < 0))
goto out_unlock;
err = do_dup2(files, file, fd, &fdfile, flags);
+ do_replace(files, file, fd, fdfile);
if (fdfile)
filp_close(fdfile, files);
+
return err;

out_unlock:
diff --git a/include/linux/eventpoll.h b/include/linux/eventpoll.h
index 3337745d81bd..711146e79740 100644
--- a/include/linux/eventpoll.h
+++ b/include/linux/eventpoll.h
@@ -25,6 +25,14 @@ struct file *get_epoll_tfile_raw_ptr(struct file *file, int tfd, unsigned long t
/* Used to release the epoll bits inside the "struct file" */
void eventpoll_release_file(struct file *file);

+/*
+ * This is called from fs/file.c:do_replace() to replace a linked file in the
+ * epoll interface with a new file received from another process. This is useful
+ * in cases where a process is trying to install a new file for an existing one
+ * that is linked in the epoll interface
+ */
+void eventpoll_replace_file(struct file *toreplace, struct file *file);
+
/*
* This is called from inside fs/file_table.c:__fput() to unlink files
* from the eventpoll interface. We need to have this facility to cleanup
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 61386e499b77..caf68682519c 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -47,6 +47,7 @@
#include <linux/kcmp.h>
#include <sys/resource.h>
#include <sys/capability.h>
+#include <sys/epoll.h>

#include <unistd.h>
#include <sys/syscall.h>
@@ -4179,6 +4180,102 @@ TEST(user_notification_addfd)
close(memfd);
}

+TEST(user_notification_addfd_with_epoll_replace)
+{
+ char c;
+ pid_t pid;
+ long ret;
+ int status, listener, fd;
+ int efd, sfd[4];
+ struct epoll_event e;
+ struct seccomp_notif_addfd addfd = {};
+ struct seccomp_notif req = {};
+ struct seccomp_notif_resp resp = {};
+
+ ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+ ASSERT_EQ(0, ret) {
+ TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+ }
+
+ listener = user_notif_syscall(__NR_getppid,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER);
+
+ /* Create two socket pairs sfd[0] <-> sfd[1] and sfd[2] <-> sfd[3] */
+ ASSERT_EQ(socketpair(AF_UNIX, SOCK_STREAM, 0, &sfd[0]), 0);
+ ASSERT_EQ(socketpair(AF_UNIX, SOCK_STREAM, 0, &sfd[2]), 0);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ efd = epoll_create(1);
+ if (efd == -1)
+ exit(1);
+
+ e.events = EPOLLIN;
+ if (epoll_ctl(efd, EPOLL_CTL_ADD, sfd[0], &e) != 0)
+ exit(1);
+
+ /*
+ * fd will be added here to replace an existing one linked
+ * in the epoll interface.
+ */
+ if (syscall(__NR_getppid) != USER_NOTIF_MAGIC)
+ exit(1);
+
+ /*
+ * Write data to the sfd[3] connected to sfd[2], but due to
+ * the swap, we should see data on sfd[0]
+ */
+ if (write(sfd[3], "w", 1) != 1)
+ exit(1);
+
+ if (epoll_wait(efd, &e, 1, 0) != 1)
+ exit(1);
+
+ if (read(sfd[0], &c, 1) != 1)
+ exit(1);
+
+ if ('w' != c)
+ exit(1);
+
+ if (epoll_ctl(efd, EPOLL_CTL_DEL, sfd[0], &e) != 0)
+ exit(1);
+
+ close(efd);
+ close(sfd[0]);
+ close(sfd[1]);
+ close(sfd[2]);
+ close(sfd[3]);
+ exit(0);
+ }
+
+ ASSERT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req), 0);
+
+ addfd.srcfd = sfd[2];
+ addfd.newfd = sfd[0];
+ addfd.id = req.id;
+ addfd.flags = SECCOMP_ADDFD_FLAG_SETFD;
+ addfd.newfd_flags = O_CLOEXEC;
+
+ /*
+ * Verfiy we can install and replace a file that is linked in the
+ * epoll interface. Replace the socket sfd[0] with sfd[2]
+ */
+ fd = ioctl(listener, SECCOMP_IOCTL_NOTIF_ADDFD, &addfd);
+ EXPECT_EQ(fd, sfd[0]);
+
+ resp.id = req.id;
+ resp.error = 0;
+ resp.val = USER_NOTIF_MAGIC;
+ EXPECT_EQ(ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &resp), 0);
+
+ /* Wait for child to finish. */
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status));
+}
+
TEST(user_notification_addfd_rlimit)
{
pid_t pid;
--
2.34.1

2023-03-24 08:26:22

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFC v4 2/2] file, epoll: Implement do_replace() and eventpoll_replace()

On Thu, Mar 23, 2023 at 10:23:56PM +0000, Alok Tiagi wrote:
> On Mon, Mar 20, 2023 at 03:51:03PM +0100, Christian Brauner wrote:
> > On Sat, Mar 18, 2023 at 06:02:48AM +0000, aloktiagi wrote:
> > > Introduce a mechanism to replace a file linked in the epoll interface or a file
> > > that has been dup'ed by a file received via the replace_fd() interface.
> > >
> > > eventpoll_replace() is called from do_replace() and finds all instances of the
> > > file to be replaced and replaces them with the new file.
> > >
> > > do_replace() also replaces the file in the file descriptor table for all fd
> > > numbers referencing it with the new file.
> > >
> > > We have a use case where multiple IPv6 only network namespaces can use a single
> > > IPv4 network namespace for IPv4 only egress connectivity by switching their
> > > sockets from IPv6 to IPv4 network namespace. This allows for migration of
> > > systems to IPv6 only while keeping their connectivity to IPv4 only destinations
> > > intact.
> > >
> > > Today, we achieve this by setting up seccomp filter to intercept network system
> > > calls like connect() from a container in a container manager which runs in an
> > > IPv4 only network namespace. The container manager creates a new IPv4 connection
> > > and injects the new file descriptor through SECCOMP_NOTIFY_IOCTL_ADDFD replacing
> > > the original file descriptor from the connect() call. This does not work for
> > > cases where the original file descriptor is handed off to a system like epoll
> > > before the connect() call. After a new file descriptor is injected the original
> > > file descriptor being referenced by the epoll fd is not longer valid leading to
> > > failures. As a workaround the container manager when intercepting connect()
> > > loops through all open socket file descriptors to check if they are referencing
> > > the socket attempting the connect() and replace the reference with the to be
> > > injected file descriptor. This workaround is cumbersome and makes the solution
> > > prone to similar yet to be discovered issues.
> > >
> > > The above change will enable us remove the workaround in the container manager
> > > and let the kernel handle the replacement correctly.
> > >
> > > Signed-off-by: aloktiagi <[email protected]>
> > > ---
> > > fs/eventpoll.c | 38 ++++++++
> > > fs/file.c | 54 +++++++++++
> > > include/linux/eventpoll.h | 18 ++++
> > > include/linux/file.h | 1 +
> > > tools/testing/selftests/seccomp/seccomp_bpf.c | 97 +++++++++++++++++++
> > > 5 files changed, 208 insertions(+)
> > >
> > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > > index 64659b110973..958ad995fd45 100644
> > > --- a/fs/eventpoll.c
> > > +++ b/fs/eventpoll.c
> > > @@ -935,6 +935,44 @@ void eventpoll_release_file(struct file *file)
> > > mutex_unlock(&epmutex);
> > > }
> > >
> > > +static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
> > > + struct file *tfile, int fd, int full_check);
> > > +
> > > +/*
> > > + * This is called from eventpoll_replace() to replace a linked file in the epoll
> > > + * interface with a new file received from another process. This is useful in
> > > + * cases where a process is trying to install a new file for an existing one
> > > + * that is linked in the epoll interface
> > > + */
> > > +void eventpoll_replace_file(struct file *toreplace, struct file *file)
> > > +{
> > > + int fd;
> > > + struct eventpoll *ep;
> > > + struct epitem *epi;
> > > + struct hlist_node *next;
> > > + struct epoll_event event;
> > > +
> > > + if (!file_can_poll(file))
> > > + return;
> > > +
> > > + mutex_lock(&epmutex);
> > > + if (unlikely(!toreplace->f_ep)) {
> > > + mutex_unlock(&epmutex);
> > > + return;
> > > + }
> > > +
> > > + hlist_for_each_entry_safe(epi, next, toreplace->f_ep, fllink) {
> > > + ep = epi->ep;
> > > + mutex_lock(&ep->mtx);
> > > + fd = epi->ffd.fd;
> > > + event = epi->event;
> > > + ep_remove(ep, epi);
> > > + ep_insert(ep, &event, file, fd, 1);
> >
> > So, ep_remove() can't fail but ep_insert() can. Maybe that doesn't
> > matter...
> >
> > > + mutex_unlock(&ep->mtx);
> > > + }
> > > + mutex_unlock(&epmutex);
> > > +}
> >
> > Andrew carries a patchset that may impact the locking here:
> >
> > https://lore.kernel.org/linux-fsdevel/323de732635cc3513c1837c6cbb98f012174f994.1678312201.git.pabeni@redhat.com
> >
> > I have to say that this whole thing has a very unpleasant taste to it.
> > Replacing a single fd from seccomp is fine, wading through the fdtable
> > to replace all fds referencing the same file is pretty nasty. Especially
> > with the global epoll mutex involved in all of this.
> >
> > And what limits this to epoll. I'm seriously asking aren't there
> > potentially issues for fds somehow referenced in io_uring instances as
> > well?
> >
> > I'm not convinced this belongs here yet...
>
> Thank you for reviewing and proving a link to Andrew's patch.
>
> I think just replacing a single fd from seccomp leaves this feature in an
> incomplete state. As a user of this feature, it means I need to figure out what
> all file descriptors are referencing this file eg. epoll, dup'ed fds etc. This
> patch is an attempt to complete this seccomp feature and also move the logic of
> figuring out the references to the kernel.

I'm still not convinced.

You're changing the semantics of the replace file feature in seccomp
drastically. Whereas now it means replace the file a single fd refers to
you're now letting it replace multiple fds.

If a the caller has 10 fds open and some of them in epoll instance and
all reference the same file: What if the caller wants 5 of them to refer
to the old file and only 5 of them to be updated? The kernel can't now this.

Callers that now rely on a single fd being replaced and want the other
fds to be left alone will suddenly see all of the references change.

>
> The epmutex is taken when the file is replaced in the epoll interface. This is
> similar to what would happen when eventpoll_release would be called for this
> same file when it is ultimately released from __fput().
>
> This is indeed not limited to epoll and the file descriptor table, but this
> current patch addresses is limited to these interfaces. We can create a separate
> one for io_uring.

The criteria for when it's sensible to update an fd to refer to the new
file and when not are murky here and tailored to this very specific
use-case. We shouldn't be involved in decisions like that. Userspace is
in a much better position to know when it's sensible to replace an fd.

The fdtable is no place to get involved in a game of "if the fd is in a
epoll instance, update the epoll instance, if it's in an io_uring
instance, update the io_uring instance, ...". That's a complete
layerying violation imho.

And even if you'd need to get sign off on this from epoll and io_uring
folks as well before we can just reach into other subsytems from the
fdtable.

There's other worries: what if someone registers the file in another
notification mechanism that doesn't keep a reference to the fd and then
we can't update it from replace_fd() so you'll need yet another
mechanism to update it.

I'm sorry but this all sounds messy. You can do this just fine in
userspace, so please do it in userspace. This all sound very NAKable
right now.

2023-03-24 13:51:03

by Tycho Andersen

[permalink] [raw]
Subject: Re: [RFC v4 2/2] file, epoll: Implement do_replace() and eventpoll_replace()

On Fri, Mar 24, 2023 at 09:23:44AM +0100, Christian Brauner wrote:
> On Thu, Mar 23, 2023 at 10:23:56PM +0000, Alok Tiagi wrote:
> > On Mon, Mar 20, 2023 at 03:51:03PM +0100, Christian Brauner wrote:
> > > On Sat, Mar 18, 2023 at 06:02:48AM +0000, aloktiagi wrote:
> > > > Introduce a mechanism to replace a file linked in the epoll interface or a file
> > > > that has been dup'ed by a file received via the replace_fd() interface.
> > > >
> > > > eventpoll_replace() is called from do_replace() and finds all instances of the
> > > > file to be replaced and replaces them with the new file.
> > > >
> > > > do_replace() also replaces the file in the file descriptor table for all fd
> > > > numbers referencing it with the new file.
> > > >
> > > > We have a use case where multiple IPv6 only network namespaces can use a single
> > > > IPv4 network namespace for IPv4 only egress connectivity by switching their
> > > > sockets from IPv6 to IPv4 network namespace. This allows for migration of
> > > > systems to IPv6 only while keeping their connectivity to IPv4 only destinations
> > > > intact.
> > > >
> > > > Today, we achieve this by setting up seccomp filter to intercept network system
> > > > calls like connect() from a container in a container manager which runs in an
> > > > IPv4 only network namespace. The container manager creates a new IPv4 connection
> > > > and injects the new file descriptor through SECCOMP_NOTIFY_IOCTL_ADDFD replacing
> > > > the original file descriptor from the connect() call. This does not work for
> > > > cases where the original file descriptor is handed off to a system like epoll
> > > > before the connect() call. After a new file descriptor is injected the original
> > > > file descriptor being referenced by the epoll fd is not longer valid leading to
> > > > failures. As a workaround the container manager when intercepting connect()
> > > > loops through all open socket file descriptors to check if they are referencing
> > > > the socket attempting the connect() and replace the reference with the to be
> > > > injected file descriptor. This workaround is cumbersome and makes the solution
> > > > prone to similar yet to be discovered issues.
> > > >
> > > > The above change will enable us remove the workaround in the container manager
> > > > and let the kernel handle the replacement correctly.
> > > >
> > > > Signed-off-by: aloktiagi <[email protected]>
> > > > ---
> > > > fs/eventpoll.c | 38 ++++++++
> > > > fs/file.c | 54 +++++++++++
> > > > include/linux/eventpoll.h | 18 ++++
> > > > include/linux/file.h | 1 +
> > > > tools/testing/selftests/seccomp/seccomp_bpf.c | 97 +++++++++++++++++++
> > > > 5 files changed, 208 insertions(+)
> > > >
> > > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > > > index 64659b110973..958ad995fd45 100644
> > > > --- a/fs/eventpoll.c
> > > > +++ b/fs/eventpoll.c
> > > > @@ -935,6 +935,44 @@ void eventpoll_release_file(struct file *file)
> > > > mutex_unlock(&epmutex);
> > > > }
> > > >
> > > > +static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
> > > > + struct file *tfile, int fd, int full_check);
> > > > +
> > > > +/*
> > > > + * This is called from eventpoll_replace() to replace a linked file in the epoll
> > > > + * interface with a new file received from another process. This is useful in
> > > > + * cases where a process is trying to install a new file for an existing one
> > > > + * that is linked in the epoll interface
> > > > + */
> > > > +void eventpoll_replace_file(struct file *toreplace, struct file *file)
> > > > +{
> > > > + int fd;
> > > > + struct eventpoll *ep;
> > > > + struct epitem *epi;
> > > > + struct hlist_node *next;
> > > > + struct epoll_event event;
> > > > +
> > > > + if (!file_can_poll(file))
> > > > + return;
> > > > +
> > > > + mutex_lock(&epmutex);
> > > > + if (unlikely(!toreplace->f_ep)) {
> > > > + mutex_unlock(&epmutex);
> > > > + return;
> > > > + }
> > > > +
> > > > + hlist_for_each_entry_safe(epi, next, toreplace->f_ep, fllink) {
> > > > + ep = epi->ep;
> > > > + mutex_lock(&ep->mtx);
> > > > + fd = epi->ffd.fd;
> > > > + event = epi->event;
> > > > + ep_remove(ep, epi);
> > > > + ep_insert(ep, &event, file, fd, 1);
> > >
> > > So, ep_remove() can't fail but ep_insert() can. Maybe that doesn't
> > > matter...
> > >
> > > > + mutex_unlock(&ep->mtx);
> > > > + }
> > > > + mutex_unlock(&epmutex);
> > > > +}
> > >
> > > Andrew carries a patchset that may impact the locking here:
> > >
> > > https://lore.kernel.org/linux-fsdevel/323de732635cc3513c1837c6cbb98f012174f994.1678312201.git.pabeni@redhat.com
> > >
> > > I have to say that this whole thing has a very unpleasant taste to it.
> > > Replacing a single fd from seccomp is fine, wading through the fdtable
> > > to replace all fds referencing the same file is pretty nasty. Especially
> > > with the global epoll mutex involved in all of this.
> > >
> > > And what limits this to epoll. I'm seriously asking aren't there
> > > potentially issues for fds somehow referenced in io_uring instances as
> > > well?
> > >
> > > I'm not convinced this belongs here yet...
> >
> > Thank you for reviewing and proving a link to Andrew's patch.
> >
> > I think just replacing a single fd from seccomp leaves this feature in an
> > incomplete state. As a user of this feature, it means I need to figure out what
> > all file descriptors are referencing this file eg. epoll, dup'ed fds etc. This
> > patch is an attempt to complete this seccomp feature and also move the logic of
> > figuring out the references to the kernel.
>
> I'm still not convinced.
>
> You're changing the semantics of the replace file feature in seccomp
> drastically. Whereas now it means replace the file a single fd refers to
> you're now letting it replace multiple fds.

Yes; the crux of the patch is really the epoll part, not the multiple
replace part. IMO, we should drop the multiple replace bit, as I agree
it is a pretty big change.

The change in semantics w.r.t. epoll() (and eventually others),
though, is important. The way it currently works is not really
helpful.

Perhaps we could add a flag that people could set from SECCOMP_ADDFD
asking for this extra behavior?

> >
> > The epmutex is taken when the file is replaced in the epoll interface. This is
> > similar to what would happen when eventpoll_release would be called for this
> > same file when it is ultimately released from __fput().
> >
> > This is indeed not limited to epoll and the file descriptor table, but this
> > current patch addresses is limited to these interfaces. We can create a separate
> > one for io_uring.
>
> The criteria for when it's sensible to update an fd to refer to the new
> file and when not are murky here and tailored to this very specific
> use-case. We shouldn't be involved in decisions like that. Userspace is
> in a much better position to know when it's sensible to replace an fd.
>
> The fdtable is no place to get involved in a game of "if the fd is in a
> epoll instance, update the epoll instance, if it's in an io_uring
> instance, update the io_uring instance, ...". That's a complete
> layerying violation imho.
>
> And even if you'd need to get sign off on this from epoll and io_uring
> folks as well before we can just reach into other subsytems from the
> fdtable.

Yep, agreed.

> I'm sorry but this all sounds messy. You can do this just fine in
> userspace, so please do it in userspace. This all sound very NAKable
> right now.

We have added lots of APIs for things that are possible from userspace
already that are made easier with a nice API. The seccomp forwarding
functionality itself, pidfd_getfd(), etc. I don't see this particular
bit as a strong argument against.

Tycho

2023-03-27 09:09:09

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFC v4 2/2] file, epoll: Implement do_replace() and eventpoll_replace()

On Fri, Mar 24, 2023 at 07:43:13AM -0600, Tycho Andersen wrote:
> On Fri, Mar 24, 2023 at 09:23:44AM +0100, Christian Brauner wrote:
> > On Thu, Mar 23, 2023 at 10:23:56PM +0000, Alok Tiagi wrote:
> > > On Mon, Mar 20, 2023 at 03:51:03PM +0100, Christian Brauner wrote:
> > > > On Sat, Mar 18, 2023 at 06:02:48AM +0000, aloktiagi wrote:
> > > > > Introduce a mechanism to replace a file linked in the epoll interface or a file
> > > > > that has been dup'ed by a file received via the replace_fd() interface.
> > > > >
> > > > > eventpoll_replace() is called from do_replace() and finds all instances of the
> > > > > file to be replaced and replaces them with the new file.
> > > > >
> > > > > do_replace() also replaces the file in the file descriptor table for all fd
> > > > > numbers referencing it with the new file.
> > > > >
> > > > > We have a use case where multiple IPv6 only network namespaces can use a single
> > > > > IPv4 network namespace for IPv4 only egress connectivity by switching their
> > > > > sockets from IPv6 to IPv4 network namespace. This allows for migration of
> > > > > systems to IPv6 only while keeping their connectivity to IPv4 only destinations
> > > > > intact.
> > > > >
> > > > > Today, we achieve this by setting up seccomp filter to intercept network system
> > > > > calls like connect() from a container in a container manager which runs in an
> > > > > IPv4 only network namespace. The container manager creates a new IPv4 connection
> > > > > and injects the new file descriptor through SECCOMP_NOTIFY_IOCTL_ADDFD replacing
> > > > > the original file descriptor from the connect() call. This does not work for
> > > > > cases where the original file descriptor is handed off to a system like epoll
> > > > > before the connect() call. After a new file descriptor is injected the original
> > > > > file descriptor being referenced by the epoll fd is not longer valid leading to
> > > > > failures. As a workaround the container manager when intercepting connect()
> > > > > loops through all open socket file descriptors to check if they are referencing
> > > > > the socket attempting the connect() and replace the reference with the to be
> > > > > injected file descriptor. This workaround is cumbersome and makes the solution
> > > > > prone to similar yet to be discovered issues.
> > > > >
> > > > > The above change will enable us remove the workaround in the container manager
> > > > > and let the kernel handle the replacement correctly.
> > > > >
> > > > > Signed-off-by: aloktiagi <[email protected]>
> > > > > ---
> > > > > fs/eventpoll.c | 38 ++++++++
> > > > > fs/file.c | 54 +++++++++++
> > > > > include/linux/eventpoll.h | 18 ++++
> > > > > include/linux/file.h | 1 +
> > > > > tools/testing/selftests/seccomp/seccomp_bpf.c | 97 +++++++++++++++++++
> > > > > 5 files changed, 208 insertions(+)
> > > > >
> > > > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > > > > index 64659b110973..958ad995fd45 100644
> > > > > --- a/fs/eventpoll.c
> > > > > +++ b/fs/eventpoll.c
> > > > > @@ -935,6 +935,44 @@ void eventpoll_release_file(struct file *file)
> > > > > mutex_unlock(&epmutex);
> > > > > }
> > > > >
> > > > > +static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
> > > > > + struct file *tfile, int fd, int full_check);
> > > > > +
> > > > > +/*
> > > > > + * This is called from eventpoll_replace() to replace a linked file in the epoll
> > > > > + * interface with a new file received from another process. This is useful in
> > > > > + * cases where a process is trying to install a new file for an existing one
> > > > > + * that is linked in the epoll interface
> > > > > + */
> > > > > +void eventpoll_replace_file(struct file *toreplace, struct file *file)
> > > > > +{
> > > > > + int fd;
> > > > > + struct eventpoll *ep;
> > > > > + struct epitem *epi;
> > > > > + struct hlist_node *next;
> > > > > + struct epoll_event event;
> > > > > +
> > > > > + if (!file_can_poll(file))
> > > > > + return;
> > > > > +
> > > > > + mutex_lock(&epmutex);
> > > > > + if (unlikely(!toreplace->f_ep)) {
> > > > > + mutex_unlock(&epmutex);
> > > > > + return;
> > > > > + }
> > > > > +
> > > > > + hlist_for_each_entry_safe(epi, next, toreplace->f_ep, fllink) {
> > > > > + ep = epi->ep;
> > > > > + mutex_lock(&ep->mtx);
> > > > > + fd = epi->ffd.fd;
> > > > > + event = epi->event;
> > > > > + ep_remove(ep, epi);
> > > > > + ep_insert(ep, &event, file, fd, 1);
> > > >
> > > > So, ep_remove() can't fail but ep_insert() can. Maybe that doesn't
> > > > matter...
> > > >
> > > > > + mutex_unlock(&ep->mtx);
> > > > > + }
> > > > > + mutex_unlock(&epmutex);
> > > > > +}
> > > >
> > > > Andrew carries a patchset that may impact the locking here:
> > > >
> > > > https://lore.kernel.org/linux-fsdevel/323de732635cc3513c1837c6cbb98f012174f994.1678312201.git.pabeni@redhat.com
> > > >
> > > > I have to say that this whole thing has a very unpleasant taste to it.
> > > > Replacing a single fd from seccomp is fine, wading through the fdtable
> > > > to replace all fds referencing the same file is pretty nasty. Especially
> > > > with the global epoll mutex involved in all of this.
> > > >
> > > > And what limits this to epoll. I'm seriously asking aren't there
> > > > potentially issues for fds somehow referenced in io_uring instances as
> > > > well?
> > > >
> > > > I'm not convinced this belongs here yet...
> > >
> > > Thank you for reviewing and proving a link to Andrew's patch.
> > >
> > > I think just replacing a single fd from seccomp leaves this feature in an
> > > incomplete state. As a user of this feature, it means I need to figure out what
> > > all file descriptors are referencing this file eg. epoll, dup'ed fds etc. This
> > > patch is an attempt to complete this seccomp feature and also move the logic of
> > > figuring out the references to the kernel.
> >
> > I'm still not convinced.
> >
> > You're changing the semantics of the replace file feature in seccomp
> > drastically. Whereas now it means replace the file a single fd refers to
> > you're now letting it replace multiple fds.
>
> Yes; the crux of the patch is really the epoll part, not the multiple
> replace part. IMO, we should drop the multiple replace bit, as I agree
> it is a pretty big change.
>
> The change in semantics w.r.t. epoll() (and eventually others),
> though, is important. The way it currently works is not really
> helpful.
>
> Perhaps we could add a flag that people could set from SECCOMP_ADDFD
> asking for this extra behavior?

So I've been thinking about this. I still maintain that fdtable helpers
such as replace_fd() are the wrong place to solve any of this. Again,
they shouldn't interfer with other subsystems like epoll or io_uring in
any semantically fundamental way. The exception I can see is that when
the type of fd/file requires specific actions to be performed during
replace such as sockets.

The reason why I keep stressing this is because you'll almost certainly
will come around the corner with more patches in the future to update
other places where the replaced fd implicitly references the old file
even though the fdtable has already been updated to point to the new
file. And if you don't someone else might.

And that to me implies that we want per-subsystem helpers to update the
old file references. What references to update should be selectable by
the supervisor for backward compatibility.

I think that's cleaner API wise and doesn't violate layering quite as badly.

Another thing that struck me is that you don't require none of the
fdtable locks to do other updates. So here's my
UNTESTED+NON-COMPILING+DRAFT+STRAWMAN+SKETCH proposal:

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index 0fdc6ef02b94..82beca4c24da 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -118,6 +118,13 @@ struct seccomp_notif_resp {
/* valid flags for seccomp_notif_addfd */
#define SECCOMP_ADDFD_FLAG_SETFD (1UL << 0) /* Specify remote fd */
#define SECCOMP_ADDFD_FLAG_SEND (1UL << 1) /* Addfd and return it, atomically */
+#define SECCOMP_ADDFD_FLAG_EPOLL (1UL << 2) /* Update epoll references */
+#define SECCOMP_ADDFD_FLAG_IOURING (1UL << 3) /* Update io_uring references */
+#define SECCOMP_ADDFD_FLAG_FANOTIFY (1UL << 4) /* Update fanotify references (probably nonsensical, but just here for illustration purposes) */
+
+#define SECOMP_ADDFD_FLAG_EVENTS \
+ (SECCOMP_ADDFD_FLAG_EPOLL | SECCOMP_ADDFD_FLAG_IOURING | \
+ SECCOMP_ADDFD_FLAG_FANOTIFY)

/**
* struct seccomp_notif_addfd
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index cebf26445f9e..02f8b67a9ba2 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1066,6 +1066,24 @@ static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd, struct seccomp_kn
fd = receive_fd(addfd->file, addfd->flags);
else
fd = receive_fd_replace(addfd->fd, addfd->file, addfd->flags);
+
+ if (fd > 0 && addfd->ioctl_flags & SECCOMP_ADDFD_FLAG_EPOLL) {
+ /*
+ * - retrieve old struct file that addfd->fd refered to if any.
+ * - call your epoll seccomp api to update the references in the epoll instance
+ */
epoll_seccomp_notify()
+ }
+
+ if (fd > 0 && addfd->ioctl_flags & SECCOMP_ADDFD_FLAG_IO_URING) {
+ /*
+ * - call your io_uring seccomp api to update the references in the io_uring instance
+ */
io_uring_seccomp_notify()
+ }
+
+ .
+ . I WILL DEFO NOT COMPILE
+ .
+
addfd->ret = fd;

if (addfd->ioctl_flags & SECCOMP_ADDFD_FLAG_SEND) {
@@ -1613,12 +1631,16 @@ static long seccomp_notify_addfd(struct seccomp_filter *filter,
if (addfd.newfd_flags & ~O_CLOEXEC)
return -EINVAL;

- if (addfd.flags & ~(SECCOMP_ADDFD_FLAG_SETFD | SECCOMP_ADDFD_FLAG_SEND))
+ if (addfd.flags & ~(SECCOMP_ADDFD_FLAG_SETFD | SECCOMP_ADDFD_FLAG_SEND |
+ SECCOMP_ADDFD_FLAG_EPOLL))
return -EINVAL;

if (addfd.newfd && !(addfd.flags & SECCOMP_ADDFD_FLAG_SETFD))
return -EINVAL;

+ if ((flags & SECCOMP_ADDFD_FLAG_EPOLL) && !add.newfd)
+ return -EINVAL;
+
kaddfd.file = fget(addfd.srcfd);
if (!kaddfd.file)
return -EBADF;

You might want to add a new member to struct seccomp_kaddfd so when you replace
the file you can stash the reference to the old file and pass it to the helpers
where things get updated:

struct seccomp_kaddfd {
struct file *file;
+ struct file *old_file;
int fd;
unsigned int flags;
__u32 ioctl_flags;

With this added you can int the future go to each subsystem and ask them to
expose you a function to update file references after you updated it in
the caller's fdtable. But the VFS/fdtable needs nothing to do with this
after replace_fd() has been called.

So then we can place all of these updates under new addfd flags and
maintain backward compatibility for older seccomp clients which I
pointed out in the other mail (So basically yes, we should do flags as
you also asked/suggested in your reply.).

For convenience we can expose defines like SECOMP_ADDFD_FLAG_EVENTS that
allow userspace to specify that they want the full shebang update.

Since the seccomp addfd uapi struct is extensible you should be pretty
much future prove as well.

I'm sure there's subtleties I miss but I think that's a better way forward.

>
> > >
> > > The epmutex is taken when the file is replaced in the epoll interface. This is
> > > similar to what would happen when eventpoll_release would be called for this
> > > same file when it is ultimately released from __fput().
> > >
> > > This is indeed not limited to epoll and the file descriptor table, but this
> > > current patch addresses is limited to these interfaces. We can create a separate
> > > one for io_uring.
> >
> > The criteria for when it's sensible to update an fd to refer to the new
> > file and when not are murky here and tailored to this very specific
> > use-case. We shouldn't be involved in decisions like that. Userspace is
> > in a much better position to know when it's sensible to replace an fd.
> >
> > The fdtable is no place to get involved in a game of "if the fd is in a
> > epoll instance, update the epoll instance, if it's in an io_uring
> > instance, update the io_uring instance, ...". That's a complete
> > layerying violation imho.
> >
> > And even if you'd need to get sign off on this from epoll and io_uring
> > folks as well before we can just reach into other subsytems from the
> > fdtable.
>
> Yep, agreed.
>
> > I'm sorry but this all sounds messy. You can do this just fine in
> > userspace, so please do it in userspace. This all sound very NAKable
> > right now.
>
> We have added lots of APIs for things that are possible from userspace
> already that are made easier with a nice API. The seccomp forwarding
> functionality itself, pidfd_getfd(), etc. I don't see this particular
> bit as a strong argument against.

It's always a question of how much burden we shift into the kernel or a
specific subsystem. IOW, if this results in really clunky apis then it's
probably not something we should o.

2023-03-27 13:11:19

by Tycho Andersen

[permalink] [raw]
Subject: Re: [RFC v4 2/2] file, epoll: Implement do_replace() and eventpoll_replace()

On Mon, Mar 27, 2023 at 11:01:06AM +0200, Christian Brauner wrote:
> On Fri, Mar 24, 2023 at 07:43:13AM -0600, Tycho Andersen wrote:
> > Perhaps we could add a flag that people could set from SECCOMP_ADDFD
> > asking for this extra behavior?
>
> + if (fd > 0 && addfd->ioctl_flags & SECCOMP_ADDFD_FLAG_EPOLL) {
> + /*
> + * - retrieve old struct file that addfd->fd refered to if any.
> + * - call your epoll seccomp api to update the references in the epoll instance
> + */
> epoll_seccomp_notify()
> + }
> +
> + if (fd > 0 && addfd->ioctl_flags & SECCOMP_ADDFD_FLAG_IO_URING) {
> + /*
> + * - call your io_uring seccomp api to update the references in the io_uring instance
> + */
> io_uring_seccomp_notify()
> + }

Looks reasonable to me, thanks.

Tycho

2023-03-27 13:17:37

by Tycho Andersen

[permalink] [raw]
Subject: Re: [RFC v4 2/2] file, epoll: Implement do_replace() and eventpoll_replace()

On Mon, Mar 27, 2023 at 07:06:46AM -0600, Tycho Andersen wrote:
> On Mon, Mar 27, 2023 at 11:01:06AM +0200, Christian Brauner wrote:
> > On Fri, Mar 24, 2023 at 07:43:13AM -0600, Tycho Andersen wrote:
> > > Perhaps we could add a flag that people could set from SECCOMP_ADDFD
> > > asking for this extra behavior?
> >
> > + if (fd > 0 && addfd->ioctl_flags & SECCOMP_ADDFD_FLAG_EPOLL) {
> > + /*
> > + * - retrieve old struct file that addfd->fd refered to if any.
> > + * - call your epoll seccomp api to update the references in the epoll instance
> > + */
> > epoll_seccomp_notify()
> > + }
> > +
> > + if (fd > 0 && addfd->ioctl_flags & SECCOMP_ADDFD_FLAG_IO_URING) {
> > + /*
> > + * - call your io_uring seccomp api to update the references in the io_uring instance
> > + */
> > io_uring_seccomp_notify()
> > + }
>
> Looks reasonable to me, thanks.

One change I might suggest is only using a single flag bit -- we don't
need to consume all of seccomp's remaining flag bits with the various
subsystems. If you want to do this logic for epoll, you almost
certainly want it for io_uring, select, and whatever else is out
there.

Tycho

2023-03-27 20:53:06

by Alok Tiagi

[permalink] [raw]
Subject: Re: [RFC v4 2/2] file, epoll: Implement do_replace() and eventpoll_replace()

On Mon, Mar 27, 2023 at 07:16:39AM -0600, Tycho Andersen wrote:
> On Mon, Mar 27, 2023 at 07:06:46AM -0600, Tycho Andersen wrote:
> > On Mon, Mar 27, 2023 at 11:01:06AM +0200, Christian Brauner wrote:
> > > On Fri, Mar 24, 2023 at 07:43:13AM -0600, Tycho Andersen wrote:
> > > > Perhaps we could add a flag that people could set from SECCOMP_ADDFD
> > > > asking for this extra behavior?
> > >
> > > + if (fd > 0 && addfd->ioctl_flags & SECCOMP_ADDFD_FLAG_EPOLL) {
> > > + /*
> > > + * - retrieve old struct file that addfd->fd refered to if any.
> > > + * - call your epoll seccomp api to update the references in the epoll instance
> > > + */
> > > epoll_seccomp_notify()
> > > + }
> > > +
> > > + if (fd > 0 && addfd->ioctl_flags & SECCOMP_ADDFD_FLAG_IO_URING) {
> > > + /*
> > > + * - call your io_uring seccomp api to update the references in the io_uring instance
> > > + */
> > > io_uring_seccomp_notify()
> > > + }
> >
> > Looks reasonable to me, thanks.
>
> One change I might suggest is only using a single flag bit -- we don't
> need to consume all of seccomp's remaining flag bits with the various
> subsystems. If you want to do this logic for epoll, you almost
> certainly want it for io_uring, select, and whatever else is out
> there.
>
> Tycho

Thank you for your comment and thoughts on this Christian. The per-subsystem
helper and calling this from seccomp add fd looks like a cleaner API. I'll
address the changes in v5.