Received: by 2002:a05:6a10:a852:0:0:0:0 with SMTP id d18csp675827pxy; Fri, 30 Apr 2021 13:51:23 -0700 (PDT) X-Google-Smtp-Source: ABdhPJydR7rg6HsRJ7SyZUAzx54pVctBIorM60C0JPGB4fzixwuRZTfIkfpdxUH7vYTOsHm6XsnO X-Received: by 2002:aa7:900c:0:b029:27d:7e15:11b2 with SMTP id m12-20020aa7900c0000b029027d7e1511b2mr6240866pfo.45.1619815883651; Fri, 30 Apr 2021 13:51:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1619815883; cv=none; d=google.com; s=arc-20160816; b=JxPHcWDDmXTt0ChDTaJVTYKXk5QHWs7SjzNz8OU9jjaN9okjpt+W5S3okPCQRzG8rc uQR5mNHTny5gTmblN0eMu7Al1+WwGyTh+imPPgSJ5JywynsmUlkPVZwjh2PCDiT2dNQ0 dND/MbgILF5Zf8+dmDQcSLQavUIAlj4QQvcH+hwsBiqibRpTTVDUNiU8gsnBItuXeSFQ tnB8kOWXTY8UUrUYa1CcC9BTXsZSn7H/3Nd3hC84Sul9GNu96OSDuXsisgiw6zTicHr+ 61/+dhEn/CjQdMq1xWZqomeBlqicJp2Es0i9y7GiCYXoAiA0CrmFytS5ibWdGmz2L0F4 wS4Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=oVd2WMDZlRcVHbWT2xgesmGD/N8sDYwKT+s1aPdX054=; b=TST/tGTN1VWu/v1sxEZC+XvzcoKZ97bNtTsfqhu9h/CqkzuBmqUkOYMhePsdnVftsr +VYw/Q6iYwwCcZJWMryuHa+0ttdigHzV6ydmSTUgh6Y1+hn4elZv3Xi01OQpQDFa5E2b eCzxuKa0rvqwdS3TxK0pH72fAjL5E+oHWYNO6cCGeMBTODmY5oArO7/nelUOYQchtUsm aHvL9gFDofPQs6Qi6IUqNmQdFUpnP4baREJ9eRzNjMQLtv+fcoVy+2WFlYjfFDr1t6jn edxvZJ7ySR7odhnG04NMDKr8xv5Ofl10Jx6RflaY0frkonfjTxy3tqGhmB45Qf3ucc1f 8HYg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@sargun.me header.s=google header.b=ttm6ofaY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id e7si4082818pgs.95.2021.04.30.13.51.09; Fri, 30 Apr 2021 13:51:23 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@sargun.me header.s=google header.b=ttm6ofaY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236101AbhD3Uuz (ORCPT + 99 others); Fri, 30 Apr 2021 16:50:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46900 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236031AbhD3Uum (ORCPT ); Fri, 30 Apr 2021 16:50:42 -0400 Received: from mail-qk1-x734.google.com (mail-qk1-x734.google.com [IPv6:2607:f8b0:4864:20::734]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7410EC06174A for ; Fri, 30 Apr 2021 13:49:53 -0700 (PDT) Received: by mail-qk1-x734.google.com with SMTP id k127so16535432qkc.6 for ; Fri, 30 Apr 2021 13:49:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sargun.me; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=oVd2WMDZlRcVHbWT2xgesmGD/N8sDYwKT+s1aPdX054=; b=ttm6ofaY0pxaGM3kpeUodAQJyrehN+55UB9ega22plk3/VUxcl1IDMhqdYnML2Rrk7 5TBcy0RzwB9UfGXP8jzfdt17zH5YxEDmKaFcWDlBsuFVSAhEALdemlEjoRlEhKzXUUqK JAopPYaoqW8yg8vT59s19lnnTAufHseAvGpec= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=oVd2WMDZlRcVHbWT2xgesmGD/N8sDYwKT+s1aPdX054=; b=mDx9h+I8BSSo65yEnoGgSbhhSlmI+bidNnORhx9qlQvP297J3NykHQ+lsxvYzRTAZC t7l45EqnIMFOaAABxEKekvhDJLigUrxETUOwarVJRmTXEKkY3h39nkRjCUePEG/XSwib 3FL1se0CKtJI6OkRXA3kK0sB8b6c1/S33393rpD9l6dEiREv/aYDubHMmFI0d878XlOQ gcmQ/cb5Hc7SILw68En8XVM9t3nGujqfPuFIQ058735IUzS8rXilO7kTZyWSaCCBiYWQ DXr2OdCDts/0joOkhVCh/SswQ0549xWFLhZaF56R3Fqa4585+VJncHKU7LeMkr/osORb MJwg== X-Gm-Message-State: AOAM532UL0Jlpp72XytWXMgcbVHxXvk87OBfThwRUGN02PI0TiHXKiax DSgdy1TyqgG6Q2p4y3VdWNYvXw== X-Received: by 2002:a37:7b41:: with SMTP id w62mr7434442qkc.256.1619815792499; Fri, 30 Apr 2021 13:49:52 -0700 (PDT) Received: from ubuntu.netflix.com (136-25-20-203.cab.webpass.net. [136.25.20.203]) by smtp.gmail.com with ESMTPSA id z17sm3161960qtf.10.2021.04.30.13.49.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 30 Apr 2021 13:49:51 -0700 (PDT) From: Sargun Dhillon To: Kees Cook , LKML , Linux Containers Cc: Rodrigo Campos , =?UTF-8?q?Mauricio=20V=C3=A1squez=20Bernal?= , Tycho Andersen , Giuseppe Scrivano , Christian Brauner , Andy Lutomirski , Sargun Dhillon Subject: [PATCH v2 4/5] seccomp: Support atomic "addfd + send reply" Date: Fri, 30 Apr 2021 13:49:38 -0700 Message-Id: <20210430204939.5152-5-sargun@sargun.me> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20210430204939.5152-1-sargun@sargun.me> References: <20210430204939.5152-1-sargun@sargun.me> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Rodrigo Campos Alban Crequy reported a race condition userspace faces when we want to add some fds and make the syscall return them[1] using seccomp notify. The problem is that currently two different ioctl() calls are needed by the process handling the syscalls (agent) for another userspace process (target): SECCOMP_IOCTL_NOTIF_ADDFD to allocate the fd and SECCOMP_IOCTL_NOTIF_SEND to return that value. Therefore, it is possible for the agent to do the first ioctl to add a file descriptor but the target is interrupted (EINTR) before the agent does the second ioctl() call. Other patches in this series add a way to block signals when a syscall is put to wait by seccomp. However, that might be a big hammer for some cases, as the golang runtime uses SIGURG to interrupt threads for GC collection. Sometimes we just don't want to interfere with the GC, for example, and just either add the fd and return it or fail the syscall. With no leaking fds added inadvertly to the target process. This patch adds a flag to the ADDFD ioctl() so it adds the fd and returns that value atomically to the target program, as suggested by Kees Cook[2]. This is done by simply allowing seccomp_do_user_notification() to add the fd and return it in this case. Therefore, in this case the target wakes up from the wait in seccomp_do_user_notification() either to interrupt the syscall or to add the fd and return it. This "allocate an fd and return" functionality is useful for syscalls that return a file descriptor only, like connect(2). Other syscalls that return a file descriptor but not as return value (or return more than one fd), like socketpair(), pipe(), recvmsg with SCM_RIGHTs, will not work with this flag. The way to go to emulate those in cases where a signal might interrupt is to use the functionality to block signals. The struct seccomp_notif_resp, used when doing SECCOMP_IOCTL_NOTIF_SEND ioctl() to send a response to the target, has three more fields that we don't allow to set when doing the addfd ioctl() to also return. The reasons to disallow each field are: * val: This will be set to the new allocated fd. No point taking it from userspace in this case. * error: If this is non-zero, the value is ignored. Therefore, it is pointless in this case as we want to return the value. * flags: The only flag is to let userspace continue to execute the syscall. This seems pointless, as we want the syscall to return the allocated fd. This is why those fields are not possible to set when using this new flag. [1]: https://lore.kernel.org/lkml/CADZs7q4sw71iNHmV8EOOXhUKJMORPzF7thraxZYddTZsxta-KQ@mail.gmail.com/ [2]: https://lore.kernel.org/lkml/202012011322.26DCBC64F2@keescook/ Signed-off-by: Rodrigo Campos Signed-off-by: Sargun Dhillon --- include/uapi/linux/seccomp.h | 1 + kernel/seccomp.c | 49 +++++++++++++++++++++++++++++++++--- 2 files changed, 46 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h index 71dbc1f7889f..6a14c39a6e05 100644 --- a/include/uapi/linux/seccomp.h +++ b/include/uapi/linux/seccomp.h @@ -115,6 +115,7 @@ struct seccomp_notif_resp { /* valid flags for seccomp_notif_addfd */ #define SECCOMP_ADDFD_FLAG_SETFD (1UL << 0) /* Specify remote fd */ +#define SECCOMP_ADDFD_FLAG_SEND (1UL << 1) /* Addfd and return it, atomically */ /** * struct seccomp_notif_addfd diff --git a/kernel/seccomp.c b/kernel/seccomp.c index 7ac1cea6e7f0..eed3294f3df8 100644 --- a/kernel/seccomp.c +++ b/kernel/seccomp.c @@ -109,6 +109,7 @@ struct seccomp_knotif { * installing process should allocate the fd as normal. * @flags: The flags for the new file descriptor. At the moment, only O_CLOEXEC * is allowed. + * @ioctl_flags: The flags used for the seccomp_addfd ioctl. * @ret: The return value of the installing process. It is set to the fd num * upon success (>= 0). * @completion: Indicates that the installing process has completed fd @@ -120,6 +121,7 @@ struct seccomp_kaddfd { struct file *file; int fd; unsigned int flags; + __u32 ioctl_flags; /* To only be set on reply */ int ret; @@ -1064,14 +1066,35 @@ static u64 seccomp_next_notify_id(struct seccomp_filter *filter) return filter->notif->next_id++; } -static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd) +static void seccomp_handle_addfd(struct seccomp_kaddfd *addfd, struct seccomp_knotif *n) { + int fd; + /* * Remove the notification, and reset the list pointers, indicating * that it has been handled. */ list_del_init(&addfd->list); - addfd->ret = receive_fd_replace(addfd->fd, addfd->file, addfd->flags); + fd = receive_fd_replace(addfd->fd, addfd->file, addfd->flags); + + addfd->ret = fd; + + if (addfd->ioctl_flags & SECCOMP_ADDFD_FLAG_SEND) { + /* If we fail reset and return an error to the notifier */ + if (fd < 0) { + n->state = SECCOMP_NOTIFY_SENT; + } else { + /* Return the FD we just added */ + n->flags = 0; + n->error = 0; + n->val = fd; + } + } + + /* + * Mark the notification as completed. From this point, addfd mem + * might be invalidated and we can't safely read it anymore. + */ complete(&addfd->completion); } @@ -1143,7 +1166,7 @@ static int seccomp_do_user_notification(int this_syscall, struct seccomp_kaddfd, list); /* Check if we were woken up by a addfd message */ if (addfd) - seccomp_handle_addfd(addfd); + seccomp_handle_addfd(addfd, &n); } while (n.state != SECCOMP_NOTIFY_REPLIED); @@ -1610,7 +1633,7 @@ static long seccomp_notify_addfd(struct seccomp_filter *filter, if (addfd.newfd_flags & ~O_CLOEXEC) return -EINVAL; - if (addfd.flags & ~SECCOMP_ADDFD_FLAG_SETFD) + if (addfd.flags & ~(SECCOMP_ADDFD_FLAG_SETFD | SECCOMP_ADDFD_FLAG_SEND)) return -EINVAL; if (addfd.newfd && !(addfd.flags & SECCOMP_ADDFD_FLAG_SETFD)) @@ -1620,6 +1643,7 @@ static long seccomp_notify_addfd(struct seccomp_filter *filter, if (!kaddfd.file) return -EBADF; + kaddfd.ioctl_flags = addfd.flags; kaddfd.flags = addfd.newfd_flags; kaddfd.fd = (addfd.flags & SECCOMP_ADDFD_FLAG_SETFD) ? addfd.newfd : -1; @@ -1645,6 +1669,23 @@ static long seccomp_notify_addfd(struct seccomp_filter *filter, goto out_unlock; } + if (addfd.flags & SECCOMP_ADDFD_FLAG_SEND) { + /* + * Disallow queuing an atomic addfd + send reply while there are + * some addfd requests still to process. + * + * There is no clear reason to support it and allows us to keep + * the loop on the other side straight-forward. + */ + if (!list_empty(&knotif->addfd)) { + ret = -EBUSY; + goto out_unlock; + } + + /* Allow exactly only one reply */ + knotif->state = SECCOMP_NOTIFY_REPLIED; + } + list_add(&kaddfd.list, &knotif->addfd); complete(&knotif->ready); mutex_unlock(&filter->notify_lock); -- 2.25.1