Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp2789645imm; Sun, 3 Jun 2018 11:42:20 -0700 (PDT) X-Google-Smtp-Source: ADUXVKJt0oGLETqu/xHHnmjKC05Z29uhuw0KH+4d+bnzk8FS5kW77QzJVaZt3s7VCTWP72YtxrVJ X-Received: by 2002:a17:902:760d:: with SMTP id k13-v6mr4777130pll.56.1528051340640; Sun, 03 Jun 2018 11:42:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1528051340; cv=none; d=google.com; s=arc-20160816; b=NeMxDk2c8OJsmF6pWdqpXgvnsQbvX0MAHV9wSKD88GWD8tl/MvBK54wK7PzgH5iYz3 9iopkzkoydZWEkvHazLy5rmSkeTVLi419pv774oqHjCNdnrtYM5rm1jtEGTKXkQmKd2/ 9znQIyQPkRuBDgJK1YOOGka0qJJ/fOUhsokW+288IhECko2WqklQtJYigi8/efEqYuCc P2OSEQ0PEvA1KpVAgY1MGhPzGfAejbVb7HUWGifnjq1RiMyZZaGClpyy7qo8Pdkzx7Xd bj5/Yz/zkI8/cK+1Mi/Wn7R787Yo7C0AF48e78KzSH+EuW62QYFwPvnfZ9mU2YuJAhha zJdg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature :arc-authentication-results; bh=wobyE+HTr3Hs6VY0Q/ux5snOpvAOFjBLt0IQa1D4IQw=; b=PPo3+FnA3IPO4epu1I1mOQzovSDYpjngLWO1bbtz20861KBo6HEwlWQLNQS1RJp80f V/jszVMP+emNMxZa3TllMM+1fzF5xqCGFsZqm03SoMU2mXs/Nshvp8nBgZwggeTWt2PA gV0tPyWDrW8iMiowEA5DVBkPNH/BVZzKPJhAc0E2nOnK3CC/2TfgQDAmo3dWlyFNk+xm Tnemmo01qkh3J1wwu+SPI81m/+GSV/Ie97BraEUhuZCY9devjL0JtaMil+hFjugNBlCE TWloe/RLoSOYA6wXnohE/Ngb66bupVYzr7bC+lhKgC6pgFdh+7FbwYcj3kbD+zECyh0M BMGA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=Url58Mh7; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d5-v6si1567912plo.3.2018.06.03.11.41.53; Sun, 03 Jun 2018 11:42:20 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=Url58Mh7; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751225AbeFCSlP (ORCPT + 99 others); Sun, 3 Jun 2018 14:41:15 -0400 Received: from mail-ot0-f196.google.com ([74.125.82.196]:42986 "EHLO mail-ot0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750952AbeFCSlO (ORCPT ); Sun, 3 Jun 2018 14:41:14 -0400 Received: by mail-ot0-f196.google.com with SMTP id 92-v6so104152otw.9 for ; Sun, 03 Jun 2018 11:41:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=wobyE+HTr3Hs6VY0Q/ux5snOpvAOFjBLt0IQa1D4IQw=; b=Url58Mh7swsVl+u0TdTN8xM3j6GJhlTK7HTL2jsUkDufj0eM1Fr56lpS5pAUKCtjv6 i+vKH74/xEZjxpvVSnLKoApHsP82HtL/c5Br6NJnMk7nzffDMuLUqlwJ9aZZEeBzbljh /hjzmY2vT6IrTcO8xJjnzMwAKdW4J50Pvp0qmnHVlLL15BTqu5CDZWCE+IeQABPHjeIw p9GsaRQ8lldtg44lhGkLJFJcaapMeCB61yHP/WX85tuOyN8aEuP4+td9+vTL0V4Adh0y 1iD134qrxeUd7vkhJNgg29yVsIjWI42+53sd7/IQDtAI7yg/OjwpeOA2A/5JJKA2CpQh 8tMA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=wobyE+HTr3Hs6VY0Q/ux5snOpvAOFjBLt0IQa1D4IQw=; b=CkP6Immn/sCtXDbBGIjHcRnnXpYBuFVSPetxaCBPblNIKD1MiBaX9ZRXo69WNXhAka 7kmGGqFpseThoblnpuEBZOpxvbWpnMipTHQgW0tJUBZIsQ0xGdsw199wz3Q63k6F+I0a LwjTY4MwnKhvaT8AJW5vnlM6Jgd/tCyJvlWd1oYxHGpbonnGJ06F6z2v0NIZHNEHAHo5 ddtBtpdSa5iuSxAj9mCQbl5hFKpCC9Lv5SCx1P7diiMe1ir71ZCXC8ln6fnAVyG4pxvr 1Fx1E2JWI6seP+RtKppQCoctxx6/sCFinsXXqLilMXkSHSnoLSA5sGcEj8I9JdtLj8I7 gsng== X-Gm-Message-State: ALKqPwfzAlfI6QfsEjRIJ7AGwii8nCCOmahQBGKJ2cTgPirRnbijDh77 IpJ6KkHjItP1xRbM1LphR4RPppjbckLf/nD4xP9zZLTL X-Received: by 2002:a9d:52a1:: with SMTP id f33-v6mr9828801oth.0.1528051273638; Sun, 03 Jun 2018 11:41:13 -0700 (PDT) MIME-Version: 1.0 References: <20180531144949.24995-1-tycho@tycho.ws> <20180531144949.24995-2-tycho@tycho.ws> In-Reply-To: <20180531144949.24995-2-tycho@tycho.ws> From: Jann Horn Date: Sun, 3 Jun 2018 20:41:01 +0200 Message-ID: Subject: Re: [PATCH v3 1/4] seccomp: add a return code to trap to userspace To: tycho@tycho.ws Cc: kernel list , containers@lists.linux-foundation.org, Kees Cook , Andy Lutomirski , Oleg Nesterov , "Eric W. Biederman" , "Serge E. Hallyn" , christian.brauner@ubuntu.com, Tyler Hicks , suda.akihiro@lab.ntt.co.jp, "Tobin C. Harding" Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Jun 3, 2018 at 2:29 PM Tycho Andersen wrote: > > This patch introduces a means for syscalls matched in seccomp to notify > some other task that a particular filter has been triggered. > > The motivation for this is primarily for use with containers. For example, > if a container does an init_module(), we obviously don't want to load this > untrusted code, which may be compiled for the wrong version of the kernel > anyway. Instead, we could parse the module image, figure out which module > the container is trying to load and load it on the host. > > As another example, containers cannot mknod(), since this checks > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard > coding some whitelist in the kernel. Another example is mount(), which has > many security restrictions for good reason, but configuration or runtime > knowledge could potentially be used to relax these restrictions. > > This patch adds functionality that is already possible via at least two > other means that I know about, both of which involve ptrace(): first, one > could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL. > Unfortunately this is slow, so a faster version would be to install a > filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP. > Since ptrace allows only one tracer, if the container runtime is that > tracer, users inside the container (or outside) trying to debug it will not > be able to use ptrace, which is annoying. It also means that older > distributions based on Upstart cannot boot inside containers using ptrace, > since upstart itself uses ptrace to start services. > > The actual implementation of this is fairly small, although getting the > synchronization right was/is slightly complex. > > Finally, it's worth noting that the classic seccomp TOCTOU of reading > memory data from the task still applies here, but can be avoided with > careful design of the userspace handler: if the userspace handler reads all > of the task memory that is necessary before applying its security policy, > the tracee's subsequent memory edits will not be read by the tracer. [...] > @@ -857,13 +1020,28 @@ static long seccomp_set_mode_filter(unsigned int flags, > if (IS_ERR(prepared)) > return PTR_ERR(prepared); > > + if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) { > + listener = get_unused_fd_flags(O_RDWR); I think you want either 0 or O_CLOEXEC here? > +out_put_fd: > + if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) { > + if (ret < 0) { > + fput(listener_f); > + put_unused_fd(listener); > + } else { > + fd_install(listener, listener_f); > + ret = listener; > + } > + } > out_free: > seccomp_filter_free(prepared); > return ret; [...] > +static __poll_t seccomp_notify_poll(struct file *file, > + struct poll_table_struct *poll_tab) > +{ > + struct seccomp_filter *filter = file->private_data; > + __poll_t ret = 0; > + struct seccomp_knotif *cur; > + > + ret = mutex_lock_interruptible(&filter->notify_lock); > + if (ret < 0) > + return ret; > + > + list_for_each_entry(cur, &filter->notifications, list) { > + if (cur->state == SECCOMP_NOTIFY_INIT) > + ret |= EPOLLIN | EPOLLRDNORM; > + if (cur->state == SECCOMP_NOTIFY_SENT) > + ret |= EPOLLOUT | EPOLLWRNORM; > + } > + > + mutex_unlock(&filter->notify_lock); > + > + return ret; > +} I don't think f_op->poll handlers work like this. AFAIK you're supposed to use something like poll_wait() to connect the caller to something like a waitqueue head, so that as soon as the file becomes ready for reading/writing, any waiting task is notified. See eventfd_poll() in fs/eventfd.c for a simple example. AFAICS in the current code, seccomp_notify_poll() only works if an event is pending at the time seccomp_notify_poll() is called.