Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp806756imm; Wed, 13 Jun 2018 08:33:50 -0700 (PDT) X-Google-Smtp-Source: ADUXVKI1SVqx3axLlsI0/88wUU1C/AMp260J/abjLxABzmjCmWbtH+mSvrIfv3qg8EqediaWJHR6 X-Received: by 2002:a17:902:7685:: with SMTP id m5-v6mr5734286pll.76.1528904030523; Wed, 13 Jun 2018 08:33:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1528904030; cv=none; d=google.com; s=arc-20160816; b=gi4EkYBZCTBEY3VVw1gPt9K33XBfbQxxo5jMo7QQGzDQIe35AiAgqkBmizdKv+rQ+v m4szl/+UB+YKyKBsSl2q2oSenO+FA0VmQC+Av0YQvMrvGSl18K2ixK2pY5bZO9/f6nah mdDyZRyBKm54N91Cqz/HoK9dP4sGhIMCHCpjqvpOr/i8FjKY1o+P9x9T81u2boCQt5Rs NkZE5E6D7V5EcaF9OW2yYIfL260J3cWxnyb46uNNm6BbW3debPbqainmiHsRf65kHDze 88F8pA5r5AZvbGSJVKAEHYbF4vzjcu69gWxsymRwD+zpsRPhMPVD9xRQ4bYZ/S7a8YtM DH8g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature :arc-authentication-results; bh=sOVFGr/A6uEbe9tpOQgpfijFBNIefmHUTc0AZcN9yzs=; b=yRqfDO+oJfqE7bOE4c21T/NW4ryqPIOmSnsgdE9kh5WBG+oujD9eJotqnEW8sjPZz5 g0Lpbw8K1dV2DPHTlsWPQ7r3+SlJkxYCUvS8yBqaZIpQgzFi8mt1ufkTl7dvPFr131F3 UoJFkznRSWerj01iHLRvm7JByd6Y4Ic6dPkqtYGxMIWswnv3jc9LBVAGEAECvwSUuvxt ZkWoq0fnnhFG/bC5nimasY9wugsNauuQA63w3PC4ZRZesAW6gt6HvHd1xqpY0WFrHy3U UdPLIApb+Z5hrB/kou8hQJomCpxylFeOxa4ZSJ75+nF9MCeoWBeJ1HWuO7KYQaGqN2kq ZoIg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="ueMp/be8"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id j33-v6si3108940pld.151.2018.06.13.08.33.35; Wed, 13 Jun 2018 08:33:50 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="ueMp/be8"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935919AbeFMPdM (ORCPT + 99 others); Wed, 13 Jun 2018 11:33:12 -0400 Received: from mail-ot0-f194.google.com ([74.125.82.194]:37395 "EHLO mail-ot0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S935763AbeFMPdK (ORCPT ); Wed, 13 Jun 2018 11:33:10 -0400 Received: by mail-ot0-f194.google.com with SMTP id 101-v6so3450526oth.4 for ; Wed, 13 Jun 2018 08:33:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=sOVFGr/A6uEbe9tpOQgpfijFBNIefmHUTc0AZcN9yzs=; b=ueMp/be83KxISy9xh9Vr6R3bdFwRQbytSzcf59TOOJwxCF6W36rLa7Du/LQ8P319jk 7XHvRba6/rtW0hcPPXvN73hjADQ1xv7xu5fIjdR8DzCKPn++SA9HUjz/8gmwzlJCM/jq aJ/0y8dWThboZjS3CPCgztuszRQWtQUiSUCxnL4B9QjjFoNO1qfm6srwI3tW+R56+0FH YwehRjM7mlaBGitZepI62fw+YsbWxjTl0bOsfntJp7LWswvWNXhHJUv0nzT5OdScYOlu uycK3p7eSW9HY7EaP+3/yUgmiquRaPbCDEhKikS/jkLOUdDE03RkTdAhdI8+I7i/DDcv wyZw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=sOVFGr/A6uEbe9tpOQgpfijFBNIefmHUTc0AZcN9yzs=; b=rthwlMW1uPRUzRULD94uGBr0iuit3x107wluFZAd03a0cZoCk9Q07r6DxQnCxPeFof XOkmwrE39eVQ+T6deIFoWPMzMiqJ4jnDEE1g8Phme//5BnuiK7PbfbIsDuhB+/ezeV/Q fnOp4KTLUW5a/dzkC5kHU4drmd84YTuB7jrIG7giimHKY+mlsIHQabygiCMi7W6QadOj soUrGYb4QmA9O7HjT5qmjqxKrPmZ7KXErbdNYK3F6IsHQS0YdnnvQi0qRU5/xJ7KSQI2 4TmjGUOptw3rmWM2cftvo6wpd0fQrM/8YRQWINdpzyMSFMuUspsWGgqU+28NIomX/uzv epcA== X-Gm-Message-State: APt69E2/hF9SL26pXo191y+be26kYNhFUCo3k/LezpWNQbAWVLTeDQAW F0CjVpMy6Yts3EqhmMBeMFf0GcY0Mt9TWzBaXsGtZg== X-Received: by 2002:a9d:2115:: with SMTP id i21-v6mr3252502otb.72.1528903989568; Wed, 13 Jun 2018 08:33:09 -0700 (PDT) MIME-Version: 1.0 References: <20180531144949.24995-1-tycho@tycho.ws> <20180531144949.24995-2-tycho@tycho.ws> <20180604001812.GE15998@cisco> In-Reply-To: <20180604001812.GE15998@cisco> From: Jann Horn Date: Wed, 13 Jun 2018 17:32:57 +0200 Message-ID: Subject: Re: [PATCH v3 1/4] seccomp: add a return code to trap to userspace To: Tycho Andersen Cc: kernel list , containers@lists.linux-foundation.org, Kees Cook , Andy Lutomirski , Oleg Nesterov , "Eric W. Biederman" , "Serge E. Hallyn" , Christian Brauner , Tyler Hicks , suda.akihiro@lab.ntt.co.jp, "Tobin C. Harding" Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jun 4, 2018 at 2:18 AM Tycho Andersen wrote: > > Hi Jann, > > On Sun, Jun 03, 2018 at 08:41:01PM +0200, Jann Horn wrote: > > On Sun, Jun 3, 2018 at 2:29 PM Tycho Andersen wrote: > > > > > > This patch introduces a means for syscalls matched in seccomp to notify > > > some other task that a particular filter has been triggered. > > > > > > The motivation for this is primarily for use with containers. For example, > > > if a container does an init_module(), we obviously don't want to load this > > > untrusted code, which may be compiled for the wrong version of the kernel > > > anyway. Instead, we could parse the module image, figure out which module > > > the container is trying to load and load it on the host. > > > > > > As another example, containers cannot mknod(), since this checks > > > capable(CAP_SYS_ADMIN). However, harmless devices like /dev/null or > > > /dev/zero should be ok for containers to mknod, but we'd like to avoid hard > > > coding some whitelist in the kernel. Another example is mount(), which has > > > many security restrictions for good reason, but configuration or runtime > > > knowledge could potentially be used to relax these restrictions. > > > > > > This patch adds functionality that is already possible via at least two > > > other means that I know about, both of which involve ptrace(): first, one > > > could ptrace attach, and then iterate through syscalls via PTRACE_SYSCALL. > > > Unfortunately this is slow, so a faster version would be to install a > > > filter that does SECCOMP_RET_TRACE, which triggers a PTRACE_EVENT_SECCOMP. > > > Since ptrace allows only one tracer, if the container runtime is that > > > tracer, users inside the container (or outside) trying to debug it will not > > > be able to use ptrace, which is annoying. It also means that older > > > distributions based on Upstart cannot boot inside containers using ptrace, > > > since upstart itself uses ptrace to start services. > > > > > > The actual implementation of this is fairly small, although getting the > > > synchronization right was/is slightly complex. > > > > > > Finally, it's worth noting that the classic seccomp TOCTOU of reading > > > memory data from the task still applies here, but can be avoided with > > > careful design of the userspace handler: if the userspace handler reads all > > > of the task memory that is necessary before applying its security policy, > > > the tracee's subsequent memory edits will not be read by the tracer. > > [...] > > > @@ -857,13 +1020,28 @@ static long seccomp_set_mode_filter(unsigned int flags, > > > if (IS_ERR(prepared)) > > > return PTR_ERR(prepared); > > > > > > + if (flags & SECCOMP_FILTER_FLAG_GET_LISTENER) { > > > + listener = get_unused_fd_flags(O_RDWR); > > > > I think you want either 0 or O_CLOEXEC here? > > Do we? I suppose it makes sense to be able to set CLOEXEC, but I could > imagine a case where a handler wanted to fork+exec to handle > something. I'm happy to make the change, but it's not obvious to me > that it's what we want by default. I said "either 0 or O_CLOEXEC" - I just meant that O_RDWR doesn't make much sense to me here, given that that's not a property of the fd and will be ignored by the function you're calling. On whether 0 or O_CLOEXEC is better: If you look at get_unused_fd_flags() calls in e.g. various ioctl handlers, it's a mix of places that hardcode 0, places that hardcode O_CLOEXEC, and places that allow the caller to specify the flag. Either should work - but personally, I believe that if the caller can't pass a flag, get_unused_fd_flags(O_CLOEXEC) is the better choice because you can still clear the O_CLOEXEC flag using fcntl() if necessary, while setting the flag using fcntl() is potentially racy in a multi-threaded context.