Received: by 2002:a25:868d:0:0:0:0:0 with SMTP id z13csp386457ybk; Wed, 20 May 2020 02:07:16 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzIcm6Z7xf0oEAuOfHHQp9VSj2ksORTibvhk4Z+VSIyNMTSUxW3rix12MWFTIBYYS/P8m+5 X-Received: by 2002:a05:6402:946:: with SMTP id h6mr2594345edz.245.1589965635946; Wed, 20 May 2020 02:07:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1589965635; cv=none; d=google.com; s=arc-20160816; b=pqxY0ycAhdrApH1V/puFKJCPM7boJG7usNCsI0lPoekpDGMYOPR/9xUK74UOekQBzE H9Z/JtBex/YwcHyufjAeAkEbefg2BSMSTZh+kD31jrNvozM8DlnmFXntQR1ztgboBjVN 2q1bDt4IhHpUPOlv621OomyuAS3x58SllsJpwBdjAvNJaF4xpTzEJS76O9hjgDh+0J/d 8wnMDxRvCHSgJPi8FAh19IdnedTG/xADlEDROJuO3ktl4PzZ83/lZaJyORbwRfgftYap YopRPmwjZVZ8OKOAW+AJtNWoJGV+pALyEcmFUoZXutbZ2ZvZbquIFVeJlRioqvFYkGy/ xDEw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=k/on5dXOqPPTTiXT6elXTcI6XkJXgcd9gTLZOle0dV4=; b=u09TlFFFLIpLuXL8Uk7RbWdGxh2iCVlEf9MajamA0gQuKn4xri+JPJrbySeFAwy3LW bLnNfZreIiV4EaqYHP4bxrlUdJsGjwuCjqvibT5t5r4kJ/CDqquGw18iUdxczyvYSoEk 97JFsPNjNRL6MdEYYQn+CGebKCJQIENyiLrhYqVeoZOPQMCZdtt5QBm03g1MOoFsA9M7 R25RvEpCrKxw4dPT2oAg1BiovDXycYwA7iFsEWm3k1N2Y8jIV3uTE7t48O1BHYzj3OWV HN3oviPRUq4RHsCT3BArJMyPs1DCLcJopwBY+eKotaqmptCK8nbWqxb6SHjFOGZAc75z aFMA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@sargun.me header.s=google header.b=MDfBRknD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id e30si1133681edc.267.2020.05.20.02.06.52; Wed, 20 May 2020 02:07:15 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@sargun.me header.s=google header.b=MDfBRknD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726548AbgETJF1 (ORCPT + 99 others); Wed, 20 May 2020 05:05:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33716 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726452AbgETJF0 (ORCPT ); Wed, 20 May 2020 05:05:26 -0400 Received: from mail-io1-xd44.google.com (mail-io1-xd44.google.com [IPv6:2607:f8b0:4864:20::d44]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8086BC061A0F for ; Wed, 20 May 2020 02:05:26 -0700 (PDT) Received: by mail-io1-xd44.google.com with SMTP id w25so2275994iol.12 for ; Wed, 20 May 2020 02:05:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sargun.me; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=k/on5dXOqPPTTiXT6elXTcI6XkJXgcd9gTLZOle0dV4=; b=MDfBRknDn2S/BQFUVM7z6D6T94KFygSDo/kJkICFQue8fXx9Qa2drC4JcszQH+0SqJ rnVDBljeQL6QvXKyM3JJ5qLdcMxL7KScgWzdmvpGeLIUbrcyw7a037GCauyMwR+Qlid6 c1gDhAbV+l8eW1qQaipwvyeoXUTXrDtoeonx8= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=k/on5dXOqPPTTiXT6elXTcI6XkJXgcd9gTLZOle0dV4=; b=QxCgwtFDTvHLANfopiDYU7S+go5Xe2x+FjW3dnaoDfUYiYh0gOox4T5Nl8THVWfZAx alzikNrj3ox8+GwlgTwJbIohAH4Y68roNjSwZUmZeY96F4IQJlm3loEG0iR/iqFZ7orA KO9RNR8xFVkIZAlrnynhTR7uOSqEBbACL2t30fLIMTnu2xA32BurmaqfEqYtnlDfFM5I KIcA7OM4Nl3JSU7nwK1b51p9/0mA1hjPQQ34JRCCIQ1mLYcEeVzkY1I3DFjYqZpslMge lRmQhnaa6T+xgEBfYoAAaJ8HpoT6J4MrU9paK2Z+CWmNqBkUp9AvUPk40FLlDb59ELqg rqTA== X-Gm-Message-State: AOAM5310307H4RWO1NYgEfbMmKyxeWIco2y0NRuDbnkLcEleW6HU4NMv BDjwMKekJ/ZkZm3xu6Zr1SbNUg== X-Received: by 2002:a02:c9d3:: with SMTP id c19mr3504366jap.80.1589965525385; Wed, 20 May 2020 02:05:25 -0700 (PDT) Received: from ircssh-2.c.rugged-nimbus-611.internal (80.60.198.104.bc.googleusercontent.com. [104.198.60.80]) by smtp.gmail.com with ESMTPSA id s84sm1059046ill.5.2020.05.20.02.05.24 (version=TLS1_2 cipher=ECDHE-ECDSA-CHACHA20-POLY1305 bits=256/256); Wed, 20 May 2020 02:05:24 -0700 (PDT) Date: Wed, 20 May 2020 09:05:23 +0000 From: Sargun Dhillon To: Kees Cook Cc: Christian Brauner , Tycho Andersen , Matt Denton , Chris Palmer , Jeffrey Vander Stoep , containers@lists.linux-foundation.org, linux-api@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: seccomp feature development Message-ID: <20200520090522.GA25177@ircssh-2.c.rugged-nimbus-611.internal> References: <202005181120.971232B7B@keescook> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <202005181120.971232B7B@keescook> User-Agent: Mutt/1.9.4 (2018-02-28) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, May 18, 2020 at 02:04:57PM -0700, Kees Cook wrote: > Hi! > > This is my attempt at a brain-dump on my plans for nearish-term seccomp > features. Welcome to my TED talk... ;) > > These are the things I've been thinking about: > > - fd passing > - deep argument inspection > - changing structure sizes > - syscall bitmasks > > So, diving right in: > > > ## fd passing > > Background: seccomp users want to be able to install an fd in a > monitored process during a user_notif to emulate "open" calls (or > similar), possibly across security boundaries, etc. > > On the fd passing front, it seems that gaining pidfd_addfd() is the way > to go as it allows for generic use not tied to seccomp in particular. > I expect this feature will be developed orthogonally to seccomp (where > does this stand, BTW?). However, as Sargun has shown[1], seccomp could > be friendlier to help with using it. Things that need to be resolved: > > - report pidnr, or pidfd? It seems the consensus is to pass pidnr, but > if we're going to step back and make some design choices here, is > there a place for pidfds in seccomp user_notif, in order to avoid > needing the user_notif cookie? I think probably not: it's a rather lot > of overhead for notifications. It seems it's safe to perform an fd > installation with these steps: > - get pidnr from user_notif_recv > - open pidfd from pidnr > - re-verify user_notif cookie is still valid > - send new fd via pidfd > - reply with user_notif_send > - close pidfd > > - how to deal with changing sizes of the user_notif structures to > include a pidnr. (Which will be its own topic below.) > > > ## deep argument inspection > > Background: seccomp users would like to write filters that traverse > the user pointers passed into many syscalls, but seccomp can't do this > dereference for a variety of reasons (mostly involving race conditions and > rearchitecting the entire kernel syscall and copy_from_user() code flows). > > During the last plumbers and in conversations since, the grudging > consensus was reached that having seccomp do this for ALL syscalls was > likely going to be extremely disruptive for very little gain (i.e. > many things, like pathnames, have differing lifetimes, aliases, unstable > kernel object references, etc[6]), but that there were a small subset of > syscalls for which this WOULD be beneficial, and those are the newly > created "Extensible Argument" syscalls (is there a better name for this > design? I'm calling it "EA" for the rest of the email), like clone3(), > openat2(), etc, which pass a pointer and a size: > > long clone3(struct clone_args *cl_args, size_t size); > > I think it should be possible to extend seccomp to examine this structure > by appending it to seccomp_data, and allowing filters to examine the > contents. This means that no BPF language extensions are required for > seccomp, as I'd still prefer to avoid making the eBPF jump (I don't think > seccomp's design principles work well with maps, kernel helpers, etc, > and I think the earlier the examination of using eBPF for user_notif > bares this out). > > In order for this to work, there are a number of prerequisites: > > - argument caching, in two halves: syscall side and seccomp side: > - the EA syscalls needs to include awareness of potential seccomp > hooking. i.e. seccomp may have done the copy_from_user() already and > kept a cached copy. > - seccomp needs to potentially DO the copy_from_user() itself when it > hits these syscalls for a given filter, and put it somewhere for > later use by the syscall. > - the sizes of these EA structs are, by design, growable over time. > seccomp and its users need to be handle this in a forward and backward > compatible way, similar to the design of the EA syscall interface > itself. > > The argument caching bit is, I think, rather mechanical in nature since > it's all "just" internal to the kernel: seccomp can likely adjust how it > allocates seccomp_data (maybe going so far as to have it split across two > pages with the syscall argument struct always starting on the 2nd page > boundary), and copying the EA struct into that page, which will be both > used by the filter and by the syscall. I imagine state tracking ("is > there a cached EA?", "what is the address of seccomp_data?", "what is > the address of the EA?") can be associated with the thread struct. > > The growing size of the EA struct will need some API design. For filters > to operate on the contiguous seccomp_data+EA struct, the filter will > need to know how large seccomp_data is (more on this later), and how > large the EA struct is. When the filter is written in userspace, it can > do the math, point into the expected offsets, and get what it needs. For > this to work correctly in the kernel, though, the seccomp BPF verifier > needs to know the size of the EA struct as well, so it can correctly > perform the offset checking (as it currently does for just the > seccomp_data struct size). > > Since there is not really any caller-based "seccomp state" associated > across seccomp(2) calls, I don't think we can add a new command to tell > the kernel "I'm expecting the EA struct size to be $foo bytes", since > the kernel doesn't track who "I" is besides just being "current", which > doesn't take into account the thread lifetime -- if a process launcher > knows about one size and the child knows about another, things will get > confused. The sizes really are just associated with individual filters, > based on the syscalls they're examining. So, I have thoughts on possible > solutions: > > - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the > EA style so we can pass in more than a filter and include also an > array of syscall to size mappings. (I don't like this...) > - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes > the meaning of the uarg from "filter" to a EA-style structure with > sizes and pointers to the filter and an array of syscall to size > mappings. (I like this slightly better, but I still don't like it.) > - leverage the EA design and just accept anything <= PAGE_SIZE, record > the "max offset" value seen during filter verification, and zero-fill > the EA struct with zeros to that size when constructing the > seccomp_data + EA struct that the filter will examine. Then the seccomp > filter doesn't care what any of the sizes are, and userspace doesn't > care what any of the sizes are. (I like this as it makes the problems > to solve contained entirely by the seccomp infrastructure and does not > touch user API, but I worry I'm missing some gotcha I haven't > considered.) > I may be ridiculed for suggesting this approach, and maybe it's a bit mad-science. I'll be honest, my familiarity with mm is low, and although I think what I'm describing can be done, I'm unsure of the complexity, and performance impact. I was playing with userfaultfd a while ago for somewhat related reasons and I found a patchset supporting write protection [1]. This got me interested in wondering if we could leverage this. What if we had a mechanism to read a process's memory, and simultaneously mark it as read-only. #define memstruct_flag_mark_ro 1 #define notify_on_mod 2 struct memstruct { __u32 flags; /* We might even support multiple iovecs... */ struct iovec *local_iov; struct iovec *remote_iov; } Flow: 1. Supervisor (s1) launches child (c1) 2. c1 makes syscall that triggers seccomp_notif -- let's say clone3 -- a call where we can't do injection or such on behalf of the child. 3. s1 receives notification + seccomp data 4. s1 calls ioctl(..., SECCOMP_READ_MEM, memstruct) (where memstruct is above) 5. When s1 calls this, the pages which are accessed by the iovec get marked as read only, so a fault is triggered if the user process tries to change things. 6. s1 says, continue syscall 7. Syscall returns, and pages are reverted to pre-notification state. Upon fault, if the range being changed lies outside of what was copied back via the SECCOMP_READ_MEM ioctl, then it is passed through as-is. If the range is inside, and notify_on_mod is unset, it will SIGSEGV. If the range is inside, and notify_on_mod is set, we will have to do a userfaultfd-like notification back down to userspace. I realize this approach is hacky, and ugly, and potentially goes into the same territory as userfaultfd, which has proven to be a security thing, but the benefits are that no kernel, cBPF, etc.. need to change. The only downsides I see are complexity, performance, and potentially there may be some userspace programs innocously manipulating memory that might get caught in such an iovec. Thoughts? > And then, my age-old concern, that maybe doesn't need a solution... I > remain plagued by the lack of pathname inspection. But I think the > ToCToU nature of it means we just cannot do it from seccomp. It does > make filtering openat2()'s EA struct a bit funny... a filter has no idea > what path it applies to... but that doesn't matter because the object > the path points to might change[6] during the syscall. Argh. > > I have no idea how to solve this if the process can do stuff like move files around on disk, or switch out namespaces while in the user notification. [1]: https://lwn.net/Articles/786896/