Received: by 2002:a25:e74b:0:0:0:0:0 with SMTP id e72csp803931ybh; Wed, 15 Jul 2020 16:08:04 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwI/Y7oU+EhheWfkbaI+wi+z42GPzJyqRbnZSpBPj7H5aZzC+T1Z724jVkaG4Xh+qzntNKk X-Received: by 2002:aa7:d7cf:: with SMTP id e15mr1898704eds.236.1594854484226; Wed, 15 Jul 2020 16:08:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1594854484; cv=none; d=google.com; s=arc-20160816; b=CWWNM5siol++P/crswMypFjT6rixjMWBQ78ENSKWV+4eGI8DdK11LwNOi+qEdci3Ky tf00YNuUyqkP4C4P19q+8jr6jiKdcDWskMJERWK8e4A6uuJtzeOstTC2CIeMCdJ3SQyQ 04sWUP8fZ586gsfg03d0CgmkLfaBMoQGtptU/0wgZ9FSIh+098j++NwLmQGLGQzD+9ui q/Sz+xQ/a9hOk4G3eTnQX8TUSzTPfqXS/tq7C4bJoSKf2NK1PxKHSa5Uawg1gvmrwAP5 MXLD3H5uI7WWh5cXDxajnJ4+HCVpmk7O+TyFD/fJHS/jMFpV/migpx6qDUue0A6XflNM lVdA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=Yvaq2P+PojMtEplE3YoxoQamecRP9sYA0XkAU4Baoko=; b=VuCWz+9A+BcCMCM9vLh00Xomn7R2WL/G+lG3+M8WTOzQCjo9wtL/ilt2YjfFYPKqOX 27S0FdVjvp4LA6HNj6LsHJZfayam9hE6XVYtIN0fVtpXGCAKBsGoQSI7RSkkXfwrPuPB DwAq+X6oZYQJOSz2g9EHMTVrMpYH5jgMZ7OSYN/2i4tL16kRozGt+jWy5t6AEvLOtrwn ZxZfeDwgq94kS2UPHtEyqnr3CJKLqQXmT2br4LRr1ewITED63B6ComPDLymT9WU535KE 7wfdERMp9SLKFh6jDcl166uU6Jw70xz0DqDI6jnPX/+OIAFfvmGA6hobPKDZSlxatgeO hqMg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=JMLQW1jf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id r12si2035420ejz.398.2020.07.15.16.07.41; Wed, 15 Jul 2020 16:08:04 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=JMLQW1jf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727114AbgGOXHE (ORCPT + 99 others); Wed, 15 Jul 2020 19:07:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46642 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727103AbgGOXHD (ORCPT ); Wed, 15 Jul 2020 19:07:03 -0400 Received: from mail-pf1-x441.google.com (mail-pf1-x441.google.com [IPv6:2607:f8b0:4864:20::441]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 22DE0C08C5DB for ; Wed, 15 Jul 2020 16:07:03 -0700 (PDT) Received: by mail-pf1-x441.google.com with SMTP id j20so2774273pfe.5 for ; Wed, 15 Jul 2020 16:07:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=Yvaq2P+PojMtEplE3YoxoQamecRP9sYA0XkAU4Baoko=; b=JMLQW1jfVYP22T1V+VNzJav9OTx0KTdcOe7OERKHlGpW8JELre7ySGePgiPZuFcyyN hv4f5DEkN3p2rSD2ZdUCUMFs5NE+FYj/t3vmPWXwa6hn6iIvulda5TgKB29T+wlP/DMw 30zXLYPXJiTbbOUUjh1uOEqVp63wRsgeQ87Y0= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=Yvaq2P+PojMtEplE3YoxoQamecRP9sYA0XkAU4Baoko=; b=uiGsuMqV83/J/TogNdqq4gt3uXH6BxI+FIkKNuPhcgDzyDM+wXEM7wArvdzdgD4g3h 1xy94tQ2uCQRD66O1c3xukmjyloLERGjaQSgCnkRCEclv5ghdedRWqy/5isj56DoIuWi m/KSMZtmiBZfQ3+LpcMlEv2Dl5afGebigEG+wW8d2NOytL8k5Sl83kfoCAA80a3QXURW Eb5+CZbf1pl2g0PYLLleTVdkEp55GR0erem8nNViz2JMd0grQrjo9jFVmK3t9Emq08e1 3Qtwdae6B9ECb6JSR8lqjLXQvS6SbUzr7yA1jXPaG9HlyJjwkmJbxB52XuR8Uwzg0vEg Ly2w== X-Gm-Message-State: AOAM532JCOfpQ4HZc2De76Uev3odN56HhD6deuIS+obL6Tp9UfIZrp1r YOFLSSuN6NL7ew23rAmFnufcrL5xdUg= X-Received: by 2002:a63:e80e:: with SMTP id s14mr1825227pgh.32.1594854422427; Wed, 15 Jul 2020 16:07:02 -0700 (PDT) Received: from www.outflux.net (smtp.outflux.net. [198.145.64.163]) by smtp.gmail.com with ESMTPSA id 186sm2941627pfe.1.2020.07.15.16.07.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 15 Jul 2020 16:07:01 -0700 (PDT) Date: Wed, 15 Jul 2020 16:07:00 -0700 From: Kees Cook To: Pavel Begunkov Cc: Miklos Szeredi , Matthew Wilcox , Andy Lutomirski , Jann Horn , Stefano Garzarella , Christian Brauner , strace-devel@lists.strace.io, io-uring@vger.kernel.org, Linux API , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Michael Kerrisk Subject: Re: strace of io_uring events? Message-ID: <202007151511.2AA7718@keescook> References: <20200715171130.GG12769@casper.infradead.org> <7c09f6af-653f-db3f-2378-02dca2bc07f7@gmail.com> <48cc7eea-5b28-a584-a66c-4eed3fac5e76@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <48cc7eea-5b28-a584-a66c-4eed3fac5e76@gmail.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Earlier Andy Lutomirski wrote: > Let’s add some seccomp folks. We probably also want to be able to run > seccomp-like filters on io_uring requests. So maybe io_uring should call into > seccomp-and-tracing code for each action. Okay, I'm finally able to spend time looking at this. And thank you to the many people that CCed me into this and earlier discussions (at least Jann, Christian, and Andy). It *seems* like there is a really clean mapping of SQE OPs to syscalls. To that end, yes, it should be trivial to add ptrace and seccomp support (sort of). The trouble comes for doing _interception_, which is how both ptrace and seccomp are designed. In the basic case of seccomp, various syscalls are just being checked for accept/reject. It seems like that would be easy to wire up. For the more ptrace-y things (SECCOMP_RET_TRAP, SECCOMP_RET_USER_NOTIF, etc), I think any such results would need to be "upgraded" to "reject". Things are a bit complex in that seccomp's form of "reject" can be "return errno" (easy) or it can be "kill thread (or thread_group)" which ... becomes less clear. (More on this later.) In the basic case of "I want to run strace", this is really just a creative use of ptrace in that interception is being used only for reporting. Does ptrace need to grow a way to create/attach an io_uring eventfd? Or should there be an entirely different tool for administrative analysis of io_uring events (kind of how disk IO can be monitored)? For io_uring generally, I have a few comments/questions: - Why did a new syscall get added that couldn't be extended? All new syscalls should be using Extended Arguments. :( - Why aren't the io_uring syscalls in the man-page git? (It seems like they're in liburing, but that's should document the _library_ not the syscalls, yes?) Speaking to Stefano's proposal[1]: - There appear to be three classes of desired restrictions: - opcodes for io_uring_register() (which can be enforced entirely with seccomp right now). - opcodes from SQEs (this _could_ be intercepted by seccomp, but is not currently written) - opcodes of the types of restrictions to restrict... for making sure things can't be changed after being set? seccomp already enforces that kind of "can only be made stricter" - Credentials vs no_new_privs needs examination (more on this later) So, I think, at least for restrictions, seccomp should absolutely be the place to get this work done. It already covers 2 of the 3 points in the proposal. Solving the mapping of seccomp interception types into CQEs (or anything more severe) will likely inform what it would mean to map ptrace events to CQEs. So, I think they're related, and we should get seccomp hooked up right away, and that might help us see how (if) ptrace should be attached. Specifically for seccomp, I see at least the following design questions: - How does no_new_privs play a role in the existing io_uring credential management? Using _any_ kind of syscall-effective filtering, whether it's seccomp or Stefano's existing proposal, needs to address the potential inheritable restrictions across privilege boundaries (which is what no_new_privs tries to eliminate). In regular syscall land, this is an issue when a filter follows a process through setuid via execve() and it gains privileges that now the filter-creator can trick into doing weird stuff -- io_uring has a concept of alternative credentials so I have to ask about it. (I don't *think* there would be a path to install a filter before gaining privilege, but I likely just need to do my homework on the io_uring internals. Regardless, use of seccomp by io_uring would need to have this issue "solved" in the sense that it must be "safe" to filter io_uring OPs, from a privilege-boundary-crossing perspective. - From which task perspective should filters be applied? It seems like it needs to follow the io_uring personalities, as that contains the credentials. (This email is a brain-dump so far -- I haven't gone to look to see if that means io_uring is literally getting a reference to struct cred; I assume so.) Seccomp filters are attached to task_struct. However, for v5.9, seccomp will gain a more generalized get/put system for having filters attached to the SECCOMP_RET_USER_NOTIF fd. Adding more get/put-ers for some part of the io_uring context shouldn't be hard. - How should seccomp return values be applied? Three seem okay: SECCOMP_RET_ALLOW: do SQE action normally SECCOMP_RET_LOG: do SQE action, log via seccomp SECCOMP_RET_ERRNO: skip actions in SQE and pass errno to CQE The rest not so much: SECCOMP_RET_TRAP: can't send SIGSYS anywhere sane? SECCOMP_RET_TRACE: no tracer, can't send SIGSYS? SECCOMP_RET_USER_NOTIF: can't do user_notif rewrites? SECCOMP_RET_KILL_THREAD: kill which thread? SECCOMP_RET_KILL_PROCESS: kill which thread group? If TRAP, TRACE, and USER_NOTIF need to be "upgraded" to KILL_THREAD, what does KILL_THREAD mean? Does it really mean "shut down the entire SQ?" Does it mean kill the worker thread? Does KILL_PROCESS mean kill all the tasks with an open mapping for the SQ? Anyway, I'd love to hear what folks think, but given the very direct mapping from SQE OPs to syscalls, I really think seccomp needs to be inserted in here somewhere to maintain any kind of sensible reasoning about syscall filtering. -Kees [1] https://lore.kernel.org/lkml/20200710141945.129329-3-sgarzare@redhat.com/ -- Kees Cook