Received: by 2002:a25:e74b:0:0:0:0:0 with SMTP id e72csp944529ybh; Tue, 21 Jul 2020 11:42:46 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxJxm18VoL+NF2vHD+Q3mDZsEtxJAB1zoJAIUvL42RIWMKnx9K+tr3+oWcJO6AhP/322EtU X-Received: by 2002:aa7:da4c:: with SMTP id w12mr26382414eds.122.1595356966764; Tue, 21 Jul 2020 11:42:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1595356966; cv=none; d=google.com; s=arc-20160816; b=N4FCmziLLqOmxOeGzXR9Q+G65OepLnCi/7H/9rjsw+7moGa2BPTP3eefYKsJvvz4zC r3XLQYls0WM1Pyiu6nykHYbx/FMUGuloSSoKLRcmzu1T3VoLdX0qKraHBBvGy9ldQJ1t moKRoD5cT5i+P7eX199Qz+IVF8X+pvPYtxRCcCKlVKjExh06gU1zpNpzREtcCBXc/4z2 xPtv+kns/bT0gOkluE8PoIYagCnbYhnqm36ssdxtVGtKMr5GRhGqrggoxrY8Wd3EHoih Zze951vGJANW6GYbAxr6CgAsf0P0haD6pd3+PS2K0d7VUh7VZvhQX7bdiZWy0N5qeKxo 6jxg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature; bh=j123tjao84sCawAECklrlDQrmh9RZ45cJHgg6B+xpl0=; b=NEowrJZAlgJX9O/AZPJ7pHyHzmQORZgwmKXZBbnPssm9ENEYM5Wq6VMq8tYLPvmLOt yYAMLVOfDM/6XntdXpM6VR5oCYcjvo7ekEbCBfJLLYmUDyy+s3ELXBlqX2HknZlbtHgT Moj6qQj5OeuB+bvvxFTKP6O3StpBy1rKQ0MvzATrYFOsY1ADK/cWp/1Q35rvDZHy+iy6 Dixh3D5hiLS419znxMMdTnyQzSoZU5r4fqA7sx5e6pLJH++tvfZsaCH1tfXrOQmSwZ4u 9HTRe7M6q9GNLJbpQp/vc9UdJahuEIKV4LwomDH2Xr5PveIC6Ao4RWWYa/vsxPHstXt7 0NTw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel-dk.20150623.gappssmtp.com header.s=20150623 header.b=ATkzng+r; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id v8si13306482edr.557.2020.07.21.11.42.23; Tue, 21 Jul 2020 11:42:46 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel-dk.20150623.gappssmtp.com header.s=20150623 header.b=ATkzng+r; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729040AbgGUSj0 (ORCPT + 99 others); Tue, 21 Jul 2020 14:39:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36902 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727115AbgGUSjY (ORCPT ); Tue, 21 Jul 2020 14:39:24 -0400 Received: from mail-pg1-x534.google.com (mail-pg1-x534.google.com [IPv6:2607:f8b0:4864:20::534]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 43F66C0619DB for ; Tue, 21 Jul 2020 11:39:24 -0700 (PDT) Received: by mail-pg1-x534.google.com with SMTP id o13so12346494pgf.0 for ; Tue, 21 Jul 2020 11:39:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=j123tjao84sCawAECklrlDQrmh9RZ45cJHgg6B+xpl0=; b=ATkzng+r1uo35unLIfNtsPpb5eVV4wK+rzAbIw4RBkY1/0DgAAEui5Jqs4BStBgH7+ lLdXZheX+n0+7S3Wx5a4UurMmGWkaipPXonBn7uKIBM72y/0A+xK2fnwQMmaX+YMiL2k B70pTM6kPUyYOx20xLI6eDr26aFfZ828JhTxLQeoLymNRlT2nedn3FFvgSFD9CzgFFJu iAWiBImbxYoSu7O0tiBBYEZq8nPJHl1+oWySWoVzhwALjloG6vmu3DIrpPdXmChbmt8o lxPzlKKdl0QN48RDg9WoqxhQZI/LSMVZKTNzZVwWrDyGBDDzghuKYDBJTbyTUdjtlZ++ Nw8A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=j123tjao84sCawAECklrlDQrmh9RZ45cJHgg6B+xpl0=; b=PkvhFZucJxj4kfE0DWAgUQI0kySrkiLVqAwc+c7CZQ2z3IigZEPUnvDcUzP9yLhhjE cDmwzydFSIsLJfHZLpWAM1oAnojTxeolrtKFBhqnCrjl5XvWxi8CLzsJRr3agtTpzMFU 96z4g79+5Lu9mY9liWsu5AF1FWS3wL2OOZWQYIqQCTOTYdnUpcRK7zGsQBwLuiRElt1r kc+5d7I/c0uXOlRFuSIJedxt7erlLRR1JohlMUFm6b6nkQO3qi7qTppkn0lOfQU8gJdG j6iBe6rf0qI7RM7XdqHg/kFI+wR1/dXvqVks7KxdMT+RJwvpZDX57NXXzScNjBKopZ4q Gigg== X-Gm-Message-State: AOAM5325R7U2nnVgOORGRu/M4YVz0Ywi+b4tGgih/lIzo4WSVu2lVNQc JujQKDa3SmS1tHOWr0YMYriZSUplaYfCRg== X-Received: by 2002:a63:f04d:: with SMTP id s13mr23251957pgj.100.1595356763477; Tue, 21 Jul 2020 11:39:23 -0700 (PDT) Received: from [192.168.1.182] ([66.219.217.173]) by smtp.gmail.com with ESMTPSA id m19sm18875012pgd.13.2020.07.21.11.39.21 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 21 Jul 2020 11:39:22 -0700 (PDT) Subject: Re: strace of io_uring events? To: Andy Lutomirski Cc: Andres Freund , Stefano Garzarella , Christoph Hellwig , Kees Cook , Pavel Begunkov , Miklos Szeredi , Matthew Wilcox , Jann Horn , Christian Brauner , strace-devel@lists.strace.io, io-uring@vger.kernel.org, Linux API , Linux FS Devel , LKML , Michael Kerrisk , Stefan Hajnoczi References: <20200715171130.GG12769@casper.infradead.org> <7c09f6af-653f-db3f-2378-02dca2bc07f7@gmail.com> <48cc7eea-5b28-a584-a66c-4eed3fac5e76@gmail.com> <202007151511.2AA7718@keescook> <20200716131404.bnzsaarooumrp3kx@steredhat> <202007160751.ED56C55@keescook> <20200717080157.ezxapv7pscbqykhl@steredhat.lan> <39a3378a-f8f3-6706-98c8-be7017e64ddb@kernel.dk> From: Jens Axboe Message-ID: <65ad6c17-37d0-da30-4121-43554ad8f51f@kernel.dk> Date: Tue, 21 Jul 2020 12:39:20 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 7/21/20 11:44 AM, Andy Lutomirski wrote: > On Tue, Jul 21, 2020 at 10:30 AM Jens Axboe wrote: >> >> On 7/21/20 11:23 AM, Andy Lutomirski wrote: >>> On Tue, Jul 21, 2020 at 8:31 AM Jens Axboe wrote: >>>> >>>> On 7/21/20 9:27 AM, Andy Lutomirski wrote: >>>>> On Fri, Jul 17, 2020 at 1:02 AM Stefano Garzarella wrote: >>>>>> >>>>>> On Thu, Jul 16, 2020 at 08:12:35AM -0700, Kees Cook wrote: >>>>>>> On Thu, Jul 16, 2020 at 03:14:04PM +0200, Stefano Garzarella wrote: >>>>> >>>>>>> access (IIUC) is possible without actually calling any of the io_uring >>>>>>> syscalls. Is that correct? A process would receive an fd (via SCM_RIGHTS, >>>>>>> pidfd_getfd, or soon seccomp addfd), and then call mmap() on it to gain >>>>>>> access to the SQ and CQ, and off it goes? (The only glitch I see is >>>>>>> waking up the worker thread?) >>>>>> >>>>>> It is true only if the io_uring istance is created with SQPOLL flag (not the >>>>>> default behaviour and it requires CAP_SYS_ADMIN). In this case the >>>>>> kthread is created and you can also set an higher idle time for it, so >>>>>> also the waking up syscall can be avoided. >>>>> >>>>> I stared at the io_uring code for a while, and I'm wondering if we're >>>>> approaching this the wrong way. It seems to me that most of the >>>>> complications here come from the fact that io_uring SQEs don't clearly >>>>> belong to any particular security principle. (We have struct creds, >>>>> but we don't really have a task or mm.) But I'm also not convinced >>>>> that io_uring actually supports cross-mm submission except by accident >>>>> -- as it stands, unless a user is very careful to only submit SQEs >>>>> that don't use user pointers, the results will be unpredictable. >>>> >>>> How so? >>> >>> Unless I've missed something, either current->mm or sqo_mm will be >>> used depending on which thread ends up doing the IO. (And there might >>> be similar issues with threads.) Having the user memory references >>> end up somewhere that is an implementation detail seems suboptimal. >> >> current->mm is always used from the entering task - obviously if done >> synchronously, but also if it needs to go async. The only exception is a >> setup with SQPOLL, in which case ctx->sqo_mm is the task that set up the >> ring. SQPOLL requires root privileges to setup, and there's no task >> entering the io_uring at all necessarily. It'll just submit sqes with >> the credentials that are registered with the ring. > > Really? I admit I haven't fully followed how the code works, but it > looks like anything that goes through the io_queue_async_work() path > will use sqo_mm, and can't most requests that end up blocking end up > there? It looks like, even if SQPOLL is not set, the mm used will > depend on whether the request ends up blocking and thus getting queued > for later completion. > > Or does some magic I missed make this a nonissue. No, you are wrong. The logic works as I described it. >> This is just one known use case, there may very well be others. Outside >> of SQPOLL, which is special, I don't see a reason to restrict this. >> Given that you may have a fuller understanding of it after the above >> explanation, please clearly state what problem you're seeing that >> warrants a change. > > I see two fundamental issues: > > 1. The above. This may be less of an issue than it seems to me, but, > if you submit io from outside sqo_mm, the mm that ends up being used > depends on whether the IO is completed from io_uring_enter() or from > the workqueue. For something like Postgres, I guess this is okay > because the memory is MAP_ANONYMOUS | MAP_SHARED and the pointers all > point the same place regardless. No that is incorrect. If you disregard SQPOLL, then the 'mm' is always who submitted it. > 2. If you create an io_uring and io_uring_enter() it from a different > mm, it's unclear what seccomp is supposed to do. (Or audit, for that > matter.) Which task did the IO? Which mm did the IO? Whose sandbox > is supposed to be applied? Also doesn't seem like a problem, if you understand the 'mm' logic above. Unless SQPOLL is used, the entering tasks mm will be used. There's no mixing of tasks and mm outside of that. -- Jens Axboe