Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp2710562ybl; Sun, 26 Jan 2020 08:59:50 -0800 (PST) X-Google-Smtp-Source: APXvYqxsVj/g06sUf8dcNNjtOTt5K73RoIg4MXzo2q0ATAKzrONjuU5lDam4cXsRKwvBFCND54PY X-Received: by 2002:a9d:7e8c:: with SMTP id m12mr10151284otp.346.1580057990142; Sun, 26 Jan 2020 08:59:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1580057990; cv=none; d=google.com; s=arc-20160816; b=rV+HS+RWhGOb5xuRQ15HMVaSN3+1HdK5+PN1JRSadsGJS4+nz+CJT1Dr7enWndjgS1 2MBIFfdl+xNsG84Fhi6iO6fR26Yae6HCztWwskNAJZ6jzMIZwe07esgtEIjKPdHRXBkW FYeQ8+VxUz6BNZwQMn/xLk4hk0kiH8X7h8OHYD0TASxUECp85Y9eLHz37w9U8fAYO4RK u0zqprhc0lHRWxcvJo4iayT5zv0TyOAMyYnOLzjCKaoomJMbRcPPvm08DTKBqnqwMv5P oJFYuimKxXy+UGEVNFZeen/X2RzdTa9vSgBt/tGkpXxUG7Cgsfp4PpkMmf1Xm/hxdUqh oh2g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature; bh=7zqZazIrLW3UMe+DEg8Ims5Vma+2yCeGVjU7l0hDaFI=; b=DllqD9bR6KPPKANIoYf9xWaRkixlbYGvAIQqIZJryFBUFZS5JxCO/dxErLjyEMkdO1 n96i3eZzuEGZzl6sO2T9KNjlMmuLDQGXrBCTwhdDWxJIMoqO4wCLXIu/1XTnI4Uz0fuA hnNaqAOHUwS997Cb7BElX0LcmRqVhQl7t+A3tSpWJeyV1fKC/f1N16DGqbhMODMkL47l l/8hQYDKjSEL5wukao5Yy6muGFP+FBGsa10Gi0IFn5QcfF+3we8kKiXERV3Bg7c6O4+A qWAyRjExuPvmKLjPIghH6ngKw/OiffNG0kMET5E2VPEIe1fsh+icKiaQ1Za6XuCRJmQn lKDQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel-dk.20150623.gappssmtp.com header.s=20150623 header.b=FmXnlEHS; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v26si6085898otj.0.2020.01.26.08.59.38; Sun, 26 Jan 2020 08:59:50 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel-dk.20150623.gappssmtp.com header.s=20150623 header.b=FmXnlEHS; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726181AbgAZQ5e (ORCPT + 99 others); Sun, 26 Jan 2020 11:57:34 -0500 Received: from mail-pj1-f66.google.com ([209.85.216.66]:55655 "EHLO mail-pj1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725838AbgAZQ5e (ORCPT ); Sun, 26 Jan 2020 11:57:34 -0500 Received: by mail-pj1-f66.google.com with SMTP id d5so2014136pjz.5 for ; Sun, 26 Jan 2020 08:57:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=7zqZazIrLW3UMe+DEg8Ims5Vma+2yCeGVjU7l0hDaFI=; b=FmXnlEHSap/erSDUtHgbLZKvNXoyQp7mGEL9OB4HWUDrPGQ+y+gu6hEwu1M1AikPli Y7nyUsM/YtgdC3/6V00wRNkgkG0l08K1eX+UKID6MEYiQiul4kDlCJLfax24Fms64Cdq kghBW/392UqzwkI6gE/disn1tK6WgdJIlUc3/KOX3f6beRwWYR0+zRDZNI6pjv6GRU3y szxPlgY8vf9C3Kaqw43G3ozp6Ix8/1XkYcraGg5AkO1jYS0Sq9ogD0JOzmjPmfl9enWB fu7dCaO08I3YCzI/THbHAP5rRUIrND/QyOfKXPXHT7N82uF06hgM+iuQSPEI8MWD6FvS iOeg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=7zqZazIrLW3UMe+DEg8Ims5Vma+2yCeGVjU7l0hDaFI=; b=lz+VGvrQDBIu2+ly1fg+CFsq8zw4iwh7NjLABiAqFbL4YXJgZLoVbVpstM2TVht/Bx bRCRbeOnWtE5HpOmv52ZU1yT8ROmR5vPO9+F0pMRwWmrkrEM7gj/jECaidrOJl3fJrPK k2o2vjrOFeI1locaBMvOaBsf/eyPkVKP12LO53+cF8PFhcAHtrkk+bcdRZ9JlPGgx71H rk9+I0M7NX6K6AzLCEU8EVRUhBO0bgNDzOkgBqi9+O86vyK+O4KV1V9AKBIUjvYHR41N ANj+bi7S8Vi+hYIGpbacpl0DSEvif6aRHqtVofn9RqVKgBwTSPjN4BQICTcdH3zh3F3K J0ng== X-Gm-Message-State: APjAAAWYmtyMYYQRBcFFCnhhr5LuzNksyw1FEz2dwzrdGCcvN+MHajg7 +I8p5CF0rGLsz1D8ZbUpGiu2dQ== X-Received: by 2002:a17:90b:8ce:: with SMTP id ds14mr10296557pjb.57.1580057852072; Sun, 26 Jan 2020 08:57:32 -0800 (PST) Received: from [192.168.1.188] ([66.219.217.145]) by smtp.gmail.com with ESMTPSA id u26sm12381726pfn.46.2020.01.26.08.57.30 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 26 Jan 2020 08:57:31 -0800 (PST) Subject: Re: [PATCH 5.4 033/222] io_uring: only allow submit from owning task To: Andres Freund , Stefan Metzmacher Cc: Greg Kroah-Hartman , linux-kernel@vger.kernel.org, stable@vger.kernel.org, io-uring References: <20200122092833.339495161@linuxfoundation.org> <20200122092835.852416399@linuxfoundation.org> <1b4a79c1-6cda-12a8-219b-0c1c146faeff@samba.org> <20200126055457.5w4f5jyhkic7cixu@alap3.anarazel.de> From: Jens Axboe Message-ID: Date: Sun, 26 Jan 2020 09:57:29 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.4.1 MIME-Version: 1.0 In-Reply-To: <20200126055457.5w4f5jyhkic7cixu@alap3.anarazel.de> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 1/25/20 10:54 PM, Andres Freund wrote: > Hi, > > On 2020-01-24 11:38:02 +0100, Stefan Metzmacher wrote: >> Am 22.01.20 um 10:26 schrieb Greg Kroah-Hartman: >>> From: Jens Axboe >>> >>> commit 44d282796f81eb1debc1d7cb53245b4cb3214cb5 upstream. >>> >>> If the credentials or the mm doesn't match, don't allow the task to >>> submit anything on behalf of this ring. The task that owns the ring can >>> pass the file descriptor to another task, but we don't want to allow >>> that task to submit an SQE that then assumes the ring mm and creds if >>> it needs to go async. >>> >>> Cc: stable@vger.kernel.org >>> Suggested-by: Stefan Metzmacher >>> Signed-off-by: Jens Axboe >>> Signed-off-by: Greg Kroah-Hartman >>> >>> >>> --- >>> fs/io_uring.c | 6 ++++++ >>> 1 file changed, 6 insertions(+) >>> >>> --- a/fs/io_uring.c >>> +++ b/fs/io_uring.c >>> @@ -3716,6 +3716,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned >>> wake_up(&ctx->sqo_wait); >>> submitted = to_submit; >>> } else if (to_submit) { >>> + if (current->mm != ctx->sqo_mm || >>> + current_cred() != ctx->creds) { >>> + ret = -EPERM; >>> + goto out; >>> + } >>> + >> >> I thought about this a bit more. >> >> I'm not sure if this is actually to restrictive, >> because it means applications like Samba won't >> be able to use io-uring at all. > > Yea, I think it is too restrictive. In fact, it broke my WIP branch to > make postgres use io_uring. > > > Postgres uses a forked process model, with all sub-processes forked off > one parent process ("postmaster"), sharing MAP_ANONYMOUS|MAP_SHARED > memory (buffer pool, locks, and lots of other IPC). My WIP branch so far > has postmaster create a number of io_urings that then the different > processes can use (with locking if necessary). > > In plenty of the cases it's fairly important for performance to not > require an additional context switch initiate IO, therefore we cannot > delegate submitting to an io_uring to separate process. But it's not > feasible to have one (or even two) urings for each process either: For > one, that's just about guaranteed to bring us over the default > RLIMIT_MEMLOCK limit, and requiring root only config changes is not an > option for many (nor user friendly). > > > Not sharing queues makes it basically impossible to rely on io_uring > ordering properties when operation interlock is needed. E.g. to > guarantee that the journal is flushed before some data buffer can be > written back, being able to make use of links and drains is great - but > there's one journal for all processes. To be able to guarantee anything, > all the interlocked writes need to go through one io_uring. I've not yet > implemented this, so I don't have numbers, but I expect pretty > significant savings. > > > Not being able to share urings also makes it harder to resolve > deadlocks: > > As we call into both library and user defined code, we cannot guarantee > that a specific backend process will promptly (or at all, when waiting > for some locks) process cqes. There's also sections where we don't want > to constantly check for ready events, for performance reasons. But > operations initiated by a process might be blocking other connections: > > E.g. process #1 might have initiated transferring a number of blocks > into postgres' buffer pool via io_uring , and now is busy processing the > first block that completed. But now process #2 might need one of the > buffers that had IO queued, but didn't complete in time for #1 to see > the results. The way I have it set up right now, #2 simply can process > pending cqes in the relevant queue. Which, in this example, would mark > the pg buffer pool entry as valid, allowing #2 to continue. > > Now, completions can still be read by all processes, so I could continue > to do the above: But that'd require all potentially needed io_urings to > be set up in postmaster, before the first fork, and all processes to > keep all those FDs open (commonly several hundred). Not an attractive > option either, imo. > > Obviously we could solve this by having a sqe result processing thread > running within each process - but that'd be a very significant new > overhead. And it'd require making some non-threadsafe code threadsafe, > which I do not relish tackling as a side-effect of io_uring adoption. > > > It also turns out to be nice from a performance / context-switch rate > angle to be able to process cqes for submissions by other > processes. Saves an expensive context switch, and often enough it really > doesn't matter which process processes the completion (especially for > readahead). And in other cases it's cheap to just schedule the > subsequent work from the cqe processor, e.g. initiating readahead of a > few more blocks into the pg buffer pool. Similarly, there are a few > cases where it's useful for several processes to submit IO into a uring > primarily drained by one specific process, to offload the subsequent > action, if that's expensive > > > Now, I think there's a valid argument to be made that postgres should > just use threads, and not be hampered by any of this. But a) that's not > going to happen all that soon, it's a large change, b) it's far from > free from problems either, especially scalability on larger machines, > and robustness. > > >> As even if current_cred() and ctx->creds describe the same >> set of uid,gids the != won't ever match again and >> makes the whole ring unuseable. > > Indeed. It also seems weird that a sqpoll now basically has different > semantics, allowing the io_uring to be used by multiple processes - a > task with a different mm can still wake the sqpoll thread up, even. > > Since the different processes attached still can still write to the > io_uring mmaped memory, they can still queue sqes, they just can't > initiate the processing. But the next time the creator of the uring > submits, they will still be - and thus it seems that the kernel needs to > handle this safely. So I really don't get what this actually achieves? > Am I missing something here? Thanks for your detailed reported, I'm going to revert this change for 5.5. -- Jens Axboe