Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Subject: Re: [PATCH 5.4 033/222] io_uring: only allow submit from owning task
To:     Andres Freund <andres@anarazel.de>,
        Stefan Metzmacher <metze@samba.org>
Cc:     Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
        linux-kernel@vger.kernel.org, stable@vger.kernel.org,
        io-uring <io-uring@vger.kernel.org>
References: <20200122092833.339495161@linuxfoundation.org>
 <20200122092835.852416399@linuxfoundation.org>
 <1b4a79c1-6cda-12a8-219b-0c1c146faeff@samba.org>
 <20200126055457.5w4f5jyhkic7cixu@alap3.anarazel.de>
From:   Jens Axboe <axboe@kernel.dk>
Message-ID: <c52a3b63-731b-95f2-ae97-0ad153e7698b@kernel.dk>
Date:   Sun, 26 Jan 2020 09:57:29 -0700
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.4.1
MIME-Version: 1.0
In-Reply-To: <20200126055457.5w4f5jyhkic7cixu@alap3.anarazel.de>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On 1/25/20 10:54 PM, Andres Freund wrote:
> Hi,
> 
> On 2020-01-24 11:38:02 +0100, Stefan Metzmacher wrote:
>> Am 22.01.20 um 10:26 schrieb Greg Kroah-Hartman:
>>> From: Jens Axboe <axboe@kernel.dk>
>>>
>>> commit 44d282796f81eb1debc1d7cb53245b4cb3214cb5 upstream.
>>>
>>> If the credentials or the mm doesn't match, don't allow the task to
>>> submit anything on behalf of this ring. The task that owns the ring can
>>> pass the file descriptor to another task, but we don't want to allow
>>> that task to submit an SQE that then assumes the ring mm and creds if
>>> it needs to go async.
>>>
>>> Cc: stable@vger.kernel.org
>>> Suggested-by: Stefan Metzmacher <metze@samba.org>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
>>>
>>>
>>> ---
>>>  fs/io_uring.c |    6 ++++++
>>>  1 file changed, 6 insertions(+)
>>>
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -3716,6 +3716,12 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned
>>>  			wake_up(&ctx->sqo_wait);
>>>  		submitted = to_submit;
>>>  	} else if (to_submit) {
>>> +		if (current->mm != ctx->sqo_mm ||
>>> +		    current_cred() != ctx->creds) {
>>> +			ret = -EPERM;
>>> +			goto out;
>>> +		}
>>> +
>>
>> I thought about this a bit more.
>>
>> I'm not sure if this is actually to restrictive,
>> because it means applications like Samba won't
>> be able to use io-uring at all.
> 
> Yea, I think it is too restrictive. In fact, it broke my WIP branch to
> make postgres use io_uring.
> 
> 
> Postgres uses a forked process model, with all sub-processes forked off
> one parent process ("postmaster"), sharing MAP_ANONYMOUS|MAP_SHARED
> memory (buffer pool, locks, and lots of other IPC). My WIP branch so far
> has postmaster create a number of io_urings that then the different
> processes can use (with locking if necessary).
> 
> In plenty of the cases it's fairly important for performance to not
> require an additional context switch initiate IO, therefore we cannot
> delegate submitting to an io_uring to separate process. But it's not
> feasible to have one (or even two) urings for each process either: For
> one, that's just about guaranteed to bring us over the default
> RLIMIT_MEMLOCK limit, and requiring root only config changes is not an
> option for many (nor user friendly).
> 
> 
> Not sharing queues makes it basically impossible to rely on io_uring
> ordering properties when operation interlock is needed. E.g. to
> guarantee that the journal is flushed before some data buffer can be
> written back, being able to make use of links and drains is great - but
> there's one journal for all processes. To be able to guarantee anything,
> all the interlocked writes need to go through one io_uring. I've not yet
> implemented this, so I don't have numbers, but I expect pretty
> significant savings.
> 
> 
> Not being able to share urings also makes it harder to resolve
> deadlocks:
> 
> As we call into both library and user defined code, we cannot guarantee
> that a specific backend process will promptly (or at all, when waiting
> for some locks) process cqes. There's also sections where we don't want
> to constantly check for ready events, for performance reasons.  But
> operations initiated by a process might be blocking other connections:
> 
> E.g. process #1 might have initiated transferring a number of blocks
> into postgres' buffer pool via io_uring , and now is busy processing the
> first block that completed. But now process #2 might need one of the
> buffers that had IO queued, but didn't complete in time for #1 to see
> the results.  The way I have it set up right now, #2 simply can process
> pending cqes in the relevant queue. Which, in this example, would mark
> the pg buffer pool entry as valid, allowing #2 to continue.
> 
> Now, completions can still be read by all processes, so I could continue
> to do the above: But that'd require all potentially needed io_urings to
> be set up in postmaster, before the first fork, and all processes to
> keep all those FDs open (commonly several hundred). Not an attractive
> option either, imo.
> 
> Obviously we could solve this by having a sqe result processing thread
> running within each process - but that'd be a very significant new
> overhead. And it'd require making some non-threadsafe code threadsafe,
> which I do not relish tackling as a side-effect of io_uring adoption.
> 
> 
> It also turns out to be nice from a performance / context-switch rate
> angle to be able to process cqes for submissions by other
> processes. Saves an expensive context switch, and often enough it really
> doesn't matter which process processes the completion (especially for
> readahead). And in other cases it's cheap to just schedule the
> subsequent work from the cqe processor, e.g. initiating readahead of a
> few more blocks into the pg buffer pool.  Similarly, there are a few
> cases where it's useful for several processes to submit IO into a uring
> primarily drained by one specific process, to offload the subsequent
> action, if that's expensive
> 
> 
> Now, I think there's a valid argument to be made that postgres should
> just use threads, and not be hampered by any of this. But a) that's not
> going to happen all that soon, it's a large change, b) it's far from
> free from problems either, especially scalability on larger machines,
> and robustness.
> 
> 
>> As even if current_cred() and ctx->creds describe the same
>> set of uid,gids the != won't ever match again and
>> makes the whole ring unuseable.
> 
> Indeed.  It also seems weird that a sqpoll now basically has different
> semantics, allowing the io_uring to be used by multiple processes - a
> task with a different mm can still wake the sqpoll thread up, even.
> 
> Since the different processes attached still can still write to the
> io_uring mmaped memory, they can still queue sqes, they just can't
> initiate the processing. But the next time the creator of the uring
> submits, they will still be - and thus it seems that the kernel needs to
> handle this safely. So I really don't get what this actually achieves?
> Am I missing something here?

Thanks for your detailed reported, I'm going to revert this change for
5.5.

-- 
Jens Axboe