Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp1933338pxb; Sat, 23 Jan 2021 10:18:20 -0800 (PST) X-Google-Smtp-Source: ABdhPJwgMtQVlP1xYXyuwSJqjRKWRhNZ2hJw4mno4EiXjM3sslRhwkCd148RCnW6SU36Hy9a+ECK X-Received: by 2002:a17:906:1c42:: with SMTP id l2mr6666720ejg.390.1611425900116; Sat, 23 Jan 2021 10:18:20 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1611425900; cv=none; d=google.com; s=arc-20160816; b=BtFKZvYTJHA1XdtaQ10t2iseaMidjH2vXIidwkYhRzxwlo6TChhDCJfR99Eg+mryts 9h1yRM3ljuPcT1pXOJbKBDN+sS0sWze/23nI1GUHuV/q/lyuYQ39whP1ik7zAmRk9kCl mGNwCJpcwiOhpWwSsO7w2xEx4jNmpd6suXndroX712bAyNNuF0jXnFbvsgs7mI07+sV1 PMs46u0YIq+3OBLK4Y+kwwM6IsamJEz+Ro97v++nNn7OGf6dA7CSQuBBUwEDSxlM743x Qt6U59PYO7sWhBdhsJ7WMnOAM0Mf7wxJvOwKAR7BG4jb9kY4XV/itB6kdIHZxoHXb3nr C95w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=meu16rwTHXfXwUxLyuiO34fPuPFmmKB42eZdtLnIQ4A=; b=dZOl1DxitjgA5mlBjw/RLguiTUyVE6JLSc7591SWUC7fv279LvCA9R1zM4jsmIecaF DyxWKawJy7VFVBToOB88mJEPv8mkOgSALby4RPFfTbGT8TKEHeH1U9jHDH9IzHZamk9T LQ/mR02xTTwBHMs94SGbQNxrWX2i3wtDvauSlFgN8S5S03mnO0nm2zLYv7pfxmfFbbF1 xsCshtj6FZyqLYCSdplJcpZUyw4hgA6h8pd/4CyiG0YY+upsuH+AnsOqlLSPagtOWbdL 2G+pgz/k2kqHnIbXiE5F2KhTY8OnbkAifMJQhBXnve/FNmsF0Tsr9pUp2VdgQbgj/rw5 3V+w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id a27si4225034ejc.18.2021.01.23.10.17.54; Sat, 23 Jan 2021 10:18:20 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726122AbhAWSRV (ORCPT + 99 others); Sat, 23 Jan 2021 13:17:21 -0500 Received: from hmm.wantstofly.org ([213.239.204.108]:40824 "EHLO mail.wantstofly.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725883AbhAWSRU (ORCPT ); Sat, 23 Jan 2021 13:17:20 -0500 Received: by mail.wantstofly.org (Postfix, from userid 1000) id E4D257F02B; Sat, 23 Jan 2021 20:16:36 +0200 (EET) Date: Sat, 23 Jan 2021 20:16:36 +0200 From: Lennert Buytenhek To: Jens Axboe Cc: linux-kernel@vger.kernel.org, io-uring@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-btrfs@vger.kernel.org Subject: Re: [RFC PATCH] io_uring: add support for IORING_OP_GETDENTS64 Message-ID: <20210123181636.GA121842@wantstofly.org> References: <20210123114152.GA120281@wantstofly.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Jan 23, 2021 at 10:37:25AM -0700, Jens Axboe wrote: > > IORING_OP_GETDENTS64 behaves like getdents64(2) and takes the same > > arguments. > > > > Signed-off-by: Lennert Buytenhek > > --- > > This seems to work OK, but I'd appreciate a review from someone more > > familiar with io_uring internals than I am, as I'm not entirely sure > > I did everything quite right. > > > > A dumb test program for IORING_OP_GETDENTS64 is available here: > > > > https://krautbox.wantstofly.org/~buytenh/uringfind.c > > > > This does more or less what find(1) does: it scans recursively through > > a directory tree and prints the names of all directories and files it > > encounters along the way -- but then using io_uring. (The uring version > > prints the names of encountered files and directories in an order that's > > determined by SQE completion order, which is somewhat nondeterministic > > and likely to differ between runs.) > > > > On a directory tree with 14-odd million files in it that's on a > > six-drive (spinning disk) btrfs raid, find(1) takes: > > > > # echo 3 > /proc/sys/vm/drop_caches > > # time find /mnt/repo > /dev/null > > > > real 24m7.815s > > user 0m15.015s > > sys 0m48.340s > > # > > > > And the io_uring version takes: > > > > # echo 3 > /proc/sys/vm/drop_caches > > # time ./uringfind /mnt/repo > /dev/null > > > > real 10m29.064s > > user 0m4.347s > > sys 0m1.677s > > # > > > > These timings are repeatable and consistent to within a few seconds. > > > > (btrfs seems to be sending most metadata reads to the same drive in the > > array during this test, even though this filesystem is using the raid1c4 > > profile for metadata, so I suspect that more drive-level parallelism can > > be extracted with some btrfs tweaks.) > > > > The fully cached case also shows some speedup for the io_uring version: > > > > # time find /mnt/repo > /dev/null > > > > real 0m5.223s > > user 0m1.926s > > sys 0m3.268s > > # > > > > vs: > > > > # time ./uringfind /mnt/repo > /dev/null > > > > real 0m3.604s > > user 0m2.417s > > sys 0m0.793s > > # > > > > That said, the point of this patch isn't primarily to enable > > lightning-fast find(1) or du(1), but more to complete the set of > > filesystem I/O primitives available via io_uring, so that applications > > can do all of their filesystem I/O using the same mechanism, without > > having to manually punt some of their work out to worker threads -- and > > indeed, an object storage backend server that I wrote a while ago can > > run with a pure io_uring based event loop with this patch. > > The results look nice for sure. Thanks! And thank you for having a look. > Once concern is that io_uring generally > guarantees that any state passed in is stable once submit is done. For > the below implementation, that doesn't hold as the linux_dirent64 isn't > used until later in the process. That means if you do: > > submit_getdents64(ring) > { > struct linux_dirent64 dent; > struct io_uring_sqe *sqe; > > sqe = io_uring_get_sqe(ring); > io_uring_prep_getdents64(sqe, ..., &dent); > io_uring_submit(ring); > } > > other_func(ring) > { > struct io_uring_cqe *cqe; > > submit_getdents64(ring); > io_uring_wait_cqe(ring, &cqe); > > } > > then the kernel side might get garbage by the time the sqe is actually > submitted. This is true because you don't use it inline, only from the > out-of-line async context. Usually this is solved by having the prep > side copy in the necessary state, eg see io_openat2_prep() for how we > make filename and open_how stable by copying them into kernel memory. > That ensures that if/when these operations need to go async and finish > out-of-line, the contents are stable and there's no requirement for the > application to keep them valid once submission is done. > > Not sure how best to solve that, since the vfs side relies heavily on > linux_dirent64 being a user pointer... No data is passed into the kernel on a getdents64(2) call via user memory, i.e. getdents64(2) only ever writes into the supplied linux_dirent64 user pointer, it never reads from it. The only things that we need to keep stable here are the linux_dirent64 pointer itself and the 'count' argument and those are both passed in via the SQE, and we READ_ONCE() them from the SQE in the prep function. I think that's probably the source of confusion here? Cheers, Lennert