Received: by 2002:a05:7412:37c9:b0:e2:908c:2ebd with SMTP id jz9csp2310807rdb; Thu, 21 Sep 2023 15:04:08 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGtL7NCagh4NwshpPBB6oeKNchGbuJOXAzuJLQLbfUWSOOkQUGIO0StOlnOxJ1h+3n1884J X-Received: by 2002:a17:90a:17c3:b0:271:9a75:6cda with SMTP id q61-20020a17090a17c300b002719a756cdamr6479095pja.14.1695333847801; Thu, 21 Sep 2023 15:04:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695333847; cv=none; d=google.com; s=arc-20160816; b=ICLZc6gvyE6ZJALWZcgFG8RO5pHr+IbpWWom1mTaZPPW1oWRXe5kh9B4TtKmI04yKN Tuk+a70Nlc4uu5Z7c9kEGmdXvlDzfbrLQM0oH3Pp8juOoI3+32bkKRw57vXHn46Jaev/ gUvqvIm8Js32bMjTyAh8NO4mPNvgHYtzTTcy2aBwMwX5xeTpNoAFOxs1Vuh56pOTSrqT bmT1V82JII9WtXT//NaNmvb2fvF9X4uRf11dvz7rLQ4XTKylAt0wy9zPqOCJ9PeH8K8e 4V1/DgRdr8a+N79xYWVrRTrr44ArYAY7j6Se+xcV2xLmOllPe+OHzjEVbovOtaf/yvJs OGhA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=n7GcjJMAhJ6aBbgmYVJaY498tX8EJz+zoUNU1lZL1rI=; fh=VIOmVuSpt03ioSSTsYO+bWowKS38BlXXR4tRnDijCyU=; b=OIw9jIq7obharg4rNJdb23AaNLZpLxkIjkvMCVfMei3kMX6K4Wxe0cPzkfbwJyslBd llJ/eTiF3xCLbB5Izcq5Bx3ub7xsEmoC2NM7IX702fjLlNp/XiIBF1/AS+om0sHWlOeA u2ndCN6S+njINu4AEqDjBYWD2T+gWMpsRwmO9n34MNqiB42msbypHKrHUs9alYhi+3+c Mma5KhZReUGmXSfRc60Hwe/6SAaqHYH0WMo6jAI/0aWK7JtC3cgX85ngcKQ3gRcg0hxK YNL2SYp2MYrhhxQJOQ9gV8AwbtpXVNA/J8sBRxdqpsYTvj9FvuAUQaPpjrWdZ2+dDdCM eqdQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b=ngZqedEV; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from fry.vger.email (fry.vger.email. [23.128.96.38]) by mx.google.com with ESMTPS id z126-20020a633384000000b0057c24bae9a2si1979688pgz.321.2023.09.21.15.04.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 21 Sep 2023 15:04:07 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) client-ip=23.128.96.38; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel-dk.20230601.gappssmtp.com header.s=20230601 header.b=ngZqedEV; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id 7581A838F054; Thu, 21 Sep 2023 12:54:31 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231128AbjIUTyX (ORCPT + 99 others); Thu, 21 Sep 2023 15:54:23 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41332 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229660AbjIUTyI (ORCPT ); Thu, 21 Sep 2023 15:54:08 -0400 Received: from mail-io1-xd2b.google.com (mail-io1-xd2b.google.com [IPv6:2607:f8b0:4864:20::d2b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 334BDD7DAB for ; Thu, 21 Sep 2023 11:29:14 -0700 (PDT) Received: by mail-io1-xd2b.google.com with SMTP id ca18e2360f4ac-79f8df47bfbso10250039f.0 for ; Thu, 21 Sep 2023 11:29:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20230601.gappssmtp.com; s=20230601; t=1695320953; x=1695925753; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=n7GcjJMAhJ6aBbgmYVJaY498tX8EJz+zoUNU1lZL1rI=; b=ngZqedEV7ZMStk8XskreSyJNHKF8ywnk9yqbagMHECW5e7nf71QmJ9f4u2B/GxJV+e 0PzpBsyZZUVloGZml7PmeG7m5AUyyH3n6g+kTY3N98jlL46WUctlIc8CA6KvXf7RNDG6 IV+lQeLpNf78+FCoCaJ5nS1UiVDwJCiRfs1yvL52RPYNGtMduQfXbOruOKKoc3821PAk XYwW5MHqL2b5aBjoaCAJs1/4oaS6QaIdr9u1WHHL48lyY+YVkpytAKPah9XWLe8qgawH KGWvDZED9FWVbElHgPY34xO6t2bRbdZdRkY4sMAS1AaJSRYjt6oehEn5DTXJkGGPSXdV mSoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695320953; x=1695925753; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=n7GcjJMAhJ6aBbgmYVJaY498tX8EJz+zoUNU1lZL1rI=; b=C6WducnHDRHrDUE/eWyARaQ/PLS2kzOJOrDMK68XRnty/D743ROXpiX4rLmf3Dgb2w XfiIXxgMwI78gfIDUh1l3j06zLM4017htyRjcGASI34A3Wv0XXot1CALRz+TXhYAG5xA w4xOZZE4qlyF7Bbc+ixXHV2omTOBm2bXWG02hDdnWPu3Dm6XKBYtIVnc6/URhpuV2fZ/ xr9VSXX4ZiAFCXAsoOdMAgp23tvrPhKidy62FvD8qMM+cuquLfT2pdfCjhVKnLjwBRVN WWQaSJnz4UhCYqPlJQtkj7MEWBtDQxh6ZHRwWzE2zncdVlwSqsJoZ30jtjNr1xHodb5w zxxg== X-Gm-Message-State: AOJu0Yw77k/7BrrRWepQcsS6Tfs/OK8rf4+LdUnt9UoZE5hF9zSfxY4p xz3W233Q6eTs0AuX8/RZ1wmJqw== X-Received: by 2002:a92:dc08:0:b0:34f:a4f0:4fc4 with SMTP id t8-20020a92dc08000000b0034fa4f04fc4mr7234640iln.2.1695320953475; Thu, 21 Sep 2023 11:29:13 -0700 (PDT) Received: from localhost.localdomain ([96.43.243.2]) by smtp.gmail.com with ESMTPSA id o25-20020a02c6b9000000b0042b227eb1ddsm500441jan.55.2023.09.21.11.29.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 21 Sep 2023 11:29:12 -0700 (PDT) From: Jens Axboe To: io-uring@vger.kernel.org, linux-kernel@vger.kernel.org Cc: peterz@infradead.org, andres@anarazel.de, tglx@linutronix.de Subject: [PATCHSET v5] Add io_uring futex/futexv support Date: Thu, 21 Sep 2023 12:29:00 -0600 Message-Id: <20230921182908.160080-1-axboe@kernel.dk> X-Mailer: git-send-email 2.40.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Thu, 21 Sep 2023 12:54:31 -0700 (PDT) Hi, This patchset adds support for first futex wake and wait, and then futexv. For both wait/wake/waitv, we support the bitset variant, as the "normal" variants can be easily implemented on top of that. PI and requeue are not supported through io_uring, just the above mentioned parts. This may change in the future, but in the spirit of keeping this small (and based on what people have been asking for), this is what we currently have. When I did these patches, I forgot that Pavel had previously posted a futex variant for io_uring. The major thing that had been holding me back from people asking about futexes and io_uring, is that I wanted to do this what I consider the right way - no usage of io-wq or thread offload, an actually async implementation that is efficient to use and don't rely on a blocking thread for futex wait/waitv. This is what this patchset attempts to do, while being minimally invasive on the futex side. I believe the diffstat reflects that. As far as I can recall, the first request for futex support with io_uring came from Andres Freund, working on postgres. His aio rework of postgres was one of the early adopters of io_uring, and futex support was a natural extension for that. This is relevant from both a usability point of view, as well as for effiency and performance. In Andres's words, for the former: "Futex wait support in io_uring makes it a lot easier to avoid deadlocks in concurrent programs that have their own buffer pool: Obviously pages in the application buffer pool have to be locked during IO. If the initiator of IO A needs to wait for a held lock B, the holder of lock B might wait for the IO A to complete. The ability to wait for a lock and IO completions at the same time provides an efficient way to avoid such deadlocks." and in terms of effiency, even without unlocking the full potential yet, Andres says: "Futex wake support in io_uring is useful because it allows for more efficient directed wakeups. For some "locks" postgres has queues implemented in userspace, with wakeup logic that cannot easily be implemented with FUTEX_WAKE_BITSET on a single "futex word" (imagine waiting for journal flushes to have completed up to a certain point). Thus a "lock release" sometimes need to wake up many processes in a row. A quick-and-dirty conversion to doing these wakeups via io_uring lead to a 3% throughput increase, with 12% fewer context switches, albeit in a fairly extreme workload." Some basic io_uring futex support and test cases are available in the liburing 'futex' branch: https://git.kernel.dk/cgit/liburing/log/?h=futex testing all of the variants. I originally wrote this code about a month ago and Andres has been using it with postgres, and I'm not aware of any bugs in it. That's not to say it's perfect, obviously, and I welcome some feedback so we can move this forward and hash out any potential issues. In terms of testing, there's a functionality and beat-up test case in liburing, and I've run all the ltp futex test cases as well to ensure we didn't inadvertently break anything. It's also been in linux-next for a long time and haven't heard any complaints. include/linux/io_uring_types.h | 5 + include/uapi/linux/io_uring.h | 4 + io_uring/Makefile | 1 + io_uring/cancel.c | 5 + io_uring/cancel.h | 4 + io_uring/futex.c | 376 +++++++++++++++++++++++++++++++++ io_uring/futex.h | 36 ++++ io_uring/io_uring.c | 7 + io_uring/opdef.c | 34 +++ kernel/futex/futex.h | 20 ++ kernel/futex/requeue.c | 3 +- kernel/futex/syscalls.c | 18 +- kernel/futex/waitwake.c | 49 +++-- 13 files changed, 535 insertions(+), 27 deletions(-) You can also find the code here: https://git.kernel.dk/cgit/linux/log/?h=io_uring-futex V5: - Rebase on PeterZ's futex2 changes. Pulled in the tip locking/core branch. - Shuffle order of some io_uring patchsets, which changed the value of the futex opcodes. V4: - Refactor the prep setup so it's fully independent between the vectoed and non-vectored futex handling. - Ensure we -EINVAL any futex/futexv wait/waitv that specifies unused fields. - Fix a comment typo - Update the patches from Peter. - Fix two kerneldoc warnings - Add a prep patch moving FUTEX2_MASK to futex.h -- Jens Axboe