Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BB807C433F5 for ; Thu, 2 Dec 2021 16:26:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1359567AbhLBQ3k (ORCPT ); Thu, 2 Dec 2021 11:29:40 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35464 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1359487AbhLBQ2s (ORCPT ); Thu, 2 Dec 2021 11:28:48 -0500 Received: from mail-wr1-x42a.google.com (mail-wr1-x42a.google.com [IPv6:2a00:1450:4864:20::42a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 20F5AC06174A; Thu, 2 Dec 2021 08:25:25 -0800 (PST) Received: by mail-wr1-x42a.google.com with SMTP id q3so37927326wru.5; Thu, 02 Dec 2021 08:25:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=message-id:date:mime-version:user-agent:subject:content-language:to :cc:references:from:in-reply-to:content-transfer-encoding; bh=0JGsvnjiYkIl6j1i8qSpFvEIFGcGRddganu7CpOyXOk=; b=X/0Vs8VNVs+6CMRA+kwG8jWZnxgMrYIY1EwQaL97BJweRb0ZY6vFKjx5CeZCmsapfX KBgZ7E9zyr3Lc+QK9JFXV8xHH0hffaS/cer6YtrdTK0cgCJ46otgp+x0AS2obh9g9BYE /WAg4XqUpnC3pywG+rx8VDprgBM1safugNRPF6eW1LNo23FU23WCBwei4gmdEde9uNdd 0t8/WIGTqWj1CeHrAl8UW/P+HsWgQMM3rQhK81g5Gp3WHG3mBHDwUBpZa0MbG0LLA5uf xqO2+IsIaNd5i2a+rCcf2NMBRdceoyDB325XWR1M5XL2uOeV2jQjWFXnSmdSHossravG 7BEA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:cc:references:from:in-reply-to :content-transfer-encoding; bh=0JGsvnjiYkIl6j1i8qSpFvEIFGcGRddganu7CpOyXOk=; b=7cRCnCqa0sTn4zyKMmf1M5zRoG02SVdsxDPUN86am8BZatQazpFoSL8tuE++vrSF8f Bpw7M4rsOPJSo/2nHEmKMNHxVxXWPs0LYT0DzC692YCuXH8z64Z6QVKruvgvXWmc9uX8 s6QNAkNCFyLtgZnh6is1mRKGiN5Ski6r8b8gk+QcT+JXPU37GbuEMPm9tVhJl+GoSrEb nSdQRFApvsf4CEOXvS4gp3hxLE84g+g7oNuKOI6xU20LLKYQtooJ6zmvosHrtiU1781c 1zMpwxIAlaJ6r+ZJmuurvlLpUVK23lpAFPU/wCQRgRmLLlVRLN9hgKA1mnIfHx3qzCgN wQiA== X-Gm-Message-State: AOAM530BPupTeiWIg7mRuMSZXMl5mczpy9cYQjRl+pRafQqjvfuQiwWB qg6iE4+iGOHSj0d5Lwoa2x8= X-Google-Smtp-Source: ABdhPJzUsiFaBatwrqVRFWnbt+o+h6w2j9qYRd+3ZiiiY/fxafluV9DaJ2LV6+OCq8hY+8pp9BRyXg== X-Received: by 2002:adf:f7d2:: with SMTP id a18mr15223132wrq.354.1638462323689; Thu, 02 Dec 2021 08:25:23 -0800 (PST) Received: from [192.168.8.198] ([185.69.144.137]) by smtp.gmail.com with ESMTPSA id p12sm349112wrr.10.2021.12.02.08.25.22 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 02 Dec 2021 08:25:23 -0800 (PST) Message-ID: <9db0edcf-75c0-d014-6120-514cc37a1a9f@gmail.com> Date: Thu, 2 Dec 2021 16:25:19 +0000 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.3.2 Subject: Re: [RFC 00/12] io_uring zerocopy send Content-Language: en-US To: Willem de Bruijn Cc: io-uring@vger.kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Jakub Kicinski , Jonathan Lemon , "David S . Miller" , Eric Dumazet , Hideaki YOSHIFUJI , David Ahern , Jens Axboe References: <0d82f4e2-730f-4888-ec82-2354ffa9c2d8@gmail.com> From: Pavel Begunkov In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 12/2/21 00:36, Willem de Bruijn wrote: >>>>> 1) we pass a bvec, so no page table walks. >>>>> 2) zerocopy_sg_from_iter() is just slow, adding a bvec optimised version >>>>> still doing page get/put (see 4/12) slashed 4-5%. >>>>> 3) avoiding get_page/put_page in 5/12 >>>>> 4) completion events are posted into io_uring's CQ, so no >>>>> extra recvmsg for getting events >>>>> 5) no poll(2) in the code because of io_uring >>>>> 6) lot of time is spent in sock_omalloc()/free allocating ubuf_info. >>>>> io_uring caches the structures reducing it to nearly zero-overhead. >>>> >>>> Nice set of complementary optimizations. >>>> >>>> We have looked at adding some of those as independent additions to >>>> msg_zerocopy before, such as long-term pinned regions. One issue with >>>> that is that the pages must remain until the request completes, >>>> regardless of whether the calling process is alive. So it cannot rely >>>> on a pinned range held by a process only. >>>> >>>> If feasible, it would be preferable if the optimizations can be added >>>> to msg_zerocopy directly, rather than adding a dependency on io_uring >>>> to make use of them. But not sure how feasible that is. For some, like >>>> 4 and 5, the answer is clearly it isn't. 6, it probably is? >> >> Forgot about 6), io_uring uses the fact that submissions are >> done under an per ring mutex, and completions are under a per >> ring spinlock, so there are two lists for them and no extra >> locking. Lists are spliced in a batched manner, so it's >> 1 spinlock per N (e.g. 32) cached ubuf_info's allocations. >> >> Any similar guarantees for sockets? > > For datagrams it might matter, not sure if it would show up in a > profile. The current notification mechanism is quite a bit more > heavyweight than any form of fixed ubuf pool. Just to give an idea what I'm seeing in profiles: while testing 3 | io_uring (@flush=false, nr_reqs=1) | 96534 | 2.03 I found that removing one extra smb_mb() per request in io_uring gave around +0.65% of t-put (quick testing). In profiles the function where it was dropped from 0.93% to 0.09%. From what I see, alloc+free takes 6-10% for 64KB UDP, it may be great to have something for MSG_ZEROCOPY, but if that adds additional locking/atomics, honestly I'd prefer to keep it separate from io_uring's caching. I also hope we can optimise generic paths at some point, and the faster it gets the more such additional locking will hurt, pretty much how it was with the block layer. > For TCP this matters less, as multiple sends are not needed and > completions are coalesced, because in order. > -- Pavel Begunkov