Received: by 2002:a05:6359:6284:b0:131:369:b2a3 with SMTP id se4csp5473318rwb; Wed, 9 Aug 2023 04:59:52 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFUQH6NmBDhrQIGDBZ5hOVeu4QlEUTNQNLUr7krJYNevl1LRkDAtMiWg9XM65q310ByZIwX X-Received: by 2002:a9d:639a:0:b0:6bd:836:4fc7 with SMTP id w26-20020a9d639a000000b006bd08364fc7mr2723622otk.37.1691582392330; Wed, 09 Aug 2023 04:59:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691582392; cv=none; d=google.com; s=arc-20160816; b=b74XP+tXiJ4wwlFprFuCMBzCypkQwrYxy9ey2cRg0TRQdfuz2YsXaxxnWF5bT0lMQZ dni0mwGT7Lx/jTk2WjDmrhwT0oUiiaaHwVR6oh/yBN87XejPdCkmtr4z3q2lKiiyzQZ7 693KSwtdc9hN7MsOf4faQnZnai3JaSJQxBN1hduetG8DY1EsukJNcQgdrrZ28974Q9Do Wo6VMKwn44sPKdqRZpx5tl9eUs2Hxc/UyGy5YvP9KCXpSNW7L9FH9XDPeJ3W2cEclnBb /g24EHXeNP2mHmHCNwLeQH11zTSA8ud3Ly/3OXC8aKuPvceQdvQ327CLos+SuvwagFWn RmEA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:to:content-language:subject:cc:user-agent:mime-version :date:message-id:dkim-signature; bh=iM0Ka4PoVC6jRXzqcRNEXzrsGqF57aS+zlRhU4+EYzM=; fh=hslK5lr9+KWk2zIPjXDN2a0BhHrxQpLpwN9b8mvFueU=; b=H/9+lV5fy/MFOXn4wPie5Lzkpnq2HTTCPAv4UoPMTh4un0YkjsGm9BytOXP83mIGxu Adhar3TVmLB8b6S2Kb5LkQdLW803MmHky41c+GNNF8nSv5aqMB91/F/V5pmR3htbhKn4 vFmrke5lc2dSK7uZLEl/243yr1qfOJ9U/CMMBtrFeOlnkMindVyAHLlyvDMFo4FPBKTR HXBX6xNHuNa2Fwbe+n8xO7+TD8ewNcFI0SbPC7ePvHbBQSYCVuebovNg1/3QIyxP5KUm czVk5Zkd54XgVj4q1FQmKAhGlwhYsE8lSv1mJjGXs6b7SrX7Xzg0/NkSuVVd5+4Fosx9 W9fA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=CfPD9Dpg; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 75-20020a63004e000000b005652a295669si1709982pga.490.2023.08.09.04.59.40; Wed, 09 Aug 2023 04:59:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=CfPD9Dpg; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233239AbjHILKA (ORCPT + 99 others); Wed, 9 Aug 2023 07:10:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38756 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233234AbjHILJ6 (ORCPT ); Wed, 9 Aug 2023 07:09:58 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 10B361724 for ; Wed, 9 Aug 2023 04:09:58 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 9C46E6314D for ; Wed, 9 Aug 2023 11:09:57 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id DE6A7C433C9; Wed, 9 Aug 2023 11:09:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1691579397; bh=3Wv/fOquDhwM4PQubxLEW0sEpc4CURoTiDKWKH6VRuI=; h=Date:Cc:Subject:To:References:From:In-Reply-To:From; b=CfPD9DpgGD7pma4wFzIKTbCQRrEN4QKfDTqWQCL/GzgESJrADcF57CKkt+P0XcOwN N7P6pOwnYnR6Y+u1GkD5KjfC3+2yCGGO1yWabC1QRUguhhjCcXfsnzfKdnyP32n0wq vrH9FxnU0hdH4KNngPLZdArA4WJYOAFvRr1Xq6k9avD86cbutzy/ayFzMWtbkKlzMe YP95BSi4dx7LEkK73HBgEFIXU9bVnpg2Ur3XKM0LLHsaCQdPCpPA0aqo8dz9m1nGys eFXFz5r27/etHlxHRgVOKRu0VFMhZsHJOBv6RJGJSuo+YP9QiW6fKdMFNBS99PmE8E rgu6TkQis76+Q== Message-ID: <68f73855-f206-80a2-a546-3d40864ee176@kernel.org> Date: Wed, 9 Aug 2023 13:09:50 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 Cc: =?UTF-8?Q?Toke_H=c3=b8iland-J=c3=b8rgensen?= , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Alexei Starovoitov , Daniel Borkmann , Jesper Dangaard Brouer , John Fastabend , Maciej Fijalkowski , Jonathan Lemon , Pavel Begunkov , Yunsheng Lin , Kees Cook , Richard Gobert , "open list:NETWORKING DRIVERS" , open list , "open list:XDP (eXpress Data Path)" , Donald Hunter , Dave Tucker Subject: Re: [RFC v3 Optimizing veth xsk performance 0/9] Content-Language: en-US To: =?UTF-8?B?6buE5p2w?= , =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= , Magnus Karlsson , Maryam Tahhan References: <20230808031913.46965-1-huangjie.albert@bytedance.com> <87v8dpbv5r.fsf@toke.dk> <87msz04mb4.fsf@toke.dk> From: Jesper Dangaard Brouer In-Reply-To: <87msz04mb4.fsf@toke.dk> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-5.2 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A, RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09/08/2023 11.06, Toke Høiland-Jørgensen wrote: > 黄杰 writes: > >> Toke Høiland-Jørgensen 于2023年8月8日周二 20:01写道: >>> >>> Albert Huang writes: >>> >>>> AF_XDP is a kernel bypass technology that can greatly improve performance. >>>> However,for virtual devices like veth,even with the use of AF_XDP sockets, >>>> there are still many additional software paths that consume CPU resources. >>>> This patch series focuses on optimizing the performance of AF_XDP sockets >>>> for veth virtual devices. Patches 1 to 4 mainly involve preparatory work. >>>> Patch 5 introduces tx queue and tx napi for packet transmission, while >>>> patch 8 primarily implements batch sending for IPv4 UDP packets, and patch 9 >>>> add support for AF_XDP tx need_wakup feature. These optimizations significantly >>>> reduce the software path and support checksum offload. >>>> >>>> I tested those feature with >>>> A typical topology is shown below: >>>> client(send): server:(recv) >>>> veth<-->veth-peer veth1-peer<--->veth1 >>>> 1 | | 7 >>>> |2 6| >>>> | | >>>> bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1 >>>> 3 4 5 >>>> (machine1) (machine2) >>> >>> I definitely applaud the effort to improve the performance of af_xdp >>> over veth, this is something we have flagged as in need of improvement >>> as well. >>> >>> However, looking through your patch series, I am less sure that the >>> approach you're taking here is the right one. >>> >>> AFAIU (speaking about the TX side here), the main difference between >>> AF_XDP ZC and the regular transmit mode is that in the regular TX mode >>> the stack will allocate an skb to hold the frame and push that down the >>> stack. Whereas in ZC mode, there's a driver NDO that gets called >>> directly, bypassing the skb allocation entirely. >>> >>> In this series, you're implementing the ZC mode for veth, but the driver >>> code ends up allocating an skb anyway. Which seems to be a bit of a >>> weird midpoint between the two modes, and adds a lot of complexity to >>> the driver that (at least conceptually) is mostly just a >>> reimplementation of what the stack does in non-ZC mode (allocate an skb >>> and push it through the stack). >>> >>> So my question is, why not optimise the non-zc path in the stack instead >>> of implementing the zc logic for veth? It seems to me that it would be >>> quite feasible to apply the same optimisations (bulking, and even GRO) >>> to that path and achieve the same benefits, without having to add all >>> this complexity to the veth driver? >>> >>> -Toke >>> >> thanks! >> This idea is really good indeed. You've reminded me, and that's >> something I overlooked. I will now consider implementing the solution >> you've proposed and test the performance enhancement. > > Sounds good, thanks! :) Good to hear, that you want to optimize the non-zc TX path of AF_XDP, as Toke suggests. There is a number of performance issues for AF_XDP non-zc TX that I've talked/complained to Magnus and Bjørn about over the years. I've recently started to work on fixing these myself, in collaboration with Maryam (cc). The most obvious is that non-zc TX uses socket memory accounting for the SKBs that gets allocated. (ZC TX obviously doesn't). IMHO this doesn't make sense as AF_XDP concept is to pre-allocate memory, thus AF_XDP memory limits are already bounded at setup time. Further more, __xsk_generic_xmit() already have a backpressure mechanism based on avail room in the CQ (Completion Queue) . Hint: the call sock_alloc_send_skb() includes/does socket mem accounting. When AF_XDP gets combined with veth (or other layered software devices), the problem gets worse, because: (1) the SKB that gets allocated by xsk_build_skb() doesn't have enough headroom to satisfy XDP requirement XDP_PACKET_HEADROOM. (2) the backing memory type from sock_alloc_send_skb() is not compatible with generic/veth XDP. Both these issues, result in that when peer veth device RX the (AF_XDP) TX packet, then it have to reallocate memory+SKB and copy data *again*. I'm currently[1] looking into how to fix this and have some PoC patches to estimate the performance benefit from avoiding the realloc when entering veth. With packet size 512, the numbers start at 828Kpps and after increase to 1002Kpps (and increase of 20% or 208 nanosec). [1] https://github.com/xdp-project/xdp-project/blob/veth-benchmark01/areas/core/veth_benchmark03.org -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer