Received: by 2002:a05:6358:701b:b0:131:369:b2a3 with SMTP id 27csp3973488rwo; Mon, 24 Jul 2023 21:22:06 -0700 (PDT) X-Google-Smtp-Source: APBJJlHr0VSHql/B7dbQr1tHOl7VDNwbXNSTi/pidzX04a6DtqVoTYdfo40hU3oI5lzU1QkWokgy X-Received: by 2002:a17:906:208d:b0:999:26d3:b815 with SMTP id 13-20020a170906208d00b0099926d3b815mr11584711ejq.64.1690258926071; Mon, 24 Jul 2023 21:22:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690258926; cv=none; d=google.com; s=arc-20160816; b=o7cZZFUesrFX7XJ0dogaER9C54II93yTAqTsfaWFPgLt9RSZlIrZFC6ZRGWu/o492U i196oPJXXEdKQPhPJ6KHXgxAfBAAxo6pmQuvmx+Dj8/uVHasXBKFnu0JMWZJ7CQje3hq CK55NYtERcIbiq8rzyWB4d4kAnA0N7EEnsBmUjmFacd6P3Aw9LUcy7q1KQNupxpGwRT+ +Gwhxrg/uSlm5vi1F5cWSSVHzQX07ge3oOevr5wcHQ1bZd7AntypTEEkJZsG4KmDtMli avOuRuZwgEOge8mAKeLGhIXG7RJgob1zpMy3X7b+Vei4WIsKmk7JYDpHdPVSVY7Lem1e etTw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=LWdSnIWQun0ashRbVT0OVUHEFVk8YYt5YaTXkv+Xkhc=; fh=wy0Nep1M/BG5AR4b5O+BpqrIXu5hzs/hfdzKYAJFRL0=; b=Fa/ZHVOQsop7Bq1h7E8i7XZ9r50gd536WSvlK9fxFBHx0aXbkCzupvYjhvIbxtv7Hl 2d7O+DLTU7WBJF8adZmmRPJIdoRhbU87u9d4AWu01j3A7VldZM3nl8NA1Cczd3kCVm5R HzRiYm8O10wsEowAikhXTUyGkZ9sVQ1GoSBXPRlHFQ2sbgypPZbm6s4s0qVkMfqmJbhc obf7NUS5wIN5aXqP8EX5oFuNhhmRCmo9UqNEQHLt/AjwaaUPrMPXYA5bL0XB2yNpiI/R OmH44K7KR0dGXPBvv4dDTBgPRS0qfPUtc3wiMurCYk5NgMcNigwKWzhZDT64KPzcTKgx 2lTQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=uq3iusBb; spf=pass (google.com: domain of linux-wireless-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-wireless-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id gr26-20020a170906e2da00b00993150ed537si6720588ejb.189.2023.07.24.21.21.51; Mon, 24 Jul 2023 21:22:06 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-wireless-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=uq3iusBb; spf=pass (google.com: domain of linux-wireless-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-wireless-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231552AbjGYEEk (ORCPT + 58 others); Tue, 25 Jul 2023 00:04:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47124 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231292AbjGYEEj (ORCPT ); Tue, 25 Jul 2023 00:04:39 -0400 Received: from mail-ua1-x935.google.com (mail-ua1-x935.google.com [IPv6:2607:f8b0:4864:20::935]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 44E8D1A6 for ; Mon, 24 Jul 2023 21:04:36 -0700 (PDT) Received: by mail-ua1-x935.google.com with SMTP id a1e0cc1a2514c-799a451ca9cso1639090241.2 for ; Mon, 24 Jul 2023 21:04:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1690257875; x=1690862675; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=LWdSnIWQun0ashRbVT0OVUHEFVk8YYt5YaTXkv+Xkhc=; b=uq3iusBbqJPc2CTRePG/0p4dDkjSLOvhrrfGgfOLYLE2ZyESAHXdxhgqjslHldiAD6 KLS/+wOLV71DZZb66aAvhvUbMCL1h01kRKYqujKDJeqnbLgjKMuBP1hmu/ZpXTG5oEUY v+nnlclvWIOsOSs1i6/Qvgamieqnw6PyyfWKkqnuxKkZNGk8te+LsMcTcwd7Slkyut33 U39fhEDDpT7VIRm7piZhgyZZjgVbgjwCyc+h7gaudDhk84zIddYfWkC/cKc2oYL8VFuu Q4qw8p47aAlVu+L+Nc0nxK0ZpOisK99ykmC1j5w4fVigBs13/OxIKhF4VB+7i+Pjo23Z gJ+g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690257875; x=1690862675; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=LWdSnIWQun0ashRbVT0OVUHEFVk8YYt5YaTXkv+Xkhc=; b=L7POdlKG7kL55sByw4YqfBhxoLafPzpuBEidJKyjRYQLbJsqpazu4hW7F4QvgXtaUN 2LvC7UayGXjJPN0cmviGnEQE1YX2NSDDo/klwFZSv/Eb7KekZC9klbD2tdRebf16Lu6u Fk6E+ob0GrLo+6S3upVVWV9y7lqfzWDmcD02ky0FMQurHf7I2I9qgOMJLhi/uICcwO5c S4PWJkwbG9GkuBe8vRYD/uCTpwbbApYYwa1wPuxW5eVsREqFzpYHdU3Ky+usm/bTil63 qC9nHMKcS+JDrKqCgGKA5JSL/7LRR3urgrsYZixWhjGfOm/njF6dSRP+sH41ZeWHPLzF i9yQ== X-Gm-Message-State: ABy/qLbhfOU8u/uMDNDSJJj5iknK3V3wuIdgRrsPYcullwziitsd+Od0 +1vF+ODDmWIBfyH0iaJbEyxX21Y8WFzoO1EhW8JRfw== X-Received: by 2002:a67:cf43:0:b0:443:7e93:641f with SMTP id f3-20020a67cf43000000b004437e93641fmr3945436vsm.9.1690257875137; Mon, 24 Jul 2023 21:04:35 -0700 (PDT) MIME-Version: 1.0 References: <04187826-8dad-d17b-2469-2837bafd3cd5@kernel.org> <20230711093224.1bf30ed5@kernel.org> <20230711133915.03482fdc@kernel.org> <2263ae79-690e-8a4d-fca2-31aacc5c9bc6@kernel.org> <143a7ca4-e695-db98-9488-84cf8b78cf86@amd.com> In-Reply-To: From: Mina Almasry Date: Mon, 24 Jul 2023 21:04:23 -0700 Message-ID: Subject: Re: Memory providers multiplexing (Was: [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE_FRAG flag) To: Jesper Dangaard Brouer Cc: Jason Gunthorpe , brouer@redhat.com, =?UTF-8?Q?Christian_K=C3=B6nig?= , Hari Ramakrishnan , David Ahern , Samiullah Khawaja , Willem de Bruijn , Jakub Kicinski , Christoph Hellwig , John Hubbard , Dan Williams , Alexander Duyck , Yunsheng Lin , davem@davemloft.net, pabeni@redhat.com, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Lorenzo Bianconi , Yisen Zhuang , Salil Mehta , Eric Dumazet , Sunil Goutham , Geetha sowjanya , Subbaraya Sundeep , hariprasad , Saeed Mahameed , Leon Romanovsky , Felix Fietkau , Ryder Lee , Shayne Chen , Sean Wang , Kalle Valo , Matthias Brugger , AngeloGioacchino Del Regno , Jesper Dangaard Brouer , Ilias Apalodimas , linux-rdma@vger.kernel.org, linux-wireless@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-mediatek@lists.infradead.org, Jonathan Lemon , logang@deltatee.com, Bjorn Helgaas Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-wireless@vger.kernel.org On Mon, Jul 24, 2023 at 7:56=E2=80=AFAM Jesper Dangaard Brouer wrote: > > > > On 17/07/2023 03.53, Mina Almasry wrote: > > On Fri, Jul 14, 2023 at 8:55=E2=80=AFAM Jason Gunthorpe = wrote: > >> > >> On Fri, Jul 14, 2023 at 07:55:15AM -0700, Mina Almasry wrote: > >> > >>> Once the skb frags with struct new_abstraction are in the TCP stack, > >>> they will need some special handling in code accessing the frags. But > >>> my RFC already addressed that somewhat because the frags were > >>> inaccessible in that case. In this case the frags will be both > >>> inaccessible and will not be struct pages at all (things like > >>> get_page() will not work), so more special handling will be required, > >>> maybe. > >> > >> It seems sort of reasonable, though there will be interesting concerns > >> about coherence and synchronization with generial purpose DMABUFs that > >> will need tackling. > >> > >> Still it is such a lot of churn and weridness in the netdev side, I > >> think you'd do well to present an actual full application as > >> justification. > >> > >> Yes, you showed you can stick unordered TCP data frags into GPU memory > >> sort of quickly, but have you gone further with this to actually show > >> it is useful for a real world GPU centric application? > >> > >> BTW your cover letter said 96% utilization, the usual server > >> configuation is one NIC per GPU, so you were able to hit 1500Gb/sec of > >> TCP BW with this? > >> > > > > I do notice that the number of NICs is missing from our public > > documentation so far, so I will refrain from specifying how many NICs > > are on those A3 VMs until the information is public. But I think I can > > confirm that your general thinking is correct, the perf that we're > > getting is 96.6% line rate of each GPU/NIC pair, > > What do you mean by 96.6% "line rate". > Is is the Ethernet line-rate? > Yes I believe this is the ethernet line-rate. I.e. the 200 Gbits/sec that my NICs run. > Is the measured throughput the measured TCP data "goodput"? Yes, it is goodput. Roughly I believe we add up the return values of recvmsg() and divide that number by time (very roughly, I think). > Assuming > - MTU 1500 bytes (1514 on wire). > - Ethernet header 14 bytes > - IP header 20 bytes > - TCP header 20 bytes > > Due to header overhead the goodput will be approx 96.4%. > - (1514-(14+20+20))/1514 =3D 0.9643 > - (Not taking Ethernet interframe gap into account). > > Thus, maybe you have hit Ethernet wire line-rate already? My MTU is 8244 actually, which gives me 8192 mss/payload for my connections. By my math the theoretical max would be 1 - 52/8244 =3D ~99.3%. So it looks like I'm dropping ~3% line rate somewhere in the implementation. > > > and scales linearly > > for each NIC/GPU pair we've tested with so far. Line rate of each > > NIC/GPU pair is 200 Gb/sec. > > > > So if we have 8 NIC/GPU pairs we'd be hitting 96.6% * 200 * 8 =3D 1545 = GB/sec. > > Lets keep our units straight. > Here you mean 1545 Gbit/sec, which is 193 GBytes/s > Yes! Sorry! I definitely meant 1545 Gbits/sec, sorry! > > If we have, say, 2 NIC/GPU pairs, we'd be hitting 96.6% * 200 * 2 =3D 3= 84 GB/sec > > Here you mean 384 Gbit/sec, which is 48 GBytes/sec. > Correct again! > > ... > > etc. > > > > These massive throughput numbers are important, because they *exceed* > the physical host RAM/DIMM memory speeds. > > This is the *real argument* why software cannot afford to do a single > copy of the data from host-RAM into GPU-memory, because the CPU memory > throughput to DRAM/DIMM are insufficient. > > My testlab CPU E5-1650 have 4 DIMM slots DDR4 > - Data Width: 64 bits (=3D 8 bytes) > - Configured Memory Speed: 2400 MT/s > - Theoretical maximum memory bandwidth: 76.8 GBytes/s (2400*8*4) > > Even the theoretical max 76.8 GBytes/s (614 Gbit/s) is not enough for > the 193 GBytes/s or 1545 Gbit/s (8 NIC/GPU pairs). > > When testing this with lmbench tool bw_mem, the results (below > signature) are in the area 14.8 GBytes/sec (118 Gbit/s), as soon as > exceeding L3 cache size. In practice it looks like main memory is > limited to reading 118 Gbit/s *once*. (Mina's NICs run at 200 Gbit/s) > > Given DDIO can deliver network packets into L3, I also tried to figure > out what the L3 read bandwidth, which I measured to be 42.4 GBits/sec > (339 Gbit/s), in hopes that it would be enough, but it was not. > > Yes, avoiding any memory speed bottleneck as you note is important, but the second point mentioned in my cover letter is also impactful: " Alleviate PCIe BW pressure, by limiting data transfer to the lowest level of the PCIe tree, compared to traditional path which sends data through t= he root complex." Depending on the hardware, this is a bottleneck that we avoid with device memory TCP. NIC/GPU copies occupy the PCIe link bandwidth. In a hierarchy like this: root complex | (uplink) PCIe switch / \ NIC GPU I believe the uplink from the PCIe switch to the root complex is used up 2 times for TX and 2 times for RX if the data needs to go through host memory: RX: NIC -> root complex -> GPU TX: GPU -> root complex -> NIC With device memory TCP, and enabling PCI P2P communication between the devices under the same PCIe switch, the payload flows directly from/to the NIC/GPU through the PCIe switch, and the payload never goes to the root complex, alleviating pressure/bottleneck on that link between the PCIe switch/root complex. I believe this is a core reason we're able to scale throughput linearly with NIC/GPU pairs, because we don't stress share uplink connections and all the payload data transfer happens beneath the PCIe switch. > --Jesper > (data below signature) > > CPU under test: > > $ cat /proc/cpuinfo | egrep -e 'model name|cache size' | head -2 > model name : Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz > cache size : 15360 KB > > > Providing some cmdline outputs from lmbench "bw_mem" tool. > (Output format is "%0.2f %.2f\n", megabytes, megabytes_per_second) > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rd > 256.00 14924.50 > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M wr > 256.00 9895.25 > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M rdwr > 256.00 9737.54 > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bcopy > 256.00 12462.88 > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 256M bzero > 256.00 14869.89 > > > Next output shows reducing size below L3 cache size, which shows an > increase in speed, likely the L3 bandwidth. > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 64M rd > 64.00 14840.58 > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 32M rd > 32.00 14823.97 > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 16M rd > 16.00 24743.86 > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 8M rd > 8.00 40852.26 > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 4M rd > 4.00 42545.65 > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 2M rd > 2.00 42447.82 > > $ taskset -c 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/bw_mem 1M rd > 1.00 42447.82 > --=20 Thanks, Mina