Received: by 2002:a05:6902:102b:0:0:0:0 with SMTP id x11csp228294ybt; Thu, 9 Jul 2020 20:57:05 -0700 (PDT) X-Google-Smtp-Source: ABdhPJy3F7vuS80D4bfY3e4GrSrb2xYmPu30kS4Vu3QtpEqF94+azxqIGTI7Of6Y9wwzcfYjP2NM X-Received: by 2002:a17:906:9387:: with SMTP id l7mr57094609ejx.274.1594353425393; Thu, 09 Jul 2020 20:57:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1594353425; cv=none; d=google.com; s=arc-20160816; b=IeBMJh5iOA/qkfjNKwkd1JqjSeOWXauWqNmQQcvrmJRvwqxYMT2kWQJ2IcmgQpAQz0 53LcKTcgSkKtZj2Ozl89JdKGIod+vnO+4a1DDnuRvVozSvM0cmxdNoQiPIqdMck9IuOC Q4B9xa9f4w61dSmCthj5G+9xMJGx9ayZgh14VS+LK2TzotbcJfHxgBrywh8WbBYO6kER G6GLCYBr3XtEqGuMmIsesZJlJAjvkvboDC5mLcthEHLHKvRBOVYAnvD2lJqcr5SZ3XNJ uDODntW00yEVoU2A79YLhuUO0pKAVjy2Sh2Hk2okUUbw3q/M3aj7/1sfBAfWdlW1tVJR wrqQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature; bh=z5Aam00hbFoeDn6hLT4U9FDxk2d5+qUKm44Nb7A7gvc=; b=uXcTA9jG/MC1w6jGm1+yeXh19oYKpDyxxYkPwK5FHGQfZyWVLNNRg9D2g4FPbgshxk v425QPIQD15nJiMpivyvso1fQ9rw5tIMmJAWXItw6QgrJU3WqRBGJh2OYZ1Mdx7oEwMH 1qAZRDDOlPapzOlR98vQpMuIOgOdsMjkbzuLHXZ+JNXzBaKhI+B7mmBTIwQ/dQi9kwim IdV29e47Cspc4EDpt+ckobnoYxUkFn37U+8Fdio6rQyC+ueDrpKX4Y0WZRJmO2Qglkg6 Z27BjluoAg8ZAIGdQXJKKg3S+1+5RtHzx6xxJsCnQ7CvPafvJMBAZU6I4OjkEnwtn4RM 0ZEg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Q7xEK9JV; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id re6si3189703ejb.381.2020.07.09.20.56.42; Thu, 09 Jul 2020 20:57:05 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Q7xEK9JV; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726856AbgGJD41 (ORCPT + 99 others); Thu, 9 Jul 2020 23:56:27 -0400 Received: from us-smtp-2.mimecast.com ([207.211.31.81]:33599 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726509AbgGJD41 (ORCPT ); Thu, 9 Jul 2020 23:56:27 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1594353385; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=z5Aam00hbFoeDn6hLT4U9FDxk2d5+qUKm44Nb7A7gvc=; b=Q7xEK9JVyh+YMPVGXmu3dV8xG2btKKEofuyq+cEzyTD8DesnY82hHHVMFIulqaBPdCWg0S kngvAJi9Zl2lCBBVfwm0wshWMz0pWhW1bgXW10nziRhfrqO6uSojaJ71V1XKe6roS1rwVC kg5E5NhZMKcbu9gtGaPCrc18aJZFJrY= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-263-09iyMhobPbm7zdTyaQFi8w-1; Thu, 09 Jul 2020 23:56:22 -0400 X-MC-Unique: 09iyMhobPbm7zdTyaQFi8w-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 260D5100AA22; Fri, 10 Jul 2020 03:56:21 +0000 (UTC) Received: from [10.72.13.228] (ovpn-13-228.pek2.redhat.com [10.72.13.228]) by smtp.corp.redhat.com (Postfix) with ESMTP id 6797B6FEF4; Fri, 10 Jul 2020 03:56:12 +0000 (UTC) Subject: Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version To: "Michael S. Tsirkin" , Eugenio Perez Martin Cc: Konrad Rzeszutek Wilk , linux-kernel@vger.kernel.org, kvm list , virtualization@lists.linux-foundation.org, netdev@vger.kernel.org References: <20200622114622-mutt-send-email-mst@kernel.org> <20200622122546-mutt-send-email-mst@kernel.org> <419cc689-adae-7ba4-fe22-577b3986688c@redhat.com> <0a83aa03-8e3c-1271-82f5-4c07931edea3@redhat.com> <20200709133438-mutt-send-email-mst@kernel.org> From: Jason Wang Message-ID: <7dec8cc2-152c-83f4-aa45-8ef9c6aca56d@redhat.com> Date: Fri, 10 Jul 2020 11:56:10 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: <20200709133438-mutt-send-email-mst@kernel.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2020/7/10 上午1:37, Michael S. Tsirkin wrote: > On Thu, Jul 09, 2020 at 06:46:13PM +0200, Eugenio Perez Martin wrote: >> On Wed, Jul 1, 2020 at 4:10 PM Jason Wang wrote: >>> >>> On 2020/7/1 下午9:04, Eugenio Perez Martin wrote: >>>> On Wed, Jul 1, 2020 at 2:40 PM Jason Wang wrote: >>>>> On 2020/7/1 下午6:43, Eugenio Perez Martin wrote: >>>>>> On Tue, Jun 23, 2020 at 6:15 PM Eugenio Perez Martin >>>>>> wrote: >>>>>>> On Mon, Jun 22, 2020 at 6:29 PM Michael S. Tsirkin wrote: >>>>>>>> On Mon, Jun 22, 2020 at 06:11:21PM +0200, Eugenio Perez Martin wrote: >>>>>>>>> On Mon, Jun 22, 2020 at 5:55 PM Michael S. Tsirkin wrote: >>>>>>>>>> On Fri, Jun 19, 2020 at 08:07:57PM +0200, Eugenio Perez Martin wrote: >>>>>>>>>>> On Mon, Jun 15, 2020 at 2:28 PM Eugenio Perez Martin >>>>>>>>>>> wrote: >>>>>>>>>>>> On Thu, Jun 11, 2020 at 5:22 PM Konrad Rzeszutek Wilk >>>>>>>>>>>> wrote: >>>>>>>>>>>>> On Thu, Jun 11, 2020 at 07:34:19AM -0400, Michael S. Tsirkin wrote: >>>>>>>>>>>>>> As testing shows no performance change, switch to that now. >>>>>>>>>>>>> What kind of testing? 100GiB? Low latency? >>>>>>>>>>>>> >>>>>>>>>>>> Hi Konrad. >>>>>>>>>>>> >>>>>>>>>>>> I tested this version of the patch: >>>>>>>>>>>> https://lkml.org/lkml/2019/10/13/42 >>>>>>>>>>>> >>>>>>>>>>>> It was tested for throughput with DPDK's testpmd (as described in >>>>>>>>>>>> http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html) >>>>>>>>>>>> and kernel pktgen. No latency tests were performed by me. Maybe it is >>>>>>>>>>>> interesting to perform a latency test or just a different set of tests >>>>>>>>>>>> over a recent version. >>>>>>>>>>>> >>>>>>>>>>>> Thanks! >>>>>>>>>>> I have repeated the tests with v9, and results are a little bit different: >>>>>>>>>>> * If I test opening it with testpmd, I see no change between versions >>>>>>>>>> OK that is testpmd on guest, right? And vhost-net on the host? >>>>>>>>>> >>>>>>>>> Hi Michael. >>>>>>>>> >>>>>>>>> No, sorry, as described in >>>>>>>>> http://doc.dpdk.org/guides/howto/virtio_user_as_exceptional_path.html. >>>>>>>>> But I could add to test it in the guest too. >>>>>>>>> >>>>>>>>> These kinds of raw packets "bursts" do not show performance >>>>>>>>> differences, but I could test deeper if you think it would be worth >>>>>>>>> it. >>>>>>>> Oh ok, so this is without guest, with virtio-user. >>>>>>>> It might be worth checking dpdk within guest too just >>>>>>>> as another data point. >>>>>>>> >>>>>>> Ok, I will do it! >>>>>>> >>>>>>>>>>> * If I forward packets between two vhost-net interfaces in the guest >>>>>>>>>>> using a linux bridge in the host: >>>>>>>>>> And here I guess you mean virtio-net in the guest kernel? >>>>>>>>> Yes, sorry: Two virtio-net interfaces connected with a linux bridge in >>>>>>>>> the host. More precisely: >>>>>>>>> * Adding one of the interfaces to another namespace, assigning it an >>>>>>>>> IP, and starting netserver there. >>>>>>>>> * Assign another IP in the range manually to the other virtual net >>>>>>>>> interface, and start the desired test there. >>>>>>>>> >>>>>>>>> If you think it would be better to perform then differently please let me know. >>>>>>>> Not sure why you bother with namespaces since you said you are >>>>>>>> using L2 bridging. I guess it's unimportant. >>>>>>>> >>>>>>> Sorry, I think I should have provided more context about that. >>>>>>> >>>>>>> The only reason to use namespaces is to force the traffic of these >>>>>>> netperf tests to go through the external bridge. To test netperf >>>>>>> different possibilities than the testpmd (or pktgen or others "blast >>>>>>> of frames unconditionally" tests). >>>>>>> >>>>>>> This way, I make sure that is the same version of everything in the >>>>>>> guest, and is a little bit easier to manage cpu affinity, start and >>>>>>> stop testing... >>>>>>> >>>>>>> I could use a different VM for sending and receiving, but I find this >>>>>>> way a faster one and it should not introduce a lot of noise. I can >>>>>>> test with two VM if you think that this use of network namespace >>>>>>> introduces too much noise. >>>>>>> >>>>>>> Thanks! >>>>>>> >>>>>>>>>>> - netperf UDP_STREAM shows a performance increase of 1.8, almost >>>>>>>>>>> doubling performance. This gets lower as frame size increase. >>>>>> Regarding UDP_STREAM: >>>>>> * with event_idx=on: The performance difference is reduced a lot if >>>>>> applied affinity properly (manually assigning CPU on host/guest and >>>>>> setting IRQs on guest), making them perform equally with and without >>>>>> the patch again. Maybe the batching makes the scheduler perform >>>>>> better. >>>>> Note that for UDP_STREAM, the result is pretty trick to be analyzed. E.g >>>>> setting a sndbuf for TAP may help for the performance (reduce the drop). >>>>> >>>> Ok, will add that to the test. Thanks! >>> >>> Actually, it's better to skip the UDP_STREAM test since: >>> >>> - My understanding is very few application is using raw UDP stream >>> - It's hard to analyze (usually you need to count the drop ratio etc) >>> >>> >>>>>>>>>>> - rests of the test goes noticeably worse: UDP_RR goes from ~6347 >>>>>>>>>>> transactions/sec to 5830 >>>>>> * Regarding UDP_RR, TCP_STREAM, and TCP_RR, proper CPU pinning makes >>>>>> them perform similarly again, only a very small performance drop >>>>>> observed. It could be just noise. >>>>>> ** All of them perform better than vanilla if event_idx=off, not sure >>>>>> why. I can try to repeat them if you suspect that can be a test >>>>>> failure. >>>>>> >>>>>> * With testpmd and event_idx=off, if I send from the VM to host, I see >>>>>> a performance increment especially in small packets. The buf api also >>>>>> increases performance compared with only batching: Sending the minimum >>>>>> packet size in testpmd makes pps go from 356kpps to 473 kpps. >>>>> What's your setup for this. The number looks rather low. I'd expected >>>>> 1-2 Mpps at least. >>>>> >>>> Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz, 2 NUMA nodes of 16G memory >>>> each, and no device assigned to the NUMA node I'm testing in. Too low >>>> for testpmd AF_PACKET driver too? >>> >>> I don't test AF_PACKET, I guess it should use the V3 which mmap based >>> zerocopy interface. >>> >>> And it might worth to check the cpu utilization of vhost thread. It's >>> required to stress it as 100% otherwise there could be a bottleneck >>> somewhere. >>> >>> >>>>>> Sending >>>>>> 1024 length UDP-PDU makes it go from 570kpps to 64 kpps. >>>>>> >>>>>> Something strange I observe in these tests: I get more pps the bigger >>>>>> the transmitted buffer size is. Not sure why. >>>>>> >>>>>> ** Sending from the host to the VM does not make a big change with the >>>>>> patches in small packets scenario (minimum, 64 bytes, about 645 >>>>>> without the patch, ~625 with batch and batch+buf api). If the packets >>>>>> are bigger, I can see a performance increase: with 256 bits, >>>>> I think you meant bytes? >>>>> >>>> Yes, sorry. >>>> >>>>>> it goes >>>>>> from 590kpps to about 600kpps, and in case of 1500 bytes payload it >>>>>> gets from 348kpps to 528kpps, so it is clearly an improvement. >>>>>> >>>>>> * with testpmd and event_idx=on, batching+buf api perform similarly in >>>>>> both directions. >>>>>> >>>>>> All of testpmd tests were performed with no linux bridge, just a >>>>>> host's tap interface ( in xml), >>>>> What DPDK driver did you use in the test (AF_PACKET?). >>>>> >>>> Yes, both testpmd are using AF_PACKET driver. >>> >>> I see, using AF_PACKET means extra layers of issues need to be analyzed >>> which is probably not good. >>> >>> >>>>>> with a >>>>>> testpmd txonly and another in rxonly forward mode, and using the >>>>>> receiving side packets/bytes data. Guest's rps, xps and interrupts, >>>>>> and host's vhost threads affinity were also tuned in each test to >>>>>> schedule both testpmd and vhost in different processors. >>>>> My feeling is that if we start from simple setup, it would be more >>>>> easier as a start. E.g start without an VM. >>>>> >>>>> 1) TX: testpmd(txonly) -> virtio-user -> vhost_net -> XDP_DROP on TAP >>>>> 2) RX: pkgetn -> TAP -> vhost_net -> testpmd(rxonly) >>>>> >>>> Got it. Is there a reason to prefer pktgen over testpmd? >>> >>> I think the reason is using testpmd you must use a userspace kernel >>> interface (AF_PACKET), and it could not be as fast as pktgen since: >>> >>> - it talks directly to xmit of TAP >>> - skb can be cloned >>> >> Hi! >> >> Here it is the result of the tests. Details on [1]. >> >> Tx: >> === >> >> For tx packets it seems that the batching patch makes things a little >> bit worse, but the buf_api outperforms baseline by a 7%: >> >> * We start with a baseline of 4208772.571 pps and 269361444.6 bytes/s [2]. >> * When we add the batching, I see a small performance decrease: >> 4133292.308 and 264530707.7 bytes/s. >> * However, the buf api it outperform the baseline: 4551319.631pps, >> 291205178.1 bytes/s >> >> I don't have numbers on the receiver side since it is just a XDP_DROP. >> I think it would be interesting to see them. >> >> Rx: >> === >> >> Regarding Rx, the reverse is observed: a small performance increase is >> observed with batching (~2%), but buf_api makes tests perform equally >> to baseline. >> >> pktgen was called using pktgen_sample01_simple.sh, with the environment: >> DEV="$tap_name" F_THREAD=1 DST_MAC=$MAC_ADDR COUNT=$((2500000*25)) >> SKB_CLONE=$((2**31)) >> >> And testpmd is the same as Tx but with forward-mode=rxonly. >> >> Pktgen reports: >> Baseline: 1853025pps 622Mb/sec (622616400bps) errors: 7915231 >> Batch: 1891404pps 635Mb/sec (635511744bps) errors: 4926093 >> Buf_api: 1844008pps 619Mb/sec (619586688bps) errors: 47766692 >> >> Testpmd reports: >> Baseline: 1854448pps, 860464156 bps. [3] >> Batch: 1892844.25pps, 878280070bps. >> Buf_api: 1846139.75pps, 856609120bps. >> >> Any thoughts? >> >> Thanks! >> >> [1] >> Testpmd options: -l 1,3 >> --vdev=virtio_user0,mac=01:02:03:04:05:06,path=/dev/vhost-net,queue_size=1024 >> -- --auto-start --stats-period 5 --tx-offloads="$TX_OFFLOADS" >> --rx-offloads="$RX_OFFLOADS" --txd=4096 --rxd=4096 --burst=512 >> --forward-mode=txonly >> >> Where offloads were obtained manually running with >> --[tr]x-offloads=0x8fff and examining testpmd response: >> declare -r RX_OFFLOADS=0x81d >> declare -r TX_OFFLOADS=0x802d >> >> All of the tests results are an average of at least 3 samples of >> testpmd, discarding the obvious deviations at start/end (like warming >> up or waiting for pktgen to start). The result of pktgen is directly >> c&p from its output. >> >> The numbers do not change very much from one stats printing to another >> of testpmd. >> >> [2] Obtained subtracting each accumulated tx-packets from one stats >> print to the previous one. If we attend testpmd output about Tx-pps, >> it counts a little bit less performance, but it follows the same >> pattern: >> >> Testpmd pps/bps stats: >> Baseline: 3510826.25 pps, 1797887912bps = 224735989bytes/sec >> Batch: 3448515.571pps, 1765640226bps = 220705028.3bytes/sec >> Buf api: 3794115.333pps, 1942587286bps = 242823410.8bytes/sec >> >> [3] This is obtained using the rx-pps/rx-bps report of testpmd. >> >> Seems strange to me that the relation between pps/bps is ~336 this >> time, and between accumulated pkts/accumulated bytes is ~58. Also, the >> relation between them is not even close to 8. >> >> However, testpmd shows a lot of absolute packets received. If we see >> the received packets in a period subtracting from the previous one, >> testpmd tells that receive more pps than pktgen tx-pps: >> Baseline: ~2222668.667pps 128914784.3bps. >> Batch: 2269260.933pps, 131617134.9bps >> Buf_api: 2213226.467pps, 128367135.9bp > How about playing with the batch size? Make it a mod parameter instead > of the hard coded 64, and measure for all values 1 to 64 ... Right, according to the test result, 64 seems to be too aggressive in the case of TX. And it might also be worth to check: 1) Whether vhost thread is stressed as 100% CPU utilization, if not, there's bottleneck elsewhere 2) For RX test, make sure pktgen kthread is running in the same NUMA node with virtio-user Thanks >