Received: by 2002:a05:6902:102b:0:0:0:0 with SMTP id x11csp551791ybt; Wed, 8 Jul 2020 06:20:48 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyYTUP/3HfeOcL0fRogOWz7t6OIlsFdl/ZZLo7XQRansDsxy+ulORu4vgz0KkyXym6uMl0B X-Received: by 2002:a50:cf09:: with SMTP id c9mr25851245edk.90.1594214448487; Wed, 08 Jul 2020 06:20:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1594214448; cv=none; d=google.com; s=arc-20160816; b=0y0HuTjIghpPKKS3KvCpjbQ0S8ABLj6SnRfVBYPupGfew3K7wSkrA/9AqOrY1Nq0QA 6dOCuZp4rhvg1/6PQEWWGwesDtVGcOm7TyQxR5IKc5QfC0QpVa0AdN/vSEhNbhZkK8// 7p7DxAsICenwTA4uTZ5Oy1lOztmwjXvH0CLaSIHM/Qu+o091UyCMSRR0BnCvVKz8292b XLeToBV5aS3unRe7il9Kfd1KuJfVrys2/YsLNVt+nHZbmkzq7lnj1BaZdrh3DF5UEpan /BiJSeRP8riAsjxpIIRwxkBcP46uBXXd3ITd62gt4qRtywLsTxnNcsRk6FjqOaUiFjM9 NWKQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=zsIFbzu1drDgdmDcCVhu+rmZxK1Rib3cKXIRJQGtPSU=; b=EZW6PuTb9aUx4QM+VwfUD0v8iF02MKihtmAnK5gv4RWA8ua+D+yjuRHWotcjom0mTr OkCn89JX8Ix8e6g5i25ZV8RefNe1ywq12r94yC72r9YpNDRc/b1t+InYNPBTnAyH80uO VSUdXlKsoDJHxu3iq3/NEoqyTsKfW+rUTQvz0nyx3vcIMm/kUPq3AUliP3uCC1QM8eDA LuyR1FncsbZmAlgwThB8AoYjryHUtMJWIBZLKRUkRb08QIH/PUctjxghElKJU+LS4h0/ rnG85LMIaOYQIyxji79DnSBx+dt74weCWUWnJKF0xd1U9HWq+UAH9O05DzSedXkyFfCo BK+Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id o9si17392563edr.292.2020.07.08.06.20.25; Wed, 08 Jul 2020 06:20:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729496AbgGHNSq (ORCPT + 99 others); Wed, 8 Jul 2020 09:18:46 -0400 Received: from foss.arm.com ([217.140.110.172]:39838 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729145AbgGHNSo (ORCPT ); Wed, 8 Jul 2020 09:18:44 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 46CDB1FB; Wed, 8 Jul 2020 06:18:43 -0700 (PDT) Received: from [10.57.21.32] (unknown [10.57.21.32]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 5516A3F718; Wed, 8 Jul 2020 06:18:41 -0700 (PDT) Subject: Re: [PATCH net] xsk: remove cheap_dma optimization To: Christoph Hellwig Cc: =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= , Daniel Borkmann , maximmi@mellanox.com, konrad.wilk@oracle.com, jonathan.lemon@gmail.com, linux-kernel@vger.kernel.org, iommu@lists.linux-foundation.org, netdev@vger.kernel.org, bpf@vger.kernel.org, davem@davemloft.net, magnus.karlsson@intel.com References: <20200626134358.90122-1-bjorn.topel@gmail.com> <20200627070406.GB11854@lst.de> <88d27e1b-dbda-301c-64ba-2391092e3236@intel.com> <878626a2-6663-0d75-6339-7b3608aa4e42@arm.com> <20200708065014.GA5694@lst.de> From: Robin Murphy Message-ID: <79926b59-0eb9-2b88-b1bb-1bd472b10370@arm.com> Date: Wed, 8 Jul 2020 14:18:39 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: <20200708065014.GA5694@lst.de> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2020-07-08 07:50, Christoph Hellwig wrote: > On Mon, Jun 29, 2020 at 04:41:16PM +0100, Robin Murphy wrote: >> On 2020-06-28 18:16, Bj�rn T�pel wrote: >>> >>> On 2020-06-27 09:04, Christoph Hellwig wrote: >>>> On Sat, Jun 27, 2020 at 01:00:19AM +0200, Daniel Borkmann wrote: >>>>> Given there is roughly a ~5 weeks window at max where this removal could >>>>> still be applied in the worst case, could we come up with a fix / >>>>> proposal >>>>> first that moves this into the DMA mapping core? If there is something >>>>> that >>>>> can be agreed upon by all parties, then we could avoid re-adding the 9% >>>>> slowdown. :/ >>>> >>>> I'd rather turn it upside down - this abuse of the internals blocks work >>>> that has basically just missed the previous window and I'm not going >>>> to wait weeks to sort out the API misuse.� But we can add optimizations >>>> back later if we find a sane way. >>>> >>> >>> I'm not super excited about the performance loss, but I do get >>> Christoph's frustration about gutting the DMA API making it harder for >>> DMA people to get work done. Lets try to solve this properly using >>> proper DMA APIs. >>> >>> >>>> That being said I really can't see how this would make so much of a >>>> difference.� What architecture and what dma_ops are you using for >>>> those measurements?� What is the workload? >>>> >>> >>> The 9% is for an AF_XDP (Fast raw Ethernet socket. Think AF_PACKET, but >>> faster.) benchmark: receive the packet from the NIC, and drop it. The DMA >>> syncs stand out in the perf top: >>> >>> � 28.63%� [kernel]������������������ [k] i40e_clean_rx_irq_zc >>> � 17.12%� [kernel]������������������ [k] xp_alloc >>> �� 8.80%� [kernel]������������������ [k] __xsk_rcv_zc >>> �� 7.69%� [kernel]������������������ [k] xdp_do_redirect >>> �� 5.35%� bpf_prog_992d9ddc835e5629� [k] bpf_prog_992d9ddc835e5629 >>> �� 4.77%� [kernel]������������������ [k] xsk_rcv.part.0 >>> �� 4.07%� [kernel]������������������ [k] __xsk_map_redirect >>> �� 3.80%� [kernel]������������������ [k] dma_direct_sync_single_for_cpu >>> �� 3.03%� [kernel]������������������ [k] dma_direct_sync_single_for_device >>> �� 2.76%� [kernel]������������������ [k] i40e_alloc_rx_buffers_zc >>> �� 1.83%� [kernel]������������������ [k] xsk_flush >>> ... >>> >>> For this benchmark the dma_ops are NULL (dma_is_direct() == true), and >>> the main issue is that SWIOTLB is now unconditionally enabled [1] for >>> x86, and for each sync we have to check that if is_swiotlb_buffer() >>> which involves a some costly indirection. >>> >>> That was pretty much what my hack avoided. Instead we did all the checks >>> upfront, since AF_XDP has long-term DMA mappings, and just set a flag >>> for that. >>> >>> Avoiding the whole "is this address swiotlb" in >>> dma_direct_sync_single_for_{cpu, device]() per-packet >>> would help a lot. >> >> I'm pretty sure that's one of the things we hope to achieve with the >> generic bypass flag :) >> >>> Somewhat related to the DMA API; It would have performance benefits for >>> AF_XDP if the DMA range of the mapped memory was linear, i.e. by IOMMU >>> utilization. I've started hacking a thing a little bit, but it would be >>> nice if such API was part of the mapping core. >>> >>> Input: array of pages Output: array of dma addrs (and obviously dev, >>> flags and such) >>> >>> For non-IOMMU len(array of pages) == len(array of dma addrs) >>> For best-case IOMMU len(array of dma addrs) == 1 (large linear space) >>> >>> But that's for later. :-) >> >> FWIW you will typically get that behaviour from IOMMU-based implementations >> of dma_map_sg() right now, although it's not strictly guaranteed. If you >> can weather some additional setup cost of calling >> sg_alloc_table_from_pages() plus walking the list after mapping to test >> whether you did get a contiguous result, you could start taking advantage >> of it as some of the dma-buf code in DRM and v4l2 does already (although >> those cases actually treat it as a strict dependency rather than an >> optimisation). > > Yikes. Heh, consider it as iommu_dma_alloc_remap() and vb2_dc_get_contiguous_size() having a beautiful baby ;) >> I'm inclined to agree that if we're going to see more of these cases, a new >> API call that did formally guarantee a DMA-contiguous mapping (either via >> IOMMU or bounce buffering) or failure might indeed be handy. > > I was planning on adding a dma-level API to add more pages to an > IOMMU batch, but was waiting for at least the intel IOMMU driver to be > converted to the dma-iommu code (and preferably arm32 and s390 as well). FWIW I did finally get round to having an initial crack at arm32 recently[1] - of course it needs significant rework already for all the IOMMU API motion, and I still need to attempt to test any of it (at least I do have a couple of 32-bit boards here), but with any luck I hope I'll be able to pick it up again next cycle. > Here is my old pseudo-code sketch for what I was aiming for from the > block/nvme perspective. I haven't even implemented it yet, so there might > be some holes in the design: > > > /* > * Returns 0 if batching is possible, postitive number of segments required > * if batching is not possible, or negatie values on error. > */ > int dma_map_batch_start(struct device *dev, size_t rounded_len, > enum dma_data_direction dir, unsigned long attrs, dma_addr_t *addr); > int dma_map_batch_add(struct device *dev, dma_addr_t *addr, struct page *page, > unsigned long offset, size_t size); > int dma_map_batch_end(struct device *dev, int ret, dma_addr_t start_addr); Just as an initial thought, it's probably nicer to have some kind of encapsulated state structure to pass around between these calls rather than a menagerie of bare address pointers, similar to what we did with iommu_iotlb_gather. An IOMMU-based backend might not want to commit batch_add() calls immediately, but look for physically-sequential pages and merge them into larger mappings if it can, and keeping track of things based only on next_addr, when multiple batch requests could be happening in parallel for the same device, would get messy fast. I also don't entirely see how the backend can be expected to determine the number of segments required in advance - e.g. bounce-buffering could join two half-page segments into one while an IOMMU typically couldn't, yet the opposite might also be true of larger multi-page segments. Robin. [1] http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/arm/dma > int blk_dma_map_rq(struct device *dev, struct request *rq, > enum dma_data_direction dir, unsigned long attrs, > dma_addr_t *start_addr, size_t *len) > { > struct req_iterator iter; > struct bio_vec bvec; > dma_addr_t next_addr; > int ret; > > if (number_of_segments(req) == 1) { > // plain old dma_map_page(); > return 0; > } > > // XXX: block helper for rounded_len? > *len = length_of_request(req); > ret = dma_map_batch_start(dev, *len, dir, attrs, start_addr); > if (ret) > return ret; > > next_addr = *start_addr; > rq_for_each_segment(bvec, rq, iter) { > ret = dma_map_batch_add(dev, &next_addr, bvec.bv_page, > bvec.bv_offset, bvev.bv_len); > if (ret) > break; > } > > return dma_map_batch_end(dev, ret, *start_addr); > } > > dma_addr_t blk_dma_map_bvec(struct device *dev, struct bio_vec *bvec, > enum dma_data_direction dir, unsigned long attrs) > { > return dma_map_page_attrs(dev, bv_page, bvec.bv_offset, bvev.bv_len, > dir, attrs); > } > > int queue_rq() > { > dma_addr_t addr; > int ret; > > ret = blk_dma_map_rq(dev, rq, dir, attrs. &addr, &len); > if (ret < 0) > return ret; > > if (ret == 0) { > if (use_sgl()) { > nvme_pci_sgl_set_data(&cmd->dptr.sgl, addr, len); > } else { > set_prps(); > } > return; > } > > if (use_sgl()) { > alloc_one_sgl_per_segment(); > > rq_for_each_segment(bvec, rq, iter) { > addr = blk_dma_map_bvec(dev, &bdev, dir, 0); > set_one_sgl(); > } > } else { > alloc_one_prp_per_page(); > > rq_for_each_segment(bvec, rq, iter) { > ret = blk_dma_map_bvec(dev, &bdev, dir, 0); > if (ret) > break; > set_prps(); > } > } >