Received: by 2002:a05:6359:c8b:b0:c7:702f:21d4 with SMTP id go11csp3561970rwb; Tue, 20 Sep 2022 01:12:09 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4++mla7YVcfsYwXzEjBSQxzIklNEfuuh37xG+xQMfICdrj9zbgTipI4bLddpkqLM4GO8gy X-Received: by 2002:a17:902:8214:b0:178:95c9:bd5d with SMTP id x20-20020a170902821400b0017895c9bd5dmr3817292pln.106.1663661529199; Tue, 20 Sep 2022 01:12:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1663661529; cv=none; d=google.com; s=arc-20160816; b=R0qqOKYTqV4DNlhMFKuXrumhjPUFXhYJ15FS7N2Xnrx1RrhmTWR+zkwuM+uFajRFDh 8tZeYlp29wakt0W2gwhy5+0YkFpX3t/c996agdxN1i6XaH5Q/bjBWr5++88GOMPDGXNT +JzZvu4wCsfHcjO4/fMly/56Aa0cXnHOaGFI2rOZqUbj66HmUfAMQ44/mteC43BAZG6J qQ0hTO3hnHg0fZUAukivUfZAXq/dOXhdUWkSLCF9Y87ukcKadlYjhl1EumH4jIq2n2GN mfFOOYwcyU0WTtGj/QxB0aMXB/2zLL1SxOuOgqxVD4ajd7h4aKMgpUXAUJ3W/CWqKw+W 2sHA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=H0mXB3qLzd6nyp4U82pUC+0jG9YRP9G0sb9CuKu7rQc=; b=iQhAkpRr2g0tgTHDnRu8CGrnjaT19s3NRsVL3AkCa13FhbmjbLbgLqq8t9XcDEkEYU EkYgDzxPgwhVrIxXl/jLj+S0PTC4XH0r2G5hw5aWxH7RG5VsuB6dDLvoDRhUjVAMynKN UBcXFIDxhbTkCXWdnHvNw0X1XllJtPWRJTZ0mojJ6H3GZibdXPfMTPhQxxjLmxjtBv2L e+lK8nkHOBeQ1Hx5njhGBKowrAaAw+ToGYui/ca/ebbgeSqOpg5cOG2rM9dmvXFMCXmO c8MH64M3slo1Z8m8wjrt9bOfRcxiegmCL0XHgYPp/lMcBMj9yy4cDcXi7neD9a71wiVX y7OA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@infradead.org header.s=bombadil.20210309 header.b=YCkxzsVu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x2-20020a170902fe8200b001789bd50b64si915560plm.319.2022.09.20.01.11.57; Tue, 20 Sep 2022 01:12:09 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=bombadil.20210309 header.b=YCkxzsVu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230020AbiITHzg (ORCPT + 99 others); Tue, 20 Sep 2022 03:55:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50352 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229553AbiITHz1 (ORCPT ); Tue, 20 Sep 2022 03:55:27 -0400 Received: from bombadil.infradead.org (bombadil.infradead.org [IPv6:2607:7c80:54:3::133]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 61B062DFA; Tue, 20 Sep 2022 00:55:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=H0mXB3qLzd6nyp4U82pUC+0jG9YRP9G0sb9CuKu7rQc=; b=YCkxzsVunikesQe2vmc4oMq2Hn Jg33iFWFuD4QHdvq6fDWy8Z3gojgL0RTxp2wHVG11PD9SGiG94LnDwCfM0D5ZbFKC//BwQb83f5kt KwsNLeOUgUUU3MHlb6KJoYbFbSJuTfGC5UEoEajTXpzQIwfGeW85tyLb7xxH/Ao7q2EwK2BNuzYWc aNL14xoXbcw6ZY0YsF3QsrggrY8mOuwt3JZRUErGulwoWldPYHn9QdDr2suz3ZVXuunT+RoLnvWlg xDYTJiVsEp/MPK7zZcEyW9nsCtWowlraCvfA3e1XhM3BAyuqxe7ilU6LfUtfHbB5iLnEj7CYeZdr3 XHziMGFA==; Received: from hch by bombadil.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1oaY5v-001WMi-Uv; Tue, 20 Sep 2022 07:55:04 +0000 Date: Tue, 20 Sep 2022 00:55:03 -0700 From: Christoph Hellwig To: Dongli Zhang Cc: iommu@lists.linux.dev, x86@kernel.org, xen-devel@lists.xenproject.org, linux-arm-kernel@lists.infradead.org, linux-mips@vger.kernel.org, linux-kernel@vger.kernel.org, hch@infradead.org, sstabellini@kernel.org, linux@armlinux.org.uk, tsbogend@alpha.franken.de, jgross@suse.com, boris.ostrovsky@oracle.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, m.szyprowski@samsung.com, konrad.wilk@oracle.com, robin.murphy@arm.com, hpa@zytor.com, oleksandr_tyshchenko@epam.com, joe.jin@oracle.com Subject: Re: [PATCH RFC v2 1/1] wiotlb: split buffer into 32-bit default and 64-bit extra zones Message-ID: References: <20220820074250.5460-1-dongli.zhang@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220820074250.5460-1-dongli.zhang@oracle.com> X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org. See http://www.infradead.org/rpr.html X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Aug 20, 2022 at 12:42:50AM -0700, Dongli Zhang wrote: > Hello, > > I used to send out RFC v1 to introduce an extra io_tlb_mem (created with > SWIOTLB_ANY) in addition to the default io_tlb_mem (32-bit). The > dev->dma_io_tlb_mem is set to either default or the extra io_tlb_mem, > depending on dma mask. However, that is not good for setting > dev->dma_io_tlb_mem at swiotlb layer transparently as suggested by > Christoph Hellwig. > > https://lore.kernel.org/all/20220609005553.30954-1-dongli.zhang@oracle.com/ > > Therefore, this is another RFC v2 implementation following a different > direction. The core ideas are: > > 1. The swiotlb is splited into two zones, io_tlb_mem->zone[0] (32-bit) and > io_tlb_mem->zone[1] (64-bit). > > struct io_tlb_mem { > struct io_tlb_zone zone[SWIOTLB_NR]; > struct dentry *debugfs; > bool late_alloc; > bool force_bounce; > bool for_alloc; > bool has_extra; > }; > > struct io_tlb_zone { > phys_addr_t start; > phys_addr_t end; > void *vaddr; > unsigned long nslabs; > unsigned long used; > unsigned int nareas; > unsigned int area_nslabs; > struct io_tlb_area *areas; > struct io_tlb_slot *slots; > }; > > 2. By default, only io_tlb_mem->zone[0] is available. The > io_tlb_mem->zone[1] is allocated conditionally if: > > - the "swiotlb=" is configured to allocate extra buffer, and > - the SWIOTLB_EXTRA is set in the flag (this is to make sure arch(s) other > than x86/sev/xen will not enable it until it is fully tested by each > arch, e.g., mips/powerpc). Currently it is enabled for x86 and xen. > > 3. During swiotlb map, whether zone[0] (32-bit) or zone[1] (64-bit > SWIOTLB_ANY) > is used depends on min_not_zero(*dev->dma_mask, dev->bus_dma_limit). > > To test the RFC v2, here is the QEMU command line. > > qemu-system-x86_64 -smp 8 -m 32G -enable-kvm -vnc :5 -hda disk.img \ > -kernel path-to-linux/arch/x86_64/boot/bzImage \ > -append "root=/dev/sda1 init=/sbin/init text console=ttyS0 loglevel=7 swiotlb=32768,4194304,force" \ > -net nic -net user,hostfwd=tcp::5025-:22 \ > -device nvme,drive=nvme01,serial=helloworld -drive file=test.qcow2,if=none,id=nvme01 \ > -serial stdio > > There is below in syslog. The extra 8GB buffer is allocated. > > [ 0.152251] software IO TLB: area num 8. > ... ... > [ 3.706088] PCI-DMA: Using software bounce buffering for IO (SWIOTLB) > [ 3.707334] software IO TLB: mapped default [mem 0x00000000bbfd7000-0x00000000bffd7000] (64MB) > [ 3.708585] software IO TLB: mapped extra [mem 0x000000061cc00000-0x000000081cc00000] (8192MB) > > After the FIO is triggered over NVMe, the 64-bit buffer is used. > > $ cat /sys/kernel/debug/swiotlb/io_tlb_nslabs_extra > 4194304 > $ cat /sys/kernel/debug/swiotlb/io_tlb_used_extra > 327552 > > Would you mind helping if this is the right direction to go? > > Thank you very much! > > Cc: Konrad Wilk > Cc: Joe Jin > Signed-off-by: Dongli Zhang > --- > arch/arm/xen/mm.c | 2 +- > arch/mips/pci/pci-octeon.c | 5 +- > arch/x86/include/asm/xen/swiotlb-xen.h | 2 +- > arch/x86/kernel/pci-dma.c | 6 +- > drivers/xen/swiotlb-xen.c | 18 +- > include/linux/swiotlb.h | 73 +++-- > diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c > index 3d826c0..4edfa42 100644 > --- a/arch/arm/xen/mm.c > +++ b/arch/arm/xen/mm.c > @@ -125,7 +125,7 @@ static int __init xen_mm_init(void) > return 0; > > /* we can work with the default swiotlb */ > - if (!io_tlb_default_mem.nslabs) { > + if (!io_tlb_default_mem.zone[SWIOTLB_DF].nslabs) { > rc = swiotlb_init_late(swiotlb_size_or_default(), > xen_swiotlb_gfp(), NULL); > if (rc < 0) First thing we need before doing anything about multiple default pools is to get all the knowledge of details hidden inside swiotlb.c. For swiotlb_init_late that seems easy as we can just move the check into it. > diff --git a/arch/mips/pci/pci-octeon.c b/arch/mips/pci/pci-octeon.c > index e457a18..0bf0859 100644 > --- a/arch/mips/pci/pci-octeon.c > +++ b/arch/mips/pci/pci-octeon.c > @@ -654,6 +654,9 @@ static int __init octeon_pci_setup(void) > octeon_pci_mem_resource.end = > octeon_pci_mem_resource.start + (1ul << 30); > } else { > + struct io_tlb_mem *mem = &io_tlb_default_mem; > + struct io_tlb_zone *zone = &mem->zone[SWIOTLB_DF]; > + > /* Remap the Octeon BAR 0 to map 128MB-(128MB+4KB) */ > octeon_npi_write32(CVMX_NPI_PCI_CFG04, 128ul << 20); > octeon_npi_write32(CVMX_NPI_PCI_CFG05, 0); > @@ -664,7 +667,7 @@ static int __init octeon_pci_setup(void) > > /* BAR1 movable regions contiguous to cover the swiotlb */ > octeon_bar1_pci_phys = > - io_tlb_default_mem.start & ~((1ull << 22) - 1); > + zone->start & ~((1ull << 22) - 1); But we'll need to do something about this mess. I'll need help from the octeon maintainer on it. > - x86_swiotlb_flags |= SWIOTLB_ANY; > + x86_swiotlb_flags |= SWIOTLB_ANY | SWIOTLB_EXTRA; I don't think this is how it is suppoed to be. SWIOTLB_ANY already says give me a pool with no addressing constrains. We don't need two pools without that. EXTRA is also not exactly a very helpful name here. > #ifdef CONFIG_X86 > -int xen_swiotlb_fixup(void *buf, unsigned long nslabs) > +int xen_swiotlb_fixup(void *buf, unsigned long nslabs, unsigned int flags) > { > int rc; > unsigned int order = get_order(IO_TLB_SEGSIZE << IO_TLB_SHIFT); > unsigned int i, dma_bits = order + PAGE_SHIFT; > dma_addr_t dma_handle; > phys_addr_t p = virt_to_phys(buf); > + unsigned int max_dma_bits = 32; I think live will be a lot simple if the addressing bits are passed to this function, and not some kind of flags. > +#define SWIOTLB_DF 0 > +#define SWIOTLB_EX 1 > +#define SWIOTLB_NR 2 These names are not very descriptive.