Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp31755982rwd; Fri, 7 Jul 2023 04:06:37 -0700 (PDT) X-Google-Smtp-Source: APBJJlEsCqUBWf5DW42hwabcPJE80msF2phyl3eMeQKudrwLuJ7Whbcm9jw2E7TdzWBHNaSamU69 X-Received: by 2002:a05:6808:14d2:b0:3a3:e729:596e with SMTP id f18-20020a05680814d200b003a3e729596emr2738512oiw.47.1688727997618; Fri, 07 Jul 2023 04:06:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1688727997; cv=none; d=google.com; s=arc-20160816; b=br/jIbRdlpOCl90EtZfDzWpiEpjVspcGhHAqoKz08/Tmz9p1VNNtspbnnNHIQsEE07 zqs3iVLWfqX5KnidENVfMqKKaJjFr0j1jvbk0d6eLxRJMigktYNzus5TT5hJoNF+W3UF xa9vbFTpvcAHKv6U3XDmHtpcXu3b+TMnBHF12jMi0Aq+8GSxYk2JTGaRX2ZLdb8jnJUB 7m+K2Rj2DzjriYsVq25hkupoMPDtci9excz+vjRlqTgG4nGq7FLpfgquxQZhbVsOHLcQ xs2JVx40EuXThVFCw6/YUVVH7ue3S7YYx3pMDNFHevhBkoVhsa/v3D9x5RwZhug4xgzK 1xWA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=gt2jW4LxnvpPOzIqm4w71SWWjBoJubvb31eBl1LUH4s=; fh=SAbAtNRbi6cZ4ID4uu+EGx2m4RaIvlKpevlc2oDKLaA=; b=IfPsAPmZo0rCNpPfBmkUCfr3x5KMqVOzwA0haG6PzRbj5n6bnNY1YEAPLbYpBAL5W8 Xem81JjJQjdUUzRTl6lDxL472bEXQhZm4TinCuJZ3NabzvuhTlrucIfxGwz4TMzqPMSB JeDAQJSuTpmRGW7gvTS6+BGua6PWVNU2fFTzvSwxNfHoRR3YquVr8jrX6NGTqr8kDqq1 RP2j+hS8AtMx5iWUmc/MDjGf0QtpFOrzHPnUJt+UXK6HlM4RqDHlSJfzhQ3pY6Xnej1G agTlw5weAUPRjMaegborOAMzVL4Y3/AIN2+fWMDwVUV3GQSbUuv3yTXk4WB1wiehdTvp enxA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@tesarici.cz header.s=mail header.b=hrw3hX5h; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=tesarici.cz Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id d10-20020a056a0024ca00b006825d2884acsi3823467pfv.72.2023.07.07.04.06.24; Fri, 07 Jul 2023 04:06:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@tesarici.cz header.s=mail header.b=hrw3hX5h; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=tesarici.cz Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230443AbjGGKWV (ORCPT + 99 others); Fri, 7 Jul 2023 06:22:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43220 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229458AbjGGKWU (ORCPT ); Fri, 7 Jul 2023 06:22:20 -0400 Received: from bee.tesarici.cz (bee.tesarici.cz [77.93.223.253]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4D61810B; Fri, 7 Jul 2023 03:22:18 -0700 (PDT) Received: from meshulam.tesarici.cz (dynamic-2a00-1028-83b8-1e7a-4427-cc85-6706-c595.ipv6.o2.cz [IPv6:2a00:1028:83b8:1e7a:4427:cc85:6706:c595]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by bee.tesarici.cz (Postfix) with ESMTPSA id 291D6EB4F0; Fri, 7 Jul 2023 12:22:15 +0200 (CEST) Authentication-Results: mail.tesarici.cz; dmarc=fail (p=none dis=none) header.from=tesarici.cz DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tesarici.cz; s=mail; t=1688725335; bh=R2JAn8dEc5t5IbxpMIYj+ROpC8mKUwi+oGPWhkhnmps=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=hrw3hX5hun3YSXg/bR+StoUqmBtv5Sg4IUiJIG4MBvgSTT8nNjHLDBlwAxI/Bgus2 f++y7N3Z1Wa5LMZrneYAvSsBQ3NkH/xvR6QjWjMIveXWXfbNevXIqCBNV4YCOQjBAj WECHj7caN3qm6AJ8He9i9tGPkhCGQg0Krp19h20yz99u5X+9BPg0Gdgz3R9hE7N5mn ozvvrVEjkXoAMyYrINs49hiIRN1hd2XoGedS6hDK0kwmzglSNWF6XGZr8B7FmzvcZ4 80/N8K8H+Sxi8onYywDqbBt4L7jxPlco1/GoFG7/OVNeYT14QIYXOD4psk7JWBhyxJ JNUUAHieg1MmQ== Date: Fri, 7 Jul 2023 12:22:13 +0200 From: Petr =?UTF-8?B?VGVzYcWZw61r?= To: Greg Kroah-Hartman , "Michael Kelley (LINUX)" Cc: Petr Tesarik , Stefano Stabellini , Thomas Bogendoerfer , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)" , "H. Peter Anvin" , "Rafael J. Wysocki" , Juergen Gross , Oleksandr Tyshchenko , Christoph Hellwig , Marek Szyprowski , Robin Murphy , Andy Shevchenko , Hans de Goede , Jason Gunthorpe , Kees Cook , Saravana Kannan , "moderated list:XEN HYPERVISOR ARM" , "moderated list:ARM PORT" , open list , "open list:MIPS" , "open list:XEN SWIOTLB SUBSYSTEM" , Roberto Sassu , Kefeng Wang Subject: Re: [PATCH v3 4/7] swiotlb: if swiotlb is full, fall back to a transient memory pool Message-ID: <20230707122213.3a7378b5@meshulam.tesarici.cz> In-Reply-To: <2023070706-humbling-starfish-c68f@gregkh> References: <34c2a1ba721a7bc496128aac5e20724e4077f1ab.1687859323.git.petr.tesarik.ext@huawei.com> <2023070626-boxcar-bubbly-471d@gregkh> <2023070706-humbling-starfish-c68f@gregkh> X-Mailer: Claws Mail 4.1.1 (GTK 3.24.38; x86_64-suse-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,SPF_HELO_PASS,SPF_PASS, T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 7 Jul 2023 10:29:00 +0100 Greg Kroah-Hartman wrote: > On Thu, Jul 06, 2023 at 02:22:50PM +0000, Michael Kelley (LINUX) wrote: > > From: Greg Kroah-Hartman Sent: Thursday, July 6, 2023 1:07 AM > > > > > > On Thu, Jul 06, 2023 at 03:50:55AM +0000, Michael Kelley (LINUX) wrote: > > > > From: Petr Tesarik Sent: Tuesday, June 27, 2023 > > > 2:54 AM > > > > > > > > > > Try to allocate a transient memory pool if no suitable slots can be found, > > > > > except when allocating from a restricted pool. The transient pool is just > > > > > enough big for this one bounce buffer. It is inserted into a per-device > > > > > list of transient memory pools, and it is freed again when the bounce > > > > > buffer is unmapped. > > > > > > > > > > Transient memory pools are kept in an RCU list. A memory barrier is > > > > > required after adding a new entry, because any address within a transient > > > > > buffer must be immediately recognized as belonging to the SWIOTLB, even if > > > > > it is passed to another CPU. > > > > > > > > > > Deletion does not require any synchronization beyond RCU ordering > > > > > guarantees. After a buffer is unmapped, its physical addresses may no > > > > > longer be passed to the DMA API, so the memory range of the corresponding > > > > > stale entry in the RCU list never matches. If the memory range gets > > > > > allocated again, then it happens only after a RCU quiescent state. > > > > > > > > > > Since bounce buffers can now be allocated from different pools, add a > > > > > parameter to swiotlb_alloc_pool() to let the caller know which memory pool > > > > > is used. Add swiotlb_find_pool() to find the memory pool corresponding to > > > > > an address. This function is now also used by is_swiotlb_buffer(), because > > > > > a simple boundary check is no longer sufficient. > > > > > > > > > > The logic in swiotlb_alloc_tlb() is taken from __dma_direct_alloc_pages(), > > > > > simplified and enhanced to use coherent memory pools if needed. > > > > > > > > > > Note that this is not the most efficient way to provide a bounce buffer, > > > > > but when a DMA buffer can't be mapped, something may (and will) actually > > > > > break. At that point it is better to make an allocation, even if it may be > > > > > an expensive operation. > > > > > > > > I continue to think about swiotlb memory management from the standpoint > > > > of CoCo VMs that may be quite large with high network and storage loads. > > > > These VMs are often running mission-critical workloads that can't tolerate > > > > a bounce buffer allocation failure. To prevent such failures, the swiotlb > > > > memory size must be overly large, which wastes memory. > > > > > > If "mission critical workloads" are in a vm that allowes overcommit and > > > no control over other vms in that same system, then you have worse > > > problems, sorry. > > > > > > Just don't do that. > > > > > > > No, the cases I'm concerned about don't involve memory overcommit. > > > > CoCo VMs must use swiotlb bounce buffers to do DMA I/O. Current swiotlb > > code in the Linux guest allocates a configurable, but fixed, amount of guest > > memory at boot time for this purpose. But it's hard to know how much > > swiotlb bounce buffer memory will be needed to handle peak I/O loads. > > This patch set does dynamic allocation of swiotlb bounce buffer memory, > > which can help avoid needing to configure an overly large fixed size at boot. > > But, as you point out, memory allocation can fail at runtime, so how can > you "guarantee" that this will work properly anymore if you are going to > make it dynamic? In general, there is no guarantee, of course, because bounce buffers may be requested from interrupt context. I believe Michael is looking for the SWIOTLB_MAY_SLEEP flag that was introduced in my v2 series, so new pools can be allocated with GFP_KERNEL instead of GFP_NOWAIT if possible, and then there is no need to dip into the coherent pool. Well, I have deliberately removed all complexities from my v3 series, but I have more WIP local topic branches in my local repo: - allow blocking allocations if possible - allocate a new pool before existing pools are full - free unused memory pools I can make a bigger series, or I can send another series as RFC if this is desired. ATM I don't feel confident enough that my v3 series will be accepted without major changes, so I haven't invested time into finalizing the other topic branches. @Michael: If you know that my plan is to introduce blocking allocations with a follow-up patch series, is the present approach acceptable? Petr T