Received: by 2002:a05:7412:31a9:b0:e2:908c:2ebd with SMTP id et41csp5738049rdb; Sun, 17 Sep 2023 09:55:25 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFOSk+qC9CFdt1aOOichWbd5s9mXmXZcMqQwCAzSjP7B/0pBweDFy52Vx8ziy4amGDTu6OR X-Received: by 2002:a05:6a20:144a:b0:140:2ec5:2bd3 with SMTP id a10-20020a056a20144a00b001402ec52bd3mr13237680pzi.27.1694969725074; Sun, 17 Sep 2023 09:55:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1694969725; cv=none; d=google.com; s=arc-20160816; b=aJuSxSlR4rSwXC13pIoWs1qFBW3yKa04Qbgk5fqYF2dQWGxPSrcYi2aAoJINidDObE eoN1uSz0qsZbN2sGUKZM542ZjENfzUtmK4Eh2wAcMUOsn/IbpKSxVe8bliAq+t6uUskn RiRgKRkkNwOYWHAJ3XVv0mGde96vhTtCoDcYIfbggFxxUej449bZznhJu0woS/ZmhSQc ZQUqM3Tz4CqIkYux/v2UuUxgdm7NOvhPO0BrcToVIXo/CFRQbE2Nx9UcvEZQB8kFQY8D 32ZUD5xrN60T1faaeuyeK2KrnzxRYn41dGRONnJJV3vr3Hvc271PQq/UcueVGIn+imUC 1vzA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=6yp9DajWJ2XKxYbjCHg5nfvpxBB3u6YKnUpFVXiU9zg=; fh=S1YYEOm1m4xK37F45k6ILPXeC0nFLnzPis16FJxYmDw=; b=OAFZAXZgbYBQcCz0XkGiOPZ11V4/DvityUOZ9sO5O9F6FsdVcjSGLwD3KNSxOyLe9k 7iR7DYpobAxkStVutPkkgPZExa7dirviPEHXHh0rOaXCD2hICvNGvRFbV4GvhTRPqnE/ 678LJurAHLhoD5o1sKS+wS4YCnP4nXckV00hhFz2ozsDilnNxecB05VWvWkGtHUjZNWa YN1qJQaWhz7J99zSEzf5Q7T4/zczclpfftBy3C3vAaEwd773xzcPBb3KW7hYjc7mg3aa qa4zsIVMtyTlPg7jnmVEktbY3zw4VK/aBwtITz39OgJtxCwHsg5w0ibN/3jzxc0f3pjI AOZQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@tesarici.cz header.s=mail header.b="WdL5ARj/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=tesarici.cz Return-Path: Received: from fry.vger.email (fry.vger.email. [23.128.96.38]) by mx.google.com with ESMTPS id a22-20020a63d216000000b00565342470c4si6486469pgg.801.2023.09.17.09.55.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 17 Sep 2023 09:55:25 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) client-ip=23.128.96.38; Authentication-Results: mx.google.com; dkim=pass header.i=@tesarici.cz header.s=mail header.b="WdL5ARj/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=tesarici.cz Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id 9FF6981DF765; Sun, 17 Sep 2023 02:49:09 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235412AbjIQJsT (ORCPT + 99 others); Sun, 17 Sep 2023 05:48:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33272 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236293AbjIQJrx (ORCPT ); Sun, 17 Sep 2023 05:47:53 -0400 Received: from bee.tesarici.cz (bee.tesarici.cz [IPv6:2a03:3b40:fe:2d4::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5921E18F for ; Sun, 17 Sep 2023 02:47:46 -0700 (PDT) Received: from meshulam.tesarici.cz (dynamic-2a00-1028-83b8-1e7a-4427-cc85-6706-c595.ipv6.o2.cz [IPv6:2a00:1028:83b8:1e7a:4427:cc85:6706:c595]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by bee.tesarici.cz (Postfix) with ESMTPSA id 52CC6171C50; Sun, 17 Sep 2023 11:47:43 +0200 (CEST) Authentication-Results: mail.tesarici.cz; dmarc=fail (p=none dis=none) header.from=tesarici.cz DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=tesarici.cz; s=mail; t=1694944063; bh=oNG7MiE6U4mHWwCW/Ds7gGpn6Pq5Xr3QIiIPctqNAws=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=WdL5ARj/XdZutq+ii76phpXRetOJWOtu6dUb3OuduAG0LfdERb4BEJOBk9IptVAAR 6LbGaoAlrml2hHHXTeFyzpdpAwzfgmHj8Z79rHCVMpw0gxNW4yAZgIsWVkaTzTy8OX znQLQY1yK6zbTRnIE3uWvKQ1Uh5kXRYLEHMTWz7JKQ1bqpCsnEq6ZslwQN23cKc51/ igL2yHcri4Nz+CQQQqqvUb4ojAuebDizcNdaca3L87z4U7mWK4aW5QmKCOEHvfcaBU fuPc1itxQoh0Lv9VvzbzLGf8LGqA8GVDS7hzQr/Hz9qcomd9jD4A40ssi5spO/2X7G XM7GGReCqvbOw== Date: Sun, 17 Sep 2023 11:47:41 +0200 From: Petr =?UTF-8?B?VGVzYcWZw61r?= To: Catalin Marinas Cc: Christoph Hellwig , Marek Szyprowski , Robin Murphy , "open list:DMA MAPPING HELPERS" , open list , Roberto Sassu , Jonathan Corbet Subject: Re: [PATCH] swiotlb: fix the check whether a device has used software IO TLB Message-ID: <20230917114741.01a23364@meshulam.tesarici.cz> In-Reply-To: References: <20230913114016.17752-1-petr@tesarici.cz> <20230913121403.GB4544@lst.de> <20230913142656.29e135d6@meshulam.tesarici.cz> <20230915111343.01496320@meshulam.tesarici.cz> X-Mailer: Claws Mail 4.1.1 (GTK 3.24.38; x86_64-suse-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Sun, 17 Sep 2023 02:49:09 -0700 (PDT) Hi Catalin, thank you for your reply! On Fri, 15 Sep 2023 18:09:28 +0100 Catalin Marinas wrote: > On Fri, Sep 15, 2023 at 11:13:43AM +0200, Petr Tesa=C5=99=C3=ADk wrote: > > On Thu, 14 Sep 2023 19:28:01 +0100 > > Catalin Marinas wrote: =20 > > > What do the smp_wmb() barriers in swiotlb_find_slots() and > > > swiotlb_dyn_alloc() order? The latter is even more unclear as it's at > > > the end of the function and the "pairing" comment doesn't help. =20 > >=20 > > By the time swiotlb_find_slots() returns a valid slot index, the new > > value of dev->dma_uses_io_tlb must be visible by all CPUs in > > is_swiotlb_buffer(). The index is used to calculate the bounce buffer > > address returned to device drivers. This address may be passed to > > another CPU and used as an argument to is_swiotlb_buffer(). =20 >=20 > Ah, I remember now. So the smp_wmb() ensures that dma_uses_io_tlb is > seen by other CPUs before the slot address (presumably passed via other > memory write). It may be worth updating the comment in the code (I'm > sure I'll forget it in a month time). The smp_rmb() before READ_ONCE() > in this patch is also needed for the same reasons (ordering after the > read of the address passed to is_swiotlb_buffer()). I agree. The comment should definitely mention the implicitly assumed memory write done by device drivers, because it is not obvious which two writes are being ordered. > BTW, you may want to use WRITE_ONCE() when setting dma_uses_io_tlb (it > also matches the READ_ONCE() in is_swiotlb_buffer()). Or you can use > smp_store_mb() (but check its semantics first). I can use WRITE_ONCE(), although I believe it does not make much difference thanks to the barrier provided by smp_wmb(). Using smp_store_mb() should also work, but it inserts a full memory barrier, which is more than is needed IMO. > > I am not sure that the smp_wmb() in swiotlb_dyn_alloc() is needed, but > > let me explain my thinking. Even if the dev->dma_uses_io_tlb flag is > > set, is_swiotlb_buffer() returns false unless the buffer address is > > found in the RCU list of swiotlb pools, which is walked without taking > > any lock. In some iterations of the patch series, there was a direct > > call to swiotlb_dyn_alloc(), and a smp_wmb() was necessary to make sure > > that the list of swiotlb pools cannot be observed as empty by another > > CPU. You have already explained to me that taking a spin lock in > > add_mem_pool() is not sufficient, because it does not invalidate a > > value that might have been speculatively read by another CPU before > > entering the critical section. OTOH swiotlb_dyn_alloc() is always > > called through schedule_work() in the current version. If that > > implicitly provides necessary barriers, this smp_wmb() can be removed. = =20 >=20 > Normally you'd need a barrier between pool update and flag update: >=20 > list_add_rcu(&pool->node, &mem->pools); > smp_wmb(); > WRITE_ONCE(dev->dma_uses_io_tlb, true); I agree. However, if is_swiotlb_buffer() gets an address allocated from a swiotlb pool that was added here, then it also had to be returned by swiotlb_find_slots(), where smp_wmb() makes sure that all writes are visible before the above-mentioned assumed write of the buffer pointer into a private data field, performed by driver code. > The lock around mem->pools update doesn't help since since it only has > release semantics on the unlock path (which doesn't prevent the > dma_uses_io_tlb write from being reordered before). Yes. > On the reader side, you need to make sure that if the dma_uses_io_tlb is > true, the mem->pools access was not speculated and read earlier as > empty, so another dma_rmb(): >=20 > if (READ_ONCE(dev->dma_uses_io_tlb)) { > dma_rmb(); I think this is a typo, and you mean smp_rmb(); at least I can't see which consistent DMA memory would be accessed here. > swiotlb_find_pool(...); > } >=20 > That's missing in the code currently (and rcu_read_lock() only has > acquire semantics). >=20 > However, what confuses me is that mem->pools is populated > asynchronously via schedule_work(). Not sure how the barriers help > since the work can be scheduled on any CPU at any point after, > potentially after dma_uses_io_tlb has been updated. Ah... You may have a point after all if this sequence of events is possible: - CPU 0 writes new value to mem->pools->next in swiotlb_dyn_alloc(). - CPU 1 observes the new value in swiotlb_find_slots(), even though it is not guaranteed by any barrier, allocates a slot and sets the dev->dma_uses_io_tlb flag. - CPU 1 (driver code) writes the returned buffer address into its private struct. This write is ordered after dev->dma_uses_io_tlb thanks to the smp_wmb() in swiotlb_find_slots(). - CPU 2 (driver code) reads the buffer address, and DMA core passes it to is_swiotlb_buffer(), which contains smp_rmb(). - IIUC CPU 2 is guaranteed to observe the new value of dev->dma_uses_io_tlb, but it may still use the old value of mem->pools->next, because the write on CPU 0 was not ordered against anything. The fact that the new value was observed by CPU 1 does not mean that it is also observed by CPU 2. If this is what can happen, then I'm not sure how these events can be properly ordered in a lockless fashion. > On the transient dev->dma_io_tlb_pools updated in > swiotlb_find_slots(), you'd need the barriers as I mentioned above > (unless I misunderstood how this pool searching works; not entirely > sure why swiotlb_find_pool() searches both). The transient pool list is updated only in swiotlb_find_slots() before setting dev->dma_uses_io_tlb. AFAICS this write should is already ordered with the same smp_wmb(). >[...] > > Actually, I wonder when a newly added RCU list element is > > guaranteed to be visible by all CPUs. Existing documentation deals > > mainly with element removal, explaining that both the old state and > > the new state of an RCU list are valid after addition. Is it > > assumed that the old state of an RCU list after addition is valid > > indefinitely? =20 >=20 > When some write becomes visible to other CPUs, I don't think the > architectures define precisely (more like in vague terms as > "eventually"). That's why we can't talk about barriers in relation to > a single write as the barrier doesn't make a write visible but rather > makes it visible _before_ a subsequent write becomes visible (whenever > that may be). This is understood. It's just that list_add_rcu() updates the next pointer in the previous list entry using smp_store_release(), ordering writes on this next pointer, but I don't see the matching loads with acquire semantics anywhere. It's probably required to detect RCU grace period... Anyway, I was merely wondering when the new list entry is visible on all CPUs, and your answer seems to confirm that the RCU list primitives do not provide any guarantees. IOW if correct operation requires that the new value is observed before something else (like here), it needs additional synchronization. Petr T