Received: by 2002:ab2:3141:0:b0:1ed:23cc:44d1 with SMTP id i1csp476019lqg; Fri, 1 Mar 2024 10:42:29 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCXeB5IWMfzBZYyd7SRAoFOEN4LWIMtgwuu54f4Wkszjk4hLcnDywVAvkA0X61UQJw9VC6wZOZi2dvQntLyL1QpRY0izWROe6IVwbiZ1fQ== X-Google-Smtp-Source: AGHT+IEtHRRqdz0oLoPm8i2iYV2iK+EHeTZxWGpT5Y7vvFo6Q9FsJ8kiAnPEI6pfgi5e+IaizfR4 X-Received: by 2002:a17:906:fc16:b0:a44:3bec:20a8 with SMTP id ov22-20020a170906fc1600b00a443bec20a8mr2010957ejb.76.1709318548764; Fri, 01 Mar 2024 10:42:28 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1709318548; cv=pass; d=google.com; s=arc-20160816; b=uBTpfs4o5VW/pDZBtK1fjwbA6y3w+lsAyZ4aa4VwiOsR4z9fuX9u3U1U4agKPZ/t1a FRramMQL30NGe+5JRZpanKt5XcK1pGNFsfHwJHqRURJHibuUCFp+D8m0VoQzHW0J3s2J o4ECoqJ+AgC4nNEJjD2bvzJW811YgSbSu0dJbZppNQE9+smM6jBI13Ccx8VZ75DEdvc9 Fz4oj5cjnWNuP+mg6y3QwzgsPB1KqFxd4D5xWo5THUKVTPDCnds1E1izF0wf/EimcWEQ RsSVxCS+hatTul8xqDiAZkOYakHP3JPQy8JIG4vbljsZNUhfJWj8Ihm5XzaG+c2R/MmX NrKQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:in-reply-to:message-id :subject:cc:to:from:date:dkim-signature; bh=0BzVbEMUW+HOHqX604KISbar5ltvn6EhM64KI0u9ocU=; fh=wVOxbXQt7McHwT1Ph3SewFU5h4SGFK1x/O6dX82lPp0=; b=bf+JEHQ8gbvsDqVJpBymyL4eaiyI53z+SflgR5O0Mqd89FLi8fh3XIQ7dgm/xEKreL N8jsn/FN5mO+Pq8Lu/FxpnIyCjVUKA6nPQWZ7moyTxLKnH+v9WDIVwcPPOm16CDkvUxO ga2/wJgsX/LyD8Cy82RCnz5qeD9uCoV5DzN/NUBdua5s0peIyB5HfD/Yohoo4zLyBZE/ uYxhYXywWY8ocY8OHisO7UgO8SrWAVCkwtX86G263IW6N0oJu6fB23EoiTD5q55coKoT zNJW2mnqu50aWJdNqiePdnx+lBq/X7EH5iFLyKlarnQ0lEvbtr7xFlpZ7hDyCVA6La6+ +HLw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@tesarici.cz header.s=mail header.b=Bmhi2mkZ; arc=pass (i=1 spf=pass spfdomain=tesarici.cz dkim=pass dkdomain=tesarici.cz dmarc=pass fromdomain=tesarici.cz); spf=pass (google.com: domain of linux-kernel+bounces-88933-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-88933-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=tesarici.cz Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [2604:1380:4601:e00::3]) by mx.google.com with ESMTPS id s8-20020a170906354800b00a441a40c035si1637606eja.279.2024.03.01.10.42.28 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 01 Mar 2024 10:42:28 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-88933-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; Authentication-Results: mx.google.com; dkim=pass header.i=@tesarici.cz header.s=mail header.b=Bmhi2mkZ; arc=pass (i=1 spf=pass spfdomain=tesarici.cz dkim=pass dkdomain=tesarici.cz dmarc=pass fromdomain=tesarici.cz); spf=pass (google.com: domain of linux-kernel+bounces-88933-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-88933-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=tesarici.cz Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 1BF541F24B7E for ; Fri, 1 Mar 2024 18:42:28 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 5817B39ACE; Fri, 1 Mar 2024 18:42:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=tesarici.cz header.i=@tesarici.cz header.b="Bmhi2mkZ" Received: from bee.tesarici.cz (unknown [77.93.223.253]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9F2602575F for ; Fri, 1 Mar 2024 18:42:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=77.93.223.253 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709318538; cv=none; b=DW8kAYeMgJO+zUmvzkSCgoMhYn5A5bRI0q989Oi2s7VeaU4ENszmhkAUuu52oFtx2PDXJpd7lukdd9L1PDRgMILiWzdwyp68B524BBbyTaSuQuMqU0gPg/T0H8FgT2cquzpxtQDZl8Ghdubtx3F51Jo9gh8Hk7PWwd7lIRPuego= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709318538; c=relaxed/simple; bh=w+8tzDUNxC+0c2kk3d9ykcorijNJa53GYpsyK5+JaZE=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=HBILSm8RhskJs+Ftv5hbyR1lPccbzazQ0IZZCoHbYgUzLkndTGDLhqO4l7WEFyZ0lBUnkzQRgM4+3nF2doPmahM7LfIk/akZ6DdoMFUiz7KU5lEUkVmfyOU3IDRjVFPgrTuCMNzu4XAM3agXBqPZg6Ux8x818lARpWjxJ/Gke10= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=tesarici.cz; spf=pass smtp.mailfrom=tesarici.cz; dkim=pass (2048-bit key) header.d=tesarici.cz header.i=@tesarici.cz header.b=Bmhi2mkZ; arc=none smtp.client-ip=77.93.223.253 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=tesarici.cz Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=tesarici.cz Received: from meshulam.tesarici.cz (dynamic-2a00-1028-83b8-1e7a-4427-cc85-6706-c595.ipv6.o2.cz [IPv6:2a00:1028:83b8:1e7a:4427:cc85:6706:c595]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by bee.tesarici.cz (Postfix) with ESMTPSA id 52C991C37C8; Fri, 1 Mar 2024 19:42:13 +0100 (CET) Authentication-Results: mail.tesarici.cz; dmarc=fail (p=quarantine dis=none) header.from=tesarici.cz DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=tesarici.cz; s=mail; t=1709318533; bh=0BzVbEMUW+HOHqX604KISbar5ltvn6EhM64KI0u9ocU=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=Bmhi2mkZ3so7nx3ERP2V7oB11mo9VdPjZVkhxOBotjJZP6CxSTk+Az+/ZH8rRDjUP NGyu1a3lBsp2D3ADnmZrO31NQtF/tjXYTY1qN7qRur6ZZKqdxWHXmjBwQeJEqH/Qj5 Uc0lftk9fJR+LV3q1ZSReYPjTosYktuTyvI0cWVaQCuje7HjOXjt191XIMvyKYBFLV LvXiyexLut5Oylz+NII5OEnRH1ICf0kX+tdEX5JVJ2xP0M/oykO9qzepMx7rvCTCSJ dy5CmsHXKa4acTWPoBJ0yGqUcBSvcn5hqg99x6tYlZrHGL12uYJCTVM0QrgoRFx5oH WvemhV6V89Jiw== Date: Fri, 1 Mar 2024 19:42:12 +0100 From: Petr =?UTF-8?B?VGVzYcWZw61r?= To: Robin Murphy Cc: Christoph Hellwig , Michael Kelley , Will Deacon , "linux-kernel@vger.kernel.org" , Petr Tesarik , "kernel-team@android.com" , "iommu@lists.linux.dev" , Marek Szyprowski , Dexuan Cui , Nicolin Chen Subject: Re: [PATCH v5 6/6] swiotlb: Remove pointless stride adjustment for allocations >= PAGE_SIZE Message-ID: <20240301194212.3c64c9b2@meshulam.tesarici.cz> In-Reply-To: <8869c8b2-29c3-41e4-8f8a-5bcf9c0d22bb@arm.com> References: <20240228133930.15400-1-will@kernel.org> <20240228133930.15400-7-will@kernel.org> <20240229133346.GA7177@lst.de> <20240229154756.GA10137@lst.de> <20240301163927.18358ee2@meshulam.tesarici.cz> <20240301180853.5ac20b27@meshulam.tesarici.cz> <8869c8b2-29c3-41e4-8f8a-5bcf9c0d22bb@arm.com> X-Mailer: Claws Mail 4.2.0 (GTK 3.24.39; x86_64-suse-linux-gnu) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Fri, 1 Mar 2024 17:54:06 +0000 Robin Murphy wrote: > On 2024-03-01 5:08 pm, Petr Tesa=C5=99=C3=ADk wrote: > > On Fri, 1 Mar 2024 16:39:27 +0100 > > Petr Tesa=C5=99=C3=ADk wrote: > > =20 > >> On Thu, 29 Feb 2024 16:47:56 +0100 > >> Christoph Hellwig wrote: > >> =20 > >>> On Thu, Feb 29, 2024 at 03:44:11PM +0000, Michael Kelley wrote: =20 > >>>> Any thoughts on how that historical behavior should apply if > >>>> the DMA min_align_mask is non-zero, or the alloc_align_mask > >>>> parameter to swiotbl_tbl_map_single() is non-zero? As currently > >>>> used, alloc_align_mask is page aligned if the IOMMU granule is =20 > >>>>> =3D PAGE_SIZE. But a non-zero min_align_mask could mandate =20 > >>>> returning a buffer that is not page aligned. Perhaps do the > >>>> historical behavior only if alloc_align_mask and min_align_mask > >>>> are both zero? =20 > >>> > >>> I think the driver setting min_align_mask is a clear indicator > >>> that the driver requested a specific alignment and the defaults > >>> don't apply. For swiotbl_tbl_map_single as used by dma-iommu > >>> I'd have to tak a closer look at how it is used. =20 > >> > >> I'm not sure it helps in this discussion, but let me dive into a bit > >> of ancient history to understand how we ended up here. > >> > >> IIRC this behaviour was originally motivated by limitations of PC AT > >> hardware. Intel 8237 is a 16-bit DMA controller. To make it somehow > >> usable with addresses up to 16MB (yeah, the infamous DMA zone), IBM > >> added a page register, but it was on a separate chip and it did not > >> increment when the 8237 address rolled over back to zero. Effectively, > >> the page register selected a 64K-aligned window of 64K buffers. > >> Consequently, DMA buffers could not cross a 64K physical boundary. > >> > >> Thanks to how the buddy allocator works, the 64K-boundary constraint > >> was satisfied by allocation size, and drivers took advantage of it when > >> allocating device buffers. IMO software bounce buffers simply followed > >> the same logic that worked for buffers allocated by the buddy allocato= r. > >> > >> OTOH min_align_mask was motivated by NVME which prescribes the value of > >> a certain number of low bits in the DMA address (for simplicity assumed > >> to be identical with the same bits in the physical address). > >> > >> The only pre-existing user of alloc_align_mask is x86 IOMMU code, and > >> IIUC it is used to guarantee that unaligned transactions do not share > >> the IOMMU granule with another device. This whole thing is weird, > >> because swiotlb_tbl_map_single() is called like this: > >> > >> aligned_size =3D iova_align(iovad, size); > >> phys =3D swiotlb_tbl_map_single(dev, phys, size, alig= ned_size, > >> iova_mask(iovad), dir, = attrs); > >> > >> Here: > >> > >> * alloc_size =3D iova_align(iovad, size) > >> * alloc_align_mask =3D iova_mask(iovad) > >> > >> Now, iova_align() rounds up its argument to a multiple of iova granule > >> and iova_mask() is simply "granule - 1". This works, because granule > >> size must be a power of 2, and I assume it must also be >=3D PAGE_SIZE. > >> > >> In that case, the alloc_align_mask argument is not even needed if you > >> adjust the code to match documentation---the resulting buffer will be > >> aligned to a granule boundary by virtue of having a size that is a > >> multiple of the granule size. > >> > >> To sum it up: > >> > >> 1. min_align_mask is by far the most important constraint. Devices will > >> simply stop working if it is not met. > >> 2. Alignment to the smallest PAGE_SIZE order which is greater than or > >> equal to the requested size has been documented, and some drivers > >> may rely on it. > >> 3. alloc_align_mask is a misguided fix for a bug in the above. > >> > >> Correct me if anything of the above is wrong. =20 > >=20 > > I thought about it some more, and I believe I know what should happen > > if the first two constraints appear to be mutually exclusive. > >=20 > > First, the alignment based on size does not guarantee that the resulting > > physical address is aligned. In fact, the lowest IO_TLB_SHIFT bits must > > be always identical to the original buffer address. > >=20 > > Let's take an example request like this: > >=20 > > TLB_SIZE =3D 0x00000800 > > min_align_mask =3D 0x0000ffff > > orig_addr =3D 0x....1234 > > alloc_size =3D 0x00002800 > >=20 > > Minimum alignment mask requires to keep the 1234 at the end. Allocation > > size requires a buffer that is aligned to 16K. Of course, there is no > > 16K-aligned slot with slot_address & 0x7ff =3D=3D 0x200, but if IO_TLB_= SHIFT > > was 14, it would be slot_address & 0x3fff =3D=3D 0 (low IO_TLB_SHIFT are > > masked off). Since the SWIOTLB API does not guarantee any specific > > value of IO_TLB_SHIFT, callers cannot rely on it. That means 0x1234 is a > > perfectly valid bounce buffer address for this example. > >=20 > > The caller may rightfully expect that the 16K granule containing the > > bounce buffer is not shared with any other user. For the above case I > > suggest to increase the allocation size to 0x4000 already in > > swiotlb_tbl_map_single() and treat 0x1234 as the offset from the slot > > address. =20 >=20 > That doesn't make sense - a caller asks to map some range of kernel=20 > addresses and they get back a corresponding range of DMA addresses; they= =20 > cannot make any reasonable assumptions about DMA addresses *outside*=20 > that range. In the example above, the caller has explicitly chosen not=20 > to map the range xxx0000-xxx1234; if they expect the device to actually=20 > access bytes in the DMA range yyy0000-yyy1234, then they should have=20 > mapped the whole range starting from xxx0000 and it is their own error. I agree that the range was not requested. But it is not wrong if SWIOTLB overallocates. In fact, it usually does overallocate because it works with slot granularity. > SWIOTLB does not and cannot provide any memory protection itself, so=20 > there is no functional benefit to automatically over-allocating, all it=20 > will do is waste slots. iommu-dma *can* provide memory protection=20 > between individual mappings via additional layers that SWIOTLB doesn't=20 > know about, so in that case it's iommu-dma's responsibility to=20 > explicitly manage whatever over-allocation is necessary at the SWIOTLB=20 > level to match the IOMMU level. I'm trying to understand what the caller expects to get if they request both buffer alignment (either given implicitly through mapping size or explicitly with an alloc_align_mask) with a min_align_mask and non-zero low bits covered by the buffer alignment. In other words, if iommu_dma_map_page() gets into this situation: * granule size is 4k * device specifies 64k min_align_mask * bit 11 of the original buffer address is non-zero Then you ask for a pair of slots where the first slot has bit 11 =3D=3D 0 (required by alignment to granule size) and also has bit 11 =3D=3D 1 (required to preserve the lowest 16 bits of the original address). Sure, you can fail such a mapping, but is it what the caller expects? Thanks Petr T