Received: by 2002:a89:48b:0:b0:1f5:f2ab:c469 with SMTP id a11csp754112lqd; Wed, 24 Apr 2024 16:31:25 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCV4RHpI03kIRYm197ogzjnYGmhC6KZa8VdFztkc200lGH8x/dKCtdQBdDGc0yP49L2CAg5mQ8alguJusbNMqS+iwp78Sp3Jn+mcDE4jrQ== X-Google-Smtp-Source: AGHT+IHjNBJ2Lar543Dmo1R32tgOmXwj7HWvKiHdGh5YIf2cIOiAhJ/V4u9p1+S/DLuvp3cEkf1p X-Received: by 2002:a05:6808:429a:b0:3c7:44d5:3f33 with SMTP id dq26-20020a056808429a00b003c744d53f33mr3254023oib.20.1714001485044; Wed, 24 Apr 2024 16:31:25 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1714001485; cv=pass; d=google.com; s=arc-20160816; b=hJTKTSRD67czwWU9fejpL0e8g64WKm9vvL+6L++nEqf6jDL4eYDLn7OtpPqBkfi5cx dSFoP4n0UrDDnAFEmonXqF5ouusbEpIal+fwYcfkKKREFDjqqt61gLOz8KVGmmJ5Hjta XhHcOZYObX76eiZa90byZvNTM0BHjMoINfg6B67OVc3/pot82GXVp3167WxZmtGWCT2t vNo7XVzgTNc7IcZKadAvnCTqNHngg159rxXKHmKrkm/KUquVMNWQtsW9Qf3gmSvmENC3 AM5OI5Dcp1stxOgR4voFegLLbPwGm/1qZqFYTSPUnFImkqCfOLcJDfTM+wwyjXmRHh9G sSIQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:list-unsubscribe:list-subscribe :list-id:precedence:dkim-signature; bh=EMATVbNc+KeJ4YoonusKbhJs7fn4XepeFw1NaKuOxdo=; fh=WegXqReAkxA3l3N/nVEP55uytPVcaHAS/yxI+7z2LzU=; b=fhjaHVvnbJ4ByJDE6BmeEV1yXpyw9a9sZTuM878/Ew93YTLi8bYytFmwZ6wdfEKZhu ReFeDTXq078gg+DzP1TaIiazDt96Dff3lvdHoOOqvYcdZczOEf8CYmAc3S5Y5VREoOOx /No9Y/uXPpt9yBe4yAJOer1BYAjMoiCuMbiZ+6PlM4htX68azm3LM7Sa32ozLCw3X/fR cSXT9SgAzWtwrbwOoQ92uk/lvHGtAxphTw89e+YWo0mCKHkQdRZzMum5jViECMa4167p 1YSe8CZFephXCo2uUddS+BPbcr7CYlsU/Bq4ivnvxtC7f5s5bIjJDIuxmMXvFQcfioXT 1C3g==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@rivosinc-com.20230601.gappssmtp.com header.s=20230601 header.b=ZMYIGmZR; arc=pass (i=1 spf=pass spfdomain=rivosinc.com dkim=pass dkdomain=rivosinc-com.20230601.gappssmtp.com); spf=pass (google.com: domain of linux-kernel+bounces-157842-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-157842-linux.lists.archive=gmail.com@vger.kernel.org" Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [2604:1380:45e3:2400::1]) by mx.google.com with ESMTPS id v189-20020a6389c6000000b005f8004f613bsi9349117pgd.395.2024.04.24.16.31.24 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 24 Apr 2024 16:31:25 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-157842-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1; Authentication-Results: mx.google.com; dkim=pass header.i=@rivosinc-com.20230601.gappssmtp.com header.s=20230601 header.b=ZMYIGmZR; arc=pass (i=1 spf=pass spfdomain=rivosinc.com dkim=pass dkdomain=rivosinc-com.20230601.gappssmtp.com); spf=pass (google.com: domain of linux-kernel+bounces-157842-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-157842-linux.lists.archive=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id A51EE281FB9 for ; Wed, 24 Apr 2024 23:31:24 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 4A55A16F295; Wed, 24 Apr 2024 23:31:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=rivosinc-com.20230601.gappssmtp.com header.i=@rivosinc-com.20230601.gappssmtp.com header.b="ZMYIGmZR" Received: from mail-pj1-f48.google.com (mail-pj1-f48.google.com [209.85.216.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2F37116F273 for ; Wed, 24 Apr 2024 23:30:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.48 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1714001459; cv=none; b=MYXa017DJLA6N4bNof3pdALXl1N2WicSqzX7EBzo7upU+ttGBcVwCm7V9DIhEWanHO6cio/TTUm0e3YkSKR0iyguaAqEwQlQstdns3wnbbSjXGabVUS6Kut6U1RsWox3e5MDX3q6B+HVWHeWL7MN37qEOABWwoyqPeSWVfSPy6w= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1714001459; c=relaxed/simple; bh=8tX0hIqEDwtvdHuLOQHmw3t48nptNzqhFmHzM8Lw3Cs=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=hh1JVCFeTpyjQ7kUIe+MTpwq4k1ULw1RdLwR42u++Cukncjf5HoVXwytBDZutX4WAiaYzetC4r2LcWx+uCQ1+FXf/su6GHDcETFjiwkRt4B0nMezxtWAyXnzYOq+1Fyr3ZmHc0sSaIAqwPQ9nJMImYwmvJ7RNwvkmKItdLkdG2w= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=rivosinc.com; spf=pass smtp.mailfrom=rivosinc.com; dkim=pass (2048-bit key) header.d=rivosinc-com.20230601.gappssmtp.com header.i=@rivosinc-com.20230601.gappssmtp.com header.b=ZMYIGmZR; arc=none smtp.client-ip=209.85.216.48 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=rivosinc.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=rivosinc.com Received: by mail-pj1-f48.google.com with SMTP id 98e67ed59e1d1-2a6fa7773d3so388011a91.3 for ; Wed, 24 Apr 2024 16:30:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=rivosinc-com.20230601.gappssmtp.com; s=20230601; t=1714001456; x=1714606256; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=EMATVbNc+KeJ4YoonusKbhJs7fn4XepeFw1NaKuOxdo=; b=ZMYIGmZRJAU07POXs4CL/Pba7aT+4jLdNoUWVb8OWip6DAElqdKnIYaZXnD1AAPRq/ Jtqdy9UL2X+8ROLet9D6i2izTD2p5aKDW4NH8/eLN7wTGd20QjZGSSJbiubQONB05m4c RW613TQz6wR0eFyyM+NGvBoCrq+mjDp5WR3tU9HrRvheqAUHWyN8dnar1bN06NH4HTDX LVgjUSmqdUQ8dcvZHheINHNCyPxb2F0KIpispL7y0xUuiDtHMmqe2sb565wTJU7bCPZS iXEcTwyGbMQtpOTtFe12yA5gYRcdWRlnARXvT21SCjq/IRMgDXMdhD4DDVV6YUEM9n0Y IK3A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1714001456; x=1714606256; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=EMATVbNc+KeJ4YoonusKbhJs7fn4XepeFw1NaKuOxdo=; b=mfhAL8qunDQz9LTLCgyubTT48LpUOU06WogHxqwfyebmlwVHA7Drzq7F/WLIACqLyN St0vFSineIzAI7h6+8JGl52Rp5ysJnGKYA2vUnPvvtvxEf6rBf7MfwBQAriH0WIQHoD6 gLpFRnj+lKIxTtinYAtPs2QlZWLS4LFaH1jxgC7+bVvI/ep6Mj+dEdkBE/CRg0AG+5sA I9SuNrVa9YU3BLj3//hAAOrXJIDgrqezLMTIwP0zGe15zHfCNGoA47QYY/PQwSA7v6Ja hRf+Rs7S/tWr3Wv8+0L5NK7W1CGFUe148OZfLKEsrysg0vtpapilAeVAz+9WcLx9qYi/ N+yw== X-Forwarded-Encrypted: i=1; AJvYcCVE9eqtWq05IkXmHylp+RhqGTem0re/UJW1SmnDaM8mVNT7PvmI4s/5wmVlrT8hyrNsxug/ARYryCAZy9UQrYyvUt3V54O60WtGlKUg X-Gm-Message-State: AOJu0YzwHHOuzDuQDps1hlTo6wUYj3/bPaGua7U2waDKhy3cVLlFWSd+ fRdzam+IpKKu+SRMi1+DNRBuymEnzKQvkjXXkttXsSbLGhEvh8h3zNmmg7ezltL227iyHlkY1m2 Gxam79xeF8n6v10usQENUgK/tTv7IBRqgScNn+A== X-Received: by 2002:a17:90b:1103:b0:2a3:be59:e969 with SMTP id gi3-20020a17090b110300b002a3be59e969mr3617650pjb.47.1714001456387; Wed, 24 Apr 2024 16:30:56 -0700 (PDT) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <301244bc3ff5da484b46d3fecc931cdad7d2806f.1713456598.git.tjeznach@rivosinc.com> <20240419125627.GD223006@ziepe.ca> In-Reply-To: <20240419125627.GD223006@ziepe.ca> From: Tomasz Jeznach Date: Wed, 24 Apr 2024 16:30:45 -0700 Message-ID: Subject: Re: [PATCH v2 7/7] iommu/riscv: Paging domain support To: Jason Gunthorpe Cc: Joerg Roedel , Will Deacon , Robin Murphy , Paul Walmsley , Palmer Dabbelt , Albert Ou , Anup Patel , Sunil V L , Nick Kossifidis , Sebastien Boeuf , Rob Herring , Krzysztof Kozlowski , Conor Dooley , devicetree@vger.kernel.org, iommu@lists.linux.dev, linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org, linux@rivosinc.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Fri, Apr 19, 2024 at 5:56=E2=80=AFAM Jason Gunthorpe wrot= e: > > On Thu, Apr 18, 2024 at 09:32:25AM -0700, Tomasz Jeznach wrote: > > > diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c > > index a4f74588cdc2..32ddc372432d 100644 > > --- a/drivers/iommu/riscv/iommu.c > > +++ b/drivers/iommu/riscv/iommu.c > > @@ -46,6 +46,10 @@ MODULE_LICENSE("GPL"); > > #define dev_to_iommu(dev) \ > > container_of((dev)->iommu->iommu_dev, struct riscv_iommu_device, = iommu) > > > > +/* IOMMU PSCID allocation namespace. */ > > +static DEFINE_IDA(riscv_iommu_pscids); > > +#define RISCV_IOMMU_MAX_PSCID BIT(20) > > + > > You may consider putting this IDA in the riscv_iommu_device() and move > the pscid from the domain to the bond? > I've been considering containing IDA inside riscv_iommu_device at some point, but it made PCSID management more complicated. In the follow up patches it is desired for PSCID to be unique across all IOMMUs in the system (within guest's GSCID), as the protection domains might (and will) be shared between more than single IOMMU device. > > /* Device resource-managed allocations */ > > struct riscv_iommu_devres { > > unsigned long addr; > > @@ -752,12 +756,77 @@ static int riscv_iommu_ddt_alloc(struct riscv_iom= mu_device *iommu) > > return 0; > > } > > > > +struct riscv_iommu_bond { > > + struct list_head list; > > + struct rcu_head rcu; > > + struct device *dev; > > +}; > > + > > +/* This struct contains protection domain specific IOMMU driver data. = */ > > +struct riscv_iommu_domain { > > + struct iommu_domain domain; > > + struct list_head bonds; > > + int pscid; > > + int numa_node; > > + int amo_enabled:1; > > + unsigned int pgd_mode; > > + /* paging domain */ > > + unsigned long pgd_root; > > +}; > > Glad to see there is no riscv_iommu_device pointer in the domain! > > > +static void riscv_iommu_iotlb_inval(struct riscv_iommu_domain *domain, > > + unsigned long start, unsigned long en= d) > > +{ > > + struct riscv_iommu_bond *bond; > > + struct riscv_iommu_device *iommu; > > + struct riscv_iommu_command cmd; > > + unsigned long len =3D end - start + 1; > > + unsigned long iova; > > + > > + rcu_read_lock(); > > + list_for_each_entry_rcu(bond, &domain->bonds, list) { > > + iommu =3D dev_to_iommu(bond->dev); > > Pedantically this locking isn't locked right, there is technically > nothing that prevents bond->dev and the iommu instance struct from > being freed here. eg iommufd can hit races here if userspace can hot > unplug devices. > > I suggest storing the iommu pointer itself in the bond instead of the > device then add a synchronize_rcu() to the iommu unregister path. > Very good point. Thanks for pointing this out. Reworked to add locking around list modifications (and do not incorrectly rely on iommu group mutex locks). > > + riscv_iommu_cmd_inval_vma(&cmd); > > + riscv_iommu_cmd_inval_set_pscid(&cmd, domain->pscid); > > + if (len > 0 && len < RISCV_IOMMU_IOTLB_INVAL_LIMIT) { > > + for (iova =3D start; iova < end; iova +=3D PAGE_S= IZE) { > > + riscv_iommu_cmd_inval_set_addr(&cmd, iova= ); > > + riscv_iommu_cmd_send(iommu, &cmd, 0); > > + } > > + } else { > > + riscv_iommu_cmd_send(iommu, &cmd, 0); > > + } > > + } > > This seems suboptimal, you probably want to copy the new design that > Intel is doing where you allocate "bonds" that are already > de-duplicated. Ie if I have 10 devices on the same iommu sharing the > domain the above will invalidate the PSCID 10 times. It should only be > done once. > > ie add a "bond" for the (iommu,pscid) and refcount that based on how > many devices are used. Then another "bond" for the ATS stuff eventually. > Agree, not perfect to send duplicate invalidations. This should improve with follow up patchsets introducing of SVA (reusing the same, extended bond structure) and update to send IOTLB range invalidations. For this change I've decided to go with as simple as possible implementation and over-invalidate for domains with multiple devices attached. Hope this makes sense. > > + > > + list_for_each_entry_rcu(bond, &domain->bonds, list) { > > + iommu =3D dev_to_iommu(bond->dev); > > + > > + riscv_iommu_cmd_iofence(&cmd); > > + riscv_iommu_cmd_send(iommu, &cmd, RISCV_IOMMU_QUEUE_TIMEO= UT); > > + } > > + rcu_read_unlock(); > > +} > > + > > > @@ -787,12 +870,390 @@ static int riscv_iommu_attach_domain(struct risc= v_iommu_device *iommu, > > xchg64(&dc->ta, ta); > > xchg64(&dc->tc, tc); > > > > - /* Device context invalidation will be required. Ignoring= for now. */ > > + if (!(tc & RISCV_IOMMU_DC_TC_V)) > > + continue; > > No negative caching in HW? > No. Disallowed by the spec. > > + /* Invalidate device context cache */ > > + riscv_iommu_cmd_iodir_inval_ddt(&cmd); > > + riscv_iommu_cmd_iodir_set_did(&cmd, fwspec->ids[i]); > > + riscv_iommu_cmd_send(iommu, &cmd, 0); > > + > > + if (FIELD_GET(RISCV_IOMMU_PC_FSC_MODE, fsc) =3D=3D RISCV_= IOMMU_DC_FSC_MODE_BARE) > > + continue; > > + > > + /* Invalidate last valid PSCID */ > > + riscv_iommu_cmd_inval_vma(&cmd); > > + riscv_iommu_cmd_inval_set_pscid(&cmd, FIELD_GET(RISCV_IOM= MU_DC_TA_PSCID, ta)); > > + riscv_iommu_cmd_send(iommu, &cmd, 0); > > + } > > + > > + /* Synchronize directory update */ > > + riscv_iommu_cmd_iofence(&cmd); > > + riscv_iommu_cmd_send(iommu, &cmd, RISCV_IOMMU_IOTINVAL_TIMEOUT); > > + > > + /* Track domain to devices mapping. */ > > + if (bond) > > + list_add_rcu(&bond->list, &domain->bonds); > > This is in the wrong order, the invalidation on the pscid needs to > start before the pscid is loaded into HW in the first place otherwise > concurrent invalidations may miss HW updates. > > > + > > + /* Remove tracking from previous domain, if needed. */ > > + iommu_domain =3D iommu_get_domain_for_dev(dev); > > + if (iommu_domain && !!(iommu_domain->type & __IOMMU_DOMAIN_PAGING= )) { > > No need for !!, && is already booleanizing > > > + domain =3D iommu_domain_to_riscv(iommu_domain); > > + bond =3D NULL; > > + rcu_read_lock(); > > + list_for_each_entry_rcu(b, &domain->bonds, list) { > > + if (b->dev =3D=3D dev) { > > + bond =3D b; > > + break; > > + } > > + } > > + rcu_read_unlock(); > > + > > + if (bond) { > > + list_del_rcu(&bond->list); > > + kfree_rcu(bond, rcu); > > + } > > + } > > + > > + return 0; > > +} > > > +static inline size_t get_page_size(size_t size) > > +{ > > + if (size >=3D IOMMU_PAGE_SIZE_512G) > > + return IOMMU_PAGE_SIZE_512G; > > + if (size >=3D IOMMU_PAGE_SIZE_1G) > > + return IOMMU_PAGE_SIZE_1G; > > + if (size >=3D IOMMU_PAGE_SIZE_2M) > > + return IOMMU_PAGE_SIZE_2M; > > + return IOMMU_PAGE_SIZE_4K; > > +} > > + > > +#define _io_pte_present(pte) ((pte) & (_PAGE_PRESENT | _PAGE_PROT_NONE= )) > > +#define _io_pte_leaf(pte) ((pte) & _PAGE_LEAF) > > +#define _io_pte_none(pte) ((pte) =3D=3D 0) > > +#define _io_pte_entry(pn, prot) ((_PAGE_PFN_MASK & ((pn) << _PAGE= _PFN_SHIFT)) | (prot)) > > + > > +static void riscv_iommu_pte_free(struct riscv_iommu_domain *domain, > > + unsigned long pte, struct list_head *fre= elist) > > +{ > > + unsigned long *ptr; > > + int i; > > + > > + if (!_io_pte_present(pte) || _io_pte_leaf(pte)) > > + return; > > + > > + ptr =3D (unsigned long *)pfn_to_virt(__page_val_to_pfn(pte)); > > + > > + /* Recursively free all sub page table pages */ > > + for (i =3D 0; i < PTRS_PER_PTE; i++) { > > + pte =3D READ_ONCE(ptr[i]); > > + if (!_io_pte_none(pte) && cmpxchg_relaxed(ptr + i, pte, 0= ) =3D=3D pte) > > + riscv_iommu_pte_free(domain, pte, freelist); > > + } > > + > > + if (freelist) > > + list_add_tail(&virt_to_page(ptr)->lru, freelist); > > + else > > + free_page((unsigned long)ptr); > > +} > > Consider putting the page table handling in its own file? > It was in separate file at some point, but merged to iommu.c, as its simple enough with ~300 lines only. Probably not worth separating this out. > > +static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu= _domain, > > + struct device *dev) > > +{ > > + struct riscv_iommu_device *iommu =3D dev_to_iommu(dev); > > + struct riscv_iommu_domain *domain =3D iommu_domain_to_riscv(iommu= _domain); > > + struct page *page; > > + > > + if (!riscv_iommu_pt_supported(iommu, domain->pgd_mode)) > > + return -ENODEV; > > + > > + domain->numa_node =3D dev_to_node(iommu->dev); > > + domain->amo_enabled =3D !!(iommu->caps & RISCV_IOMMU_CAP_AMO_HWAD= ); > > + > > + if (!domain->pgd_root) { > > + page =3D alloc_pages_node(domain->numa_node, > > + GFP_KERNEL_ACCOUNT | __GFP_ZERO, = 0); > > + if (!page) > > + return -ENOMEM; > > + domain->pgd_root =3D (unsigned long)page_to_virt(page); > > The pgd_root should be allocated by the alloc_paging function, not > during attach. There is no locking here that will protect against > concurrent attach and also map before attach should work. > > You can pick up the numa affinity from the alloc paging dev pointer > (note it may be null still in some cases) > Good point. Thanks. Will send update shortly with v3. > Jason Ack to all other comments, thank you! Best, - Tomasz