Received: by 2002:a05:7412:40d:b0:e2:908c:2ebd with SMTP id 13csp777214rdf; Tue, 21 Nov 2023 17:02:48 -0800 (PST) X-Google-Smtp-Source: AGHT+IFnYaTtmZ1fw3sRSSSlGC5h2RrXLJcc1orBN3M4G530OrMMT9P6jqqwOdmaQKetNwO0Woyi X-Received: by 2002:a17:903:443:b0:1cc:436d:39dd with SMTP id iw3-20020a170903044300b001cc436d39ddmr844480plb.65.1700614968291; Tue, 21 Nov 2023 17:02:48 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1700614968; cv=none; d=google.com; s=arc-20160816; b=oFIlX6W+q9LUt5eHNnHut2DkRPNLdXteqJ1DSlNueiibYeaBeOsinufCZVYhNSL171 7smgqOpP/tDa23SG+UDW+mZbCpv73HvxoFuCXse1mefFzfaJLAmHiLPtmscxKNAqhXC5 EpPC2fuOhgy6+Z/8epJkCejDbSv/8wCOhjgie3FmZJ+TEgn9+GRvJjrHlBQKaXl33lD2 p1u5YZHEZmo+IkWOK3Dh6rvILTMJj8EKu3D7r68SaexdCJiDnV35VSAU745LWYSlwjg4 db6GMMv1P9VUuiiAYeAG01DBqnHYL0vdX5tTiQIZmCunvwvuYqNioHxzfJnP3T43UivC z2wA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=zXAWsG9iOloHUdNE5mm3jXIIiOW32R7yz4554A63aDM=; fh=HcvhjHomgWAO6s/3s5886MHpUAI7dCPQiLIv5MH2q2Q=; b=MWtRiq9A3B42qd6zOPOpMQFenGjD4+Cf0KYhybMTTKppFhknHSd4yRd77k4A6Jg9BC RjVK2QBBDsrhdxppADcUSiGI0bzOq9ZXXDG2NdQbj10Y2x1BN9oaGeNQG+ISZQHGntj1 8p0BVZEd9k/zlUyFJPRHS8URw/wonD+jYyj99vmV0z941E7XlIpDvWPiXu26tDCHwxMm CUZQRxGt1v59v/CZiWxurNziP2W6z37OHzXm2yUdWdB2jQCrx5lhTvncoo2uC6Fcxuq9 AkMHO8pL5dWuDer4szo+8mcfz7T5yvYLkxDmUq4APuoI8JSFyD5o9AEpI6UUP7Hv3w+b keYg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=eLMHczCs; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5]) by mx.google.com with ESMTPS id l15-20020a170903244f00b001cf6453b25esi4715330pls.543.2023.11.21.17.02.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 21 Nov 2023 17:02:48 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) client-ip=2620:137:e000::3:5; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=eLMHczCs; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 1D82180859BD; Tue, 21 Nov 2023 17:02:45 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234986AbjKVBCd (ORCPT + 99 others); Tue, 21 Nov 2023 20:02:33 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36160 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234999AbjKVBCc (ORCPT ); Tue, 21 Nov 2023 20:02:32 -0500 Received: from mail-ed1-x532.google.com (mail-ed1-x532.google.com [IPv6:2a00:1450:4864:20::532]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2CB88D49; Tue, 21 Nov 2023 17:02:26 -0800 (PST) Received: by mail-ed1-x532.google.com with SMTP id 4fb4d7f45d1cf-5491eb3fb63so642154a12.3; Tue, 21 Nov 2023 17:02:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1700614944; x=1701219744; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=zXAWsG9iOloHUdNE5mm3jXIIiOW32R7yz4554A63aDM=; b=eLMHczCsRfIWigC0JaCWX2LtiHgJciEgTtYQlM5KAfWfgmyAhJRqY9RyRiBBpV6fHM wX+T01wPqSGwvZTehVtM//1Pxdhmfs+4P98Fh8iDhr519bfkTbuSThZYNOQhmVqom34d Yq+ve7Pzfti2hcCeZdc8F2ZDbjML12T7vpQcrxux+tYISEeu2FqlrpiE633WTjGnm7Nx W1TqEFESlR6QFe1Cxszaz8xOYpd+XB8h8SdQYgeGCtuSLTK6D8DoeF5mYubHaL8FielN nAB0iYiXv6DioEa9+a997TQnrbbOP31K0oETz2ZAG656GGGFXudzM2+ItluXVMvkNCAh T8bQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1700614944; x=1701219744; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=zXAWsG9iOloHUdNE5mm3jXIIiOW32R7yz4554A63aDM=; b=S+W3xZ9qlGsZFVZgH+184gy7YWPRjb3dLzO2xptKQjRu8zCmCaC/7ZR9h5Ne6ixwSn XYV+2Q0NzSL34uQuVFzm2KVPE6CVtTvzLj6hMsywuU+xoCKD5FURTqke8R3rq6cVqfIz x1ZyvUNZz+0EqM1ZPUXcuVxRMNBMMaX+u2qxaFfcVyd2SoQp5Q0QySl476szoQHHoiDX 4nwiK0K+v5WwpiPgavRewPwS1ddSgadPBgLTO/arr+46Kckx1hUBUZicXszAcJWQPwdY IrG5qAsDcCiOcnOkubYNCO2VXAys+edQtdEsQMtTPkHgjU7G53oozfhFDlF/e+5mavIa ZoMw== X-Gm-Message-State: AOJu0YyurhzwGdtBMBM0lOZs7aMb3hjghQw74LRXryF7VTFDkjJuOSOZ z5aW5pT1lF0VL/KLiOyRYPFL49We+qwVGFSAB1M= X-Received: by 2002:a50:ed89:0:b0:548:5fa3:1483 with SMTP id h9-20020a50ed89000000b005485fa31483mr630561edr.6.1700614944335; Tue, 21 Nov 2023 17:02:24 -0800 (PST) MIME-Version: 1.0 References: <20231115020209.4665-1-yaozhenguo@jd.com> <20231117152746.3aa55d68.alex.williamson@redhat.com> In-Reply-To: <20231117152746.3aa55d68.alex.williamson@redhat.com> From: Zhenguo Yao Date: Wed, 22 Nov 2023 09:02:13 +0800 Message-ID: Subject: Re: [PATCH V1] vfio: add attach_group_by_node to control behavior of attaching group to domain To: Alex Williamson Cc: yaozhenguo@jd.com, dwmw2@infradead.org, baolu.lu@linux.intel.com, joro@8bytes.org, will@kernel.org, robin.murphy@arm.com, iommu@lists.linux.dev, linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Wenchao Yao , ZiHan Zhou , Jason Gunthorpe Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-0.6 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Tue, 21 Nov 2023 17:02:45 -0800 (PST) Understood, thanks=EF=BC=81 Alex Williamson =E4=BA=8E2023=E5=B9=B411=E6=9C= =8818=E6=97=A5=E5=91=A8=E5=85=AD 06:27=E5=86=99=E9=81=93=EF=BC=9A > > On Wed, 15 Nov 2023 10:02:09 +0800 > yaozhenguo wrote: > > > From: Zhenguo Yao > > > > Groups will attach to one iommu_domain if ops and enforce_cache_coheren= cy > > are equal. And all the iommu hardware share one pagetable by default. > > There are performance issue in some scenarios. For example: > > Host hardware topopy: > > > > node0 + PCIe RP0 ---+ GPU A100 > > | |---+ GPU A100 > > | |---+ NIC Mellanox CX6 > > | |---+ NIC Mellanox CX6 > > + PCIe RP1 ---+ GPU A100 > > |---+ GPU A100 > > |---+ NIC Mellanox CX6 > > |---+ NIC Mellanox CX6 > > node1 + PCIe RP0 ---+ GPU A100 > > | |---+ GPU A100 > > | |---+ NIC Mellanox CX6 > > | |---+ NIC Mellanox CX6 > > + PCIe RP1 ---+ GPU A100 > > |---+ GPU A100 > > |---+ NIC Mellanox CX6 > > |---+ NIC Mellanox CX6 > > > > We passthrough all NICs and GPU to VM, and emulate host hardware topopy= . > > Mellanox CX6 ATS feature is enabled, GPU direct RDMA enabled. > > We test NCCL allreduce in VM at different cases. > > > > Case1: allreduce test use 4nic and 4GPU in numa0. > > Case2=EF=BC=9Aallreduce test use 4nic and 4GPU in numa1. > > case3: allreduce test use 8nic and 8GPU. > > > > the result are below: > > > > | | algbw (GB/S) | > > | ------ | -------------| > > | case1 | 24 | > > | case2 | 32 | > > | case3 | 45 | > > > > We checked that IOMMU pagetable is allocated in numa1 when VM boot up. > > So, if IOTLB miss happan, IOMMU hardware in numa0 will access remote > > pagetable in numa1. This will drop performance. After apply this patch = and > > attach_group_by_node is 1. Group in same node will attach to one domain= . > > IOMMU will access there local pagetable. Performance is improved: > > > > | | algbw (GB/S) | > > | ------ | -------------| > > | case1 | 32 | > > | case2 | 32 | > > | case3 | 63 | > > > > Signed-off-by: Zhenguo Yao > > Co-developed-by: Wenchao Yao > > Signed-off-by: Wenchao Yao > > Co-developed-by: ZiHan Zhou > > Signed-off-by: ZiHan Zhou > > --- > > drivers/iommu/intel/iommu.c | 8 +++++++- > > drivers/vfio/vfio_iommu_type1.c | 33 +++++++++++++++++++++------------ > > include/linux/iommu.h | 1 + > > 3 files changed, 29 insertions(+), 13 deletions(-) > > > > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c > > index 3531b95..2c6d8f0 100644 > > --- a/drivers/iommu/intel/iommu.c > > +++ b/drivers/iommu/intel/iommu.c > > @@ -569,8 +569,10 @@ void domain_update_iommu_cap(struct dmar_domain *d= omain) > > * If RHSA is missing, we should default to the device numa domai= n > > * as fall back. > > */ > > - if (domain->nid =3D=3D NUMA_NO_NODE) > > + if (domain->nid =3D=3D NUMA_NO_NODE) { > > domain->nid =3D domain_update_device_node(domain); > > + domain->domain.nid =3D domain->nid; > > + } > > > > /* > > * First-level translation restricts the input-address to a > > @@ -1767,6 +1769,7 @@ static struct dmar_domain *alloc_domain(unsigned = int type) > > return NULL; > > > > domain->nid =3D NUMA_NO_NODE; > > + domain->domain.nid =3D NUMA_NO_NODE; > > if (first_level_by_default(type)) > > domain->use_first_level =3D true; > > domain->has_iotlb_device =3D false; > > @@ -1808,6 +1811,8 @@ int domain_attach_iommu(struct dmar_domain *domai= n, struct intel_iommu *iommu) > > info->refcnt =3D 1; > > info->did =3D num; > > info->iommu =3D iommu; > > + domain->nid =3D iommu->node; > > + domain->domain.nid =3D iommu->node; > > curr =3D xa_cmpxchg(&domain->iommu_array, iommu->seq_id, > > NULL, info, GFP_ATOMIC); > > if (curr) { > > @@ -1837,6 +1842,7 @@ void domain_detach_iommu(struct dmar_domain *doma= in, struct intel_iommu *iommu) > > clear_bit(info->did, iommu->domain_ids); > > xa_erase(&domain->iommu_array, iommu->seq_id); > > domain->nid =3D NUMA_NO_NODE; > > + domain->domain.nid =3D NUMA_NO_NODE; > > domain_update_iommu_cap(domain); > > kfree(info); > > } > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_= type1.c > > index eacd6ec..6a5641e 100644 > > --- a/drivers/vfio/vfio_iommu_type1.c > > +++ b/drivers/vfio/vfio_iommu_type1.c > > @@ -59,6 +59,11 @@ > > module_param_named(dma_entry_limit, dma_entry_limit, uint, 0644); > > MODULE_PARM_DESC(dma_entry_limit, > > "Maximum number of user DMA mappings per container (6553= 5)."); > > +static uint attach_group_by_node; > > +module_param_named(attach_group_by_node, > > + attach_group_by_node, uint, 0644); > > +MODULE_PARM_DESC(attach_group_by_node, > > + "Attach group to domain when it's in same node"); > > > > struct vfio_iommu { > > struct list_head domain_list; > > @@ -2287,19 +2292,23 @@ static int vfio_iommu_type1_attach_group(void *= iommu_data, > > if (d->domain->ops =3D=3D domain->domain->ops && > > d->enforce_cache_coherency =3D=3D > > domain->enforce_cache_coherency) { > > - iommu_detach_group(domain->domain, group->iommu_g= roup); > > - if (!iommu_attach_group(d->domain, > > - group->iommu_group)) { > > - list_add(&group->next, &d->group_list); > > - iommu_domain_free(domain->domain); > > - kfree(domain); > > - goto done; > > - } > > + if ((attach_group_by_node =3D=3D 1 && > > + d->domain->nid =3D=3D domain->domain->nid= ) || > > + attach_group_by_node =3D=3D 0) { > > + iommu_detach_group(domain->domain, group-= >iommu_group); > > + if (!iommu_attach_group(d->domain, > > + group->iommu_grou= p)) { > > + list_add(&group->next, &d->group_= list); > > + iommu_domain_free(domain->domain)= ; > > + kfree(domain); > > + goto done; > > + } > > > > - ret =3D iommu_attach_group(domain->domain, > > - group->iommu_group); > > - if (ret) > > - goto out_domain; > > + ret =3D iommu_attach_group(domain->domain= , > > + group->iommu_group); > > + if (ret) > > + goto out_domain; > > + } > > } > > } > > > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h > > index ec289c1..c1330ed 100644 > > --- a/include/linux/iommu.h > > +++ b/include/linux/iommu.h > > @@ -123,6 +123,7 @@ struct iommu_domain { > > int users; > > }; > > }; > > + int nid; > > }; > > > > static inline bool iommu_is_dma_domain(struct iommu_domain *domain) > > As I understand what's being done here, we're duplicating > dmar_domain.nid to iommu_domain.nid, then when enabled by this new > module option, we'll use this node id as part of the match to determine > whether to create a new domain within the same container context or > re-use an existing domain, which may have non-favorably locality. > > If we're going to implement a node id on the iommu_domain, it should > replace the existing use of node id in the device specific structure > and not simply duplicate it. This should also account for non-VT-d use > cases as well, for example AMD IOMMU also has a nid field on their > protection_domain structure. Alternatively this might be implemented > through iommu_domain_ops so we could query the node association for a > domain. > > I question whether we need this solution at all though. AIUI the > initial domain is allocated in proximity to the initial group. The > problem comes when the user asks to add an additional group into the > same container. Another valid solution would be that the user > recognizes that these groups are not within the same locality and > creates a separate container for this group. In fact, if we're using > QEMU here and created a q35 VM with vIOMMU, each device would have a > separate address space and therefore a separate container and we'd > already avoid the issue this patch tries to solve. > > Separate containers per QEMU AddressSpace are a requirement, but QEMU > might also implement a policy to not re-use vfio containers between > virtual nodes such that if each locality were mapped to separate PXBs > with unique proximities, then simply reflecting the physical locality > into the VM would be sufficient to avoid this non-optimal domain > allocation placement. > > In any case, the type1 vfio IOMMU backend is in the early stages of > deprecation, so any choices we make here would also need to be reflected > in IOMMUFD, both in the compatibility and native interfaces. Thanks, > > Alex >