Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp1307147imu; Thu, 20 Dec 2018 13:59:16 -0800 (PST) X-Google-Smtp-Source: AFSGD/WpeySWy4EbzQiBAg8yqyvH+PHSUC2hkM5GHzsKCMjuTV6fMkuWP2Wa97CC7Ejd6LABCSDe X-Received: by 2002:a62:8a51:: with SMTP id y78mr25591418pfd.35.1545343156776; Thu, 20 Dec 2018 13:59:16 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1545343156; cv=none; d=google.com; s=arc-20160816; b=sHi1JqK8Jv/gP5I2CGt8gUBHA+ZETyZyuM9O5JIrxZ5Gp2iq4F0MXEOoRsC5ZSatNI j6TYsdMOqCNKCRIDiIsarMcuwtCxyWT8kN/a/6K4bq7xCuNgU3XTmdX5wWWgget8QgqA f5Ty6Ts+iT0LX9nGQPfqhHAj0sthsgsRU5xt6EyWuFZOdg39zN4ZjCr7paltf0t5GPAz H9z8xyS2Xlki/C/EZseyOWpsJVl66E5uD5NZMNxdZZwTv8XaL9m9stlCjlGdzs3iVD4f Ev2NjW6D/B74mhy7hn6RLBJwn937FI7caBEXswg92cww10V8SkT4T87EwT7r/+sKv5ha 72KQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:user-agent:in-reply-to :content-disposition:mime-version:references:subject:cc:to:from:date; bh=YRjefsnZjlKBauACL/rtSqtf1JbY2PXUgNhoBljvv5g=; b=0U9DtTs0WHnOqmr5BAYysy79sRI2R8AF82e0VSULQcEr9sOs2fLbTzxN7femP+CfKb GwvSC6pIcnMfBEYLoSbhCXtH2AqvzyGk10cp/+OTo7VRL6ViQuXud4S/6Azy9V+UwRxc 6QT0qU22D9sA/Mjp0X8jpIQGmCDhXH9o19g9nEIfjF94MxRag7M4fjUWDZIklzfcj7rc t0gbCvkN8B2jyXnpOgcoyitq7NaYuwcwJfwXBHqVFyt05IqPyxkyt5+ZqtE8homgduCQ lBV3XiajhV/3YG+psPwm8Oyhm+8gR5C85Ciuu6Wu55cnK72qZXnQDh5ZwqZr0A2QF/Xv lvBg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g11si19718973pgn.32.2018.12.20.13.59.00; Thu, 20 Dec 2018 13:59:16 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733144AbeLTQaa (ORCPT + 99 others); Thu, 20 Dec 2018 11:30:30 -0500 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:60746 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728528AbeLTQa3 (ORCPT ); Thu, 20 Dec 2018 11:30:29 -0500 Received: from pps.filterd (m0098421.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id wBKGTd02039384 for ; Thu, 20 Dec 2018 11:30:25 -0500 Received: from e32.co.us.ibm.com (e32.co.us.ibm.com [32.97.110.150]) by mx0a-001b2d01.pphosted.com with ESMTP id 2pgcqv5wnu-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Thu, 20 Dec 2018 11:30:24 -0500 Received: from localhost by e32.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 20 Dec 2018 16:30:24 -0000 Received: from b03cxnp08025.gho.boulder.ibm.com (9.17.130.17) by e32.co.us.ibm.com (192.168.1.132) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Thu, 20 Dec 2018 16:30:20 -0000 Received: from b03ledav006.gho.boulder.ibm.com (b03ledav006.gho.boulder.ibm.com [9.17.130.237]) by b03cxnp08025.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id wBKGUJw218612404 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Thu, 20 Dec 2018 16:30:19 GMT Received: from b03ledav006.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id EA051C6059; Thu, 20 Dec 2018 16:30:18 +0000 (GMT) Received: from b03ledav006.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 1EFE4C6055; Thu, 20 Dec 2018 16:30:08 +0000 (GMT) Received: from localhost (unknown [9.18.235.107]) by b03ledav006.gho.boulder.ibm.com (Postfix) with ESMTPS; Thu, 20 Dec 2018 16:30:07 +0000 (GMT) Date: Thu, 20 Dec 2018 14:30:06 -0200 From: Murilo Opsfelder Araujo To: Alexey Kardashevskiy Cc: linuxppc-dev@lists.ozlabs.org, Christoph Hellwig , Jose Ricardo Ziviani , kvm@vger.kernel.org, Alistair Popple , Daniel Henrique Barboza , Alex Williamson , kvm-ppc@vger.kernel.org, linux-kernel@vger.kernel.org, Sam Bobroff , Piotr Jaroszynski , Leonardo Augusto =?iso-8859-1?Q?Guimar=E3es?= Garcia , Reza Arbab , David Gibson Subject: Re: [PATCH kernel v7 20/20] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver References: <20181220082350.58113-1-aik@ozlabs.ru> <20181220082350.58113-21-aik@ozlabs.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181220082350.58113-21-aik@ozlabs.ru> User-Agent: Mutt/1.10.1 (2018-07-13) X-TM-AS-GCONF: 00 x-cbid: 18122016-0004-0000-0000-000014C54170 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00010255; HX=3.00000242; KW=3.00000007; PH=3.00000004; SC=3.00000272; SDB=6.01134539; UDB=6.00589891; IPR=6.00914712; MB=3.00024770; MTD=3.00000008; XFM=3.00000015; UTC=2018-12-20 16:30:23 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18122016-0005-0000-0000-000089ED7DA8 Message-Id: <20181220163006.GA31987@kermit-br-ibm-com.br.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2018-12-20_08:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=52 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1812200134 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 20, 2018 at 07:23:50PM +1100, Alexey Kardashevskiy wrote: > POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not > pluggable PCIe devices but still have PCIe links which are used > for config space and MMIO. In addition to that the GPUs have 6 NVLinks > which are connected to other GPUs and the POWER9 CPU. POWER9 chips > have a special unit on a die called an NPU which is an NVLink2 host bus > adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each. > These systems also support ATS (address translation services) which is > a part of the NVLink2 protocol. Such GPUs also share on-board RAM > (16GB or 32GB) to the system via the same NVLink2 so a CPU has > cache-coherent access to a GPU RAM. > > This exports GPU RAM to the userspace as a new VFIO device region. This > preregisters the new memory as device memory as it might be used for DMA. > This inserts pfns from the fault handler as the GPU memory is not onlined > until the vendor driver is loaded and trained the NVLinks so doing this > earlier causes low level errors which we fence in the firmware so > it does not hurt the host system but still better be avoided; for the same > reason this does not map GPU RAM into the host kernel (usual thing for > emulated access otherwise). > > This exports an ATSD (Address Translation Shootdown) register of NPU which > allows TLB invalidations inside GPU for an operating system. The register > conveniently occupies a single 64k page. It is also presented to > the userspace as a new VFIO device region. One NPU has 8 ATSD registers, > each of them can be used for TLB invalidation in a GPU linked to this NPU. > This allocates one ATSD register per an NVLink bridge allowing passing > up to 6 registers. Due to the host firmware bug (just recently fixed), > only 1 ATSD register per NPU was actually advertised to the host system > so this passes that alone register via the first NVLink bridge device in > the group which is still enough as QEMU collects them all back and > presents to the guest via vPHB to mimic the emulated NPU PHB on the host. > > In order to provide the userspace with the information about GPU-to-NVLink > connections, this exports an additional capability called "tgt" > (which is an abbreviated host system bus address). The "tgt" property > tells the GPU its own system address and allows the guest driver to > conglomerate the routing information so each GPU knows how to get directly > to the other GPUs. > > For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to > know LPID (a logical partition ID or a KVM guest hardware ID in other > words) and PID (a memory context ID of a userspace process, not to be > confused with a linux pid). This assigns a GPU to LPID in the NPU and > this is why this adds a listener for KVM on an IOMMU group. A PID comes > via NVLink from a GPU and NPU uses a PID wildcard to pass it through. > > This requires coherent memory and ATSD to be available on the host as > the GPU vendor only supports configurations with both features enabled > and other configurations are known not to work. Because of this and > because of the ways the features are advertised to the host system > (which is a device tree with very platform specific properties), > this requires enabled POWERNV platform. > > The V100 GPUs do not advertise any of these capabilities via the config > space and there are more than just one device ID so this relies on > the platform to tell whether these GPUs have special abilities such as > NVLinks. > > Signed-off-by: Alexey Kardashevskiy > --- > Changes: > v6.1: > * fixed outdated comment about VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD > > v6: > * reworked capabilities - tgt for nvlink and gpu and link-speed > for nvlink only > > v5: > * do not memremap GPU RAM for emulation, map it only when it is needed > * allocate 1 ATSD register per NVLink bridge, if none left, then expose > the region with a zero size > * separate caps per device type > * addressed AW review comments > > v4: > * added nvlink-speed to the NPU bridge capability as this turned out to > be not a constant value > * instead of looking at the exact device ID (which also changes from system > to system), now this (indirectly) looks at the device tree to know > if GPU and NPU support NVLink > > v3: > * reworded the commit log about tgt > * added tracepoints (do we want them enabled for entire vfio-pci?) > * added code comments > * added write|mmap flags to the new regions > * auto enabled VFIO_PCI_NVLINK2 config option > * added 'tgt' capability to a GPU so QEMU can recreate ibm,npu and ibm,gpu > references; there are required by the NVIDIA driver > * keep notifier registered only for short time > --- > drivers/vfio/pci/Makefile | 1 + > drivers/vfio/pci/trace.h | 102 ++++++ > drivers/vfio/pci/vfio_pci_private.h | 14 + > include/uapi/linux/vfio.h | 37 +++ > drivers/vfio/pci/vfio_pci.c | 27 +- > drivers/vfio/pci/vfio_pci_nvlink2.c | 482 ++++++++++++++++++++++++++++ > drivers/vfio/pci/Kconfig | 6 + > 7 files changed, 667 insertions(+), 2 deletions(-) > create mode 100644 drivers/vfio/pci/trace.h > create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c > > diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile > index 76d8ec0..9662c06 100644 > --- a/drivers/vfio/pci/Makefile > +++ b/drivers/vfio/pci/Makefile > @@ -1,5 +1,6 @@ > > vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o > vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o > +vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o > > obj-$(CONFIG_VFIO_PCI) += vfio-pci.o > diff --git a/drivers/vfio/pci/trace.h b/drivers/vfio/pci/trace.h > new file mode 100644 > index 0000000..b80d2d3 > --- /dev/null > +++ b/drivers/vfio/pci/trace.h > @@ -0,0 +1,102 @@ > +/* SPDX-License-Identifier: GPL-2.0+ */ > +/* > + * VFIO PCI mmap/mmap_fault tracepoints > + * > + * Copyright (C) 2018 IBM Corp. All rights reserved. > + * Author: Alexey Kardashevskiy > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + */ > + > +#undef TRACE_SYSTEM > +#define TRACE_SYSTEM vfio_pci > + > +#if !defined(_TRACE_VFIO_PCI_H) || defined(TRACE_HEADER_MULTI_READ) > +#define _TRACE_VFIO_PCI_H > + > +#include > + > +TRACE_EVENT(vfio_pci_nvgpu_mmap_fault, > + TP_PROTO(struct pci_dev *pdev, unsigned long hpa, unsigned long ua, > + vm_fault_t ret), > + TP_ARGS(pdev, hpa, ua, ret), > + > + TP_STRUCT__entry( > + __field(const char *, name) > + __field(unsigned long, hpa) > + __field(unsigned long, ua) > + __field(int, ret) > + ), > + > + TP_fast_assign( > + __entry->name = dev_name(&pdev->dev), > + __entry->hpa = hpa; > + __entry->ua = ua; > + __entry->ret = ret; > + ), > + > + TP_printk("%s: %lx -> %lx ret=%d", __entry->name, __entry->hpa, > + __entry->ua, __entry->ret) > +); > + > +TRACE_EVENT(vfio_pci_nvgpu_mmap, > + TP_PROTO(struct pci_dev *pdev, unsigned long hpa, unsigned long ua, > + unsigned long size, int ret), > + TP_ARGS(pdev, hpa, ua, size, ret), > + > + TP_STRUCT__entry( > + __field(const char *, name) > + __field(unsigned long, hpa) > + __field(unsigned long, ua) > + __field(unsigned long, size) > + __field(int, ret) > + ), > + > + TP_fast_assign( > + __entry->name = dev_name(&pdev->dev), > + __entry->hpa = hpa; > + __entry->ua = ua; > + __entry->size = size; > + __entry->ret = ret; > + ), > + > + TP_printk("%s: %lx -> %lx size=%lx ret=%d", __entry->name, __entry->hpa, > + __entry->ua, __entry->size, __entry->ret) > +); > + > +TRACE_EVENT(vfio_pci_npu2_mmap, > + TP_PROTO(struct pci_dev *pdev, unsigned long hpa, unsigned long ua, > + unsigned long size, int ret), > + TP_ARGS(pdev, hpa, ua, size, ret), > + > + TP_STRUCT__entry( > + __field(const char *, name) > + __field(unsigned long, hpa) > + __field(unsigned long, ua) > + __field(unsigned long, size) > + __field(int, ret) > + ), > + > + TP_fast_assign( > + __entry->name = dev_name(&pdev->dev), > + __entry->hpa = hpa; > + __entry->ua = ua; > + __entry->size = size; > + __entry->ret = ret; > + ), > + > + TP_printk("%s: %lx -> %lx size=%lx ret=%d", __entry->name, __entry->hpa, > + __entry->ua, __entry->size, __entry->ret) > +); > + > +#endif /* _TRACE_SUBSYS_H */ I think it's too late but this line I guess should read: #endif /* _TRACE_VFIO_PCI_H */ > + > +#undef TRACE_INCLUDE_PATH > +#define TRACE_INCLUDE_PATH . > +#undef TRACE_INCLUDE_FILE > +#define TRACE_INCLUDE_FILE trace > + > +/* This part must be outside protection */ > +#include > diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h > index 93c1738..127071b 100644 > --- a/drivers/vfio/pci/vfio_pci_private.h > +++ b/drivers/vfio/pci/vfio_pci_private.h > @@ -163,4 +163,18 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev) > return -ENODEV; > } > #endif > +#ifdef CONFIG_VFIO_PCI_NVLINK2 > +extern int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev); > +extern int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev); > +#else > +static inline int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev) > +{ > + return -ENODEV; > +} > + > +static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev) > +{ > + return -ENODEV; > +} > +#endif > #endif /* VFIO_PCI_PRIVATE_H */ > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index 8131028..5562587 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -353,6 +353,21 @@ struct vfio_region_gfx_edid { > #define VFIO_DEVICE_GFX_LINK_STATE_DOWN 2 > }; > > +/* > + * 10de vendor sub-type > + * > + * NVIDIA GPU NVlink2 RAM is coherent RAM mapped onto the host address space. > + */ > +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM (1) > + > +/* > + * 1014 vendor sub-type > + * > + * IBM NPU NVlink2 ATSD (Address Translation Shootdown) register of NPU > + * to do TLB invalidation on a GPU. > + */ > +#define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD (1) > + > /* > * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped > * which allows direct access to non-MSIX registers which happened to be within > @@ -363,6 +378,28 @@ struct vfio_region_gfx_edid { > */ > #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 > > +/* > + * Capability with compressed real address (aka SSA - small system address) > + * where GPU RAM is mapped on a system bus. Used by a GPU for DMA routing. > + */ > +#define VFIO_REGION_INFO_CAP_NVLINK2_SSATGT 4 > + > +struct vfio_region_info_cap_nvlink2_ssatgt { > + struct vfio_info_cap_header header; > + __u64 tgt; > +}; > + > +/* > + * Capability with an NVLink link speed. > + */ > +#define VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD 5 > + > +struct vfio_region_info_cap_nvlink2_lnkspd { > + struct vfio_info_cap_header header; > + __u32 link_speed; > + __u32 __pad; > +}; > + > /** > * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, > * struct vfio_irq_info) > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c > index 6cb70cf..67c03f2 100644 > --- a/drivers/vfio/pci/vfio_pci.c > +++ b/drivers/vfio/pci/vfio_pci.c > @@ -302,14 +302,37 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev) > if (ret) { > dev_warn(&vdev->pdev->dev, > "Failed to setup Intel IGD regions\n"); > - vfio_pci_disable(vdev); > - return ret; > + goto disable_exit; > + } > + } > + > + if (pdev->vendor == PCI_VENDOR_ID_NVIDIA && > + IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) { > + ret = vfio_pci_nvdia_v100_nvlink2_init(vdev); > + if (ret && ret != -ENODEV) { > + dev_warn(&vdev->pdev->dev, > + "Failed to setup NVIDIA NV2 RAM region\n"); > + goto disable_exit; > + } > + } > + > + if (pdev->vendor == PCI_VENDOR_ID_IBM && > + IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) { > + ret = vfio_pci_ibm_npu2_init(vdev); > + if (ret && ret != -ENODEV) { > + dev_warn(&vdev->pdev->dev, > + "Failed to setup NVIDIA NV2 ATSD region\n"); > + goto disable_exit; > } > } > > vfio_pci_probe_mmaps(vdev); > > return 0; > + > +disable_exit: > + vfio_pci_disable(vdev); > + return ret; > } > > static void vfio_pci_disable(struct vfio_pci_device *vdev) > diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c > new file mode 100644 > index 0000000..054a2cf > --- /dev/null > +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c > @@ -0,0 +1,482 @@ > +// SPDX-License-Identifier: GPL-2.0+ > +/* > + * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2. > + * > + * Copyright (C) 2018 IBM Corp. All rights reserved. > + * Author: Alexey Kardashevskiy > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + * > + * Register an on-GPU RAM region for cacheable access. > + * > + * Derived from original vfio_pci_igd.c: > + * Copyright (C) 2016 Red Hat, Inc. All rights reserved. > + * Author: Alex Williamson > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include "vfio_pci_private.h" > + > +#define CREATE_TRACE_POINTS > +#include "trace.h" > + > +EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap_fault); > +EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap); > +EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_npu2_mmap); > + > +struct vfio_pci_nvgpu_data { > + unsigned long gpu_hpa; /* GPU RAM physical address */ > + unsigned long gpu_tgt; /* TGT address of corresponding GPU RAM */ > + unsigned long useraddr; /* GPU RAM userspace address */ > + unsigned long size; /* Size of the GPU RAM window (usually 128GB) */ > + struct mm_struct *mm; > + struct mm_iommu_table_group_mem_t *mem; /* Pre-registered RAM descr. */ > + struct pci_dev *gpdev; > + struct notifier_block group_notifier; > +}; > + > +static size_t vfio_pci_nvgpu_rw(struct vfio_pci_device *vdev, > + char __user *buf, size_t count, loff_t *ppos, bool iswrite) > +{ > + unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS; > + struct vfio_pci_nvgpu_data *data = vdev->region[i].data; > + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK; > + loff_t posaligned = pos & PAGE_MASK, posoff = pos & ~PAGE_MASK; > + size_t sizealigned; > + void __iomem *ptr; > + > + if (pos >= vdev->region[i].size) > + return -EINVAL; > + > + count = min(count, (size_t)(vdev->region[i].size - pos)); > + > + /* > + * We map only a bit of GPU RAM for a short time instead of mapping it > + * for the guest lifetime as: > + * > + * 1) we do not know GPU RAM size, only aperture which is 4-8 times > + * bigger than actual RAM size (16/32GB RAM vs. 128GB aperture); > + * 2) mapping GPU RAM allows CPU to prefetch and if this happens > + * before NVLink bridge is reset (which fences GPU RAM), > + * hardware management interrupts (HMI) might happen, this > + * will freeze NVLink bridge. > + * > + * This is not fast path anyway. > + */ > + sizealigned = _ALIGN_UP(posoff + count, PAGE_SIZE); > + ptr = ioremap_cache(data->gpu_hpa + posaligned, sizealigned); > + if (!ptr) > + return -EFAULT; > + > + if (iswrite) { > + if (copy_from_user(ptr + posoff, buf, count)) > + count = -EFAULT; > + else > + *ppos += count; > + } else { > + if (copy_to_user(buf, ptr + posoff, count)) > + count = -EFAULT; > + else > + *ppos += count; > + } > + > + iounmap(ptr); > + > + return count; > +} > + > +static void vfio_pci_nvgpu_release(struct vfio_pci_device *vdev, > + struct vfio_pci_region *region) > +{ > + struct vfio_pci_nvgpu_data *data = region->data; > + long ret; > + > + /* If there were any mappings at all... */ > + if (data->mm) { > + ret = mm_iommu_put(data->mm, data->mem); > + WARN_ON(ret); > + > + mmdrop(data->mm); > + } > + > + vfio_unregister_notifier(&data->gpdev->dev, VFIO_GROUP_NOTIFY, > + &data->group_notifier); > + > + pnv_npu2_unmap_lpar_dev(data->gpdev); > + > + kfree(data); > +} > + > +static vm_fault_t vfio_pci_nvgpu_mmap_fault(struct vm_fault *vmf) > +{ > + vm_fault_t ret; > + struct vm_area_struct *vma = vmf->vma; > + struct vfio_pci_region *region = vma->vm_private_data; > + struct vfio_pci_nvgpu_data *data = region->data; > + unsigned long vmf_off = (vmf->address - vma->vm_start) >> PAGE_SHIFT; > + unsigned long nv2pg = data->gpu_hpa >> PAGE_SHIFT; > + unsigned long vm_pgoff = vma->vm_pgoff & > + ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1); > + unsigned long pfn = nv2pg + vm_pgoff + vmf_off; > + > + ret = vmf_insert_pfn(vma, vmf->address, pfn); > + trace_vfio_pci_nvgpu_mmap_fault(data->gpdev, pfn << PAGE_SHIFT, > + vmf->address, ret); > + > + return ret; > +} > + > +static const struct vm_operations_struct vfio_pci_nvgpu_mmap_vmops = { > + .fault = vfio_pci_nvgpu_mmap_fault, > +}; > + > +static int vfio_pci_nvgpu_mmap(struct vfio_pci_device *vdev, > + struct vfio_pci_region *region, struct vm_area_struct *vma) > +{ > + int ret; > + struct vfio_pci_nvgpu_data *data = region->data; > + > + if (data->useraddr) > + return -EPERM; > + > + if (vma->vm_end - vma->vm_start > data->size) > + return -EINVAL; > + > + vma->vm_private_data = region; > + vma->vm_flags |= VM_PFNMAP; > + vma->vm_ops = &vfio_pci_nvgpu_mmap_vmops; > + > + /* > + * Calling mm_iommu_newdev() here once as the region is not > + * registered yet and therefore right initialization will happen now. > + * Other places will use mm_iommu_find() which returns > + * registered @mem and does not go gup(). > + */ > + data->useraddr = vma->vm_start; > + data->mm = current->mm; > + > + atomic_inc(&data->mm->mm_count); > + ret = (int) mm_iommu_newdev(data->mm, data->useraddr, > + (vma->vm_end - vma->vm_start) >> PAGE_SHIFT, > + data->gpu_hpa, &data->mem); > + > + trace_vfio_pci_nvgpu_mmap(vdev->pdev, data->gpu_hpa, data->useraddr, > + vma->vm_end - vma->vm_start, ret); > + > + return ret; > +} > + > +static int vfio_pci_nvgpu_add_capability(struct vfio_pci_device *vdev, > + struct vfio_pci_region *region, struct vfio_info_cap *caps) > +{ > + struct vfio_pci_nvgpu_data *data = region->data; > + struct vfio_region_info_cap_nvlink2_ssatgt cap = { 0 }; > + > + cap.header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT; > + cap.header.version = 1; > + cap.tgt = data->gpu_tgt; > + > + return vfio_info_add_capability(caps, &cap.header, sizeof(cap)); > +} > + > +static const struct vfio_pci_regops vfio_pci_nvgpu_regops = { > + .rw = vfio_pci_nvgpu_rw, > + .release = vfio_pci_nvgpu_release, > + .mmap = vfio_pci_nvgpu_mmap, > + .add_capability = vfio_pci_nvgpu_add_capability, > +}; > + > +static int vfio_pci_nvgpu_group_notifier(struct notifier_block *nb, > + unsigned long action, void *opaque) > +{ > + struct kvm *kvm = opaque; > + struct vfio_pci_nvgpu_data *data = container_of(nb, > + struct vfio_pci_nvgpu_data, > + group_notifier); > + > + if (action == VFIO_GROUP_NOTIFY_SET_KVM && kvm && > + pnv_npu2_map_lpar_dev(data->gpdev, > + kvm->arch.lpid, MSR_DR | MSR_PR)) > + return NOTIFY_BAD; > + > + return NOTIFY_OK; > +} > + > +int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev) > +{ > + int ret; > + u64 reg[2]; > + u64 tgt = 0; > + struct device_node *npu_node, *mem_node; > + struct pci_dev *npu_dev; > + struct vfio_pci_nvgpu_data *data; > + uint32_t mem_phandle = 0; > + unsigned long events = VFIO_GROUP_NOTIFY_SET_KVM; > + > + /* > + * PCI config space does not tell us about NVLink presense but > + * platform does, use this. > + */ > + npu_dev = pnv_pci_get_npu_dev(vdev->pdev, 0); > + if (!npu_dev) > + return -ENODEV; > + > + npu_node = pci_device_to_OF_node(npu_dev); > + if (!npu_node) > + return -EINVAL; > + > + if (of_property_read_u32(npu_node, "memory-region", &mem_phandle)) > + return -EINVAL; > + > + mem_node = of_find_node_by_phandle(mem_phandle); > + if (!mem_node) > + return -EINVAL; > + > + if (of_property_read_variable_u64_array(mem_node, "reg", reg, > + ARRAY_SIZE(reg), ARRAY_SIZE(reg)) != > + ARRAY_SIZE(reg)) > + return -EINVAL; > + > + if (of_property_read_u64(npu_node, "ibm,device-tgt-addr", &tgt)) { > + dev_warn(&vdev->pdev->dev, "No ibm,device-tgt-addr found\n"); > + return -EFAULT; > + } > + > + data = kzalloc(sizeof(*data), GFP_KERNEL); > + if (!data) > + return -ENOMEM; > + > + data->gpu_hpa = reg[0]; > + data->gpu_tgt = tgt; > + data->size = reg[1]; > + > + dev_dbg(&vdev->pdev->dev, "%lx..%lx\n", data->gpu_hpa, > + data->gpu_hpa + data->size - 1); > + > + data->gpdev = vdev->pdev; > + data->group_notifier.notifier_call = vfio_pci_nvgpu_group_notifier; > + > + ret = vfio_register_notifier(&data->gpdev->dev, VFIO_GROUP_NOTIFY, > + &events, &data->group_notifier); > + if (ret) > + goto free_exit; > + > + /* > + * We have just set KVM, we do not need the listener anymore. > + * Also, keeping it registered means that if more than one GPU is > + * assigned, we will get several similar notifiers notifying about > + * the same device again which does not help with anything. > + */ > + vfio_unregister_notifier(&data->gpdev->dev, VFIO_GROUP_NOTIFY, > + &data->group_notifier); > + > + ret = vfio_pci_register_dev_region(vdev, > + PCI_VENDOR_ID_NVIDIA | VFIO_REGION_TYPE_PCI_VENDOR_TYPE, > + VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM, > + &vfio_pci_nvgpu_regops, > + data->size, > + VFIO_REGION_INFO_FLAG_READ | > + VFIO_REGION_INFO_FLAG_WRITE | > + VFIO_REGION_INFO_FLAG_MMAP, > + data); > + if (ret) > + goto free_exit; > + > + return 0; > +free_exit: > + kfree(data); > + > + return ret; > +} > + > +/* > + * IBM NPU2 bridge > + */ > +struct vfio_pci_npu2_data { > + void *base; /* ATSD register virtual address, for emulated access */ > + unsigned long mmio_atsd; /* ATSD physical address */ > + unsigned long gpu_tgt; /* TGT address of corresponding GPU RAM */ > + unsigned int link_speed; /* The link speed from DT's ibm,nvlink-speed */ > +}; > + > +static size_t vfio_pci_npu2_rw(struct vfio_pci_device *vdev, > + char __user *buf, size_t count, loff_t *ppos, bool iswrite) > +{ > + unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS; > + struct vfio_pci_npu2_data *data = vdev->region[i].data; > + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK; > + > + if (pos >= vdev->region[i].size) > + return -EINVAL; > + > + count = min(count, (size_t)(vdev->region[i].size - pos)); > + > + if (iswrite) { > + if (copy_from_user(data->base + pos, buf, count)) > + return -EFAULT; > + } else { > + if (copy_to_user(buf, data->base + pos, count)) > + return -EFAULT; > + } > + *ppos += count; > + > + return count; > +} > + > +static int vfio_pci_npu2_mmap(struct vfio_pci_device *vdev, > + struct vfio_pci_region *region, struct vm_area_struct *vma) > +{ > + int ret; > + struct vfio_pci_npu2_data *data = region->data; > + unsigned long req_len = vma->vm_end - vma->vm_start; > + > + if (req_len != PAGE_SIZE) > + return -EINVAL; > + > + vma->vm_flags |= VM_PFNMAP; > + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); > + > + ret = remap_pfn_range(vma, vma->vm_start, data->mmio_atsd >> PAGE_SHIFT, > + req_len, vma->vm_page_prot); > + trace_vfio_pci_npu2_mmap(vdev->pdev, data->mmio_atsd, vma->vm_start, > + vma->vm_end - vma->vm_start, ret); > + > + return ret; > +} > + > +static void vfio_pci_npu2_release(struct vfio_pci_device *vdev, > + struct vfio_pci_region *region) > +{ > + struct vfio_pci_npu2_data *data = region->data; > + > + memunmap(data->base); > + kfree(data); > +} > + > +static int vfio_pci_npu2_add_capability(struct vfio_pci_device *vdev, > + struct vfio_pci_region *region, struct vfio_info_cap *caps) > +{ > + struct vfio_pci_npu2_data *data = region->data; > + struct vfio_region_info_cap_nvlink2_ssatgt captgt = { 0 }; > + struct vfio_region_info_cap_nvlink2_lnkspd capspd = { 0 }; > + int ret; > + > + captgt.header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT; > + captgt.header.version = 1; > + captgt.tgt = data->gpu_tgt; > + > + capspd.header.id = VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD; > + capspd.header.version = 1; > + capspd.link_speed = data->link_speed; > + > + ret = vfio_info_add_capability(caps, &captgt.header, sizeof(captgt)); > + if (ret) > + return ret; > + > + return vfio_info_add_capability(caps, &capspd.header, sizeof(capspd)); > +} > + > +static const struct vfio_pci_regops vfio_pci_npu2_regops = { > + .rw = vfio_pci_npu2_rw, > + .mmap = vfio_pci_npu2_mmap, > + .release = vfio_pci_npu2_release, > + .add_capability = vfio_pci_npu2_add_capability, > +}; > + > +int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev) > +{ > + int ret; > + struct vfio_pci_npu2_data *data; > + struct device_node *nvlink_dn; > + u32 nvlink_index = 0; > + struct pci_dev *npdev = vdev->pdev; > + struct device_node *npu_node = pci_device_to_OF_node(npdev); > + struct pci_controller *hose = pci_bus_to_host(npdev->bus); > + u64 mmio_atsd = 0; > + u64 tgt = 0; > + u32 link_speed = 0xff; > + > + /* > + * PCI config space does not tell us about NVLink presense but > + * platform does, use this. > + */ > + if (!pnv_pci_get_gpu_dev(vdev->pdev)) > + return -ENODEV; > + > + /* > + * NPU2 normally has 8 ATSD registers (for concurrency) and 6 links > + * so we can allocate one register per link, using nvlink index as > + * a key. > + * There is always at least one ATSD register so as long as at least > + * NVLink bridge #0 is passed to the guest, ATSD will be available. > + */ > + nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0); > + if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index", > + &nvlink_index))) > + return -ENODEV; > + > + if (of_property_read_u64_index(hose->dn, "ibm,mmio-atsd", nvlink_index, > + &mmio_atsd)) { > + dev_warn(&vdev->pdev->dev, "No available ATSD found\n"); > + mmio_atsd = 0; > + } > + > + if (of_property_read_u64(npu_node, "ibm,device-tgt-addr", &tgt)) { > + dev_warn(&vdev->pdev->dev, "No ibm,device-tgt-addr found\n"); > + return -EFAULT; > + } > + > + if (of_property_read_u32(npu_node, "ibm,nvlink-speed", &link_speed)) { > + dev_warn(&vdev->pdev->dev, "No ibm,nvlink-speed found\n"); > + return -EFAULT; > + } > + > + data = kzalloc(sizeof(*data), GFP_KERNEL); > + if (!data) > + return -ENOMEM; > + > + data->mmio_atsd = mmio_atsd; > + data->gpu_tgt = tgt; > + data->link_speed = link_speed; > + if (data->mmio_atsd) { > + data->base = memremap(data->mmio_atsd, SZ_64K, MEMREMAP_WT); > + if (!data->base) { > + ret = -ENOMEM; > + goto free_exit; > + } > + } > + > + /* > + * We want to expose the capability even if this specific NVLink > + * did not get its own ATSD register because capabilities > + * belong to VFIO regions and normally there will be ATSD register > + * assigned to the NVLink bridge. > + */ > + ret = vfio_pci_register_dev_region(vdev, > + PCI_VENDOR_ID_IBM | > + VFIO_REGION_TYPE_PCI_VENDOR_TYPE, > + VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD, > + &vfio_pci_npu2_regops, > + data->mmio_atsd ? PAGE_SIZE : 0, > + VFIO_REGION_INFO_FLAG_READ | > + VFIO_REGION_INFO_FLAG_WRITE | > + VFIO_REGION_INFO_FLAG_MMAP, > + data); > + if (ret) > + goto free_exit; > + > + return 0; > + > +free_exit: > + kfree(data); > + > + return ret; > +} > diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig > index 42dc1d3..d0f8e4f 100644 > --- a/drivers/vfio/pci/Kconfig > +++ b/drivers/vfio/pci/Kconfig > @@ -38,3 +38,9 @@ config VFIO_PCI_IGD > and LPC bridge config space. > > To enable Intel IGD assignment through vfio-pci, say Y. > + > +config VFIO_PCI_NVLINK2 > + def_bool y > + depends on VFIO_PCI && PPC_POWERNV > + help > + VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs > -- > 2.17.1 > -- Murilo