Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp5403739imu; Wed, 19 Dec 2018 10:29:24 -0800 (PST) X-Google-Smtp-Source: AFSGD/VSXdHDP9+w0J9XD1dblyEiuWWsFYNzAsXKq7MDb0Sz+z4/9hbaTI19nwXLJHy4uk0CKWOh X-Received: by 2002:a62:509b:: with SMTP id g27mr21715318pfj.48.1545244164630; Wed, 19 Dec 2018 10:29:24 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1545244164; cv=none; d=google.com; s=arc-20160816; b=lj3soM3ONEIo3cMtI2gl6OufiVNlfg08bLXCI2ERbbEhVTm6I1n8hc/jpET1F8Pw5H 4qUJojChhMSwlLKSUkp4qfujLQAVLKxKPt3k2S8hccBnxa1rlDfFoJKktPd/OfQtcxdJ koyuENQweXtIY3JZkQalgjYtKosleNNN5QsRkrbXcCn0kISS8wIeIUfVSrUGm5tLNq1B ozR8FVUe8veHAQ7qKGiHW8B48KYYPqyAZ9pLIsqsMGjcL8AwHbzibAgDCC2VLwnUpMq1 2Rw+qFq4UijGiO4LMJgRQ4fN9Pr1Wz3ft/ZIys6Q1fMPzQGW7d7CzFHRR/fWCC2gQhUG ANtQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :organization:references:in-reply-to:message-id:subject:cc:to:from :date; bh=31K+hoFrcr3h0klnr3CFoYfNkcrfVPRXc3bDKUlpDBM=; b=urEt2NceaZWTYa3D97PHbAuwJq2e0QlejHfjN/Qukc2QDVvR8Ta00eyHPXYjNEEzll xZsCeSGV6uuv+DDmfM827Ja9iiUsD6+XMF+oVxXwdc24DXPwAop4e0fz8P8NJH0s2oEF YUDljx5GrUFHGtsv/3YEmNAdYuJa7jnWKV7Mn6WhV+tdah2YZvcxeZOhnzINX3wcBT6l r/jkJ8CmMceT2l1BihcS+/HdVQgbEtUUenrT9cPyJE9TC8oraxRMn3jcg9koE5sAHbit bddHcWcQ9v2FT/6+WOEYtBR0f2tYPKnMdGSStjWfBhmY20PRKMiHNXC1pJhOP98NyEcu omKQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 30si6643450pgv.191.2018.12.19.10.29.09; Wed, 19 Dec 2018 10:29:24 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729842AbeLSQo4 (ORCPT + 99 others); Wed, 19 Dec 2018 11:44:56 -0500 Received: from mx1.redhat.com ([209.132.183.28]:55176 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728109AbeLSQoz (ORCPT ); Wed, 19 Dec 2018 11:44:55 -0500 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 6845081256; Wed, 19 Dec 2018 16:44:54 +0000 (UTC) Received: from x1.home (ovpn-116-92.phx2.redhat.com [10.3.116.92]) by smtp.corp.redhat.com (Postfix) with ESMTP id 6EA295DD63; Wed, 19 Dec 2018 16:44:53 +0000 (UTC) Date: Wed, 19 Dec 2018 09:44:52 -0700 From: Alex Williamson To: Alexey Kardashevskiy Cc: linuxppc-dev@lists.ozlabs.org, David Gibson , kvm-ppc@vger.kernel.org, Alistair Popple , Reza Arbab , Sam Bobroff , Piotr Jaroszynski , Leonardo Augusto =?UTF-8?B?R3VpbWFyw6Nlcw==?= Garcia , Jose Ricardo Ziviani , Daniel Henrique Barboza , Paul Mackerras , kvm@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH kernel v6 20/20] vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver Message-ID: <20181219094452.57ad366e@x1.home> In-Reply-To: <20181219085232.103441-21-aik@ozlabs.ru> References: <20181219085232.103441-1-aik@ozlabs.ru> <20181219085232.103441-21-aik@ozlabs.ru> Organization: Red Hat MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Wed, 19 Dec 2018 16:44:54 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [cc +kvm, +lkml] Ditto list cc comment from 18/20, and doubly so on this with updates to the vfio uapi. Also comment below... On Wed, 19 Dec 2018 19:52:32 +1100 Alexey Kardashevskiy wrote: > POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not > pluggable PCIe devices but still have PCIe links which are used > for config space and MMIO. In addition to that the GPUs have 6 NVLinks > which are connected to other GPUs and the POWER9 CPU. POWER9 chips > have a special unit on a die called an NPU which is an NVLink2 host bus > adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each. > These systems also support ATS (address translation services) which is > a part of the NVLink2 protocol. Such GPUs also share on-board RAM > (16GB or 32GB) to the system via the same NVLink2 so a CPU has > cache-coherent access to a GPU RAM. > > This exports GPU RAM to the userspace as a new VFIO device region. This > preregisters the new memory as device memory as it might be used for DMA. > This inserts pfns from the fault handler as the GPU memory is not onlined > until the vendor driver is loaded and trained the NVLinks so doing this > earlier causes low level errors which we fence in the firmware so > it does not hurt the host system but still better be avoided; for the same > reason this does not map GPU RAM into the host kernel (usual thing for > emulated access otherwise). > > This exports an ATSD (Address Translation Shootdown) register of NPU which > allows TLB invalidations inside GPU for an operating system. The register > conveniently occupies a single 64k page. It is also presented to > the userspace as a new VFIO device region. One NPU has 8 ATSD registers, > each of them can be used for TLB invalidation in a GPU linked to this NPU. > This allocates one ATSD register per an NVLink bridge allowing passing > up to 6 registers. Due to the host firmware bug (just recently fixed), > only 1 ATSD register per NPU was actually advertised to the host system > so this passes that alone register via the first NVLink bridge device in > the group which is still enough as QEMU collects them all back and > presents to the guest via vPHB to mimic the emulated NPU PHB on the host. > > In order to provide the userspace with the information about GPU-to-NVLink > connections, this exports an additional capability called "tgt" > (which is an abbreviated host system bus address). The "tgt" property > tells the GPU its own system address and allows the guest driver to > conglomerate the routing information so each GPU knows how to get directly > to the other GPUs. > > For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to > know LPID (a logical partition ID or a KVM guest hardware ID in other > words) and PID (a memory context ID of a userspace process, not to be > confused with a linux pid). This assigns a GPU to LPID in the NPU and > this is why this adds a listener for KVM on an IOMMU group. A PID comes > via NVLink from a GPU and NPU uses a PID wildcard to pass it through. > > This requires coherent memory and ATSD to be available on the host as > the GPU vendor only supports configurations with both features enabled > and other configurations are known not to work. Because of this and > because of the ways the features are advertised to the host system > (which is a device tree with very platform specific properties), > this requires enabled POWERNV platform. > > The V100 GPUs do not advertise any of these capabilities via the config > space and there are more than just one device ID so this relies on > the platform to tell whether these GPUs have special abilities such as > NVLinks. > > Signed-off-by: Alexey Kardashevskiy > --- > Changes: > v6: > * reworked capabilities - tgt for nvlink and gpu and link-speed > for nvlink only > > v5: > * do not memremap GPU RAM for emulation, map it only when it is needed > * allocate 1 ATSD register per NVLink bridge, if none left, then expose > the region with a zero size > * separate caps per device type > * addressed AW review comments > > v4: > * added nvlink-speed to the NPU bridge capability as this turned out to > be not a constant value > * instead of looking at the exact device ID (which also changes from system > to system), now this (indirectly) looks at the device tree to know > if GPU and NPU support NVLink > > v3: > * reworded the commit log about tgt > * added tracepoints (do we want them enabled for entire vfio-pci?) > * added code comments > * added write|mmap flags to the new regions > * auto enabled VFIO_PCI_NVLINK2 config option > * added 'tgt' capability to a GPU so QEMU can recreate ibm,npu and ibm,gpu > references; there are required by the NVIDIA driver > * keep notifier registered only for short time > --- > drivers/vfio/pci/Makefile | 1 + > drivers/vfio/pci/trace.h | 102 ++++++ > drivers/vfio/pci/vfio_pci_private.h | 14 + > include/uapi/linux/vfio.h | 38 +++ > drivers/vfio/pci/vfio_pci.c | 27 +- > drivers/vfio/pci/vfio_pci_nvlink2.c | 482 ++++++++++++++++++++++++++++ > drivers/vfio/pci/Kconfig | 6 + > 7 files changed, 668 insertions(+), 2 deletions(-) > create mode 100644 drivers/vfio/pci/trace.h > create mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c > > diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile > index 76d8ec0..9662c06 100644 > --- a/drivers/vfio/pci/Makefile > +++ b/drivers/vfio/pci/Makefile > @@ -1,5 +1,6 @@ > > vfio-pci-y := vfio_pci.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o > vfio-pci-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o > +vfio-pci-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2.o > > obj-$(CONFIG_VFIO_PCI) += vfio-pci.o > diff --git a/drivers/vfio/pci/trace.h b/drivers/vfio/pci/trace.h > new file mode 100644 > index 0000000..b80d2d3 > --- /dev/null > +++ b/drivers/vfio/pci/trace.h > @@ -0,0 +1,102 @@ > +/* SPDX-License-Identifier: GPL-2.0+ */ > +/* > + * VFIO PCI mmap/mmap_fault tracepoints > + * > + * Copyright (C) 2018 IBM Corp. All rights reserved. > + * Author: Alexey Kardashevskiy > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + */ > + > +#undef TRACE_SYSTEM > +#define TRACE_SYSTEM vfio_pci > + > +#if !defined(_TRACE_VFIO_PCI_H) || defined(TRACE_HEADER_MULTI_READ) > +#define _TRACE_VFIO_PCI_H > + > +#include > + > +TRACE_EVENT(vfio_pci_nvgpu_mmap_fault, > + TP_PROTO(struct pci_dev *pdev, unsigned long hpa, unsigned long ua, > + vm_fault_t ret), > + TP_ARGS(pdev, hpa, ua, ret), > + > + TP_STRUCT__entry( > + __field(const char *, name) > + __field(unsigned long, hpa) > + __field(unsigned long, ua) > + __field(int, ret) > + ), > + > + TP_fast_assign( > + __entry->name = dev_name(&pdev->dev), > + __entry->hpa = hpa; > + __entry->ua = ua; > + __entry->ret = ret; > + ), > + > + TP_printk("%s: %lx -> %lx ret=%d", __entry->name, __entry->hpa, > + __entry->ua, __entry->ret) > +); > + > +TRACE_EVENT(vfio_pci_nvgpu_mmap, > + TP_PROTO(struct pci_dev *pdev, unsigned long hpa, unsigned long ua, > + unsigned long size, int ret), > + TP_ARGS(pdev, hpa, ua, size, ret), > + > + TP_STRUCT__entry( > + __field(const char *, name) > + __field(unsigned long, hpa) > + __field(unsigned long, ua) > + __field(unsigned long, size) > + __field(int, ret) > + ), > + > + TP_fast_assign( > + __entry->name = dev_name(&pdev->dev), > + __entry->hpa = hpa; > + __entry->ua = ua; > + __entry->size = size; > + __entry->ret = ret; > + ), > + > + TP_printk("%s: %lx -> %lx size=%lx ret=%d", __entry->name, __entry->hpa, > + __entry->ua, __entry->size, __entry->ret) > +); > + > +TRACE_EVENT(vfio_pci_npu2_mmap, > + TP_PROTO(struct pci_dev *pdev, unsigned long hpa, unsigned long ua, > + unsigned long size, int ret), > + TP_ARGS(pdev, hpa, ua, size, ret), > + > + TP_STRUCT__entry( > + __field(const char *, name) > + __field(unsigned long, hpa) > + __field(unsigned long, ua) > + __field(unsigned long, size) > + __field(int, ret) > + ), > + > + TP_fast_assign( > + __entry->name = dev_name(&pdev->dev), > + __entry->hpa = hpa; > + __entry->ua = ua; > + __entry->size = size; > + __entry->ret = ret; > + ), > + > + TP_printk("%s: %lx -> %lx size=%lx ret=%d", __entry->name, __entry->hpa, > + __entry->ua, __entry->size, __entry->ret) > +); > + > +#endif /* _TRACE_SUBSYS_H */ > + > +#undef TRACE_INCLUDE_PATH > +#define TRACE_INCLUDE_PATH . > +#undef TRACE_INCLUDE_FILE > +#define TRACE_INCLUDE_FILE trace > + > +/* This part must be outside protection */ > +#include > diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h > index 93c1738..127071b 100644 > --- a/drivers/vfio/pci/vfio_pci_private.h > +++ b/drivers/vfio/pci/vfio_pci_private.h > @@ -163,4 +163,18 @@ static inline int vfio_pci_igd_init(struct vfio_pci_device *vdev) > return -ENODEV; > } > #endif > +#ifdef CONFIG_VFIO_PCI_NVLINK2 > +extern int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev); > +extern int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev); > +#else > +static inline int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev) > +{ > + return -ENODEV; > +} > + > +static inline int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev) > +{ > + return -ENODEV; > +} > +#endif > #endif /* VFIO_PCI_PRIVATE_H */ > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index 8131028..22b825c 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -353,6 +353,21 @@ struct vfio_region_gfx_edid { > #define VFIO_DEVICE_GFX_LINK_STATE_DOWN 2 > }; > > +/* > + * 10de vendor sub-type > + * > + * NVIDIA GPU NVlink2 RAM is coherent RAM mapped onto the host address space. > + */ > +#define VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM (1) > + > +/* > + * 1014 vendor sub-type > + * > + * IBM NPU NVlink2 ATSD (Address Translation Shootdown) register of NPU > + * to do TLB invalidation on a GPU. > + */ > +#define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD (1) > + > /* > * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped > * which allows direct access to non-MSIX registers which happened to be within > @@ -363,6 +378,29 @@ struct vfio_region_gfx_edid { > */ > #define VFIO_REGION_INFO_CAP_MSIX_MAPPABLE 3 > > +/* > + * Capability with compressed real address (aka SSA - small system address) > + * where GPU RAM is mapped on a system bus. Used by a GPU for DMA routing. > + */ > +#define VFIO_REGION_INFO_CAP_NVLINK2_SSATGT 4 > + > +struct vfio_region_info_cap_nvlink2_ssatgt { > + struct vfio_info_cap_header header; > + __u64 tgt; > +}; > + > +/* > + * Capability with compressed real address (aka SSA - small system address), > + * used to match the NVLink bridge with a GPU. Also contains a link speed. > + */ Comments carried over from previous definitions are no longer accurate. Thanks, Alex > +#define VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD 5 > + > +struct vfio_region_info_cap_nvlink2_lnkspd { > + struct vfio_info_cap_header header; > + __u32 link_speed; > + __u32 __pad; > +}; > + > /** > * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9, > * struct vfio_irq_info) > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c > index 6cb70cf..67c03f2 100644 > --- a/drivers/vfio/pci/vfio_pci.c > +++ b/drivers/vfio/pci/vfio_pci.c > @@ -302,14 +302,37 @@ static int vfio_pci_enable(struct vfio_pci_device *vdev) > if (ret) { > dev_warn(&vdev->pdev->dev, > "Failed to setup Intel IGD regions\n"); > - vfio_pci_disable(vdev); > - return ret; > + goto disable_exit; > + } > + } > + > + if (pdev->vendor == PCI_VENDOR_ID_NVIDIA && > + IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) { > + ret = vfio_pci_nvdia_v100_nvlink2_init(vdev); > + if (ret && ret != -ENODEV) { > + dev_warn(&vdev->pdev->dev, > + "Failed to setup NVIDIA NV2 RAM region\n"); > + goto disable_exit; > + } > + } > + > + if (pdev->vendor == PCI_VENDOR_ID_IBM && > + IS_ENABLED(CONFIG_VFIO_PCI_NVLINK2)) { > + ret = vfio_pci_ibm_npu2_init(vdev); > + if (ret && ret != -ENODEV) { > + dev_warn(&vdev->pdev->dev, > + "Failed to setup NVIDIA NV2 ATSD region\n"); > + goto disable_exit; > } > } > > vfio_pci_probe_mmaps(vdev); > > return 0; > + > +disable_exit: > + vfio_pci_disable(vdev); > + return ret; > } > > static void vfio_pci_disable(struct vfio_pci_device *vdev) > diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c > new file mode 100644 > index 0000000..054a2cf > --- /dev/null > +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c > @@ -0,0 +1,482 @@ > +// SPDX-License-Identifier: GPL-2.0+ > +/* > + * VFIO PCI NVIDIA Whitherspoon GPU support a.k.a. NVLink2. > + * > + * Copyright (C) 2018 IBM Corp. All rights reserved. > + * Author: Alexey Kardashevskiy > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + * > + * Register an on-GPU RAM region for cacheable access. > + * > + * Derived from original vfio_pci_igd.c: > + * Copyright (C) 2016 Red Hat, Inc. All rights reserved. > + * Author: Alex Williamson > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include "vfio_pci_private.h" > + > +#define CREATE_TRACE_POINTS > +#include "trace.h" > + > +EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap_fault); > +EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_nvgpu_mmap); > +EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_npu2_mmap); > + > +struct vfio_pci_nvgpu_data { > + unsigned long gpu_hpa; /* GPU RAM physical address */ > + unsigned long gpu_tgt; /* TGT address of corresponding GPU RAM */ > + unsigned long useraddr; /* GPU RAM userspace address */ > + unsigned long size; /* Size of the GPU RAM window (usually 128GB) */ > + struct mm_struct *mm; > + struct mm_iommu_table_group_mem_t *mem; /* Pre-registered RAM descr. */ > + struct pci_dev *gpdev; > + struct notifier_block group_notifier; > +}; > + > +static size_t vfio_pci_nvgpu_rw(struct vfio_pci_device *vdev, > + char __user *buf, size_t count, loff_t *ppos, bool iswrite) > +{ > + unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS; > + struct vfio_pci_nvgpu_data *data = vdev->region[i].data; > + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK; > + loff_t posaligned = pos & PAGE_MASK, posoff = pos & ~PAGE_MASK; > + size_t sizealigned; > + void __iomem *ptr; > + > + if (pos >= vdev->region[i].size) > + return -EINVAL; > + > + count = min(count, (size_t)(vdev->region[i].size - pos)); > + > + /* > + * We map only a bit of GPU RAM for a short time instead of mapping it > + * for the guest lifetime as: > + * > + * 1) we do not know GPU RAM size, only aperture which is 4-8 times > + * bigger than actual RAM size (16/32GB RAM vs. 128GB aperture); > + * 2) mapping GPU RAM allows CPU to prefetch and if this happens > + * before NVLink bridge is reset (which fences GPU RAM), > + * hardware management interrupts (HMI) might happen, this > + * will freeze NVLink bridge. > + * > + * This is not fast path anyway. > + */ > + sizealigned = _ALIGN_UP(posoff + count, PAGE_SIZE); > + ptr = ioremap_cache(data->gpu_hpa + posaligned, sizealigned); > + if (!ptr) > + return -EFAULT; > + > + if (iswrite) { > + if (copy_from_user(ptr + posoff, buf, count)) > + count = -EFAULT; > + else > + *ppos += count; > + } else { > + if (copy_to_user(buf, ptr + posoff, count)) > + count = -EFAULT; > + else > + *ppos += count; > + } > + > + iounmap(ptr); > + > + return count; > +} > + > +static void vfio_pci_nvgpu_release(struct vfio_pci_device *vdev, > + struct vfio_pci_region *region) > +{ > + struct vfio_pci_nvgpu_data *data = region->data; > + long ret; > + > + /* If there were any mappings at all... */ > + if (data->mm) { > + ret = mm_iommu_put(data->mm, data->mem); > + WARN_ON(ret); > + > + mmdrop(data->mm); > + } > + > + vfio_unregister_notifier(&data->gpdev->dev, VFIO_GROUP_NOTIFY, > + &data->group_notifier); > + > + pnv_npu2_unmap_lpar_dev(data->gpdev); > + > + kfree(data); > +} > + > +static vm_fault_t vfio_pci_nvgpu_mmap_fault(struct vm_fault *vmf) > +{ > + vm_fault_t ret; > + struct vm_area_struct *vma = vmf->vma; > + struct vfio_pci_region *region = vma->vm_private_data; > + struct vfio_pci_nvgpu_data *data = region->data; > + unsigned long vmf_off = (vmf->address - vma->vm_start) >> PAGE_SHIFT; > + unsigned long nv2pg = data->gpu_hpa >> PAGE_SHIFT; > + unsigned long vm_pgoff = vma->vm_pgoff & > + ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1); > + unsigned long pfn = nv2pg + vm_pgoff + vmf_off; > + > + ret = vmf_insert_pfn(vma, vmf->address, pfn); > + trace_vfio_pci_nvgpu_mmap_fault(data->gpdev, pfn << PAGE_SHIFT, > + vmf->address, ret); > + > + return ret; > +} > + > +static const struct vm_operations_struct vfio_pci_nvgpu_mmap_vmops = { > + .fault = vfio_pci_nvgpu_mmap_fault, > +}; > + > +static int vfio_pci_nvgpu_mmap(struct vfio_pci_device *vdev, > + struct vfio_pci_region *region, struct vm_area_struct *vma) > +{ > + int ret; > + struct vfio_pci_nvgpu_data *data = region->data; > + > + if (data->useraddr) > + return -EPERM; > + > + if (vma->vm_end - vma->vm_start > data->size) > + return -EINVAL; > + > + vma->vm_private_data = region; > + vma->vm_flags |= VM_PFNMAP; > + vma->vm_ops = &vfio_pci_nvgpu_mmap_vmops; > + > + /* > + * Calling mm_iommu_newdev() here once as the region is not > + * registered yet and therefore right initialization will happen now. > + * Other places will use mm_iommu_find() which returns > + * registered @mem and does not go gup(). > + */ > + data->useraddr = vma->vm_start; > + data->mm = current->mm; > + > + atomic_inc(&data->mm->mm_count); > + ret = (int) mm_iommu_newdev(data->mm, data->useraddr, > + (vma->vm_end - vma->vm_start) >> PAGE_SHIFT, > + data->gpu_hpa, &data->mem); > + > + trace_vfio_pci_nvgpu_mmap(vdev->pdev, data->gpu_hpa, data->useraddr, > + vma->vm_end - vma->vm_start, ret); > + > + return ret; > +} > + > +static int vfio_pci_nvgpu_add_capability(struct vfio_pci_device *vdev, > + struct vfio_pci_region *region, struct vfio_info_cap *caps) > +{ > + struct vfio_pci_nvgpu_data *data = region->data; > + struct vfio_region_info_cap_nvlink2_ssatgt cap = { 0 }; > + > + cap.header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT; > + cap.header.version = 1; > + cap.tgt = data->gpu_tgt; > + > + return vfio_info_add_capability(caps, &cap.header, sizeof(cap)); > +} > + > +static const struct vfio_pci_regops vfio_pci_nvgpu_regops = { > + .rw = vfio_pci_nvgpu_rw, > + .release = vfio_pci_nvgpu_release, > + .mmap = vfio_pci_nvgpu_mmap, > + .add_capability = vfio_pci_nvgpu_add_capability, > +}; > + > +static int vfio_pci_nvgpu_group_notifier(struct notifier_block *nb, > + unsigned long action, void *opaque) > +{ > + struct kvm *kvm = opaque; > + struct vfio_pci_nvgpu_data *data = container_of(nb, > + struct vfio_pci_nvgpu_data, > + group_notifier); > + > + if (action == VFIO_GROUP_NOTIFY_SET_KVM && kvm && > + pnv_npu2_map_lpar_dev(data->gpdev, > + kvm->arch.lpid, MSR_DR | MSR_PR)) > + return NOTIFY_BAD; > + > + return NOTIFY_OK; > +} > + > +int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev) > +{ > + int ret; > + u64 reg[2]; > + u64 tgt = 0; > + struct device_node *npu_node, *mem_node; > + struct pci_dev *npu_dev; > + struct vfio_pci_nvgpu_data *data; > + uint32_t mem_phandle = 0; > + unsigned long events = VFIO_GROUP_NOTIFY_SET_KVM; > + > + /* > + * PCI config space does not tell us about NVLink presense but > + * platform does, use this. > + */ > + npu_dev = pnv_pci_get_npu_dev(vdev->pdev, 0); > + if (!npu_dev) > + return -ENODEV; > + > + npu_node = pci_device_to_OF_node(npu_dev); > + if (!npu_node) > + return -EINVAL; > + > + if (of_property_read_u32(npu_node, "memory-region", &mem_phandle)) > + return -EINVAL; > + > + mem_node = of_find_node_by_phandle(mem_phandle); > + if (!mem_node) > + return -EINVAL; > + > + if (of_property_read_variable_u64_array(mem_node, "reg", reg, > + ARRAY_SIZE(reg), ARRAY_SIZE(reg)) != > + ARRAY_SIZE(reg)) > + return -EINVAL; > + > + if (of_property_read_u64(npu_node, "ibm,device-tgt-addr", &tgt)) { > + dev_warn(&vdev->pdev->dev, "No ibm,device-tgt-addr found\n"); > + return -EFAULT; > + } > + > + data = kzalloc(sizeof(*data), GFP_KERNEL); > + if (!data) > + return -ENOMEM; > + > + data->gpu_hpa = reg[0]; > + data->gpu_tgt = tgt; > + data->size = reg[1]; > + > + dev_dbg(&vdev->pdev->dev, "%lx..%lx\n", data->gpu_hpa, > + data->gpu_hpa + data->size - 1); > + > + data->gpdev = vdev->pdev; > + data->group_notifier.notifier_call = vfio_pci_nvgpu_group_notifier; > + > + ret = vfio_register_notifier(&data->gpdev->dev, VFIO_GROUP_NOTIFY, > + &events, &data->group_notifier); > + if (ret) > + goto free_exit; > + > + /* > + * We have just set KVM, we do not need the listener anymore. > + * Also, keeping it registered means that if more than one GPU is > + * assigned, we will get several similar notifiers notifying about > + * the same device again which does not help with anything. > + */ > + vfio_unregister_notifier(&data->gpdev->dev, VFIO_GROUP_NOTIFY, > + &data->group_notifier); > + > + ret = vfio_pci_register_dev_region(vdev, > + PCI_VENDOR_ID_NVIDIA | VFIO_REGION_TYPE_PCI_VENDOR_TYPE, > + VFIO_REGION_SUBTYPE_NVIDIA_NVLINK2_RAM, > + &vfio_pci_nvgpu_regops, > + data->size, > + VFIO_REGION_INFO_FLAG_READ | > + VFIO_REGION_INFO_FLAG_WRITE | > + VFIO_REGION_INFO_FLAG_MMAP, > + data); > + if (ret) > + goto free_exit; > + > + return 0; > +free_exit: > + kfree(data); > + > + return ret; > +} > + > +/* > + * IBM NPU2 bridge > + */ > +struct vfio_pci_npu2_data { > + void *base; /* ATSD register virtual address, for emulated access */ > + unsigned long mmio_atsd; /* ATSD physical address */ > + unsigned long gpu_tgt; /* TGT address of corresponding GPU RAM */ > + unsigned int link_speed; /* The link speed from DT's ibm,nvlink-speed */ > +}; > + > +static size_t vfio_pci_npu2_rw(struct vfio_pci_device *vdev, > + char __user *buf, size_t count, loff_t *ppos, bool iswrite) > +{ > + unsigned int i = VFIO_PCI_OFFSET_TO_INDEX(*ppos) - VFIO_PCI_NUM_REGIONS; > + struct vfio_pci_npu2_data *data = vdev->region[i].data; > + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK; > + > + if (pos >= vdev->region[i].size) > + return -EINVAL; > + > + count = min(count, (size_t)(vdev->region[i].size - pos)); > + > + if (iswrite) { > + if (copy_from_user(data->base + pos, buf, count)) > + return -EFAULT; > + } else { > + if (copy_to_user(buf, data->base + pos, count)) > + return -EFAULT; > + } > + *ppos += count; > + > + return count; > +} > + > +static int vfio_pci_npu2_mmap(struct vfio_pci_device *vdev, > + struct vfio_pci_region *region, struct vm_area_struct *vma) > +{ > + int ret; > + struct vfio_pci_npu2_data *data = region->data; > + unsigned long req_len = vma->vm_end - vma->vm_start; > + > + if (req_len != PAGE_SIZE) > + return -EINVAL; > + > + vma->vm_flags |= VM_PFNMAP; > + vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); > + > + ret = remap_pfn_range(vma, vma->vm_start, data->mmio_atsd >> PAGE_SHIFT, > + req_len, vma->vm_page_prot); > + trace_vfio_pci_npu2_mmap(vdev->pdev, data->mmio_atsd, vma->vm_start, > + vma->vm_end - vma->vm_start, ret); > + > + return ret; > +} > + > +static void vfio_pci_npu2_release(struct vfio_pci_device *vdev, > + struct vfio_pci_region *region) > +{ > + struct vfio_pci_npu2_data *data = region->data; > + > + memunmap(data->base); > + kfree(data); > +} > + > +static int vfio_pci_npu2_add_capability(struct vfio_pci_device *vdev, > + struct vfio_pci_region *region, struct vfio_info_cap *caps) > +{ > + struct vfio_pci_npu2_data *data = region->data; > + struct vfio_region_info_cap_nvlink2_ssatgt captgt = { 0 }; > + struct vfio_region_info_cap_nvlink2_lnkspd capspd = { 0 }; > + int ret; > + > + captgt.header.id = VFIO_REGION_INFO_CAP_NVLINK2_SSATGT; > + captgt.header.version = 1; > + captgt.tgt = data->gpu_tgt; > + > + capspd.header.id = VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD; > + capspd.header.version = 1; > + capspd.link_speed = data->link_speed; > + > + ret = vfio_info_add_capability(caps, &captgt.header, sizeof(captgt)); > + if (ret) > + return ret; > + > + return vfio_info_add_capability(caps, &capspd.header, sizeof(capspd)); > +} > + > +static const struct vfio_pci_regops vfio_pci_npu2_regops = { > + .rw = vfio_pci_npu2_rw, > + .mmap = vfio_pci_npu2_mmap, > + .release = vfio_pci_npu2_release, > + .add_capability = vfio_pci_npu2_add_capability, > +}; > + > +int vfio_pci_ibm_npu2_init(struct vfio_pci_device *vdev) > +{ > + int ret; > + struct vfio_pci_npu2_data *data; > + struct device_node *nvlink_dn; > + u32 nvlink_index = 0; > + struct pci_dev *npdev = vdev->pdev; > + struct device_node *npu_node = pci_device_to_OF_node(npdev); > + struct pci_controller *hose = pci_bus_to_host(npdev->bus); > + u64 mmio_atsd = 0; > + u64 tgt = 0; > + u32 link_speed = 0xff; > + > + /* > + * PCI config space does not tell us about NVLink presense but > + * platform does, use this. > + */ > + if (!pnv_pci_get_gpu_dev(vdev->pdev)) > + return -ENODEV; > + > + /* > + * NPU2 normally has 8 ATSD registers (for concurrency) and 6 links > + * so we can allocate one register per link, using nvlink index as > + * a key. > + * There is always at least one ATSD register so as long as at least > + * NVLink bridge #0 is passed to the guest, ATSD will be available. > + */ > + nvlink_dn = of_parse_phandle(npdev->dev.of_node, "ibm,nvlink", 0); > + if (WARN_ON(of_property_read_u32(nvlink_dn, "ibm,npu-link-index", > + &nvlink_index))) > + return -ENODEV; > + > + if (of_property_read_u64_index(hose->dn, "ibm,mmio-atsd", nvlink_index, > + &mmio_atsd)) { > + dev_warn(&vdev->pdev->dev, "No available ATSD found\n"); > + mmio_atsd = 0; > + } > + > + if (of_property_read_u64(npu_node, "ibm,device-tgt-addr", &tgt)) { > + dev_warn(&vdev->pdev->dev, "No ibm,device-tgt-addr found\n"); > + return -EFAULT; > + } > + > + if (of_property_read_u32(npu_node, "ibm,nvlink-speed", &link_speed)) { > + dev_warn(&vdev->pdev->dev, "No ibm,nvlink-speed found\n"); > + return -EFAULT; > + } > + > + data = kzalloc(sizeof(*data), GFP_KERNEL); > + if (!data) > + return -ENOMEM; > + > + data->mmio_atsd = mmio_atsd; > + data->gpu_tgt = tgt; > + data->link_speed = link_speed; > + if (data->mmio_atsd) { > + data->base = memremap(data->mmio_atsd, SZ_64K, MEMREMAP_WT); > + if (!data->base) { > + ret = -ENOMEM; > + goto free_exit; > + } > + } > + > + /* > + * We want to expose the capability even if this specific NVLink > + * did not get its own ATSD register because capabilities > + * belong to VFIO regions and normally there will be ATSD register > + * assigned to the NVLink bridge. > + */ > + ret = vfio_pci_register_dev_region(vdev, > + PCI_VENDOR_ID_IBM | > + VFIO_REGION_TYPE_PCI_VENDOR_TYPE, > + VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD, > + &vfio_pci_npu2_regops, > + data->mmio_atsd ? PAGE_SIZE : 0, > + VFIO_REGION_INFO_FLAG_READ | > + VFIO_REGION_INFO_FLAG_WRITE | > + VFIO_REGION_INFO_FLAG_MMAP, > + data); > + if (ret) > + goto free_exit; > + > + return 0; > + > +free_exit: > + kfree(data); > + > + return ret; > +} > diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig > index 42dc1d3..d0f8e4f 100644 > --- a/drivers/vfio/pci/Kconfig > +++ b/drivers/vfio/pci/Kconfig > @@ -38,3 +38,9 @@ config VFIO_PCI_IGD > and LPC bridge config space. > > To enable Intel IGD assignment through vfio-pci, say Y. > + > +config VFIO_PCI_NVLINK2 > + def_bool y > + depends on VFIO_PCI && PPC_POWERNV > + help > + VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs