LinuxLists.cc - [PATCH v15 1/1] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2023-12-17 19:11:04

by Ankit Agrawal

[permalink] [raw]

Subject: [PATCH v15 1/1] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

From: Ankit Agrawal <[email protected]>

NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
for the on-chip GPU that is the logical OS representation of the
internal proprietary chip-to-chip cache coherent interconnect.

The device is peculiar compared to a real PCI device in that whilst
there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the
device, it is not used to access device memory once the faster
chip-to-chip interconnect is initialized (occurs at the time of host
system boot). The device memory is accessed instead using the chip-to-chip
interconnect that is exposed as a contiguous physically addressable
region on the host. This device memory aperture can be obtained from host
ACPI table using device_property_read_u64(), according to the FW
specification. Since the device memory is cache coherent with the CPU,
it can be mmap into the user VMA with a cacheable mapping using
remap_pfn_range() and used like a regular RAM. The device memory
is not added to the host kernel, but mapped directly as this reduces
memory wastage due to struct pages.

There is also a requirement of a reserved 1G uncached region (termed as
resmem) to support the Multi-Instance GPU (MIG) feature [1]. Based on [2],
the requisite properties (uncached, unaligned access) can be achieved
through a VM mapping (S1) of NORMAL_NC and host (S2) mapping with
MemAttr[2:0]=0b101. To provide a different non-cached property to the
reserved 1G region, it needs to be carved out from the device memory and
mapped as a separate region in Qemu VMA with pgprot_writecombine().
pgprot_writecombine() sets the Qemu VMA page properties (pgprot) as
NORMAL_NC.

Provide a VFIO PCI variant driver that adapts the unique device memory
representation into a more standard PCI representation facing userspace.

The variant driver exposes these two regions - the non-cached reserved
(resmem) and the cached rest of the device memory (termed as usemem) as
separate VFIO 64b BAR regions. Since the device implements 64-bit BAR0,
the VFIO PCI variant driver maps the uncached carved out region to the
next available PCI BAR (i.e. comprising of region 2 and 3). The cached
device memory aperture is assigned BAR region 4 and 5. Qemu will then
naturally generate a PCI device in the VM with the uncached aperture
reported as BAR2 region, the cacheable as BAR4. The variant driver provides
emulation for these fake BARs' PCI config space offset registers. The
BAR can also be enabled/disabled through PCI_COMMAND config space register.
The VM driver should enable the BARs (as it does already) to make use
of the device memory.

The memory layout on the host looks like the following:
devmem (memlength)
|--------------------------------------------------|
|-------------cached------------------------|--NC--|
| |
usemem.phys/memphys resmem.phys

PCI BARs need to be aligned to the power-of-2, but the actual memory on the
device may not. A read or write access to the physical address from the
last device PFN up to the next power-of-2 aligned physical address
results in reading ~0 and dropped writes. Note that the GPU device
driver [6] is capable of knowing the exact device memory size through
separate means. The device memory size is primarily kept in the system
ACPI tables for use by the VFIO PCI variant module.

Note that the usemem memory is added by the VM Nvidia device driver [5]
to the VM kernel as memblocks. Hence make the usable memory size memblock
aligned.

Currently there is no provision in KVM for a S2 mapping with
MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3].
As previously mentioned, resmem is mapped pgprot_writecombine(), that
sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the
proposed changes in [4] and [3], KVM marks the region with
MemAttr[2:0]=0b101 in S2.

This goes along with a qemu series [6] to provides the necessary
implementation of the Grace Hopper Superchip firmware specification so
that the guest operating system can see the correct ACPI modeling for
the coherent GPU device. Verified with the CUDA workload in the VM.

[1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu/
[2] section D8.5.5 of https://developer.arm.com/documentation/ddi0487/latest/
[3] https://lore.kernel.org/all/[email protected]/
[4] https://lore.kernel.org/all/[email protected]/
[5] https://github.com/NVIDIA/open-gpu-kernel-modules
[6] https://lore.kernel.org/all/[email protected]/

Applied over next-20231211.
---
Link for variant driver v14:
https://lore.kernel.org/all/[email protected]/

v14 -> v15
- Added case to handle VFIO_DEVICE_IOEVENTFD to return -EIO as it
is not required on the device.
- Updated the BAR config space handling code to closely resemble
by Yishai Hadas (using range_intersect_range) in
https://lore.kernel.org/all/[email protected]
- Changed the bar pci config register from union to u64.
- Adapted the code to disable BAR when it is disabled through
PCI_COMMAND.
- Exported and reused the do_io_rw to do mmio accesses.
- Added a new header file to keep the newly declared structures.
- Miscellaneous code fixes suggested by Alex Williamson in v14.

v13 -> v14
- Merged the changes for second BAR implementation for MIG support
on the device driver.
https://lore.kernel.org/all/[email protected]/
- Added the missing implementation of sub-word access to fake BARs'
PCI config access. Implemented access algorithm suggested by
Alex Williamson in the comments (Thanks!)
- Added support to BAR accesses on the reserved memory with
Qemu device param x-no-mmap=on.
- Handled endian-ness in the PCI config space access.
- Git commit message change

v12 -> v13
- Added emulation for the PCI config space BAR offset register for
the fake BAR.
- commit message updated with more details on the BAR offset emulation.

v11 -> v12
- More details in commit message on device memory size

v10 -> v11
- Removed sysfs attribute to expose the CPU coherent memory feature
- Addressed review comments

v9 -> v10
- Add new sysfs attribute to expose the CPU coherent memory feature.

v8 -> v9
- Minor code adjustment suggested in v8.

v7 -> v8
- Various field names updated.
- Added a new function to handle VFIO_DEVICE_GET_REGION_INFO ioctl.
- Locking protection for memremap to bar region and other changes
recommended in v7.
- Added code to fail if the devmem size advertized is 0 in system DSDT.

v6 -> v7
- Handled out-of-bound and overflow conditions at various places to validate
input offset and length.
- Added code to return EINVAL for offset beyond region size.

v5 -> v6
- Added the code to handle BAR2 read/write using memremap to the device
memory.

v4 -> v5
- Changed the module name from nvgpu-vfio-pci to nvgrace-gpu-vfio-pci.
- Fixed memory leak and added suggested boundary checks on device memory
mapping.
- Added code to read all Fs and ignored write on region outside of the
physical memory.
- Other miscellaneous cleanup suggestions.

v3 -> v4
- Mapping the available device memory using sparse mmap. The region outside
the device memory is handled by read/write ops.
- Removed the fault handler added in v3.

v2 -> v3
- Added fault handler to map the region outside the physical GPU memory
up to the next power-of-2 to a dummy PFN.
- Changed to select instead of "depends on" VFIO_PCI_CORE for all the
vfio-pci variant driver.
- Code cleanup based on feedback comments.
- Code implemented and tested against v6.4-rc4.

v1 -> v2
- Updated the wording of reference to BAR offset and replaced with
index.
- The GPU memory is exposed at the fixed BAR2_REGION_INDEX.
- Code cleanup based on feedback comments.

Signed-off-by: Ankit Agrawal <[email protected]>
Signed-off-by: Aniket Agashe <[email protected]>
Tested-by: Ankit Agrawal <[email protected]>
---
MAINTAINERS | 6 +
drivers/vfio/pci/Kconfig | 2 +
drivers/vfio/pci/Makefile | 2 +
drivers/vfio/pci/nvgrace-gpu/Kconfig | 10 +
drivers/vfio/pci/nvgrace-gpu/Makefile | 3 +
drivers/vfio/pci/nvgrace-gpu/main.c | 806 ++++++++++++++++++++++++++
drivers/vfio/pci/vfio_pci_rdwr.c | 9 +-
7 files changed, 834 insertions(+), 4 deletions(-)
create mode 100644 drivers/vfio/pci/nvgrace-gpu/Kconfig
create mode 100644 drivers/vfio/pci/nvgrace-gpu/Makefile
create mode 100644 drivers/vfio/pci/nvgrace-gpu/main.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 98f7dd0499f1..6f8f3a6daa43 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -22877,6 +22877,12 @@ L: [email protected]
S: Maintained
F: drivers/vfio/platform/

+VFIO NVIDIA GRACE GPU DRIVER
+M: Ankit Agrawal <[email protected]>
+L: [email protected]
+S: Maintained
+F: drivers/vfio/pci/nvgrace-gpu/
+
VGA_SWITCHEROO
R: Lukas Wunner <[email protected]>
S: Maintained
diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 8125e5f37832..2456210e85f1 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -65,4 +65,6 @@ source "drivers/vfio/pci/hisilicon/Kconfig"

source "drivers/vfio/pci/pds/Kconfig"

+source "drivers/vfio/pci/nvgrace-gpu/Kconfig"
+
endmenu
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index 45167be462d8..1352c65e568a 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -13,3 +13,5 @@ obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5/
obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/

obj-$(CONFIG_PDS_VFIO_PCI) += pds/
+
+obj-$(CONFIG_NVGRACE_GPU_VFIO_PCI) += nvgrace-gpu/
diff --git a/drivers/vfio/pci/nvgrace-gpu/Kconfig b/drivers/vfio/pci/nvgrace-gpu/Kconfig
new file mode 100644
index 000000000000..936e88d8d41d
--- /dev/null
+++ b/drivers/vfio/pci/nvgrace-gpu/Kconfig
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config NVGRACE_GPU_VFIO_PCI
+ tristate "VFIO support for the GPU in the NVIDIA Grace Hopper Superchip"
+ depends on ARM64 || (COMPILE_TEST && 64BIT)
+ select VFIO_PCI_CORE
+ help
+ VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is
+ required to assign the GPU device using KVM/qemu/etc.
+
+ If you don't know what to do here, say N.
diff --git a/drivers/vfio/pci/nvgrace-gpu/Makefile b/drivers/vfio/pci/nvgrace-gpu/Makefile
new file mode 100644
index 000000000000..3ca8c187897a
--- /dev/null
+++ b/drivers/vfio/pci/nvgrace-gpu/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_NVGRACE_GPU_VFIO_PCI) += nvgrace-gpu-vfio-pci.o
+nvgrace-gpu-vfio-pci-y := main.o
diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
new file mode 100644
index 000000000000..43387a800c41
--- /dev/null
+++ b/drivers/vfio/pci/nvgrace-gpu/main.c
@@ -0,0 +1,806 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
+ */
+
+#include "nvgrace_gpu_vfio_pci.h"
+
+static bool nvgrace_gpu_vfio_pci_is_fake_bar(int index)
+{
+ return (index == RESMEM_REGION_INDEX ||
+ index == USEMEM_REGION_INDEX);
+}
+
+static void nvgrace_gpu_init_fake_bar_emu_regs(struct vfio_device *core_vdev)
+{
+ struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
+ core_vdev, struct nvgrace_gpu_vfio_pci_core_device,
+ core_device.vdev);
+
+ nvdev->resmem.u64_reg = 0;
+ nvdev->usemem.u64_reg = 0;
+}
+
+/* Choose the structure corresponding to the fake BAR with a given index. */
+struct mem_region *
+nvgrace_gpu_vfio_pci_fake_bar_mem_region(int index,
+ struct nvgrace_gpu_vfio_pci_core_device *nvdev)
+{
+ if (index == USEMEM_REGION_INDEX)
+ return &(nvdev->usemem);
+
+ if (index == RESMEM_REGION_INDEX)
+ return &(nvdev->resmem);
+
+ return NULL;
+}
+
+static int nvgrace_gpu_vfio_pci_open_device(struct vfio_device *core_vdev)
+{
+ struct vfio_pci_core_device *vdev =
+ container_of(core_vdev, struct vfio_pci_core_device, vdev);
+ struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
+ core_vdev, struct nvgrace_gpu_vfio_pci_core_device,
+ core_device.vdev);
+ int ret;
+
+ ret = vfio_pci_core_enable(vdev);
+ if (ret)
+ return ret;
+
+ vfio_pci_core_finish_enable(vdev);
+
+ nvgrace_gpu_init_fake_bar_emu_regs(core_vdev);
+
+ mutex_init(&nvdev->remap_lock);
+
+ return 0;
+}
+
+static void nvgrace_gpu_vfio_pci_close_device(struct vfio_device *core_vdev)
+{
+ struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
+ core_vdev, struct nvgrace_gpu_vfio_pci_core_device,
+ core_device.vdev);
+
+ /* Unmap the mapping to the device memory cached region */
+ if (nvdev->usemem.bar_remap.memaddr) {
+ memunmap(nvdev->usemem.bar_remap.memaddr);
+ nvdev->usemem.bar_remap.memaddr = NULL;
+ }
+
+ /* Unmap the mapping to the device memory non-cached region */
+ if (nvdev->resmem.bar_remap.ioaddr) {
+ iounmap(nvdev->resmem.bar_remap.ioaddr);
+ nvdev->resmem.bar_remap.ioaddr = NULL;
+ }
+
+ mutex_destroy(&nvdev->remap_lock);
+
+ vfio_pci_core_close_device(core_vdev);
+}
+
+static int nvgrace_gpu_vfio_pci_mmap(struct vfio_device *core_vdev,
+ struct vm_area_struct *vma)
+{
+ struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
+ core_vdev, struct nvgrace_gpu_vfio_pci_core_device, core_device.vdev);
+
+ unsigned long start_pfn;
+ unsigned int index;
+ u64 req_len, pgoff, end;
+ int ret = 0;
+ struct mem_region *memregion;
+
+ index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+
+ memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(index, nvdev);
+ if (!memregion)
+ return vfio_pci_core_mmap(core_vdev, vma);
+
+ /*
+ * Request to mmap the BAR. Map to the CPU accessible memory on the
+ * GPU using the memory information gathered from the system ACPI
+ * tables.
+ */
+ pgoff = vma->vm_pgoff &
+ ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+
+ if (check_sub_overflow(vma->vm_end, vma->vm_start, &req_len) ||
+ check_add_overflow(PHYS_PFN(memregion->memphys), pgoff, &start_pfn) ||
+ check_add_overflow(PFN_PHYS(pgoff), req_len, &end))
+ return -EOVERFLOW;
+
+ /*
+ * Check that the mapping request does not go beyond available device
+ * memory size
+ */
+ if (end > memregion->memlength)
+ return -EINVAL;
+
+ /*
+ * The carved out region of the device memory needs the NORMAL_NC
+ * property. Communicate as such to the hypervisor.
+ */
+ if (index == RESMEM_REGION_INDEX)
+ vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
+
+ /*
+ * Perform a PFN map to the memory and back the device BAR by the
+ * GPU memory.
+ *
+ * The available GPU memory size may not be power-of-2 aligned. Map up
+ * to the size of the device memory. If the memory access is beyond the
+ * actual GPU memory size, it will be handled by the vfio_device_ops
+ * read/write.
+ *
+ * During device reset, the GPU is safely disconnected to the CPU
+ * and access to the BAR will be immediately returned preventing
+ * machine check.
+ */
+ ret = remap_pfn_range(vma, vma->vm_start, start_pfn,
+ req_len, vma->vm_page_prot);
+ if (ret)
+ return ret;
+
+ vma->vm_pgoff = start_pfn;
+
+ return 0;
+}
+
+static long
+nvgrace_gpu_vfio_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
+ unsigned long arg)
+{
+ unsigned long minsz = offsetofend(struct vfio_region_info, offset);
+ struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
+ core_vdev, struct nvgrace_gpu_vfio_pci_core_device, core_device.vdev);
+ struct vfio_region_info_cap_sparse_mmap *sparse;
+ struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+ struct vfio_region_info info;
+ struct mem_region *memregion;
+ uint32_t size;
+ int ret;
+
+ if (copy_from_user(&info, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (info.argsz < minsz)
+ return -EINVAL;
+
+ memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(info.index, nvdev);
+ if (!memregion)
+ return vfio_pci_core_ioctl(core_vdev,
+ VFIO_DEVICE_GET_REGION_INFO, arg);
+
+ /*
+ * Request to determine the BAR region information. Send the
+ * GPU memory information.
+ */
+ size = struct_size(sparse, areas, 1);
+
+ /*
+ * Setup for sparse mapping for the device memory. Only the
+ * available device memory on the hardware is shown as a
+ * mappable region.
+ */
+ sparse = kzalloc(size, GFP_KERNEL);
+ if (!sparse)
+ return -ENOMEM;
+
+ sparse->nr_areas = 1;
+ sparse->areas[0].offset = 0;
+ sparse->areas[0].size = memregion->memlength;
+ sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+ sparse->header.version = 1;
+
+ ret = vfio_info_add_capability(&caps, &sparse->header, size);
+ kfree(sparse);
+ if (ret)
+ return ret;
+
+ info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+ /*
+ * The region memory size may not be power-of-2 aligned.
+ * Given that the memory as a BAR and may not be
+ * aligned, roundup to the next power-of-2.
+ */
+ info.size = roundup_pow_of_two(memregion->memlength);
+ info.flags = VFIO_REGION_INFO_FLAG_READ |
+ VFIO_REGION_INFO_FLAG_WRITE |
+ VFIO_REGION_INFO_FLAG_MMAP;
+
+ if (caps.size) {
+ info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
+ if (info.argsz < sizeof(info) + caps.size) {
+ info.argsz = sizeof(info) + caps.size;
+ info.cap_offset = 0;
+ } else {
+ vfio_info_cap_shift(&caps, sizeof(info));
+ if (copy_to_user((void __user *)arg +
+ sizeof(info), caps.buf,
+ caps.size)) {
+ kfree(caps.buf);
+ return -EFAULT;
+ }
+ info.cap_offset = sizeof(info);
+ }
+ kfree(caps.buf);
+ }
+ return copy_to_user((void __user *)arg, &info, minsz) ?
+ -EFAULT : 0;
+}
+
+static long nvgrace_gpu_vfio_pci_ioctl(struct vfio_device *core_vdev,
+ unsigned int cmd, unsigned long arg)
+{
+ switch (cmd) {
+ case VFIO_DEVICE_GET_REGION_INFO:
+ return nvgrace_gpu_vfio_pci_ioctl_get_region_info(core_vdev, arg);
+ case VFIO_DEVICE_IOEVENTFD:
+ return -ENOTTY;
+ case VFIO_DEVICE_RESET:
+ nvgrace_gpu_init_fake_bar_emu_regs(core_vdev);
+ fallthrough;
+ default:
+ return vfio_pci_core_ioctl(core_vdev, cmd, arg);
+ }
+}
+
+static bool range_intersect_range(loff_t range1_start, size_t count1,
+ loff_t range2_start, size_t count2,
+ loff_t *start_offset,
+ size_t *intersect_count,
+ size_t *register_offset)
+{
+ if (range1_start <= range2_start &&
+ range1_start + count1 > range2_start) {
+ *start_offset = range2_start - range1_start;
+ *intersect_count = min_t(size_t, count2,
+ range1_start + count1 - range2_start);
+ *register_offset = 0;
+ return true;
+ }
+
+ if (range1_start > range2_start &&
+ range1_start < range2_start + count2) {
+ *start_offset = 0;
+ *intersect_count = min_t(size_t, count1,
+ range2_start + count2 - range1_start);
+ *register_offset = range1_start - range2_start;
+ return true;
+ }
+
+ return false;
+}
+
+/*
+ * Both the usable (usemem) and the reserved (resmem) device memory region
+ * are exposed as a 64b fake BARs in the VM. These fake BARs must respond
+ * to the accesses on their respective PCI config space offsets.
+ *
+ * resmem BAR owns PCI_BASE_ADDRESS_2 & PCI_BASE_ADDRESS_3.
+ * usemem BAR owns PCI_BASE_ADDRESS_4 & PCI_BASE_ADDRESS_5.
+ */
+static ssize_t
+nvgrace_gpu_read_config_emu(struct vfio_device *core_vdev,
+ char __user *buf, size_t count, loff_t *ppos)
+{
+ struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
+ core_vdev, struct nvgrace_gpu_vfio_pci_core_device, core_device.vdev);
+ u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
+ __le64 val64;
+ size_t bar_size;
+ size_t register_offset;
+ loff_t copy_offset;
+ size_t copy_count;
+ int ret;
+
+ ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
+ if (ret < 0)
+ return ret;
+
+ if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(val64),
+ &copy_offset, &copy_count, &register_offset)) {
+ bar_size = roundup_pow_of_two(nvdev->resmem.memlength);
+ nvdev->resmem.u64_reg &= ~(bar_size - 1);
+ nvdev->resmem.u64_reg |= PCI_BASE_ADDRESS_MEM_TYPE_64 |
+ PCI_BASE_ADDRESS_MEM_PREFETCH;
+ val64 = cpu_to_le64(nvdev->resmem.u64_reg);
+ if (copy_to_user(buf + copy_offset, (void *)&val64 + register_offset, copy_count))
+ return -EFAULT;
+ }
+
+ if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_4, sizeof(val64),
+ &copy_offset, &copy_count, &register_offset)) {
+ bar_size = roundup_pow_of_two(nvdev->usemem.memlength);
+ nvdev->usemem.u64_reg &= ~(bar_size - 1);
+ nvdev->usemem.u64_reg |= PCI_BASE_ADDRESS_MEM_TYPE_64 |
+ PCI_BASE_ADDRESS_MEM_PREFETCH;
+ val64 = cpu_to_le64(nvdev->usemem.u64_reg);
+ if (copy_to_user(buf + copy_offset, (void *)&val64 + register_offset, copy_count))
+ return -EFAULT;
+ }
+
+ return count;
+}
+
+static ssize_t
+nvgrace_gpu_write_config_emu(struct vfio_device *core_vdev,
+ const char __user *buf, size_t count, loff_t *ppos)
+{
+ struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
+ core_vdev, struct nvgrace_gpu_vfio_pci_core_device, core_device.vdev);
+ u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
+ __le64 val64 = 0;
+ __le16 val16 = 0;
+ u64 tmp_val;
+ size_t register_offset;
+ loff_t copy_offset;
+ size_t copy_count;
+
+ if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(val64),
+ &copy_offset, &copy_count, &register_offset)) {
+ if (copy_from_user((void *)&val64, buf + copy_offset, copy_count))
+ return -EFAULT;
+ tmp_val = le64_to_cpu(val64);
+ memcpy((void *)&(nvdev->resmem.u64_reg) + register_offset,
+ (void *)&tmp_val + copy_offset, copy_count);
+ *ppos += copy_count;
+ return copy_count;
+ }
+
+ if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_4, sizeof(val64),
+ &copy_offset, &copy_count, &register_offset)) {
+ if (copy_from_user((void *)&val64, buf + copy_offset, copy_count))
+ return -EFAULT;
+ tmp_val = le64_to_cpu(val64);
+ memcpy((void *)&(nvdev->usemem.u64_reg) + register_offset,
+ (void *)&tmp_val + copy_offset, copy_count);
+ *ppos += copy_count;
+ return copy_count;
+ }
+
+ if (range_intersect_range(pos, count, PCI_COMMAND, sizeof(val16),
+ &copy_offset, &copy_count, &register_offset)) {
+ if (copy_from_user((void *)&val16, buf + copy_offset, copy_count))
+ return -EFAULT;
+
+ if (le16_to_cpu(val16) & PCI_COMMAND_MEMORY)
+ nvdev->bars_disabled = false;
+ else
+ nvdev->bars_disabled = true;
+ }
+
+ return vfio_pci_core_write(core_vdev, buf, count, ppos);
+}
+
+/*
+ * Ad hoc map the device memory in the module kernel VA space. Primarily needed
+ * to support Qemu's device x-no-mmap=on option.
+ *
+ * The usemem region is cacheable memory and hence is memremaped.
+ * The resmem region is non-cached and is mapped using ioremap_wc (NORMAL_NC).
+ */
+static int
+nvgrace_gpu_map_device_mem(struct nvgrace_gpu_vfio_pci_core_device *nvdev,
+ int index)
+{
+ int ret = 0;
+
+ mutex_lock(&nvdev->remap_lock);
+ if (index == USEMEM_REGION_INDEX &&
+ !nvdev->usemem.bar_remap.memaddr) {
+ nvdev->usemem.bar_remap.memaddr
+ = memremap(nvdev->usemem.memphys, nvdev->usemem.memlength, MEMREMAP_WB);
+ if (!nvdev->usemem.bar_remap.memaddr)
+ ret = -ENOMEM;
+ } else if (index == RESMEM_REGION_INDEX &&
+ !nvdev->resmem.bar_remap.ioaddr) {
+ nvdev->resmem.bar_remap.ioaddr
+ = ioremap_wc(nvdev->resmem.memphys, nvdev->resmem.memlength);
+ if (!nvdev->resmem.bar_remap.ioaddr)
+ ret = -ENOMEM;
+ }
+ mutex_unlock(&nvdev->remap_lock);
+
+ return ret;
+}
+
+/*
+ * Read the data from the device memory (mapped either through ioremap
+ * or memremap) into the user buffer.
+ */
+static int
+nvgrace_gpu_map_and_read(struct nvgrace_gpu_vfio_pci_core_device *nvdev,
+ void __user *buf, size_t mem_count, loff_t *ppos)
+{
+ unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+ u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
+ int ret = 0;
+
+ /*
+ * Handle read on the BAR regions. Map to the target device memory
+ * physical address and copy to the request read buffer.
+ */
+ ret = nvgrace_gpu_map_device_mem(nvdev, index);
+ if (ret)
+ goto read_exit;
+
+ if (index == USEMEM_REGION_INDEX) {
+ if (copy_to_user(buf, (u8 *)nvdev->usemem.bar_remap.memaddr + offset, mem_count))
+ ret = -EFAULT;
+ } else {
+ return do_io_rw(&nvdev->core_device, false, nvdev->resmem.bar_remap.ioaddr,
+ (char __user *) buf, offset, mem_count, 0, 0, false);
+ }
+
+read_exit:
+ return ret;
+}
+
+/*
+ * Read count bytes from the device memory at an offset. The actual device
+ * memory size (available) may not be a power-of-2. So the driver fakes
+ * the size to a power-of-2 (reported) when exposing to a user space driver.
+ *
+ * Read request beyond the actual device size is filled with ~0, while
+ * those beyond the actual reported size is skipped.
+ *
+ * A read from a negative or an offset greater than reported size, a negative
+ * count are considered error conditions and returned with an -EINVAL.
+ */
+static ssize_t
+nvgrace_gpu_read_mem(struct nvgrace_gpu_vfio_pci_core_device *nvdev,
+ void __user *buf, size_t count, loff_t *ppos)
+{
+ u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
+ unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+ struct mem_region *memregion;
+ size_t mem_count, i, bar_size;
+ u8 val = 0xFF;
+ int ret;
+
+ memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(index, nvdev);
+ if (!memregion)
+ return -EINVAL;
+
+ bar_size = roundup_pow_of_two(memregion->memlength);
+
+ if (offset >= bar_size)
+ return -EINVAL;
+
+ /* Clip short the read request beyond reported BAR size */
+ count = min(count, bar_size - (size_t)offset);
+
+ /*
+ * Determine how many bytes to be actually read from the device memory.
+ * Read request beyond the actual device memory size is filled with ~0,
+ * while those beyond the actual reported size is skipped.
+ */
+ if (offset >= memregion->memlength)
+ mem_count = 0;
+ else
+ mem_count = min(count, memregion->memlength - (size_t)offset);
+
+ ret = nvgrace_gpu_map_and_read(nvdev, buf, mem_count, ppos);
+ if (ret)
+ return ret;
+
+ /*
+ * Only the device memory present on the hardware is mapped, which may
+ * not be power-of-2 aligned. A read to an offset beyond the device memory
+ * size is filled with ~0.
+ */
+ for (i = mem_count; i < count; i++)
+ put_user(val, (unsigned char __user *)(buf + i));
+
+ *ppos += count;
+ return count;
+}
+
+static ssize_t
+nvgrace_gpu_vfio_pci_read(struct vfio_device *core_vdev,
+ char __user *buf, size_t count, loff_t *ppos)
+{
+ unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+ ssize_t read_count;
+ struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
+ core_vdev, struct nvgrace_gpu_vfio_pci_core_device,
+ core_device.vdev);
+
+ if (nvgrace_gpu_vfio_pci_is_fake_bar(index)) {
+ /* Check if the bars are disabled, allow access otherwise */
+ down_read(&nvdev->core_device.memory_lock);
+ if (nvdev->bars_disabled) {
+ up_read(&nvdev->core_device.memory_lock);
+ return -EIO;
+ }
+ read_count = nvgrace_gpu_read_mem(nvdev, buf, count, ppos);
+ up_read(&nvdev->core_device.memory_lock);
+
+ return read_count;
+ }
+
+ if (index == VFIO_PCI_CONFIG_REGION_INDEX)
+ return nvgrace_gpu_read_config_emu(core_vdev, buf, count, ppos);
+
+ return vfio_pci_core_read(core_vdev, buf, count, ppos);
+}
+
+/*
+ * Write the data to the device memory (mapped either through ioremap
+ * or memremap) from the user buffer.
+ */
+static int nvgrace_gpu_map_and_write(struct nvgrace_gpu_vfio_pci_core_device *nvdev,
+ const void __user *buf, size_t mem_count, loff_t *ppos)
+{
+ unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+ loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+ int ret = 0;
+
+ ret = nvgrace_gpu_map_device_mem(nvdev, index);
+ if (ret)
+ goto write_exit;
+
+ if (index == USEMEM_REGION_INDEX) {
+ if (copy_from_user((u8 *)nvdev->usemem.bar_remap.memaddr + pos,
+ buf, mem_count))
+ return -EFAULT;
+ } else {
+ return do_io_rw(&nvdev->core_device, false, nvdev->resmem.bar_remap.ioaddr,
+ (char __user *) buf, pos, mem_count, 0, 0, true);
+ }
+
+write_exit:
+ return ret;
+}
+
+/*
+ * Write count bytes to the device memory at a given offset. The actual device
+ * memory size (available) may not be a power-of-2. So the driver fakes the
+ * size to a power-of-2 (reported) when exposing to a user space driver.
+ *
+ * Write request beyond the actual device size are dropped, while those
+ * beyond the actual reported size are skipped entirely.
+ *
+ * A write to a negative or an offset greater than the reported size, a
+ * negative count are considered error conditions and returned with an -EINVAL.
+ */
+static ssize_t
+nvgrace_gpu_write_mem(struct nvgrace_gpu_vfio_pci_core_device *nvdev,
+ size_t count, loff_t *ppos, const void __user *buf)
+{
+ u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
+ unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+ struct mem_region *memregion;
+ size_t mem_count, bar_size;
+ int ret = 0;
+
+ memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(index, nvdev);
+ if (!memregion)
+ return -EINVAL;
+
+ bar_size = roundup_pow_of_two(memregion->memlength);
+
+ if (offset >= bar_size)
+ return -EINVAL;
+
+ /* Clip short the write request beyond reported BAR size */
+ count = min(count, bar_size - (size_t)offset);
+
+ /*
+ * Determine how many bytes to be actually written to the device memory.
+ * Do not write to the offset beyond available size.
+ */
+ if (offset >= memregion->memlength)
+ goto exitfn;
+
+ /*
+ * Only the device memory present on the hardware is mapped, which may
+ * not be power-of-2 aligned. Drop access outside the available device
+ * memory on the hardware.
+ */
+ mem_count = min(count, memregion->memlength - (size_t)offset);
+
+ ret = nvgrace_gpu_map_and_write(nvdev, buf, mem_count, ppos);
+ if (ret)
+ return ret;
+
+exitfn:
+ *ppos += count;
+ return count;
+}
+
+static ssize_t
+nvgrace_gpu_vfio_pci_write(struct vfio_device *core_vdev,
+ const char __user *buf, size_t count, loff_t *ppos)
+{
+ unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+ struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
+ core_vdev, struct nvgrace_gpu_vfio_pci_core_device, core_device.vdev);
+ size_t write_count;
+
+ if (nvgrace_gpu_vfio_pci_is_fake_bar(index)) {
+ /* Check if the bars are disabled, allow access otherwise */
+ down_read(&nvdev->core_device.memory_lock);
+ if (nvdev->bars_disabled) {
+ up_read(&nvdev->core_device.memory_lock);
+ return -EIO;
+ }
+ write_count = nvgrace_gpu_write_mem(nvdev, count, ppos, buf);
+ up_read(&nvdev->core_device.memory_lock);
+ return write_count;
+ }
+
+ if (index == VFIO_PCI_CONFIG_REGION_INDEX)
+ return nvgrace_gpu_write_config_emu(core_vdev, buf, count, ppos);
+
+ return vfio_pci_core_write(core_vdev, buf, count, ppos);
+}
+
+static const struct vfio_device_ops nvgrace_gpu_vfio_pci_ops = {
+ .name = "nvgrace-gpu-vfio-pci",
+ .init = vfio_pci_core_init_dev,
+ .release = vfio_pci_core_release_dev,
+ .open_device = nvgrace_gpu_vfio_pci_open_device,
+ .close_device = nvgrace_gpu_vfio_pci_close_device,
+ .ioctl = nvgrace_gpu_vfio_pci_ioctl,
+ .read = nvgrace_gpu_vfio_pci_read,
+ .write = nvgrace_gpu_vfio_pci_write,
+ .mmap = nvgrace_gpu_vfio_pci_mmap,
+ .request = vfio_pci_core_request,
+ .match = vfio_pci_core_match,
+ .bind_iommufd = vfio_iommufd_physical_bind,
+ .unbind_iommufd = vfio_iommufd_physical_unbind,
+ .attach_ioas = vfio_iommufd_physical_attach_ioas,
+ .detach_ioas = vfio_iommufd_physical_detach_ioas,
+};
+
+static struct
+nvgrace_gpu_vfio_pci_core_device *nvgrace_gpu_drvdata(struct pci_dev *pdev)
+{
+ struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
+
+ return container_of(core_device, struct nvgrace_gpu_vfio_pci_core_device,
+ core_device);
+}
+
+static int
+nvgrace_gpu_vfio_pci_fetch_memory_property(struct pci_dev *pdev,
+ struct nvgrace_gpu_vfio_pci_core_device *nvdev)
+{
+ int ret;
+ u64 memphys, memlength;
+
+ /*
+ * The memory information is present in the system ACPI tables as DSD
+ * properties nvidia,gpu-mem-base-pa and nvidia,gpu-mem-size.
+ */
+ ret = device_property_read_u64(&pdev->dev, "nvidia,gpu-mem-base-pa",
+ &(memphys));
+ if (ret)
+ return ret;
+
+ if (memphys > type_max(phys_addr_t))
+ return -EOVERFLOW;
+
+ ret = device_property_read_u64(&pdev->dev, "nvidia,gpu-mem-size",
+ &(memlength));
+ if (ret)
+ return ret;
+
+ if (memlength > type_max(size_t))
+ return -EOVERFLOW;
+
+ /*
+ * If the C2C link is not up due to an error, the coherent device
+ * memory size is returned as 0. Fail in such case.
+ */
+ if (memlength == 0)
+ return -ENOMEM;
+
+ /*
+ * The VM GPU device driver needs a non-cacheable region to support
+ * the MIG feature. Since the device memory is mapped as NORMAL cached,
+ * carve out a region from the end with a different NORMAL_NC
+ * property (called as reserved memory and represented as resmem). This
+ * region then is exposed as a 64b BAR (region 2 and 3) to the VM, while
+ * exposing the rest (termed as usable memory and represented using usemem)
+ * as cacheable 64b BAR (region 4 and 5).
+ *
+ * devmem (memlength)
+ * |-------------------------------------------------|
+ * | |
+ * usemem.phys/memphys resmem.phys
+ */
+ nvdev->usemem.memphys = memphys;
+
+ /*
+ * The device memory exposed to the VM is added to the kernel by the
+ * VM driver module in chunks of memory block size. Only the usable
+ * memory (usemem) is added to the kernel for usage by the VM
+ * workloads. Make the usable memory size memblock aligned.
+ */
+ if (check_sub_overflow(memlength, RESMEM_SIZE,
+ &nvdev->usemem.memlength)) {
+ ret = -EOVERFLOW;
+ goto done;
+ }
+ nvdev->usemem.memlength = round_down(nvdev->usemem.memlength,
+ MEMBLK_SIZE);
+ if ((check_add_overflow(nvdev->usemem.memphys,
+ nvdev->usemem.memlength, &nvdev->resmem.memphys)) ||
+ (check_sub_overflow(memlength, nvdev->usemem.memlength,
+ &nvdev->resmem.memlength))) {
+ ret = -EOVERFLOW;
+ goto done;
+ }
+
+done:
+ return ret;
+}
+
+static int nvgrace_gpu_vfio_pci_probe(struct pci_dev *pdev,
+ const struct pci_device_id *id)
+{
+ struct nvgrace_gpu_vfio_pci_core_device *nvdev;
+ int ret;
+
+ nvdev = vfio_alloc_device(nvgrace_gpu_vfio_pci_core_device, core_device.vdev,
+ &pdev->dev, &nvgrace_gpu_vfio_pci_ops);
+ if (IS_ERR(nvdev))
+ return PTR_ERR(nvdev);
+
+ dev_set_drvdata(&pdev->dev, nvdev);
+
+ ret = nvgrace_gpu_vfio_pci_fetch_memory_property(pdev, nvdev);
+ if (ret)
+ goto out_put_vdev;
+
+ ret = vfio_pci_core_register_device(&nvdev->core_device);
+ if (ret)
+ goto out_put_vdev;
+
+ return ret;
+
+out_put_vdev:
+ vfio_put_device(&nvdev->core_device.vdev);
+ return ret;
+}
+
+static void nvgrace_gpu_vfio_pci_remove(struct pci_dev *pdev)
+{
+ struct nvgrace_gpu_vfio_pci_core_device *nvdev = nvgrace_gpu_drvdata(pdev);
+ struct vfio_pci_core_device *vdev = &nvdev->core_device;
+
+ vfio_pci_core_unregister_device(vdev);
+ vfio_put_device(&vdev->vdev);
+}
+
+static const struct pci_device_id nvgrace_gpu_vfio_pci_table[] = {
+ /* GH200 120GB */
+ { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2342) },
+ /* GH200 480GB */
+ { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2345) },
+ {}
+};
+
+MODULE_DEVICE_TABLE(pci, nvgrace_gpu_vfio_pci_table);
+
+static struct pci_driver nvgrace_gpu_vfio_pci_driver = {
+ .name = KBUILD_MODNAME,
+ .id_table = nvgrace_gpu_vfio_pci_table,
+ .probe = nvgrace_gpu_vfio_pci_probe,
+ .remove = nvgrace_gpu_vfio_pci_remove,
+ .err_handler = &vfio_pci_core_err_handlers,
+ .driver_managed_dma = true,
+};
+
+module_pci_driver(nvgrace_gpu_vfio_pci_driver);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Ankit Agrawal <[email protected]>");
+MODULE_AUTHOR("Aniket Agashe <[email protected]>");
+MODULE_DESCRIPTION(
+ "VFIO NVGRACE GPU PF - User Level driver for NVIDIA devices with CPU coherently accessible device memory");
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index e27de61ac9fe..ca20440b442d 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -94,10 +94,10 @@ VFIO_IOREAD(32)
* reads with -1. This is intended for handling MSI-X vector tables and
* leftover space for ROM BARs.
*/
-static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
- void __iomem *io, char __user *buf,
- loff_t off, size_t count, size_t x_start,
- size_t x_end, bool iswrite)
+ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
+ void __iomem *io, char __user *buf,
+ loff_t off, size_t count, size_t x_start,
+ size_t x_end, bool iswrite)
{
ssize_t done = 0;
int ret;
@@ -199,6 +199,7 @@ static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,

return done;
}
+EXPORT_SYMBOL(do_io_rw);

static int vfio_pci_setup_barmap(struct vfio_pci_core_device *vdev, int bar)
{
--
2.17.1

2023-12-18 15:57:45

by Cédric Le Goater

[permalink] [raw]

Subject: Re: [PATCH v15 1/1] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

Hello Ankit,

On 12/17/23 20:10, [email protected] wrote:
> From: Ankit Agrawal <[email protected]>
>
> NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> for the on-chip GPU that is the logical OS representation of the
> internal proprietary chip-to-chip cache coherent interconnect.
>
> The device is peculiar compared to a real PCI device in that whilst
> there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the
> device, it is not used to access device memory once the faster
> chip-to-chip interconnect is initialized (occurs at the time of host
> system boot). The device memory is accessed instead using the chip-to-chip
> interconnect that is exposed as a contiguous physically addressable
> region on the host. This device memory aperture can be obtained from host
> ACPI table using device_property_read_u64(), according to the FW
> specification. Since the device memory is cache coherent with the CPU,
> it can be mmap into the user VMA with a cacheable mapping using
> remap_pfn_range() and used like a regular RAM. The device memory
> is not added to the host kernel, but mapped directly as this reduces
> memory wastage due to struct pages.
>
> There is also a requirement of a reserved 1G uncached region (termed as
> resmem) to support the Multi-Instance GPU (MIG) feature [1]. Based on [2],
> the requisite properties (uncached, unaligned access) can be achieved
> through a VM mapping (S1) of NORMAL_NC and host (S2) mapping with
> MemAttr[2:0]=0b101. To provide a different non-cached property to the
> reserved 1G region, it needs to be carved out from the device memory and
> mapped as a separate region in Qemu VMA with pgprot_writecombine().
> pgprot_writecombine() sets the Qemu VMA page properties (pgprot) as
> NORMAL_NC.
>
> Provide a VFIO PCI variant driver that adapts the unique device memory
> representation into a more standard PCI representation facing userspace.
>
> The variant driver exposes these two regions - the non-cached reserved
> (resmem) and the cached rest of the device memory (termed as usemem) as
> separate VFIO 64b BAR regions. Since the device implements 64-bit BAR0,
> the VFIO PCI variant driver maps the uncached carved out region to the
> next available PCI BAR (i.e. comprising of region 2 and 3). The cached
> device memory aperture is assigned BAR region 4 and 5. Qemu will then
> naturally generate a PCI device in the VM with the uncached aperture
> reported as BAR2 region, the cacheable as BAR4. The variant driver provides
> emulation for these fake BARs' PCI config space offset registers. The
> BAR can also be enabled/disabled through PCI_COMMAND config space register.
> The VM driver should enable the BARs (as it does already) to make use
> of the device memory.
>
> The memory layout on the host looks like the following:
> devmem (memlength)
> |--------------------------------------------------|
> |-------------cached------------------------|--NC--|
> | |
> usemem.phys/memphys resmem.phys
>
> PCI BARs need to be aligned to the power-of-2, but the actual memory on the
> device may not. A read or write access to the physical address from the
> last device PFN up to the next power-of-2 aligned physical address
> results in reading ~0 and dropped writes. Note that the GPU device
> driver [6] is capable of knowing the exact device memory size through
> separate means. The device memory size is primarily kept in the system
> ACPI tables for use by the VFIO PCI variant module.
>
> Note that the usemem memory is added by the VM Nvidia device driver [5]
> to the VM kernel as memblocks. Hence make the usable memory size memblock
> aligned.
>
> Currently there is no provision in KVM for a S2 mapping with
> MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3].
> As previously mentioned, resmem is mapped pgprot_writecombine(), that
> sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the
> proposed changes in [4] and [3], KVM marks the region with
> MemAttr[2:0]=0b101 in S2.
>
> This goes along with a qemu series [6] to provides the necessary
> implementation of the Grace Hopper Superchip firmware specification so
> that the guest operating system can see the correct ACPI modeling for
> the coherent GPU device. Verified with the CUDA workload in the VM.
>
> [1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu/
> [2] section D8.5.5 of https://developer.arm.com/documentation/ddi0487/latest/
> [3] https://lore.kernel.org/all/[email protected]/
> [4] https://lore.kernel.org/all/[email protected]/
> [5] https://github.com/NVIDIA/open-gpu-kernel-modules
> [6] https://lore.kernel.org/all/[email protected]/
>
> Applied over next-20231211.
> ---
> Link for variant driver v14:
> https://lore.kernel.org/all/[email protected]/
>
> v14 -> v15
> - Added case to handle VFIO_DEVICE_IOEVENTFD to return -EIO as it
> is not required on the device.
> - Updated the BAR config space handling code to closely resemble
> by Yishai Hadas (using range_intersect_range) in
> https://lore.kernel.org/all/[email protected]
> - Changed the bar pci config register from union to u64.
> - Adapted the code to disable BAR when it is disabled through
> PCI_COMMAND.
> - Exported and reused the do_io_rw to do mmio accesses.
> - Added a new header file to keep the newly declared structures.
> - Miscellaneous code fixes suggested by Alex Williamson in v14.

./scripts/checkpatch.pl --strict will give you some tips on how to
improve the changes furthermore.

> v13 -> v14
> - Merged the changes for second BAR implementation for MIG support
> on the device driver.
> https://lore.kernel.org/all/[email protected]/
> - Added the missing implementation of sub-word access to fake BARs'
> PCI config access. Implemented access algorithm suggested by
> Alex Williamson in the comments (Thanks!)
> - Added support to BAR accesses on the reserved memory with
> Qemu device param x-no-mmap=on.
> - Handled endian-ness in the PCI config space access.
> - Git commit message change
>
> v12 -> v13
> - Added emulation for the PCI config space BAR offset register for
> the fake BAR.
> - commit message updated with more details on the BAR offset emulation.
>
> v11 -> v12
> - More details in commit message on device memory size
>
> v10 -> v11
> - Removed sysfs attribute to expose the CPU coherent memory feature
> - Addressed review comments
>
> v9 -> v10
> - Add new sysfs attribute to expose the CPU coherent memory feature.
>
> v8 -> v9
> - Minor code adjustment suggested in v8.
>
> v7 -> v8
> - Various field names updated.
> - Added a new function to handle VFIO_DEVICE_GET_REGION_INFO ioctl.
> - Locking protection for memremap to bar region and other changes
> recommended in v7.
> - Added code to fail if the devmem size advertized is 0 in system DSDT.
>
> v6 -> v7
> - Handled out-of-bound and overflow conditions at various places to validate
> input offset and length.
> - Added code to return EINVAL for offset beyond region size.
>
> v5 -> v6
> - Added the code to handle BAR2 read/write using memremap to the device
> memory.
>
> v4 -> v5
> - Changed the module name from nvgpu-vfio-pci to nvgrace-gpu-vfio-pci.
> - Fixed memory leak and added suggested boundary checks on device memory
> mapping.
> - Added code to read all Fs and ignored write on region outside of the
> physical memory.
> - Other miscellaneous cleanup suggestions.
>
> v3 -> v4
> - Mapping the available device memory using sparse mmap. The region outside
> the device memory is handled by read/write ops.
> - Removed the fault handler added in v3.
>
> v2 -> v3
> - Added fault handler to map the region outside the physical GPU memory
> up to the next power-of-2 to a dummy PFN.
> - Changed to select instead of "depends on" VFIO_PCI_CORE for all the
> vfio-pci variant driver.
> - Code cleanup based on feedback comments.
> - Code implemented and tested against v6.4-rc4.
>
> v1 -> v2
> - Updated the wording of reference to BAR offset and replaced with
> index.
> - The GPU memory is exposed at the fixed BAR2_REGION_INDEX.
> - Code cleanup based on feedback comments.
>
> Signed-off-by: Ankit Agrawal <[email protected]>
> Signed-off-by: Aniket Agashe <[email protected]>
> Tested-by: Ankit Agrawal <[email protected]>
> ---
> MAINTAINERS | 6 +
> drivers/vfio/pci/Kconfig | 2 +
> drivers/vfio/pci/Makefile | 2 +
> drivers/vfio/pci/nvgrace-gpu/Kconfig | 10 +
> drivers/vfio/pci/nvgrace-gpu/Makefile | 3 +
> drivers/vfio/pci/nvgrace-gpu/main.c | 806 ++++++++++++++++++++++++++
> drivers/vfio/pci/vfio_pci_rdwr.c | 9 +-
> 7 files changed, 834 insertions(+), 4 deletions(-)
> create mode 100644 drivers/vfio/pci/nvgrace-gpu/Kconfig
> create mode 100644 drivers/vfio/pci/nvgrace-gpu/Makefile
> create mode 100644 drivers/vfio/pci/nvgrace-gpu/main.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 98f7dd0499f1..6f8f3a6daa43 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -22877,6 +22877,12 @@ L: [email protected]
> S: Maintained
> F: drivers/vfio/platform/
>
> +VFIO NVIDIA GRACE GPU DRIVER
> +M: Ankit Agrawal <[email protected]>
> +L: [email protected]
> +S: Maintained
> +F: drivers/vfio/pci/nvgrace-gpu/
> +
> VGA_SWITCHEROO
> R: Lukas Wunner <[email protected]>
> S: Maintained
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 8125e5f37832..2456210e85f1 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -65,4 +65,6 @@ source "drivers/vfio/pci/hisilicon/Kconfig"
>
> source "drivers/vfio/pci/pds/Kconfig"
>
> +source "drivers/vfio/pci/nvgrace-gpu/Kconfig"
> +
> endmenu
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 45167be462d8..1352c65e568a 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -13,3 +13,5 @@ obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5/
> obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
>
> obj-$(CONFIG_PDS_VFIO_PCI) += pds/
> +
> +obj-$(CONFIG_NVGRACE_GPU_VFIO_PCI) += nvgrace-gpu/
> diff --git a/drivers/vfio/pci/nvgrace-gpu/Kconfig b/drivers/vfio/pci/nvgrace-gpu/Kconfig
> new file mode 100644
> index 000000000000..936e88d8d41d
> --- /dev/null
> +++ b/drivers/vfio/pci/nvgrace-gpu/Kconfig
> @@ -0,0 +1,10 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config NVGRACE_GPU_VFIO_PCI
> + tristate "VFIO support for the GPU in the NVIDIA Grace Hopper Superchip"
> + depends on ARM64 || (COMPILE_TEST && 64BIT)
> + select VFIO_PCI_CORE
> + help
> + VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is
> + required to assign the GPU device using KVM/qemu/etc.
> +
> + If you don't know what to do here, say N.
> diff --git a/drivers/vfio/pci/nvgrace-gpu/Makefile b/drivers/vfio/pci/nvgrace-gpu/Makefile
> new file mode 100644
> index 000000000000..3ca8c187897a
> --- /dev/null
> +++ b/drivers/vfio/pci/nvgrace-gpu/Makefile
> @@ -0,0 +1,3 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_NVGRACE_GPU_VFIO_PCI) += nvgrace-gpu-vfio-pci.o
> +nvgrace-gpu-vfio-pci-y := main.o
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
> new file mode 100644
> index 000000000000..43387a800c41
> --- /dev/null
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -0,0 +1,806 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include "nvgrace_gpu_vfio_pci.h"

This file doesn't exist.

> +
> +static bool nvgrace_gpu_vfio_pci_is_fake_bar(int index)
> +{
> + return (index == RESMEM_REGION_INDEX ||
> + index == USEMEM_REGION_INDEX);
> +}
> +
> +static void nvgrace_gpu_init_fake_bar_emu_regs(struct vfio_device *core_vdev)
> +{
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
> + core_vdev, struct nvgrace_gpu_vfio_pci_core_device,
> + core_device.vdev);
> +
> + nvdev->resmem.u64_reg = 0;
> + nvdev->usemem.u64_reg = 0;
> +}
> +
> +/* Choose the structure corresponding to the fake BAR with a given index. */
> +struct mem_region *
> +nvgrace_gpu_vfio_pci_fake_bar_mem_region(int index,
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev)
> +{
> + if (index == USEMEM_REGION_INDEX)
> + return &(nvdev->usemem);
> +
> + if (index == RESMEM_REGION_INDEX)
> + return &(nvdev->resmem);
> +
> + return NULL;
> +}
> +
> +static int nvgrace_gpu_vfio_pci_open_device(struct vfio_device *core_vdev)
> +{
> + struct vfio_pci_core_device *vdev =
> + container_of(core_vdev, struct vfio_pci_core_device, vdev);
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
> + core_vdev, struct nvgrace_gpu_vfio_pci_core_device,
> + core_device.vdev);
> + int ret;
> +
> + ret = vfio_pci_core_enable(vdev);
> + if (ret)
> + return ret;
> +
> + vfio_pci_core_finish_enable(vdev);
> +
> + nvgrace_gpu_init_fake_bar_emu_regs(core_vdev);
> +
> + mutex_init(&nvdev->remap_lock);
> +
> + return 0;
> +}
> +
> +static void nvgrace_gpu_vfio_pci_close_device(struct vfio_device *core_vdev)
> +{
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
> + core_vdev, struct nvgrace_gpu_vfio_pci_core_device,
> + core_device.vdev);
> +
> + /* Unmap the mapping to the device memory cached region */
> + if (nvdev->usemem.bar_remap.memaddr) {
> + memunmap(nvdev->usemem.bar_remap.memaddr);
> + nvdev->usemem.bar_remap.memaddr = NULL;
> + }
> +
> + /* Unmap the mapping to the device memory non-cached region */
> + if (nvdev->resmem.bar_remap.ioaddr) {
> + iounmap(nvdev->resmem.bar_remap.ioaddr);
> + nvdev->resmem.bar_remap.ioaddr = NULL;
> + }
> +
> + mutex_destroy(&nvdev->remap_lock);
> +
> + vfio_pci_core_close_device(core_vdev);
> +}
> +
> +static int nvgrace_gpu_vfio_pci_mmap(struct vfio_device *core_vdev,
> + struct vm_area_struct *vma)
> +{
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
> + core_vdev, struct nvgrace_gpu_vfio_pci_core_device, core_device.vdev);
> +
> + unsigned long start_pfn;
> + unsigned int index;
> + u64 req_len, pgoff, end;
> + int ret = 0;
> + struct mem_region *memregion;
> +
> + index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> +
> + memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(index, nvdev);
> + if (!memregion)
> + return vfio_pci_core_mmap(core_vdev, vma);
> +
> + /*
> + * Request to mmap the BAR. Map to the CPU accessible memory on the
> + * GPU using the memory information gathered from the system ACPI
> + * tables.
> + */
> + pgoff = vma->vm_pgoff &
> + ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +
> + if (check_sub_overflow(vma->vm_end, vma->vm_start, &req_len) ||
> + check_add_overflow(PHYS_PFN(memregion->memphys), pgoff, &start_pfn) ||
> + check_add_overflow(PFN_PHYS(pgoff), req_len, &end))
> + return -EOVERFLOW;
> +
> + /*
> + * Check that the mapping request does not go beyond available device
> + * memory size
> + */
> + if (end > memregion->memlength)
> + return -EINVAL;
> +
> + /*
> + * The carved out region of the device memory needs the NORMAL_NC
> + * property. Communicate as such to the hypervisor.
> + */
> + if (index == RESMEM_REGION_INDEX)
> + vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
> +
> + /*
> + * Perform a PFN map to the memory and back the device BAR by the
> + * GPU memory.
> + *
> + * The available GPU memory size may not be power-of-2 aligned. Map up
> + * to the size of the device memory. If the memory access is beyond the
> + * actual GPU memory size, it will be handled by the vfio_device_ops
> + * read/write.
> + *
> + * During device reset, the GPU is safely disconnected to the CPU
> + * and access to the BAR will be immediately returned preventing
> + * machine check.
> + */
> + ret = remap_pfn_range(vma, vma->vm_start, start_pfn,
> + req_len, vma->vm_page_prot);
> + if (ret)
> + return ret;
> +
> + vma->vm_pgoff = start_pfn;
> +
> + return 0;
> +}
> +
> +static long
> +nvgrace_gpu_vfio_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
> + unsigned long arg)
> +{
> + unsigned long minsz = offsetofend(struct vfio_region_info, offset);
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
> + core_vdev, struct nvgrace_gpu_vfio_pci_core_device, core_device.vdev);
> + struct vfio_region_info_cap_sparse_mmap *sparse;
> + struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
> + struct vfio_region_info info;
> + struct mem_region *memregion;
> + uint32_t size;
> + int ret;
> +
> + if (copy_from_user(&info, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (info.argsz < minsz)
> + return -EINVAL;
> +
> + memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(info.index, nvdev);
> + if (!memregion)
> + return vfio_pci_core_ioctl(core_vdev,
> + VFIO_DEVICE_GET_REGION_INFO, arg);
> +
> + /*
> + * Request to determine the BAR region information. Send the
> + * GPU memory information.
> + */
> + size = struct_size(sparse, areas, 1);
> +
> + /*
> + * Setup for sparse mapping for the device memory. Only the
> + * available device memory on the hardware is shown as a
> + * mappable region.
> + */
> + sparse = kzalloc(size, GFP_KERNEL);
> + if (!sparse)
> + return -ENOMEM;
> +
> + sparse->nr_areas = 1;
> + sparse->areas[0].offset = 0;
> + sparse->areas[0].size = memregion->memlength;
> + sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
> + sparse->header.version = 1;
> +
> + ret = vfio_info_add_capability(&caps, &sparse->header, size);
> + kfree(sparse);
> + if (ret)
> + return ret;
> +
> + info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> + /*
> + * The region memory size may not be power-of-2 aligned.
> + * Given that the memory as a BAR and may not be
> + * aligned, roundup to the next power-of-2.
> + */
> + info.size = roundup_pow_of_two(memregion->memlength);
> + info.flags = VFIO_REGION_INFO_FLAG_READ |
> + VFIO_REGION_INFO_FLAG_WRITE |
> + VFIO_REGION_INFO_FLAG_MMAP;
> +
> + if (caps.size) {
> + info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
> + if (info.argsz < sizeof(info) + caps.size) {
> + info.argsz = sizeof(info) + caps.size;
> + info.cap_offset = 0;
> + } else {
> + vfio_info_cap_shift(&caps, sizeof(info));
> + if (copy_to_user((void __user *)arg +
> + sizeof(info), caps.buf,
> + caps.size)) {
> + kfree(caps.buf);
> + return -EFAULT;
> + }
> + info.cap_offset = sizeof(info);
> + }
> + kfree(caps.buf);
> + }
> + return copy_to_user((void __user *)arg, &info, minsz) ?
> + -EFAULT : 0;
> +}
> +
> +static long nvgrace_gpu_vfio_pci_ioctl(struct vfio_device *core_vdev,
> + unsigned int cmd, unsigned long arg)
> +{
> + switch (cmd) {
> + case VFIO_DEVICE_GET_REGION_INFO:
> + return nvgrace_gpu_vfio_pci_ioctl_get_region_info(core_vdev, arg);
> + case VFIO_DEVICE_IOEVENTFD:
> + return -ENOTTY;
> + case VFIO_DEVICE_RESET:
> + nvgrace_gpu_init_fake_bar_emu_regs(core_vdev);
> + fallthrough;
> + default:
> + return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> + }
> +}
> +
> +static bool range_intersect_range(loff_t range1_start, size_t count1,
> + loff_t range2_start, size_t count2,
> + loff_t *start_offset,
> + size_t *intersect_count,
> + size_t *register_offset)
> +{
> + if (range1_start <= range2_start &&
> + range1_start + count1 > range2_start) {
> + *start_offset = range2_start - range1_start;
> + *intersect_count = min_t(size_t, count2,
> + range1_start + count1 - range2_start);
> + *register_offset = 0;
> + return true;
> + }
> +
> + if (range1_start > range2_start &&
> + range1_start < range2_start + count2) {
> + *start_offset = 0;
> + *intersect_count = min_t(size_t, count1,
> + range2_start + count2 - range1_start);
> + *register_offset = range1_start - range2_start;
> + return true;
> + }
> +
> + return false;
> +}
> +
> +/*
> + * Both the usable (usemem) and the reserved (resmem) device memory region
> + * are exposed as a 64b fake BARs in the VM. These fake BARs must respond
> + * to the accesses on their respective PCI config space offsets.
> + *
> + * resmem BAR owns PCI_BASE_ADDRESS_2 & PCI_BASE_ADDRESS_3.
> + * usemem BAR owns PCI_BASE_ADDRESS_4 & PCI_BASE_ADDRESS_5.
> + */
> +static ssize_t
> +nvgrace_gpu_read_config_emu(struct vfio_device *core_vdev,
> + char __user *buf, size_t count, loff_t *ppos)
> +{
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
> + core_vdev, struct nvgrace_gpu_vfio_pci_core_device, core_device.vdev);
> + u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + __le64 val64;
> + size_t bar_size;
> + size_t register_offset;
> + loff_t copy_offset;
> + size_t copy_count;
> + int ret;
> +
> + ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
> + if (ret < 0)
> + return ret;
> +
> + if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(val64),
> + &copy_offset, &copy_count, &register_offset)) {
> + bar_size = roundup_pow_of_two(nvdev->resmem.memlength);
> + nvdev->resmem.u64_reg &= ~(bar_size - 1);
> + nvdev->resmem.u64_reg |= PCI_BASE_ADDRESS_MEM_TYPE_64 |
> + PCI_BASE_ADDRESS_MEM_PREFETCH;
> + val64 = cpu_to_le64(nvdev->resmem.u64_reg);
> + if (copy_to_user(buf + copy_offset, (void *)&val64 + register_offset, copy_count))
> + return -EFAULT;
> + }
> +
> + if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_4, sizeof(val64),
> + &copy_offset, &copy_count, &register_offset)) {
> + bar_size = roundup_pow_of_two(nvdev->usemem.memlength);
> + nvdev->usemem.u64_reg &= ~(bar_size - 1);
> + nvdev->usemem.u64_reg |= PCI_BASE_ADDRESS_MEM_TYPE_64 |
> + PCI_BASE_ADDRESS_MEM_PREFETCH;
> + val64 = cpu_to_le64(nvdev->usemem.u64_reg);
> + if (copy_to_user(buf + copy_offset, (void *)&val64 + register_offset, copy_count))
> + return -EFAULT;
> + }
> +
> + return count;
> +}
> +
> +static ssize_t
> +nvgrace_gpu_write_config_emu(struct vfio_device *core_vdev,
> + const char __user *buf, size_t count, loff_t *ppos)
> +{
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
> + core_vdev, struct nvgrace_gpu_vfio_pci_core_device, core_device.vdev);
> + u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + __le64 val64 = 0;
> + __le16 val16 = 0;
> + u64 tmp_val;
> + size_t register_offset;
> + loff_t copy_offset;
> + size_t copy_count;
> +
> + if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(val64),
> + &copy_offset, &copy_count, &register_offset)) {
> + if (copy_from_user((void *)&val64, buf + copy_offset, copy_count))
> + return -EFAULT;
> + tmp_val = le64_to_cpu(val64);
> + memcpy((void *)&(nvdev->resmem.u64_reg) + register_offset,
> + (void *)&tmp_val + copy_offset, copy_count);
> + *ppos += copy_count;
> + return copy_count;
> + }
> +
> + if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_4, sizeof(val64),
> + &copy_offset, &copy_count, &register_offset)) {
> + if (copy_from_user((void *)&val64, buf + copy_offset, copy_count))
> + return -EFAULT;
> + tmp_val = le64_to_cpu(val64);
> + memcpy((void *)&(nvdev->usemem.u64_reg) + register_offset,
> + (void *)&tmp_val + copy_offset, copy_count);
> + *ppos += copy_count;
> + return copy_count;
> + }
> +
> + if (range_intersect_range(pos, count, PCI_COMMAND, sizeof(val16),
> + &copy_offset, &copy_count, &register_offset)) {
> + if (copy_from_user((void *)&val16, buf + copy_offset, copy_count))
> + return -EFAULT;
> +
> + if (le16_to_cpu(val16) & PCI_COMMAND_MEMORY)
> + nvdev->bars_disabled = false;
> + else
> + nvdev->bars_disabled = true;
> + }
> +
> + return vfio_pci_core_write(core_vdev, buf, count, ppos);
> +}
> +
> +/*
> + * Ad hoc map the device memory in the module kernel VA space. Primarily needed
> + * to support Qemu's device x-no-mmap=on option.
> + *
> + * The usemem region is cacheable memory and hence is memremaped.
> + * The resmem region is non-cached and is mapped using ioremap_wc (NORMAL_NC).
> + */
> +static int
> +nvgrace_gpu_map_device_mem(struct nvgrace_gpu_vfio_pci_core_device *nvdev,
> + int index)
> +{
> + int ret = 0;
> +
> + mutex_lock(&nvdev->remap_lock);
> + if (index == USEMEM_REGION_INDEX &&
> + !nvdev->usemem.bar_remap.memaddr) {
> + nvdev->usemem.bar_remap.memaddr
> + = memremap(nvdev->usemem.memphys, nvdev->usemem.memlength, MEMREMAP_WB);
> + if (!nvdev->usemem.bar_remap.memaddr)
> + ret = -ENOMEM;
> + } else if (index == RESMEM_REGION_INDEX &&
> + !nvdev->resmem.bar_remap.ioaddr) {
> + nvdev->resmem.bar_remap.ioaddr
> + = ioremap_wc(nvdev->resmem.memphys, nvdev->resmem.memlength);
> + if (!nvdev->resmem.bar_remap.ioaddr)
> + ret = -ENOMEM;
> + }
> + mutex_unlock(&nvdev->remap_lock);
> +
> + return ret;
> +}
> +
> +/*
> + * Read the data from the device memory (mapped either through ioremap
> + * or memremap) into the user buffer.
> + */
> +static int
> +nvgrace_gpu_map_and_read(struct nvgrace_gpu_vfio_pci_core_device *nvdev,
> + void __user *buf, size_t mem_count, loff_t *ppos)
> +{
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
> + int ret = 0;
> +
> + /*
> + * Handle read on the BAR regions. Map to the target device memory
> + * physical address and copy to the request read buffer.
> + */
> + ret = nvgrace_gpu_map_device_mem(nvdev, index);
> + if (ret)
> + goto read_exit;
> +
> + if (index == USEMEM_REGION_INDEX) {
> + if (copy_to_user(buf, (u8 *)nvdev->usemem.bar_remap.memaddr + offset, mem_count))
> + ret = -EFAULT;
> + } else {
> + return do_io_rw(&nvdev->core_device, false, nvdev->resmem.bar_remap.ioaddr,
> + (char __user *) buf, offset, mem_count, 0, 0, false);
> + }
> +
> +read_exit:
> + return ret;
> +}
> +
> +/*
> + * Read count bytes from the device memory at an offset. The actual device
> + * memory size (available) may not be a power-of-2. So the driver fakes
> + * the size to a power-of-2 (reported) when exposing to a user space driver.
> + *
> + * Read request beyond the actual device size is filled with ~0, while
> + * those beyond the actual reported size is skipped.
> + *
> + * A read from a negative or an offset greater than reported size, a negative
> + * count are considered error conditions and returned with an -EINVAL.
> + */
> +static ssize_t
> +nvgrace_gpu_read_mem(struct nvgrace_gpu_vfio_pci_core_device *nvdev,
> + void __user *buf, size_t count, loff_t *ppos)
> +{
> + u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + struct mem_region *memregion;
> + size_t mem_count, i, bar_size;
> + u8 val = 0xFF;
> + int ret;
> +
> + memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(index, nvdev);
> + if (!memregion)
> + return -EINVAL;
> +
> + bar_size = roundup_pow_of_two(memregion->memlength);
> +
> + if (offset >= bar_size)
> + return -EINVAL;
> +
> + /* Clip short the read request beyond reported BAR size */
> + count = min(count, bar_size - (size_t)offset);
> +
> + /*
> + * Determine how many bytes to be actually read from the device memory.
> + * Read request beyond the actual device memory size is filled with ~0,
> + * while those beyond the actual reported size is skipped.
> + */
> + if (offset >= memregion->memlength)
> + mem_count = 0;
> + else
> + mem_count = min(count, memregion->memlength - (size_t)offset);
> +
> + ret = nvgrace_gpu_map_and_read(nvdev, buf, mem_count, ppos);
> + if (ret)
> + return ret;
> +
> + /*
> + * Only the device memory present on the hardware is mapped, which may
> + * not be power-of-2 aligned. A read to an offset beyond the device memory
> + * size is filled with ~0.
> + */
> + for (i = mem_count; i < count; i++)
> + put_user(val, (unsigned char __user *)(buf + i));
> +
> + *ppos += count;
> + return count;
> +}
> +
> +static ssize_t
> +nvgrace_gpu_vfio_pci_read(struct vfio_device *core_vdev,
> + char __user *buf, size_t count, loff_t *ppos)
> +{
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + ssize_t read_count;
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
> + core_vdev, struct nvgrace_gpu_vfio_pci_core_device,
> + core_device.vdev);
> +
> + if (nvgrace_gpu_vfio_pci_is_fake_bar(index)) {
> + /* Check if the bars are disabled, allow access otherwise */
> + down_read(&nvdev->core_device.memory_lock);
> + if (nvdev->bars_disabled) {
> + up_read(&nvdev->core_device.memory_lock);
> + return -EIO;
> + }
> + read_count = nvgrace_gpu_read_mem(nvdev, buf, count, ppos);
> + up_read(&nvdev->core_device.memory_lock);
> +
> + return read_count;
> + }
> +
> + if (index == VFIO_PCI_CONFIG_REGION_INDEX)
> + return nvgrace_gpu_read_config_emu(core_vdev, buf, count, ppos);
> +
> + return vfio_pci_core_read(core_vdev, buf, count, ppos);
> +}
> +
> +/*
> + * Write the data to the device memory (mapped either through ioremap
> + * or memremap) from the user buffer.
> + */
> +static int nvgrace_gpu_map_and_write(struct nvgrace_gpu_vfio_pci_core_device *nvdev,
> + const void __user *buf, size_t mem_count, loff_t *ppos)
> +{
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + int ret = 0;
> +
> + ret = nvgrace_gpu_map_device_mem(nvdev, index);
> + if (ret)
> + goto write_exit;
> +
> + if (index == USEMEM_REGION_INDEX) {
> + if (copy_from_user((u8 *)nvdev->usemem.bar_remap.memaddr + pos,
> + buf, mem_count))
> + return -EFAULT;
> + } else {
> + return do_io_rw(&nvdev->core_device, false, nvdev->resmem.bar_remap.ioaddr,
> + (char __user *) buf, pos, mem_count, 0, 0, true);
> + }
> +
> +write_exit:
> + return ret;
> +}
> +
> +/*
> + * Write count bytes to the device memory at a given offset. The actual device
> + * memory size (available) may not be a power-of-2. So the driver fakes the
> + * size to a power-of-2 (reported) when exposing to a user space driver.
> + *
> + * Write request beyond the actual device size are dropped, while those
> + * beyond the actual reported size are skipped entirely.
> + *
> + * A write to a negative or an offset greater than the reported size, a
> + * negative count are considered error conditions and returned with an -EINVAL.
> + */
> +static ssize_t
> +nvgrace_gpu_write_mem(struct nvgrace_gpu_vfio_pci_core_device *nvdev,
> + size_t count, loff_t *ppos, const void __user *buf)
> +{
> + u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + struct mem_region *memregion;
> + size_t mem_count, bar_size;
> + int ret = 0;
> +
> + memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(index, nvdev);
> + if (!memregion)
> + return -EINVAL;
> +
> + bar_size = roundup_pow_of_two(memregion->memlength);
> +
> + if (offset >= bar_size)
> + return -EINVAL;
> +
> + /* Clip short the write request beyond reported BAR size */
> + count = min(count, bar_size - (size_t)offset);
> +
> + /*
> + * Determine how many bytes to be actually written to the device memory.
> + * Do not write to the offset beyond available size.
> + */
> + if (offset >= memregion->memlength)
> + goto exitfn;
> +
> + /*
> + * Only the device memory present on the hardware is mapped, which may
> + * not be power-of-2 aligned. Drop access outside the available device
> + * memory on the hardware.
> + */
> + mem_count = min(count, memregion->memlength - (size_t)offset);
> +
> + ret = nvgrace_gpu_map_and_write(nvdev, buf, mem_count, ppos);
> + if (ret)
> + return ret;
> +
> +exitfn:
> + *ppos += count;
> + return count;
> +}
> +
> +static ssize_t
> +nvgrace_gpu_vfio_pci_write(struct vfio_device *core_vdev,
> + const char __user *buf, size_t count, loff_t *ppos)
> +{
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
> + core_vdev, struct nvgrace_gpu_vfio_pci_core_device, core_device.vdev);
> + size_t write_count;
> +
> + if (nvgrace_gpu_vfio_pci_is_fake_bar(index)) {
> + /* Check if the bars are disabled, allow access otherwise */
> + down_read(&nvdev->core_device.memory_lock);
> + if (nvdev->bars_disabled) {
> + up_read(&nvdev->core_device.memory_lock);
> + return -EIO;
> + }
> + write_count = nvgrace_gpu_write_mem(nvdev, count, ppos, buf);
> + up_read(&nvdev->core_device.memory_lock);
> + return write_count;
> + }
> +
> + if (index == VFIO_PCI_CONFIG_REGION_INDEX)
> + return nvgrace_gpu_write_config_emu(core_vdev, buf, count, ppos);
> +
> + return vfio_pci_core_write(core_vdev, buf, count, ppos);
> +}
> +
> +static const struct vfio_device_ops nvgrace_gpu_vfio_pci_ops = {
> + .name = "nvgrace-gpu-vfio-pci",
> + .init = vfio_pci_core_init_dev,
> + .release = vfio_pci_core_release_dev,
> + .open_device = nvgrace_gpu_vfio_pci_open_device,
> + .close_device = nvgrace_gpu_vfio_pci_close_device,
> + .ioctl = nvgrace_gpu_vfio_pci_ioctl,
> + .read = nvgrace_gpu_vfio_pci_read,
> + .write = nvgrace_gpu_vfio_pci_write,
> + .mmap = nvgrace_gpu_vfio_pci_mmap,
> + .request = vfio_pci_core_request,
> + .match = vfio_pci_core_match,
> + .bind_iommufd = vfio_iommufd_physical_bind,
> + .unbind_iommufd = vfio_iommufd_physical_unbind,
> + .attach_ioas = vfio_iommufd_physical_attach_ioas,
> + .detach_ioas = vfio_iommufd_physical_detach_ioas,
> +};
> +
> +static struct
> +nvgrace_gpu_vfio_pci_core_device *nvgrace_gpu_drvdata(struct pci_dev *pdev)
> +{
> + struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
> +
> + return container_of(core_device, struct nvgrace_gpu_vfio_pci_core_device,
> + core_device);
> +}
> +
> +static int
> +nvgrace_gpu_vfio_pci_fetch_memory_property(struct pci_dev *pdev,
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev)
> +{
> + int ret;
> + u64 memphys, memlength;
> +
> + /*
> + * The memory information is present in the system ACPI tables as DSD
> + * properties nvidia,gpu-mem-base-pa and nvidia,gpu-mem-size.
> + */
> + ret = device_property_read_u64(&pdev->dev, "nvidia,gpu-mem-base-pa",
> + &(memphys));
> + if (ret)
> + return ret;
> +
> + if (memphys > type_max(phys_addr_t))
> + return -EOVERFLOW;
> +
> + ret = device_property_read_u64(&pdev->dev, "nvidia,gpu-mem-size",
> + &(memlength));
> + if (ret)
> + return ret;
> +
> + if (memlength > type_max(size_t))
> + return -EOVERFLOW;
> +
> + /*
> + * If the C2C link is not up due to an error, the coherent device
> + * memory size is returned as 0. Fail in such case.
> + */
> + if (memlength == 0)
> + return -ENOMEM;
> +
> + /*
> + * The VM GPU device driver needs a non-cacheable region to support
> + * the MIG feature. Since the device memory is mapped as NORMAL cached,
> + * carve out a region from the end with a different NORMAL_NC
> + * property (called as reserved memory and represented as resmem). This
> + * region then is exposed as a 64b BAR (region 2 and 3) to the VM, while
> + * exposing the rest (termed as usable memory and represented using usemem)
> + * as cacheable 64b BAR (region 4 and 5).
> + *
> + * devmem (memlength)
> + * |-------------------------------------------------|
> + * | |
> + * usemem.phys/memphys resmem.phys
> + */
> + nvdev->usemem.memphys = memphys;
> +
> + /*
> + * The device memory exposed to the VM is added to the kernel by the
> + * VM driver module in chunks of memory block size. Only the usable
> + * memory (usemem) is added to the kernel for usage by the VM
> + * workloads. Make the usable memory size memblock aligned.
> + */
> + if (check_sub_overflow(memlength, RESMEM_SIZE,
> + &nvdev->usemem.memlength)) {
> + ret = -EOVERFLOW;
> + goto done;
> + }
> + nvdev->usemem.memlength = round_down(nvdev->usemem.memlength,
> + MEMBLK_SIZE);
> + if ((check_add_overflow(nvdev->usemem.memphys,
> + nvdev->usemem.memlength, &nvdev->resmem.memphys)) ||
> + (check_sub_overflow(memlength, nvdev->usemem.memlength,
> + &nvdev->resmem.memlength))) {
> + ret = -EOVERFLOW;
> + goto done;
> + }
> +
> +done:
> + return ret;
> +}
> +
> +static int nvgrace_gpu_vfio_pci_probe(struct pci_dev *pdev,
> + const struct pci_device_id *id)
> +{
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev;
> + int ret;
> +
> + nvdev = vfio_alloc_device(nvgrace_gpu_vfio_pci_core_device, core_device.vdev,
> + &pdev->dev, &nvgrace_gpu_vfio_pci_ops);
> + if (IS_ERR(nvdev))
> + return PTR_ERR(nvdev);
> +
> + dev_set_drvdata(&pdev->dev, nvdev);
> +
> + ret = nvgrace_gpu_vfio_pci_fetch_memory_property(pdev, nvdev);
> + if (ret)
> + goto out_put_vdev;
> +
> + ret = vfio_pci_core_register_device(&nvdev->core_device);
> + if (ret)
> + goto out_put_vdev;
> +
> + return ret;
> +
> +out_put_vdev:
> + vfio_put_device(&nvdev->core_device.vdev);
> + return ret;
> +}
> +
> +static void nvgrace_gpu_vfio_pci_remove(struct pci_dev *pdev)
> +{
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = nvgrace_gpu_drvdata(pdev);
> + struct vfio_pci_core_device *vdev = &nvdev->core_device;
> +
> + vfio_pci_core_unregister_device(vdev);
> + vfio_put_device(&vdev->vdev);
> +}
> +
> +static const struct pci_device_id nvgrace_gpu_vfio_pci_table[] = {
> + /* GH200 120GB */
> + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2342) },
> + /* GH200 480GB */
> + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2345) },
> + {}
> +};
> +
> +MODULE_DEVICE_TABLE(pci, nvgrace_gpu_vfio_pci_table);
> +
> +static struct pci_driver nvgrace_gpu_vfio_pci_driver = {
> + .name = KBUILD_MODNAME,
> + .id_table = nvgrace_gpu_vfio_pci_table,
> + .probe = nvgrace_gpu_vfio_pci_probe,
> + .remove = nvgrace_gpu_vfio_pci_remove,
> + .err_handler = &vfio_pci_core_err_handlers,
> + .driver_managed_dma = true,
> +};
> +
> +module_pci_driver(nvgrace_gpu_vfio_pci_driver);
> +
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Ankit Agrawal <[email protected]>");
> +MODULE_AUTHOR("Aniket Agashe <[email protected]>");
> +MODULE_DESCRIPTION(
> + "VFIO NVGRACE GPU PF - User Level driver for NVIDIA devices with CPU coherently accessible device memory");
> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
> index e27de61ac9fe..ca20440b442d 100644
> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> @@ -94,10 +94,10 @@ VFIO_IOREAD(32)
> * reads with -1. This is intended for handling MSI-X vector tables and
> * leftover space for ROM BARs.
> */
> -static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
> - void __iomem *io, char __user *buf,
> - loff_t off, size_t count, size_t x_start,
> - size_t x_end, bool iswrite)
> +ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
> + void __iomem *io, char __user *buf,
> + loff_t off, size_t count, size_t x_start,
> + size_t x_end, bool iswrite)
> {
> ssize_t done = 0;
> int ret;
> @@ -199,6 +199,7 @@ static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
>
> return done;
> }
> +EXPORT_SYMBOL(do_io_rw);

You should rename this routime with a vfio_ prefix and include its
declaration in include/linux/vfio_pci_core.h. I suppose we want
it to be GPL, right ?

See https://lore.kernel.org/all/[email protected]/

Thanks,

C.

>
> static int vfio_pci_setup_barmap(struct vfio_pci_core_device *vdev, int bar)
> {

2023-12-18 22:17:36

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH v15 1/1] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

On Mon, 18 Dec 2023 00:40:31 +0530
<[email protected]> wrote:

> From: Ankit Agrawal <[email protected]>
>
> NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> for the on-chip GPU that is the logical OS representation of the
> internal proprietary chip-to-chip cache coherent interconnect.
>
> The device is peculiar compared to a real PCI device in that whilst
> there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the
> device, it is not used to access device memory once the faster
> chip-to-chip interconnect is initialized (occurs at the time of host
> system boot). The device memory is accessed instead using the chip-to-chip
> interconnect that is exposed as a contiguous physically addressable
> region on the host. This device memory aperture can be obtained from host
> ACPI table using device_property_read_u64(), according to the FW
> specification. Since the device memory is cache coherent with the CPU,
> it can be mmap into the user VMA with a cacheable mapping using
> remap_pfn_range() and used like a regular RAM. The device memory
> is not added to the host kernel, but mapped directly as this reduces
> memory wastage due to struct pages.
>
> There is also a requirement of a reserved 1G uncached region (termed as
> resmem) to support the Multi-Instance GPU (MIG) feature [1]. Based on [2],
> the requisite properties (uncached, unaligned access) can be achieved
> through a VM mapping (S1) of NORMAL_NC and host (S2) mapping with
> MemAttr[2:0]=0b101. To provide a different non-cached property to the
> reserved 1G region, it needs to be carved out from the device memory and
> mapped as a separate region in Qemu VMA with pgprot_writecombine().
> pgprot_writecombine() sets the Qemu VMA page properties (pgprot) as
> NORMAL_NC.
>
> Provide a VFIO PCI variant driver that adapts the unique device memory
> representation into a more standard PCI representation facing userspace.
>
> The variant driver exposes these two regions - the non-cached reserved
> (resmem) and the cached rest of the device memory (termed as usemem) as
> separate VFIO 64b BAR regions. Since the device implements 64-bit BAR0,
> the VFIO PCI variant driver maps the uncached carved out region to the
> next available PCI BAR (i.e. comprising of region 2 and 3). The cached
> device memory aperture is assigned BAR region 4 and 5. Qemu will then
> naturally generate a PCI device in the VM with the uncached aperture
> reported as BAR2 region, the cacheable as BAR4. The variant driver provides
> emulation for these fake BARs' PCI config space offset registers. The
> BAR can also be enabled/disabled through PCI_COMMAND config space register.
> The VM driver should enable the BARs (as it does already) to make use
> of the device memory.
>
> The memory layout on the host looks like the following:
> devmem (memlength)
> |--------------------------------------------------|
> |-------------cached------------------------|--NC--|
> | |
> usemem.phys/memphys resmem.phys
>
> PCI BARs need to be aligned to the power-of-2, but the actual memory on the
> device may not. A read or write access to the physical address from the
> last device PFN up to the next power-of-2 aligned physical address
> results in reading ~0 and dropped writes. Note that the GPU device
> driver [6] is capable of knowing the exact device memory size through
> separate means. The device memory size is primarily kept in the system
> ACPI tables for use by the VFIO PCI variant module.
>
> Note that the usemem memory is added by the VM Nvidia device driver [5]
> to the VM kernel as memblocks. Hence make the usable memory size memblock
> aligned.
>
> Currently there is no provision in KVM for a S2 mapping with
> MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3].
> As previously mentioned, resmem is mapped pgprot_writecombine(), that
> sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the
> proposed changes in [4] and [3], KVM marks the region with
> MemAttr[2:0]=0b101 in S2.
>
> This goes along with a qemu series [6] to provides the necessary
> implementation of the Grace Hopper Superchip firmware specification so
> that the guest operating system can see the correct ACPI modeling for
> the coherent GPU device. Verified with the CUDA workload in the VM.
>
> [1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu/
> [2] section D8.5.5 of https://developer.arm.com/documentation/ddi0487/latest/
> [3] https://lore.kernel.org/all/[email protected]/
> [4] https://lore.kernel.org/all/[email protected]/
> [5] https://github.com/NVIDIA/open-gpu-kernel-modules
> [6] https://lore.kernel.org/all/[email protected]/
>
> Applied over next-20231211.
> ---
> Link for variant driver v14:
> https://lore.kernel.org/all/[email protected]/
>
> v14 -> v15
> - Added case to handle VFIO_DEVICE_IOEVENTFD to return -EIO as it
> is not required on the device.
> - Updated the BAR config space handling code to closely resemble
> by Yishai Hadas (using range_intersect_range) in
> https://lore.kernel.org/all/[email protected]
> - Changed the bar pci config register from union to u64.
> - Adapted the code to disable BAR when it is disabled through
> PCI_COMMAND.
> - Exported and reused the do_io_rw to do mmio accesses.
> - Added a new header file to keep the newly declared structures.
> - Miscellaneous code fixes suggested by Alex Williamson in v14.
>
> v13 -> v14
> - Merged the changes for second BAR implementation for MIG support
> on the device driver.
> https://lore.kernel.org/all/[email protected]/
> - Added the missing implementation of sub-word access to fake BARs'
> PCI config access. Implemented access algorithm suggested by
> Alex Williamson in the comments (Thanks!)
> - Added support to BAR accesses on the reserved memory with
> Qemu device param x-no-mmap=on.
> - Handled endian-ness in the PCI config space access.
> - Git commit message change
>
> v12 -> v13
> - Added emulation for the PCI config space BAR offset register for
> the fake BAR.
> - commit message updated with more details on the BAR offset emulation.
>
> v11 -> v12
> - More details in commit message on device memory size
>
> v10 -> v11
> - Removed sysfs attribute to expose the CPU coherent memory feature
> - Addressed review comments
>
> v9 -> v10
> - Add new sysfs attribute to expose the CPU coherent memory feature.
>
> v8 -> v9
> - Minor code adjustment suggested in v8.
>
> v7 -> v8
> - Various field names updated.
> - Added a new function to handle VFIO_DEVICE_GET_REGION_INFO ioctl.
> - Locking protection for memremap to bar region and other changes
> recommended in v7.
> - Added code to fail if the devmem size advertized is 0 in system DSDT.
>
> v6 -> v7
> - Handled out-of-bound and overflow conditions at various places to validate
> input offset and length.
> - Added code to return EINVAL for offset beyond region size.
>
> v5 -> v6
> - Added the code to handle BAR2 read/write using memremap to the device
> memory.
>
> v4 -> v5
> - Changed the module name from nvgpu-vfio-pci to nvgrace-gpu-vfio-pci.
> - Fixed memory leak and added suggested boundary checks on device memory
> mapping.
> - Added code to read all Fs and ignored write on region outside of the
> physical memory.
> - Other miscellaneous cleanup suggestions.
>
> v3 -> v4
> - Mapping the available device memory using sparse mmap. The region outside
> the device memory is handled by read/write ops.
> - Removed the fault handler added in v3.
>
> v2 -> v3
> - Added fault handler to map the region outside the physical GPU memory
> up to the next power-of-2 to a dummy PFN.
> - Changed to select instead of "depends on" VFIO_PCI_CORE for all the
> vfio-pci variant driver.
> - Code cleanup based on feedback comments.
> - Code implemented and tested against v6.4-rc4.
>
> v1 -> v2
> - Updated the wording of reference to BAR offset and replaced with
> index.
> - The GPU memory is exposed at the fixed BAR2_REGION_INDEX.
> - Code cleanup based on feedback comments.
>
> Signed-off-by: Ankit Agrawal <[email protected]>
> Signed-off-by: Aniket Agashe <[email protected]>
> Tested-by: Ankit Agrawal <[email protected]>
> ---
> MAINTAINERS | 6 +
> drivers/vfio/pci/Kconfig | 2 +
> drivers/vfio/pci/Makefile | 2 +
> drivers/vfio/pci/nvgrace-gpu/Kconfig | 10 +
> drivers/vfio/pci/nvgrace-gpu/Makefile | 3 +
> drivers/vfio/pci/nvgrace-gpu/main.c | 806 ++++++++++++++++++++++++++
> drivers/vfio/pci/vfio_pci_rdwr.c | 9 +-
> 7 files changed, 834 insertions(+), 4 deletions(-)
> create mode 100644 drivers/vfio/pci/nvgrace-gpu/Kconfig
> create mode 100644 drivers/vfio/pci/nvgrace-gpu/Makefile
> create mode 100644 drivers/vfio/pci/nvgrace-gpu/main.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 98f7dd0499f1..6f8f3a6daa43 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -22877,6 +22877,12 @@ L: [email protected]
> S: Maintained
> F: drivers/vfio/platform/
>
> +VFIO NVIDIA GRACE GPU DRIVER
> +M: Ankit Agrawal <[email protected]>
> +L: [email protected]
> +S: Maintained
> +F: drivers/vfio/pci/nvgrace-gpu/
> +
> VGA_SWITCHEROO
> R: Lukas Wunner <[email protected]>
> S: Maintained
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 8125e5f37832..2456210e85f1 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -65,4 +65,6 @@ source "drivers/vfio/pci/hisilicon/Kconfig"
>
> source "drivers/vfio/pci/pds/Kconfig"
>
> +source "drivers/vfio/pci/nvgrace-gpu/Kconfig"
> +
> endmenu
> diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
> index 45167be462d8..1352c65e568a 100644
> --- a/drivers/vfio/pci/Makefile
> +++ b/drivers/vfio/pci/Makefile
> @@ -13,3 +13,5 @@ obj-$(CONFIG_MLX5_VFIO_PCI) += mlx5/
> obj-$(CONFIG_HISI_ACC_VFIO_PCI) += hisilicon/
>
> obj-$(CONFIG_PDS_VFIO_PCI) += pds/
> +
> +obj-$(CONFIG_NVGRACE_GPU_VFIO_PCI) += nvgrace-gpu/
> diff --git a/drivers/vfio/pci/nvgrace-gpu/Kconfig b/drivers/vfio/pci/nvgrace-gpu/Kconfig
> new file mode 100644
> index 000000000000..936e88d8d41d
> --- /dev/null
> +++ b/drivers/vfio/pci/nvgrace-gpu/Kconfig
> @@ -0,0 +1,10 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config NVGRACE_GPU_VFIO_PCI
> + tristate "VFIO support for the GPU in the NVIDIA Grace Hopper Superchip"
> + depends on ARM64 || (COMPILE_TEST && 64BIT)
> + select VFIO_PCI_CORE
> + help
> + VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is
> + required to assign the GPU device using KVM/qemu/etc.
> +
> + If you don't know what to do here, say N.
> diff --git a/drivers/vfio/pci/nvgrace-gpu/Makefile b/drivers/vfio/pci/nvgrace-gpu/Makefile
> new file mode 100644
> index 000000000000..3ca8c187897a
> --- /dev/null
> +++ b/drivers/vfio/pci/nvgrace-gpu/Makefile
> @@ -0,0 +1,3 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_NVGRACE_GPU_VFIO_PCI) += nvgrace-gpu-vfio-pci.o
> +nvgrace-gpu-vfio-pci-y := main.o
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
> new file mode 100644
> index 000000000000..43387a800c41
> --- /dev/null
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -0,0 +1,806 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> + */
> +
> +#include "nvgrace_gpu_vfio_pci.h"

drivers/vfio/pci/nvgrace-gpu/main.c:6:10: fatal error: nvgrace_gpu_vfio_pci.h: No such file or directory
6 | #include "nvgrace_gpu_vfio_pci.h"

> +
> +static bool nvgrace_gpu_vfio_pci_is_fake_bar(int index)
> +{
> + return (index == RESMEM_REGION_INDEX ||
> + index == USEMEM_REGION_INDEX);
> +}

Presumably these macros are defined in the missing header, though we
don't really need a header file just for that. This doesn't need to be
line wrapped, it's short enough with the macros as is.
> +
> +static void nvgrace_gpu_init_fake_bar_emu_regs(struct vfio_device *core_vdev)
> +{
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
> + core_vdev, struct nvgrace_gpu_vfio_pci_core_device,
> + core_device.vdev);
> +
> + nvdev->resmem.u64_reg = 0;
> + nvdev->usemem.u64_reg = 0;
> +}
> +
> +/* Choose the structure corresponding to the fake BAR with a given index. */
> +struct mem_region *
> +nvgrace_gpu_vfio_pci_fake_bar_mem_region(int index,
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev)
> +{
> + if (index == USEMEM_REGION_INDEX)
> + return &(nvdev->usemem);

Parenthesis are unnecessary.

> +
> + if (index == RESMEM_REGION_INDEX)
> + return &(nvdev->resmem);
> +
> + return NULL;
> +}
> +
> +static int nvgrace_gpu_vfio_pci_open_device(struct vfio_device *core_vdev)
> +{
> + struct vfio_pci_core_device *vdev =
> + container_of(core_vdev, struct vfio_pci_core_device, vdev);
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
> + core_vdev, struct nvgrace_gpu_vfio_pci_core_device,
> + core_device.vdev);
> + int ret;
> +
> + ret = vfio_pci_core_enable(vdev);
> + if (ret)
> + return ret;
> +
> + vfio_pci_core_finish_enable(vdev);
> +
> + nvgrace_gpu_init_fake_bar_emu_regs(core_vdev);
> +
> + mutex_init(&nvdev->remap_lock);
> +
> + return 0;
> +}
> +
> +static void nvgrace_gpu_vfio_pci_close_device(struct vfio_device *core_vdev)
> +{
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
> + core_vdev, struct nvgrace_gpu_vfio_pci_core_device,
> + core_device.vdev);
> +
> + /* Unmap the mapping to the device memory cached region */
> + if (nvdev->usemem.bar_remap.memaddr) {
> + memunmap(nvdev->usemem.bar_remap.memaddr);
> + nvdev->usemem.bar_remap.memaddr = NULL;
> + }
> +
> + /* Unmap the mapping to the device memory non-cached region */
> + if (nvdev->resmem.bar_remap.ioaddr) {
> + iounmap(nvdev->resmem.bar_remap.ioaddr);
> + nvdev->resmem.bar_remap.ioaddr = NULL;
> + }
> +
> + mutex_destroy(&nvdev->remap_lock);
> +
> + vfio_pci_core_close_device(core_vdev);
> +}
> +
> +static int nvgrace_gpu_vfio_pci_mmap(struct vfio_device *core_vdev,
> + struct vm_area_struct *vma)
> +{
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
> + core_vdev, struct nvgrace_gpu_vfio_pci_core_device, core_device.vdev);
> +
> + unsigned long start_pfn;
> + unsigned int index;
> + u64 req_len, pgoff, end;
> + int ret = 0;
> + struct mem_region *memregion;
> +
> + index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> +
> + memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(index, nvdev);
> + if (!memregion)
> + return vfio_pci_core_mmap(core_vdev, vma);
> +
> + /*
> + * Request to mmap the BAR. Map to the CPU accessible memory on the
> + * GPU using the memory information gathered from the system ACPI
> + * tables.
> + */
> + pgoff = vma->vm_pgoff &
> + ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +
> + if (check_sub_overflow(vma->vm_end, vma->vm_start, &req_len) ||
> + check_add_overflow(PHYS_PFN(memregion->memphys), pgoff, &start_pfn) ||
> + check_add_overflow(PFN_PHYS(pgoff), req_len, &end))
> + return -EOVERFLOW;
> +
> + /*
> + * Check that the mapping request does not go beyond available device
> + * memory size
> + */
> + if (end > memregion->memlength)
> + return -EINVAL;
> +
> + /*
> + * The carved out region of the device memory needs the NORMAL_NC
> + * property. Communicate as such to the hypervisor.
> + */
> + if (index == RESMEM_REGION_INDEX)
> + vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
> +
> + /*
> + * Perform a PFN map to the memory and back the device BAR by the
> + * GPU memory.
> + *
> + * The available GPU memory size may not be power-of-2 aligned. Map up
> + * to the size of the device memory. If the memory access is beyond the
> + * actual GPU memory size, it will be handled by the vfio_device_ops
> + * read/write.
> + *
> + * During device reset, the GPU is safely disconnected to the CPU
> + * and access to the BAR will be immediately returned preventing
> + * machine check.
> + */
> + ret = remap_pfn_range(vma, vma->vm_start, start_pfn,
> + req_len, vma->vm_page_prot);
> + if (ret)
> + return ret;
> +
> + vma->vm_pgoff = start_pfn;
> +
> + return 0;
> +}
> +
> +static long
> +nvgrace_gpu_vfio_pci_ioctl_get_region_info(struct vfio_device *core_vdev,
> + unsigned long arg)
> +{
> + unsigned long minsz = offsetofend(struct vfio_region_info, offset);
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
> + core_vdev, struct nvgrace_gpu_vfio_pci_core_device, core_device.vdev);
> + struct vfio_region_info_cap_sparse_mmap *sparse;
> + struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
> + struct vfio_region_info info;
> + struct mem_region *memregion;
> + uint32_t size;
> + int ret;
> +
> + if (copy_from_user(&info, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (info.argsz < minsz)
> + return -EINVAL;
> +
> + memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(info.index, nvdev);
> + if (!memregion)
> + return vfio_pci_core_ioctl(core_vdev,
> + VFIO_DEVICE_GET_REGION_INFO, arg);
> +
> + /*
> + * Request to determine the BAR region information. Send the
> + * GPU memory information.
> + */
> + size = struct_size(sparse, areas, 1);
> +
> + /*
> + * Setup for sparse mapping for the device memory. Only the
> + * available device memory on the hardware is shown as a
> + * mappable region.
> + */
> + sparse = kzalloc(size, GFP_KERNEL);
> + if (!sparse)
> + return -ENOMEM;
> +
> + sparse->nr_areas = 1;
> + sparse->areas[0].offset = 0;
> + sparse->areas[0].size = memregion->memlength;
> + sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
> + sparse->header.version = 1;
> +
> + ret = vfio_info_add_capability(&caps, &sparse->header, size);
> + kfree(sparse);
> + if (ret)
> + return ret;
> +
> + info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> + /*
> + * The region memory size may not be power-of-2 aligned.
> + * Given that the memory as a BAR and may not be
> + * aligned, roundup to the next power-of-2.
> + */
> + info.size = roundup_pow_of_two(memregion->memlength);
> + info.flags = VFIO_REGION_INFO_FLAG_READ |
> + VFIO_REGION_INFO_FLAG_WRITE |
> + VFIO_REGION_INFO_FLAG_MMAP;

Align these all:

info.flags = VFIO_REGION_INFO_FLAG_READ |
VFIO_REGION_INFO_FLAG_WRITE |
VFIO_REGION_INFO_FLAG_MMAP;

> +
> + if (caps.size) {
> + info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
> + if (info.argsz < sizeof(info) + caps.size) {
> + info.argsz = sizeof(info) + caps.size;
> + info.cap_offset = 0;
> + } else {
> + vfio_info_cap_shift(&caps, sizeof(info));
> + if (copy_to_user((void __user *)arg +
> + sizeof(info), caps.buf,
> + caps.size)) {
> + kfree(caps.buf);
> + return -EFAULT;
> + }
> + info.cap_offset = sizeof(info);
> + }
> + kfree(caps.buf);
> + }
> + return copy_to_user((void __user *)arg, &info, minsz) ?
> + -EFAULT : 0;
> +}
> +
> +static long nvgrace_gpu_vfio_pci_ioctl(struct vfio_device *core_vdev,
> + unsigned int cmd, unsigned long arg)
> +{
> + switch (cmd) {
> + case VFIO_DEVICE_GET_REGION_INFO:
> + return nvgrace_gpu_vfio_pci_ioctl_get_region_info(core_vdev, arg);
> + case VFIO_DEVICE_IOEVENTFD:
> + return -ENOTTY;
> + case VFIO_DEVICE_RESET:
> + nvgrace_gpu_init_fake_bar_emu_regs(core_vdev);
> + fallthrough;
> + default:
> + return vfio_pci_core_ioctl(core_vdev, cmd, arg);
> + }
> +}
> +
> +static bool range_intersect_range(loff_t range1_start, size_t count1,
> + loff_t range2_start, size_t count2,
> + loff_t *start_offset,
> + size_t *intersect_count,
> + size_t *register_offset)
> +{
> + if (range1_start <= range2_start &&
> + range1_start + count1 > range2_start) {
> + *start_offset = range2_start - range1_start;
> + *intersect_count = min_t(size_t, count2,
> + range1_start + count1 - range2_start);
> + *register_offset = 0;
> + return true;
> + }
> +
> + if (range1_start > range2_start &&
> + range1_start < range2_start + count2) {
> + *start_offset = 0;
> + *intersect_count = min_t(size_t, count1,
> + range2_start + count2 - range1_start);
> + *register_offset = range1_start - range2_start;
> + return true;
> + }
> +
> + return false;
> +}

We should put this somewhere shared with virtio-vfio-pci.

> +
> +/*
> + * Both the usable (usemem) and the reserved (resmem) device memory region
> + * are exposed as a 64b fake BARs in the VM. These fake BARs must respond
> + * to the accesses on their respective PCI config space offsets.
> + *
> + * resmem BAR owns PCI_BASE_ADDRESS_2 & PCI_BASE_ADDRESS_3.
> + * usemem BAR owns PCI_BASE_ADDRESS_4 & PCI_BASE_ADDRESS_5.
> + */
> +static ssize_t
> +nvgrace_gpu_read_config_emu(struct vfio_device *core_vdev,
> + char __user *buf, size_t count, loff_t *ppos)
> +{
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
> + core_vdev, struct nvgrace_gpu_vfio_pci_core_device, core_device.vdev);
> + u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + __le64 val64;
> + size_t bar_size;
> + size_t register_offset;
> + loff_t copy_offset;
> + size_t copy_count;
> + int ret;
> +
> + ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
> + if (ret < 0)
> + return ret;
> +
> + if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(val64),
> + &copy_offset, &copy_count, &register_offset)) {
> + bar_size = roundup_pow_of_two(nvdev->resmem.memlength);
> + nvdev->resmem.u64_reg &= ~(bar_size - 1);
> + nvdev->resmem.u64_reg |= PCI_BASE_ADDRESS_MEM_TYPE_64 |
> + PCI_BASE_ADDRESS_MEM_PREFETCH;
> + val64 = cpu_to_le64(nvdev->resmem.u64_reg);

As suggested and implemented in virtio-vfio-pci, store the value as
little endian, then the write function simply becomes a
copy_from_user(), we only need a cpu native representation of the value
on read.

Between this branch and below it looks like you need a helper function
that returns an __le64 provided a __le64 value, u64 mask, and u64 flags.

> + if (copy_to_user(buf + copy_offset, (void *)&val64 + register_offset, copy_count))
> + return -EFAULT;
> + }
> +
> + if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_4, sizeof(val64),
> + &copy_offset, &copy_count, &register_offset)) {
> + bar_size = roundup_pow_of_two(nvdev->usemem.memlength);
> + nvdev->usemem.u64_reg &= ~(bar_size - 1);
> + nvdev->usemem.u64_reg |= PCI_BASE_ADDRESS_MEM_TYPE_64 |
> + PCI_BASE_ADDRESS_MEM_PREFETCH;
> + val64 = cpu_to_le64(nvdev->usemem.u64_reg);
> + if (copy_to_user(buf + copy_offset, (void *)&val64 + register_offset, copy_count))
> + return -EFAULT;
> + }
> +
> + return count;
> +}
> +
> +static ssize_t
> +nvgrace_gpu_write_config_emu(struct vfio_device *core_vdev,
> + const char __user *buf, size_t count, loff_t *ppos)
> +{
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
> + core_vdev, struct nvgrace_gpu_vfio_pci_core_device, core_device.vdev);
> + u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + __le64 val64 = 0;
> + __le16 val16 = 0;
> + u64 tmp_val;
> + size_t register_offset;
> + loff_t copy_offset;
> + size_t copy_count;
> +
> + if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(val64),
> + &copy_offset, &copy_count, &register_offset)) {
> + if (copy_from_user((void *)&val64, buf + copy_offset, copy_count))
> + return -EFAULT;
> + tmp_val = le64_to_cpu(val64);
> + memcpy((void *)&(nvdev->resmem.u64_reg) + register_offset,
> + (void *)&tmp_val + copy_offset, copy_count);
> + *ppos += copy_count;
> + return copy_count;
> + }
> +
> + if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_4, sizeof(val64),
> + &copy_offset, &copy_count, &register_offset)) {
> + if (copy_from_user((void *)&val64, buf + copy_offset, copy_count))
> + return -EFAULT;
> + tmp_val = le64_to_cpu(val64);
> + memcpy((void *)&(nvdev->usemem.u64_reg) + register_offset,
> + (void *)&tmp_val + copy_offset, copy_count);
> + *ppos += copy_count;
> + return copy_count;
> + }
> +
> + if (range_intersect_range(pos, count, PCI_COMMAND, sizeof(val16),
> + &copy_offset, &copy_count, &register_offset)) {
> + if (copy_from_user((void *)&val16, buf + copy_offset, copy_count))
> + return -EFAULT;
> +
> + if (le16_to_cpu(val16) & PCI_COMMAND_MEMORY)
> + nvdev->bars_disabled = false;
> + else
> + nvdev->bars_disabled = true;

nvdev->bars_disabled = !(le16_to_cpu(val16) & PCI_COMMAND_MEMORY);

But you're only tracking COMMAND_MEM relative to user writes, memory
will be disabled on reset and should not initially be enabled.

But then also, isn't this really just an empty token of pretending this
is a PCI BAR if only the read/write and not mmap path honor the device
memory enable bit? We'd need to zap and restore vmas mapping these
BARs if this was truly behaving as a PCI memory region.

We discussed previously that access through the coherent memory is
unaffected by device reset, is that also true of this non-cached BAR as
well?

TBH, I'm still struggling with the justification to expose these memory
ranges as BAR space versus attaching them as a device specific region
where QEMU would map them into the VM address space and create ACPI
tables to describe them to reflect the same mechanism in the VM as used
on bare metal. AFAICT the justification boils down to avoiding work in
QEMU and we're sacrificing basic PCI semantics and creating a more
complicated kernel driver to get there. Let's have an on-list
discussion why this is the correct approach.

> + }
> +
> + return vfio_pci_core_write(core_vdev, buf, count, ppos);
> +}
> +
> +/*
> + * Ad hoc map the device memory in the module kernel VA space. Primarily needed
> + * to support Qemu's device x-no-mmap=on option.
> + *
> + * The usemem region is cacheable memory and hence is memremaped.
> + * The resmem region is non-cached and is mapped using ioremap_wc (NORMAL_NC).
> + */
> +static int
> +nvgrace_gpu_map_device_mem(struct nvgrace_gpu_vfio_pci_core_device *nvdev,
> + int index)
> +{
> + int ret = 0;
> +
> + mutex_lock(&nvdev->remap_lock);
> + if (index == USEMEM_REGION_INDEX &&
> + !nvdev->usemem.bar_remap.memaddr) {
> + nvdev->usemem.bar_remap.memaddr
> + = memremap(nvdev->usemem.memphys, nvdev->usemem.memlength, MEMREMAP_WB);
> + if (!nvdev->usemem.bar_remap.memaddr)
> + ret = -ENOMEM;
> + } else if (index == RESMEM_REGION_INDEX &&
> + !nvdev->resmem.bar_remap.ioaddr) {
> + nvdev->resmem.bar_remap.ioaddr
> + = ioremap_wc(nvdev->resmem.memphys, nvdev->resmem.memlength);
> + if (!nvdev->resmem.bar_remap.ioaddr)
> + ret = -ENOMEM;
> + }
> + mutex_unlock(&nvdev->remap_lock);
> +
> + return ret;
> +}
> +
> +/*
> + * Read the data from the device memory (mapped either through ioremap
> + * or memremap) into the user buffer.
> + */
> +static int
> +nvgrace_gpu_map_and_read(struct nvgrace_gpu_vfio_pci_core_device *nvdev,
> + void __user *buf, size_t mem_count, loff_t *ppos)
> +{
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
> + int ret = 0;
> +
> + /*
> + * Handle read on the BAR regions. Map to the target device memory
> + * physical address and copy to the request read buffer.
> + */
> + ret = nvgrace_gpu_map_device_mem(nvdev, index);
> + if (ret)
> + goto read_exit;

We don't need a goto to simply return an error.

> +
> + if (index == USEMEM_REGION_INDEX) {
> + if (copy_to_user(buf, (u8 *)nvdev->usemem.bar_remap.memaddr + offset, mem_count))
> + ret = -EFAULT;
> + } else {
> + return do_io_rw(&nvdev->core_device, false, nvdev->resmem.bar_remap.ioaddr,
> + (char __user *) buf, offset, mem_count, 0, 0, false);

The vfio_device_ops.read prototype defines buf as a char __user*, so
maybe look at why it's being passed as a void __user* rather than
casting.

> + }
> +
> +read_exit:
> + return ret;
> +}
> +
> +/*
> + * Read count bytes from the device memory at an offset. The actual device
> + * memory size (available) may not be a power-of-2. So the driver fakes
> + * the size to a power-of-2 (reported) when exposing to a user space driver.
> + *
> + * Read request beyond the actual device size is filled with ~0, while
> + * those beyond the actual reported size is skipped.
> + *
> + * A read from a negative or an offset greater than reported size, a negative
> + * count are considered error conditions and returned with an -EINVAL.
> + */
> +static ssize_t
> +nvgrace_gpu_read_mem(struct nvgrace_gpu_vfio_pci_core_device *nvdev,
> + void __user *buf, size_t count, loff_t *ppos)
> +{
> + u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + struct mem_region *memregion;
> + size_t mem_count, i, bar_size;
> + u8 val = 0xFF;
> + int ret;
> +
> + memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(index, nvdev);
> + if (!memregion)
> + return -EINVAL;
> +
> + bar_size = roundup_pow_of_two(memregion->memlength);
> +
> + if (offset >= bar_size)
> + return -EINVAL;
> +
> + /* Clip short the read request beyond reported BAR size */
> + count = min(count, bar_size - (size_t)offset);
> +
> + /*
> + * Determine how many bytes to be actually read from the device memory.
> + * Read request beyond the actual device memory size is filled with ~0,
> + * while those beyond the actual reported size is skipped.
> + */
> + if (offset >= memregion->memlength)
> + mem_count = 0;
> + else
> + mem_count = min(count, memregion->memlength - (size_t)offset);
> +
> + ret = nvgrace_gpu_map_and_read(nvdev, buf, mem_count, ppos);
> + if (ret)
> + return ret;
> +
> + /*
> + * Only the device memory present on the hardware is mapped, which may
> + * not be power-of-2 aligned. A read to an offset beyond the device memory
> + * size is filled with ~0.
> + */
> + for (i = mem_count; i < count; i++)
> + put_user(val, (unsigned char __user *)(buf + i));
> +
> + *ppos += count;
> + return count;
> +}
> +
> +static ssize_t
> +nvgrace_gpu_vfio_pci_read(struct vfio_device *core_vdev,
> + char __user *buf, size_t count, loff_t *ppos)
> +{
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + ssize_t read_count;
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
> + core_vdev, struct nvgrace_gpu_vfio_pci_core_device,
> + core_device.vdev);
> +
> + if (nvgrace_gpu_vfio_pci_is_fake_bar(index)) {

As noted in previous version, we don't need this function:

if (nvgrace_gpu_vfio_pci_fake_bar_mem_region(index, nvdev)) {

> + /* Check if the bars are disabled, allow access otherwise */
> + down_read(&nvdev->core_device.memory_lock);
> + if (nvdev->bars_disabled) {
> + up_read(&nvdev->core_device.memory_lock);
> + return -EIO;
> + }

Why do we need bars_disabled here, or at all? If we let do_io_rw()
test memory it would read the command register from vconfig and all of
this is redundant.

> + read_count = nvgrace_gpu_read_mem(nvdev, buf, count, ppos);
> + up_read(&nvdev->core_device.memory_lock);
> +
> + return read_count;
> + }
> +
> + if (index == VFIO_PCI_CONFIG_REGION_INDEX)
> + return nvgrace_gpu_read_config_emu(core_vdev, buf, count, ppos);
> +
> + return vfio_pci_core_read(core_vdev, buf, count, ppos);
> +}
> +
> +/*
> + * Write the data to the device memory (mapped either through ioremap
> + * or memremap) from the user buffer.
> + */
> +static int nvgrace_gpu_map_and_write(struct nvgrace_gpu_vfio_pci_core_device *nvdev,
> + const void __user *buf, size_t mem_count, loff_t *ppos)
> +{
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + loff_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + int ret = 0;
> +
> + ret = nvgrace_gpu_map_device_mem(nvdev, index);
> + if (ret)
> + goto write_exit;
> +
> + if (index == USEMEM_REGION_INDEX) {
> + if (copy_from_user((u8 *)nvdev->usemem.bar_remap.memaddr + pos,
> + buf, mem_count))
> + return -EFAULT;
> + } else {
> + return do_io_rw(&nvdev->core_device, false, nvdev->resmem.bar_remap.ioaddr,
> + (char __user *) buf, pos, mem_count, 0, 0, true);
> + }
> +
> +write_exit:
> + return ret;
> +}
> +
> +/*
> + * Write count bytes to the device memory at a given offset. The actual device
> + * memory size (available) may not be a power-of-2. So the driver fakes the
> + * size to a power-of-2 (reported) when exposing to a user space driver.
> + *
> + * Write request beyond the actual device size are dropped, while those
> + * beyond the actual reported size are skipped entirely.
> + *
> + * A write to a negative or an offset greater than the reported size, a
> + * negative count are considered error conditions and returned with an -EINVAL.
> + */
> +static ssize_t
> +nvgrace_gpu_write_mem(struct nvgrace_gpu_vfio_pci_core_device *nvdev,
> + size_t count, loff_t *ppos, const void __user *buf)
> +{
> + u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + struct mem_region *memregion;
> + size_t mem_count, bar_size;
> + int ret = 0;
> +
> + memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(index, nvdev);
> + if (!memregion)
> + return -EINVAL;
> +
> + bar_size = roundup_pow_of_two(memregion->memlength);
> +
> + if (offset >= bar_size)
> + return -EINVAL;
> +
> + /* Clip short the write request beyond reported BAR size */
> + count = min(count, bar_size - (size_t)offset);
> +
> + /*
> + * Determine how many bytes to be actually written to the device memory.
> + * Do not write to the offset beyond available size.
> + */
> + if (offset >= memregion->memlength)
> + goto exitfn;
> +
> + /*
> + * Only the device memory present on the hardware is mapped, which may
> + * not be power-of-2 aligned. Drop access outside the available device
> + * memory on the hardware.
> + */
> + mem_count = min(count, memregion->memlength - (size_t)offset);
> +
> + ret = nvgrace_gpu_map_and_write(nvdev, buf, mem_count, ppos);
> + if (ret)
> + return ret;
> +
> +exitfn:
> + *ppos += count;
> + return count;
> +}
> +
> +static ssize_t
> +nvgrace_gpu_vfio_pci_write(struct vfio_device *core_vdev,
> + const char __user *buf, size_t count, loff_t *ppos)
> +{
> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = container_of(
> + core_vdev, struct nvgrace_gpu_vfio_pci_core_device, core_device.vdev);
> + size_t write_count;
> +
> + if (nvgrace_gpu_vfio_pci_is_fake_bar(index)) {
> + /* Check if the bars are disabled, allow access otherwise */
> + down_read(&nvdev->core_device.memory_lock);
> + if (nvdev->bars_disabled) {
> + up_read(&nvdev->core_device.memory_lock);
> + return -EIO;
> + }
> + write_count = nvgrace_gpu_write_mem(nvdev, count, ppos, buf);
> + up_read(&nvdev->core_device.memory_lock);
> + return write_count;
> + }
> +
> + if (index == VFIO_PCI_CONFIG_REGION_INDEX)
> + return nvgrace_gpu_write_config_emu(core_vdev, buf, count, ppos);
> +
> + return vfio_pci_core_write(core_vdev, buf, count, ppos);
> +}
> +
> +static const struct vfio_device_ops nvgrace_gpu_vfio_pci_ops = {
> + .name = "nvgrace-gpu-vfio-pci",
> + .init = vfio_pci_core_init_dev,
> + .release = vfio_pci_core_release_dev,
> + .open_device = nvgrace_gpu_vfio_pci_open_device,
> + .close_device = nvgrace_gpu_vfio_pci_close_device,
> + .ioctl = nvgrace_gpu_vfio_pci_ioctl,
> + .read = nvgrace_gpu_vfio_pci_read,
> + .write = nvgrace_gpu_vfio_pci_write,
> + .mmap = nvgrace_gpu_vfio_pci_mmap,
> + .request = vfio_pci_core_request,
> + .match = vfio_pci_core_match,
> + .bind_iommufd = vfio_iommufd_physical_bind,
> + .unbind_iommufd = vfio_iommufd_physical_unbind,
> + .attach_ioas = vfio_iommufd_physical_attach_ioas,
> + .detach_ioas = vfio_iommufd_physical_detach_ioas,
> +};
> +
> +static struct
> +nvgrace_gpu_vfio_pci_core_device *nvgrace_gpu_drvdata(struct pci_dev *pdev)
> +{
> + struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
> +
> + return container_of(core_device, struct nvgrace_gpu_vfio_pci_core_device,
> + core_device);
> +}
> +
> +static int
> +nvgrace_gpu_vfio_pci_fetch_memory_property(struct pci_dev *pdev,
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev)
> +{
> + int ret;
> + u64 memphys, memlength;
> +
> + /*
> + * The memory information is present in the system ACPI tables as DSD
> + * properties nvidia,gpu-mem-base-pa and nvidia,gpu-mem-size.
> + */
> + ret = device_property_read_u64(&pdev->dev, "nvidia,gpu-mem-base-pa",
> + &(memphys));
> + if (ret)
> + return ret;
> +
> + if (memphys > type_max(phys_addr_t))
> + return -EOVERFLOW;
> +
> + ret = device_property_read_u64(&pdev->dev, "nvidia,gpu-mem-size",
> + &(memlength));
> + if (ret)
> + return ret;
> +
> + if (memlength > type_max(size_t))
> + return -EOVERFLOW;
> +
> + /*
> + * If the C2C link is not up due to an error, the coherent device
> + * memory size is returned as 0. Fail in such case.
> + */
> + if (memlength == 0)
> + return -ENOMEM;
> +
> + /*
> + * The VM GPU device driver needs a non-cacheable region to support
> + * the MIG feature. Since the device memory is mapped as NORMAL cached,
> + * carve out a region from the end with a different NORMAL_NC
> + * property (called as reserved memory and represented as resmem). This
> + * region then is exposed as a 64b BAR (region 2 and 3) to the VM, while
> + * exposing the rest (termed as usable memory and represented using usemem)
> + * as cacheable 64b BAR (region 4 and 5).
> + *
> + * devmem (memlength)
> + * |-------------------------------------------------|
> + * | |
> + * usemem.phys/memphys resmem.phys
> + */
> + nvdev->usemem.memphys = memphys;
> +
> + /*
> + * The device memory exposed to the VM is added to the kernel by the
> + * VM driver module in chunks of memory block size. Only the usable
> + * memory (usemem) is added to the kernel for usage by the VM
> + * workloads. Make the usable memory size memblock aligned.
> + */
> + if (check_sub_overflow(memlength, RESMEM_SIZE,
> + &nvdev->usemem.memlength)) {
> + ret = -EOVERFLOW;
> + goto done;
> + }
> + nvdev->usemem.memlength = round_down(nvdev->usemem.memlength,
> + MEMBLK_SIZE);
> + if ((check_add_overflow(nvdev->usemem.memphys,
> + nvdev->usemem.memlength, &nvdev->resmem.memphys)) ||
> + (check_sub_overflow(memlength, nvdev->usemem.memlength,
> + &nvdev->resmem.memlength))) {
> + ret = -EOVERFLOW;
> + goto done;
> + }
> +
> +done:
> + return ret;
> +}
> +
> +static int nvgrace_gpu_vfio_pci_probe(struct pci_dev *pdev,
> + const struct pci_device_id *id)
> +{
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev;
> + int ret;
> +
> + nvdev = vfio_alloc_device(nvgrace_gpu_vfio_pci_core_device, core_device.vdev,
> + &pdev->dev, &nvgrace_gpu_vfio_pci_ops);
> + if (IS_ERR(nvdev))
> + return PTR_ERR(nvdev);
> +
> + dev_set_drvdata(&pdev->dev, nvdev);
> +
> + ret = nvgrace_gpu_vfio_pci_fetch_memory_property(pdev, nvdev);
> + if (ret)
> + goto out_put_vdev;
> +
> + ret = vfio_pci_core_register_device(&nvdev->core_device);
> + if (ret)
> + goto out_put_vdev;
> +
> + return ret;
> +
> +out_put_vdev:
> + vfio_put_device(&nvdev->core_device.vdev);
> + return ret;
> +}
> +
> +static void nvgrace_gpu_vfio_pci_remove(struct pci_dev *pdev)
> +{
> + struct nvgrace_gpu_vfio_pci_core_device *nvdev = nvgrace_gpu_drvdata(pdev);
> + struct vfio_pci_core_device *vdev = &nvdev->core_device;
> +
> + vfio_pci_core_unregister_device(vdev);
> + vfio_put_device(&vdev->vdev);
> +}
> +
> +static const struct pci_device_id nvgrace_gpu_vfio_pci_table[] = {
> + /* GH200 120GB */
> + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2342) },
> + /* GH200 480GB */
> + { PCI_DRIVER_OVERRIDE_DEVICE_VFIO(PCI_VENDOR_ID_NVIDIA, 0x2345) },
> + {}
> +};
> +
> +MODULE_DEVICE_TABLE(pci, nvgrace_gpu_vfio_pci_table);
> +
> +static struct pci_driver nvgrace_gpu_vfio_pci_driver = {
> + .name = KBUILD_MODNAME,
> + .id_table = nvgrace_gpu_vfio_pci_table,
> + .probe = nvgrace_gpu_vfio_pci_probe,
> + .remove = nvgrace_gpu_vfio_pci_remove,
> + .err_handler = &vfio_pci_core_err_handlers,
> + .driver_managed_dma = true,
> +};
> +
> +module_pci_driver(nvgrace_gpu_vfio_pci_driver);
> +
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Ankit Agrawal <[email protected]>");
> +MODULE_AUTHOR("Aniket Agashe <[email protected]>");
> +MODULE_DESCRIPTION(
> + "VFIO NVGRACE GPU PF - User Level driver for NVIDIA devices with CPU coherently accessible device memory");
> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
> index e27de61ac9fe..ca20440b442d 100644
> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> @@ -94,10 +94,10 @@ VFIO_IOREAD(32)
> * reads with -1. This is intended for handling MSI-X vector tables and
> * leftover space for ROM BARs.
> */
> -static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
> - void __iomem *io, char __user *buf,
> - loff_t off, size_t count, size_t x_start,
> - size_t x_end, bool iswrite)
> +ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
> + void __iomem *io, char __user *buf,
> + loff_t off, size_t count, size_t x_start,
> + size_t x_end, bool iswrite)

This should be exported in a separate patch and renamed to be
consistent with other vfio-pci-core functions.

> {
> ssize_t done = 0;
> int ret;
> @@ -199,6 +199,7 @@ static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
>
> return done;
> }
> +EXPORT_SYMBOL(do_io_rw);

NAK, _GPL. Thanks,

Alex

>
> static int vfio_pci_setup_barmap(struct vfio_pci_core_device *vdev, int bar)
> {

2023-12-21 12:43:44

by Ankit Agrawal

[permalink] [raw]

Subject: Re: [PATCH v15 1/1] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

Thanks Alex and Cedric for the review.

>> +/*
>> + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
>> + */
>> +
>> +#include "nvgrace_gpu_vfio_pci.h"
>
> drivers/vfio/pci/nvgrace-gpu/main.c:6:10: fatal error: nvgrace_gpu_vfio_pci.h: No such file or directory
>??? 6 | #include "nvgrace_gpu_vfio_pci.h"

Yeah, managed to miss the file. Will fix that in the next version.

>> +
>> +static bool nvgrace_gpu_vfio_pci_is_fake_bar(int index)
>> +{
>> +???? return (index == RESMEM_REGION_INDEX ||
>> +???????? index == USEMEM_REGION_INDEX);
>> +}
>
> Presumably these macros are defined in the missing header, though we
> don't really need a header file just for that.? This doesn't need to be
> line wrapped, it's short enough with the macros as is.

Yeah that and the structures are moved to the header file.

>> +???? info.flags = VFIO_REGION_INFO_FLAG_READ |
>> +???????????? VFIO_REGION_INFO_FLAG_WRITE |
>> +???????????? VFIO_REGION_INFO_FLAG_MMAP;
>
> Align these all:
>
>??????? info.flags = VFIO_REGION_INFO_FLAG_READ |
>??????????????????? VFIO_REGION_INFO_FLAG_WRITE |
>???????????????????? VFIO_REGION_INFO_FLAG_MMAP;

Ack.

>> +
>> +static bool range_intersect_range(loff_t range1_start, size_t count1,
>> +?????????????????????????????? loff_t range2_start, size_t count2,
>> +?????????????????????????????? loff_t *start_offset,
>> +?????????????????????????????? size_t *intersect_count,
>> +?????????????????????????????? size_t *register_offset)
>
> We should put this somewhere shared with virtio-vfio-pci.

Yeah, will move to vfio_pci_core.c

>> +
>> +???? if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(val64),
>> +?????????????????????????????? &copy_offset, &copy_count, &register_offset)) {
>> +???????????? bar_size = roundup_pow_of_two(nvdev->resmem.memlength);
>> +???????????? nvdev->resmem.u64_reg &= ~(bar_size - 1);
>> +???????????? nvdev->resmem.u64_reg |= PCI_BASE_ADDRESS_MEM_TYPE_64 |
>> +????????????????????????????????????? PCI_BASE_ADDRESS_MEM_PREFETCH;
>> +???????????? val64 = cpu_to_le64(nvdev->resmem.u64_reg);
>
> As suggested and implemented in virtio-vfio-pci, store the value as
> little endian, then the write function simply becomes a
> copy_from_user(), we only need a cpu native representation of the value
> on read.

Ack.

>> +
>> +???? if (range_intersect_range(pos, count, PCI_COMMAND, sizeof(val16),
>> +?????????????????????????????? &copy_offset, &copy_count, &register_offset)) {
>> +???????????? if (copy_from_user((void *)&val16, buf + copy_offset, copy_count))
>> +???????????????????? return -EFAULT;
>> +
>> +???????????? if (le16_to_cpu(val16) & PCI_COMMAND_MEMORY)
>> +???????????????????? nvdev->bars_disabled = false;
>> +???????????? else
>> +???????????????????? nvdev->bars_disabled = true;
>
> nvdev->bars_disabled = !(le16_to_cpu(val16) & PCI_COMMAND_MEMORY);
>
> But you're only tracking COMMAND_MEM relative to user writes, memory
> will be disabled on reset and should not initially be enabled.

I suppose you are suggesting we disable during reset and not enable until
VM driver does so through PCI_COMMAND?

> But then also, isn't this really just an empty token of pretending this
> is a PCI BAR if only the read/write and not mmap path honor the device
> memory enable bit?? We'd need to zap and restore vmas mapping these
> BARs if this was truly behaving as a PCI memory region.

I can do that change, but for my information, is this a requirement to be
PCI compliant?

> We discussed previously that access through the coherent memory is
> unaffected by device reset, is that also true of this non-cached BAR as
> well?

Actually, the coherent memory is not entirely unaffected during device reset.
It becomes unavailable (and read access returns ~0) for a brief time during
the reset. The non-cached BAR behaves in the same way as they are just
as it is just a carved out part of device memory.

> TBH, I'm still struggling with the justification to expose these memory
> ranges as BAR space versus attaching them as a device specific region
> where QEMU would map them into the VM address space and create ACPI
> tables to describe them to reflect the same mechanism in the VM as used
> on bare metal.? AFAICT the justification boils down to avoiding work in
> QEMU and we're sacrificing basic PCI semantics and creating a more
> complicated kernel driver to get there.? Let's have an on-list
> discussion why this is the correct approach.

Sorry it isn't clear to me how we are sacrificing PCI semantics here. What
features are we compromising (after we fix the ones you pointed out above)?

And if we managed to make these fake BARs PCI compliant, I suppose the
primary objection is the additional code that we added to make it compliant?

>> +???? ret = nvgrace_gpu_map_device_mem(nvdev, index);
>> +???? if (ret)
>> +???????????? goto read_exit;
>
> We don't need a goto to simply return an error.

Yes, will fix that.

>> +???? if (index == USEMEM_REGION_INDEX) {
>> +???????????? if (copy_to_user(buf, (u8 *)nvdev->usemem.bar_remap.memaddr + offset, mem_count))
>> +???????????????????? ret = -EFAULT;
>> +???? } else {
>> +???????????? return do_io_rw(&nvdev->core_device, false, nvdev->resmem.bar_remap.ioaddr,
>> +???????????????????????????? (char __user *) buf, offset, mem_count, 0, 0, false);
>
> The vfio_device_ops.read prototype defines buf as a char __user*, so
> maybe look at why it's being passed as a void __user* rather than
> casting.

True, will fix that.

>> +???????????? /* Check if the bars are disabled, allow access otherwise */
>> +???????????? down_read(&nvdev->core_device.memory_lock);
>> +???????????? if (nvdev->bars_disabled) {
>> +???????????????????? up_read(&nvdev->core_device.memory_lock);
>> +???????????????????? return -EIO;
>> +???????????? }
>
> Why do we need bars_disabled here, or at all?? If we let do_io_rw()
> test memory it would read the command register from vconfig and all of
> this is redundant.

Yes, and I will make use of the same flag to cover the
USEMEM_REGION_INDEX cacheable device memory accesses.

>> -static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
>> -???????????????????? void __iomem *io, char __user *buf,
>> -???????????????????? loff_t off, size_t count, size_t x_start,
>> -???????????????????? size_t x_end, bool iswrite)
>> +ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
>> +?????????????? void __iomem *io, char __user *buf,
>> +?????????????? loff_t off, size_t count, size_t x_start,
>> +?????????????? size_t x_end, bool iswrite)
>
> This should be exported in a separate patch and renamed to be
> consistent with other vfio-pci-core functions.

Sure, and will rename with vfio_pci_core prefix.

>> @@ -199,6 +199,7 @@ static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
>>
>>?????? return done;
>>? }
>> +EXPORT_SYMBOL(do_io_rw);
>
> NAK, _GPL.? Thanks,

Yes, will make the change.

>./scripts/checkpatch.pl --strict will give you some tips on how to
improve the changes furthermore.

Yes, will do that.

2024-01-02 16:10:15

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH v15 1/1] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

On Thu, 21 Dec 2023 12:43:28 +0000
Ankit Agrawal <[email protected]> wrote:

> Thanks Alex and Cedric for the review.
>
> >> +/*
> >> + * Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved
> >> + */
> >> +
> >> +#include "nvgrace_gpu_vfio_pci.h"
> >
> > drivers/vfio/pci/nvgrace-gpu/main.c:6:10: fatal error: nvgrace_gpu_vfio_pci.h: No such file or directory
> >    6 | #include "nvgrace_gpu_vfio_pci.h"
>
> Yeah, managed to miss the file. Will fix that in the next version.
>
> >> +
> >> +static bool nvgrace_gpu_vfio_pci_is_fake_bar(int index)
> >> +{
> >> +     return (index == RESMEM_REGION_INDEX ||
> >> +         index == USEMEM_REGION_INDEX);
> >> +}
> >
> > Presumably these macros are defined in the missing header, though we
> > don't really need a header file just for that. This doesn't need to be
> > line wrapped, it's short enough with the macros as is.
>
> Yeah that and the structures are moved to the header file.
>
> >> +     info.flags = VFIO_REGION_INFO_FLAG_READ |
> >> +             VFIO_REGION_INFO_FLAG_WRITE |
> >> +             VFIO_REGION_INFO_FLAG_MMAP;
> >
> > Align these all:
> >
> >        info.flags = VFIO_REGION_INFO_FLAG_READ |
> >                    VFIO_REGION_INFO_FLAG_WRITE |
> >                     VFIO_REGION_INFO_FLAG_MMAP;
>
> Ack.
>
> >> +
> >> +static bool range_intersect_range(loff_t range1_start, size_t count1,
> >> +                               loff_t range2_start, size_t count2,
> >> +                               loff_t *start_offset,
> >> +                               size_t *intersect_count,
> >> +                               size_t *register_offset)
> >
> > We should put this somewhere shared with virtio-vfio-pci.
>
> Yeah, will move to vfio_pci_core.c
>
> >> +
> >> +     if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(val64),
> >> +                               &copy_offset, &copy_count, &register_offset)) {
> >> +             bar_size = roundup_pow_of_two(nvdev->resmem.memlength);
> >> +             nvdev->resmem.u64_reg &= ~(bar_size - 1);
> >> +             nvdev->resmem.u64_reg |= PCI_BASE_ADDRESS_MEM_TYPE_64 |
> >> +                                      PCI_BASE_ADDRESS_MEM_PREFETCH;
> >> +             val64 = cpu_to_le64(nvdev->resmem.u64_reg);
> >
> > As suggested and implemented in virtio-vfio-pci, store the value as
> > little endian, then the write function simply becomes a
> > copy_from_user(), we only need a cpu native representation of the value
> > on read.
>
> Ack.
>
> >> +
> >> +     if (range_intersect_range(pos, count, PCI_COMMAND, sizeof(val16),
> >> +                               &copy_offset, &copy_count, &register_offset)) {
> >> +             if (copy_from_user((void *)&val16, buf + copy_offset, copy_count))
> >> +                     return -EFAULT;
> >> +
> >> +             if (le16_to_cpu(val16) & PCI_COMMAND_MEMORY)
> >> +                     nvdev->bars_disabled = false;
> >> +             else
> >> +                     nvdev->bars_disabled = true;
> >
> > nvdev->bars_disabled = !(le16_to_cpu(val16) & PCI_COMMAND_MEMORY);
> >
> > But you're only tracking COMMAND_MEM relative to user writes, memory
> > will be disabled on reset and should not initially be enabled.
>
> I suppose you are suggesting we disable during reset and not enable until
> VM driver does so through PCI_COMMAND?

Unless we're working with a VF, MMIO space of a physical device is
governed by the the memory enable bit in the command register, so how
do we justify that these manufactured BARs don't have standard physical
function semantics, ie. they're accessible regardless of the command
register? Further, if we emulate any of those semantics, we should
emulate all of those semantics, r/w as well as mapped access.

> > But then also, isn't this really just an empty token of pretending this
> > is a PCI BAR if only the read/write and not mmap path honor the device
> > memory enable bit? We'd need to zap and restore vmas mapping these
> > BARs if this was truly behaving as a PCI memory region.
>
> I can do that change, but for my information, is this a requirement to be
> PCI compliant?

I think it all falls into the justification of why this variant driver
chooses to expose these coherent and non-coherent ranges as PCI BARs.
A PCI BAR would follow the semantics that its accessibility is governed
by the memory enable bit. The physical platform exposes these as
ranges exposed through ACPI, maybe in part exactly because these ranges
live outside of the PCI device semantics.

> > We discussed previously that access through the coherent memory is
> > unaffected by device reset, is that also true of this non-cached BAR as
> > well?
>
> Actually, the coherent memory is not entirely unaffected during device reset.
> It becomes unavailable (and read access returns ~0) for a brief time during
> the reset. The non-cached BAR behaves in the same way as they are just
> as it is just a carved out part of device memory.

So the reset behavior is at least safe, but in tying them to a PCI BAR
we don't really honor the memory enable bit.

> > TBH, I'm still struggling with the justification to expose these memory
> > ranges as BAR space versus attaching them as a device specific region
> > where QEMU would map them into the VM address space and create ACPI
> > tables to describe them to reflect the same mechanism in the VM as used
> > on bare metal. AFAICT the justification boils down to avoiding work in
> > QEMU and we're sacrificing basic PCI semantics and creating a more
> > complicated kernel driver to get there. Let's have an on-list
> > discussion why this is the correct approach.
>
> Sorry it isn't clear to me how we are sacrificing PCI semantics here. What
> features are we compromising (after we fix the ones you pointed out above)?
>
> And if we managed to make these fake BARs PCI compliant, I suppose the
> primary objection is the additional code that we added to make it compliant?

Yes, it's possible to add support that these ranges honor the memory
enable bit, but it's not trivial and unfortunately even vfio-pci isn't
a great example of this. The fault handling there still has lockdep
issues and arguably has more significant mm issues. OTOH, there are no
fixed semantics or precedents for a device specific region. We could
claim that it only supports mmap and requires no special handling
around device reset or relative to the PCI command register. The
variant driver becomes a trivial implementation that masks BARs 2 & 4
and exposes the ACPI range as a device specific region with only mmap
support. QEMU can then map the device specific region into VM memory
and create an equivalent ACPI table for the guest.

I know Jason had described this device as effectively pre-CXL to
justify the coherent memory mapping, but it seems like there's still a
gap here that we can't simply hand wave that this PCI BAR follows a
different set of semantics. We don't typically endorse complexity in
the kernel only for the purpose of avoiding work in userspace. The
absolute minimum should be some justification of the design decision
and analysis relative to standard PCI behavior. Thanks,

Alex

> >> +     ret = nvgrace_gpu_map_device_mem(nvdev, index);
> >> +     if (ret)
> >> +             goto read_exit;
> >
> > We don't need a goto to simply return an error.
>
> Yes, will fix that.
>
> >> +     if (index == USEMEM_REGION_INDEX) {
> >> +             if (copy_to_user(buf, (u8 *)nvdev->usemem.bar_remap.memaddr + offset, mem_count))
> >> +                     ret = -EFAULT;
> >> +     } else {
> >> +             return do_io_rw(&nvdev->core_device, false, nvdev->resmem.bar_remap.ioaddr,
> >> +                             (char __user *) buf, offset, mem_count, 0, 0, false);
> >
> > The vfio_device_ops.read prototype defines buf as a char __user*, so
> > maybe look at why it's being passed as a void __user* rather than
> > casting.
>
> True, will fix that.
>
> >> +             /* Check if the bars are disabled, allow access otherwise */
> >> +             down_read(&nvdev->core_device.memory_lock);
> >> +             if (nvdev->bars_disabled) {
> >> +                     up_read(&nvdev->core_device.memory_lock);
> >> +                     return -EIO;
> >> +             }
> >
> > Why do we need bars_disabled here, or at all? If we let do_io_rw()
> > test memory it would read the command register from vconfig and all of
> > this is redundant.
>
> Yes, and I will make use of the same flag to cover the
> USEMEM_REGION_INDEX cacheable device memory accesses.
>
> >> -static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
> >> -                     void __iomem *io, char __user *buf,
> >> -                     loff_t off, size_t count, size_t x_start,
> >> -                     size_t x_end, bool iswrite)
> >> +ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
> >> +               void __iomem *io, char __user *buf,
> >> +               loff_t off, size_t count, size_t x_start,
> >> +               size_t x_end, bool iswrite)
> >
> > This should be exported in a separate patch and renamed to be
> > consistent with other vfio-pci-core functions.
>
> Sure, and will rename with vfio_pci_core prefix.
>
> >> @@ -199,6 +199,7 @@ static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
> >>
> >>       return done;
> >> }
> >> +EXPORT_SYMBOL(do_io_rw);
> >
> > NAK, _GPL. Thanks,
>
> Yes, will make the change.
>
> >./scripts/checkpatch.pl --strict will give you some tips on how to
> improve the changes furthermore.
>
> Yes, will do that.
>

2024-01-03 16:58:06

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH v15 1/1] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

On Tue, Jan 02, 2024 at 09:10:01AM -0700, Alex Williamson wrote:

> Yes, it's possible to add support that these ranges honor the memory
> enable bit, but it's not trivial and unfortunately even vfio-pci isn't
> a great example of this.

We talked about this already, the HW architects here confirm there is
no issue with reset and memory enable. You will get all 1's on read
and NOP on write. It doesn't need to implement VMA zap.

> around device reset or relative to the PCI command register. The
> variant driver becomes a trivial implementation that masks BARs 2 & 4
> and exposes the ACPI range as a device specific region with only mmap
> support. QEMU can then map the device specific region into VM memory
> and create an equivalent ACPI table for the guest.

Well, no, probably not. There is an NVIDIA specification for how the
vPCI function should be setup within the VM and it uses the BAR
method, not the ACPI.

There are a lot of VMMs and OSs this needs to support so it must all
be consistent. For better or worse the decision was taken for the vPCI
spec to use BAR not ACPI, in part due to feedback from the broader VMM
ecosystem, and informed by future product plans.

So, if vfio does special regions then qemu and everyone else has to
fix it to meet the spec.

> I know Jason had described this device as effectively pre-CXL to
> justify the coherent memory mapping, but it seems like there's still a
> gap here that we can't simply hand wave that this PCI BAR follows a
> different set of semantics.

I thought all the meaningful differences are fixed now?

The main remaining issue seems to be around the config space
emulation?

> We don't typically endorse complexity in the kernel only for the
> purpose of avoiding work in userspace. The absolute minimum should
> be some justification of the design decision and analysis relative
> to standard PCI behavior. Thanks,

If we strictly took that view in VFIO a lot of stuff wouldn't be here
:)

I've made this argument before and gave up - the ecosystem wants to
support multiple VMMs and the sanest path to do that is via VFIO
kernel drivers that plug into existing vfio-pci support in the VMM
ecosystem.

From a HW supplier perspective it is quite vexing to have to support
all these different (and often proprietary!) VMM implementations. It
is not just top of tree qemu.

If we instead did complex userspace drivers and userspace emulation of
config space and so on then things like the IDXD SIOV support would
look *very* different and not use VFIO at all. That would probably be
somewhat better for security, but I was convinced it is a long and
technically complex road.

At least with this approach the only VMM issue is the NUMA nodes, and
as we have discussed that hackery is to make up for current Linux
kernel SW limitations, not actually reflecting anything about the
HW. If some other OS or future Linux doesn't require the ACPI NUMA
nodes to create an OS visible NUMA object then the VMM will not
require any changes.

Jason

2024-01-03 18:00:38

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH v15 1/1] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

On Wed, 3 Jan 2024 12:57:27 -0400
Jason Gunthorpe <[email protected]> wrote:

> On Tue, Jan 02, 2024 at 09:10:01AM -0700, Alex Williamson wrote:
>
> > Yes, it's possible to add support that these ranges honor the memory
> > enable bit, but it's not trivial and unfortunately even vfio-pci isn't
> > a great example of this.
>
> We talked about this already, the HW architects here confirm there is
> no issue with reset and memory enable. You will get all 1's on read
> and NOP on write. It doesn't need to implement VMA zap.

We talked about reset, I don't recall that we discussed that coherent
and uncached memory ranges masquerading as PCI BARs here honor the
memory enable bit in the command register.

> > around device reset or relative to the PCI command register. The
> > variant driver becomes a trivial implementation that masks BARs 2 & 4
> > and exposes the ACPI range as a device specific region with only mmap
> > support. QEMU can then map the device specific region into VM memory
> > and create an equivalent ACPI table for the guest.
>
> Well, no, probably not. There is an NVIDIA specification for how the
> vPCI function should be setup within the VM and it uses the BAR
> method, not the ACPI.

Is this specification available? It's a shame we've gotten this far
without a reference to it.

> There are a lot of VMMs and OSs this needs to support so it must all
> be consistent. For better or worse the decision was taken for the vPCI
> spec to use BAR not ACPI, in part due to feedback from the broader VMM
> ecosystem, and informed by future product plans.
>
> So, if vfio does special regions then qemu and everyone else has to
> fix it to meet the spec.

Great, this is the sort of justification and transparency that had not
been previously provided. It is curious that only within the past
couple months the device ABI changed by adding the uncached BAR, so
this hasn't felt like a firm design. Also I believe it's been stated
that the driver supports both the bare metal representation of the
device and this model where the coherent memory is mapped as a BAR, so
I'm not sure what obstacles remain or how we're positioned for future
products if take the bare metal approach.

> > I know Jason had described this device as effectively pre-CXL to
> > justify the coherent memory mapping, but it seems like there's still a
> > gap here that we can't simply hand wave that this PCI BAR follows a
> > different set of semantics.
>
> I thought all the meaningful differences are fixed now?
>
> The main remaining issue seems to be around the config space
> emulation?

In the development of the virtio-vfio-pci variant driver we noted that
r/w access to the IO BAR didn't honor the IO bit in the command
register, which was quickly remedied and now returns -EIO if accessed
while disabled. We were already adding r/w support to the coherent BAR
at the time as vfio doesn't have a means to express a region as only
having mmap support and precedent exists that BAR regions must support
these accesses. So it was suggested that r/w accesses should also
honor the command register memory enable bit, but of course memory BARs
also support mmap, which snowballs into a much more complicated problem
than we have in the case of the virtio IO BAR.

So where do we go? Do we continue down the path of emulating full PCI
semantics relative to these emulated BARs? Does hardware take into
account the memory enable bit of the command register? Do we
re-evaluate the BAR model for a device specific region?

> > We don't typically endorse complexity in the kernel only for the
> > purpose of avoiding work in userspace. The absolute minimum should
> > be some justification of the design decision and analysis relative
> > to standard PCI behavior. Thanks,
>
> If we strictly took that view in VFIO a lot of stuff wouldn't be here
> :)
>
> I've made this argument before and gave up - the ecosystem wants to
> support multiple VMMs and the sanest path to do that is via VFIO
> kernel drivers that plug into existing vfio-pci support in the VMM
> ecosystem.
>
> From a HW supplier perspective it is quite vexing to have to support
> all these different (and often proprietary!) VMM implementations. It
> is not just top of tree qemu.
>
> If we instead did complex userspace drivers and userspace emulation of
> config space and so on then things like the IDXD SIOV support would
> look *very* different and not use VFIO at all. That would probably be
> somewhat better for security, but I was convinced it is a long and
> technically complex road.
>
> At least with this approach the only VMM issue is the NUMA nodes, and
> as we have discussed that hackery is to make up for current Linux
> kernel SW limitations, not actually reflecting anything about the
> HW. If some other OS or future Linux doesn't require the ACPI NUMA
> nodes to create an OS visible NUMA object then the VMM will not
> require any changes.

Yes, I'm satisfied with where we've landed for the NUMA nodes and
generic initiator object. It's an annoying constraint for management
tools but it's better than the original proposal where nodes
automatically popped into existence based on a vfio-pci device.

There's certainly a balancing game of complexity in the driver vs
deferring the work to userspace. From my perspective, I don't have a
good justification for why we're on the emulated BAR path when another
path looks a lot easier. With the apparent increasing complexity of
emulating the memory enable semantics, I felt we needed to get a better
story there and really look at whether those semantics are worthwhile,
or perhaps as alluded, HW takes this into account (though I'm not sure
how).

I'd suggest we take a look at whether we need to continue to pursue
honoring the memory enable bit for these BARs and make a conscious and
documented decision if we choose to ignore it. Ideally we could also
make this shared spec that we're implementing to available to the
community to justify the design decisions here. In the case of
GPUDirect Cliques we had permission to post the spec to the list so it
could be archived to provide a stable link for future reference.
Thanks,

Alex

2024-01-03 19:34:15

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH v15 1/1] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

On Wed, Jan 03, 2024 at 11:00:16AM -0700, Alex Williamson wrote:
> On Wed, 3 Jan 2024 12:57:27 -0400
> Jason Gunthorpe <[email protected]> wrote:
>
> > On Tue, Jan 02, 2024 at 09:10:01AM -0700, Alex Williamson wrote:
> >
> > > Yes, it's possible to add support that these ranges honor the memory
> > > enable bit, but it's not trivial and unfortunately even vfio-pci isn't
> > > a great example of this.
> >
> > We talked about this already, the HW architects here confirm there is
> > no issue with reset and memory enable. You will get all 1's on read
> > and NOP on write. It doesn't need to implement VMA zap.
>
> We talked about reset, I don't recall that we discussed that coherent
> and uncached memory ranges masquerading as PCI BARs here honor the
> memory enable bit in the command register.

Why do it need to do anything special? If the VM read/writes from
memory that the master bit is disabled on it gets undefined
behavior. The system doesn't crash and it does something reasonable.

> > > around device reset or relative to the PCI command register. The
> > > variant driver becomes a trivial implementation that masks BARs 2 & 4
> > > and exposes the ACPI range as a device specific region with only mmap
> > > support. QEMU can then map the device specific region into VM memory
> > > and create an equivalent ACPI table for the guest.
> >
> > Well, no, probably not. There is an NVIDIA specification for how the
> > vPCI function should be setup within the VM and it uses the BAR
> > method, not the ACPI.
>
> Is this specification available? It's a shame we've gotten this far
> without a reference to it.

No, at this point it is internal short form only.

> > There are a lot of VMMs and OSs this needs to support so it must all
> > be consistent. For better or worse the decision was taken for the vPCI
> > spec to use BAR not ACPI, in part due to feedback from the broader VMM
> > ecosystem, and informed by future product plans.
> >
> > So, if vfio does special regions then qemu and everyone else has to
> > fix it to meet the spec.
>
> Great, this is the sort of justification and transparency that had not
> been previously provided. It is curious that only within the past
> couple months the device ABI changed by adding the uncached BAR, so
> this hasn't felt like a firm design.

That is to work around some unfortunate HW defect, and is connected to
another complex discussion about how ARM memory types need to
work. Originally people hoped this could simply work transparently but
it eventually became clear it was not possible for the VM to degrade
from cachable without VMM help. Thus a mess was born..

> Also I believe it's been stated that the driver supports both the
> bare metal representation of the device and this model where the
> coherent memory is mapped as a BAR, so I'm not sure what obstacles
> remain or how we're positioned for future products if take the bare
> metal approach.

It could work, but it is not really the direction that was decided on
for the vPCI functions.

> > I thought all the meaningful differences are fixed now?
> >
> > The main remaining issue seems to be around the config space
> > emulation?
>
> In the development of the virtio-vfio-pci variant driver we noted that
> r/w access to the IO BAR didn't honor the IO bit in the command
> register, which was quickly remedied and now returns -EIO if accessed
> while disabled. We were already adding r/w support to the coherent BAR
> at the time as vfio doesn't have a means to express a region as only
> having mmap support and precedent exists that BAR regions must support
> these accesses. So it was suggested that r/w accesses should also
> honor the command register memory enable bit, but of course memory BARs
> also support mmap, which snowballs into a much more complicated problem
> than we have in the case of the virtio IO BAR.

I think that has just become too pedantic, accessing the regions with
the enable bits turned off is broadly undefined behavior. So long as
the platform doesn't crash, I think it is fine to behave in a simple
way.

There is no use case for providing stronger emulation of this.

> So where do we go? Do we continue down the path of emulating full PCI
> semantics relative to these emulated BARs? Does hardware take into
> account the memory enable bit of the command register? Do we
> re-evaluate the BAR model for a device specific region?

It has to come out as a BAR in the VM side so all of this can't be
avoided. The simple answer is we don't need to care about enables
because there is no reason to care. VMs don't write to memory with the
enable turned off because on some platforms you will crash the system
if you do that.

> I'd suggest we take a look at whether we need to continue to pursue
> honoring the memory enable bit for these BARs and make a conscious and
> documented decision if we choose to ignore it.

Let's document it.

> Ideally we could also make this shared spec that we're implementing
> to available to the community to justify the design decisions here.
> In the case of GPUDirect Cliques we had permission to post the spec
> to the list so it could be archived to provide a stable link for
> future reference. Thanks,

Ideally. I'll see that people consider it at least.

Jason

2024-01-04 00:24:56

by Alex Williamson

[permalink] [raw]

Subject: Re: [PATCH v15 1/1] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

On Wed, 3 Jan 2024 15:33:56 -0400
Jason Gunthorpe <[email protected]> wrote:

> On Wed, Jan 03, 2024 at 11:00:16AM -0700, Alex Williamson wrote:
> > On Wed, 3 Jan 2024 12:57:27 -0400
> > Jason Gunthorpe <[email protected]> wrote:
> >
> > > On Tue, Jan 02, 2024 at 09:10:01AM -0700, Alex Williamson wrote:
> > >
> > > > Yes, it's possible to add support that these ranges honor the memory
> > > > enable bit, but it's not trivial and unfortunately even vfio-pci isn't
> > > > a great example of this.
> > >
> > > We talked about this already, the HW architects here confirm there is
> > > no issue with reset and memory enable. You will get all 1's on read
> > > and NOP on write. It doesn't need to implement VMA zap.
> >
> > We talked about reset, I don't recall that we discussed that coherent
> > and uncached memory ranges masquerading as PCI BARs here honor the
> > memory enable bit in the command register.
>
> Why do it need to do anything special? If the VM read/writes from
> memory that the master bit is disabled on it gets undefined
> behavior. The system doesn't crash and it does something reasonable.

The behavior is actually defined (6.0.1 Table 7-4):

Memory Space Enable - Controls a Function's response to Memory
Space accesses. When this bit is Clear, all received Memory Space
accesses are caused to be handled as Unsupported Requests. When
this bit is Set, the Function is enabled to decode the address and
further process Memory Space accesses.

From there we get into system error handling decisions where some
platforms claim to protect data integrity by generating a fault before
allowing drivers to consume the UR response and others are more lenient.

The former platforms are the primary reason that we guard access to the
device when the memory enable bit is cleared. It seems we've also
already opened the door somewhat given our VF behavior, where we test
pci_dev.no_command_memory to allow access regardless of the state of
the emulated command register bit.

AIUI, the address space enable bits are primarily to prevent the device
from decoding accesses during BAR sizing operations or prior to BAR
programming. BAR sizing operations are purely emulated in vfio-pci and
unprogrammed BARs are ignored (ie. not exposed to userspace), so perhaps
as long as it can be guaranteed that an access with the address space
enable bit cleared cannot generate a system level fault, we actually
have no strong argument to strictly enforce the address space bits.

> > > > around device reset or relative to the PCI command register. The
> > > > variant driver becomes a trivial implementation that masks BARs 2 & 4
> > > > and exposes the ACPI range as a device specific region with only mmap
> > > > support. QEMU can then map the device specific region into VM memory
> > > > and create an equivalent ACPI table for the guest.
> > >
> > > Well, no, probably not. There is an NVIDIA specification for how the
> > > vPCI function should be setup within the VM and it uses the BAR
> > > method, not the ACPI.
> >
> > Is this specification available? It's a shame we've gotten this far
> > without a reference to it.
>
> No, at this point it is internal short form only.
>
> > > There are a lot of VMMs and OSs this needs to support so it must all
> > > be consistent. For better or worse the decision was taken for the vPCI
> > > spec to use BAR not ACPI, in part due to feedback from the broader VMM
> > > ecosystem, and informed by future product plans.
> > >
> > > So, if vfio does special regions then qemu and everyone else has to
> > > fix it to meet the spec.
> >
> > Great, this is the sort of justification and transparency that had not
> > been previously provided. It is curious that only within the past
> > couple months the device ABI changed by adding the uncached BAR, so
> > this hasn't felt like a firm design.
>
> That is to work around some unfortunate HW defect, and is connected to
> another complex discussion about how ARM memory types need to
> work. Originally people hoped this could simply work transparently but
> it eventually became clear it was not possible for the VM to degrade
> from cachable without VMM help. Thus a mess was born..
>
> > Also I believe it's been stated that the driver supports both the
> > bare metal representation of the device and this model where the
> > coherent memory is mapped as a BAR, so I'm not sure what obstacles
> > remain or how we're positioned for future products if take the bare
> > metal approach.
>
> It could work, but it is not really the direction that was decided on
> for the vPCI functions.
>
> > > I thought all the meaningful differences are fixed now?
> > >
> > > The main remaining issue seems to be around the config space
> > > emulation?
> >
> > In the development of the virtio-vfio-pci variant driver we noted that
> > r/w access to the IO BAR didn't honor the IO bit in the command
> > register, which was quickly remedied and now returns -EIO if accessed
> > while disabled. We were already adding r/w support to the coherent BAR
> > at the time as vfio doesn't have a means to express a region as only
> > having mmap support and precedent exists that BAR regions must support
> > these accesses. So it was suggested that r/w accesses should also
> > honor the command register memory enable bit, but of course memory BARs
> > also support mmap, which snowballs into a much more complicated problem
> > than we have in the case of the virtio IO BAR.
>
> I think that has just become too pedantic, accessing the regions with
> the enable bits turned off is broadly undefined behavior. So long as
> the platform doesn't crash, I think it is fine to behave in a simple
> way.
>
> There is no use case for providing stronger emulation of this.

As above, I think I can be convinced this is acceptable given that the
platform and device are essentially one in the same here with
understood lack of a system wide error response.

Now I'm wondering if we should do something different with
virtio-vfio-pci. As a VF, the memory space is effectively always
enabled, governed by the SR-IOV MSE bit on the PF which is assumed to
be static. It doesn't make a lot of sense to track the IO enable bit
for the emulated IO BAR when the memory BAR is always enabled. It's a
fairly trivial amount of code though, so it's not harmful either.

> > So where do we go? Do we continue down the path of emulating full PCI
> > semantics relative to these emulated BARs? Does hardware take into
> > account the memory enable bit of the command register? Do we
> > re-evaluate the BAR model for a device specific region?
>
> It has to come out as a BAR in the VM side so all of this can't be
> avoided. The simple answer is we don't need to care about enables
> because there is no reason to care. VMs don't write to memory with the
> enable turned off because on some platforms you will crash the system
> if you do that.
>
> > I'd suggest we take a look at whether we need to continue to pursue
> > honoring the memory enable bit for these BARs and make a conscious and
> > documented decision if we choose to ignore it.
>
> Let's document it.

Ok, I'd certainly appreciate more background around why it was chosen
to expose the device differently than bare metal in the commit log and
code comments and in particular why we're consciously choosing not to
honor these specific PCI semantics. Presumably we'd remove the -EIO
from the r/w path based on the memory enable state as well.

> > Ideally we could also make this shared spec that we're implementing
> > to available to the community to justify the design decisions here.
> > In the case of GPUDirect Cliques we had permission to post the spec
> > to the list so it could be archived to provide a stable link for
> > future reference. Thanks,
>
> Ideally. I'll see that people consider it at least.

Yep, I think it would have helped move this driver along if such a
document had been shared earlier for the sake of transparency rather
than leaving us to speculate on the motivation and design. Thanks,

Alex

2024-01-04 00:40:34

by Jason Gunthorpe

[permalink] [raw]

Subject: Re: [PATCH v15 1/1] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

On Wed, Jan 03, 2024 at 05:24:26PM -0700, Alex Williamson wrote:
> > Why do it need to do anything special? If the VM read/writes from
> > memory that the master bit is disabled on it gets undefined
> > behavior. The system doesn't crash and it does something reasonable.
>
> The behavior is actually defined (6.0.1 Table 7-4):
>
> Memory Space Enable - Controls a Function's response to Memory
> Space accesses. When this bit is Clear, all received Memory Space
> accesses are caused to be handled as Unsupported Requests. When
> this bit is Set, the Function is enabled to decode the address and
> further process Memory Space accesses.
>
> From there we get into system error handling decisions where some
> platforms claim to protect data integrity by generating a fault before
> allowing drivers to consume the UR response and others are more lenient.

Sure PCIe defines more detail, but the actual behavior the SW
experiences when triggering this corner is effective undefined as
"machine crash" is something that actually happens.

> AIUI, the address space enable bits are primarily to prevent the device
> from decoding accesses during BAR sizing operations or prior to BAR
> programming.

Yes. It is not functionally relavent to devices like this that have a
fixed aperture, or to virtual devices that can't move the physical
aperture.

I think the layers have become confused a bit here. The vfio side
should entirely care about kernel self-protection from hostile
userspace, which is why we have to zap/etc.

However the VMM still controls the "address decoder" and if the memory
(or IO) enable is off then the VMM should already prevent the VM
address space from decoding into the VFIO regions at all. Ie it should
unmap it from KVM for mmapable regions, and stop matching the address
for emulated regions.

This is effectively necessary because the VM might choose to reprogram
the BAR registers and move the region, it can't do this atomically so
we have to fully ignore the BAR value when the decoders are disabled.

IOW the corner case of the memory enable disable and the VM touching
the memory is not something the kernel VFIO should be emulating, and
indeed, I think there is probably no reason to allow the VM to
manipulate the physical control either..

> unprogrammed BARs are ignored (ie. not exposed to userspace), so perhaps
> as long as it can be guaranteed that an access with the address space
> enable bit cleared cannot generate a system level fault, we actually
> have no strong argument to strictly enforce the address space bits.

This is what I think, yes.

> > I think that has just become too pedantic, accessing the regions with
> > the enable bits turned off is broadly undefined behavior. So long as
> > the platform doesn't crash, I think it is fine to behave in a simple
> > way.
> >
> > There is no use case for providing stronger emulation of this.
>
> As above, I think I can be convinced this is acceptable given that the
> platform and device are essentially one in the same here with
> understood lack of a system wide error response.

Right

> Now I'm wondering if we should do something different with
> virtio-vfio-pci. As a VF, the memory space is effectively always
> enabled, governed by the SR-IOV MSE bit on the PF which is assumed to
> be static. It doesn't make a lot of sense to track the IO enable bit
> for the emulated IO BAR when the memory BAR is always enabled. It's a
> fairly trivial amount of code though, so it's not harmful either.

As above, it was probably unneeded to put this into VFIO kernel side,
I don't think there is a functional harm to allow it.

Jason

2024-01-09 11:42:56

by Ankit Agrawal

[permalink] [raw]

Subject: Re: [PATCH v15 1/1] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

I am working to get the code changes suggested in this thread to get
to the next version.

Moreover based on the other design related comments for BAR approach
and support for the enable/disable bit, comments follow.

> There are a lot of VMMs and OSs this needs to support so it must all
> be consistent. For better or worse the decision was taken for the vPCI
> spec to use BAR not ACPI, in part due to feedback from the broader VMM
> ecosystem, and informed by future product plans.
>
> So, if vfio does special regions then qemu and everyone else has to
> fix it to meet the spec.

Ack, I'm putting this into the commit log as a reason to go with the BAR
approach for future reference.

>>> I'd suggest we take a look at whether we need to continue to pursue
>>> honoring the memory enable bit for these BARs and make a conscious and
>>> documented decision if we choose to ignore it.
>>
>> Let's document it.
> Ok, I'd certainly appreciate more background around why it was chosen
> to expose the device differently than bare metal in the commit log and
> code comments and in particular why we're consciously choosing not to
> honor these specific PCI semantics. Presumably we'd remove the -EIO
> from the r/w path based on the memory enable state as well.

Ack, will add to the commit log and remove EIO.

>> unprogrammed BARs are ignored (ie. not exposed to userspace), so perhaps
>> as long as it can be guaranteed that an access with the address space
>> enable bit cleared cannot generate a system level fault, we actually
>> have no strong argument to strictly enforce the address space bits.
>
> This is what I think, yes.

Ack, so given the access does not cause system level fault, I'll not enforce
the enable/disable bit. Though I'll put this information in the commit log
for reference as well and the reason for this choice.