Received: by 2002:ac0:bc90:0:0:0:0:0 with SMTP id a16csp1452672img; Tue, 19 Mar 2019 07:59:17 -0700 (PDT) X-Google-Smtp-Source: APXvYqypTu5Td756bLX8bCRPUMSjPF8FVcolvG3LlAHi8GMFoBdpYYad1MuVhjcQec6LzL1tYCIl X-Received: by 2002:a17:902:758b:: with SMTP id j11mr2619557pll.29.1553007557078; Tue, 19 Mar 2019 07:59:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1553007557; cv=none; d=google.com; s=arc-20160816; b=PeSVnEMmNRJQhIrudVmaqZy120/ber/JoZGUjR+JXxKeKl3STIt3CGA+K8Aoq0ka6f 8DB+tbWE58mae9h+MvP2igAharpJf4Tz+db17TzKo3i81xTpDMoRYgs0a08voDqzKDgU eziL6n6VVmdiNrn976cAjasv8rygGKyex73Tol2VN+zeR+ZyndTeoxe732G3I/ValQ7D ITEiSAozbGZ1z3lTZdKU3vOnAKsQ3clsM5LzylpryYrko+wRwfcDuftqD5atDhSNA954 dLmmESHiB4j4Ztr6X2OQa5Jkyb9tV2V6z6aD0DyMUQiv+oVvmdHiw8zicbHivF/Q62En +YUA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:date:cc:to:from:subject:message-id; bh=kFgivIzVtoFhV0AiXL6/X8g1nT/GIMByb2fdQekXgGU=; b=vSq39+mGt+r7C+A644lsV8lre6T99crXneWAFqbfqI4z/vlmVgKJ78TAtIfNFnCOdi iHImyNvRexGhb6l3RG0KCowXRCT9DfPPvimCEHj1EDmOfn+UAqgRNVis7+1oK+nXPcAp NSEc2VKpBCQ4aSxvvX5f22ACNdF5wOMGhYEbnvMCpqsI2XjuuOE8PYlkQo6zsw6VqoOh u6ur8yKt3IQXkK04DRFJ0loHF2EJWjpTrx7EAD1ds8S558/cjoqTWiGH5P6QyhPNALJg 4uQRIBvRMokR7UFHPtuGi8rw4w/tSxO3k3MWGpuGZHamKUZ2n5LypLu4glPByQhXIPXp IjZA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b8si11949732plr.54.2019.03.19.07.59.01; Tue, 19 Mar 2019 07:59:17 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727766AbfCSO6Z (ORCPT + 99 others); Tue, 19 Mar 2019 10:58:25 -0400 Received: from mx1.redhat.com ([209.132.183.28]:48870 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726703AbfCSO6Z (ORCPT ); Tue, 19 Mar 2019 10:58:25 -0400 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 4B40B30842AA; Tue, 19 Mar 2019 14:58:24 +0000 (UTC) Received: from dhcp-4-67.tlv.redhat.com (dhcp-4-67.tlv.redhat.com [10.35.4.67]) by smtp.corp.redhat.com (Postfix) with ESMTP id 4CAAC1902B; Tue, 19 Mar 2019 14:58:17 +0000 (UTC) Message-ID: Subject: [PATCH 0/9] RFC: NVME VFIO mediated device From: Maxim Levitsky To: linux-nvme@lists.infradead.org Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Jens Axboe , Alex Williamson , Keith Busch , Christoph Hellwig , Sagi Grimberg , Kirti Wankhede , "David S . Miller" , Mauro Carvalho Chehab , Greg Kroah-Hartman , Wolfram Sang , Nicolas Ferre , "Paul E . McKenney" , Paolo Bonzini , Liang Cunming , Liu Changpeng , Fam Zheng , Amnon Ilan , John Ferlan Date: Tue, 19 Mar 2019 16:58:16 +0200 In-Reply-To: <20190319144116.400-1-mlevitsk@redhat.com> References: <20190319144116.400-1-mlevitsk@redhat.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.40]); Tue, 19 Mar 2019 14:58:24 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Oops, I placed the subject in the wrong place. Best regards, Maxim Levitsky On Tue, 2019-03-19 at 16:41 +0200, Maxim Levitsky wrote: > Date: Tue, 19 Mar 2019 14:45:45 +0200 > Subject: [PATCH 0/9] RFC: NVME VFIO mediated device > > Hi everyone! > > In this patch series, I would like to introduce my take on the problem of > doing > as fast as possible virtualization of storage with emphasis on low latency. > > In this patch series I implemented a kernel vfio based, mediated device that > allows the user to pass through a partition and/or whole namespace to a guest. > > The idea behind this driver is based on paper you can find at > https://www.usenix.org/conference/atc18/presentation/peng, > > Although note that I stared the development prior to reading this paper, > independently. > > In addition to that implementation is not based on code used in the paper as > I wasn't being able at that time to make the source available to me. > > ***Key points about the implementation:*** > > * Polling kernel thread is used. The polling is stopped after a > predefined timeout (1/2 sec by default). > Support for all interrupt driven mode is planned, and it shows promising > results. > > * Guest sees a standard NVME device - this allows to run guest with > unmodified drivers, for example windows guests. > > * The NVMe device is shared between host and guest. > That means that even a single namespace can be split between host > and guest based on different partitions. > > * Simple configuration > > *** Performance *** > > Performance was tested on Intel DC P3700, With Xeon E5-2620 v2 > and both latency and throughput is very similar to SPDK. > > Soon I will test this on a better server and nvme device and provide > more formal performance numbers. > > Latency numbers: > ~80ms - spdk with fio plugin on the host. > ~84ms - nvme driver on the host > ~87ms - mdev-nvme + nvme driver in the guest > > Throughput was following similar pattern as well. > > * Configuration example > $ modprobe nvme mdev_queues=4 > $ modprobe nvme-mdev > > $ UUID=$(uuidgen) > $ DEVICE='device pci address' > $ echo $UUID > /sys/bus/pci/devices/$DEVICE/mdev_supported_types/nvme- > 2Q_V1/create > $ echo n1p3 > /sys/bus/mdev/devices/$UUID/namespaces/add_namespace #attach > host namespace 1 parition 3 > $ echo 11 > /sys/bus/mdev/devices/$UUID/settings/iothread_cpu #pin the io > thread to cpu 11 > > Afterward boot qemu with > -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID > > Zero configuration on the guest. > > *** FAQ *** > > * Why to make this in the kernel? Why this is better that SPDK > > -> Reuse the existing nvme kernel driver in the host. No new drivers in the > guest. > > -> Share the NVMe device between host and guest. > Even in fully virtualized configurations, > some partitions of nvme device could be used by guests as block devices > while others passed through with nvme-mdev to achieve balance between > all features of full IO stack emulation and performance. > > -> NVME-MDEV is a bit faster due to the fact that in-kernel driver > can send interrupts to the guest directly without a context > switch that can be expensive due to meltdown mitigation. > > -> Is able to utilize interrupts to get reasonable performance. > This is only implemented > as a proof of concept and not included in the patches, > but interrupt driven mode shows reasonable performance > > -> This is a framework that later can be used to support NVMe devices > with more of the IO virtualization built-in > (IOMMU with PASID support coupled with device that supports it) > > * Why to attach directly to nvme-pci driver and not use block layer IO > -> The direct attachment allows for better performance, but I will > check the possibility of using block IO, especially for fabrics drivers. > > *** Implementation notes *** > > * All guest memory is mapped into the physical nvme device > but not 1:1 as vfio-pci would do this. > This allows very efficient DMA. > To support this, patch 2 adds ability for a mdev device to listen on > guest's memory map events. > Any such memory is immediately pinned and then DMA mapped. > (Support for fabric drivers where this is not possible exits too, > in which case the fabric driver will do its own DMA mapping) > > * nvme core driver is modified to announce the appearance > and disappearance of nvme controllers and namespaces, > to which the nvme-mdev driver is subscribed. > > * nvme-pci driver is modified to expose raw interface of attaching to > and sending/polling the IO queues. > This allows the mdev driver very efficiently to submit/poll for the IO. > By default one host queue is used per each mediated device. > (support for other fabric based host drivers is planned) > > * The nvme-mdev doesn't assume presence of KVM, thus any VFIO user, including > SPDK, a qemu running with tccg, ... can use this virtual device. > > *** Testing *** > > The device was tested with stock QEMU 3.0 on the host, > with host was using 5.0 kernel with nvme-mdev added and the following > hardware: > * QEMU nvme virtual device (with nested guest) > * Intel DC P3700 on Xeon E5-2620 v2 server > * Samsung SM981 (in a Thunderbolt enclosure, with my laptop) > * Lenovo NVME device found in my laptop > > The guest was tested with kernel 4.16, 4.18, 4.20 and > the same custom complied kernel 5.0 > Windows 10 guest was tested too with both Microsoft's inbox driver and > open source community NVME driver > (https://lists.openfabrics.org/pipermail/nvmewin/2016-December/001420.html) > > Testing was mostly done on x86_64, but 32 bit host/guest combination > was lightly tested too. > > In addition to that, the virtual device was tested with nested guest, > by passing the virtual device to it, > using pci passthrough, qemu userspace nvme driver, and spdk > > > PS: I used to contribute to the kernel as a hobby using the > maximlevitsky@gmail.com address > > Maxim Levitsky (9): > vfio/mdev: add .request callback > nvme/core: add some more values from the spec > nvme/core: add NVME_CTRL_SUSPENDED controller state > nvme/pci: use the NVME_CTRL_SUSPENDED state > nvme/pci: add known admin effects to augument admin effects log page > nvme/pci: init shadow doorbell after each reset > nvme/core: add mdev interfaces > nvme/core: add nvme-mdev core driver > nvme/pci: implement the mdev external queue allocation interface > > MAINTAINERS | 5 + > drivers/nvme/Kconfig | 1 + > drivers/nvme/Makefile | 1 + > drivers/nvme/host/core.c | 149 +++++- > drivers/nvme/host/nvme.h | 55 ++- > drivers/nvme/host/pci.c | 385 ++++++++++++++- > drivers/nvme/mdev/Kconfig | 16 + > drivers/nvme/mdev/Makefile | 5 + > drivers/nvme/mdev/adm.c | 873 ++++++++++++++++++++++++++++++++++ > drivers/nvme/mdev/events.c | 142 ++++++ > drivers/nvme/mdev/host.c | 491 +++++++++++++++++++ > drivers/nvme/mdev/instance.c | 802 +++++++++++++++++++++++++++++++ > drivers/nvme/mdev/io.c | 563 ++++++++++++++++++++++ > drivers/nvme/mdev/irq.c | 264 ++++++++++ > drivers/nvme/mdev/mdev.h | 56 +++ > drivers/nvme/mdev/mmio.c | 591 +++++++++++++++++++++++ > drivers/nvme/mdev/pci.c | 247 ++++++++++ > drivers/nvme/mdev/priv.h | 700 +++++++++++++++++++++++++++ > drivers/nvme/mdev/udata.c | 390 +++++++++++++++ > drivers/nvme/mdev/vcq.c | 207 ++++++++ > drivers/nvme/mdev/vctrl.c | 514 ++++++++++++++++++++ > drivers/nvme/mdev/viommu.c | 322 +++++++++++++ > drivers/nvme/mdev/vns.c | 356 ++++++++++++++ > drivers/nvme/mdev/vsq.c | 178 +++++++ > drivers/vfio/mdev/vfio_mdev.c | 11 + > include/linux/mdev.h | 4 + > include/linux/nvme.h | 88 +++- > 27 files changed, 7375 insertions(+), 41 deletions(-) > create mode 100644 drivers/nvme/mdev/Kconfig > create mode 100644 drivers/nvme/mdev/Makefile > create mode 100644 drivers/nvme/mdev/adm.c > create mode 100644 drivers/nvme/mdev/events.c > create mode 100644 drivers/nvme/mdev/host.c > create mode 100644 drivers/nvme/mdev/instance.c > create mode 100644 drivers/nvme/mdev/io.c > create mode 100644 drivers/nvme/mdev/irq.c > create mode 100644 drivers/nvme/mdev/mdev.h > create mode 100644 drivers/nvme/mdev/mmio.c > create mode 100644 drivers/nvme/mdev/pci.c > create mode 100644 drivers/nvme/mdev/priv.h > create mode 100644 drivers/nvme/mdev/udata.c > create mode 100644 drivers/nvme/mdev/vcq.c > create mode 100644 drivers/nvme/mdev/vctrl.c > create mode 100644 drivers/nvme/mdev/viommu.c > create mode 100644 drivers/nvme/mdev/vns.c > create mode 100644 drivers/nvme/mdev/vsq.c >