Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp6996373rwb; Mon, 5 Dec 2022 22:47:32 -0800 (PST) X-Google-Smtp-Source: AA0mqf5EfP7KhyqeAOVQW+GQsi0AqVt+1Qz8XFDWbitBdGcZjQeiWoB9SSHxyZFGFie+5xY6ICXb X-Received: by 2002:a17:903:22cc:b0:17e:539:c405 with SMTP id y12-20020a17090322cc00b0017e0539c405mr67992875plg.53.1670309251832; Mon, 05 Dec 2022 22:47:31 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1670309251; cv=none; d=google.com; s=arc-20160816; b=Om/kPXZnlmSf9oRkvMGC9O9i75kPjNSlpeNQRElVZKpCSk7bbmx91lXwIfZXThTdLq fnieFD431v1As3KQ5PrX9sAnXkfNuq4X9/Ei6Pklr/RktVqmBymVHHmeAMali1bcbfCm Gj8BrarpfLtc2TNPH4tpDFn8N44399RiX6c7+LWAE/+Q7QLpsKnQA5F0PZQ7Ey4FlH+I etxvePqpjgOwUYlhJ7pcw0qFtsufqEpM2IZzxJaAothOaGMGT14Ka0JB7swegMO+2AkB awMv2/yy/3EzpTKb76/SAkVdjGvAJB33qm4u7vL3h90snDDy2ibbfZGF3dVXh3JCg36m Ivkg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=MJd2bU2D+lm0WG5JeREqwz/gTrRZIL3tJNevpv7COSI=; b=zxXlsxuiJvAxzYrG8HM41c/8ncCp5cYDbcvwrkxZUwDrut5Fx0hEM9HNZZ2u2wHoM1 EJ8gNKZgKzZ+O90GtPGNjPI5KgmbN8mR8W2pLRmrgyhfCtg/4lM9WoK8ewUheCpHSq7P 5tSrnGuYuN83f3a1DHZ2PiZO4tiij+kY8o4O0RryPjpQ0/ZiiqKDOybU8RRh4G4ocqtj hmxNkqGzCeOmtOsw4qQ9EKhZf6FNGY8prFSYFiQYT8cGnJBlqSjvzQmN5Frjv0NONOiJ SztR13hGtLVDJg4kw3ioDOIVt5Uj9ph1n7ub4YNJlu55psLOKtUxWXAZg0qztEtY6Pm4 Q6hA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=LYfMyyXZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id s1-20020a632c01000000b0046eecbac47esi17076628pgs.28.2022.12.05.22.47.21; Mon, 05 Dec 2022 22:47:31 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=LYfMyyXZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233895AbiLFF7o (ORCPT + 79 others); Tue, 6 Dec 2022 00:59:44 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43246 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233770AbiLFF7S (ORCPT ); Tue, 6 Dec 2022 00:59:18 -0500 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2FAFD2793F; Mon, 5 Dec 2022 21:59:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1670306350; x=1701842350; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=VBnJVd+Hc6al1XaW5PlQfaxbMT88bR8HJInkPCC3JHA=; b=LYfMyyXZNP80nirOweJ4PHTgk8NpF29lRc3gTUODDzFtFu7qo0NlSPg+ 61ZNPN1GZJnq7PwTkroyguMtBYbnRQKyehyCTdE1rk4RN953Ln98trDXG CX5YcYy09JiANM2gBJbGDd3RzNNHymFC2a83OcELW+Ni7em+Tzsd/zk9A YV4JrB3KVFowolplSH1d/BAnsBG/XS7hBzIysTOKtGgtKkbIGW7rKwDpY uUhd2cjpdKKPVhvkCJ11wdfYfDe/8D07nKnC9BMDN3YjiVcwRBNclWfYC TOHA2ah0j1h1JEPAk2vHVBovCwcewu1vnyzxb679PrtGTLJR17UoQyE+X A==; X-IronPort-AV: E=McAfee;i="6500,9779,10552"; a="378706773" X-IronPort-AV: E=Sophos;i="5.96,220,1665471600"; d="scan'208";a="378706773" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Dec 2022 21:59:00 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10552"; a="648211414" X-IronPort-AV: E=Sophos;i="5.96,220,1665471600"; d="scan'208";a="648211414" Received: from leirao-pc.bj.intel.com ([10.238.156.101]) by fmsmga007.fm.intel.com with ESMTP; 05 Dec 2022 21:58:54 -0800 From: Lei Rao To: kbusch@kernel.org, axboe@fb.com, kch@nvidia.com, hch@lst.de, sagi@grimberg.me, alex.williamson@redhat.com, cohuck@redhat.com, jgg@ziepe.ca, yishaih@nvidia.com, shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com, mjrosato@linux.ibm.com, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, kvm@vger.kernel.org Cc: eddie.dong@intel.com, yadong.li@intel.com, yi.l.liu@intel.com, Konrad.wilk@oracle.com, stephen@eideticom.com, hang.yuan@intel.com, Lei Rao Subject: [RFC PATCH 5/5] nvme-vfio: Add a document for the NVMe device Date: Tue, 6 Dec 2022 13:58:16 +0800 Message-Id: <20221206055816.292304-6-lei.rao@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221206055816.292304-1-lei.rao@intel.com> References: <20221206055816.292304-1-lei.rao@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The documentation describes the details of the NVMe hardware extension to support VFIO live migration. Signed-off-by: Lei Rao Signed-off-by: Yadong Li Signed-off-by: Chaitanya Kulkarni Reviewed-by: Eddie Dong Reviewed-by: Hang Yuan --- drivers/vfio/pci/nvme/nvme.txt | 278 +++++++++++++++++++++++++++++++++ 1 file changed, 278 insertions(+) create mode 100644 drivers/vfio/pci/nvme/nvme.txt diff --git a/drivers/vfio/pci/nvme/nvme.txt b/drivers/vfio/pci/nvme/nvme.txt new file mode 100644 index 000000000000..eadcf2082eed --- /dev/null +++ b/drivers/vfio/pci/nvme/nvme.txt @@ -0,0 +1,278 @@ +=========================== +NVMe Live Migration Support +=========================== + +Introduction +------------ +To support live migration, NVMe device designs its own implementation, +including five new specific admin commands and a capability flag in +the vendor-specific field in the identify controller data structure to +support VF's live migration usage. Software can use these live migration +admin commands to get device migration state data size, save and load the +data, suspend and resume the given VF device. They are submitted by software +to the NVMe PF device's admin queue and ignored if placed in the VF device's +admin queue. This is due to the NVMe VF device being passed to the virtual +machine in the virtualization scenario. So VF device's admin queue is not +available for the hypervisor to submit VF device live migration commands. +The capability flag in the identify controller data structure can be used by +software to detect if the NVMe device supports live migration. The following +chapters introduce the detailed format of the commands and the capability flag. + +Definition of opcode for live migration commands +------------------------------------------------ + ++---------------------------+-----------+-----------+------------+ +| | | | | +| Opcode by Field | | | | +| | | | | ++--------+---------+--------+ | | | +| | | | Combined | Namespace | | +| 07 | 06:02 | 01:00 | Opcode | Identifier| Command | +| | | | | used | | ++--------+---------+--------+ | | | +|Generic | Function| Data | | | | +|command | |Transfer| | | | ++--------+---------+--------+-----------+-----------+------------+ +| | +| Vendor SpecificOpcode | ++--------+---------+--------+-----------+-----------+------------+ +| | | | | | Query the | +| 1b | 10001 | 00 | 0xC4 | | data size | ++--------+---------+--------+-----------+-----------+------------+ +| | | | | | Suspend the| +| 1b | 10010 | 00 | 0xC8 | | VF | ++--------+---------+--------+-----------+-----------+------------+ +| | | | | | Resume the | +| 1b | 10011 | 00 | 0xCC | | VF | ++--------+---------+--------+-----------+-----------+------------+ +| | | | | | Save the | +| 1b | 10100 | 10 | 0xD2 | |device data | ++--------+---------+--------+-----------+-----------+------------+ +| | | | | | Load the | +| 1b | 10101 | 01 | 0xD5 | |device data | ++--------+---------+--------+-----------+-----------+------------+ + +Definition of QUERY_DATA_SIZE command +------------------------------------- + ++---------+------------------------------------------------------------------------------------+ +| | | +| Bytes | Description | +| | | ++---------+------------------------------------------------------------------------------------+ +| | | +| | | +| | +-----------+--------------------------------------------------------------------+ | +| | | Bits |Description | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 07:00 |Opcode(OPC):set to 0xC4 to indicate a qeury command | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for more details[1] | | +| | +-----------+--------------------------------------------------------------------+ | +| 03:00 | | 13:10 |Reserved | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 15:14 |PRP or SGL for Data Transfer(PSDT): See NVMe SPEC for details[1] | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 31:16 |Command Identifier(CID) | | +| | +-----------+--------------------------------------------------------------------+ | +| | | +| | | ++---------+------------------------------------------------------------------------------------+ +| 39:04 | Reserved | ++---------+------------------------------------------------------------------------------------+ +| 41:40 | VF index: means which VF controller internal data size to query | ++---------+------------------------------------------------------------------------------------+ +| 63:42 | Reserved | ++---------+------------------------------------------------------------------------------------+ + +The QUERY_DATA_SIZE command is used to query the NVMe VF internal data size for live migration. +When the NVMe firmware receives the command, it will return the size of NVMe VF internal +data. The data size depends on how many IO queues are created. + +Definition of SUSPEND command +----------------------------- + ++---------+------------------------------------------------------------------------------------+ +| | | +| Bytes | Description | +| | | ++---------+------------------------------------------------------------------------------------+ +| | | +| | | +| | +-----------+--------------------------------------------------------------------+ | +| | | Bits |Description | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 07:00 |Opcode(OPC):set to 0xC8 to indicate a suspend command | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 09:08 |Fused Operation(FUSE):Please see NVMe specification for details[1] | | +| | +-----------+--------------------------------------------------------------------+ | +| 03:00 | | 13:10 |Reserved | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 15:14 |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1] | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 31:16 |Command Identifier(CID) | | +| | +-----------+--------------------------------------------------------------------+ | +| | | +| | | ++---------+------------------------------------------------------------------------------------+ +| 39:04 | Reserved | ++---------+------------------------------------------------------------------------------------+ +| 41:40 | VF index: means which VF controller to suspend | ++---------+------------------------------------------------------------------------------------+ +| 63:42 | Reserved | ++---------+------------------------------------------------------------------------------------+ + +The SUSPEND command is used to suspend the NVMe VF controller. When the NVMe firmware receives +this command, it will suspend the NVMe VF controller. + +Definition of RESUME command +---------------------------- + ++---------+------------------------------------------------------------------------------------+ +| | | +| Bytes | Description | +| | | ++---------+------------------------------------------------------------------------------------+ +| | | +| | | +| | +-----------+--------------------------------------------------------------------+ | +| | | Bits |Description | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 07:00 |Opcode(OPC):set to 0xCC to indicate a resume command | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for details[1] | | +| | +-----------+--------------------------------------------------------------------+ | +| 03:00 | | 13:10 |Reserved | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 15:14 |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1] | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 31:16 |Command Identifier(CID) | | +| | +-----------+--------------------------------------------------------------------+ | +| | | +| | | ++---------+------------------------------------------------------------------------------------+ +| 39:04 | Reserved | ++---------+------------------------------------------------------------------------------------+ +| 41:40 | VF index: means which VF controller to resume | ++---------+------------------------------------------------------------------------------------+ +| 63:42 | Reserved | ++---------+------------------------------------------------------------------------------------+ + +The RESUME command is used to resume the NVMe VF controller. When firmware receives this command, +it will restart the NVMe VF controller. + +Definition of SAVE_DATA command +-------------------------- + ++---------+------------------------------------------------------------------------------------+ +| | | +| Bytes | Description | +| | | ++---------+------------------------------------------------------------------------------------+ +| | | +| | | +| | +-----------+--------------------------------------------------------------------+ | +| | | Bits |Description | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 07:00 |Opcode(OPC):set to 0xD2 to indicate a save command | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for details[1] | | +| | +-----------+--------------------------------------------------------------------+ | +| 03:00 | | 13:10 |Reserved | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 15:14 |PRP or SGL for Data Transfer(PSDT):See NVMe SPEC for details[1] | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 31:16 |Command Identifier(CID) | | +| | +-----------+--------------------------------------------------------------------+ | +| | | +| | | ++---------+------------------------------------------------------------------------------------+ +| 23:04 | Reserved | ++---------+------------------------------------------------------------------------------------+ +| 31:24 | PRP Entry1:the first PRP entry for the commmand or a PRP List Pointer | ++---------+------------------------------------------------------------------------------------+ +| 39:32 | PRP Entry2:the second address entry(reserved,page base address or PRP List Pointer)| ++---------+------------------------------------------------------------------------------------+ +| 41:40 | VF index: means which VF controller internal data to save | ++---------+------------------------------------------------------------------------------------+ +| 63:42 | Reserved | ++---------+------------------------------------------------------------------------------------+ + +The SAVE_DATA command is used to save the NVMe VF internal data for live migration. When firmware +receives this command, it will save the admin queue states, save some registers, drain IO SQs +and CQs, save every IO queue state, disable the VF controller, and transfer all data to the +host memory through DMA. + +Definition of LOAD_DATA command +-------------------------- + ++---------+------------------------------------------------------------------------------------+ +| | | +| Bytes | Description | +| | | ++---------+------------------------------------------------------------------------------------+ +| | | +| | | +| | +-----------+--------------------------------------------------------------------+ | +| | | Bits |Description | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 07:00 |Opcode(OPC):set to 0xD5 to indicate a load command | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 09:08 |Fused Operation(FUSE):Please see NVMe SPEC for details[1] | | +| | +-----------+--------------------------------------------------------------------+ | +| 03:00 | | 13:10 |Reserved | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 15:14 |PRP or SGL for Data Transfer(PSDT): See NVMe SPEC for details[1] | | +| | +-----------+--------------------------------------------------------------------+ | +| | | 31:16 |Command Identifier(CID) | | +| | +-----------+--------------------------------------------------------------------+ | +| | | +| | | ++---------+------------------------------------------------------------------------------------+ +| 23:04 | Reserved | ++---------+------------------------------------------------------------------------------------+ +| 31:24 | PRP Entry1:the first PRP entry for the commmand or a PRP List Pointer | ++---------+------------------------------------------------------------------------------------+ +| 39:32 | PRP Entry2:the second address entry(reserved,page base address or PRP List Pointer)| ++---------+------------------------------------------------------------------------------------+ +| 41:40 | VF index: means which VF controller internal data to load | ++---------+------------------------------------------------------------------------------------+ +| 47:44 | Size: means the size of the device's internal data to be loaded | ++---------+------------------------------------------------------------------------------------+ +| 63:48 | Reserved | ++---------+------------------------------------------------------------------------------------+ + +The LOAD_DATA command is used to restore the NVMe VF internal data. When firmware receives this +command, it will read the device internal's data from the host memory through DMA, restore the +admin queue states and some registers, and restore every IO queue state. + +Extensions of the vendor-specific field in the identify controller data structure +--------------------------------------------------------------------------------- + ++---------+------+------+------+-------------------------------+ +| | | | | | +| Bytes | I/O |Admin | Disc | Description | +| | | | | | ++---------+------+------+------+-------------------------------+ +| | | | | | +| 01:00 | M | M | R | PCI Vendor ID(VID) | ++---------+------+------+------+-------------------------------+ +| | | | | | +| 03:02 | M | M | R | PCI Subsytem Vendor ID(SSVID) | ++---------+------+------+------+-------------------------------+ +| | | | | | +| ... | ... | ... | ... | ... | ++---------+------+------+------+-------------------------------+ +| | | | | | +| 3072 | O | O | O | Live Migration Support | ++---------+------+------+------+-------------------------------+ +| | | | | | +|4095:3073| O | O | O | Vendor Specific | ++---------+------+------+------+-------------------------------+ + +According to NVMe specification, the bytes from 3072 to 4095 are vendor-specific fields. +NVMe device uses the 3072 bytes in the identify controller data structure to indicate +whether live migration is supported. 0x0 means live migration is not supported. 0x01 means +live migration is supported, and other values are reserved. + +[1] https://nvmexpress.org/wp-content/uploads/NVMe-NVM-Express-2.0a-2021.07.26-Ratified.pdf -- 2.34.1