Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753207AbbDRBiY (ORCPT ); Fri, 17 Apr 2015 21:38:24 -0400 Received: from mga09.intel.com ([134.134.136.24]:59085 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752589AbbDRBiG (ORCPT ); Fri, 17 Apr 2015 21:38:06 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.11,598,1422950400"; d="scan'208";a="681970598" Subject: [PATCH 02/21] ND NFIT-Defined/NVIDIMM Subsystem From: Dan Williams To: linux-nvdimm@ml01.01.org Cc: Boaz Harrosh , Neil Brown , Greg KH , linux-kernel@vger.kernel.org, Andy Lutomirski , Jens Axboe , "H. Peter Anvin" , Christoph Hellwig , Ingo Molnar Date: Fri, 17 Apr 2015 21:35:25 -0400 Message-ID: <20150418013525.25237.45181.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: <20150418013256.25237.96403.stgit@dwillia2-desk3.amr.corp.intel.com> References: <20150418013256.25237.96403.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.17.1-8-g92dd MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 34518 Lines: 948 Maintainer information and documenation for drivers/block/nd/ Cc: Andy Lutomirski Cc: Boaz Harrosh Cc: H. Peter Anvin Cc: Jens Axboe Cc: Ingo Molnar Cc: Christoph Hellwig Cc: Neil Brown Cc: Greg KH Signed-off-by: Dan Williams --- Documentation/blockdev/nd.txt | 867 +++++++++++++++++++++++++++++++++++++++++ MAINTAINERS | 34 +- 2 files changed, 895 insertions(+), 6 deletions(-) create mode 100644 Documentation/blockdev/nd.txt diff --git a/Documentation/blockdev/nd.txt b/Documentation/blockdev/nd.txt new file mode 100644 index 000000000000..bcfdf21063ab --- /dev/null +++ b/Documentation/blockdev/nd.txt @@ -0,0 +1,867 @@ + The NFIT-Defined/NVDIMM Sub-system (ND) + + nd - kernel abi / device-model & ndctl - userspace helper library + linux-nvdimm@lists.01.org + v9: April 17th, 2015 + + + Glossary + + Overview + Supporting Documents + Git Trees + + NFIT Terminology and NVDIMM Types + + Why BLK? + PMEM vs BLK (SPA vs BDW) + BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX + + Example NFIT Diagram + + ND Device Model/ABI and NDCTL API + NDCTL: Context + ndctl: instantiate a new library context example + + ND/NDCTL: Bus + nd: control class device in /sys/class + nd: bus layout + ndctl: bus enumeration example + + ND/NDCTL: DIMM (NMEM) + nd: DIMM (NMEM) layout + ndctl: DIMM enumeration example + + ND/NDCTL: Region + nd: region layout + ndctl: region enumeration example + Why Not Encode the Region Type into the Region Name? + How Do I Determine the Major Type of a Region? + + ND/NDCTL: Namespace + nd: namespace layout + ndctl: namespace enumeration example + ndctl: namespace creation example + Why the Term “namespace”? + + ND/NDCTL: Block Translation Table “btt” + nd: btt layout + ndctl: btt creation example + + Summary NDCTL Diagram + + +Glossary +-------- + +NFIT: NVDIMM Firmware Interface Table + +SPA: System Physical Address also refers to an NFIT system-physical +address table entry describing contiguous persistent memory range. + +DPA: DIMM Physical Address, is a DIMM-relative offset. With one DIMM in +the system there would be a 1:1 SPA:DPA association. Once more DIMMs +are added an interleave-description-table provided by NFIT is needed to +decode a SPA to a DPA. + +DCR: DIMM Control Region Descriptor, an NFIT sub-table entry conveying +the vendor, format, revision, and geometry of the related +block-data-windows. + +BDW: Block Data Window Region Descriptor, an NFIT sub-table referenced +by a DCR locating a set of data transfer apertures and control registers +in system memory. + +PMEM: A linux block device which provides access to an SPA range. A PMEM +device is capable of DAX (see below). + +DAX: File system extensions to bypass the page cache and block layer to +map persistent memory, from a PMEM block device, directly into a process +address space. + +BLK: A linux block device which accesses NVDIMM storage through a BDW +(block-data-window aperture). A BLK device is not amenable to DAX. + +DSM: Device Specific Method, refers to a runtime service provided by +platform firmware to send formatted control/configuration messages to a +DIMM device. In ACPI this is an _DSM attribute of an object. + +BTT: Block Translation Table: Persistent memory is byte addressable. +Existing software may have an expectation that the power-fail-atomicity +of writes is at least one sector, 512 bytes. The BTT is an indirection +table with atomic update semantics to front a PMEM/BLK block device +driver and present arbitrary atomic sector sizes. + +LABEL: Metadata stored on a DIMM device that partitions and identifies +(persistently names) storage between PMEM and BLK. It also partitions +BLK storage to host BTTs with different parameters per BLK-partition. +Note that traditional partition tables, GPT/MBR, are layered on top of a +BLK or PMEM device. + + + + +Overview +-------- + +The “NVDIMM Firmware Interface Table” (NFIT) defines a set of tables +that describe the non-volatile memory resources in a platform. Platform +firmware provides this table as well as runtime-services for sending +control and configuration messages to capable NVDIMM devices. NFIT is a +new top-level table in ACPI 6. The Linux ND subsystem is designed as a +generic mechanism that can register a binary NFIT from any provider, +ACPI being just one example of a provider. The unit test infrastructure +in the kernel exploits this capability to provide multiple sample NFITs +via custom test-platform-devices. + + +Supporting Documents +ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf +NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf +DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf +Driver Writer’s Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf + + +Git Trees +ND: https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git/log/?h=nd +NDCTL: https://github.com/pmem/ndctl.git +PMEM: https://github.com/01org/prd + + +NFIT Terminology and NVDIMM Types +--------------------------------- + +Prior to the arrival of the NFIT, non-volatile memory was described to a +system in various ad-hoc ways. Usually only the bare minimum was +provided, namely, a single SPA range where writes are expected to be +durable after a system power loss. Now, the NFIT specification +standardizes not only the description SPA ranges, but also DCR/BDW +(block-aperture access) and DSM entry points for control/configuration. + + +For each NFIT-defined I/O interface (SPA, DCR/BDW), ND provides a block +device driver: + + +1. PMEM (nd_pmem.ko): Drives an NFIT system-physical address (SPA) + range. A SPA range is contiguous in system memory and may be + interleaved (hardware memory controller striped) across multiple DIMMs. + When a SPA is interleaved the NFIT optionally provides descriptions of + which DIMMs are participating in the interleave. + + Note, while ND describes SPAs with backing DIMM information + (ND_NAMESPACE_PMEM) with a different device-type than SPAs without such + a description (ND_NAMESPACE_IO), to nd_pmem there is no distinction. + The different device-types are an implementation detail that userspace + can exploit to implement policies like “only interface with SPA ranges + from certain DIMMs”. + + +2. BLK (nd_blk.ko): This driver performs I/O using a set of DCR/BDW + defined apertures. A set of apertures will all access just one DIMM. + Multiple windows allow multiple concurrent accesses, much like + tagged-command-queuing, and would likely be used by different threads or + different CPUs. + + The NFIT specification defines a standard format for a BDW, but the spec + also allows for vendor specific layouts. As of this writing “nd_blk” + only supports the example interface detailed in the “DSM Interface + Example”. If another BDW format arrives in the future this can added as + a new sub-device-type to nd_blk or as a new ND device type with its own + driver. + + +Why BLK? +-------- + +While PMEM provides direct byte-addressable CPU-load/store access to +NVDIMM storage, it does not provide the best system RAS (recovery, +availability, and serviceability) model. An access to a corrupted SPA +address causes a cpu exception while an access to a corrupted address +through a BDW aperture causes that block window to raise an error status +in a register. The latter is more aligned with the standard error model +that host-bus-adapter attached disks present. Also, if an administrator +ever wants to replace a memory it is easier to service a system at DIMM +module boundaries. Compare this to PMEM where data could be interleaved +in an opaque hardware specific manner across several DIMMs. + + +PMEM vs BLK (SPA vs BDW) +------------------------ + +BDWs solve this RAS problem, but their presence is also the major +contributing factor to the complexity of the ND subsystem. They +complicate the implementation because PMEM and BLK alias in DPA space. +Any given DIMM’s DPA-range may contribute to one or more SPA sets of +interleaved DIMMs, *and* may also be accessed in its entirety through +its BDW. Accessing a DPA through a SPA while simultaneously accessing +the same DPA through a BDW has undefined results. For this reason, +DIMM’s with this dual interface configuration include a DSM function to +store/retrieve a LABEL. The LABEL effectively partitions the DPA-space +into exclusive SPA and BDW accessible regions. For simplicity a DIMM is +allowed a PMEM “region” per each interleave set in which it is a member. +The remaining DPA space can be carved into an arbitrary number of BLK +devices with discontiguous extents. + + +BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX +-------------------------------------------------- +One of the few reasons to allow multiple BLK namespaces per REGION is so +that each BLK-namespace can be configured with a BTT with unique atomic +sector sizes. While a PMEM device can host a BTT the LABEL +specification does not provide for a sector size to be specified for a +PMEM namespace. This is due to the expectation that the primary usage +model for PMEM is via DAX, and the BTT is incompatible with DAX. +However, for the cases where an application or filesystem still needs +atomic sector update guarantees it can register a BTT on a PMEM device +or partition. See ND/NDCTL: Block Translation Table “btt” + + +________________ + + +Example NFIT Diagram + + +For the remainder of this document the following diagram and device +names will be referenced for the example sysfs layouts. + + + (a) (b) DIMM BLK-REGION + +-------------------+--------+--------+--------+ ++------+ | pm0.0 | blk2.0 | pm1.0 | blk2.1 | 0 region2 +| imc0 +--+- - - region0- - - +--------+ +--------+ ++--+---+ | pm0.0 | blk3.0 | pm1.0 | blk3.1 | 1 region3 + | +-------------------+--------v v--------+ ++--+---+ | | +| cpu0 | region1 ++--+---+ | | + | +----------------------------^ ^--------+ ++--+---+ | blk4.0 | pm1.0 | blk4.0 | 2 region4 +| imc1 +--+----------------------------| +--------+ ++------+ | blk5.0 | pm1.0 | blk5.0 | 3 region5 + +----------------------------+--------+--------+ + + +In this platform we have four DIMMs and two memory controllers in one +socket. Each unique interface (BLK or PMEM) to DPA space is identified +by a region device with a dynamically assigned id (REGION0 - REGION5). + + +1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A + single PMEM namespace is created in the REGION0-SPA-range that spans + DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that + interleaved SPA range is reclaimed as BDW accessed space starting at + DPA-offset (a) into each DIMM. In that reclaimed space we create two + BDW "namespaces" from REGION2 and REGION3 where "blk2.0" and "blk3.0" + are just human readable names that could be set to any user-desired name + in the LABEL. + + +2. In the last portion of DIMM0 and DIMM1 we have an interleaved SPA + range, REGION1, that spans those two DIMMs as well as DIMM2 and DIMM3. + Some of REGION1 allocated to a PMEM namespace named "pm1.0" the rest is + reclaimed in 4 BDW namespaces (for each DIMM in the interleave set), + "blk2.1", "blk3.1", "blk4.0", and "blk5.0". + + +3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1 + interleaved SPA range (i.e. the DPA address below offset (b) are also + included in the "blk4.0" and "blk5.0" namespaces. Note, that this + example shows that BDW namespaces don't need to be contiguous in + DPA-space. + +This bus is provided by the kernel under the device +/sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and +the nfit_test.ko module is loaded. + + +ND Device Model/ABI and NDCTL API +--------------------------------- + +What follows is a description of the ND sysfs layout and a corresponding +object hierarchy diagram as viewed through the NDCTL api. The example +sysfs paths and diagrams are relative to the Example NFIT Diagram which +is also the NFIT used in the “nd/ndctl” unit test. + + +NDCTL: Context +Every api call in the NDCTL library requires a context that holds the +logging parameters and other library instance state. The library is +based on the libabc template: +https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git/ + +ndctl: instantiate a new library context example + + struct ndctl_ctx *ctx; + + if (ndctl_new(&ctx) == 0) + return ctx; + else + return NULL; + + +ND/NDCTL: Bus +A bus has a 1:1 relationship with an NFIT. The current expectation for +ACPI based systems is that there is only ever one platform-global NFIT. +That said, it is trivial to register multiple NFITs, the specification +does not preclude it. The infrastructure supports multiple busses and +we we use this capability to test multiple NFIT configurations in the +unit test. + +nd: control class device in /sys/class + +This character device accepts DSM messages to be passed to DIMM +identified by its NFIT handle. + + /sys/class/nd/ndctl0 + |-- dev + |-- device -> ../../../ndbus0 + |-- subsystem -> ../../../../../../../class/nd + + +nd: bus layout + + /sys/devices/platform/nfit_test.0/ndbus0 + |-- btt0 + |-- btt_seed + |-- commands + |-- nd + |-- nmem0 + |-- nmem1 + |-- nmem2 + |-- nmem3 + |-- provider + |-- region0 + |-- region1 + |-- region2 + |-- region3 + |-- region4 + |-- region5 + |-- revision + |-- uevent + `-- wait_probe + + +ndctl: bus enumeration example + +Find the 'bus' handle that describes the bus from Example NFIT Diagram + + + static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx, + const char *provider) + { + struct ndctl_bus *bus; + + + ndctl_bus_foreach(ctx, bus) + if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0) + return bus; + + + return NULL; + } + + bus = get_bus_by_provider(ctx, “nfit_test.0”); + + +ND/NDCTL: DIMM (NMEM) + +The DIMM object identifies the NFIT “handle” and a “phys_id” for a given +memory device. The “handle” is derived from the DIMM’s physical +location (socket, memory-controller, channel, slot). The “phys_id” is +used for looking up DIMM details in other platform tables. The handle +value is also used to send control/configuration messages via ioctl +through the “ndctl0” device in the given example. The kernel id (‘N” in +“DIMMN”) for the device is dynamically assigned. The “vendor”, +“device”, “revision” and “format” attributes are optionally available if +the NFIT publishes a DCR (DIMM-control-region) for the given memory +device. These latter attributes are only useful in the presence of a +vendor-specific DIMM. + + +Note that the kernel device name for “DIMMs” is “nmemX”. The NFIT +describes these devices via “Memory Device to System Physical Address +Range Mapping Structure”, and there is no requirement that they actually +be DIMMs, so we use a more generic name. + + +nd: DIMM (NMEM) layout + + /sys/devices/platform/nfit_test.0/ndbus0/ + |-- nmem0 + | |-- available_slots + | |-- commands + | |-- dev + | |-- device + | |-- devtype + | |-- driver -> ../../../../../bus/nd/drivers/nd_dimm + | |-- format + | |-- handle + | |-- modalias + | |-- phys_id + | |-- revision + | |-- serial + | |-- state + | |-- subsystem -> ../../../../../bus/nd + | |-- uevent + | `-- vendor + |-- nmem1 + [..] + +ndctl: DIMM enumeration example + +Note, DIMMs are identified by an “nfit_handle” which is a 32-bit value +where: + + Bit 3:0 DIMM number within the memory channel + Bit 7:4 memory channel number + Bit 11:8 memory controller ID + Bit 15:12 socket ID + Bit 27:16 Node Controller ID + Bit 31:28 Reserved + + static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus, + unsigned int handle) + { + struct ndctl_dimm *dimm; + + + ndctl_dimm_foreach(bus, dimm) + if (ndctl_dimm_get_handle(dimm) == handle) + return dimm; + + + return NULL; + } + + #define DIMM_HANDLE(n, s, i, c, d) \ + (((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \ + | ((c & 0xf) << 4) | (d & 0xf)) + + dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0)); + + +ND/NDCTL: Region +A generic REGION device is registered for each SPA or DCR/BDW. Per the +example there are 6 regions: 2 SPAs and 4 BDWs on the “nfit_test.0” bus. +The primary role of regions are to be a container of “mappings”. A +mapping is a tuple of . + +The ND core provides a driver for these REGION devices. This driver is +responsible for reconciling the aliased mappings across all regions, +parsing the LABEL, if present, and then emitting “namespace” devices +with the resolved/exclusive DPA-boundaries for a ND PMEM or BLK device +driver to consume. + +In addition to the generic attributes of “mapping”s, “interleave_ways” +and “size” the REGION device also exports some convenience attributes. +“nstype” indicates the integer type of namespace-device this region +emits, “devtype” duplicates the DEVTYPE variable stored by udev at the +‘add’ event, “modalias” duplicates the MODALIAS variable stored by udev +at the ‘add’ event, and finally, the optional “spa_index” is provided in +the case where the region is defined by a SPA. + +nd: region layout + + |-- region0 + | |-- available_size + | |-- devtype + | |-- driver -> ../../../../../bus/nd/drivers/nd_region + | |-- init_namespaces + | |-- mapping0 + | |-- mapping1 + | |-- mappings + | |-- modalias + | |-- namespace0.0 + | |-- namespace_seed + | |-- nstype + | |-- set_cookie + | |-- size + | |-- spa_index + | |-- subsystem -> ../../../../../bus/nd + | `-- uevent + |-- region1 + | |-- available_size + | |-- devtype + | |-- driver -> ../../../../../bus/nd/drivers/nd_region + | |-- init_namespaces + | |-- mapping0 + | |-- mapping1 + | |-- mapping2 + | |-- mapping3 + | |-- mappings + | |-- modalias + | |-- namespace1.0 + | |-- namespace_seed + | |-- nstype + | |-- set_cookie + | |-- size + | |-- spa_index + | |-- subsystem -> ../../../../../bus/nd + | `-- uevent + |-- region2 + [..] + + +ndctl: region enumeration example + +Sample region retrieval routines based on NFIT-unique data like +“spa_index” (interleave set id) for PMEM and “nfit_handle” (dimm id) for +BLK. + + static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus, + unsigned int spa_index) + { + struct ndctl_region *region; + + + ndctl_region_foreach(bus, region) { + if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM) + continue; + if (ndctl_region_get_spa_index(region) == spa_index) + return region; + } + return NULL; + } + + + static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus, + unsigned int handle) + { + struct ndctl_region *region; + + + ndctl_region_foreach(bus, region) { + struct ndctl_mapping *map; + + + if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK) + continue; + ndctl_mapping_foreach(region, map) { + struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map); + + + if (ndctl_dimm_get_handle(dimm) == handle) + return region; + } + } + return NULL; + } + + +Why Not Encode the Region Type into the Region Name? + +At first glance it seems since NFIT defines just PMEM and BLK interface +types that we should simply name REGION devices with something derived +from those type names. However, the ND subsystem explicitly keeps the +REGION name generic and expects userspace to always consider the +region-attributes for 4 reasons: + +1. There are already more than two REGION and “namespace” types. For + PMEM there are two subtypes. As mentioned previously we have PMEM where + the constituent DIMM devices are known and anonymous PMEM. For BLK + regions the NFIT specification already anticipates vendor specific + implementations. The exact distinction of what a region contains is in + the region-attributes not the region-name or the region-devtype. + +2. A region with zero child-namespaces is a possible configuration. For + example, the NFIT allows for a DCR to be published without a + corresponding BDW. This equates to a DIMM that can only accept + control/configuration messages, but no i/o through a descendant block + device. Again, this “type” is advertised in the attributes (‘mappings’ + == 0) and the name does not tell you much. + +3. What if a third major interface type arises in the future? Outside + of vendor specific implementations, it’s not difficult to envision a + third class of interface type beyond BLK and PMEM. With a generic name + for the REGION level of the device-hierarchy old userspace + implementations can still make sense of new kernel advertised + region-types. Userspace can always rely on the generic region + attributes like “mappings”, “size”, etc and the expected child devices + named “namespace”. This generic format of the device-model hierarchy + allows the ND and NDCTL implementations to be more uniform and + future-proof. + +4. There are more robust mechanisms for determining the major type of a + region than a device name. See the next section, How Do I Determine the + Major Type of a Region? + + +How Do I Determine the Major Type of a Region? + +Outside of the blanket recommendation of “use the ndctl library”, or +simply looking at the kernel header to decode the “nstype” integer +attribute, here are some other options. + + +1. module alias lookup: + The whole point of region/namespace device type differentiation is to + decide which block-device driver will attach to a given ND namespace. + One can simply use the modalias to lookup the resulting module. It’s + important to note that this method is robust in the presence of a + vendor-specific driver down the road. If a vendor-specific + implementation wants to supplant the standard nd_blk driver it can with + minimal impact to the rest of ND. + + In fact, a vendor may also want to have a vendor-specific region-driver + (outside of nd_region). For example, if a vendor defined its own LABEL + format it would need its own region driver to parse that LABEL and emit + the resulting namespaces. The output from module resolution is more + accurate than a region-name or region-devtype. + + +2. udev: + The kernel “devtype” is registered in the udev database + # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0 + P: /devices/platform/nfit_test.0/ndbus0/region0 + E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0 + E: DEVTYPE=nd_pmem + E: MODALIAS=nd:t2 + E: SUBSYSTEM=nd + + + # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4 + P: /devices/platform/nfit_test.0/ndbus0/region4 + E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4 + E: DEVTYPE=nd_blk + E: MODALIAS=nd:t3 + E: SUBSYSTEM=nd + + + ...and is available as a region attribute, but keep in mind that the + “devtype” does not indicate sub-type variations and scripts should + really be understanding the other attributes. + + +3. type specific attributes: + As it currently stands a BDW region will never have a “spa_index” + attribute. A DCR region with a “mappings” value of 0 is, as mentioned + above, a DIMM that does not allow I/O. A PMEM region with a “mappings” + value of zero is a simple SPA range. + + +ND/NDCTL: Namespace + +A REGION, after resolving DPA aliasing and LABEL specified boundaries, +surfaces one or more “namespace” devices. The arrival of a “namespace” +device currently triggers either the nd_blk or nd_pmem driver to load +and register a disk/block device. + + +nd: namespace layout +Here is a sample layout from the three major types of NAMESPACE where +namespace0.0 represents DIMM-info-backed PMEM (note that it has a ‘uuid’ +attribute), namespace2.0 represents a BLK namespace (note it has a +‘sector_size’ attribute) that, and namespace6.0 represents an anonymous +PMEM namespace (note that has no ‘uuid’ attribute due to not support a +LABEL). + + /sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0 + |-- alt_name + |-- devtype + |-- dpa_extents + |-- modalias + |-- resource + |-- size + |-- subsystem -> ../../../../../../bus/nd + |-- type + |-- uevent + `-- uuid + /sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0 + |-- alt_name + |-- devtype + |-- dpa_extents + |-- modalias + |-- sector_size + |-- size + |-- subsystem -> ../../../../../../bus/nd + |-- type + |-- uevent + `-- uuid + /sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0 + |-- block + | `-- pmem0 + |-- devtype + |-- driver -> ../../../../../../bus/nd/drivers/pmem + |-- modalias + |-- resource + |-- size + |-- subsystem -> ../../../../../../bus/nd + |-- type + `-- uevent + + +ndctl: namespace enumeration example +Namespaces are indexed relative to their parent region, example below. +These indexes are mostly static from boot to boot, but subsystem makes +no guarantees in this regard. For a static namespace identifier use its +‘uuid’ attribute. + + static struct ndctl_namespace *get_namespace_by_id(struct ndctl_region *region, + unsigned int id) + { + struct ndctl_namespace *ndns; + + + ndctl_namespace_foreach(region, ndns) + if (ndctl_namespace_get_id(ndns) == id) + return ndns; + + + return NULL; + } + + +ndctl: namespace creation example + +Idle namespaces are automatically created by the kernel if a given +region has enough available capacity to create a new namespace. +Namespace instantiation involves finding an idle namespace and +configuring it. For the most part the setting of namespace attributes +can occur in any order, the only constraint is that ‘uuid’ must be set +before ‘size’. This enables the kernel to track DPA allocations +internally with a static identifier. + + + static int configure_namespace(struct ndctl_region *region, + struct ndctl_namespace *ndns, + struct namespace_parameters *parameters) + { + char devname[50]; + + + snprintf(devname, sizeof(devname), "namespace%d.%d", + ndctl_region_get_id(region), paramaters->id); + + + ndctl_namespace_set_alt_name(ndns, devname); + /* ‘uuid’ must be set prior to setting size! */ + ndctl_namespace_set_uuid(ndns, paramaters->uuid); + ndctl_namespace_set_size(ndns, paramaters->size); + /* unlike pmem namespaces, blk namespaces have a sector size */ + if (parameters->lbasize) + ndctl_namespace_set_sector_size(ndns, parameters->lbasize); + ndctl_namespace_enable(ndns); + } + +Why the Term “namespace”? +1. Why not “volume” for instance? “volume” ran the risk of confusing ND + as a volume manager like device-mapper. + + +2. The term originated to describe the sub-devices that can be created + within a NVME controller (see the nvme specification: + http://www.nvmexpress.org/specifications/), and NFIT namespaces are + meant to parallel the capabilities and configurability of + NVME-namespaces. + + +ND/NDCTL: Block Translation Table “btt” +A BTT (design document: http://pmem.io/2014/09/23/btt.html) is a stacked +block device driver that fronts either the whole block device or a +partition of a block device emitted by either a PMEM or BLK NAMESPACE. + + +nd: btt layout +Every bus will start out with at least one BTT device which is the seed +device. To activate it set the “backing_dev”, “uuid”, and “sector_size” +attributes and then bind the device to the nd_btt driver. + + /sys/devices/platform/nfit_test.1/ndbus0/btt0/ + ├── backing_dev + ├── delete + ├── devtype + ├── modalias + ├── sector_size + ├── subsystem -> ../../../../../bus/nd + ├── uevent + └── uuid + +ndctl: btt creation example + +Similar to namespaces an idle BTT device is automatically created per +bus. Each time this “seed” btt device is configured and enabled a new +seed is created. Creating a BTT configuration involves two steps of +finding and idle BTT and assigning it to front a PMEM or BLK namespace. + + + static struct ndctl_btt *get_idle_btt(struct ndctl_bus *bus) + { + struct ndctl_btt *btt; + + + ndctl_btt_foreach(bus, btt) + if (!ndctl_btt_is_enabled(btt) && !ndctl_btt_is_configured(btt)) + return btt; + + + return NULL; + } + + static int configure_btt(struct ndctl_bus *bus, struct btt_parameters *parameters) + { + btt = get_idle_btt(bus); + + + sprintf(bdevpath, "/dev/%s", + ndctl_namespace_get_block_device(parameters->ndns)); + ndctl_btt_set_uuid(btt, parameters->uuid); + ndctl_btt_set_sector_size(btt, parameters->sector_size); + ndctl_btt_set_backing_dev(btt, parametes->bdevpath); + ndctl_btt_enable(btt); + } + + +Once instantiated a “nd_btt” link will be created under the +“backing_dev” (pmem0) block device: + + /sys/block/pmem0/ + ├── alignment_offset + ├── bdi -> ../../../../../../../virtual/bdi/259:0 + ├── capability + ├── dev + ├── device -> ../../../namespace0.0 + ├── discard_alignment + ├── ext_range + ├── holders + ├── inflight + └── nd_btt -> ../../../../btt0 + + +...and a new inactive seed device will appear on the bus. + + +Once a “backing_dev” is disabled its associated BTT will be +automatically deleted. This deletion is only at the device model level. +In order to destroy a BTT the “info block” needs to be destroyed. + + +Summary NDCTL Diagram +--------------------- + +For the given example above, here is the view of the objects as seen by +the NDCTL api: + +---+ + |CTX| +---------+ +--------------+ +---------------+ + +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | + | | +---------+ +--------------+ +---------------+ ++-------+ | | +---------+ +--------------+ +---------------+ +| DIMM0 <-+ | +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" | ++-------+ | | | +---------+ +--------------+ +---------------+ +| DIMM1 <-+ +-v--+ | +---------+ +--------------+ +---------------+ ++-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6 "blk2.0" | +| DIMM2 <-+ +----+ | +---------+ | +--------------+ +----------------------+ ++-------+ | | +-> NAMESPACE2.1 +--> ND5 "blk2.1" | BTT2 | +| DIMM3 <-+ | +--------------+ +----------------------+ ++-------+ | +---------+ +--------------+ +---------------+ + +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4 "blk3.0" | + | +---------+ | +--------------+ +----------------------+ + | +-> NAMESPACE3.1 +--> ND3 "blk3.1" | BTT1 | + | +--------------+ +----------------------+ + | +---------+ +--------------+ +---------------+ + +-> REGION4 +---> NAMESPACE4.0 +--> ND2 "blk4.0" | + | +---------+ +--------------+ +---------------+ + | +---------+ +--------------+ +----------------------+ + +-> REGION5 +---> NAMESPACE5.0 +--> ND1 "blk5.0" | BTT0 | + +---------+ +--------------+ +---------------+------+ diff --git a/MAINTAINERS b/MAINTAINERS index 4517613dc638..6bc0af450544 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -6666,6 +6666,34 @@ S: Maintained F: Documentation/hwmon/nct6775 F: drivers/hwmon/nct6775.c +ND (NFIT-DEFINED/NVDIMM SUBSYSTEM) +M: Dan Williams +L: linux-nvdimm@lists.01.org +Q: https://patchwork.kernel.org/project/linux-nvdimm/list/ +S: Supported +F: drivers/block/nd/* +F: include/linux/nd.h +F: include/uapi/linux/ndctl.h + +ND BLOCK APERTURE DRIVER +M: Ross Zwisler +L: linux-nvdimm@lists.01.org +S: Supported +F: drivers/block/nd/blk.c +F: drivers/block/nd/region_devs.c + +ND BLOCK TRANSLATION TABLE +M: Vishal Verma +L: linux-nvdimm@lists.01.org +S: Supported +F: drivers/block/nd/btt* + +ND PERSISTENT MEMORY DRIVER +M: Ross Zwisler +L: linux-nvdimm@lists.01.org +S: Supported +F: drivers/block/nd/pmem.c + NETEFFECT IWARP RNIC DRIVER (IW_NES) M: Faisal Latif L: linux-rdma@vger.kernel.org @@ -8071,12 +8099,6 @@ S: Maintained F: Documentation/blockdev/ramdisk.txt F: drivers/block/brd.c -PERSISTENT MEMORY DRIVER -M: Ross Zwisler -L: linux-nvdimm@lists.01.org -S: Supported -F: drivers/block/pmem.c - RANDOM NUMBER DRIVER M: "Theodore Ts'o" S: Maintained -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/