Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751950AbdHCHsZ (ORCPT ); Thu, 3 Aug 2017 03:48:25 -0400 Received: from mail-pg0-f45.google.com ([74.125.83.45]:36932 "EHLO mail-pg0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751775AbdHCHsU (ORCPT ); Thu, 3 Aug 2017 03:48:20 -0400 From: Steven Swanson X-Google-Original-From: Steven Swanson Subject: [RFC 01/16] NOVA: Documentation To: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nvdimm@lists.01.org Cc: Steven Swanson , dan.j.williams@intel.com Date: Thu, 03 Aug 2017 00:48:17 -0700 Message-ID: <150174649708.104003.4595004262958377346.stgit@hn> In-Reply-To: <150174646416.104003.14042713459553361884.stgit@hn> References: <150174646416.104003.14042713459553361884.stgit@hn> User-Agent: StGit/0.17.1-27-g0d46-dirty MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 40365 Lines: 1009 A brief overview is in README.md. Implementation and usage details are in Documentation/filesystems/nova.txt. These two papers provide a detailed, high-level description of NOVA's design goals and approach: NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories (http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf) Hardening the NOVA File System (http://cseweb.ucsd.edu/~swanson/papers/TechReport2017HardenedNOVA.pdf) Signed-off-by: Steven Swanson --- Documentation/filesystems/00-INDEX | 2 Documentation/filesystems/nova.txt | 771 ++++++++++++++++++++++++++++++++++++ MAINTAINERS | 8 README.md | 173 ++++++++ 4 files changed, 954 insertions(+) create mode 100644 Documentation/filesystems/nova.txt create mode 100644 README.md diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index b7bd6c9009cc..dc5c72273957 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX @@ -95,6 +95,8 @@ nfs/ - nfs-related documentation. nilfs2.txt - info and mount options for the NILFS2 filesystem. +nova.txt + - info on the NOVA filesystem. ntfs.txt - info and mount options for the NTFS filesystem (Windows NT). ocfs2.txt diff --git a/Documentation/filesystems/nova.txt b/Documentation/filesystems/nova.txt new file mode 100644 index 000000000000..af90da1c3fb1 --- /dev/null +++ b/Documentation/filesystems/nova.txt @@ -0,0 +1,771 @@ +The NOVA Filesystem +=================== + +NOVA is a DAX file system designed to maximize performance on hybrid DRAM and +non-volatile main memory (NVMM) systems while providing strong consistency +guarantees. NOVA adapts conventional log-structured file system techniques to +exploit the fast random access that NVMs provide. In particular, it maintains +separate logs for each inode to improve concurrency, and stores file data +outside the log to minimize log size and reduce garbage collection costs. NOVA's +logs provide metadata, data, and mmap atomicity and focus on simplicity and +reliability, keeping complex metadata structures in DRAM to accelerate lookup +operations. + +The main NOVA features include: + + * POSIX semantics + * Directly access (DAX) byte-addressable NVMM without page caching + * Per-CPU NVMM pool to maximize concurrency + * Strong consistency guarantees with 8-byte atomic stores + * Full filesystem snapshot with DAX-mmap support + * Checksums on metadata and file data (crc32c) + * Full metadata replication and RAID-5 parity per file page + * Online filesystem integrity check and corruption recovery + +Filesystem Design +================= +NOVA divides NVMM into five regions. NOVA's 512 B superblock contains global +file system information and the recovery inode. The recovery inode represents a +special file that stores recovery information (e.g., the list of unallocated +NVMM pages). NOVA divides its inode tables into per-CPU stripes. It also +provides per-CPU journals for complex file operations that involve multiple +inodes. The rest of the available NVMM stores logs and file data. + +NOVA is log-structured and stores a separate log for each inode to maximize +concurrency and provide atomicity for operations that affect a single file. The +logs only store metadata and comprise a linked list of 4 KB pages. Log entries +are small – between 32 and 64 bytes. Logs are generally non-contiguous, and log +pages may reside anywhere in NVMM. + +NOVA keeps read-only copies of most file metadata in DRAM during normal +operations, eliminating the need to access metadata in NVMM during reads. + +NOVA uses copy-on-write to provide atomic updates for file data and appends +metadata about the write to the log. For operations that affect multiple inodes +NOVA uses lightweight, fixed-length journals – one per core. + +NOVA divides the allocatable NVMM into multiple regions, one region per CPU +core. A per-core allocator manages each of the regions, minimizing contention +during memory allocation. + +After a system crash, NOVA must scan all the logs to rebuild the memory +allocator state. Since, there are many logs, NOVA aggressively parallelizes the +scan. + +Using NOVA +========== + +NOVA runs on a pmem non-volatile memory region. You can create one of these +regions with the `memmap` kernel command line option. For instance, adding +`memmap=16G!8G` to the kernel boot parameters will reserve 16GB memory starting +from address 8GB, and the kernel will create a `pmem0` block device under the +`/dev` directory. + +After the OS has booted, you can initialize a NOVA instance with the following commands: + + +# modprobe nova +# mount -t NOVA -o init /dev/pmem0 /mnt/ramdisk + + +The above commands create a NOVA instance on `/dev/pmem0` and mounts it on +`/mnt/ramdisk`. + +Nova support several module command line options: + + * metadata_csum: Enable metadata replication and checksums (default 0) + + * data_csum: Compute checksums on file data. (default: 0) + + * data_parity: Compute parity for file data. (default: 0) + + * inplace_data_updates: Update data in place rather than with COW (default: 0) + + * wprotect: Make PMEM unwritable and then use CR0.WP to enable writes as + needed (default: 0). You must also install the nd_pmem module as with + wprotect =1 (e.g., modprobe nd_pmem readonly=1). + +For instance to enable all Nova's data protection features: + +# modprobe nova metadata_csum=1\ + data_csum=1\ + data_parity=1\ + wprotect=1 + +Currently, remounting file systems with different combinations of options may +not work. + +To recover an existing NOVA instance, mount NOVA without the init option, for example: + +# mount -t NOVA /dev/pmem0 /mnt/ramdisk + +### Taking Snapshots + +To create a snapshot: + +# echo 1 > /proc/fs/NOVA//create_snapshot + +To list the current snapshots: + +# cat /proc/fs/NOVA//snapshots + +To mount a snapshot, mount NOVA and specifying the snapshot index, for example: + +# mount -t NOVA -o snapshot= /dev/pmem0 /mnt/ramdisk + +Users should not write to the file system after mounting a snapshot. + +Source File Structure +===================== + + * nova_def.h/nova.h + Defines NOVA macros and key inline functions. + + * balloc.{h,c} + NOVA's block allocator implementation. + + * bbuild.c + Implements recovery routines to restore the in-use inode list, the NVMM + allocator information, and the snapshot table. + + * checksum.c + Contains checksum-related functions to compute and verify checksums on NOVA + data structures and file pages, and also performs recovery actions when + corruptions are detected. + + * dax.c + Implements DAX read/write functions to access file data. NOVA uses + copy-on-write to modify file pages by default, unless inplace data update is + enabled at mount-time. There are also functions to update and verify the + file data integrity information. + + * dir.c + Contains functions to create, update, and remove NOVA dentries. + + * file.c + Implements file-related operations such as open, fallocate, llseek, fsync, + and flush. + + * gc.c + NOVA's garbage collection functions. + + * inode.{h,c} + Creates, reads, and frees NOVA inode tables and inodes. + + * ioctl.c + Implements some ioctl commands to call NOVA's internal functions. + + * journal.{h,c} + For operations that affect multiple inodes NOVA uses lightweight, + fixed-length journals – one per core. This file contains functions to + create and manage the lite journals. + + * log.{h,c} + Functions to manipulate NOVA inode logs, including log page allocation, log + entry creation, commit, modification, and deletion. + + * mprotect.{h,c} + Implements inline functions to enable/disable writing to different NOVA + data structures. + + * namei.c + Functions to create/remove files, directories, and links. It also looks for + the NOVA inode number for a given path name. + + * parity.c + Functions to compute file page parity bits. Each file page is striped in to + equally sized segments (or strips), and one parity strip is calculated using + RAID-5 method. A function to restore a broken data strip is also implemented + in this file. + + * perf.{h,c} + Function performance measurements. It defines + function IDs and call prototypes. Measures primitive functions' + performance, including memory copy functions for DRAM and NVMM, checksum + functions, and XOR parity functions. + + * rebuild.c + When mounting NOVA after a crash, rebuilds NOVA inodes from its logs. There + are also functions to re-calculate checksums and parity bits for file pages + that were mmapped during the crash. + + * snapshot.{h,c} + Code and data structures for taking snapshots. + + * stats.h + Defines data structures and macros that are relevant to gather NOVA usage + statistics. + + * stats.c + Implements routines to gather and print NOVA usage statistics. + + * super.{h,c} + Super block structures and Nova FS layout and entry points for NOVA + mounting and unmounting, initializing or recovering the NOVA super block + and other global file system information. + + * symlink.c + Implements functions to create and read symbolic links in the filesystem. + + * sysfs.c + Implements sysfs entries to take user inputs for taking snapshots, printing + NOVA statistics, and measuring function's performance. + + +FS Layout +====================== + +A Nova file systems resides in single PMEM device. Nova divides the device int +4KB blocks. + + block ++-----------------------------------------------------+ +| 0 | primary super block (struct nova_super_block) | ++-----------------------------------------------------+ +| 1 | Reserved inodes | ++-----------------------------------------------------+ +| 2 | reserved | ++-----------------------------------------------------+ +| 3 | Journal pointers | ++-----------------------------------------------------+ +| 4-5 | Inode pointer tables | ++-----------------------------------------------------+ +| 6 | reserved | ++-----------------------------------------------------+ +| 7 | reserved | ++-----------------------------------------------------+ +| ... | data pages | ++-----------------------------------------------------+ +| n-2 | replica reserved Inodes | ++-----------------------------------------------------+ +| n-1 | replica super block | ++-----------------------------------------------------+ + + + +Superblock and Associated Structures +==================================== + +The beginning of the PMEM device hold the super block and its associated +tables. These include reserved inodes, a table of pointers to the journals +Nova uses for complex operations, and pointers to inodes tables. Nova +maintains replicas of the super block and reserved inodes in the last two +blocks of the PMEM area. + + +Block Allocator/Free Lists +========================== + +Nova uses per-CPU allocators to manage free PMEM blocks. On initialization, +NOVA divides the range of blocks in the PMEM device among the CPUs, and those +blocks are managed solely by that CPU. We call these ranges of "allocation regions". + +Some of the blocks in an allocation region have fixed roles. Here's the +layout: + ++-------------------------------+ +| data checksum blocks | ++-------------------------------+ +| data parity blocks | ++-------------------------------+ +| | +| Allocatable blocks | +| | ++-------------------------------+ +| replica data parity blocks | ++-------------------------------+ +| replica data checksum blocks | ++-------------------------------+ + +The first and last allocation regions, also contain the super block, inode +tables, etc. and their replicas, respectively. + +Each allocator maintains a red-black tree of unallocated ranges (struct +nova_range_node). + +Allocation Functions +-------------------- + +Nova allocate PMEM blocks using two mechanisms: + +1. Static allocation as defined in super.h + +2. Allocation for log and data pages via nova_new_log_blocks() and +nova_new_data_blocks(). + +Both of these functions allow the caller to control whether the allocator +preferes higher addresses for allocation or lower addresses. We use this to +encourage meta data structures and their replicas to be far from one another. + +PMEM Address Translation +------------------------ + +In Nova's persistent data structures, memory locations are given as offsets +from the beginning of the PMEM region. nova_get_block() translates offsets to +PMEM addresses. nova_get_addr_off() performs the reverse translation. + + +Inodes +====== + +Nova maintains per-CPU inode tables, and inode numbers are striped across the +tables (i.e., inos 0, n, 2n,... on cpu 0; inos 1, n + 1, 2n + 1, ... on cpu 1). + +The inodes themselves live in a set of linked lists (one per CPU) of 2MB +blocks. The last 8 bytes of each block points to the next block. Pointers to +heads of these list live in PMEM block INODE_TABLE0_START and are replicated in +PMEM block INODE_TABLE1_START. Additional space for inodes is allocated on +demand. + +To allocate inodes, Nova maintains a per-cpu "inuse_list" in DRAM holds a RB +tree that holds ranges of unallocated inode numbers. + +Logs +==== + +Nova maintains a log for each inode that records updates to the inode's +metadata and holds pointers to the file data. Nova makes updates to file data +and metadata atomic by atomically appending log entries to the log. + +Each inode contains pointers to head and tail of the inode's log. When the log +grows past the end of the last page, nova allocates additional space. For +short logs (less than 1MB) , it doubles the length. For longer logs, it adds a +fixed amount of additional space (1MB). + +Log space is reclaimed during garbage collection. + +Log Entries +----------- + +There are eight kinds of log entry, documented in log.h. The log entries have +several entries in common: + + 1. 'epoch_id' gives the epoch during which the log entry was created. + Creating a snapshot increiments the epoch_id for the file systems. + + 2. 'trans_id' is filesystem-wide, monotone increasing, number assigned each + log entry. It provides an ordering over all FS operations. + + 3. 'invalid' is true if the effects of this entry are dead and the log + entry can be garbage collected. + + 4. 'csum' is a CRC32 checksum for the entry. + +Log structure +------------- + +The logs comprise a linked list of PMEM blocks. The tail of each block + +contains some metadata about the block and pointers to the next block and +block's replica (struct nova_inode_page_tail). + ++----------------+ +| log entry | ++----------------+ +| log entry | ++----------------+ +| ... | ++----------------+ +| tail | +| metadata | +| -> next block | ++----------------+ + + +Journals +======== + +Nova uses a lightweight journaling mechanisms to provide atomicity for +operations that modify more than one on inode. The journals providing logging +for two operations: + +1. Single word updates (JOURNAL_ENTRY) +2. Copying inodes (JOURNAL_INODE) + +The journals are undo logs: Nova creates the journal entries for an operation, +and if the operation does not complete due to a system failure, the recovery +process rolls back the changes using the journal entries. + +To commit, Nova drops the log. + +Nova maintains one journal per CPU. The head and tail pointers for each +journal live in a reserved page near the beginning of the file system. + +During recovery, Nova scans the journals and undoes the operations described by +each entry. + + +File and Directory Access +========================= + +To access file data via read(), Nova maintains a radix tree in DRAM for each +inode (nova_inode_info_header.tree) that maps file offsets to write log +entries. For directories, the same tree maps a hash of filenames to their +corresponding dentry. + +In both cases, the nova populates the tree when the file or directory is opened +by scanning its log. + +MMap and DAX +============ + +NOVA leverages the kernel's DAX mechanisms for mmap and file data access. Nova +maintains a red-black tree in DRAM (nova_inode_info_header.vma_tree) to track +which portions of a file have been mapped. + +Garbage Collection +================== + +Nova recovers log space with a two-phase garbage collection system. When a log +reaches the end of its allocated pages, Nova allocates more space. Then, the +fast GC algorithm scans the log to remove pages that have no valid entries. +Then, it estimates how many pages the logs valid entries would fill. If this +is less than half the number of pages in the log, the second GC phase copies +the valid entries to new pages. + +For example (V=valid; I=invalid): + ++---+ +---+ +---+ +| I | | I | | V | ++---+ +---+ Thorough +---+ +| V | | V | GC | V | ++---+ +---+ =====> +---+ +| I | | I | | V | ++---+ +---+ +---+ +| V | | V | | V | ++---+ +---+ +---+ + | | + V V ++---+ +---+ +| I | | V | ++---+ +---+ +| I | fast GC | I | ++---+ ====> +---+ +| I | | I | ++---+ +---+ +| I | | V | ++---+ +---+ + | + V ++---+ +| V | ++---+ +| I | ++---+ +| I | ++---+ +| V | ++---+ + + +Replication and Checksums +========================= + +Nova protects data and metadat from corruption due to media errors and +"scribbles" -- software errors in the kernels that may overwrite Nova data. + +Replication +----------- + +Nova replicates all PMEM metadata structures (there are a few exceptions. They +are WIP). For structure, there is a primary and an "alternate" (denoted as +"alter" in the code). To ensure that Nova can recover a consistent copy of the +data in case of a failure, Nova first updates the primary, and issues a persist +barrier to ensure that data is written to NVMM. Then it does the same for the +alternate. + +Detection +--------- + +Nova uses two techniques to detect data corruption. For media errors, Nova +should always uses memcpy_from_pmem() to read data from PMEM, usually by +copying the PMEM data structure into DRAM. + +To detect software-caused corruption, Nova uses CRC32 checksums. All the PMEM +data structures in Nova include csum field for this purpose. Nova also +computes CRC32 checksums each 512-byte slice of each data page. + +The checksums are stored in dedicated pages in each CPU's allocation region. + + replica + parity parity + page page + +---+---+---+---+---+---+---+---+ +---+ +---+ +data page 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | | 0 | | 0 | + +---+---+---+---+---+---+---+---+ +---+ +---+ +data page 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | | 1 | | 1 | + +---+---+---+---+---+---+---+---+ +---+ +---+ +data page 2 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | | 0 | | 0 | + +---+---+---+---+---+---+---+---+ +---+ +---+ +data page 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | | 0 | | 0 | + +---+---+---+---+---+---+---+---+ +---+ +---+ + ... ... ... ... + +Recovery +-------- + +Nova uses replication to support recovery of metadata structures and +RAID4-style parity to recover corrupted data. + +If Nova detects corruption of a metadata structure, it restores the structure +using the replica. + +If it detects a corrupt slice of data page, it uses RAID4-style recovery to +restore it. The CRC32 checksums for the page slices are replicated. + +Cautious allocation +------------------- + +To maximize its resilience to software scribbles, Nova allocate metadata +structures and their replicas far from one another. It tries to allocate the +primary copy at a low address and the replica at a high address within the PMEM +region. + +Write Protection +---------------- + +Finally, Nova supports can prevent unintended writes PMEM by mapping the entire +PMEM device as read-only and then disabling _all_ write protection by clearing +the WP bit the CR0 control register when Nova needs to perform a write. The +wprotect mount-time option controls this behavior. + +To map the PMEM device as read-only, we have added a readonly module command +line option to nd_pmem. There is probably a better approach to achieving this +goal. + +Unsafe modes +============ + +Nova support modes that disable some of the protections it provides to improve +perforamnce. + +File data +--------- + +Nova can disable parity and/or checksums on file data (options 'data_parity=0' +and 'data_checksum=0'). Without parity, Nova can detect but not recover from +data corruption. Without checksums, Nova will still detect and recover from +media errors, but not scribbles. + +Nova also supports in-place file updates (option: inplace_data_updates=1). +This breaks atomicity for writes, but improve performance, especially for +sub-page writes, since these require a full page COW in the default mode. + +Metadata +-------- + +Nova can disable metadata checksums and replication (option 'metadata_csum=0'). + + +Snapshots +========= + +Nova supports snapshots to facilitate backups. + +Taking a snapshot +----------------- + +Each Nova file systems has a current epoch_id in the super block and each log +entry has the epoch_id attached to it at creation. When the user creates a +snaphot, Nova increments the epoch_id for the file system and the old epoch_id +identifies the moment the snapshot was taken. + +Nova records the epoch_id and a timestamp in a new log entry (struct +snapshot_info_log_entry) and appends it to the log of the reserved snapshot +inode (NOVA_SNAPSHOT_INODE) in the superblock. + +Nova also maintains a radix tree (nova_sb_info.snapshot_info_tree) of struct +snapshot_info in DRAM indexed by epoch_id. + +Nova also marks all mmap'd pages as read-only and uses COW to preserve file +contents after the snapshot. + +Tracking Live Data +------------------ + +Supporting snapshots requires Nova to preserve file contents from previous +snapshots while also being able to recover the space a snapshot occupied after +its deletion. + +Preserving file contents requires a small change to how Nova implements write +operations. To perform a write, Nova appends a write log entry to the file's +log. The log entry includes pointers to newly-allocated and populated NVMM +pages that hold the written data. If the write overwrites existing data, Nova +locates the previous write log entry for that portion of the file, and performs +an "epoch check" that compares the old log entry's epoch_id to the file +system's current epoch_id. If the comparison matches, the old write log entry +and the file data blocks it points to no longer belong to any snapshot, and +Nova reclaims the data blocks. + +If the epoch_id's do not match, then the data in the old log entry belongs to +an earlier snapshot and Nova leaves the log entry in place. + +Determining when to reclaim data belonging to deleted snapshots requires +additional bookkeeping. For each snapshot, Nova maintains a "snapshot log" +that records the inodes and blocks that belong to that snapshot, but are not +part of the current file system image. + +Nova populates the snapshot log during the epoch check: If the epoch_ids for +the new and old log entries do not match, it appends a log entry (either struct +snapshot_inode_entry or struct snapshot_file_write_entry) to the snapshot log +that the old log entry belongs to. The log entry contains a pointer to the old +log entry, and the filesystem's current epoch_id as the delete_epoch_id. + +To delete a snapshot, Nova removes the snapshot from the list of live snapshots +and appends its log to the following snapshot's log. Then, a background thread +traverses the combined log and reclaims dead inode/data based on the delete +epoch_id: If the delete epoch_id for an entry in the log is less than or equal +to the snapshot's epoch_id, it means the log entry and/or the associated data +blocks are now dead. + +Snapshots and DAX +----------------- + +Taking consistent snapshots while applications are modifying files using +DAX-style mmap requires NOVA to reckon with the order in which stores to NVMM +become persistent (i.e., reach physical NVMM so they will survive a system +failure). These applications rely on the processor's ``memory persistence +model'' [http://dl.acm.org/citation.cfm?id=2665671.2665712] to make guarantees +about when and in what order stores become persistent. These guarantees allow +the application to restore their data to a consistent state during recovery +from a system failure. + +From the application's perspective, reading a snapshot is equivalent to +recovering from a system failure. In both cases, the contents of the +memory-mapped file reflect its state at a moment when application operations +might be in-flight and when the application had no chance to shut down cleanly. + +A naive approach to checkpointing mmap()'d files in NOVA would simply mark each +of the read/write mapped pages as read-only and then do copy-on-write when a +store occurs to preserve the old pages as part of the snapshot. + +However, this approach can leave the snapshot in an inconsistent state: +Setting the page to read-only captures its contents for the +snapshot, and the kernel requires NOVA to set the pages as read-only +one at a time. So, if the order in which NOVA marks pages as read-only +is incompatible with ordering that the application requires, the snapshot will +contain an inconsistent version of the file. + +To resolve this problem, when NOVA starts marking pages as read-only, it blocks +page faults to the read-only mmap()'d pages until it has marked all the pages +read-only and finished taking the snapshot. + +More detail is available in the technical report referenced at the top of this +document. + +We have implemented this functionality in NOVA by adding the 'original_write' +flag to struct vm_area_struct that tracks whether the vm_area_struct is created +with write permission, but has been marked read-only in the course of taking a +snapshot. We have also added a 'dax_cow' operation to struct +vm_operations_struct that the page fault handler runs when applications write +to a page with original_write = 1. NOVA's dax_cow operation +(nova_restore_page_write()) performs the COW, maps the page to a new physical +page and allows writing. + +Saving Snapshot State +--------------------- + +During a clean shutdown, Nova stores the snapshot information to PMEM. + +Nova reserves an inode for storing snapshot information. The log for the inode +contains an entry for each snapshot (struct snapshot_info_log_entry). On +shutdown, Nova allocates one page (struct snapshot_nvmm_page) to store an array +of struct snapshot_nvmm_list. + +Each of these lists (one per CPU) contains head and tail pointers to a linked +list of blocks (just like an inode log). The lists contain a struct +snapshot_file_write_entry or struct snapshot_inode_entry for each operation +that modified file data or an inode. + +Superblock ++--------------------+ +| ... | ++--------------------+ +| Reserved Inodes | ++---+----------------+ +| | ... | ++---+----------------+ +| 7 | Snapshot Inode | +| | head | ++---+----------------+ + / + / + / ++---------+---------+---------+ +| Snap | Snap | Snap | +| epoch=1 | epoch=4 | epoch=11| +| | | | +|nvmm_page|nvmm_page|nvmm_page| ++---------+---------+---------+ + | + | ++----------+ +--------+--------+ +| cpu 0 | | snap | snap | +| head |-->| inode | write | +| | | entry | entry | +| | +--------+--------+ ++----------+ +--------+--------+ +| cpu 1 | | snap | snap | +| head |-->| write | write | +| | | entry | entry | +| | +--------+--------+ ++----------+ +| ... | ++----------+ +--------+ +| cpu 128 | | snap | +| head |-->| inode | +| | | entry | +| | +--------+ ++----------+ + + +Umount and Recovery +=================== + +Clean umount/mount +------------------ + +On a clean unmount, Nova saves the contents of many of its DRAM data structures +to PMEM to accelerate the next mount: + +1. Nova stores the allocator state for each of the per-cpu allocators to the + log of a reserved inode (NOVA_BLOCK_NODE_INO). + +2. Nova stores the per-CPU lists of available inodes (the inuse_list) to the + NOVA_BLOCK_INODELIST1_INO reserved inode. + +3. Nova stores the snapshot state to PMEM as described above. + +After a clean unmount, the following mount restores these data and then +invalidates them. + +Recovery after failures +------------------------ + +In case of a unclean dismount (e.g., system crash), Nova must rebuild these +DRAM structures by scanning the inode logs. Nova log scanning is fast because +per-CPU inode tables and per-inode logs allow for parallel recovery. + +The number of live log entries in an inode log is roughly the number of extents +in the file. As a result, Nova only needs to scan a small fraction of the NVMM +during recovery. + +The Nova failure recovery consists of two steps: + +First, Nova checks its lite weight journals and rolls back any uncommitted +transactions to restore the file system to a consistent state. + +Second, Nova starts a recovery thread on each CPU and scans the inode tables in +parallel, performing log scanning for every valid inode in the inode table. +Nova use different recovery mechanisms for directory inodes and file inodes: +For a directory inode, Nova scans the log's linked list to enumerate the pages +it occupies, but it does not inspect the log's contents. For a file inode, +Nova reads the write entries in the log to enumerate the data pages. + +During the recovery scan Nova builds a bitmap of occupied pages, and rebuilds +the allocator based on the result. After this process completes, the file +system is ready to accept new requests. + +During the same scan, it rebuilds the snapshot information and the list +available inodes. + diff --git a/MAINTAINERS b/MAINTAINERS index 767e9d202adf..cfcee556acc6 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -9108,6 +9108,14 @@ F: drivers/power/supply/bq27xxx_battery_i2c.c F: drivers/power/supply/isp1704_charger.c F: drivers/power/supply/rx51_battery.c +NOVA FILE SYSTEM +M: Andiry Xu +M: Steven Swanson +L: linux-fsdevel@vger.kernel.org +L: linux-nvdimm@lists.01.org +F: Documentation/filesystems/nova.txt +F: fs/nova/ + NTB DRIVER CORE M: Jon Mason M: Dave Jiang diff --git a/README.md b/README.md new file mode 100644 index 000000000000..4f778e99a79e --- /dev/null +++ b/README.md @@ -0,0 +1,173 @@ +# NOVA: NOn-Volatile memory Accelerated log-structured file system + +NOVA's goal is to provide a high-performance, full-featured, production-ready +file system tailored for byte-addressable non-volatile memories (e.g., NVDIMMs +and Intel's soon-to-be-released 3DXpoint DIMMs). It combines design elements +from many other file systems to provide a combination of high-performance, +strong consistency guarantees, and comprehensive data protection. NOVA support +DAX-style mmap and making DAX performs well is a first-order priority in NOVA's +design. NOVA was developed by the [Non-Volatile Systems Laboratory][NVSL] in +the [Computer Science and Engineering Department][CSE] at the [University of +California, San Diego][UCSD]. + + +NOVA is primarily a log-structured file system, but rather than maintain a +single global log for the entire file system, it maintains separate logs for +each file (inode). NOVA breaks the logs into 4KB pages, they need not be +contiguous in memory. The logs only contain metadata. + +File data pages reside outside the log, and log entries for write operations +point to data pages they modify. File modification uses copy-on-write (COW) to +provide atomic file updates. + +For file operations that involve multiple inodes, NOVA use small, fixed-sized +redo logs to atomically append log entries to the logs of the inodes involned. + +This structure keeps logs small and make garbage collection very fast. It also +enables enormous parallelism during recovery from an unclean unmount, since +threads can scan logs in parallel. + +NOVA replicates and checksums all metadata structures and protects file data +with RAID-4-style parity. It supports checkpoints to facilitate backups. + +A more thorough discussion of NOVA's design is avaialable in these two papers: + +**NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories** +[PDF](http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf)
+*Jian Xu and Steven Swanson*
+Published in [FAST 2016][FAST2016] + +**Hardening the NOVA File System** +[PDF](http://cseweb.ucsd.edu/~swanson/papers/TechReport2017HardenedNOVA.pdf)
+UCSD-CSE Techreport CS2017-1018 +*Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff, Steven Swanson*
+ +Read on for further details about NOVA's overall design and its current status + +### Compatibilty with Other File Systems + +NOVA aims to be compatible with other Linux file systems. To help verify that it achieves this we run several test suites against NOVA each night. + +* The latest version of XFSTests. ([Current failures](https://github.com/NVSL/linux-nova/issues?q=is%3Aopen+is%3Aissue+label%3AXFSTests)) +* The (Linux testing project)(https://linux-test-project.github.io/) file system tests. +* The (fstest POSIX test suite)[POSIXtest]. + +Currently, nearly all of these tests pass for the `master` branch, and we have +run complex programs on NOVA. There are, of course, many bugs left to fix. + +NOVA uses the standard PMEM kernel interfaces for accessing and managing +persistent memory. + +### Atomicity + +By default, NOVA makes all metadata and file data operations atomic. + +Strong atomicity guarantees make it easier to build reliable applications on +NOVA, and NOVA can provide these guarantees with sacrificing much performance +because NVDIMMs support very fast random access. + +NOVA also supports "unsafe data" and "unsafe metadata" modes that +improve performance in some cases and allows for non-atomic updates of file +data and metadata, respectively. + +### Data Protection + +NOVA aims to protect data against both misdirected writes in the kernel (which +can easily "scribble" over the contents of an NVDIMM) as well as media errors. + +NOVA protects all of its metadata data structures with a combination of +replication and checksums. It protects file data using RAID-5 style parity. + +NOVA can detects data corruption by verifying checksums on each access and by +catching and handling machine check exceptions (MCEs) that arise when the +system's memory controller detects at uncorrectable media error. + +We use a fault injection tool that allows testing of these recovery mechanisms. + +To facilitate backups, NOVA can take snapshots of the current filesystem state +that can be mounted read-only while the current file system is mounted +read-write. + +The tech report list above describes the design of NOVA's data protection system in detail. + +### DAX Support + +Supporting DAX efficiently is a core feature of NOVA and one of the challenges +in designing NOVA is reconciling DAX support which aims to avoid file system +intervention when file data changes, and other features that require such +intervention. + +NOVA's philosophy with respect to DAX is that when a program uses DAX mmap to +to modify a file, the program must take full responsibility for that data and +NOVA must ensure that the memory will behave as expected. At other times, the +file system provides protection. This approach has several implications: + +1. Implementing `msync()` in user space works fine. + +2. While a file is mmap'd, it is not protected by NOVA's RAID-style parity +mechanism, because protecting it would be too expensive. When the file is +unmapped and/or during file system recovery, protection is restored. + +3. The snapshot mechanism must be careful about the order in which in adds +pages to the file's snapshot image. + +### Performance + +The research paper and technical report referenced above compare NOVA's +performance to other file systems. In almost all cases, NOVA outperforms other +DAX-enabled file systems. A notable exception is sub-page updates which incur +COW overheads for the entire page. + +The technical report also illustrates the trade-offs between our protection +mechanisms and performance. + +## Gaps, Missing Features, and Development Status + +Although NOVA is a fully-functional file system, there is still much work left +to be done. In particular, (at least) the following items are currently missing: + +1. There is no mkfs or fsk utility (`mount` takes `-o init` to create a NOVA file system) +2. NOVA doesn't scrub data to prevent corruption from accumulating in infrequently accessed data. +3. NOVA doesn't read bad block information on mount and attempt recovery of the effected data. +4. NOVA only works on x86-64 kernels. +5. NOVA does not currently support extended attributes or ACL. +6. NOVA does not currently prevent writes to mounted snapshots. +7. Using `write()` to modify pages that are mmap'd is not supported. +8. NOVA deoesn't provide quota support. +9. Moving NOVA file systems between machines with different numbers of CPUs does not work. +10. Remounting a NOVA file system with different mount options may fail. + +None of these are fundamental limitations of NOVA's design. Additional bugs +and issues are here [here][https://github.com/NVSL/linux-nova/issues]. + +NOVA is complete and robust enough to run a range of complex applications, but +it is not yet ready for production use. Our current focus is on adding a few +missing features list above and finding/fixing bugs. + +## Building and Using NOVA + +This repo contains a version of the Linux with NOVA included. You should be +able to build and install it just as you would the mainline Linux source. + +### Building NOVA + +To build NOVA, build the kernel with PMEM (`CONFIG_BLK_DEV_PMEM`), DAX (`CONFIG_FS_DAX`) and NOVA (`CONFIG_NOVA_FS`) support. Install as usual. + +## Hacking and Contributing + +The NOVA source code is almost completely contains in the `fs/nova` directory. +The execptions are some small changes in the kernel's memory management system +to support checkpointing. + +`Documentation/filesystems/nova.txt` describes the internals of Nova in more detail. + +If you find bugs, please [report them](https://github.com/NVSL/linux-nova/issues). + +If you have other questions or suggestions you can contact the NOVA developers at [cse-nova-hackers@eng.ucsd.edu](mailto:cse-nova-hackers@eng.ucsd.edu). + + +[NVSL]: http://nvsl.ucsd.edu/ "http://nvsl.ucsd.edu" +[POSIXtest]: http://www.tuxera.com/community/posix-test-suite/ +[FAST2016]: https://www.usenix.org/conference/fast16/technical-sessions +[CSE]: http://cs.ucsd.edu +[UCSD]: http://www.ucsd.edu \ No newline at end of file