Received: by 10.223.185.111 with SMTP id b44csp1645088wrg; Sat, 10 Mar 2018 10:28:49 -0800 (PST) X-Google-Smtp-Source: AG47ELt/Xlv5j/A6KVVpMhqS0RXIjk4O2hUjz17rfFR2Vxg6ekjtrhMBC1KmYFnz1Y6icLoiO2qD X-Received: by 10.99.66.135 with SMTP id p129mr2274499pga.220.1520706529202; Sat, 10 Mar 2018 10:28:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1520706529; cv=none; d=google.com; s=arc-20160816; b=vb63eOeCup+SmqarjPv9VLkPElqVsQiqVwba9wRZaoQFXNa9s188XwJgv9LmFFWgc/ pJUSzpR7H2cWw1qN4W6VYgHxDYm22zYKWUWYoNvawrOH0T4oDpkPGWut9G8AKUDdzuj7 fCjV/YICP3/Fc+nxw94sSvPTQ+TjAspZGdDiVbvg/PL5h6yyMnSSeDZpGZIur8pqKYTO AuAEpMzXe9Y6kUraoXljR/dSiabbMUqEzY2eaBMA62dhXBPmAtgevjKhKfilAsPwsEQM eIyJIdTSGJk29/wm1++efyvCI4yxmUEXd62JWb/g477VGFIEs5ZTs61biDHT/xro7q3U lujg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:dkim-signature:arc-authentication-results; bh=BMGiUKGOVZlliI0o8DYMUHNVl7ROOI6rcAJFi86NZuU=; b=rU02cTmHyJzqIy1s50cgFXcLOymFYoZYgwY6+oiBPUCoCTAmWmFUfy11htg1xg7SZv 80kzpnx9QZ5hkx79pM8DjolCcInI6UyQqezunqanLCKYP/kHvXBdBAOcY8CgtY+69RFY ERlChtxAzpGv2b1UW7gJZny9FVvsPinng2otistHOY7bsA2d+jbmkHWVfqzX6O1RiaxH AzDBeHe4G/SlvLGK19vh850DhzKCaYwSEpmAiR6ftmbQa3W/nfgnsvivnbZnXuIBC5Fn n/3w7aAcPVT3iee9sz72RhD1cQXaOikjMkWBjj4VuPs5/zKqGOEPEw1GosRUJvfCazQo bo2w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@eng.ucsd.edu header.s=google header.b=Od0yqoFc; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a2-v6si3060336plm.745.2018.03.10.10.28.34; Sat, 10 Mar 2018 10:28:49 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@eng.ucsd.edu header.s=google header.b=Od0yqoFc; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932869AbeCJSVl (ORCPT + 99 others); Sat, 10 Mar 2018 13:21:41 -0500 Received: from mail-pg0-f65.google.com ([74.125.83.65]:46213 "EHLO mail-pg0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932848AbeCJSVj (ORCPT ); Sat, 10 Mar 2018 13:21:39 -0500 Received: by mail-pg0-f65.google.com with SMTP id r26so4828804pgv.13 for ; Sat, 10 Mar 2018 10:21:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=eng.ucsd.edu; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=BMGiUKGOVZlliI0o8DYMUHNVl7ROOI6rcAJFi86NZuU=; b=Od0yqoFcct0WoGn12lWow8UDcHUGrbL/vmlZSth4tXCuFjxlKLUnf9rk2puaVlT7Ao AvXKDYDVCf5HjnaS1FLoYwCSypSyx9uVeTHQMjhRgeVpHe/DiEeHSjB3Vih8OyGIZaCt 8ZrYPgBaWoJYKcdjj/0m2AwO1Z5lGiz2N2gNQ= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=BMGiUKGOVZlliI0o8DYMUHNVl7ROOI6rcAJFi86NZuU=; b=SyCI7NmkOqCVMhhvK7FZbh7aS8lTdeN0rvioFaSXj072yoqGGoqW2kkmUcojgNBSZR +2D5qulHW2WCXENDUBB0cSYzxLqK0p73sA7QT50p60joyaPNWIR3+0ENV9nUAFLjEj03 m2KU/a6hDPFO+g/FHUogjUyHdkM4kDH0lRrg6KVfx8WHwtuN/d1iurLnUDn9d44ENlK5 a0kITfV+rWYLjeQYsX9o2cTSCmQJP7MZtYxYWuo0upR9EfGtO9j2sXBozp44I3ndyOk5 kauQcRdVhfgDKDbDWLIZGJswBPQD5xp18G6OZhniIb1ugIcsYWWTHaNZvIo4b2NBM2n0 sZJA== X-Gm-Message-State: AElRT7FRiM+2Gmrayeod5tpNEtaw5ly9r5QHtkAL1OfANBdbeQPnhWAQ YIcfJlNz4w8Iy0fkXS3N2YwV+A== X-Received: by 10.98.141.65 with SMTP id z62mr2670328pfd.129.1520706098432; Sat, 10 Mar 2018 10:21:38 -0800 (PST) Received: from brienza-desktop.8.8.4.4 (andxu.ucsd.edu. [132.239.17.134]) by smtp.gmail.com with ESMTPSA id h80sm9210167pfj.181.2018.03.10.10.21.37 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Sat, 10 Mar 2018 10:21:37 -0800 (PST) From: Andiry Xu To: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nvdimm@lists.01.org Cc: dan.j.williams@intel.com, andy.rudoff@intel.com, coughlan@redhat.com, swanson@cs.ucsd.edu, david@fromorbit.com, jack@suse.com, swhiteho@redhat.com, miklos@szeredi.hu, andiry.xu@gmail.com, Andiry Xu Subject: [RFC v2 68/83] File operation: copy-on-write write. Date: Sat, 10 Mar 2018 10:18:49 -0800 Message-Id: <1520705944-6723-69-git-send-email-jix024@eng.ucsd.edu> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1520705944-6723-1-git-send-email-jix024@eng.ucsd.edu> References: <1520705944-6723-1-git-send-email-jix024@eng.ucsd.edu> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Andiry Xu If the file is not mmaped, NOVA performs copy-on-write. The CoW is composed of parts: 1. Allocate contiguous data pages. 2. Copy data from user buffer to the data pages. If the write is not aligned to page size, also copy data from existing pmem pages. 3. Allocate and initialize a file write item, add it to a linked list. 4. Repeat 1 - 3 until the whole user data is copied to pmem pages. 5. Commit the list of file write items to the log and update the radix tree. 6. Update log tail pointer once all the items are committed. Signed-off-by: Andiry Xu --- fs/nova/dax.c | 149 +++++++++++++++++++++++++++++++++++++++++ fs/nova/file.c | 208 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ fs/nova/nova.h | 8 +++ 3 files changed, 365 insertions(+) diff --git a/fs/nova/dax.c b/fs/nova/dax.c index 1669dc0..9561d8e 100644 --- a/fs/nova/dax.c +++ b/fs/nova/dax.c @@ -22,6 +22,113 @@ #include "inode.h" +static inline int nova_copy_partial_block(struct super_block *sb, + struct nova_inode_info_header *sih, + struct nova_file_write_entry *entry, unsigned long index, + size_t offset, size_t length, void *kmem) +{ + void *ptr; + int rc = 0; + unsigned long nvmm; + + nvmm = get_nvmm(sb, sih, entry, index); + ptr = nova_get_block(sb, (nvmm << PAGE_SHIFT)); + + if (ptr != NULL) { + if (support_clwb) + rc = memcpy_mcsafe(kmem + offset, ptr + offset, + length); + else + memcpy_to_pmem_nocache(kmem + offset, ptr + offset, + length); + } + + /* TODO: If rc < 0, go to MCE data recovery. */ + return rc; +} + +static inline int nova_handle_partial_block(struct super_block *sb, + struct nova_inode_info_header *sih, + struct nova_file_write_entry *entry, unsigned long index, + size_t offset, size_t length, void *kmem) +{ + struct nova_sb_info *sbi = NOVA_SB(sb); + + if (entry == NULL) { + /* Fill zero */ + if (support_clwb) + memset(kmem + offset, 0, length); + else + memcpy_to_pmem_nocache(kmem + offset, + sbi->zeroed_page, length); + } else { + nova_copy_partial_block(sb, sih, entry, index, + offset, length, kmem); + + } + if (support_clwb) + nova_flush_buffer(kmem + offset, length, 0); + return 0; +} + +/* + * Fill the new start/end block from original blocks. + * Do nothing if fully covered; copy if original blocks present; + * Fill zero otherwise. + */ +int nova_handle_head_tail_blocks(struct super_block *sb, + struct inode *inode, loff_t pos, size_t count, void *kmem) +{ + struct nova_inode_info *si = NOVA_I(inode); + struct nova_inode_info_header *sih = &si->header; + size_t offset, eblk_offset; + unsigned long start_blk, end_blk, num_blocks; + struct nova_file_write_entry *entry; + timing_t partial_time; + int ret = 0; + + NOVA_START_TIMING(partial_block_t, partial_time); + offset = pos & (sb->s_blocksize - 1); + num_blocks = ((count + offset - 1) >> sb->s_blocksize_bits) + 1; + /* offset in the actual block size block */ + offset = pos & (nova_inode_blk_size(sih) - 1); + start_blk = pos >> sb->s_blocksize_bits; + end_blk = start_blk + num_blocks - 1; + + nova_dbg_verbose("%s: %lu blocks\n", __func__, num_blocks); + /* We avoid zeroing the alloc'd range, which is going to be overwritten + * by this system call anyway + */ + nova_dbg_verbose("%s: start offset %lu start blk %lu %p\n", __func__, + offset, start_blk, kmem); + if (offset != 0) { + entry = nova_get_write_entry(sb, sih, start_blk); + ret = nova_handle_partial_block(sb, sih, entry, + start_blk, 0, offset, kmem); + if (ret < 0) + return ret; + } + + kmem = (void *)((char *)kmem + + ((num_blocks - 1) << sb->s_blocksize_bits)); + eblk_offset = (pos + count) & (nova_inode_blk_size(sih) - 1); + nova_dbg_verbose("%s: end offset %lu, end blk %lu %p\n", __func__, + eblk_offset, end_blk, kmem); + if (eblk_offset != 0) { + entry = nova_get_write_entry(sb, sih, end_blk); + + ret = nova_handle_partial_block(sb, sih, entry, end_blk, + eblk_offset, + sb->s_blocksize - eblk_offset, + kmem); + if (ret < 0) + return ret; + } + NOVA_END_TIMING(partial_block_t, partial_time); + + return ret; +} + static int nova_reassign_file_tree(struct super_block *sb, struct nova_inode_info_header *sih, u64 begin_tail, u64 end_tail) { @@ -110,3 +217,45 @@ int nova_commit_writes_to_log(struct super_block *sb, struct nova_inode *pi, return ret; } + +int nova_cleanup_incomplete_write(struct super_block *sb, + struct nova_inode_info_header *sih, struct list_head *head, int free) +{ + struct nova_file_write_item *entry_item, *temp; + struct nova_file_write_entry *entry; + unsigned long blocknr; + + list_for_each_entry_safe(entry_item, temp, head, list) { + entry = &entry_item->entry; + blocknr = nova_get_blocknr(sb, entry->block, sih->i_blk_type); + nova_free_data_blocks(sb, sih, blocknr, entry->num_pages); + + if (free) + nova_free_file_write_item(entry_item); + } + + return 0; +} + +void nova_init_file_write_item(struct super_block *sb, + struct nova_inode_info_header *sih, struct nova_file_write_item *item, + u64 epoch_id, u64 pgoff, int num_pages, u64 blocknr, u32 time, + u64 file_size) +{ + struct nova_file_write_entry *entry = &item->entry; + + INIT_LIST_HEAD(&item->list); + memset(entry, 0, sizeof(struct nova_file_write_entry)); + entry->entry_type = FILE_WRITE; + entry->reassigned = 0; + entry->epoch_id = epoch_id; + entry->trans_id = sih->trans_id; + entry->pgoff = cpu_to_le64(pgoff); + entry->num_pages = cpu_to_le32(num_pages); + entry->invalid_pages = 0; + entry->block = cpu_to_le64(nova_get_block_off(sb, blocknr, + sih->i_blk_type)); + entry->mtime = cpu_to_le32(time); + + entry->size = file_size; +} diff --git a/fs/nova/file.c b/fs/nova/file.c index 842da45..26f15c7 100644 --- a/fs/nova/file.c +++ b/fs/nova/file.c @@ -256,10 +256,218 @@ static ssize_t nova_dax_file_read(struct file *filp, char __user *buf, return res; } +/* + * Perform a COW write. Must hold the inode lock before calling. + */ +static ssize_t do_nova_cow_file_write(struct file *filp, + const char __user *buf, size_t len, loff_t *ppos) +{ + struct address_space *mapping = filp->f_mapping; + struct inode *inode = mapping->host; + struct nova_inode_info *si = NOVA_I(inode); + struct nova_inode_info_header *sih = &si->header; + struct super_block *sb = inode->i_sb; + struct nova_inode *pi; + struct nova_file_write_item *entry_item; + struct list_head item_head; + struct nova_inode_update update; + ssize_t written = 0; + loff_t pos; + size_t count, offset, copied; + unsigned long start_blk, num_blocks; + unsigned long total_blocks; + unsigned long blocknr = 0; + int allocated = 0; + void *kmem; + u64 file_size; + size_t bytes; + long status = 0; + timing_t cow_write_time, memcpy_time; + unsigned long step = 0; + ssize_t ret; + u64 epoch_id; + u32 time; + + + if (len == 0) + return 0; + + sih_lock(sih); + NOVA_START_TIMING(cow_write_t, cow_write_time); + INIT_LIST_HEAD(&item_head); + + if (!access_ok(VERIFY_READ, buf, len)) { + ret = -EFAULT; + goto out; + } + pos = *ppos; + + if (filp->f_flags & O_APPEND) + pos = i_size_read(inode); + + count = len; + + pi = nova_get_block(sb, sih->pi_addr); + + offset = pos & (sb->s_blocksize - 1); + num_blocks = ((count + offset - 1) >> sb->s_blocksize_bits) + 1; + total_blocks = num_blocks; + start_blk = pos >> sb->s_blocksize_bits; + + /* offset in the actual block size block */ + + ret = file_remove_privs(filp); + if (ret) + goto out; + + inode->i_ctime = inode->i_mtime = current_time(inode); + time = current_time(inode).tv_sec; + + nova_dbgv("%s: inode %lu, offset %lld, count %lu\n", + __func__, inode->i_ino, pos, count); + + epoch_id = nova_get_epoch_id(sb); + update.tail = sih->log_tail; + while (num_blocks > 0) { + offset = pos & (nova_inode_blk_size(sih) - 1); + start_blk = pos >> sb->s_blocksize_bits; + + /* don't zero-out the allocated blocks */ + allocated = nova_new_data_blocks(sb, sih, &blocknr, start_blk, + num_blocks, ALLOC_NO_INIT, ANY_CPU, + ALLOC_FROM_HEAD); + + nova_dbg_verbose("%s: alloc %d blocks @ %lu\n", __func__, + allocated, blocknr); + + if (allocated <= 0) { + nova_dbg("%s alloc blocks failed %d\n", __func__, + allocated); + ret = allocated; + goto out; + } + + step++; + bytes = sb->s_blocksize * allocated - offset; + if (bytes > count) + bytes = count; + + kmem = nova_get_block(inode->i_sb, + nova_get_block_off(sb, blocknr, sih->i_blk_type)); + + if (offset || ((offset + bytes) & (PAGE_SIZE - 1)) != 0) { + ret = nova_handle_head_tail_blocks(sb, inode, pos, + bytes, kmem); + if (ret) + goto out; + } + /* Now copy from user buf */ + // nova_dbg("Write: %p\n", kmem); + NOVA_START_TIMING(memcpy_w_nvmm_t, memcpy_time); + copied = bytes - memcpy_to_pmem_nocache(kmem + offset, + buf, bytes); + NOVA_END_TIMING(memcpy_w_nvmm_t, memcpy_time); + + if (pos + copied > inode->i_size) + file_size = cpu_to_le64(pos + copied); + else + file_size = cpu_to_le64(inode->i_size); + + entry_item = nova_alloc_file_write_item(sb); + if (!entry_item) { + ret = -ENOMEM; + goto out; + } + + nova_init_file_write_item(sb, sih, entry_item, epoch_id, + start_blk, allocated, blocknr, time, + file_size); + + list_add_tail(&entry_item->list, &item_head); + + nova_dbgv("Write: %p, %lu\n", kmem, copied); + if (copied > 0) { + status = copied; + written += copied; + pos += copied; + buf += copied; + count -= copied; + num_blocks -= allocated; + } + if (unlikely(copied != bytes)) { + nova_dbg("%s ERROR!: %p, bytes %lu, copied %lu\n", + __func__, kmem, bytes, copied); + if (status >= 0) + status = -EFAULT; + } + if (status < 0) + break; + } + + ret = nova_commit_writes_to_log(sb, pi, inode, + &item_head, total_blocks, 1); + if (ret < 0) { + nova_err(sb, "commit to log failed\n"); + goto out; + } + + ret = written; + NOVA_STATS_ADD(cow_write_breaks, step); + nova_dbgv("blocks: %lu, %lu\n", inode->i_blocks, sih->i_blocks); + + *ppos = pos; + if (pos > inode->i_size) { + i_size_write(inode, pos); + sih->i_size = pos; + } + +out: + if (ret < 0) + nova_cleanup_incomplete_write(sb, sih, &item_head, 1); + + NOVA_END_TIMING(cow_write_t, cow_write_time); + NOVA_STATS_ADD(cow_write_bytes, written); + sih_unlock(sih); + + return ret; +} + +/* + * Acquire locks and perform COW write. + */ +ssize_t nova_cow_file_write(struct file *filp, + const char __user *buf, size_t len, loff_t *ppos) +{ + struct address_space *mapping = filp->f_mapping; + struct inode *inode = mapping->host; + int ret; + + if (len == 0) + return 0; + + sb_start_write(inode->i_sb); + inode_lock(inode); + + ret = do_nova_cow_file_write(filp, buf, len, ppos); + + inode_unlock(inode); + sb_end_write(inode->i_sb); + + return ret; +} + + +static ssize_t nova_dax_file_write(struct file *filp, const char __user *buf, + size_t len, loff_t *ppos) +{ + return nova_cow_file_write(filp, buf, len, ppos); +} + const struct file_operations nova_dax_file_operations = { .llseek = nova_llseek, .read = nova_dax_file_read, + .write = nova_dax_file_write, .open = nova_open, .fsync = nova_fsync, .flush = nova_flush, diff --git a/fs/nova/nova.h b/fs/nova/nova.h index dcda02a..1c2205e 100644 --- a/fs/nova/nova.h +++ b/fs/nova/nova.h @@ -465,9 +465,17 @@ nova_get_blocknr(struct super_block *sb, u64 block, unsigned short btype) /* ====================================================== */ /* dax.c */ +int nova_handle_head_tail_blocks(struct super_block *sb, + struct inode *inode, loff_t pos, size_t count, void *kmem); int nova_commit_writes_to_log(struct super_block *sb, struct nova_inode *pi, struct inode *inode, struct list_head *head, unsigned long new_blocks, int free); +int nova_cleanup_incomplete_write(struct super_block *sb, + struct nova_inode_info_header *sih, struct list_head *head, int free); +void nova_init_file_write_item(struct super_block *sb, + struct nova_inode_info_header *sih, struct nova_file_write_item *item, + u64 epoch_id, u64 pgoff, int num_pages, u64 blocknr, u32 time, + u64 file_size); /* dir.c */ extern const struct file_operations nova_dir_operations; -- 2.7.4