2014-02-25 14:25:44

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 00/22] Support ext4 on NV-DIMMs

One of the primary uses for NV-DIMMs is to expose them as a block device
and use a filesystem to store files on the NV-DIMM. While that works,
it currently wastes memory and CPU time buffering the files in the page
cache. We have support in ext2 for bypassing the page cache, but it
has some races which are unfixable in the current design. This series
of patches rewrite the underlying support, and add support for direct
access to ext4.

This iteration of the patchset renames the "XIP" support to "DAX".
This fixes the confusion between kernel XIP and filesystem XIP. It's not
really about executing in-place; it's about direct access to memory-like
storage, bypassing the page cache. DAX is TLA-compliant, retains the
exciting X, is pronouncable ("Dacks") and is not used elsewhere in
the kernel. The only major use of DAX outside the kernel is the German
stock exchange, and I think that's pretty unlikely to cause confusion.

Patch 2 *still* wants careful review from the MM people.

I want to particularly credit Ross Zwisler for all the effort he's put
into tracking down bugs.

Testing
~~~~~~~

Seven xfstests still fail reliably with DAX that pass reliably without
DAX: ext4/301 generic/{075,091,112,127,223,263}

Two fail randomly for me whether DAX is enabled or not: generic/{299,300}

Eleven fail reliably without DAX: ext4/{302,303,304}
generic/{219,230,231,232,233,235,270} shared/218

Several are not run, because they need dm_flakey,
CONFIG_FAIL_MAKE_REQUEST, my 3GB ramdisk is too small, they're not
suitable for Linux or they're not suitable for ext4. My current
score is 18/127 tests fail with -o dax and 11/127 without.

v6:
- Fix compilation with CONFIG_FS_XIP=n (reported by Ross Zwisler)
- Removed unused argument from xip_io (patch from Ross Zwisler)
- Fixed buffer overrun in xip_io (original patch from Ross Zwisler)
- Prevented reads past i_size (original patch from Ross Zwisler)
- Fixed documentation errors (reported by Randy Dunlap)
- Add handling of BH_New (reported by Dave Chinner)
- Support the way ext4 reports holes (original patch from Ross Zwisler)
- Zero the *end* of new blocks as well as the beginning (reported by Dave
Chinner)
- Rebased on top of 3.14-rc4 plus Kirill's cleanups of __do_fault() which
are in linux-mm (http://marc.info/?l=linux-mm&m=139206489208546&w=2).
- Renamed XIP to DAX
- Fixed writev() so subsequent writes don't overwrite earlier writes
- Remove code in ext4 to clear blocks in DAX files
- Fixed dax_zero_page_range() to actually call memset

v5:
- Improved documentation
- Fixed a couple of warnings emitted by a newer version of gcc
- Fixed a bug where we would read/write the wrong sector in xip_io for I/Os
that were not aligned to PAGE_SIZE
- Dropped PMD fault patch
- Added support for unwritten extents


Matthew Wilcox (21):
Fix XIP fault vs truncate race
Allow page fault handlers to perform the COW
axonram: Fix bug in direct_access
Change direct_access calling convention
Introduce IS_DAX(inode)
Replace XIP read and write with DAX I/O
Replace the XIP page fault handler with the DAX page fault handler
Replace xip_truncate_page with dax_truncate_page
Remove mm/filemap_xip.c
Remove get_xip_mem
Replace ext2_clear_xip_target with dax_clear_blocks
ext2: Remove ext2_xip_verify_sb()
ext2: Remove ext2_use_xip
ext2: Remove xip.c and xip.h
Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX
ext2: Remove ext2_aops_xip
Get rid of most mentions of XIP in ext2
xip: Add xip_zero_page_range
ext4: Make ext4_block_zero_page_range static
ext4: Fix typos
dax: Add reporting of major faults

Ross Zwisler (1):
ext4: Add DAX functionality

Documentation/filesystems/Locking | 3 -
Documentation/filesystems/dax.txt | 84 +++++++
Documentation/filesystems/ext4.txt | 2 +
Documentation/filesystems/xip.txt | 68 ------
arch/powerpc/sysdev/axonram.c | 8 +-
drivers/block/brd.c | 8 +-
drivers/s390/block/dcssblk.c | 19 +-
fs/Kconfig | 21 +-
fs/Makefile | 1 +
fs/dax.c | 451 ++++++++++++++++++++++++++++++++++
fs/exofs/inode.c | 1 -
fs/ext2/Kconfig | 11 -
fs/ext2/Makefile | 1 -
fs/ext2/ext2.h | 9 +-
fs/ext2/file.c | 45 +++-
fs/ext2/inode.c | 37 +--
fs/ext2/namei.c | 13 +-
fs/ext2/super.c | 48 ++--
fs/ext2/xip.c | 91 -------
fs/ext2/xip.h | 26 --
fs/ext4/ext4.h | 8 +-
fs/ext4/file.c | 53 +++-
fs/ext4/indirect.c | 19 +-
fs/ext4/inode.c | 94 +++++---
fs/ext4/namei.c | 10 +-
fs/ext4/super.c | 39 ++-
fs/open.c | 5 +-
include/linux/blkdev.h | 4 +-
include/linux/fs.h | 49 +++-
include/linux/mm.h | 2 +
mm/Makefile | 1 -
mm/fadvise.c | 6 +-
mm/filemap.c | 6 +-
mm/filemap_xip.c | 483 -------------------------------------
mm/madvise.c | 2 +-
mm/memory.c | 44 +++-
36 files changed, 911 insertions(+), 861 deletions(-)
create mode 100644 Documentation/filesystems/dax.txt
delete mode 100644 Documentation/filesystems/xip.txt
create mode 100644 fs/dax.c
delete mode 100644 fs/ext2/xip.c
delete mode 100644 fs/ext2/xip.h
delete mode 100644 mm/filemap_xip.c

--
1.8.5.3


2014-02-25 14:18:54

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 02/22] Allow page fault handlers to perform the COW

Currently COW of an XIP file is done by first bringing in a read-only
mapping, then retrying the fault and copying the page. It is much more
efficient to tell the fault handler that a COW is being attempted (by
passing in the pre-allocated page in the vm_fault structure), and allow
the handler to perform the COW operation itself.

Where the filemap code protects against truncation of the file until
the PTE has been installed with the page lock, the XIP code use the
i_mmap_mutex instead. We must therefore unlock the i_mmap_mutex after
inserting the PTE.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/mm.h | 2 ++
mm/memory.c | 44 ++++++++++++++++++++++++++++++++------------
2 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f28f46e..22260c0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -205,6 +205,7 @@ struct vm_fault {
pgoff_t pgoff; /* Logical page offset based on vma */
void __user *virtual_address; /* Faulting virtual address */

+ struct page *cow_page; /* Handler may choose to COW */
struct page *page; /* ->fault handlers should return a
* page here, unless VM_FAULT_NOPAGE
* is set (which is also implied by
@@ -1000,6 +1001,7 @@ static inline int page_mapped(struct page *page)
#define VM_FAULT_HWPOISON 0x0010 /* Hit poisoned small page */
#define VM_FAULT_HWPOISON_LARGE 0x0020 /* Hit poisoned large page. Index encoded in upper bits */

+#define VM_FAULT_COWED 0x0080 /* ->fault COWed the page instead */
#define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */
#define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */
#define VM_FAULT_RETRY 0x0400 /* ->fault blocked, must retry */
diff --git a/mm/memory.c b/mm/memory.c
index 7f52c46..c7fc9c5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3288,7 +3288,8 @@ oom:
}

static int __do_fault(struct vm_area_struct *vma, unsigned long address,
- pgoff_t pgoff, unsigned int flags, struct page **page)
+ pgoff_t pgoff, unsigned int flags,
+ struct page *cow_page, struct page **page)
{
struct vm_fault vmf;
int ret;
@@ -3297,10 +3298,13 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
vmf.pgoff = pgoff;
vmf.flags = flags;
vmf.page = NULL;
+ vmf.cow_page = cow_page;

ret = vma->vm_ops->fault(vma, &vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
+ if (unlikely(ret & VM_FAULT_COWED))
+ goto out;

if (unlikely(PageHWPoison(vmf.page))) {
if (ret & VM_FAULT_LOCKED)
@@ -3314,6 +3318,7 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
else
VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);

+ out:
*page = vmf.page;
return ret;
}
@@ -3351,7 +3356,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
pte_t *pte;
int ret;

- ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+ ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;

@@ -3368,6 +3373,12 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return ret;
}

+/*
+ * If the fault handler performs the COW, it does not return a page,
+ * so cannot use the page's lock to protect against a concurrent truncate
+ * operation. Instead it returns with the i_mmap_mutex held, which must
+ * be released after the PTE has been inserted.
+ */
static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
@@ -3389,25 +3400,34 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
}

- ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+ ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;

- copy_user_highpage(new_page, fault_page, address, vma);
+ if (!(ret & VM_FAULT_COWED))
+ copy_user_highpage(new_page, fault_page, address, vma);
__SetPageUptodate(new_page);

pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_same(*pte, orig_pte))) {
- pte_unmap_unlock(pte, ptl);
+ if (unlikely(!pte_same(*pte, orig_pte)))
+ goto unlock_out;
+ do_set_pte(vma, address, new_page, pte, true, true);
+ pte_unmap_unlock(pte, ptl);
+ if (ret & VM_FAULT_COWED) {
+ mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ } else {
unlock_page(fault_page);
page_cache_release(fault_page);
- goto uncharge_out;
}
- do_set_pte(vma, address, new_page, pte, true, true);
- pte_unmap_unlock(pte, ptl);
- unlock_page(fault_page);
- page_cache_release(fault_page);
return ret;
+unlock_out:
+ pte_unmap_unlock(pte, ptl);
+ if (ret & VM_FAULT_COWED) {
+ mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ } else {
+ unlock_page(fault_page);
+ page_cache_release(fault_page);
+ }
uncharge_out:
mem_cgroup_uncharge_page(new_page);
page_cache_release(new_page);
@@ -3424,7 +3444,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
int dirtied = 0;
int ret, tmp;

- ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+ ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;

--
1.8.5.3

2014-02-25 14:18:57

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 01/22] Fix XIP fault vs truncate race

Pagecache faults recheck i_size after taking the page lock to ensure that
the fault didn't race against a truncate. We don't have a page to lock
in the XIP case, so use the i_mmap_mutex instead. It is locked in the
truncate path in unmap_mapping_range() after updating i_size. So while
we hold it in the fault path, we are guaranteed that either i_size has
already been updated in the truncate path, or that the truncate will
subsequently call zap_page_range_single() and so remove the mapping we
have just inserted.

There is a window of time in which i_size has been reduced and the
thread has a mapping to a page which will be removed from the file,
but this is harmless as the page will not be allocated to a different
purpose before the thread's access to it is revoked.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/filemap_xip.c | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..c8d23e9 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -260,8 +260,17 @@ again:
__xip_unmap(mapping, vmf->pgoff);

found:
+ /* We must recheck i_size under i_mmap_mutex */
+ mutex_lock(&mapping->i_mmap_mutex);
+ size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+ PAGE_CACHE_SHIFT;
+ if (unlikely(vmf->pgoff >= size)) {
+ mutex_unlock(&mapping->i_mmap_mutex);
+ return VM_FAULT_SIGBUS;
+ }
err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
xip_pfn);
+ mutex_unlock(&mapping->i_mmap_mutex);
if (err == -ENOMEM)
return VM_FAULT_OOM;
/*
@@ -285,16 +294,27 @@ found:
}
if (error != -ENODATA)
goto out;
+
+ /* We must recheck i_size under i_mmap_mutex */
+ mutex_lock(&mapping->i_mmap_mutex);
+ size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+ PAGE_CACHE_SHIFT;
+ if (unlikely(vmf->pgoff >= size)) {
+ ret = VM_FAULT_SIGBUS;
+ goto unlock;
+ }
/* not shared and writable, use xip_sparse_page() */
page = xip_sparse_page();
if (!page)
- goto out;
+ goto unlock;
err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
page);
if (err == -ENOMEM)
- goto out;
+ goto unlock;

ret = VM_FAULT_NOPAGE;
+unlock:
+ mutex_unlock(&mapping->i_mmap_mutex);
out:
write_seqcount_end(&xip_sparse_seq);
mutex_unlock(&xip_sparse_mutex);
--
1.8.5.3

2014-02-25 14:19:11

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 17/22] Get rid of most mentions of XIP in ext2

The only remaining usage is userspace's 'xip' option.
---
fs/ext2/ext2.h | 6 +++---
fs/ext2/file.c | 2 +-
fs/ext2/inode.c | 6 +++---
fs/ext2/namei.c | 8 ++++----
fs/ext2/super.c | 16 ++++++++--------
5 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index b8b1c11..0e1fe9d 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -381,9 +381,9 @@ struct ext2_inode {
#define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */
#define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */
#ifdef CONFIG_FS_DAX
-#define EXT2_MOUNT_XIP 0x010000 /* Execute in place */
+#define EXT2_MOUNT_DAX 0x010000 /* Direct Access */
#else
-#define EXT2_MOUNT_XIP 0
+#define EXT2_MOUNT_DAX 0
#endif
#define EXT2_MOUNT_USRQUOTA 0x020000 /* user quota */
#define EXT2_MOUNT_GRPQUOTA 0x040000 /* group quota */
@@ -789,7 +789,7 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end,
int datasync);
extern const struct inode_operations ext2_file_inode_operations;
extern const struct file_operations ext2_file_operations;
-extern const struct file_operations ext2_xip_file_operations;
+extern const struct file_operations ext2_dax_file_operations;

/* inode.c */
extern const struct address_space_operations ext2_aops;
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index ae7f000..f9bcb9b 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -110,7 +110,7 @@ const struct file_operations ext2_file_operations = {
};

#ifdef CONFIG_FS_DAX
-const struct file_operations ext2_xip_file_operations = {
+const struct file_operations ext2_dax_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 7ca76da..3776063 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1285,7 +1285,7 @@ void ext2_set_inode_flags(struct inode *inode)
inode->i_flags |= S_NOATIME;
if (flags & EXT2_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
- if (test_opt(inode->i_sb, XIP))
+ if (test_opt(inode->i_sb, DAX))
inode->i_flags |= S_DAX;
}

@@ -1387,9 +1387,9 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)

if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext2_file_inode_operations;
- if (test_opt(inode->i_sb, XIP)) {
+ if (test_opt(inode->i_sb, DAX)) {
inode->i_mapping->a_ops = &ext2_aops;
- inode->i_fop = &ext2_xip_file_operations;
+ inode->i_fop = &ext2_dax_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
inode->i_fop = &ext2_file_operations;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 0db888c..148f6e3 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -104,9 +104,9 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
return PTR_ERR(inode);

inode->i_op = &ext2_file_inode_operations;
- if (test_opt(inode->i_sb, XIP)) {
+ if (test_opt(inode->i_sb, DAX)) {
inode->i_mapping->a_ops = &ext2_aops;
- inode->i_fop = &ext2_xip_file_operations;
+ inode->i_fop = &ext2_dax_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
inode->i_fop = &ext2_file_operations;
@@ -125,9 +125,9 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
return PTR_ERR(inode);

inode->i_op = &ext2_file_inode_operations;
- if (test_opt(inode->i_sb, XIP)) {
+ if (test_opt(inode->i_sb, DAX)) {
inode->i_mapping->a_ops = &ext2_aops;
- inode->i_fop = &ext2_xip_file_operations;
+ inode->i_fop = &ext2_dax_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
inode->i_fop = &ext2_file_operations;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index fdcacf7..8062373 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -288,7 +288,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root)
#endif

#ifdef CONFIG_FS_DAX
- if (sbi->s_mount_opt & EXT2_MOUNT_XIP)
+ if (sbi->s_mount_opt & EXT2_MOUNT_DAX)
seq_puts(seq, ",xip");
#endif

@@ -393,7 +393,7 @@ enum {
Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic,
Opt_err_ro, Opt_nouid32, Opt_nocheck, Opt_debug,
Opt_oldalloc, Opt_orlov, Opt_nobh, Opt_user_xattr, Opt_nouser_xattr,
- Opt_acl, Opt_noacl, Opt_xip, Opt_ignore, Opt_err, Opt_quota,
+ Opt_acl, Opt_noacl, Opt_dax, Opt_ignore, Opt_err, Opt_quota,
Opt_usrquota, Opt_grpquota, Opt_reservation, Opt_noreservation
};

@@ -421,7 +421,7 @@ static const match_table_t tokens = {
{Opt_nouser_xattr, "nouser_xattr"},
{Opt_acl, "acl"},
{Opt_noacl, "noacl"},
- {Opt_xip, "xip"},
+ {Opt_dax, "xip"},
{Opt_grpquota, "grpquota"},
{Opt_ignore, "noquota"},
{Opt_quota, "quota"},
@@ -548,9 +548,9 @@ static int parse_options(char *options, struct super_block *sb)
"(no)acl options not supported");
break;
#endif
- case Opt_xip:
+ case Opt_dax:
#ifdef CONFIG_FS_DAX
- set_opt (sbi->s_mount_opt, XIP);
+ set_opt (sbi->s_mount_opt, DAX);
#else
ext2_msg(sb, KERN_INFO, "xip option not supported");
#endif
@@ -896,7 +896,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)

blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);

- if (sbi->s_mount_opt & EXT2_MOUNT_XIP) {
+ if (sbi->s_mount_opt & EXT2_MOUNT_DAX) {
if (blocksize != PAGE_SIZE) {
ext2_msg(sb, KERN_ERR,
"error: unsupported blocksize for xip");
@@ -1275,10 +1275,10 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);

es = sbi->s_es;
- if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) {
+ if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_DAX) {
ext2_msg(sb, KERN_WARNING, "warning: refusing change of "
"xip flag with busy inodes while remounting");
- sbi->s_mount_opt ^= EXT2_MOUNT_XIP;
+ sbi->s_mount_opt ^= EXT2_MOUNT_DAX;
}
if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) {
spin_unlock(&sbi->s_lock);
--
1.8.5.3

2014-02-25 14:19:28

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler

Instead of calling aops->get_xip_mem from the fault handler, the
filesystem passes a get_block_t that is used to find the appropriate
blocks.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/dax.c | 167 +++++++++++++++++++++++++++++++++++++++++++
fs/ext2/file.c | 35 ++++++++-
include/linux/fs.h | 4 +-
mm/filemap_xip.c | 206 -----------------------------------------------------
4 files changed, 203 insertions(+), 209 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 81099f9..ebcd8fd 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -19,6 +19,8 @@
#include <linux/buffer_head.h>
#include <linux/fs.h>
#include <linux/genhd.h>
+#include <linux/highmem.h>
+#include <linux/mm.h>
#include <linux/mutex.h>
#include <linux/uio.h>

@@ -32,6 +34,16 @@ static long dax_get_addr(struct inode *inode, struct buffer_head *bh,
return ops->direct_access(bdev, sector, addr, &pfn, bh->b_size);
}

+static long dax_get_pfn(struct inode *inode, struct buffer_head *bh,
+ unsigned long *pfn)
+{
+ struct block_device *bdev = bh->b_bdev;
+ const struct block_device_operations *ops = bdev->bd_disk->fops;
+ void *addr;
+ sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
+ return ops->direct_access(bdev, sector, &addr, pfn, bh->b_size);
+}
+
static void dax_new_buf(void *addr, unsigned size, unsigned first,
loff_t offset, loff_t end, int rw)
{
@@ -190,3 +202,158 @@ ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
return retval;
}
EXPORT_SYMBOL_GPL(dax_do_io);
+
+/*
+ * The user has performed a load from a hole in the file. Allocating
+ * a new page in the file would cause excessive storage usage for
+ * workloads with sparse files. We allocate a page cache page instead.
+ * We'll kick it out of the page cache if it's ever written to,
+ * otherwise it will simply fall out of the page cache under memory
+ * pressure without ever having been dirtied.
+ */
+static int dax_load_hole(struct address_space *mapping, struct vm_fault *vmf)
+{
+ unsigned long size;
+ struct inode *inode = mapping->host;
+ struct page *page = find_or_create_page(mapping, vmf->pgoff,
+ GFP_KERNEL | __GFP_ZERO);
+ if (!page)
+ return VM_FAULT_OOM;
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (vmf->pgoff >= size) {
+ unlock_page(page);
+ page_cache_release(page);
+ return VM_FAULT_SIGBUS;
+ }
+
+ vmf->page = page;
+ return VM_FAULT_LOCKED;
+}
+
+static void copy_user_bh(struct page *to, struct inode *inode,
+ struct buffer_head *bh, unsigned long vaddr)
+{
+ void *vfrom, *vto;
+ dax_get_addr(inode, bh, &vfrom); /* XXX: error handling */
+ vto = kmap_atomic(to);
+ copy_user_page(vto, vfrom, vaddr, to);
+ kunmap_atomic(vto);
+}
+
+static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+ get_block_t get_block)
+{
+ struct file *file = vma->vm_file;
+ struct inode *inode = file_inode(file);
+ struct address_space *mapping = file->f_mapping;
+ struct buffer_head bh;
+ unsigned long vaddr = (unsigned long)vmf->virtual_address;
+ sector_t block;
+ pgoff_t size;
+ unsigned long pfn;
+ int error;
+
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (vmf->pgoff >= size)
+ return VM_FAULT_SIGBUS;
+
+ memset(&bh, 0, sizeof(bh));
+ block = (sector_t)vmf->pgoff << (PAGE_SHIFT - inode->i_blkbits);
+ bh.b_size = PAGE_SIZE;
+ error = get_block(inode, block, &bh, 0);
+ if (error || bh.b_size < PAGE_SIZE)
+ return VM_FAULT_SIGBUS;
+
+ if (!buffer_written(&bh) && !vmf->cow_page) {
+ if (vmf->flags & FAULT_FLAG_WRITE) {
+ error = get_block(inode, block, &bh, 1);
+ if (error || bh.b_size < PAGE_SIZE)
+ return VM_FAULT_SIGBUS;
+ } else {
+ return dax_load_hole(mapping, vmf);
+ }
+ }
+
+ /* Recheck i_size under i_mmap_mutex */
+ mutex_lock(&mapping->i_mmap_mutex);
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (unlikely(vmf->pgoff >= size)) {
+ mutex_unlock(&mapping->i_mmap_mutex);
+ return VM_FAULT_SIGBUS;
+ }
+ if (vmf->cow_page) {
+ if (buffer_written(&bh))
+ copy_user_bh(vmf->cow_page, inode, &bh, vaddr);
+ else
+ clear_user_highpage(vmf->cow_page, vaddr);
+ return VM_FAULT_COWED;
+ }
+
+ error = dax_get_pfn(inode, &bh, &pfn);
+ if (error > 0)
+ error = vm_insert_mixed(vma, vaddr, pfn);
+ mutex_unlock(&mapping->i_mmap_mutex);
+ if (error == -ENOMEM)
+ return VM_FAULT_OOM;
+ /* -EBUSY is fine, somebody else faulted on the same PTE */
+ if (error != -EBUSY)
+ BUG_ON(error);
+ return VM_FAULT_NOPAGE;
+}
+
+/**
+ * dax_fault - handle a page fault on an XIP file
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * When a page fault occurs, filesystems may call this helper in their
+ * fault handler for XIP files.
+ */
+int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+ get_block_t get_block)
+{
+ int result;
+ struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+ sb_start_pagefault(sb);
+ file_update_time(vma->vm_file);
+ result = do_dax_fault(vma, vmf, get_block);
+ sb_end_pagefault(sb);
+
+ return result;
+}
+EXPORT_SYMBOL_GPL(dax_fault);
+
+/**
+ * dax_mkwrite - convert a read-only page to read-write in an XIP file
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * XIP handles reads of holes by adding pages full of zeroes into the
+ * mapping. If the page is subsequenty written to, we have to allocate
+ * the page on media and free the page that was in the cache.
+ */
+int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+ get_block_t get_block)
+{
+ int result;
+ struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+ sb_start_pagefault(sb);
+ file_update_time(vma->vm_file);
+ result = do_dax_fault(vma, vmf, get_block);
+ sb_end_pagefault(sb);
+
+ if (!(result & VM_FAULT_ERROR)) {
+ struct page *page = vmf->page;
+ unmap_mapping_range(page->mapping,
+ (loff_t)page->index << PAGE_CACHE_SHIFT,
+ PAGE_CACHE_SIZE, 0);
+ delete_from_page_cache(page);
+ }
+
+ return result;
+}
+EXPORT_SYMBOL_GPL(dax_mkwrite);
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index ef5cf96..e3ce10d 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,6 +25,37 @@
#include "xattr.h"
#include "acl.h"

+#ifdef CONFIG_EXT2_FS_XIP
+static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return dax_fault(vma, vmf, ext2_get_block);
+}
+
+static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return dax_mkwrite(vma, vmf, ext2_get_block);
+}
+
+static const struct vm_operations_struct ext2_dax_vm_ops = {
+ .fault = ext2_dax_fault,
+ .page_mkwrite = ext2_dax_mkwrite,
+ .remap_pages = generic_file_remap_pages,
+};
+
+static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ if (!IS_DAX(file_inode(file)))
+ return generic_file_mmap(file, vma);
+
+ file_accessed(file);
+ vma->vm_ops = &ext2_dax_vm_ops;
+ vma->vm_flags |= VM_MIXEDMAP;
+ return 0;
+}
+#else
+#define ext2_file_mmap generic_file_mmap
+#endif
+
/*
* Called when filp is released. This happens when all file descriptors
* for a single struct file are closed. Note that different open() calls
@@ -70,7 +101,7 @@ const struct file_operations ext2_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = ext2_compat_ioctl,
#endif
- .mmap = generic_file_mmap,
+ .mmap = ext2_file_mmap,
.open = dquot_file_open,
.release = ext2_release_file,
.fsync = ext2_fsync,
@@ -89,7 +120,7 @@ const struct file_operations ext2_xip_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = ext2_compat_ioctl,
#endif
- .mmap = xip_file_mmap,
+ .mmap = ext2_file_mmap,
.open = dquot_file_open,
.release = ext2_release_file,
.fsync = ext2_fsync,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 07888b9..00ad95e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -48,6 +48,7 @@ struct cred;
struct swap_info_struct;
struct seq_file;
struct workqueue_struct;
+struct vm_fault;

extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -2517,10 +2518,11 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
extern int nonseekable_open(struct inode * inode, struct file * filp);

#ifdef CONFIG_FS_XIP
-extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
extern int xip_truncate_page(struct address_space *mapping, loff_t from);
ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
+int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int dax_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t);
#else
static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
{
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index f7c37a1..9dd45f3 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -22,212 +22,6 @@
#include <asm/io.h>

/*
- * We do use our own empty page to avoid interference with other users
- * of ZERO_PAGE(), such as /dev/zero
- */
-static DEFINE_MUTEX(xip_sparse_mutex);
-static seqcount_t xip_sparse_seq = SEQCNT_ZERO(xip_sparse_seq);
-static struct page *__xip_sparse_page;
-
-/* called under xip_sparse_mutex */
-static struct page *xip_sparse_page(void)
-{
- if (!__xip_sparse_page) {
- struct page *page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
-
- if (page)
- __xip_sparse_page = page;
- }
- return __xip_sparse_page;
-}
-
-/*
- * __xip_unmap is invoked from xip_unmap and
- * xip_write
- *
- * This function walks all vmas of the address_space and unmaps the
- * __xip_sparse_page when found at pgoff.
- */
-static void
-__xip_unmap (struct address_space * mapping,
- unsigned long pgoff)
-{
- struct vm_area_struct *vma;
- struct mm_struct *mm;
- unsigned long address;
- pte_t *pte;
- pte_t pteval;
- spinlock_t *ptl;
- struct page *page;
- unsigned count;
- int locked = 0;
-
- count = read_seqcount_begin(&xip_sparse_seq);
-
- page = __xip_sparse_page;
- if (!page)
- return;
-
-retry:
- mutex_lock(&mapping->i_mmap_mutex);
- vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
- mm = vma->vm_mm;
- address = vma->vm_start +
- ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
- BUG_ON(address < vma->vm_start || address >= vma->vm_end);
- pte = page_check_address(page, mm, address, &ptl, 1);
- if (pte) {
- /* Nuke the page table entry. */
- flush_cache_page(vma, address, pte_pfn(*pte));
- pteval = ptep_clear_flush(vma, address, pte);
- page_remove_rmap(page);
- dec_mm_counter(mm, MM_FILEPAGES);
- BUG_ON(pte_dirty(pteval));
- pte_unmap_unlock(pte, ptl);
- /* must invalidate_page _before_ freeing the page */
- mmu_notifier_invalidate_page(mm, address);
- page_cache_release(page);
- }
- }
- mutex_unlock(&mapping->i_mmap_mutex);
-
- if (locked) {
- mutex_unlock(&xip_sparse_mutex);
- } else if (read_seqcount_retry(&xip_sparse_seq, count)) {
- mutex_lock(&xip_sparse_mutex);
- locked = 1;
- goto retry;
- }
-}
-
-/*
- * xip_fault() is invoked via the vma operations vector for a
- * mapped memory region to read in file data during a page fault.
- *
- * This function is derived from filemap_fault, but used for execute in place
- */
-static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
-{
- struct file *file = vma->vm_file;
- struct address_space *mapping = file->f_mapping;
- struct inode *inode = mapping->host;
- pgoff_t size;
- void *xip_mem;
- unsigned long xip_pfn;
- struct page *page;
- int error;
-
- /* XXX: are VM_FAULT_ codes OK? */
-again:
- size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
- if (vmf->pgoff >= size)
- return VM_FAULT_SIGBUS;
-
- error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
- &xip_mem, &xip_pfn);
- if (likely(!error))
- goto found;
- if (error != -ENODATA)
- return VM_FAULT_OOM;
-
- /* sparse block */
- if ((vma->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
- (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) &&
- (!(mapping->host->i_sb->s_flags & MS_RDONLY))) {
- int err;
-
- /* maybe shared writable, allocate new block */
- mutex_lock(&xip_sparse_mutex);
- error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 1,
- &xip_mem, &xip_pfn);
- mutex_unlock(&xip_sparse_mutex);
- if (error)
- return VM_FAULT_SIGBUS;
- /* unmap sparse mappings at pgoff from all other vmas */
- __xip_unmap(mapping, vmf->pgoff);
-
-found:
- /* We must recheck i_size under i_mmap_mutex */
- mutex_lock(&mapping->i_mmap_mutex);
- size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
- PAGE_CACHE_SHIFT;
- if (unlikely(vmf->pgoff >= size)) {
- mutex_unlock(&mapping->i_mmap_mutex);
- return VM_FAULT_SIGBUS;
- }
- err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
- xip_pfn);
- mutex_unlock(&mapping->i_mmap_mutex);
- if (err == -ENOMEM)
- return VM_FAULT_OOM;
- /*
- * err == -EBUSY is fine, we've raced against another thread
- * that faulted-in the same page
- */
- if (err != -EBUSY)
- BUG_ON(err);
- return VM_FAULT_NOPAGE;
- } else {
- int err, ret = VM_FAULT_OOM;
-
- mutex_lock(&xip_sparse_mutex);
- write_seqcount_begin(&xip_sparse_seq);
- error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
- &xip_mem, &xip_pfn);
- if (unlikely(!error)) {
- write_seqcount_end(&xip_sparse_seq);
- mutex_unlock(&xip_sparse_mutex);
- goto again;
- }
- if (error != -ENODATA)
- goto out;
-
- /* We must recheck i_size under i_mmap_mutex */
- mutex_lock(&mapping->i_mmap_mutex);
- size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
- PAGE_CACHE_SHIFT;
- if (unlikely(vmf->pgoff >= size)) {
- ret = VM_FAULT_SIGBUS;
- goto unlock;
- }
- /* not shared and writable, use xip_sparse_page() */
- page = xip_sparse_page();
- if (!page)
- goto unlock;
- err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
- page);
- if (err == -ENOMEM)
- goto unlock;
-
- ret = VM_FAULT_NOPAGE;
-unlock:
- mutex_unlock(&mapping->i_mmap_mutex);
-out:
- write_seqcount_end(&xip_sparse_seq);
- mutex_unlock(&xip_sparse_mutex);
-
- return ret;
- }
-}
-
-static const struct vm_operations_struct xip_file_vm_ops = {
- .fault = xip_file_fault,
- .page_mkwrite = filemap_page_mkwrite,
- .remap_pages = generic_file_remap_pages,
-};
-
-int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
-{
- BUG_ON(!file->f_mapping->a_ops->get_xip_mem);
-
- file_accessed(file);
- vma->vm_ops = &xip_file_vm_ops;
- vma->vm_flags |= VM_MIXEDMAP;
- return 0;
-}
-EXPORT_SYMBOL_GPL(xip_file_mmap);
-
-/*
* truncate a page used for execute in place
* functionality is analog to block_truncate_page but does use get_xip_mem
* to get the page instead of page cache
--
1.8.5.3

2014-02-25 14:19:26

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 13/22] ext2: Remove ext2_use_xip

Replace ext2_use_xip() with test_opt(XIP) which expands to the same code

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext2/ext2.h | 4 ++++
fs/ext2/inode.c | 2 +-
fs/ext2/namei.c | 4 ++--
3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index d9a17d0..5ecf570 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,7 +380,11 @@ struct ext2_inode {
#define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */
#define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */
#define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */
+#ifdef CONFIG_FS_XIP
#define EXT2_MOUNT_XIP 0x010000 /* Execute in place */
+#else
+#define EXT2_MOUNT_XIP 0
+#endif
#define EXT2_MOUNT_USRQUOTA 0x020000 /* user quota */
#define EXT2_MOUNT_GRPQUOTA 0x040000 /* group quota */
#define EXT2_MOUNT_RESERVATION 0x080000 /* Preallocation */
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index a9346a9..2e587e2 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1393,7 +1393,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)

if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext2_file_inode_operations;
- if (ext2_use_xip(inode->i_sb)) {
+ if (test_opt(inode->i_sb, XIP)) {
inode->i_mapping->a_ops = &ext2_aops_xip;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index c268d0a..846c356 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
return PTR_ERR(inode);

inode->i_op = &ext2_file_inode_operations;
- if (ext2_use_xip(inode->i_sb)) {
+ if (test_opt(inode->i_sb, XIP)) {
inode->i_mapping->a_ops = &ext2_aops_xip;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
return PTR_ERR(inode);

inode->i_op = &ext2_file_inode_operations;
- if (ext2_use_xip(inode->i_sb)) {
+ if (test_opt(inode->i_sb, XIP)) {
inode->i_mapping->a_ops = &ext2_aops_xip;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
--
1.8.5.3

2014-02-25 14:19:25

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 12/22] ext2: Remove ext2_xip_verify_sb()

Jan Kara pointed out that calling ext2_xip_verify_sb() in ext2_remount()
doesn't make sense, since changing the XIP option on remount isn't
allowed. It also doesn't make sense to re-check whether blocksize is
supported since it can't change between mounts.

Replace the call to ext2_xip_verify_sb() in ext2_fill_super() with the
equivalent check and delete the definition.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext2/super.c | 33 ++++++++++++---------------------
fs/ext2/xip.c | 12 ------------
fs/ext2/xip.h | 2 --
3 files changed, 12 insertions(+), 35 deletions(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 20d6697..3a1db39 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -868,9 +868,6 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ?
MS_POSIXACL : 0);

- ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
- EXT2_MOUNT_XIP if not */
-
if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV &&
(EXT2_HAS_COMPAT_FEATURE(sb, ~0U) ||
EXT2_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
@@ -900,11 +897,17 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)

blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);

- if (ext2_use_xip(sb) && blocksize != PAGE_SIZE) {
- if (!silent)
+ if (sbi->s_mount_opt & EXT2_MOUNT_XIP) {
+ if (blocksize != PAGE_SIZE) {
ext2_msg(sb, KERN_ERR,
- "error: unsupported blocksize for xip");
- goto failed_mount;
+ "error: unsupported blocksize for xip");
+ goto failed_mount;
+ }
+ if (!sb->s_bdev->bd_disk->fops->direct_access) {
+ ext2_msg(sb, KERN_ERR,
+ "error: device does not support xip");
+ goto failed_mount;
+ }
}

/* If the blocksize doesn't match, re-read the thing.. */
@@ -1249,7 +1252,6 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
{
struct ext2_sb_info * sbi = EXT2_SB(sb);
struct ext2_super_block * es;
- unsigned long old_mount_opt = sbi->s_mount_opt;
struct ext2_mount_options old_opts;
unsigned long old_sb_flags;
int err;
@@ -1273,22 +1275,11 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);

- ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
- EXT2_MOUNT_XIP if not */
-
- if ((ext2_use_xip(sb)) && (sb->s_blocksize != PAGE_SIZE)) {
- ext2_msg(sb, KERN_WARNING,
- "warning: unsupported blocksize for xip");
- err = -EINVAL;
- goto restore_opts;
- }
-
es = sbi->s_es;
- if ((sbi->s_mount_opt ^ old_mount_opt) & EXT2_MOUNT_XIP) {
+ if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) {
ext2_msg(sb, KERN_WARNING, "warning: refusing change of "
"xip flag with busy inodes while remounting");
- sbi->s_mount_opt &= ~EXT2_MOUNT_XIP;
- sbi->s_mount_opt |= old_mount_opt & EXT2_MOUNT_XIP;
+ sbi->s_mount_opt ^= EXT2_MOUNT_XIP;
}
if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) {
spin_unlock(&sbi->s_lock);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index 132d4da..66ca113 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,15 +13,3 @@
#include "ext2.h"
#include "xip.h"

-void ext2_xip_verify_sb(struct super_block *sb)
-{
- struct ext2_sb_info *sbi = EXT2_SB(sb);
-
- if ((sbi->s_mount_opt & EXT2_MOUNT_XIP) &&
- !sb->s_bdev->bd_disk->fops->direct_access) {
- sbi->s_mount_opt &= (~EXT2_MOUNT_XIP);
- ext2_msg(sb, KERN_WARNING,
- "warning: ignoring xip option - "
- "not supported by bdev");
- }
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index e7b9f0a..87eeb04 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -6,13 +6,11 @@
*/

#ifdef CONFIG_EXT2_FS_XIP
-extern void ext2_xip_verify_sb (struct super_block *);
static inline int ext2_use_xip (struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
}
#else
-#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
#endif
--
1.8.5.3

2014-02-25 14:20:25

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 21/22] ext4: Fix typos

Comment fix only

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext4/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9462730..14a9744 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3691,7 +3691,7 @@ void ext4_truncate(struct inode *inode)

/*
* There is a possibility that we're either freeing the inode
- * or it completely new indode. In those cases we might not
+ * or it's a completely new inode. In those cases we might not
* have i_mutex locked because it's not necessary.
*/
if (!(inode->i_state & (I_NEW|I_FREEING)))
--
1.8.5.3

2014-02-25 14:19:24

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 11/22] Replace ext2_clear_xip_target with dax_clear_blocks

This is practically generic code; other filesystems will want to call
it from other places, but there's nothing ext2-specific about it.

Make it a little more generic by allowing it to take a count of the number
of bytes to zero rather than fixing it to a single page. Thanks to Dave
Hansen for suggesting that I need to call cond_resched() if zeroing more
than one page.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/dax.c | 34 ++++++++++++++++++++++++++++++++++
fs/ext2/inode.c | 8 +++++---
fs/ext2/xip.c | 23 -----------------------
fs/ext2/xip.h | 3 ---
include/linux/fs.h | 6 ++++++
5 files changed, 45 insertions(+), 29 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index a6a1adc..75328bf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -22,8 +22,42 @@
#include <linux/highmem.h>
#include <linux/mm.h>
#include <linux/mutex.h>
+#include <linux/sched.h>
#include <linux/uio.h>

+int dax_clear_blocks(struct inode *inode, sector_t block, long size)
+{
+ struct block_device *bdev = inode->i_sb->s_bdev;
+ const struct block_device_operations *ops = bdev->bd_disk->fops;
+ sector_t sector = block << (inode->i_blkbits - 9);
+ unsigned long pfn;
+
+ might_sleep();
+ do {
+ void *addr;
+ long count = ops->direct_access(bdev, sector, &addr, &pfn,
+ size);
+ if (count < 0)
+ return count;
+ while (count >= PAGE_SIZE) {
+ clear_page(addr);
+ addr += PAGE_SIZE;
+ size -= PAGE_SIZE;
+ count -= PAGE_SIZE;
+ sector += PAGE_SIZE / 512;
+ cond_resched();
+ }
+ if (count > 0) {
+ memset(addr, 0, count);
+ sector += count / 512;
+ size -= count;
+ }
+ } while (size);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(dax_clear_blocks);
+
static long dax_get_addr(struct inode *inode, struct buffer_head *bh,
void **addr)
{
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index b156fe8..a9346a9 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -733,10 +733,12 @@ static int ext2_get_blocks(struct inode *inode,

if (IS_DAX(inode)) {
/*
- * we need to clear the block
+ * block must be initialised before we put it in the tree
+ * so that it's not found by another thread before it's
+ * initialised
*/
- err = ext2_clear_xip_target (inode,
- le32_to_cpu(chain[depth-1].key));
+ err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key),
+ count << inode->i_blkbits);
if (err) {
mutex_unlock(&ei->truncate_mutex);
goto cleanup;
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index ca745ff..132d4da 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,29 +13,6 @@
#include "ext2.h"
#include "xip.h"

-static inline long __inode_direct_access(struct inode *inode, sector_t block,
- void **kaddr, unsigned long *pfn, long size)
-{
- struct block_device *bdev = inode->i_sb->s_bdev;
- const struct block_device_operations *ops = bdev->bd_disk->fops;
- sector_t sector = block * (PAGE_SIZE / 512);
- return ops->direct_access(bdev, sector, kaddr, pfn, size);
-}
-
-int
-ext2_clear_xip_target(struct inode *inode, sector_t block)
-{
- void *kaddr;
- unsigned long pfn;
- long size;
-
- size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
- if (size < 0)
- return size;
- clear_page(kaddr);
- return 0;
-}
-
void ext2_xip_verify_sb(struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 0fa8b7f..e7b9f0a 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -7,8 +7,6 @@

#ifdef CONFIG_EXT2_FS_XIP
extern void ext2_xip_verify_sb (struct super_block *);
-extern int ext2_clear_xip_target (struct inode *, sector_t);
-
static inline int ext2_use_xip (struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
@@ -17,5 +15,4 @@ static inline int ext2_use_xip (struct super_block *sb)
#else
#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
-#define ext2_clear_xip_target(inode, chain) 0
#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c7945fd..6f7f445 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2516,12 +2516,18 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
extern int nonseekable_open(struct inode * inode, struct file * filp);

#ifdef CONFIG_FS_XIP
+int dax_clear_blocks(struct inode *, sector_t block, long size);
int dax_truncate_page(struct inode *, loff_t from, get_block_t);
ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
int dax_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t);
#else
+static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
+{
+ return 0;
+}
+
static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
{
return 0;
--
1.8.5.3

2014-02-25 14:21:15

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 22/22] dax: Add reporting of major faults

If we have to call get_block with the create argument set to 1, then
the filesystem almost certainly had to zero the block. which is an I/O,
which should be reported as a major fault.

Note that major faults on DAX files happen for different reasons than
major faults on non-DAX files. DAX files behave as if everything except
file holes is already cached. That's all the more reason to report
major faults when we do have to do I/O; it may be a valuable resource
for sysadmins trying to diagnose performance problems.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/dax.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index cdc8012..79a67c5 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -20,10 +20,12 @@
#include <linux/fs.h>
#include <linux/genhd.h>
#include <linux/highmem.h>
+#include <linux/memcontrol.h>
#include <linux/mm.h>
#include <linux/mutex.h>
#include <linux/sched.h>
#include <linux/uio.h>
+#include <linux/vmstat.h>

int dax_clear_blocks(struct inode *inode, sector_t block, long size)
{
@@ -286,6 +288,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
pgoff_t size;
unsigned long pfn;
int error;
+ int major = 0;

size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
if (vmf->pgoff >= size)
@@ -301,6 +304,9 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
if (!buffer_written(&bh) && !vmf->cow_page) {
if (vmf->flags & FAULT_FLAG_WRITE) {
error = get_block(inode, block, &bh, 1);
+ count_vm_event(PGMAJFAULT);
+ mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+ major = VM_FAULT_MAJOR;
if (error || bh.b_size < PAGE_SIZE)
return VM_FAULT_SIGBUS;
} else {
@@ -332,7 +338,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
/* -EBUSY is fine, somebody else faulted on the same PTE */
if (error != -EBUSY)
BUG_ON(error);
- return VM_FAULT_NOPAGE;
+ return VM_FAULT_NOPAGE | major;
}

/**
--
1.8.5.3

2014-02-25 14:21:43

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 18/22] xip: Add xip_zero_page_range

This new function allows us to support hole-punch for XIP files by zeroing
a partial page, as opposed to the xip_truncate_page() function which can
only truncate to the end of the page. Reimplement xip_truncate_page() as
a macro that calls xip_zero_page_range().

Signed-off-by: Matthew Wilcox <[email protected]>
[ported to 3.13-rc2]
Signed-off-by: Ross Zwisler <[email protected]>
---
Documentation/filesystems/dax.txt | 1 +
fs/dax.c | 22 +++++++++++++++-------
include/linux/fs.h | 9 ++++++++-
3 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 06f84e5..e5706cc 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -55,6 +55,7 @@ Filesystem support consists of
for fault and page_mkwrite (which should probably call dax_fault() and
dax_mkwrite(), passing the appropriate get_block() callback)
- calling dax_truncate_page() instead of block_truncate_page() for DAX files
+- calling dax_zero_page_range() instead of zero_user() for DAX files
- ensuring that there is sufficient locking between reads, writes,
truncates and page faults

diff --git a/fs/dax.c b/fs/dax.c
index 75328bf..cdc8012 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -393,13 +393,16 @@ int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
EXPORT_SYMBOL_GPL(dax_mkwrite);

/**
- * dax_truncate_page - handle a partial page being truncated in a DAX file
+ * dax_zero_page_range - zero a range within a page of a DAX file
* @inode: The file being truncated
* @from: The file offset that is being truncated to
+ * @length: The number of bytes to zero
* @get_block: The filesystem method used to translate file offsets to blocks
*
- * Similar to block_truncate_page(), this function can be called by a
- * filesystem when it is truncating an DAX file to handle the partial page.
+ * This function can be called by a filesystem when it is zeroing part of a
+ * page in a DAX file. This is intended for hole-punch operations. If
+ * you are truncating a file, the helper function dax_truncate_page() may be
+ * more convenient.
*
* We work in terms of PAGE_CACHE_SIZE here for commonality with
* block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
@@ -407,12 +410,12 @@ EXPORT_SYMBOL_GPL(dax_mkwrite);
* block size is smaller than PAGE_SIZE, we have to zero the rest of the page
* since the file might be mmaped.
*/
-int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
+ get_block_t get_block)
{
struct buffer_head bh;
pgoff_t index = from >> PAGE_CACHE_SHIFT;
unsigned offset = from & (PAGE_CACHE_SIZE-1);
- unsigned length = PAGE_CACHE_ALIGN(from) - from;
int err;

/* Block boundary? Nothing to do */
@@ -427,11 +430,16 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
if (buffer_written(&bh)) {
void *addr;
err = dax_get_addr(inode, &bh, &addr);
- if (err)
+ if (err < 0)
return err;
+ /*
+ * ext4 sometimes asks to zero past the end of a block. It
+ * really just wants to zero to the end of the block.
+ */
+ length = min_t(unsigned, length, PAGE_CACHE_SIZE - offset);
memset(addr + offset, 0, length);
}

return 0;
}
-EXPORT_SYMBOL_GPL(dax_truncate_page);
+EXPORT_SYMBOL_GPL(dax_zero_page_range);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 19abdb1..6a5091a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2517,6 +2517,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);

#ifdef CONFIG_FS_DAX
int dax_clear_blocks(struct inode *, sector_t block, long size);
+int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
int dax_truncate_page(struct inode *, loff_t from, get_block_t);
ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
@@ -2528,7 +2529,8 @@ static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
return 0;
}

-static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
+static inline int dax_zero_page_range(struct inode *inode, loff_t from,
+ unsigned len, get_block_t gb)
{
return 0;
}
@@ -2541,6 +2543,11 @@ static inline ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
}
#endif

+/* Can't be a function because PAGE_CACHE_SIZE is defined in pagemap.h */
+#define dax_truncate_page(inode, from, get_block) \
+ dax_zero_page_range(inode, from, PAGE_CACHE_SIZE, get_block)
+
+
#ifdef CONFIG_BLOCK
typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
loff_t file_offset);
--
1.8.5.3

2014-02-25 14:19:23

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 09/22] Remove mm/filemap_xip.c

It is now empty as all of its contents have been replaced by fs/xip.c

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/Makefile | 1 -
mm/filemap_xip.c | 23 -----------------------
2 files changed, 24 deletions(-)
delete mode 100644 mm/filemap_xip.c

diff --git a/mm/Makefile b/mm/Makefile
index 310c90a..454c176 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -47,7 +47,6 @@ obj-$(CONFIG_SLUB) += slub.o
obj-$(CONFIG_KMEMCHECK) += kmemcheck.o
obj-$(CONFIG_FAILSLAB) += failslab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
-obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
deleted file mode 100644
index 6316578..0000000
--- a/mm/filemap_xip.c
+++ /dev/null
@@ -1,23 +0,0 @@
-/*
- * linux/mm/filemap_xip.c
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte <[email protected]>
- *
- * derived from linux/mm/filemap.c - Copyright (C) Linus Torvalds
- *
- */
-
-#include <linux/fs.h>
-#include <linux/pagemap.h>
-#include <linux/export.h>
-#include <linux/uio.h>
-#include <linux/rmap.h>
-#include <linux/mmu_notifier.h>
-#include <linux/sched.h>
-#include <linux/seqlock.h>
-#include <linux/mutex.h>
-#include <linux/gfp.h>
-#include <asm/tlbflush.h>
-#include <asm/io.h>
-
--
1.8.5.3

2014-02-25 14:22:20

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 16/22] ext2: Remove ext2_aops_xip

We shouldn't need a special address_space_operations any more

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext2/ext2.h | 1 -
fs/ext2/inode.c | 7 +------
fs/ext2/namei.c | 4 ++--
3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index b30c3bd..b8b1c11 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -793,7 +793,6 @@ extern const struct file_operations ext2_xip_file_operations;

/* inode.c */
extern const struct address_space_operations ext2_aops;
-extern const struct address_space_operations ext2_aops_xip;
extern const struct address_space_operations ext2_nobh_aops;

/* namei.c */
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 67124f0..7ca76da 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -890,11 +890,6 @@ const struct address_space_operations ext2_aops = {
.error_remove_page = generic_error_remove_page,
};

-const struct address_space_operations ext2_aops_xip = {
- .bmap = ext2_bmap,
- .direct_IO = ext2_direct_IO,
-};
-
const struct address_space_operations ext2_nobh_aops = {
.readpage = ext2_readpage,
.readpages = ext2_readpages,
@@ -1393,7 +1388,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext2_file_inode_operations;
if (test_opt(inode->i_sb, XIP)) {
- inode->i_mapping->a_ops = &ext2_aops_xip;
+ inode->i_mapping->a_ops = &ext2_aops;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 7ca803f..0db888c 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode

inode->i_op = &ext2_file_inode_operations;
if (test_opt(inode->i_sb, XIP)) {
- inode->i_mapping->a_ops = &ext2_aops_xip;
+ inode->i_mapping->a_ops = &ext2_aops;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)

inode->i_op = &ext2_file_inode_operations;
if (test_opt(inode->i_sb, XIP)) {
- inode->i_mapping->a_ops = &ext2_aops_xip;
+ inode->i_mapping->a_ops = &ext2_aops;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
--
1.8.5.3

2014-02-25 14:19:21

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 08/22] Replace xip_truncate_page with dax_truncate_page

It takes a get_block parameter just like nobh_truncate_page() and
block_truncate_page()

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/dax.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++----
fs/ext2/inode.c | 2 +-
include/linux/fs.h | 4 ++--
mm/filemap_xip.c | 40 ----------------------------------------
4 files changed, 51 insertions(+), 47 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ebcd8fd..a6a1adc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -302,13 +302,13 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
}

/**
- * dax_fault - handle a page fault on an XIP file
+ * dax_fault - handle a page fault on a DAX file
* @vma: The virtual memory area where the fault occurred
* @vmf: The description of the fault
* @get_block: The filesystem method used to translate file offsets to blocks
*
* When a page fault occurs, filesystems may call this helper in their
- * fault handler for XIP files.
+ * fault handler for DAX files.
*/
int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
get_block_t get_block)
@@ -326,12 +326,12 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
EXPORT_SYMBOL_GPL(dax_fault);

/**
- * dax_mkwrite - convert a read-only page to read-write in an XIP file
+ * dax_mkwrite - convert a read-only page to read-write in a DAX file
* @vma: The virtual memory area where the fault occurred
* @vmf: The description of the fault
* @get_block: The filesystem method used to translate file offsets to blocks
*
- * XIP handles reads of holes by adding pages full of zeroes into the
+ * DAX handles reads of holes by adding pages full of zeroes into the
* mapping. If the page is subsequenty written to, we have to allocate
* the page on media and free the page that was in the cache.
*/
@@ -357,3 +357,47 @@ int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
return result;
}
EXPORT_SYMBOL_GPL(dax_mkwrite);
+
+/**
+ * dax_truncate_page - handle a partial page being truncated in a DAX file
+ * @inode: The file being truncated
+ * @from: The file offset that is being truncated to
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * Similar to block_truncate_page(), this function can be called by a
+ * filesystem when it is truncating an DAX file to handle the partial page.
+ *
+ * We work in terms of PAGE_CACHE_SIZE here for commonality with
+ * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
+ * took care of disposing of the unnecessary blocks. Even if the filesystem
+ * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
+ * since the file might be mmaped.
+ */
+int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+{
+ struct buffer_head bh;
+ pgoff_t index = from >> PAGE_CACHE_SHIFT;
+ unsigned offset = from & (PAGE_CACHE_SIZE-1);
+ unsigned length = PAGE_CACHE_ALIGN(from) - from;
+ int err;
+
+ /* Block boundary? Nothing to do */
+ if (!length)
+ return 0;
+
+ memset(&bh, 0, sizeof(bh));
+ bh.b_size = PAGE_CACHE_SIZE;
+ err = get_block(inode, index, &bh, 0);
+ if (err < 0)
+ return err;
+ if (buffer_written(&bh)) {
+ void *addr;
+ err = dax_get_addr(inode, &bh, &addr);
+ if (err)
+ return err;
+ memset(addr + offset, 0, length);
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(dax_truncate_page);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index f128ebf..252481f 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1207,7 +1207,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
inode_dio_wait(inode);

if (IS_DAX(inode))
- error = xip_truncate_page(inode->i_mapping, newsize);
+ error = dax_truncate_page(inode, newsize, ext2_get_block);
else if (test_opt(inode->i_sb, NOBH))
error = nobh_truncate_page(inode->i_mapping,
newsize, ext2_get_block);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 00ad95e..02eeeb7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2518,13 +2518,13 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
extern int nonseekable_open(struct inode * inode, struct file * filp);

#ifdef CONFIG_FS_XIP
-extern int xip_truncate_page(struct address_space *mapping, loff_t from);
+int dax_truncate_page(struct inode *, loff_t from, get_block_t);
ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
int dax_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t);
#else
-static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
+static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
{
return 0;
}
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 9dd45f3..6316578 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -21,43 +21,3 @@
#include <asm/tlbflush.h>
#include <asm/io.h>

-/*
- * truncate a page used for execute in place
- * functionality is analog to block_truncate_page but does use get_xip_mem
- * to get the page instead of page cache
- */
-int
-xip_truncate_page(struct address_space *mapping, loff_t from)
-{
- pgoff_t index = from >> PAGE_CACHE_SHIFT;
- unsigned offset = from & (PAGE_CACHE_SIZE-1);
- unsigned blocksize;
- unsigned length;
- void *xip_mem;
- unsigned long xip_pfn;
- int err;
-
- BUG_ON(!mapping->a_ops->get_xip_mem);
-
- blocksize = 1 << mapping->host->i_blkbits;
- length = offset & (blocksize - 1);
-
- /* Block boundary? Nothing to do */
- if (!length)
- return 0;
-
- length = blocksize - length;
-
- err = mapping->a_ops->get_xip_mem(mapping, index, 0,
- &xip_mem, &xip_pfn);
- if (unlikely(err)) {
- if (err == -ENODATA)
- /* Hole? No need to truncate */
- return 0;
- else
- return err;
- }
- memset(xip_mem + offset, 0, length);
- return 0;
-}
-EXPORT_SYMBOL_GPL(xip_truncate_page);
--
1.8.5.3

2014-02-25 14:23:33

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 14/22] ext2: Remove xip.c and xip.h

These files are now empty, so delete them

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext2/Makefile | 1 -
fs/ext2/inode.c | 1 -
fs/ext2/namei.c | 1 -
fs/ext2/super.c | 1 -
fs/ext2/xip.c | 15 ---------------
fs/ext2/xip.h | 16 ----------------
6 files changed, 35 deletions(-)
delete mode 100644 fs/ext2/xip.c
delete mode 100644 fs/ext2/xip.h

diff --git a/fs/ext2/Makefile b/fs/ext2/Makefile
index f42af45..445b0e9 100644
--- a/fs/ext2/Makefile
+++ b/fs/ext2/Makefile
@@ -10,4 +10,3 @@ ext2-y := balloc.o dir.o file.o ialloc.o inode.o \
ext2-$(CONFIG_EXT2_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
ext2-$(CONFIG_EXT2_FS_POSIX_ACL) += acl.o
ext2-$(CONFIG_EXT2_FS_SECURITY) += xattr_security.o
-ext2-$(CONFIG_EXT2_FS_XIP) += xip.o
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 2e587e2..67124f0 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -34,7 +34,6 @@
#include <linux/aio.h>
#include "ext2.h"
#include "acl.h"
-#include "xip.h"
#include "xattr.h"

static int __ext2_write_inode(struct inode *inode, int do_sync);
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 846c356..7ca803f 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -35,7 +35,6 @@
#include "ext2.h"
#include "xattr.h"
#include "acl.h"
-#include "xip.h"

static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode)
{
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 3a1db39..752ccb4 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -35,7 +35,6 @@
#include "ext2.h"
#include "xattr.h"
#include "acl.h"
-#include "xip.h"

static void ext2_sync_super(struct super_block *sb,
struct ext2_super_block *es, int wait);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
deleted file mode 100644
index 66ca113..0000000
--- a/fs/ext2/xip.c
+++ /dev/null
@@ -1,15 +0,0 @@
-/*
- * linux/fs/ext2/xip.c
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte ([email protected])
- */
-
-#include <linux/mm.h>
-#include <linux/fs.h>
-#include <linux/genhd.h>
-#include <linux/buffer_head.h>
-#include <linux/blkdev.h>
-#include "ext2.h"
-#include "xip.h"
-
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
deleted file mode 100644
index 87eeb04..0000000
--- a/fs/ext2/xip.h
+++ /dev/null
@@ -1,16 +0,0 @@
-/*
- * linux/fs/ext2/xip.h
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte ([email protected])
- */
-
-#ifdef CONFIG_EXT2_FS_XIP
-static inline int ext2_use_xip (struct super_block *sb)
-{
- struct ext2_sb_info *sbi = EXT2_SB(sb);
- return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
-}
-#else
-#define ext2_use_xip(sb) 0
-#endif
--
1.8.5.3

2014-02-25 14:23:57

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 15/22] Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX

The fewer Kconfig options we have the better. Use the generic
CONFIG_FS_DAX to enable XIP support in ext2 as well as in the core.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/Kconfig | 21 ++++++++++++++-------
fs/Makefile | 2 +-
fs/ext2/Kconfig | 11 -----------
fs/ext2/ext2.h | 2 +-
fs/ext2/file.c | 4 ++--
fs/ext2/super.c | 4 ++--
include/linux/fs.h | 4 ++--
7 files changed, 22 insertions(+), 26 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 7385e54..620ab73 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -13,13 +13,6 @@ if BLOCK
source "fs/ext2/Kconfig"
source "fs/ext3/Kconfig"
source "fs/ext4/Kconfig"
-
-config FS_XIP
-# execute in place
- bool
- depends on EXT2_FS_XIP
- default y
-
source "fs/jbd/Kconfig"
source "fs/jbd2/Kconfig"

@@ -40,6 +33,20 @@ source "fs/ocfs2/Kconfig"
source "fs/btrfs/Kconfig"
source "fs/nilfs2/Kconfig"

+config FS_DAX
+ bool "Direct Access support"
+ depends on MMU
+ help
+ Direct Access (DAX) can be used on memory-backed block devices.
+ If the block device supports DAX and the filesystem supports DAX,
+ then you can avoid using the pagecache to buffer I/Os. Turning
+ on this option will compile in support for DAX; you will need to
+ mount the filesystem using the -o xip option.
+
+ If you do not have a block device that is capable of using this,
+ or if unsure, say N. Saying Y will increase the size of the kernel
+ by about 2kB.
+
endif # BLOCK

# Posix ACL utility routines
diff --git a/fs/Makefile b/fs/Makefile
index 2f194cd..b7e0a13 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o
obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o
obj-$(CONFIG_AIO) += aio.o
-obj-$(CONFIG_FS_XIP) += dax.o
+obj-$(CONFIG_FS_DAX) += dax.o
obj-$(CONFIG_FILE_LOCKING) += locks.o
obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o
obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout.o
diff --git a/fs/ext2/Kconfig b/fs/ext2/Kconfig
index 14a6780..c634874e 100644
--- a/fs/ext2/Kconfig
+++ b/fs/ext2/Kconfig
@@ -42,14 +42,3 @@ config EXT2_FS_SECURITY

If you are not using a security module that requires using
extended attributes for file security labels, say N.
-
-config EXT2_FS_XIP
- bool "Ext2 execute in place support"
- depends on EXT2_FS && MMU
- help
- Execute in place can be used on memory-backed block devices. If you
- enable this option, you can select to mount block devices which are
- capable of this feature without using the page cache.
-
- If you do not use a block device that is capable of using this,
- or if unsure, say N.
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 5ecf570..b30c3bd 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,7 +380,7 @@ struct ext2_inode {
#define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */
#define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */
#define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */
-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
#define EXT2_MOUNT_XIP 0x010000 /* Execute in place */
#else
#define EXT2_MOUNT_XIP 0
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index e3ce10d..ae7f000 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,7 +25,7 @@
#include "xattr.h"
#include "acl.h"

-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
return dax_fault(vma, vmf, ext2_get_block);
@@ -109,7 +109,7 @@ const struct file_operations ext2_file_operations = {
.splice_write = generic_file_splice_write,
};

-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
const struct file_operations ext2_xip_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 752ccb4..fdcacf7 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -287,7 +287,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root)
seq_puts(seq, ",grpquota");
#endif

-#if defined(CONFIG_EXT2_FS_XIP)
+#ifdef CONFIG_FS_DAX
if (sbi->s_mount_opt & EXT2_MOUNT_XIP)
seq_puts(seq, ",xip");
#endif
@@ -549,7 +549,7 @@ static int parse_options(char *options, struct super_block *sb)
break;
#endif
case Opt_xip:
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
set_opt (sbi->s_mount_opt, XIP);
#else
ext2_msg(sb, KERN_INFO, "xip option not supported");
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6f7f445..19abdb1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1677,7 +1677,7 @@ struct super_operations {
#define IS_IMA(inode) ((inode)->i_flags & S_IMA)
#define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT)
#define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC)
-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
#define IS_DAX(inode) ((inode)->i_flags & S_DAX)
#else
#define IS_DAX(inode) 0
@@ -2515,7 +2515,7 @@ extern loff_t fixed_size_llseek(struct file *file, loff_t offset,
extern int generic_file_open(struct inode * inode, struct file * filp);
extern int nonseekable_open(struct inode * inode, struct file * filp);

-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
int dax_clear_blocks(struct inode *, sector_t block, long size);
int dax_truncate_page(struct inode *, loff_t from, get_block_t);
ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
--
1.8.5.3

2014-02-25 14:24:22

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 05/22] Introduce IS_DAX(inode)

Use an inode flag to tag inodes which should avoid using the page cache.
Convert ext2 to use it instead of mapping_is_xip().

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext2/inode.c | 9 ++++++---
fs/ext2/xip.h | 2 --
include/linux/fs.h | 6 ++++++
3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 94ed3684..e7d3192 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -731,7 +731,7 @@ static int ext2_get_blocks(struct inode *inode,
goto cleanup;
}

- if (ext2_use_xip(inode->i_sb)) {
+ if (IS_DAX(inode)) {
/*
* we need to clear the block
*/
@@ -1201,7 +1201,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)

inode_dio_wait(inode);

- if (mapping_is_xip(inode->i_mapping))
+ if (IS_DAX(inode))
error = xip_truncate_page(inode->i_mapping, newsize);
else if (test_opt(inode->i_sb, NOBH))
error = nobh_truncate_page(inode->i_mapping,
@@ -1273,7 +1273,8 @@ void ext2_set_inode_flags(struct inode *inode)
{
unsigned int flags = EXT2_I(inode)->i_flags;

- inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+ inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME |
+ S_DIRSYNC | S_DAX);
if (flags & EXT2_SYNC_FL)
inode->i_flags |= S_SYNC;
if (flags & EXT2_APPEND_FL)
@@ -1284,6 +1285,8 @@ void ext2_set_inode_flags(struct inode *inode)
inode->i_flags |= S_NOATIME;
if (flags & EXT2_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
+ if (test_opt(inode->i_sb, XIP))
+ inode->i_flags |= S_DAX;
}

/* Propagate flags from i_flags to EXT2_I(inode)->i_flags */
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 18b34d2..29be737 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -16,9 +16,7 @@ static inline int ext2_use_xip (struct super_block *sb)
}
int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
void **, unsigned long *);
-#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_mem)
#else
-#define mapping_is_xip(map) 0
#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
#define ext2_clear_xip_target(inode, chain) 0
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6082956..42fe4e5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1640,6 +1640,7 @@ struct super_operations {
#define S_IMA 1024 /* Inode has an associated IMA struct */
#define S_AUTOMOUNT 2048 /* Automount/referral quasi-directory */
#define S_NOSEC 4096 /* no suid or xattr security attributes */
+#define S_DAX 8192 /* Direct Access, avoiding the page cache */

/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -1677,6 +1678,11 @@ struct super_operations {
#define IS_IMA(inode) ((inode)->i_flags & S_IMA)
#define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT)
#define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC)
+#ifdef CONFIG_FS_XIP
+#define IS_DAX(inode) ((inode)->i_flags & S_DAX)
+#else
+#define IS_DAX(inode) 0
+#endif

/*
* Inode state bits. Protected by inode->i_lock
--
1.8.5.3

2014-02-25 14:19:17

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 20/22] ext4: Add DAX functionality

From: Ross Zwisler <[email protected]>

This is a port of the DAX functionality found in the current version of
ext2.

Signed-off-by: Ross Zwisler <[email protected]>
Reviewed-by: Andreas Dilger <[email protected]>
[heavily tweaked]
Signed-off-by: Matthew Wilcox <[email protected]>
---
Documentation/filesystems/dax.txt | 1 +
Documentation/filesystems/ext4.txt | 2 ++
fs/ext4/ext4.h | 6 +++++
fs/ext4/file.c | 53 +++++++++++++++++++++++++++++++++-----
fs/ext4/indirect.c | 19 +++++++++-----
fs/ext4/inode.c | 52 ++++++++++++++++++++++++-------------
fs/ext4/namei.c | 10 +++++--
fs/ext4/super.c | 39 +++++++++++++++++++++++++++-
8 files changed, 149 insertions(+), 33 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index e5706cc..619dab5 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -66,6 +66,7 @@ or a write()) work correctly.

These filesystems may be used for inspiration:
- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+- ext4: the fourth extended filesystem, see Documentation/filesystems/ext4.txt


Shortcomings
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 919a329..9c511c4 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -386,6 +386,8 @@ max_dir_size_kb=n This limits the size of directories so that any
i_version Enable 64-bit inode version support. This option is
off by default.

+dax Use direct access if possible
+
Data Mode
=========
There are 3 different data modes:
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e025c29..00e9b79 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -966,6 +966,11 @@ struct ext4_inode_info {
#define EXT4_MOUNT_ERRORS_MASK 0x00070
#define EXT4_MOUNT_MINIX_DF 0x00080 /* Mimics the Minix statfs */
#define EXT4_MOUNT_NOLOAD 0x00100 /* Don't use existing journal*/
+#ifdef CONFIG_FS_DAX
+#define EXT4_MOUNT_DAX 0x00200 /* Execute in place */
+#else
+#define EXT4_MOUNT_DAX 0
+#endif
#define EXT4_MOUNT_DATA_FLAGS 0x00C00 /* Mode for data writes: */
#define EXT4_MOUNT_JOURNAL_DATA 0x00400 /* Write data to journal */
#define EXT4_MOUNT_ORDERED_DATA 0x00800 /* Flush data before commit */
@@ -2581,6 +2586,7 @@ extern const struct file_operations ext4_dir_operations;
/* file.c */
extern const struct inode_operations ext4_file_inode_operations;
extern const struct file_operations ext4_file_operations;
+extern const struct file_operations ext4_dax_file_operations;
extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
extern void ext4_unwritten_wait(struct inode *inode);

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1a50739..42a8ccd 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -190,7 +190,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
}
}

- if (unlikely(iocb->ki_filp->f_flags & O_DIRECT))
+ if (io_is_direct(iocb->ki_filp))
ret = ext4_file_dio_write(iocb, iov, nr_segs, pos);
else
ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
@@ -198,6 +198,27 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
return ret;
}

+#ifdef CONFIG_FS_DAX
+static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return dax_fault(vma, vmf, ext4_get_block);
+ /* Is this the right get_block? */
+}
+
+static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return dax_mkwrite(vma, vmf, ext4_get_block);
+}
+
+static const struct vm_operations_struct ext4_dax_vm_ops = {
+ .fault = ext4_dax_fault,
+ .page_mkwrite = ext4_dax_mkwrite,
+ .remap_pages = generic_file_remap_pages,
+};
+#else
+#define ext4_dax_vm_ops ext4_file_vm_ops
+#endif
+
static const struct vm_operations_struct ext4_file_vm_ops = {
.fault = filemap_fault,
.page_mkwrite = ext4_page_mkwrite,
@@ -206,12 +227,13 @@ static const struct vm_operations_struct ext4_file_vm_ops = {

static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
{
- struct address_space *mapping = file->f_mapping;
-
- if (!mapping->a_ops->readpage)
- return -ENOEXEC;
file_accessed(file);
- vma->vm_ops = &ext4_file_vm_ops;
+ if (IS_DAX(file_inode(file))) {
+ vma->vm_ops = &ext4_dax_vm_ops;
+ vma->vm_flags |= VM_MIXEDMAP;
+ } else {
+ vma->vm_ops = &ext4_file_vm_ops;
+ }
return 0;
}

@@ -609,6 +631,25 @@ const struct file_operations ext4_file_operations = {
.fallocate = ext4_fallocate,
};

+#ifdef CONFIG_FS_DAX
+const struct file_operations ext4_dax_file_operations = {
+ .llseek = ext4_llseek,
+ .read = do_sync_read,
+ .write = do_sync_write,
+ .aio_read = generic_file_aio_read,
+ .aio_write = ext4_file_write,
+ .unlocked_ioctl = ext4_ioctl,
+#ifdef CONFIG_COMPAT
+ .compat_ioctl = ext4_compat_ioctl,
+#endif
+ .mmap = ext4_file_mmap,
+ .open = ext4_file_open,
+ .release = ext4_release_file,
+ .fsync = ext4_sync_file,
+ .fallocate = ext4_fallocate,
+};
+#endif
+
const struct inode_operations ext4_file_inode_operations = {
.setattr = ext4_setattr,
.getattr = ext4_getattr,
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 594009f..dbdacef 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -686,15 +686,22 @@ retry:
inode_dio_done(inode);
goto locked;
}
- ret = __blockdev_direct_IO(rw, iocb, inode,
- inode->i_sb->s_bdev, iov,
- offset, nr_segs,
- ext4_get_block, NULL, NULL, 0);
+ if (IS_DAX(inode))
+ ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs,
+ ext4_get_block, NULL, 0);
+ else
+ ret = __blockdev_direct_IO(rw, iocb, inode,
+ inode->i_sb->s_bdev, iov, offset,
+ nr_segs, ext4_get_block, NULL, NULL, 0);
inode_dio_done(inode);
} else {
locked:
- ret = blockdev_direct_IO(rw, iocb, inode, iov,
- offset, nr_segs, ext4_get_block);
+ if (IS_DAX(inode))
+ ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs,
+ ext4_get_block, NULL, 0);
+ else
+ ret = blockdev_direct_IO(rw, iocb, inode, iov,
+ offset, nr_segs, ext4_get_block);

if (unlikely((rw & WRITE) && ret < 0)) {
loff_t isize = i_size_read(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ce7341c..9462730 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3140,13 +3140,14 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
get_block_func = ext4_get_block_write;
dio_flags = DIO_LOCKING;
}
- ret = __blockdev_direct_IO(rw, iocb, inode,
- inode->i_sb->s_bdev, iov,
- offset, nr_segs,
- get_block_func,
- ext4_end_io_dio,
- NULL,
- dio_flags);
+ if (IS_DAX(inode))
+ ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs,
+ get_block_func, ext4_end_io_dio, dio_flags);
+ else
+ ret = __blockdev_direct_IO(rw, iocb, inode,
+ inode->i_sb->s_bdev, iov, offset,
+ nr_segs, get_block_func,
+ ext4_end_io_dio, NULL, dio_flags);

/*
* Put our reference to io_end. This can free the io_end structure e.g.
@@ -3311,14 +3312,7 @@ void ext4_set_aops(struct inode *inode)
inode->i_mapping->a_ops = &ext4_aops;
}

-/*
- * ext4_block_zero_page_range() zeros out a mapping of length 'length'
- * starting from file offset 'from'. The range to be zero'd must
- * be contained with in one block. If the specified range exceeds
- * the end of the block it will be shortened to end of the block
- * that cooresponds to 'from'
- */
-static int ext4_block_zero_page_range(handle_t *handle,
+static int __ext4_block_zero_page_range(handle_t *handle,
struct address_space *mapping, loff_t from, loff_t length)
{
ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT;
@@ -3409,6 +3403,22 @@ unlock:
}

/*
+ * ext4_block_zero_page_range() zeros out a mapping of length 'length'
+ * starting from file offset 'from'. The range to be zero'd must
+ * be contained with in one block. If the specified range exceeds
+ * the end of the block it will be shortened to end of the block
+ * that cooresponds to 'from'
+ */
+static int ext4_block_zero_page_range(handle_t *handle,
+ struct address_space *mapping, loff_t from, loff_t length)
+{
+ struct inode *inode = mapping->host;
+ if (IS_DAX(inode))
+ return dax_zero_page_range(inode, from, length, ext4_get_block);
+ return __ext4_block_zero_page_range(handle, mapping, from, length);
+}
+
+/*
* ext4_block_truncate_page() zeroes out a mapping from file offset `from'
* up to the end of the block which corresponds to `from'.
* This required during truncate. We need to physically zero the tail end
@@ -3922,7 +3932,8 @@ void ext4_set_inode_flags(struct inode *inode)
{
unsigned int flags = EXT4_I(inode)->i_flags;

- inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+ inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME |
+ S_DIRSYNC | S_DAX);
if (flags & EXT4_SYNC_FL)
inode->i_flags |= S_SYNC;
if (flags & EXT4_APPEND_FL)
@@ -3933,6 +3944,8 @@ void ext4_set_inode_flags(struct inode *inode)
inode->i_flags |= S_NOATIME;
if (flags & EXT4_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
+ if (test_opt(inode->i_sb, DAX))
+ inode->i_flags |= S_DAX;
}

/* Propagate flags from i_flags to EXT4_I(inode)->i_flags */
@@ -4184,7 +4197,10 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)

if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext4_file_inode_operations;
- inode->i_fop = &ext4_file_operations;
+ if (test_opt(inode->i_sb, DAX))
+ inode->i_fop = &ext4_dax_file_operations;
+ else
+ inode->i_fop = &ext4_file_operations;
ext4_set_aops(inode);
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = &ext4_dir_inode_operations;
@@ -4640,7 +4656,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
* Truncate pagecache after we've waited for commit
* in data=journal mode to make pages freeable.
*/
- truncate_pagecache(inode, inode->i_size);
+ truncate_pagecache(inode, inode->i_size);
}
/*
* We want to call ext4_truncate() even if attr->ia_size ==
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index d050e04..acb9cca 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2249,7 +2249,10 @@ retry:
err = PTR_ERR(inode);
if (!IS_ERR(inode)) {
inode->i_op = &ext4_file_inode_operations;
- inode->i_fop = &ext4_file_operations;
+ if (test_opt(inode->i_sb, DAX))
+ inode->i_fop = &ext4_dax_file_operations;
+ else
+ inode->i_fop = &ext4_file_operations;
ext4_set_aops(inode);
err = ext4_add_nondir(handle, dentry, inode);
if (!err && IS_DIRSYNC(dir))
@@ -2313,7 +2316,10 @@ retry:
err = PTR_ERR(inode);
if (!IS_ERR(inode)) {
inode->i_op = &ext4_file_inode_operations;
- inode->i_fop = &ext4_file_operations;
+ if (test_opt(inode->i_sb, DAX))
+ inode->i_fop = &ext4_dax_file_operations;
+ else
+ inode->i_fop = &ext4_file_operations;
ext4_set_aops(inode);
d_tmpfile(dentry, inode);
err = ext4_orphan_add(handle, inode);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 710fed2..c0b7f4c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1156,7 +1156,7 @@ enum {
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_jqfmt_vfsv1, Opt_quota,
Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
- Opt_usrquota, Opt_grpquota, Opt_i_version,
+ Opt_usrquota, Opt_grpquota, Opt_i_version, Opt_dax,
Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
Opt_inode_readahead_blks, Opt_journal_ioprio,
@@ -1218,6 +1218,7 @@ static const match_table_t tokens = {
{Opt_barrier, "barrier"},
{Opt_nobarrier, "nobarrier"},
{Opt_i_version, "i_version"},
+ {Opt_dax, "dax"},
{Opt_stripe, "stripe=%u"},
{Opt_delalloc, "delalloc"},
{Opt_nodelalloc, "nodelalloc"},
@@ -1400,6 +1401,7 @@ static const struct mount_opts {
{Opt_min_batch_time, 0, MOPT_GTE0},
{Opt_inode_readahead_blks, 0, MOPT_GTE0},
{Opt_init_itable, 0, MOPT_GTE0},
+ {Opt_dax, EXT4_MOUNT_DAX, MOPT_SET},
{Opt_stripe, 0, MOPT_GTE0},
{Opt_resuid, 0, MOPT_GTE0},
{Opt_resgid, 0, MOPT_GTE0},
@@ -1638,6 +1640,11 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
}
sbi->s_jquota_fmt = m->mount_opt;
#endif
+#ifndef CONFIG_FS_DAX
+ } else if (token == Opt_dax) {
+ ext4_msg(sb, KERN_INFO, "dax option not supported");
+ return -1;
+#endif
} else {
if (!args->from)
arg = 1;
@@ -3560,6 +3567,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
"both data=journal and dioread_nolock");
goto failed_mount;
}
+ if (test_opt(sb, DAX)) {
+ ext4_msg(sb, KERN_ERR, "can't mount with "
+ "both data=journal and dax");
+ goto failed_mount;
+ }
if (test_opt(sb, DELALLOC))
clear_opt(sb, DELALLOC);
}
@@ -3613,6 +3625,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
goto failed_mount;
}

+ if (sbi->s_mount_opt & EXT4_MOUNT_DAX) {
+ if (blocksize != PAGE_SIZE) {
+ ext4_msg(sb, KERN_ERR,
+ "error: unsupported blocksize for dax");
+ goto failed_mount;
+ }
+ if (!sb->s_bdev->bd_disk->fops->direct_access) {
+ ext4_msg(sb, KERN_ERR,
+ "error: device does not support dax");
+ goto failed_mount;
+ }
+ }
+
if (sb->s_blocksize != blocksize) {
/* Validate the filesystem blocksize */
if (!sb_set_blocksize(sb, blocksize)) {
@@ -4813,6 +4838,18 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
err = -EINVAL;
goto restore_opts;
}
+ if (test_opt(sb, DAX)) {
+ ext4_msg(sb, KERN_ERR, "can't mount with "
+ "both data=journal and dax");
+ err = -EINVAL;
+ goto restore_opts;
+ }
+ }
+
+ if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT4_MOUNT_DAX) {
+ ext4_msg(sb, KERN_WARNING, "warning: refusing change of "
+ "dax flag with busy inodes while remounting");
+ sbi->s_mount_opt ^= EXT4_MOUNT_DAX;
}

if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED)
--
1.8.5.3

2014-02-25 14:24:51

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 06/22] Replace XIP read and write with DAX I/O

Use the generic AIO infrastructure instead of custom read and write
methods. In addition to giving us support for AIO, this adds the missing
locking between read() and truncate().

Signed-off-by: Matthew Wilcox <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
---
fs/Makefile | 1 +
fs/dax.c | 192 +++++++++++++++++++++++++++++++++++++++++++
fs/ext2/file.c | 6 +-
fs/ext2/inode.c | 7 +-
include/linux/fs.h | 18 ++++-
mm/filemap.c | 6 +-
mm/filemap_xip.c | 234 -----------------------------------------------------
7 files changed, 219 insertions(+), 245 deletions(-)
create mode 100644 fs/dax.c

diff --git a/fs/Makefile b/fs/Makefile
index 47ac07b..2f194cd 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o
obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o
obj-$(CONFIG_AIO) += aio.o
+obj-$(CONFIG_FS_XIP) += dax.o
obj-$(CONFIG_FILE_LOCKING) += locks.o
obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o
obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout.o
diff --git a/fs/dax.c b/fs/dax.c
new file mode 100644
index 0000000..81099f9
--- /dev/null
+++ b/fs/dax.c
@@ -0,0 +1,192 @@
+/*
+ * fs/dax.c - Direct Access filesystem code
+ * Copyright (c) 2013-2014 Intel Corporation
+ * Author: Matthew Wilcox <[email protected]>
+ * Author: Ross Zwisler <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/atomic.h>
+#include <linux/blkdev.h>
+#include <linux/buffer_head.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+#include <linux/mutex.h>
+#include <linux/uio.h>
+
+static long dax_get_addr(struct inode *inode, struct buffer_head *bh,
+ void **addr)
+{
+ struct block_device *bdev = bh->b_bdev;
+ const struct block_device_operations *ops = bdev->bd_disk->fops;
+ unsigned long pfn;
+ sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
+ return ops->direct_access(bdev, sector, addr, &pfn, bh->b_size);
+}
+
+static void dax_new_buf(void *addr, unsigned size, unsigned first,
+ loff_t offset, loff_t end, int rw)
+{
+ loff_t final = end - offset; /* The final byte in this buffer */
+ if (rw != WRITE) {
+ memset(addr, 0, size);
+ return;
+ }
+
+ if (first > 0)
+ memset(addr, 0, first);
+ if (final < size)
+ memset(addr + final, 0, size - final);
+}
+
+static bool buffer_written(struct buffer_head *bh)
+{
+ return buffer_mapped(bh) && !buffer_unwritten(bh);
+}
+
+static ssize_t dax_io(int rw, struct inode *inode, const struct iovec *iov,
+ loff_t start, loff_t end, get_block_t get_block,
+ struct buffer_head *bh)
+{
+ ssize_t retval = 0;
+ unsigned seg = 0;
+ unsigned len;
+ unsigned copied = 0;
+ loff_t offset = start;
+ loff_t max = start;
+ void *addr;
+ bool hole = false;
+
+ if (rw != WRITE)
+ end = min(end, i_size_read(inode));
+
+ while (offset < end) {
+ void __user *buf = iov[seg].iov_base + copied;
+
+ if (max == offset) {
+ sector_t block = offset >> inode->i_blkbits;
+ unsigned first = offset - (block << inode->i_blkbits);
+ long size;
+ memset(bh, 0, sizeof(*bh));
+ bh->b_size = ALIGN(end - offset, PAGE_SIZE);
+ retval = get_block(inode, block, bh, rw == WRITE);
+ if (retval)
+ break;
+ if (rw == WRITE) {
+ if (!buffer_mapped(bh)) {
+ retval = -EIO;
+ break;
+ }
+ hole = false;
+ } else {
+ hole = !buffer_written(bh);
+ }
+
+ if (hole) {
+ addr = NULL;
+ if (buffer_uptodate(bh))
+ size = bh->b_size - first;
+ else
+ size = (1 << inode->i_blkbits) - first;
+ } else {
+ retval = dax_get_addr(inode, bh, &addr);
+ if (retval < 0)
+ break;
+ if (buffer_unwritten(bh) || buffer_new(bh))
+ dax_new_buf(addr, retval, first,
+ offset, end, rw);
+ addr += first;
+ size = retval - first;
+ }
+ max = min(offset + size, end);
+ }
+
+ len = min_t(unsigned, iov[seg].iov_len - copied, max - offset);
+
+ if (rw == WRITE)
+ len -= __copy_from_user_nocache(addr, buf, len);
+ else if (!hole)
+ len -= __copy_to_user(buf, addr, len);
+ else
+ len -= __clear_user(buf, len);
+
+ if (!len)
+ break;
+
+ offset += len;
+ copied += len;
+ addr += len;
+ if (copied == iov[seg].iov_len) {
+ seg++;
+ copied = 0;
+ }
+ }
+
+ return (offset == start) ? retval : offset - start;
+}
+
+/**
+ * dax_do_io - Perform I/O to a DAX file
+ * @rw: READ to read or WRITE to write
+ * @iocb: The control block for this I/O
+ * @inode: The file which the I/O is directed at
+ * @iov: The user addresses to do I/O from or to
+ * @offset: The file offset where the I/O starts
+ * @nr_segs: The length of the iov array
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ * @end_io: A filesystem callback for I/O completion
+ * @flags: See below
+ *
+ * This function uses the same locking scheme as do_blockdev_direct_IO:
+ * If @flags has DIO_LOCKING set, we assume that the i_mutex is held by the
+ * caller for writes. For reads, we take and release the i_mutex ourselves.
+ * If DIO_LOCKING is not set, the filesystem takes care of its own locking.
+ * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O
+ * is in progress.
+ */
+ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
+ const struct iovec *iov, loff_t offset, unsigned nr_segs,
+ get_block_t get_block, dio_iodone_t end_io, int flags)
+{
+ struct buffer_head bh;
+ unsigned seg;
+ ssize_t retval = -EINVAL;
+ loff_t end = offset;
+
+ for (seg = 0; seg < nr_segs; seg++)
+ end += iov[seg].iov_len;
+
+ if ((flags & DIO_LOCKING) && (rw == READ)) {
+ struct address_space *mapping = inode->i_mapping;
+ mutex_lock(&inode->i_mutex);
+ retval = filemap_write_and_wait_range(mapping, offset, end - 1);
+ if (retval) {
+ mutex_unlock(&inode->i_mutex);
+ goto out;
+ }
+ }
+
+ /* Protects against truncate */
+ atomic_inc(&inode->i_dio_count);
+
+ retval = dax_io(rw, inode, iov, offset, end, get_block, &bh);
+
+ if ((flags & DIO_LOCKING) && (rw == READ))
+ mutex_unlock(&inode->i_mutex);
+
+ inode_dio_done(inode);
+
+ if ((retval > 0) && end_io)
+ end_io(iocb, offset, retval, bh.b_private);
+ out:
+ return retval;
+}
+EXPORT_SYMBOL_GPL(dax_do_io);
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 44c36e5..ef5cf96 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -81,8 +81,10 @@ const struct file_operations ext2_file_operations = {
#ifdef CONFIG_EXT2_FS_XIP
const struct file_operations ext2_xip_file_operations = {
.llseek = generic_file_llseek,
- .read = xip_file_read,
- .write = xip_file_write,
+ .read = do_sync_read,
+ .write = do_sync_write,
+ .aio_read = generic_file_aio_read,
+ .aio_write = generic_file_aio_write,
.unlocked_ioctl = ext2_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = ext2_compat_ioctl,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index e7d3192..f128ebf 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -858,7 +858,11 @@ ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
struct inode *inode = mapping->host;
ssize_t ret;

- ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
+ if (IS_DAX(inode))
+ ret = dax_do_io(rw, iocb, inode, iov, offset, nr_segs,
+ ext2_get_block, NULL, DIO_LOCKING);
+ else
+ ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
ext2_get_block);
if (ret < 0 && (rw & WRITE))
ext2_write_failed(mapping, offset + iov_length(iov, nr_segs));
@@ -888,6 +892,7 @@ const struct address_space_operations ext2_aops = {
const struct address_space_operations ext2_aops_xip = {
.bmap = ext2_bmap,
.get_xip_mem = ext2_get_xip_mem,
+ .direct_IO = ext2_direct_IO,
};

const struct address_space_operations ext2_nobh_aops = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 42fe4e5..07888b9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2517,17 +2517,22 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
extern int nonseekable_open(struct inode * inode, struct file * filp);

#ifdef CONFIG_FS_XIP
-extern ssize_t xip_file_read(struct file *filp, char __user *buf, size_t len,
- loff_t *ppos);
extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
-extern ssize_t xip_file_write(struct file *filp, const char __user *buf,
- size_t len, loff_t *ppos);
extern int xip_truncate_page(struct address_space *mapping, loff_t from);
+ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
+ loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
#else
static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
{
return 0;
}
+
+static inline ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
+ const struct iovec *iov, loff_t offset, unsigned nr_segs,
+ get_block_t get_block, dio_iodone_t end_io, int flags)
+{
+ return -ENOTTY;
+}
#endif

#ifdef CONFIG_BLOCK
@@ -2677,6 +2682,11 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root);
extern void save_mount_options(struct super_block *sb, char *options);
extern void replace_mount_options(struct super_block *sb, char *options);

+static inline bool io_is_direct(struct file *filp)
+{
+ return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp));
+}
+
static inline ino_t parent_ino(struct dentry *dentry)
{
ino_t res;
diff --git a/mm/filemap.c b/mm/filemap.c
index 7a13f6a..1b7dff6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1417,8 +1417,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
if (retval)
return retval;

- /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
- if (filp->f_flags & O_DIRECT) {
+ if (io_is_direct(filp)) {
loff_t size;
struct address_space *mapping;
struct inode *inode;
@@ -2468,8 +2467,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
if (err)
goto out;

- /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
- if (unlikely(file->f_flags & O_DIRECT)) {
+ if (io_is_direct(file)) {
loff_t endbyte;
ssize_t written_buffered;

diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index c8d23e9..f7c37a1 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -42,119 +42,6 @@ static struct page *xip_sparse_page(void)
}

/*
- * This is a file read routine for execute in place files, and uses
- * the mapping->a_ops->get_xip_mem() function for the actual low-level
- * stuff.
- *
- * Note the struct file* is not used at all. It may be NULL.
- */
-static ssize_t
-do_xip_mapping_read(struct address_space *mapping,
- struct file_ra_state *_ra,
- struct file *filp,
- char __user *buf,
- size_t len,
- loff_t *ppos)
-{
- struct inode *inode = mapping->host;
- pgoff_t index, end_index;
- unsigned long offset;
- loff_t isize, pos;
- size_t copied = 0, error = 0;
-
- BUG_ON(!mapping->a_ops->get_xip_mem);
-
- pos = *ppos;
- index = pos >> PAGE_CACHE_SHIFT;
- offset = pos & ~PAGE_CACHE_MASK;
-
- isize = i_size_read(inode);
- if (!isize)
- goto out;
-
- end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
- do {
- unsigned long nr, left;
- void *xip_mem;
- unsigned long xip_pfn;
- int zero = 0;
-
- /* nr is the maximum number of bytes to copy from this page */
- nr = PAGE_CACHE_SIZE;
- if (index >= end_index) {
- if (index > end_index)
- goto out;
- nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
- if (nr <= offset) {
- goto out;
- }
- }
- nr = nr - offset;
- if (nr > len - copied)
- nr = len - copied;
-
- error = mapping->a_ops->get_xip_mem(mapping, index, 0,
- &xip_mem, &xip_pfn);
- if (unlikely(error)) {
- if (error == -ENODATA) {
- /* sparse */
- zero = 1;
- } else
- goto out;
- }
-
- /* If users can be writing to this page using arbitrary
- * virtual addresses, take care about potential aliasing
- * before reading the page on the kernel side.
- */
- if (mapping_writably_mapped(mapping))
- /* address based flush */ ;
-
- /*
- * Ok, we have the mem, so now we can copy it to user space...
- *
- * The actor routine returns how many bytes were actually used..
- * NOTE! This may not be the same as how much of a user buffer
- * we filled up (we may be padding etc), so we can only update
- * "pos" here (the actor routine has to update the user buffer
- * pointers and the remaining count).
- */
- if (!zero)
- left = __copy_to_user(buf+copied, xip_mem+offset, nr);
- else
- left = __clear_user(buf + copied, nr);
-
- if (left) {
- error = -EFAULT;
- goto out;
- }
-
- copied += (nr - left);
- offset += (nr - left);
- index += offset >> PAGE_CACHE_SHIFT;
- offset &= ~PAGE_CACHE_MASK;
- } while (copied < len);
-
-out:
- *ppos = pos + copied;
- if (filp)
- file_accessed(filp);
-
- return (copied ? copied : error);
-}
-
-ssize_t
-xip_file_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
-{
- if (!access_ok(VERIFY_WRITE, buf, len))
- return -EFAULT;
-
- return do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp,
- buf, len, ppos);
-}
-EXPORT_SYMBOL_GPL(xip_file_read);
-
-/*
* __xip_unmap is invoked from xip_unmap and
* xip_write
*
@@ -340,127 +227,6 @@ int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
}
EXPORT_SYMBOL_GPL(xip_file_mmap);

-static ssize_t
-__xip_file_write(struct file *filp, const char __user *buf,
- size_t count, loff_t pos, loff_t *ppos)
-{
- struct address_space * mapping = filp->f_mapping;
- const struct address_space_operations *a_ops = mapping->a_ops;
- struct inode *inode = mapping->host;
- long status = 0;
- size_t bytes;
- ssize_t written = 0;
-
- BUG_ON(!mapping->a_ops->get_xip_mem);
-
- do {
- unsigned long index;
- unsigned long offset;
- size_t copied;
- void *xip_mem;
- unsigned long xip_pfn;
-
- offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
- index = pos >> PAGE_CACHE_SHIFT;
- bytes = PAGE_CACHE_SIZE - offset;
- if (bytes > count)
- bytes = count;
-
- status = a_ops->get_xip_mem(mapping, index, 0,
- &xip_mem, &xip_pfn);
- if (status == -ENODATA) {
- /* we allocate a new page unmap it */
- mutex_lock(&xip_sparse_mutex);
- status = a_ops->get_xip_mem(mapping, index, 1,
- &xip_mem, &xip_pfn);
- mutex_unlock(&xip_sparse_mutex);
- if (!status)
- /* unmap page at pgoff from all other vmas */
- __xip_unmap(mapping, index);
- }
-
- if (status)
- break;
-
- copied = bytes -
- __copy_from_user_nocache(xip_mem + offset, buf, bytes);
-
- if (likely(copied > 0)) {
- status = copied;
-
- if (status >= 0) {
- written += status;
- count -= status;
- pos += status;
- buf += status;
- }
- }
- if (unlikely(copied != bytes))
- if (status >= 0)
- status = -EFAULT;
- if (status < 0)
- break;
- } while (count);
- *ppos = pos;
- /*
- * No need to use i_size_read() here, the i_size
- * cannot change under us because we hold i_mutex.
- */
- if (pos > inode->i_size) {
- i_size_write(inode, pos);
- mark_inode_dirty(inode);
- }
-
- return written ? written : status;
-}
-
-ssize_t
-xip_file_write(struct file *filp, const char __user *buf, size_t len,
- loff_t *ppos)
-{
- struct address_space *mapping = filp->f_mapping;
- struct inode *inode = mapping->host;
- size_t count;
- loff_t pos;
- ssize_t ret;
-
- mutex_lock(&inode->i_mutex);
-
- if (!access_ok(VERIFY_READ, buf, len)) {
- ret=-EFAULT;
- goto out_up;
- }
-
- pos = *ppos;
- count = len;
-
- /* We can write back this queue in page reclaim */
- current->backing_dev_info = mapping->backing_dev_info;
-
- ret = generic_write_checks(filp, &pos, &count, S_ISBLK(inode->i_mode));
- if (ret)
- goto out_backing;
- if (count == 0)
- goto out_backing;
-
- ret = file_remove_suid(filp);
- if (ret)
- goto out_backing;
-
- ret = file_update_time(filp);
- if (ret)
- goto out_backing;
-
- ret = __xip_file_write (filp, buf, count, pos, ppos);
-
- out_backing:
- current->backing_dev_info = NULL;
- out_up:
- mutex_unlock(&inode->i_mutex);
- return ret;
-}
-EXPORT_SYMBOL_GPL(xip_file_write);
-
/*
* truncate a page used for execute in place
* functionality is analog to block_truncate_page but does use get_xip_mem
--
1.8.5.3

2014-02-25 14:19:15

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 04/22] Change direct_access calling convention

In order to support accesses to larger chunks of memory, pass in a
'size' parameter (counted in bytes), and return the amount available at
that address.

Signed-off-by: Matthew Wilcox <[email protected]>
---
Documentation/filesystems/xip.txt | 15 +++++++++------
arch/powerpc/sysdev/axonram.c | 6 +++---
drivers/block/brd.c | 8 +++++---
drivers/s390/block/dcssblk.c | 19 ++++++++++---------
fs/ext2/xip.c | 30 +++++++++++++-----------------
include/linux/blkdev.h | 4 ++--
6 files changed, 42 insertions(+), 40 deletions(-)

diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
index 0466ee5..b62eabf 100644
--- a/Documentation/filesystems/xip.txt
+++ b/Documentation/filesystems/xip.txt
@@ -28,12 +28,15 @@ Implementation
Execute-in-place is implemented in three steps: block device operation,
address space operation, and file operations.

-A block device operation named direct_access is used to retrieve a
-reference (pointer) to a block on-disk. The reference is supposed to be
-cpu-addressable, physical address and remain valid until the release operation
-is performed. A struct block_device reference is used to address the device,
-and a sector_t argument is used to identify the individual block. As an
-alternative, memory technology devices can be used for this.
+A block device operation named direct_access is used to translate the
+block device sector number to a page frame number (pfn) that identifies
+the physical page for the memory. It also returns a kernel virtual
+address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested. The function should return the number
+of bytes that it can provide, although it must not exceed the number of
+bytes requested. It may also return a negative errno if an error occurs.

The block device operation is optional, these block devices support it as of
today:
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 830edc8..1697e29 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -139,9 +139,9 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
* axon_ram_direct_access - direct_access() method for block device
* @device, @sector, @data: see block_device_operations method
*/
-static int
+static long
axon_ram_direct_access(struct block_device *device, sector_t sector,
- void **kaddr, unsigned long *pfn)
+ void **kaddr, unsigned long *pfn, long size)
{
struct axon_ram_bank *bank = device->bd_disk->private_data;
loff_t offset;
@@ -158,7 +158,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
*kaddr = (void *)(bank->ph_addr + offset);
*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;

- return 0;
+ return min_t(unsigned long, size, bank->size - offset);
}

static const struct block_device_operations axon_ram_devops = {
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index e73b85c..00da60d 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -361,8 +361,8 @@ out:
}

#ifdef CONFIG_BLK_DEV_XIP
-static int brd_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, unsigned long *pfn)
+static long brd_direct_access(struct block_device *bdev, sector_t sector,
+ void **kaddr, unsigned long *pfn, long size)
{
struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;
@@ -379,7 +379,9 @@ static int brd_direct_access(struct block_device *bdev, sector_t sector,
*kaddr = page_address(page);
*pfn = page_to_pfn(page);

- return 0;
+ /* Could optimistically check to see if the next page in the
+ * file is mapped to the next page of physical RAM */
+ return PAGE_SIZE;
}
#endif

diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index ebf41e2..da914b2 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -28,8 +28,8 @@
static int dcssblk_open(struct block_device *bdev, fmode_t mode);
static void dcssblk_release(struct gendisk *disk, fmode_t mode);
static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
-static int dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
- void **kaddr, unsigned long *pfn);
+static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
+ void **kaddr, unsigned long *pfn, long size);

static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";

@@ -866,25 +866,26 @@ fail:
bio_io_error(bio);
}

-static int
+static long
dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
- void **kaddr, unsigned long *pfn)
+ void **kaddr, unsigned long *pfn, long size)
{
struct dcssblk_dev_info *dev_info;
- unsigned long pgoff;
+ unsigned long offset, dev_sz;

dev_info = bdev->bd_disk->private_data;
if (!dev_info)
return -ENODEV;
+ dev_sz = dev_info->end - dev_info->start;
if (secnum % (PAGE_SIZE/512))
return -EINVAL;
- pgoff = secnum / (PAGE_SIZE / 512);
- if ((pgoff+1)*PAGE_SIZE-1 > dev_info->end - dev_info->start)
+ offset = secnum * 512;
+ if (offset > dev_sz)
return -ERANGE;
- *kaddr = (void *) (dev_info->start+pgoff*PAGE_SIZE);
+ *kaddr = (void *) (dev_info->start + offset);
*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;

- return 0;
+ return min_t(unsigned long, size, dev_sz - offset);
}

static void
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index e98171a..fa40091 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,18 +13,13 @@
#include "ext2.h"
#include "xip.h"

-static inline int
-__inode_direct_access(struct inode *inode, sector_t block,
- void **kaddr, unsigned long *pfn)
+static inline long __inode_direct_access(struct inode *inode, sector_t block,
+ void **kaddr, unsigned long *pfn, long size)
{
struct block_device *bdev = inode->i_sb->s_bdev;
const struct block_device_operations *ops = bdev->bd_disk->fops;
- sector_t sector;
-
- sector = block * (PAGE_SIZE / 512); /* ext2 block to bdev sector */
-
- BUG_ON(!ops->direct_access);
- return ops->direct_access(bdev, sector, kaddr, pfn);
+ sector_t sector = block * (PAGE_SIZE / 512);
+ return ops->direct_access(bdev, sector, kaddr, pfn, size);
}

static inline int
@@ -53,12 +48,13 @@ ext2_clear_xip_target(struct inode *inode, sector_t block)
{
void *kaddr;
unsigned long pfn;
- int rc;
+ long size;

- rc = __inode_direct_access(inode, block, &kaddr, &pfn);
- if (!rc)
- clear_page(kaddr);
- return rc;
+ size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
+ if (size < 0)
+ return size;
+ clear_page(kaddr);
+ return 0;
}

void ext2_xip_verify_sb(struct super_block *sb)
@@ -77,7 +73,7 @@ void ext2_xip_verify_sb(struct super_block *sb)
int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
void **kmem, unsigned long *pfn)
{
- int rc;
+ long rc;
sector_t block;

/* first, retrieve the sector number */
@@ -86,6 +82,6 @@ int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
return rc;

/* retrieve address of the target data */
- rc = __inode_direct_access(mapping->host, block, kmem, pfn);
- return rc;
+ rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
+ return (rc < 0) ? rc : 0;
}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4afa4f8..c6f6210 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1560,8 +1560,8 @@ struct block_device_operations {
void (*release) (struct gendisk *, fmode_t);
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
- int (*direct_access) (struct block_device *, sector_t,
- void **, unsigned long *);
+ long (*direct_access) (struct block_device *, sector_t,
+ void **, unsigned long *pfn, long size);
unsigned int (*check_events) (struct gendisk *disk,
unsigned int clearing);
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
--
1.8.5.3

2014-02-25 14:25:43

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 03/22] axonram: Fix bug in direct_access

The 'pfn' returned by axonram was completely bogus, and has been since
2008.

Signed-off-by: Matthew Wilcox <[email protected]>
---
arch/powerpc/sysdev/axonram.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 47b6b9f..830edc8 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -156,7 +156,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
}

*kaddr = (void *)(bank->ph_addr + offset);
- *pfn = virt_to_phys(kaddr) >> PAGE_SHIFT;
+ *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;

return 0;
}
--
1.8.5.3

2014-02-25 14:19:13

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 19/22] ext4: Make ext4_block_zero_page_range static

It's only called within inode.c, so make it static, remove its prototype
from ext4.h and move it above all of its callers so it doesn't need a
prototype within inode.c.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext4/ext4.h | 2 --
fs/ext4/inode.c | 42 +++++++++++++++++++++---------------------
2 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d3a534f..e025c29 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2133,8 +2133,6 @@ extern int ext4_writepage_trans_blocks(struct inode *);
extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
extern int ext4_block_truncate_page(handle_t *handle,
struct address_space *mapping, loff_t from);
-extern int ext4_block_zero_page_range(handle_t *handle,
- struct address_space *mapping, loff_t from, loff_t length);
extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
loff_t lstart, loff_t lend);
extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6e39895..ce7341c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3312,33 +3312,13 @@ void ext4_set_aops(struct inode *inode)
}

/*
- * ext4_block_truncate_page() zeroes out a mapping from file offset `from'
- * up to the end of the block which corresponds to `from'.
- * This required during truncate. We need to physically zero the tail end
- * of that block so it doesn't yield old data if the file is later grown.
- */
-int ext4_block_truncate_page(handle_t *handle,
- struct address_space *mapping, loff_t from)
-{
- unsigned offset = from & (PAGE_CACHE_SIZE-1);
- unsigned length;
- unsigned blocksize;
- struct inode *inode = mapping->host;
-
- blocksize = inode->i_sb->s_blocksize;
- length = blocksize - (offset & (blocksize - 1));
-
- return ext4_block_zero_page_range(handle, mapping, from, length);
-}
-
-/*
* ext4_block_zero_page_range() zeros out a mapping of length 'length'
* starting from file offset 'from'. The range to be zero'd must
* be contained with in one block. If the specified range exceeds
* the end of the block it will be shortened to end of the block
* that cooresponds to 'from'
*/
-int ext4_block_zero_page_range(handle_t *handle,
+static int ext4_block_zero_page_range(handle_t *handle,
struct address_space *mapping, loff_t from, loff_t length)
{
ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT;
@@ -3428,6 +3408,26 @@ unlock:
return err;
}

+/*
+ * ext4_block_truncate_page() zeroes out a mapping from file offset `from'
+ * up to the end of the block which corresponds to `from'.
+ * This required during truncate. We need to physically zero the tail end
+ * of that block so it doesn't yield old data if the file is later grown.
+ */
+int ext4_block_truncate_page(handle_t *handle,
+ struct address_space *mapping, loff_t from)
+{
+ unsigned offset = from & (PAGE_CACHE_SIZE-1);
+ unsigned length;
+ unsigned blocksize;
+ struct inode *inode = mapping->host;
+
+ blocksize = inode->i_sb->s_blocksize;
+ length = blocksize - (offset & (blocksize - 1));
+
+ return ext4_block_zero_page_range(handle, mapping, from, length);
+}
+
int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
loff_t lstart, loff_t length)
{
--
1.8.5.3

2014-02-25 14:19:09

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 10/22] Remove get_xip_mem

All callers of get_xip_mem() are now gone. Remove checks for it,
initialisers of it, documentation of it and the only implementation of it.

Add documentation for writing a filesystem that supports DAX.

Signed-off-by: Matthew Wilcox <[email protected]>
Reviewed-by: Randy Dunlap <[email protected]>
---
Documentation/filesystems/Locking | 3 --
Documentation/filesystems/dax.txt | 82 +++++++++++++++++++++++++++++++++++++++
Documentation/filesystems/xip.txt | 71 ---------------------------------
fs/exofs/inode.c | 1 -
fs/ext2/inode.c | 1 -
fs/ext2/xip.c | 37 ------------------
fs/ext2/xip.h | 3 --
fs/open.c | 5 +--
include/linux/fs.h | 2 -
mm/fadvise.c | 6 ++-
mm/madvise.c | 2 +-
11 files changed, 88 insertions(+), 125 deletions(-)
create mode 100644 Documentation/filesystems/dax.txt
delete mode 100644 Documentation/filesystems/xip.txt

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 5b0c083..2780d47 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -194,8 +194,6 @@ prototypes:
void (*freepage)(struct page *);
int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
- int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **,
- unsigned long *);
int (*migratepage)(struct address_space *, struct page *, struct page *);
int (*launder_page)(struct page *);
int (*is_partially_uptodate)(struct page *, read_descriptor_t *, unsigned long);
@@ -220,7 +218,6 @@ invalidatepage: yes
releasepage: yes
freepage: yes
direct_IO:
-get_xip_mem: maybe
migratepage: yes (both)
launder_page: yes
is_partially_uptodate: yes
diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
new file mode 100644
index 0000000..06f84e5
--- /dev/null
+++ b/Documentation/filesystems/dax.txt
@@ -0,0 +1,82 @@
+Execute-in-place for file mappings
+----------------------------------
+
+Motivation
+----------
+
+File mappings are usually performed by mapping page cache pages to
+userspace. In addition, read & write file operations also transfer data
+between the page cache and storage.
+
+For memory backed storage devices that use the block device interface,
+the page cache pages are just copies of the original storage. The
+execute-in-place code removes the extra copy by performing reads and
+writes directly on the memory backed storage device. For file mappings,
+the storage device itself is mapped directly into userspace.
+
+
+Implementation Tips for Block Driver Writers
+--------------------------------------------
+
+To support DAX in your block driver, implement the 'direct_access'
+block device operation. It is used to translate the sector number
+(expressed in units of 512-byte sectors) to a page frame number (pfn)
+that identifies the physical page for the memory. It also returns a
+kernel virtual address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested. The function should return the number
+of bytes that it can provide, although it must not exceed the number of
+bytes requested. It may also return a negative errno if an error occurs.
+
+In order to support this method, the storage must be byte-accessible by
+the CPU at all times. If your device uses paging techniques to expose
+a large amount of memory through a smaller window, then you cannot
+implement direct_access. Equally, if your device can occasionally
+stall the CPU for an extended period, you should also not attempt to
+implement direct_access.
+
+These block devices may be used for inspiration:
+- axonram: Axon DDR2 device driver
+- brd: RAM backed block device driver
+- dcssblk: s390 dcss block device driver
+
+
+Implementation Tips for Filesystem Writers
+------------------------------------------
+
+Filesystem support consists of
+- adding support to mark inodes as being DAX by setting the S_DAX flag in
+ i_flags
+- implementing the direct_IO address space operation, and calling
+ dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
+- implementing an mmap file operation for DAX files which sets the
+ VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers
+ for fault and page_mkwrite (which should probably call dax_fault() and
+ dax_mkwrite(), passing the appropriate get_block() callback)
+- calling dax_truncate_page() instead of block_truncate_page() for DAX files
+- ensuring that there is sufficient locking between reads, writes,
+ truncates and page faults
+
+The get_block() callback passed to the DAX functions may return
+uninitialised extents. If it does, it must ensure that simultaneous
+calls to get_block() (for example by a page-fault racing with a read()
+or a write()) work correctly.
+
+These filesystems may be used for inspiration:
+- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+
+
+Shortcomings
+------------
+
+Even if the kernel or its modules are stored on a filesystem that supports
+DAX on a block device that supports DAX, they will still be copied into RAM.
+
+Calling get_user_pages() on a range of user memory that has been mmaped
+from a DAX file will fail as there are no 'struct page' to describe
+those pages. This problem is being worked on. That means that O_DIRECT
+reads/writes to those memory ranges from a non-DAX file will fail (note
+that O_DIRECT reads/writes _of a DAX file_ do work, it is the memory
+that is being accessed that is key here). Other things that will not
+work include RDMA, sendfile() and splice().
diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
deleted file mode 100644
index b62eabf..0000000
--- a/Documentation/filesystems/xip.txt
+++ /dev/null
@@ -1,71 +0,0 @@
-Execute-in-place for file mappings
-----------------------------------
-
-Motivation
-----------
-File mappings are performed by mapping page cache pages to userspace. In
-addition, read&write type file operations also transfer data from/to the page
-cache.
-
-For memory backed storage devices that use the block device interface, the page
-cache pages are in fact copies of the original storage. Various approaches
-exist to work around the need for an extra copy. The ramdisk driver for example
-does read the data into the page cache, keeps a reference, and discards the
-original data behind later on.
-
-Execute-in-place solves this issue the other way around: instead of keeping
-data in the page cache, the need to have a page cache copy is eliminated
-completely. With execute-in-place, read&write type operations are performed
-directly from/to the memory backed storage device. For file mappings, the
-storage device itself is mapped directly into userspace.
-
-This implementation was initially written for shared memory segments between
-different virtual machines on s390 hardware to allow multiple machines to
-share the same binaries and libraries.
-
-Implementation
---------------
-Execute-in-place is implemented in three steps: block device operation,
-address space operation, and file operations.
-
-A block device operation named direct_access is used to translate the
-block device sector number to a page frame number (pfn) that identifies
-the physical page for the memory. It also returns a kernel virtual
-address that can be used to access the memory.
-
-The direct_access method takes a 'size' parameter that indicates the
-number of bytes being requested. The function should return the number
-of bytes that it can provide, although it must not exceed the number of
-bytes requested. It may also return a negative errno if an error occurs.
-
-The block device operation is optional, these block devices support it as of
-today:
-- dcssblk: s390 dcss block device driver
-
-An address space operation named get_xip_mem is used to retrieve references
-to a page frame number and a kernel address. To obtain these values a reference
-to an address_space is provided. This function assigns values to the kmem and
-pfn parameters. The third argument indicates whether the function should allocate
-blocks if needed.
-
-This address space operation is mutually exclusive with readpage&writepage that
-do page cache read/write operations.
-The following filesystems support it as of today:
-- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
-
-A set of file operations that do utilize get_xip_page can be found in
-mm/filemap_xip.c . The following file operation implementations are provided:
-- aio_read/aio_write
-- readv/writev
-- sendfile
-
-The generic file operations do_sync_read/do_sync_write can be used to implement
-classic synchronous IO calls.
-
-Shortcomings
-------------
-This implementation is limited to storage devices that are cpu addressable at
-all times (no highmem or such). It works well on rom/ram, but enhancements are
-needed to make it work with flash in read+write mode.
-Putting the Linux kernel and/or its modules on a xip filesystem does not mean
-they are not copied.
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index ee4317fa..f9a5bf6 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -985,7 +985,6 @@ const struct address_space_operations exofs_aops = {
.direct_IO = exofs_direct_IO,

/* With these NULL has special meaning or default is not exported */
- .get_xip_mem = NULL,
.migratepage = NULL,
.launder_page = NULL,
.is_partially_uptodate = NULL,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 252481f..b156fe8 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -891,7 +891,6 @@ const struct address_space_operations ext2_aops = {

const struct address_space_operations ext2_aops_xip = {
.bmap = ext2_bmap,
- .get_xip_mem = ext2_get_xip_mem,
.direct_IO = ext2_direct_IO,
};

diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index fa40091..ca745ff 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -22,27 +22,6 @@ static inline long __inode_direct_access(struct inode *inode, sector_t block,
return ops->direct_access(bdev, sector, kaddr, pfn, size);
}

-static inline int
-__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
- sector_t *result)
-{
- struct buffer_head tmp;
- int rc;
-
- memset(&tmp, 0, sizeof(struct buffer_head));
- tmp.b_size = 1 << inode->i_blkbits;
- rc = ext2_get_block(inode, pgoff, &tmp, create);
- *result = tmp.b_blocknr;
-
- /* did we get a sparse block (hole in the file)? */
- if (!tmp.b_blocknr && !rc) {
- BUG_ON(create);
- rc = -ENODATA;
- }
-
- return rc;
-}
-
int
ext2_clear_xip_target(struct inode *inode, sector_t block)
{
@@ -69,19 +48,3 @@ void ext2_xip_verify_sb(struct super_block *sb)
"not supported by bdev");
}
}
-
-int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
- void **kmem, unsigned long *pfn)
-{
- long rc;
- sector_t block;
-
- /* first, retrieve the sector number */
- rc = __ext2_get_block(mapping->host, pgoff, create, &block);
- if (rc)
- return rc;
-
- /* retrieve address of the target data */
- rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
- return (rc < 0) ? rc : 0;
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 29be737..0fa8b7f 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -14,11 +14,8 @@ static inline int ext2_use_xip (struct super_block *sb)
struct ext2_sb_info *sbi = EXT2_SB(sb);
return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
}
-int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
- void **, unsigned long *);
#else
#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
#define ext2_clear_xip_target(inode, chain) 0
-#define ext2_get_xip_mem NULL
#endif
diff --git a/fs/open.c b/fs/open.c
index 4b3e1ed..4b16abe 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -665,11 +665,8 @@ int open_check_o_direct(struct file *f)
{
/* NB: we're sure to have correct a_ops only after f_op->open */
if (f->f_flags & O_DIRECT) {
- if (!f->f_mapping->a_ops ||
- ((!f->f_mapping->a_ops->direct_IO) &&
- (!f->f_mapping->a_ops->get_xip_mem))) {
+ if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO)
return -EINVAL;
- }
}
return 0;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 02eeeb7..c7945fd 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -372,8 +372,6 @@ struct address_space_operations {
void (*freepage)(struct page *);
ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
- int (*get_xip_mem)(struct address_space *, pgoff_t, int,
- void **, unsigned long *);
/*
* migrate the contents of a page to the specified target. If
* migrate_mode is MIGRATE_ASYNC, it must not block.
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 3bcfd81..1f1925f 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -28,6 +28,7 @@
SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
{
struct fd f = fdget(fd);
+ struct inode *inode;
struct address_space *mapping;
struct backing_dev_info *bdi;
loff_t endbyte; /* inclusive */
@@ -39,7 +40,8 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
if (!f.file)
return -EBADF;

- if (S_ISFIFO(file_inode(f.file)->i_mode)) {
+ inode = file_inode(f.file);
+ if (S_ISFIFO(inode->i_mode)) {
ret = -ESPIPE;
goto out;
}
@@ -50,7 +52,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
goto out;
}

- if (mapping->a_ops->get_xip_mem) {
+ if (IS_DAX(inode)) {
switch (advice) {
case POSIX_FADV_NORMAL:
case POSIX_FADV_RANDOM:
diff --git a/mm/madvise.c b/mm/madvise.c
index 539eeb9..b6a2f52 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -236,7 +236,7 @@ static long madvise_willneed(struct vm_area_struct *vma,
if (!file)
return -EBADF;

- if (file->f_mapping->a_ops->get_xip_mem) {
+ if (IS_DAX(file_inode(file))) {
/* no bad return value, but ignore advice */
return 0;
}
--
1.8.5.3

2014-02-26 15:07:25

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v6 23/22] Bugfixes

On Tue, Feb 25, 2014 at 09:18:16AM -0500, Matthew Wilcox wrote:
> Seven xfstests still fail reliably with DAX that pass reliably without
> DAX: ext4/301 generic/{075,091,112,127,223,263}

With the patches below, we're down to just two additional failures,
ext4/301 and generic/223. I'll fold these patches into the right place
for a v7 of the patchset, but I thought it unsporting to send out a new
version of such a large patchset the day after.

I've now had a review with Kirill of the page fault path. We identified
some ... infelicities in the mm code, but nothing that's worse than the
current XIP code.

commit 714bad38915139f381e28681ab46e2e2f7202556
Author: Matthew Wilcox <[email protected]>
Date: Wed Feb 26 08:00:46 2014 -0500

Only call get_block when necessary

In the dax_io function, we would call get_block when we got to the end
of the current block returned by dax_get_addr(). When using a driver
like PRD, that's fine, but using BRD means that we stop on every page
boundary. The problem is that we lose information from the first call
when doing this. For example, if a write crosses a page boundary, the
first time around the loop the filesystem allocates two pages, BH_New is
set and we zero the start of the first page. The second time around the
loop, the filesystem just returns the existing block with BH_New clear,
so we don't zero the tail of the second page.

This patch adds tracking for how far through the buffer_head we've got,
and will only call get_block() again once we've got to the end of the
previous buffer.

Fixes generic/263. Now generic/{075,091,112,127} also pass; 075 was
looking like the same problem from Ross's investigation.

commit 1f1fd14eb17dc19ecb757f896dd7573af79b5699
Author: Matthew Wilcox <[email protected]>
Date: Tue Feb 25 14:41:30 2014 -0500

Clear new or unwritten blocks in page fault handler

Test generic/263 mmaps the end of the file, writes to it, then checks
the bytes after EOF are zero. They were not being zeroed before, so we
must do it.

commit 4d21ffcf353b8c83a599fe09ae8657ba05da1c76
Author: Matthew Wilcox <[email protected]>
Date: Tue Feb 25 12:06:42 2014 -0500

Initialise cow_page in do_page_mkwrite()

We will end up checking cow_page when turning a hole into a written page,
so it needs to be zero.

fs/dax.c | 47 +++++++++++++++++++++++++++++++++++++----------
mm/memory.c | 1 +
2 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 6308197..2640db6 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -100,6 +100,19 @@ static bool buffer_written(struct buffer_head *bh)
return buffer_mapped(bh) && !buffer_unwritten(bh);
}

+/*
+ * When ext4 encounters a hole, it likes to return without modifying the
+ * buffer_head which means that we can't trust b_size. To cope with this,
+ * we set b_state to 0 before calling get_block and, if any bit is set, we
+ * know we can trust b_size. Unfortunate, really, since ext4 does know
+ * precisely how long a hole is and would save us time calling get_block
+ * repeatedly.
+ */
+static bool buffer_size_valid(struct buffer_head *bh)
+{
+ return bh->b_state != 0;
+}
+
static ssize_t dax_io(int rw, struct inode *inode, const struct iovec *iov,
loff_t start, loff_t end, get_block_t get_block,
struct buffer_head *bh)
@@ -110,6 +123,7 @@ static ssize_t dax_io(int rw, struct inode *inode, const struct iovec *iov,
unsigned copied = 0;
loff_t offset = start;
loff_t max = start;
+ loff_t bh_max = start;
void *addr;
bool hole = false;

@@ -119,15 +133,27 @@ static ssize_t dax_io(int rw, struct inode *inode, const struct iovec *iov,
while (offset < end) {
void __user *buf = iov[seg].iov_base + copied;

- if (max == offset) {
+ if (offset == max) {
sector_t block = offset >> inode->i_blkbits;
unsigned first = offset - (block << inode->i_blkbits);
long size;
- memset(bh, 0, sizeof(*bh));
- bh->b_size = ALIGN(end - offset, PAGE_SIZE);
- retval = get_block(inode, block, bh, rw == WRITE);
- if (retval)
- break;
+
+ if (offset == bh_max) {
+ bh->b_size = PAGE_ALIGN(end - offset);
+ bh->b_state = 0;
+ retval = get_block(inode, block, bh,
+ rw == WRITE);
+ if (retval)
+ break;
+ if (!buffer_size_valid(bh))
+ bh->b_size = 1 << inode->i_blkbits;
+ bh_max = offset - first + bh->b_size;
+ } else {
+ unsigned done = bh->b_size - (bh_max -
+ (offset - first));
+ bh->b_blocknr += done >> inode->i_blkbits;
+ bh->b_size -= done;
+ }
if (rw == WRITE) {
if (!buffer_mapped(bh)) {
retval = -EIO;
@@ -140,10 +166,7 @@ static ssize_t dax_io(int rw, struct inode *inode, const struct iovec *iov,

if (hole) {
addr = NULL;
- if (buffer_uptodate(bh))
- size = bh->b_size - first;
- else
- size = (1 << inode->i_blkbits) - first;
+ size = bh->b_size - first;
} else {
retval = dax_get_addr(inode, bh, &addr);
if (retval < 0)
@@ -209,6 +232,7 @@ ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
ssize_t retval = -EINVAL;
loff_t end = offset;

+ memset(&bh, 0, sizeof(bh));
for (seg = 0; seg < nr_segs; seg++)
end += iov[seg].iov_len;

@@ -314,6 +338,9 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
}
}

+ if (buffer_unwritten(&bh) || buffer_new(&bh))
+ dax_clear_blocks(inode, bh.b_blocknr, bh.b_size);
+
/* Recheck i_size under i_mmap_mutex */
mutex_lock(&mapping->i_mmap_mutex);
size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
diff --git a/mm/memory.c b/mm/memory.c
index 4e1bdee..3c6b8b2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2672,6 +2672,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
vmf.pgoff = page->index;
vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
vmf.page = page;
+ vmf.cow_page = NULL;

ret = vma->vm_ops->page_mkwrite(vma, &vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))

2014-02-27 14:01:31

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH v6 00/22] Support ext4 on NV-DIMMs

On 02/25/2014 03:18 PM, Matthew Wilcox wrote:
> One of the primary uses for NV-DIMMs is to expose them as a block device
> and use a filesystem to store files on the NV-DIMM. While that works,
> it currently wastes memory and CPU time buffering the files in the page
> cache. We have support in ext2 for bypassing the page cache, but it
> has some races which are unfixable in the current design. This series
> of patches rewrite the underlying support, and add support for direct
> access to ext4.

I'm wondering if there is a potential security issue lurking here.

Some distributions use udisks2 to grant permission to local console
users to create new loop devices from files. File systems on these
block devices are then mounted. This is a replacement for several file
systems implemented in user space, and for the users, this is a good
thing because the in-kernel implementations are generally of higher quality.

What happens if we have DAX support in the entire stack, and an
enterprising user mounts a file system? Will she be able to fuzz the
file system or binfmt loaders concurrently, changing the bits while they
are being read?

Currently, it appears that the loop device duplicates pages in the page
cache, so this does not seem to be possible, but DAX support might
change this.

--
Florian Weimer / Red Hat Product Security Team

2014-02-27 16:29:29

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v6 00/22] Support ext4 on NV-DIMMs

On Thu, Feb 27, 2014 at 03:01:03PM +0100, Florian Weimer wrote:
> On 02/25/2014 03:18 PM, Matthew Wilcox wrote:
> >One of the primary uses for NV-DIMMs is to expose them as a block device
> >and use a filesystem to store files on the NV-DIMM. While that works,
> >it currently wastes memory and CPU time buffering the files in the page
> >cache. We have support in ext2 for bypassing the page cache, but it
> >has some races which are unfixable in the current design. This series
> >of patches rewrite the underlying support, and add support for direct
> >access to ext4.
>
> I'm wondering if there is a potential security issue lurking here.
>
> Some distributions use udisks2 to grant permission to local console
> users to create new loop devices from files. File systems on these
> block devices are then mounted. This is a replacement for several
> file systems implemented in user space, and for the users, this is a
> good thing because the in-kernel implementations are generally of
> higher quality.

Just to be sure I understand; the user owns the file (so can change any
bit in it at will), and the loop device is used to present that file
to the filesystem as a block device to be mounted? Have we fuzz-tested
all the filesystems enough to be sure that's safe? :-)

> What happens if we have DAX support in the entire stack, and an
> enterprising user mounts a file system? Will she be able to fuzz
> the file system or binfmt loaders concurrently, changing the bits
> while they are being read?
>
> Currently, it appears that the loop device duplicates pages in the
> page cache, so this does not seem to be possible, but DAX support
> might change this.

I haven't looked at adding DAX support to the loop device, although
that would make sense. At the moment, neither ext2 nor ext4 (our only
DAX-supporting filesystems) use DAX for their metadata, only for user
data. As far as fuzzing the binfmt loaders ... are these filesystems not
forced to be at least nosuid? I might go so far as to make them noexec.

Thanks for thinking about this. I didn't know allowing users to mount
files they owned was something distros actually did. Have we considered
prohibiting the user from modifying the file while it's mounted, eg
forcing its permissions to 0 or pretending it's immutable?

2014-02-27 16:36:19

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH v6 00/22] Support ext4 on NV-DIMMs

On 02/27/2014 05:29 PM, Matthew Wilcox wrote:

>> Some distributions use udisks2 to grant permission to local console
>> users to create new loop devices from files. File systems on these
>> block devices are then mounted. This is a replacement for several
>> file systems implemented in user space, and for the users, this is a
>> good thing because the in-kernel implementations are generally of
>> higher quality.
>
> Just to be sure I understand; the user owns the file (so can change any
> bit in it at will), and the loop device is used to present that file
> to the filesystem as a block device to be mounted?

Yes, that's a fair summary.

> Have we fuzz-tested
> all the filesystems enough to be sure that's safe? :-)

It raised some eyebrows. But I've looked at some of the userspace
alternatives, and I can see why we ended up with this.

>> What happens if we have DAX support in the entire stack, and an
>> enterprising user mounts a file system? Will she be able to fuzz
>> the file system or binfmt loaders concurrently, changing the bits
>> while they are being read?
>>
>> Currently, it appears that the loop device duplicates pages in the
>> page cache, so this does not seem to be possible, but DAX support
>> might change this.
>
> I haven't looked at adding DAX support to the loop device, although
> that would make sense. At the moment, neither ext2 nor ext4 (our only
> DAX-supporting filesystems) use DAX for their metadata, only for user
> data. As far as fuzzing the binfmt loaders ... are these filesystems not
> forced to be at least nosuid?

The kernel binfmt parser runs as root even without a SUID bit. :)

> I might go so far as to make them noexec.

Oh, that's an interesting idea.

> Thanks for thinking about this. I didn't know allowing users to mount
> files they owned was something distros actually did. Have we considered
> prohibiting the user from modifying the file while it's mounted, eg
> forcing its permissions to 0 or pretending it's immutable?

Perhaps like "Text file busy" for executables? How reliable is that in
practice?

Changing file permissions doesn't affected already open descriptors and
might not always be possible (the file system might be mounted
read-only, but still be modifiable beneath).

--
Florian Weimer / Red Hat Product Security Team

2014-02-28 17:56:45

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler

On Tue, 2014-02-25 at 09:18 -0500, Matthew Wilcox wrote:
> Instead of calling aops->get_xip_mem from the fault handler, the
> filesystem passes a get_block_t that is used to find the appropriate
> blocks.
:
> +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> + get_block_t get_block)
> +{
> + struct file *file = vma->vm_file;
> + struct inode *inode = file_inode(file);
> + struct address_space *mapping = file->f_mapping;
> + struct buffer_head bh;
> + unsigned long vaddr = (unsigned long)vmf->virtual_address;
> + sector_t block;
> + pgoff_t size;
> + unsigned long pfn;
> + int error;
> +
> + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + if (vmf->pgoff >= size)
> + return VM_FAULT_SIGBUS;
> +
> + memset(&bh, 0, sizeof(bh));
> + block = (sector_t)vmf->pgoff << (PAGE_SHIFT - inode->i_blkbits);
> + bh.b_size = PAGE_SIZE;
> + error = get_block(inode, block, &bh, 0);
> + if (error || bh.b_size < PAGE_SIZE)
> + return VM_FAULT_SIGBUS;

I am learning the code and have some questions. The original code,
xip_file_fault(), jumps to found: and calls vm_insert_mixed() when
get_xip_mem(,,0,,) succeeded. If get_xip_mem() returns -ENODATA, it
calls either get_xip_mem(,,1,,) or xip_sparse_page(). In this new
function, it looks to me that get_block(,,,0) returns 0 for both cases
(success and -ENODATA previously), which are dealt in the same way. Is
that right? If so, is there any reason for the change? Also, isn't it
possible to call get_block(,,,1) even if get_block(,,,0) found a block?

Thanks,
-Toshi

> +
> + if (!buffer_written(&bh) && !vmf->cow_page) {
> + if (vmf->flags & FAULT_FLAG_WRITE) {
> + error = get_block(inode, block, &bh, 1);
> + if (error || bh.b_size < PAGE_SIZE)
> + return VM_FAULT_SIGBUS;
> + } else {
> + return dax_load_hole(mapping, vmf);
> + }
> + }
> +

2014-02-28 20:20:35

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler

On Fri, Feb 28, 2014 at 10:49:31AM -0700, Toshi Kani wrote:
> On Tue, 2014-02-25 at 09:18 -0500, Matthew Wilcox wrote:
> > Instead of calling aops->get_xip_mem from the fault handler, the
> > filesystem passes a get_block_t that is used to find the appropriate
> > blocks.
> :
> > +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> > + get_block_t get_block)
> > +{
> > + struct file *file = vma->vm_file;
> > + struct inode *inode = file_inode(file);
> > + struct address_space *mapping = file->f_mapping;
> > + struct buffer_head bh;
> > + unsigned long vaddr = (unsigned long)vmf->virtual_address;
> > + sector_t block;
> > + pgoff_t size;
> > + unsigned long pfn;
> > + int error;
> > +
> > + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > + if (vmf->pgoff >= size)
> > + return VM_FAULT_SIGBUS;
> > +
> > + memset(&bh, 0, sizeof(bh));
> > + block = (sector_t)vmf->pgoff << (PAGE_SHIFT - inode->i_blkbits);
> > + bh.b_size = PAGE_SIZE;
> > + error = get_block(inode, block, &bh, 0);
> > + if (error || bh.b_size < PAGE_SIZE)
> > + return VM_FAULT_SIGBUS;
>
> I am learning the code and have some questions.

Hi Toshi,

Glad to see you're looking at it. Let me try to help ...

> The original code,
> xip_file_fault(), jumps to found: and calls vm_insert_mixed() when
> get_xip_mem(,,0,,) succeeded. If get_xip_mem() returns -ENODATA, it
> calls either get_xip_mem(,,1,,) or xip_sparse_page(). In this new
> function, it looks to me that get_block(,,,0) returns 0 for both cases
> (success and -ENODATA previously), which are dealt in the same way. Is
> that right? If so, is there any reason for the change?

Yes, get_xip_mem() returned -ENODATA for a hole. That was a suboptimal
interface because filesystems are actually capable of returning more
information than that, eg how long the hole is (ext4 *doesn't*, but I
consider that to be a bug).

I don't get to decide what the get_block() interface looks like. It's the
standard way that the VFS calls back into the filesystem and has been
around for probably close to twenty years at this point. I'm still trying
to understand exactly what the contract is for get_blocks() ... I have
a document that I'm working on to try to explain it, but it's tough going!

> Also, isn't it
> possible to call get_block(,,,1) even if get_block(,,,0) found a block?

The code in question looks like this:

error = get_block(inode, block, &bh, 0);
if (error || bh.b_size < PAGE_SIZE)
goto sigbus;

if (!buffer_written(&bh) && !vmf->cow_page) {
if (vmf->flags & FAULT_FLAG_WRITE) {
error = get_block(inode, block, &bh, 1);

where buffer_written is defined as:
return buffer_mapped(bh) && !buffer_unwritten(bh);

Doing some boolean algebra, that's:

if (!buffer_mapped || buffer_unwritten)

In either case, we want to tell the filesystem that we're writing to
this block. At least, that's my current understanding of the get_block()
interface. I'm open to correction here!

2014-02-28 22:25:07

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH v6 07/22] Replace the XIP page fault handler with the DAX page fault handler

On Fri, 2014-02-28 at 15:20 -0500, Matthew Wilcox wrote:
> On Fri, Feb 28, 2014 at 10:49:31AM -0700, Toshi Kani wrote:
> > On Tue, 2014-02-25 at 09:18 -0500, Matthew Wilcox wrote:
:
> Glad to see you're looking at it. Let me try to help ...

Hi Matt,

Thanks for the help. This is really a nice work, and I am hoping to
help it... (in some day! :-)

> > The original code,
> > xip_file_fault(), jumps to found: and calls vm_insert_mixed() when
> > get_xip_mem(,,0,,) succeeded. If get_xip_mem() returns -ENODATA, it
> > calls either get_xip_mem(,,1,,) or xip_sparse_page(). In this new
> > function, it looks to me that get_block(,,,0) returns 0 for both cases
> > (success and -ENODATA previously), which are dealt in the same way. Is
> > that right? If so, is there any reason for the change?
>
> Yes, get_xip_mem() returned -ENODATA for a hole. That was a suboptimal
> interface because filesystems are actually capable of returning more
> information than that, eg how long the hole is (ext4 *doesn't*, but I
> consider that to be a bug).
>
> I don't get to decide what the get_block() interface looks like. It's the
> standard way that the VFS calls back into the filesystem and has been
> around for probably close to twenty years at this point. I'm still trying
> to understand exactly what the contract is for get_blocks() ... I have
> a document that I'm working on to try to explain it, but it's tough going!

Got it. Yes, get_block() is a beast for file system newbie like me.
Thanks for working on the document.

> > Also, isn't it
> > possible to call get_block(,,,1) even if get_block(,,,0) found a block?
>
> The code in question looks like this:
>
> error = get_block(inode, block, &bh, 0);
> if (error || bh.b_size < PAGE_SIZE)
> goto sigbus;
>
> if (!buffer_written(&bh) && !vmf->cow_page) {
> if (vmf->flags & FAULT_FLAG_WRITE) {
> error = get_block(inode, block, &bh, 1);
>
> where buffer_written is defined as:
> return buffer_mapped(bh) && !buffer_unwritten(bh);
>
> Doing some boolean algebra, that's:
>
> if (!buffer_mapped || buffer_unwritten)

Oh, I see! When the first get_block(,,,0) succeeded, this buffer is
mapped. So, it won't go into this path.

> In either case, we want to tell the filesystem that we're writing to
> this block. At least, that's my current understanding of the get_block()
> interface. I'm open to correction here!

Thanks again!
-Toshi