2014-07-22 19:49:48

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 00/22] Support ext4 on NV-DIMMs

One of the primary uses for NV-DIMMs is to expose them as a block device
and use a filesystem to store files on the NV-DIMM. While that works,
it currently wastes memory and CPU time buffering the files in the page
cache. We have support in ext2 for bypassing the page cache, but it
has some races which are unfixable in the current design. This series
of patches rewrite the underlying support, and add support for direct
access to ext4.

I would particularly welcome feedback from mm people on patch 5 ("Add
vm_replace_mixed()") and from fs people on patch 7 ("Add copy_to_iter(),
copy_from_iter() and iov_iter_zero()").

This iteration of the patchset rebases to 3.16-rc6 and makes substantial
changes based on Jan Kara's feedback:

- Handles errors in copy_user_bh()
- Changes calling convention for dax_get_addr() / dax_get_pfn() to take a
blkbits argument instead of an inode argument
- Cache inode->i_blkbits in a local variable
- Rename file offset to 'pos' to fit the rest of the VFS
- Added a FIXME to fall back to buffered I/O if the filesystem doesn't
support filling a hole from within the direct I/O path. Mysterious
and complex are the ways of get_block implementations.
- Moved the call to inode_dio_done() to after end_io() to fix a race
- Added a comment about why we have to recheck i_size under the page lock
- Use vm_insert_page() in the COW path instead of returning VM_FAULT_COWED
- Handle errors coming back from dax_get_pfn() correctly in do_dax_fault()
- Removes zero pages from the process' address space before trying to
replace them with the PFN of the newly allocated block
- Factor out bdev_direct_access() to support partitioning properly
- Rework the i_mmap_mutex locking to remove an inversion vs the page lock
- Make the ext2 rename of -o xip to -o dax more graceful
- Only update file mtime/ctime on a write fault, not a read fault


Matthew Wilcox (21):
Fix XIP fault vs truncate race
Allow page fault handlers to perform the COW
axonram: Fix bug in direct_access
Change direct_access calling convention
Add vm_replace_mixed()
Introduce IS_DAX(inode)
Add copy_to_iter(), copy_from_iter() and iov_iter_zero()
Replace XIP read and write with DAX I/O
Replace ext2_clear_xip_target with dax_clear_blocks
Replace the XIP page fault handler with the DAX page fault handler
Replace xip_truncate_page with dax_truncate_page
Replace XIP documentation with DAX documentation
Remove get_xip_mem
ext2: Remove ext2_xip_verify_sb()
ext2: Remove ext2_use_xip
ext2: Remove xip.c and xip.h
Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX
ext2: Remove ext2_aops_xip
Get rid of most mentions of XIP in ext2
xip: Add xip_zero_page_range
brd: Rename XIP to DAX

Ross Zwisler (1):
ext4: Add DAX functionality

Documentation/filesystems/Locking | 3 -
Documentation/filesystems/dax.txt | 91 +++++++
Documentation/filesystems/ext4.txt | 2 +
Documentation/filesystems/xip.txt | 68 -----
arch/powerpc/sysdev/axonram.c | 14 +-
drivers/block/Kconfig | 13 +-
drivers/block/brd.c | 22 +-
drivers/s390/block/dcssblk.c | 19 +-
fs/Kconfig | 21 +-
fs/Makefile | 1 +
fs/block_dev.c | 28 +++
fs/dax.c | 503 +++++++++++++++++++++++++++++++++++++
fs/exofs/inode.c | 1 -
fs/ext2/Kconfig | 11 -
fs/ext2/Makefile | 1 -
fs/ext2/ext2.h | 10 +-
fs/ext2/file.c | 45 +++-
fs/ext2/inode.c | 38 +--
fs/ext2/namei.c | 13 +-
fs/ext2/super.c | 53 ++--
fs/ext2/xip.c | 91 -------
fs/ext2/xip.h | 26 --
fs/ext4/ext4.h | 6 +
fs/ext4/file.c | 53 +++-
fs/ext4/indirect.c | 18 +-
fs/ext4/inode.c | 51 ++--
fs/ext4/namei.c | 10 +-
fs/ext4/super.c | 39 ++-
fs/open.c | 5 +-
include/linux/blkdev.h | 6 +-
include/linux/fs.h | 49 +++-
include/linux/mm.h | 9 +-
include/linux/uio.h | 3 +
mm/Makefile | 1 -
mm/fadvise.c | 6 +-
mm/filemap.c | 6 +-
mm/filemap_xip.c | 483 -----------------------------------
mm/iov_iter.c | 237 +++++++++++++++--
mm/madvise.c | 2 +-
mm/memory.c | 45 ++--
40 files changed, 1228 insertions(+), 875 deletions(-)
create mode 100644 Documentation/filesystems/dax.txt
delete mode 100644 Documentation/filesystems/xip.txt
create mode 100644 fs/dax.c
delete mode 100644 fs/ext2/xip.c
delete mode 100644 fs/ext2/xip.h
delete mode 100644 mm/filemap_xip.c

--
2.0.0


2014-07-22 19:48:33

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 03/22] axonram: Fix bug in direct_access

The 'pfn' returned by axonram was completely bogus, and has been since
2008.

Signed-off-by: Matthew Wilcox <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
arch/powerpc/sysdev/axonram.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 47b6b9f..830edc8 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -156,7 +156,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
}

*kaddr = (void *)(bank->ph_addr + offset);
- *pfn = virt_to_phys(kaddr) >> PAGE_SHIFT;
+ *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;

return 0;
}
--
2.0.0

2014-07-22 19:48:39

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 18/22] ext2: Remove ext2_aops_xip

We shouldn't need a special address_space_operations any more

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext2/ext2.h | 1 -
fs/ext2/inode.c | 7 +------
fs/ext2/namei.c | 4 ++--
3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index b30c3bd..b8b1c11 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -793,7 +793,6 @@ extern const struct file_operations ext2_xip_file_operations;

/* inode.c */
extern const struct address_space_operations ext2_aops;
-extern const struct address_space_operations ext2_aops_xip;
extern const struct address_space_operations ext2_nobh_aops;

/* namei.c */
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 154cbcf..034fd42 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -891,11 +891,6 @@ const struct address_space_operations ext2_aops = {
.error_remove_page = generic_error_remove_page,
};

-const struct address_space_operations ext2_aops_xip = {
- .bmap = ext2_bmap,
- .direct_IO = ext2_direct_IO,
-};
-
const struct address_space_operations ext2_nobh_aops = {
.readpage = ext2_readpage,
.readpages = ext2_readpages,
@@ -1394,7 +1389,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext2_file_inode_operations;
if (test_opt(inode->i_sb, XIP)) {
- inode->i_mapping->a_ops = &ext2_aops_xip;
+ inode->i_mapping->a_ops = &ext2_aops;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 7ca803f..0db888c 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode

inode->i_op = &ext2_file_inode_operations;
if (test_opt(inode->i_sb, XIP)) {
- inode->i_mapping->a_ops = &ext2_aops_xip;
+ inode->i_mapping->a_ops = &ext2_aops;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)

inode->i_op = &ext2_file_inode_operations;
if (test_opt(inode->i_sb, XIP)) {
- inode->i_mapping->a_ops = &ext2_aops_xip;
+ inode->i_mapping->a_ops = &ext2_aops;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
--
2.0.0

2014-07-22 19:48:45

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 06/22] Introduce IS_DAX(inode)

Use an inode flag to tag inodes which should avoid using the page cache.
Convert ext2 to use it instead of mapping_is_xip().

Signed-off-by: Matthew Wilcox <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/ext2/inode.c | 9 ++++++---
fs/ext2/xip.h | 2 --
include/linux/fs.h | 6 ++++++
3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 36d35c3..0cb0448 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -731,7 +731,7 @@ static int ext2_get_blocks(struct inode *inode,
goto cleanup;
}

- if (ext2_use_xip(inode->i_sb)) {
+ if (IS_DAX(inode)) {
/*
* we need to clear the block
*/
@@ -1201,7 +1201,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)

inode_dio_wait(inode);

- if (mapping_is_xip(inode->i_mapping))
+ if (IS_DAX(inode))
error = xip_truncate_page(inode->i_mapping, newsize);
else if (test_opt(inode->i_sb, NOBH))
error = nobh_truncate_page(inode->i_mapping,
@@ -1273,7 +1273,8 @@ void ext2_set_inode_flags(struct inode *inode)
{
unsigned int flags = EXT2_I(inode)->i_flags;

- inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+ inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME |
+ S_DIRSYNC | S_DAX);
if (flags & EXT2_SYNC_FL)
inode->i_flags |= S_SYNC;
if (flags & EXT2_APPEND_FL)
@@ -1284,6 +1285,8 @@ void ext2_set_inode_flags(struct inode *inode)
inode->i_flags |= S_NOATIME;
if (flags & EXT2_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
+ if (test_opt(inode->i_sb, XIP))
+ inode->i_flags |= S_DAX;
}

/* Propagate flags from i_flags to EXT2_I(inode)->i_flags */
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 18b34d2..29be737 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -16,9 +16,7 @@ static inline int ext2_use_xip (struct super_block *sb)
}
int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
void **, unsigned long *);
-#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_mem)
#else
-#define mapping_is_xip(map) 0
#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
#define ext2_clear_xip_target(inode, chain) 0
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e11d60c..a94e5d7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1579,6 +1579,7 @@ struct super_operations {
#define S_IMA 1024 /* Inode has an associated IMA struct */
#define S_AUTOMOUNT 2048 /* Automount/referral quasi-directory */
#define S_NOSEC 4096 /* no suid or xattr security attributes */
+#define S_DAX 8192 /* Direct Access, avoiding the page cache */

/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -1616,6 +1617,11 @@ struct super_operations {
#define IS_IMA(inode) ((inode)->i_flags & S_IMA)
#define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT)
#define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC)
+#ifdef CONFIG_FS_XIP
+#define IS_DAX(inode) ((inode)->i_flags & S_DAX)
+#else
+#define IS_DAX(inode) 0
+#endif

/*
* Inode state bits. Protected by inode->i_lock
--
2.0.0

2014-07-22 19:48:55

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 11/22] Replace xip_truncate_page with dax_truncate_page

It takes a get_block parameter just like nobh_truncate_page() and
block_truncate_page()

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/dax.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
fs/ext2/inode.c | 2 +-
include/linux/fs.h | 4 ++--
mm/filemap_xip.c | 40 ----------------------------------------
4 files changed, 47 insertions(+), 43 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4ab4890..b9bc5e2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -449,3 +449,47 @@ int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
return dax_fault(vma, vmf, get_block);
}
EXPORT_SYMBOL_GPL(dax_mkwrite);
+
+/**
+ * dax_truncate_page - handle a partial page being truncated in a DAX file
+ * @inode: The file being truncated
+ * @from: The file offset that is being truncated to
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * Similar to block_truncate_page(), this function can be called by a
+ * filesystem when it is truncating an DAX file to handle the partial page.
+ *
+ * We work in terms of PAGE_CACHE_SIZE here for commonality with
+ * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
+ * took care of disposing of the unnecessary blocks. Even if the filesystem
+ * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
+ * since the file might be mmaped.
+ */
+int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+{
+ struct buffer_head bh;
+ pgoff_t index = from >> PAGE_CACHE_SHIFT;
+ unsigned offset = from & (PAGE_CACHE_SIZE-1);
+ unsigned length = PAGE_CACHE_ALIGN(from) - from;
+ int err;
+
+ /* Block boundary? Nothing to do */
+ if (!length)
+ return 0;
+
+ memset(&bh, 0, sizeof(bh));
+ bh.b_size = PAGE_CACHE_SIZE;
+ err = get_block(inode, index, &bh, 0);
+ if (err < 0)
+ return err;
+ if (buffer_written(&bh)) {
+ void *addr;
+ err = dax_get_addr(&bh, &addr, inode->i_blkbits);
+ if (err < 0)
+ return err;
+ memset(addr + offset, 0, length);
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(dax_truncate_page);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 52978b8..5ac0a34 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1210,7 +1210,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
inode_dio_wait(inode);

if (IS_DAX(inode))
- error = xip_truncate_page(inode->i_mapping, newsize);
+ error = dax_truncate_page(inode, newsize, ext2_get_block);
else if (test_opt(inode->i_sb, NOBH))
error = nobh_truncate_page(inode->i_mapping,
newsize, ext2_get_block);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d4259e1..29fafaf 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2465,7 +2465,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);

#ifdef CONFIG_FS_XIP
int dax_clear_blocks(struct inode *, sector_t block, long size);
-extern int xip_truncate_page(struct address_space *mapping, loff_t from);
+int dax_truncate_page(struct inode *, loff_t from, get_block_t);
ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
loff_t, get_block_t, dio_iodone_t, int flags);
int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
@@ -2476,7 +2476,7 @@ static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
return 0;
}

-static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
+static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
{
return 0;
}
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 9dd45f3..6316578 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -21,43 +21,3 @@
#include <asm/tlbflush.h>
#include <asm/io.h>

-/*
- * truncate a page used for execute in place
- * functionality is analog to block_truncate_page but does use get_xip_mem
- * to get the page instead of page cache
- */
-int
-xip_truncate_page(struct address_space *mapping, loff_t from)
-{
- pgoff_t index = from >> PAGE_CACHE_SHIFT;
- unsigned offset = from & (PAGE_CACHE_SIZE-1);
- unsigned blocksize;
- unsigned length;
- void *xip_mem;
- unsigned long xip_pfn;
- int err;
-
- BUG_ON(!mapping->a_ops->get_xip_mem);
-
- blocksize = 1 << mapping->host->i_blkbits;
- length = offset & (blocksize - 1);
-
- /* Block boundary? Nothing to do */
- if (!length)
- return 0;
-
- length = blocksize - length;
-
- err = mapping->a_ops->get_xip_mem(mapping, index, 0,
- &xip_mem, &xip_pfn);
- if (unlikely(err)) {
- if (err == -ENODATA)
- /* Hole? No need to truncate */
- return 0;
- else
- return err;
- }
- memset(xip_mem + offset, 0, length);
- return 0;
-}
-EXPORT_SYMBOL_GPL(xip_truncate_page);
--
2.0.0

2014-07-22 19:48:48

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 14/22] ext2: Remove ext2_xip_verify_sb()

Jan Kara pointed out that calling ext2_xip_verify_sb() in ext2_remount()
doesn't make sense, since changing the XIP option on remount isn't
allowed. It also doesn't make sense to re-check whether blocksize is
supported since it can't change between mounts.

Replace the call to ext2_xip_verify_sb() in ext2_fill_super() with the
equivalent check and delete the definition.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext2/super.c | 33 ++++++++++++---------------------
fs/ext2/xip.c | 12 ------------
fs/ext2/xip.h | 2 --
3 files changed, 12 insertions(+), 35 deletions(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 3750031..3ac6555 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -868,9 +868,6 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ?
MS_POSIXACL : 0);

- ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
- EXT2_MOUNT_XIP if not */
-
if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV &&
(EXT2_HAS_COMPAT_FEATURE(sb, ~0U) ||
EXT2_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
@@ -900,11 +897,17 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)

blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);

- if (ext2_use_xip(sb) && blocksize != PAGE_SIZE) {
- if (!silent)
+ if (sbi->s_mount_opt & EXT2_MOUNT_XIP) {
+ if (blocksize != PAGE_SIZE) {
ext2_msg(sb, KERN_ERR,
- "error: unsupported blocksize for xip");
- goto failed_mount;
+ "error: unsupported blocksize for xip");
+ goto failed_mount;
+ }
+ if (!sb->s_bdev->bd_disk->fops->direct_access) {
+ ext2_msg(sb, KERN_ERR,
+ "error: device does not support xip");
+ goto failed_mount;
+ }
}

/* If the blocksize doesn't match, re-read the thing.. */
@@ -1249,7 +1252,6 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
{
struct ext2_sb_info * sbi = EXT2_SB(sb);
struct ext2_super_block * es;
- unsigned long old_mount_opt = sbi->s_mount_opt;
struct ext2_mount_options old_opts;
unsigned long old_sb_flags;
int err;
@@ -1274,22 +1276,11 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);

- ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
- EXT2_MOUNT_XIP if not */
-
- if ((ext2_use_xip(sb)) && (sb->s_blocksize != PAGE_SIZE)) {
- ext2_msg(sb, KERN_WARNING,
- "warning: unsupported blocksize for xip");
- err = -EINVAL;
- goto restore_opts;
- }
-
es = sbi->s_es;
- if ((sbi->s_mount_opt ^ old_mount_opt) & EXT2_MOUNT_XIP) {
+ if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) {
ext2_msg(sb, KERN_WARNING, "warning: refusing change of "
"xip flag with busy inodes while remounting");
- sbi->s_mount_opt &= ~EXT2_MOUNT_XIP;
- sbi->s_mount_opt |= old_mount_opt & EXT2_MOUNT_XIP;
+ sbi->s_mount_opt ^= EXT2_MOUNT_XIP;
}
if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) {
spin_unlock(&sbi->s_lock);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index 132d4da..66ca113 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,15 +13,3 @@
#include "ext2.h"
#include "xip.h"

-void ext2_xip_verify_sb(struct super_block *sb)
-{
- struct ext2_sb_info *sbi = EXT2_SB(sb);
-
- if ((sbi->s_mount_opt & EXT2_MOUNT_XIP) &&
- !sb->s_bdev->bd_disk->fops->direct_access) {
- sbi->s_mount_opt &= (~EXT2_MOUNT_XIP);
- ext2_msg(sb, KERN_WARNING,
- "warning: ignoring xip option - "
- "not supported by bdev");
- }
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index e7b9f0a..87eeb04 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -6,13 +6,11 @@
*/

#ifdef CONFIG_EXT2_FS_XIP
-extern void ext2_xip_verify_sb (struct super_block *);
static inline int ext2_use_xip (struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
}
#else
-#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
#endif
--
2.0.0

2014-07-22 19:49:21

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 22/22] brd: Rename XIP to DAX

From: Matthew Wilcox <[email protected]>

Since this is relating to FS_XIP, not KERNEL_XIP, it should be called
DAX instead of XIP.

Signed-off-by: Matthew Wilcox <[email protected]>
---
drivers/block/Kconfig | 13 +++++++------
drivers/block/brd.c | 14 +++++++-------
fs/Kconfig | 4 ++--
3 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 014a1cf..1b8094d 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -393,14 +393,15 @@ config BLK_DEV_RAM_SIZE
The default value is 4096 kilobytes. Only change this if you know
what you are doing.

-config BLK_DEV_XIP
- bool "Support XIP filesystems on RAM block device"
- depends on BLK_DEV_RAM
+config BLK_DEV_RAM_DAX
+ bool "Support Direct Access (DAX) to RAM block devices"
+ depends on BLK_DEV_RAM && FS_DAX
default n
help
- Support XIP filesystems (such as ext2 with XIP support on) on
- top of block ram device. This will slightly enlarge the kernel, and
- will prevent RAM block device backing store memory from being
+ Support filesystems using DAX to access RAM block devices. This
+ avoids double-buffering data in the page cache before copying it
+ to the block device. Answering Y will slightly enlarge the kernel,
+ and will prevent RAM block device backing store memory from being
allocated from highmem (only a problem for highmem systems).

config CDROM_PKTCDVD
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 96e4c96..33a39e7 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -97,13 +97,13 @@ static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
* Must use NOIO because we don't want to recurse back into the
* block or filesystem layers from page reclaim.
*
- * Cannot support XIP and highmem, because our ->direct_access
- * routine for XIP must return memory that is always addressable.
- * If XIP was reworked to use pfns and kmap throughout, this
+ * Cannot support DAX and highmem, because our ->direct_access
+ * routine for DAX must return memory that is always addressable.
+ * If DAX was reworked to use pfns and kmap throughout, this
* restriction might be able to be lifted.
*/
gfp_flags = GFP_NOIO | __GFP_ZERO;
-#ifndef CONFIG_BLK_DEV_XIP
+#ifndef CONFIG_BLK_DEV_RAM_DAX
gfp_flags |= __GFP_HIGHMEM;
#endif
page = alloc_page(gfp_flags);
@@ -369,7 +369,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
return err;
}

-#ifdef CONFIG_BLK_DEV_XIP
+#ifdef CONFIG_BLK_DEV_RAM_DAX
static long brd_direct_access(struct block_device *bdev, sector_t sector,
void **kaddr, unsigned long *pfn, long size)
{
@@ -392,6 +392,8 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
* file is mapped to the next page of physical RAM */
return min_t(long, PAGE_SIZE, size);
}
+#else
+#define brd_direct_access NULL
#endif

static int brd_ioctl(struct block_device *bdev, fmode_t mode,
@@ -432,9 +434,7 @@ static const struct block_device_operations brd_fops = {
.owner = THIS_MODULE,
.rw_page = brd_rw_page,
.ioctl = brd_ioctl,
-#ifdef CONFIG_BLK_DEV_XIP
.direct_access = brd_direct_access,
-#endif
};

/*
diff --git a/fs/Kconfig b/fs/Kconfig
index a9eb53d..117900f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -34,7 +34,7 @@ source "fs/btrfs/Kconfig"
source "fs/nilfs2/Kconfig"

config FS_DAX
- bool "Direct Access support"
+ bool "Direct Access (DAX) support"
depends on MMU
help
Direct Access (DAX) can be used on memory-backed block devices.
@@ -45,7 +45,7 @@ config FS_DAX

If you do not have a block device that is capable of using this,
or if unsure, say N. Saying Y will increase the size of the kernel
- by about 2kB.
+ by about 5kB.

endif # BLOCK

--
2.0.0

2014-07-22 19:49:47

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 09/22] Replace ext2_clear_xip_target with dax_clear_blocks

This is practically generic code; other filesystems will want to call
it from other places, but there's nothing ext2-specific about it.

Make it a little more generic by allowing it to take a count of the number
of bytes to zero rather than fixing it to a single page. Thanks to Dave
Hansen for suggesting that I need to call cond_resched() if zeroing more
than one page.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/dax.c | 35 +++++++++++++++++++++++++++++++++++
fs/ext2/inode.c | 8 +++++---
fs/ext2/xip.c | 14 --------------
fs/ext2/xip.h | 3 ---
include/linux/fs.h | 6 ++++++
5 files changed, 46 insertions(+), 20 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 108c68e..02e226f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -20,8 +20,43 @@
#include <linux/fs.h>
#include <linux/genhd.h>
#include <linux/mutex.h>
+#include <linux/sched.h>
#include <linux/uio.h>

+int dax_clear_blocks(struct inode *inode, sector_t block, long size)
+{
+ struct block_device *bdev = inode->i_sb->s_bdev;
+ sector_t sector = block << (inode->i_blkbits - 9);
+
+ might_sleep();
+ do {
+ void *addr;
+ unsigned long pfn;
+ long count;
+
+ count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
+ if (count < 0)
+ return count;
+ while (count > 0) {
+ unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
+ if (pgsz > count)
+ pgsz = count;
+ if (pgsz < PAGE_SIZE)
+ memset(addr, 0, pgsz);
+ else
+ clear_page(addr);
+ addr += pgsz;
+ size -= pgsz;
+ count -= pgsz;
+ sector += pgsz / 512;
+ cond_resched();
+ }
+ } while (size);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(dax_clear_blocks);
+
static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
{
unsigned long pfn;
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 3ccd5fd..52978b8 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -733,10 +733,12 @@ static int ext2_get_blocks(struct inode *inode,

if (IS_DAX(inode)) {
/*
- * we need to clear the block
+ * block must be initialised before we put it in the tree
+ * so that it's not found by another thread before it's
+ * initialised
*/
- err = ext2_clear_xip_target (inode,
- le32_to_cpu(chain[depth-1].key));
+ err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key),
+ 1 << inode->i_blkbits);
if (err) {
mutex_unlock(&ei->truncate_mutex);
goto cleanup;
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index bbc5fec..8cfca3a 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -42,20 +42,6 @@ __ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
return rc;
}

-int
-ext2_clear_xip_target(struct inode *inode, sector_t block)
-{
- void *kaddr;
- unsigned long pfn;
- long size;
-
- size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
- if (size < 0)
- return size;
- clear_page(kaddr);
- return 0;
-}
-
void ext2_xip_verify_sb(struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 29be737..b2592f2 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -7,8 +7,6 @@

#ifdef CONFIG_EXT2_FS_XIP
extern void ext2_xip_verify_sb (struct super_block *);
-extern int ext2_clear_xip_target (struct inode *, sector_t);
-
static inline int ext2_use_xip (struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
@@ -19,6 +17,5 @@ int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
#else
#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
-#define ext2_clear_xip_target(inode, chain) 0
#define ext2_get_xip_mem NULL
#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 87a135a..e38138b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2463,11 +2463,17 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
extern int nonseekable_open(struct inode * inode, struct file * filp);

#ifdef CONFIG_FS_XIP
+int dax_clear_blocks(struct inode *, sector_t block, long size);
extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
extern int xip_truncate_page(struct address_space *mapping, loff_t from);
ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
loff_t, get_block_t, dio_iodone_t, int flags);
#else
+static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
+{
+ return 0;
+}
+
static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
{
return 0;
--
2.0.0

2014-07-22 19:50:48

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 07/22] Add copy_to_iter(), copy_from_iter() and iov_iter_zero()

From: Matthew Wilcox <[email protected]>

For DAX, we want to be able to copy between iovecs and kernel addresses
that don't necessarily have a struct page. This is a fairly simple
rearrangement for bvec iters to kmap the pages outside and pass them in,
but for user iovecs it gets more complicated because we might try various
different ways to kmap the memory. Duplicating the existing logic works
out best in this case.

We need to be able to write zeroes to an iovec for reads from unwritten
ranges in a file. This is performed by the new iov_iter_zero() function,
again patterned after the existing code that handles iovec iterators.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/uio.h | 3 +
mm/iov_iter.c | 237 ++++++++++++++++++++++++++++++++++++++++++++++++----
2 files changed, 226 insertions(+), 14 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 09a7cff..86fb3e9 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -80,6 +80,9 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i);
size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i);
+size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i);
+size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
+size_t iov_iter_zero(size_t bytes, struct iov_iter *);
unsigned long iov_iter_alignment(const struct iov_iter *i);
void iov_iter_init(struct iov_iter *i, int direction, const struct iovec *iov,
unsigned long nr_segs, size_t count);
diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index 7b5dbd1..e95ed80 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -4,6 +4,96 @@
#include <linux/slab.h>
#include <linux/vmalloc.h>

+static size_t copy_to_iter_iovec(void *from, size_t bytes, struct iov_iter *i)
+{
+ size_t skip, copy, left, wanted;
+ const struct iovec *iov;
+ char __user *buf;
+
+ if (unlikely(bytes > i->count))
+ bytes = i->count;
+
+ if (unlikely(!bytes))
+ return 0;
+
+ wanted = bytes;
+ iov = i->iov;
+ skip = i->iov_offset;
+ buf = iov->iov_base + skip;
+ copy = min(bytes, iov->iov_len - skip);
+
+ left = __copy_to_user(buf, from, copy);
+ copy -= left;
+ skip += copy;
+ from += copy;
+ bytes -= copy;
+ while (unlikely(!left && bytes)) {
+ iov++;
+ buf = iov->iov_base;
+ copy = min(bytes, iov->iov_len);
+ left = __copy_to_user(buf, from, copy);
+ copy -= left;
+ skip = copy;
+ from += copy;
+ bytes -= copy;
+ }
+
+ if (skip == iov->iov_len) {
+ iov++;
+ skip = 0;
+ }
+ i->count -= wanted - bytes;
+ i->nr_segs -= iov - i->iov;
+ i->iov = iov;
+ i->iov_offset = skip;
+ return wanted - bytes;
+}
+
+static size_t copy_from_iter_iovec(void *to, size_t bytes, struct iov_iter *i)
+{
+ size_t skip, copy, left, wanted;
+ const struct iovec *iov;
+ char __user *buf;
+
+ if (unlikely(bytes > i->count))
+ bytes = i->count;
+
+ if (unlikely(!bytes))
+ return 0;
+
+ wanted = bytes;
+ iov = i->iov;
+ skip = i->iov_offset;
+ buf = iov->iov_base + skip;
+ copy = min(bytes, iov->iov_len - skip);
+
+ left = __copy_from_user(to, buf, copy);
+ copy -= left;
+ skip += copy;
+ to += copy;
+ bytes -= copy;
+ while (unlikely(!left && bytes)) {
+ iov++;
+ buf = iov->iov_base;
+ copy = min(bytes, iov->iov_len);
+ left = __copy_from_user(to, buf, copy);
+ copy -= left;
+ skip = copy;
+ to += copy;
+ bytes -= copy;
+ }
+
+ if (skip == iov->iov_len) {
+ iov++;
+ skip = 0;
+ }
+ i->count -= wanted - bytes;
+ i->nr_segs -= iov - i->iov;
+ i->iov = iov;
+ i->iov_offset = skip;
+ return wanted - bytes;
+}
+
static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i)
{
@@ -166,6 +256,50 @@ done:
return wanted - bytes;
}

+static size_t zero_iovec(size_t bytes, struct iov_iter *i)
+{
+ size_t skip, copy, left, wanted;
+ const struct iovec *iov;
+ char __user *buf;
+
+ if (unlikely(bytes > i->count))
+ bytes = i->count;
+
+ if (unlikely(!bytes))
+ return 0;
+
+ wanted = bytes;
+ iov = i->iov;
+ skip = i->iov_offset;
+ buf = iov->iov_base + skip;
+ copy = min(bytes, iov->iov_len - skip);
+
+ left = __clear_user(buf, copy);
+ copy -= left;
+ skip += copy;
+ bytes -= copy;
+
+ while (unlikely(!left && bytes)) {
+ iov++;
+ buf = iov->iov_base;
+ copy = min(bytes, iov->iov_len);
+ left = __clear_user(buf, copy);
+ copy -= left;
+ skip = copy;
+ bytes -= copy;
+ }
+
+ if (skip == iov->iov_len) {
+ iov++;
+ skip = 0;
+ }
+ i->count -= wanted - bytes;
+ i->nr_segs -= iov - i->iov;
+ i->iov = iov;
+ i->iov_offset = skip;
+ return wanted - bytes;
+}
+
static size_t __iovec_copy_from_user_inatomic(char *vaddr,
const struct iovec *iov, size_t base, size_t bytes)
{
@@ -412,12 +546,17 @@ static void memcpy_to_page(struct page *page, size_t offset, char *from, size_t
kunmap_atomic(to);
}

-static size_t copy_page_to_iter_bvec(struct page *page, size_t offset, size_t bytes,
- struct iov_iter *i)
+static void memzero_page(struct page *page, size_t offset, size_t len)
+{
+ char *addr = kmap_atomic(page);
+ memset(addr + offset, 0, len);
+ kunmap_atomic(addr);
+}
+
+static size_t copy_to_iter_bvec(void *from, size_t bytes, struct iov_iter *i)
{
size_t skip, copy, wanted;
const struct bio_vec *bvec;
- void *kaddr, *from;

if (unlikely(bytes > i->count))
bytes = i->count;
@@ -430,8 +569,6 @@ static size_t copy_page_to_iter_bvec(struct page *page, size_t offset, size_t by
skip = i->iov_offset;
copy = min_t(size_t, bytes, bvec->bv_len - skip);

- kaddr = kmap_atomic(page);
- from = kaddr + offset;
memcpy_to_page(bvec->bv_page, skip + bvec->bv_offset, from, copy);
skip += copy;
from += copy;
@@ -444,7 +581,6 @@ static size_t copy_page_to_iter_bvec(struct page *page, size_t offset, size_t by
from += copy;
bytes -= copy;
}
- kunmap_atomic(kaddr);
if (skip == bvec->bv_len) {
bvec++;
skip = 0;
@@ -456,12 +592,10 @@ static size_t copy_page_to_iter_bvec(struct page *page, size_t offset, size_t by
return wanted - bytes;
}

-static size_t copy_page_from_iter_bvec(struct page *page, size_t offset, size_t bytes,
- struct iov_iter *i)
+static size_t copy_from_iter_bvec(void *to, size_t bytes, struct iov_iter *i)
{
size_t skip, copy, wanted;
const struct bio_vec *bvec;
- void *kaddr, *to;

if (unlikely(bytes > i->count))
bytes = i->count;
@@ -473,10 +607,6 @@ static size_t copy_page_from_iter_bvec(struct page *page, size_t offset, size_t
bvec = i->bvec;
skip = i->iov_offset;

- kaddr = kmap_atomic(page);
-
- to = kaddr + offset;
-
copy = min(bytes, bvec->bv_len - skip);

memcpy_from_page(to, bvec->bv_page, bvec->bv_offset + skip, copy);
@@ -493,7 +623,6 @@ static size_t copy_page_from_iter_bvec(struct page *page, size_t offset, size_t
to += copy;
bytes -= copy;
}
- kunmap_atomic(kaddr);
if (skip == bvec->bv_len) {
bvec++;
skip = 0;
@@ -505,6 +634,61 @@ static size_t copy_page_from_iter_bvec(struct page *page, size_t offset, size_t
return wanted;
}

+static size_t copy_page_to_iter_bvec(struct page *page, size_t offset,
+ size_t bytes, struct iov_iter *i)
+{
+ void *kaddr = kmap_atomic(page);
+ size_t wanted = copy_to_iter_bvec(kaddr + offset, bytes, i);
+ kunmap_atomic(kaddr);
+ return wanted;
+}
+
+static size_t copy_page_from_iter_bvec(struct page *page, size_t offset,
+ size_t bytes, struct iov_iter *i)
+{
+ void *kaddr = kmap_atomic(page);
+ size_t wanted = copy_from_iter_bvec(kaddr + offset, bytes, i);
+ kunmap_atomic(kaddr);
+ return wanted;
+}
+
+static size_t zero_bvec(size_t bytes, struct iov_iter *i)
+{
+ size_t skip, copy, wanted;
+ const struct bio_vec *bvec;
+
+ if (unlikely(bytes > i->count))
+ bytes = i->count;
+
+ if (unlikely(!bytes))
+ return 0;
+
+ wanted = bytes;
+ bvec = i->bvec;
+ skip = i->iov_offset;
+ copy = min_t(size_t, bytes, bvec->bv_len - skip);
+
+ memzero_page(bvec->bv_page, skip + bvec->bv_offset, copy);
+ skip += copy;
+ bytes -= copy;
+ while (bytes) {
+ bvec++;
+ copy = min(bytes, (size_t)bvec->bv_len);
+ memzero_page(bvec->bv_page, bvec->bv_offset, copy);
+ skip = copy;
+ bytes -= copy;
+ }
+ if (skip == bvec->bv_len) {
+ bvec++;
+ skip = 0;
+ }
+ i->count -= wanted - bytes;
+ i->nr_segs -= bvec - i->bvec;
+ i->bvec = bvec;
+ i->iov_offset = skip;
+ return wanted - bytes;
+}
+
static size_t copy_from_user_bvec(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes)
{
@@ -669,6 +853,31 @@ size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
}
EXPORT_SYMBOL(copy_page_from_iter);

+size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i)
+{
+ if (i->type & ITER_BVEC)
+ return copy_to_iter_bvec(addr, bytes, i);
+ else
+ return copy_to_iter_iovec(addr, bytes, i);
+}
+
+size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i)
+{
+ if (i->type & ITER_BVEC)
+ return copy_from_iter_bvec(addr, bytes, i);
+ else
+ return copy_from_iter_iovec(addr, bytes, i);
+}
+
+size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
+{
+ if (i->type & ITER_BVEC) {
+ return zero_bvec(bytes, i);
+ } else {
+ return zero_iovec(bytes, i);
+ }
+}
+
size_t iov_iter_copy_from_user_atomic(struct page *page,
struct iov_iter *i, unsigned long offset, size_t bytes)
{
--
2.0.0

2014-07-22 19:51:06

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 02/22] Allow page fault handlers to perform the COW

Currently COW of an XIP file is done by first bringing in a read-only
mapping, then retrying the fault and copying the page. It is much more
efficient to tell the fault handler that a COW is being attempted (by
passing in the pre-allocated page in the vm_fault structure), and allow
the handler to perform the COW operation itself.

Signed-off-by: Matthew Wilcox <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
include/linux/mm.h | 1 +
mm/memory.c | 11 +++++++----
2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e03dd29..e04f531 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -208,6 +208,7 @@ struct vm_fault {
pgoff_t pgoff; /* Logical page offset based on vma */
void __user *virtual_address; /* Faulting virtual address */

+ struct page *cow_page; /* Handler may choose to COW */
struct page *page; /* ->fault handlers should return a
* page here, unless VM_FAULT_NOPAGE
* is set (which is also implied by
diff --git a/mm/memory.c b/mm/memory.c
index d67fd9f..42bf429 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2003,6 +2003,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page,
vmf.pgoff = page->index;
vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
vmf.page = page;
+ vmf.cow_page = NULL;

ret = vma->vm_ops->page_mkwrite(vma, &vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
@@ -2689,7 +2690,8 @@ oom:
}

static int __do_fault(struct vm_area_struct *vma, unsigned long address,
- pgoff_t pgoff, unsigned int flags, struct page **page)
+ pgoff_t pgoff, unsigned int flags,
+ struct page *cow_page, struct page **page)
{
struct vm_fault vmf;
int ret;
@@ -2698,6 +2700,7 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
vmf.pgoff = pgoff;
vmf.flags = flags;
vmf.page = NULL;
+ vmf.cow_page = cow_page;

ret = vma->vm_ops->fault(vma, &vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
@@ -2890,7 +2893,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
pte_unmap_unlock(pte, ptl);
}

- ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+ ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;

@@ -2929,7 +2932,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
}

- ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+ ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;

@@ -2965,7 +2968,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
int dirtied = 0;
int ret, tmp;

- ret = __do_fault(vma, address, pgoff, flags, &fault_page);
+ ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;

--
2.0.0

2014-07-22 19:51:47

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 01/22] Fix XIP fault vs truncate race

Pagecache faults recheck i_size after taking the page lock to ensure that
the fault didn't race against a truncate. We don't have a page to lock
in the XIP case, so use the i_mmap_mutex instead. It is locked in the
truncate path in unmap_mapping_range() after updating i_size. So while
we hold it in the fault path, we are guaranteed that either i_size has
already been updated in the truncate path, or that the truncate will
subsequently call zap_page_range_single() and so remove the mapping we
have just inserted.

There is a window of time in which i_size has been reduced and the
thread has a mapping to a page which will be removed from the file,
but this is harmless as the page will not be allocated to a different
purpose before the thread's access to it is revoked.

Signed-off-by: Matthew Wilcox <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
mm/filemap_xip.c | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..c8d23e9 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -260,8 +260,17 @@ again:
__xip_unmap(mapping, vmf->pgoff);

found:
+ /* We must recheck i_size under i_mmap_mutex */
+ mutex_lock(&mapping->i_mmap_mutex);
+ size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+ PAGE_CACHE_SHIFT;
+ if (unlikely(vmf->pgoff >= size)) {
+ mutex_unlock(&mapping->i_mmap_mutex);
+ return VM_FAULT_SIGBUS;
+ }
err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
xip_pfn);
+ mutex_unlock(&mapping->i_mmap_mutex);
if (err == -ENOMEM)
return VM_FAULT_OOM;
/*
@@ -285,16 +294,27 @@ found:
}
if (error != -ENODATA)
goto out;
+
+ /* We must recheck i_size under i_mmap_mutex */
+ mutex_lock(&mapping->i_mmap_mutex);
+ size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+ PAGE_CACHE_SHIFT;
+ if (unlikely(vmf->pgoff >= size)) {
+ ret = VM_FAULT_SIGBUS;
+ goto unlock;
+ }
/* not shared and writable, use xip_sparse_page() */
page = xip_sparse_page();
if (!page)
- goto out;
+ goto unlock;
err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
page);
if (err == -ENOMEM)
- goto out;
+ goto unlock;

ret = VM_FAULT_NOPAGE;
+unlock:
+ mutex_unlock(&mapping->i_mmap_mutex);
out:
write_seqcount_end(&xip_sparse_seq);
mutex_unlock(&xip_sparse_mutex);
--
2.0.0

2014-07-22 19:52:09

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 13/22] Remove get_xip_mem

All callers of get_xip_mem() are now gone. Remove checks for it,
initialisers of it, documentation of it and the only implementation of it.
Also remove mm/filemap_xip.c as it is now empty.

Signed-off-by: Matthew Wilcox <[email protected]>
---
Documentation/filesystems/Locking | 3 ---
fs/exofs/inode.c | 1 -
fs/ext2/inode.c | 1 -
fs/ext2/xip.c | 45 ---------------------------------------
fs/ext2/xip.h | 3 ---
fs/open.c | 5 +----
include/linux/fs.h | 2 --
mm/Makefile | 1 -
mm/fadvise.c | 6 ++++--
mm/filemap_xip.c | 23 --------------------
mm/madvise.c | 2 +-
11 files changed, 6 insertions(+), 86 deletions(-)
delete mode 100644 mm/filemap_xip.c

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index b18dd17..958cf45 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -197,8 +197,6 @@ prototypes:
int (*releasepage) (struct page *, int);
void (*freepage)(struct page *);
int (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset);
- int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **,
- unsigned long *);
int (*migratepage)(struct address_space *, struct page *, struct page *);
int (*launder_page)(struct page *);
int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
@@ -223,7 +221,6 @@ invalidatepage: yes
releasepage: yes
freepage: yes
direct_IO:
-get_xip_mem: maybe
migratepage: yes (both)
launder_page: yes
is_partially_uptodate: yes
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index 3f9cafd..c408a53 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -985,7 +985,6 @@ const struct address_space_operations exofs_aops = {
.direct_IO = exofs_direct_IO,

/* With these NULL has special meaning or default is not exported */
- .get_xip_mem = NULL,
.migratepage = NULL,
.launder_page = NULL,
.is_partially_uptodate = NULL,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 5ac0a34..59d6c7d 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -894,7 +894,6 @@ const struct address_space_operations ext2_aops = {

const struct address_space_operations ext2_aops_xip = {
.bmap = ext2_bmap,
- .get_xip_mem = ext2_get_xip_mem,
.direct_IO = ext2_direct_IO,
};

diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index 8cfca3a..132d4da 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,35 +13,6 @@
#include "ext2.h"
#include "xip.h"

-static inline long __inode_direct_access(struct inode *inode, sector_t block,
- void **kaddr, unsigned long *pfn, long size)
-{
- struct block_device *bdev = inode->i_sb->s_bdev;
- sector_t sector = block * (PAGE_SIZE / 512);
- return bdev_direct_access(bdev, sector, kaddr, pfn, size);
-}
-
-static inline int
-__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
- sector_t *result)
-{
- struct buffer_head tmp;
- int rc;
-
- memset(&tmp, 0, sizeof(struct buffer_head));
- tmp.b_size = 1 << inode->i_blkbits;
- rc = ext2_get_block(inode, pgoff, &tmp, create);
- *result = tmp.b_blocknr;
-
- /* did we get a sparse block (hole in the file)? */
- if (!tmp.b_blocknr && !rc) {
- BUG_ON(create);
- rc = -ENODATA;
- }
-
- return rc;
-}
-
void ext2_xip_verify_sb(struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
@@ -54,19 +25,3 @@ void ext2_xip_verify_sb(struct super_block *sb)
"not supported by bdev");
}
}
-
-int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
- void **kmem, unsigned long *pfn)
-{
- long rc;
- sector_t block;
-
- /* first, retrieve the sector number */
- rc = __ext2_get_block(mapping->host, pgoff, create, &block);
- if (rc)
- return rc;
-
- /* retrieve address of the target data */
- rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
- return (rc < 0) ? rc : 0;
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index b2592f2..e7b9f0a 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -12,10 +12,7 @@ static inline int ext2_use_xip (struct super_block *sb)
struct ext2_sb_info *sbi = EXT2_SB(sb);
return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
}
-int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
- void **, unsigned long *);
#else
#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
-#define ext2_get_xip_mem NULL
#endif
diff --git a/fs/open.c b/fs/open.c
index 36662d0..1a9f0d6 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -656,11 +656,8 @@ int open_check_o_direct(struct file *f)
{
/* NB: we're sure to have correct a_ops only after f_op->open */
if (f->f_flags & O_DIRECT) {
- if (!f->f_mapping->a_ops ||
- ((!f->f_mapping->a_ops->direct_IO) &&
- (!f->f_mapping->a_ops->get_xip_mem))) {
+ if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO)
return -EINVAL;
- }
}
return 0;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 29fafaf..fc6f6bb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -349,8 +349,6 @@ struct address_space_operations {
int (*releasepage) (struct page *, gfp_t);
void (*freepage)(struct page *);
ssize_t (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset);
- int (*get_xip_mem)(struct address_space *, pgoff_t, int,
- void **, unsigned long *);
/*
* migrate the contents of a page to the specified target. If
* migrate_mode is MIGRATE_ASYNC, it must not block.
diff --git a/mm/Makefile b/mm/Makefile
index 4064f3e..34d1598 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -47,7 +47,6 @@ obj-$(CONFIG_SLUB) += slub.o
obj-$(CONFIG_KMEMCHECK) += kmemcheck.o
obj-$(CONFIG_FAILSLAB) += failslab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
-obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 3bcfd81..1f1925f 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -28,6 +28,7 @@
SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
{
struct fd f = fdget(fd);
+ struct inode *inode;
struct address_space *mapping;
struct backing_dev_info *bdi;
loff_t endbyte; /* inclusive */
@@ -39,7 +40,8 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
if (!f.file)
return -EBADF;

- if (S_ISFIFO(file_inode(f.file)->i_mode)) {
+ inode = file_inode(f.file);
+ if (S_ISFIFO(inode->i_mode)) {
ret = -ESPIPE;
goto out;
}
@@ -50,7 +52,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
goto out;
}

- if (mapping->a_ops->get_xip_mem) {
+ if (IS_DAX(inode)) {
switch (advice) {
case POSIX_FADV_NORMAL:
case POSIX_FADV_RANDOM:
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
deleted file mode 100644
index 6316578..0000000
--- a/mm/filemap_xip.c
+++ /dev/null
@@ -1,23 +0,0 @@
-/*
- * linux/mm/filemap_xip.c
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte <[email protected]>
- *
- * derived from linux/mm/filemap.c - Copyright (C) Linus Torvalds
- *
- */
-
-#include <linux/fs.h>
-#include <linux/pagemap.h>
-#include <linux/export.h>
-#include <linux/uio.h>
-#include <linux/rmap.h>
-#include <linux/mmu_notifier.h>
-#include <linux/sched.h>
-#include <linux/seqlock.h>
-#include <linux/mutex.h>
-#include <linux/gfp.h>
-#include <asm/tlbflush.h>
-#include <asm/io.h>
-
diff --git a/mm/madvise.c b/mm/madvise.c
index a402f8f..dee5dff 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -236,7 +236,7 @@ static long madvise_willneed(struct vm_area_struct *vma,
if (!file)
return -EBADF;

- if (file->f_mapping->a_ops->get_xip_mem) {
+ if (IS_DAX(file_inode(file))) {
/* no bad return value, but ignore advice */
return 0;
}
--
2.0.0

2014-07-22 19:52:32

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 08/22] Replace XIP read and write with DAX I/O

Use the generic AIO infrastructure instead of custom read and write
methods. In addition to giving us support for AIO, this adds the missing
locking between read() and truncate().

Signed-off-by: Matthew Wilcox <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/Makefile | 1 +
fs/dax.c | 195 ++++++++++++++++++++++++++++++++++++++++++++
fs/ext2/file.c | 6 +-
fs/ext2/inode.c | 8 +-
include/linux/fs.h | 18 ++++-
mm/filemap.c | 6 +-
mm/filemap_xip.c | 234 -----------------------------------------------------
7 files changed, 223 insertions(+), 245 deletions(-)
create mode 100644 fs/dax.c

diff --git a/fs/Makefile b/fs/Makefile
index 4030cbf..03f65091 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -28,6 +28,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o
obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o
obj-$(CONFIG_AIO) += aio.o
+obj-$(CONFIG_FS_XIP) += dax.o
obj-$(CONFIG_FILE_LOCKING) += locks.o
obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o
obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout.o
diff --git a/fs/dax.c b/fs/dax.c
new file mode 100644
index 0000000..108c68e
--- /dev/null
+++ b/fs/dax.c
@@ -0,0 +1,195 @@
+/*
+ * fs/dax.c - Direct Access filesystem code
+ * Copyright (c) 2013-2014 Intel Corporation
+ * Author: Matthew Wilcox <[email protected]>
+ * Author: Ross Zwisler <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/atomic.h>
+#include <linux/blkdev.h>
+#include <linux/buffer_head.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+#include <linux/mutex.h>
+#include <linux/uio.h>
+
+static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
+{
+ unsigned long pfn;
+ sector_t sector = bh->b_blocknr << (blkbits - 9);
+ return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
+}
+
+static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t pos,
+ loff_t end)
+{
+ loff_t final = end - pos + first; /* The final byte of the buffer */
+
+ if (first > 0)
+ memset(addr, 0, first);
+ if (final < size)
+ memset(addr + final, 0, size - final);
+}
+
+static bool buffer_written(struct buffer_head *bh)
+{
+ return buffer_mapped(bh) && !buffer_unwritten(bh);
+}
+
+/*
+ * When ext4 encounters a hole, it returns without modifying the buffer_head
+ * which means that we can't trust b_size. To cope with this, we set b_state
+ * to 0 before calling get_block and, if any bit is set, we know we can trust
+ * b_size. Unfortunate, really, since ext4 knows precisely how long a hole is
+ * and would save us time calling get_block repeatedly.
+ */
+static bool buffer_size_valid(struct buffer_head *bh)
+{
+ return bh->b_state != 0;
+}
+
+static ssize_t dax_io(int rw, struct inode *inode, struct iov_iter *iter,
+ loff_t start, loff_t end, get_block_t get_block,
+ struct buffer_head *bh)
+{
+ ssize_t retval = 0;
+ loff_t pos = start;
+ loff_t max = start;
+ loff_t bh_max = start;
+ void *addr;
+ bool hole = false;
+
+ if (rw != WRITE)
+ end = min(end, i_size_read(inode));
+
+ while (pos < end) {
+ unsigned len;
+ if (pos == max) {
+ unsigned blkbits = inode->i_blkbits;
+ sector_t block = pos >> blkbits;
+ unsigned first = pos - (block << blkbits);
+ long size;
+
+ if (pos == bh_max) {
+ bh->b_size = PAGE_ALIGN(end - pos);
+ bh->b_state = 0;
+ retval = get_block(inode, block, bh,
+ rw == WRITE);
+ if (retval)
+ break;
+ if (!buffer_size_valid(bh))
+ bh->b_size = 1 << blkbits;
+ bh_max = pos - first + bh->b_size;
+ } else {
+ unsigned done = bh->b_size -
+ (bh_max - (pos - first));
+ bh->b_blocknr += done >> blkbits;
+ bh->b_size -= done;
+ }
+ if (rw == WRITE) {
+ if (!buffer_mapped(bh)) {
+ retval = -EIO;
+ /* FIXME: fall back to buffered I/O */
+ break;
+ }
+ hole = false;
+ } else {
+ hole = !buffer_written(bh);
+ }
+
+ if (hole) {
+ addr = NULL;
+ size = bh->b_size - first;
+ } else {
+ retval = dax_get_addr(bh, &addr, blkbits);
+ if (retval < 0)
+ break;
+ if (buffer_unwritten(bh) || buffer_new(bh))
+ dax_new_buf(addr, retval, first, pos,
+ end);
+ addr += first;
+ size = retval - first;
+ }
+ max = min(pos + size, end);
+ }
+
+ if (rw == WRITE)
+ len = copy_from_iter(addr, max - pos, iter);
+ else if (!hole)
+ len = copy_to_iter(addr, max - pos, iter);
+ else
+ len = iov_iter_zero(max - pos, iter);
+
+ if (!len)
+ break;
+
+ pos += len;
+ addr += len;
+ }
+
+ return (pos == start) ? retval : pos - start;
+}
+
+/**
+ * dax_do_io - Perform I/O to a DAX file
+ * @rw: READ to read or WRITE to write
+ * @iocb: The control block for this I/O
+ * @inode: The file which the I/O is directed at
+ * @iter: The addresses to do I/O from or to
+ * @pos: The file offset where the I/O starts
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ * @end_io: A filesystem callback for I/O completion
+ * @flags: See below
+ *
+ * This function uses the same locking scheme as do_blockdev_direct_IO:
+ * If @flags has DIO_LOCKING set, we assume that the i_mutex is held by the
+ * caller for writes. For reads, we take and release the i_mutex ourselves.
+ * If DIO_LOCKING is not set, the filesystem takes care of its own locking.
+ * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O
+ * is in progress.
+ */
+ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
+ struct iov_iter *iter, loff_t pos,
+ get_block_t get_block, dio_iodone_t end_io, int flags)
+{
+ struct buffer_head bh;
+ ssize_t retval = -EINVAL;
+ loff_t end = pos + iov_iter_count(iter);
+
+ memset(&bh, 0, sizeof(bh));
+
+ if ((flags & DIO_LOCKING) && (rw == READ)) {
+ struct address_space *mapping = inode->i_mapping;
+ mutex_lock(&inode->i_mutex);
+ retval = filemap_write_and_wait_range(mapping, pos, end - 1);
+ if (retval) {
+ mutex_unlock(&inode->i_mutex);
+ goto out;
+ }
+ }
+
+ /* Protects against truncate */
+ atomic_inc(&inode->i_dio_count);
+
+ retval = dax_io(rw, inode, iter, pos, end, get_block, &bh);
+
+ if ((flags & DIO_LOCKING) && (rw == READ))
+ mutex_unlock(&inode->i_mutex);
+
+ if ((retval > 0) && end_io)
+ end_io(iocb, pos, retval, bh.b_private);
+
+ inode_dio_done(inode);
+ out:
+ return retval;
+}
+EXPORT_SYMBOL_GPL(dax_do_io);
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 7c87b22..a247123 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -81,8 +81,10 @@ const struct file_operations ext2_file_operations = {
#ifdef CONFIG_EXT2_FS_XIP
const struct file_operations ext2_xip_file_operations = {
.llseek = generic_file_llseek,
- .read = xip_file_read,
- .write = xip_file_write,
+ .read = new_sync_read,
+ .write = new_sync_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = generic_file_write_iter,
.unlocked_ioctl = ext2_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = ext2_compat_ioctl,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 0cb0448..3ccd5fd 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -859,7 +859,12 @@ ext2_direct_IO(int rw, struct kiocb *iocb, struct iov_iter *iter,
size_t count = iov_iter_count(iter);
ssize_t ret;

- ret = blockdev_direct_IO(rw, iocb, inode, iter, offset, ext2_get_block);
+ if (IS_DAX(inode))
+ ret = dax_do_io(rw, iocb, inode, iter, offset, ext2_get_block,
+ NULL, DIO_LOCKING);
+ else
+ ret = blockdev_direct_IO(rw, iocb, inode, iter, offset,
+ ext2_get_block);
if (ret < 0 && (rw & WRITE))
ext2_write_failed(mapping, offset + count);
return ret;
@@ -888,6 +893,7 @@ const struct address_space_operations ext2_aops = {
const struct address_space_operations ext2_aops_xip = {
.bmap = ext2_bmap,
.get_xip_mem = ext2_get_xip_mem,
+ .direct_IO = ext2_direct_IO,
};

const struct address_space_operations ext2_nobh_aops = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a94e5d7..87a135a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2463,17 +2463,22 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
extern int nonseekable_open(struct inode * inode, struct file * filp);

#ifdef CONFIG_FS_XIP
-extern ssize_t xip_file_read(struct file *filp, char __user *buf, size_t len,
- loff_t *ppos);
extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
-extern ssize_t xip_file_write(struct file *filp, const char __user *buf,
- size_t len, loff_t *ppos);
extern int xip_truncate_page(struct address_space *mapping, loff_t from);
+ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
+ loff_t, get_block_t, dio_iodone_t, int flags);
#else
static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
{
return 0;
}
+
+static inline ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
+ const struct iovec *iov, loff_t pos, unsigned nr_segs,
+ get_block_t get_block, dio_iodone_t end_io, int flags)
+{
+ return -ENOTTY;
+}
#endif

#ifdef CONFIG_BLOCK
@@ -2630,6 +2635,11 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root);
extern void save_mount_options(struct super_block *sb, char *options);
extern void replace_mount_options(struct super_block *sb, char *options);

+static inline bool io_is_direct(struct file *filp)
+{
+ return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp));
+}
+
static inline ino_t parent_ino(struct dentry *dentry)
{
ino_t res;
diff --git a/mm/filemap.c b/mm/filemap.c
index dafb06f..e2b5ccf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1681,8 +1681,7 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
loff_t *ppos = &iocb->ki_pos;
loff_t pos = *ppos;

- /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
- if (file->f_flags & O_DIRECT) {
+ if (io_is_direct(file)) {
struct address_space *mapping = file->f_mapping;
struct inode *inode = mapping->host;
size_t count = iov_iter_count(iter);
@@ -2558,8 +2557,7 @@ ssize_t __generic_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (err)
goto out;

- /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
- if (unlikely(file->f_flags & O_DIRECT)) {
+ if (io_is_direct(file)) {
loff_t endbyte;

written = generic_file_direct_write(iocb, from, pos);
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index c8d23e9..f7c37a1 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -42,119 +42,6 @@ static struct page *xip_sparse_page(void)
}

/*
- * This is a file read routine for execute in place files, and uses
- * the mapping->a_ops->get_xip_mem() function for the actual low-level
- * stuff.
- *
- * Note the struct file* is not used at all. It may be NULL.
- */
-static ssize_t
-do_xip_mapping_read(struct address_space *mapping,
- struct file_ra_state *_ra,
- struct file *filp,
- char __user *buf,
- size_t len,
- loff_t *ppos)
-{
- struct inode *inode = mapping->host;
- pgoff_t index, end_index;
- unsigned long offset;
- loff_t isize, pos;
- size_t copied = 0, error = 0;
-
- BUG_ON(!mapping->a_ops->get_xip_mem);
-
- pos = *ppos;
- index = pos >> PAGE_CACHE_SHIFT;
- offset = pos & ~PAGE_CACHE_MASK;
-
- isize = i_size_read(inode);
- if (!isize)
- goto out;
-
- end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
- do {
- unsigned long nr, left;
- void *xip_mem;
- unsigned long xip_pfn;
- int zero = 0;
-
- /* nr is the maximum number of bytes to copy from this page */
- nr = PAGE_CACHE_SIZE;
- if (index >= end_index) {
- if (index > end_index)
- goto out;
- nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
- if (nr <= offset) {
- goto out;
- }
- }
- nr = nr - offset;
- if (nr > len - copied)
- nr = len - copied;
-
- error = mapping->a_ops->get_xip_mem(mapping, index, 0,
- &xip_mem, &xip_pfn);
- if (unlikely(error)) {
- if (error == -ENODATA) {
- /* sparse */
- zero = 1;
- } else
- goto out;
- }
-
- /* If users can be writing to this page using arbitrary
- * virtual addresses, take care about potential aliasing
- * before reading the page on the kernel side.
- */
- if (mapping_writably_mapped(mapping))
- /* address based flush */ ;
-
- /*
- * Ok, we have the mem, so now we can copy it to user space...
- *
- * The actor routine returns how many bytes were actually used..
- * NOTE! This may not be the same as how much of a user buffer
- * we filled up (we may be padding etc), so we can only update
- * "pos" here (the actor routine has to update the user buffer
- * pointers and the remaining count).
- */
- if (!zero)
- left = __copy_to_user(buf+copied, xip_mem+offset, nr);
- else
- left = __clear_user(buf + copied, nr);
-
- if (left) {
- error = -EFAULT;
- goto out;
- }
-
- copied += (nr - left);
- offset += (nr - left);
- index += offset >> PAGE_CACHE_SHIFT;
- offset &= ~PAGE_CACHE_MASK;
- } while (copied < len);
-
-out:
- *ppos = pos + copied;
- if (filp)
- file_accessed(filp);
-
- return (copied ? copied : error);
-}
-
-ssize_t
-xip_file_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
-{
- if (!access_ok(VERIFY_WRITE, buf, len))
- return -EFAULT;
-
- return do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp,
- buf, len, ppos);
-}
-EXPORT_SYMBOL_GPL(xip_file_read);
-
-/*
* __xip_unmap is invoked from xip_unmap and
* xip_write
*
@@ -340,127 +227,6 @@ int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
}
EXPORT_SYMBOL_GPL(xip_file_mmap);

-static ssize_t
-__xip_file_write(struct file *filp, const char __user *buf,
- size_t count, loff_t pos, loff_t *ppos)
-{
- struct address_space * mapping = filp->f_mapping;
- const struct address_space_operations *a_ops = mapping->a_ops;
- struct inode *inode = mapping->host;
- long status = 0;
- size_t bytes;
- ssize_t written = 0;
-
- BUG_ON(!mapping->a_ops->get_xip_mem);
-
- do {
- unsigned long index;
- unsigned long offset;
- size_t copied;
- void *xip_mem;
- unsigned long xip_pfn;
-
- offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
- index = pos >> PAGE_CACHE_SHIFT;
- bytes = PAGE_CACHE_SIZE - offset;
- if (bytes > count)
- bytes = count;
-
- status = a_ops->get_xip_mem(mapping, index, 0,
- &xip_mem, &xip_pfn);
- if (status == -ENODATA) {
- /* we allocate a new page unmap it */
- mutex_lock(&xip_sparse_mutex);
- status = a_ops->get_xip_mem(mapping, index, 1,
- &xip_mem, &xip_pfn);
- mutex_unlock(&xip_sparse_mutex);
- if (!status)
- /* unmap page at pgoff from all other vmas */
- __xip_unmap(mapping, index);
- }
-
- if (status)
- break;
-
- copied = bytes -
- __copy_from_user_nocache(xip_mem + offset, buf, bytes);
-
- if (likely(copied > 0)) {
- status = copied;
-
- if (status >= 0) {
- written += status;
- count -= status;
- pos += status;
- buf += status;
- }
- }
- if (unlikely(copied != bytes))
- if (status >= 0)
- status = -EFAULT;
- if (status < 0)
- break;
- } while (count);
- *ppos = pos;
- /*
- * No need to use i_size_read() here, the i_size
- * cannot change under us because we hold i_mutex.
- */
- if (pos > inode->i_size) {
- i_size_write(inode, pos);
- mark_inode_dirty(inode);
- }
-
- return written ? written : status;
-}
-
-ssize_t
-xip_file_write(struct file *filp, const char __user *buf, size_t len,
- loff_t *ppos)
-{
- struct address_space *mapping = filp->f_mapping;
- struct inode *inode = mapping->host;
- size_t count;
- loff_t pos;
- ssize_t ret;
-
- mutex_lock(&inode->i_mutex);
-
- if (!access_ok(VERIFY_READ, buf, len)) {
- ret=-EFAULT;
- goto out_up;
- }
-
- pos = *ppos;
- count = len;
-
- /* We can write back this queue in page reclaim */
- current->backing_dev_info = mapping->backing_dev_info;
-
- ret = generic_write_checks(filp, &pos, &count, S_ISBLK(inode->i_mode));
- if (ret)
- goto out_backing;
- if (count == 0)
- goto out_backing;
-
- ret = file_remove_suid(filp);
- if (ret)
- goto out_backing;
-
- ret = file_update_time(filp);
- if (ret)
- goto out_backing;
-
- ret = __xip_file_write (filp, buf, count, pos, ppos);
-
- out_backing:
- current->backing_dev_info = NULL;
- out_up:
- mutex_unlock(&inode->i_mutex);
- return ret;
-}
-EXPORT_SYMBOL_GPL(xip_file_write);
-
/*
* truncate a page used for execute in place
* functionality is analog to block_truncate_page but does use get_xip_mem
--
2.0.0

2014-07-22 19:48:38

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 15/22] ext2: Remove ext2_use_xip

Replace ext2_use_xip() with test_opt(XIP) which expands to the same code

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext2/ext2.h | 4 ++++
fs/ext2/inode.c | 2 +-
fs/ext2/namei.c | 4 ++--
3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index d9a17d0..5ecf570 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,7 +380,11 @@ struct ext2_inode {
#define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */
#define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */
#define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */
+#ifdef CONFIG_FS_XIP
#define EXT2_MOUNT_XIP 0x010000 /* Execute in place */
+#else
+#define EXT2_MOUNT_XIP 0
+#endif
#define EXT2_MOUNT_USRQUOTA 0x020000 /* user quota */
#define EXT2_MOUNT_GRPQUOTA 0x040000 /* group quota */
#define EXT2_MOUNT_RESERVATION 0x080000 /* Preallocation */
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 59d6c7d..cba3833 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1394,7 +1394,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)

if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext2_file_inode_operations;
- if (ext2_use_xip(inode->i_sb)) {
+ if (test_opt(inode->i_sb, XIP)) {
inode->i_mapping->a_ops = &ext2_aops_xip;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index c268d0a..846c356 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
return PTR_ERR(inode);

inode->i_op = &ext2_file_inode_operations;
- if (ext2_use_xip(inode->i_sb)) {
+ if (test_opt(inode->i_sb, XIP)) {
inode->i_mapping->a_ops = &ext2_aops_xip;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
return PTR_ERR(inode);

inode->i_op = &ext2_file_inode_operations;
- if (ext2_use_xip(inode->i_sb)) {
+ if (test_opt(inode->i_sb, XIP)) {
inode->i_mapping->a_ops = &ext2_aops_xip;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
--
2.0.0

2014-07-22 19:53:21

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 21/22] ext4: Add DAX functionality

From: Ross Zwisler <[email protected]>

This is a port of the DAX functionality found in the current version of
ext2.

Signed-off-by: Ross Zwisler <[email protected]>
Reviewed-by: Andreas Dilger <[email protected]>
[heavily tweaked]
Signed-off-by: Matthew Wilcox <[email protected]>
---
Documentation/filesystems/dax.txt | 1 +
Documentation/filesystems/ext4.txt | 2 ++
fs/ext4/ext4.h | 6 +++++
fs/ext4/file.c | 53 +++++++++++++++++++++++++++++++++-----
fs/ext4/indirect.c | 18 +++++++++----
fs/ext4/inode.c | 51 +++++++++++++++++++++++-------------
fs/ext4/namei.c | 10 +++++--
fs/ext4/super.c | 39 +++++++++++++++++++++++++++-
8 files changed, 148 insertions(+), 32 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 1fd3a6f..741e4cb 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -73,6 +73,7 @@ or a write()) work correctly.

These filesystems may be used for inspiration:
- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+- ext4: the fourth extended filesystem, see Documentation/filesystems/ext4.txt


Shortcomings
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 919a329..9c511c4 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -386,6 +386,8 @@ max_dir_size_kb=n This limits the size of directories so that any
i_version Enable 64-bit inode version support. This option is
off by default.

+dax Use direct access if possible
+
Data Mode
=========
There are 3 different data modes:
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 7cc5a0e..a9687b3 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -970,6 +970,11 @@ struct ext4_inode_info {
#define EXT4_MOUNT_ERRORS_MASK 0x00070
#define EXT4_MOUNT_MINIX_DF 0x00080 /* Mimics the Minix statfs */
#define EXT4_MOUNT_NOLOAD 0x00100 /* Don't use existing journal*/
+#ifdef CONFIG_FS_DAX
+#define EXT4_MOUNT_DAX 0x00200 /* Execute in place */
+#else
+#define EXT4_MOUNT_DAX 0
+#endif
#define EXT4_MOUNT_DATA_FLAGS 0x00C00 /* Mode for data writes: */
#define EXT4_MOUNT_JOURNAL_DATA 0x00400 /* Write data to journal */
#define EXT4_MOUNT_ORDERED_DATA 0x00800 /* Flush data before commit */
@@ -2557,6 +2562,7 @@ extern const struct file_operations ext4_dir_operations;
/* file.c */
extern const struct inode_operations ext4_file_inode_operations;
extern const struct file_operations ext4_file_operations;
+extern const struct file_operations ext4_dax_file_operations;
extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);

/* inline.c */
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 8695f70..9c7bde5 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -95,7 +95,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
struct inode *inode = file_inode(iocb->ki_filp);
struct mutex *aio_mutex = NULL;
struct blk_plug plug;
- int o_direct = file->f_flags & O_DIRECT;
+ int o_direct = io_is_direct(file);
int overwrite = 0;
size_t length = iov_iter_count(from);
ssize_t ret;
@@ -191,6 +191,27 @@ errout:
return ret;
}

+#ifdef CONFIG_FS_DAX
+static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return dax_fault(vma, vmf, ext4_get_block);
+ /* Is this the right get_block? */
+}
+
+static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return dax_mkwrite(vma, vmf, ext4_get_block);
+}
+
+static const struct vm_operations_struct ext4_dax_vm_ops = {
+ .fault = ext4_dax_fault,
+ .page_mkwrite = ext4_dax_mkwrite,
+ .remap_pages = generic_file_remap_pages,
+};
+#else
+#define ext4_dax_vm_ops ext4_file_vm_ops
+#endif
+
static const struct vm_operations_struct ext4_file_vm_ops = {
.fault = filemap_fault,
.map_pages = filemap_map_pages,
@@ -200,12 +221,13 @@ static const struct vm_operations_struct ext4_file_vm_ops = {

static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
{
- struct address_space *mapping = file->f_mapping;
-
- if (!mapping->a_ops->readpage)
- return -ENOEXEC;
file_accessed(file);
- vma->vm_ops = &ext4_file_vm_ops;
+ if (IS_DAX(file_inode(file))) {
+ vma->vm_ops = &ext4_dax_vm_ops;
+ vma->vm_flags |= VM_MIXEDMAP;
+ } else {
+ vma->vm_ops = &ext4_file_vm_ops;
+ }
return 0;
}

@@ -604,6 +626,25 @@ const struct file_operations ext4_file_operations = {
.fallocate = ext4_fallocate,
};

+#ifdef CONFIG_FS_DAX
+const struct file_operations ext4_dax_file_operations = {
+ .llseek = ext4_llseek,
+ .read = new_sync_read,
+ .write = new_sync_write,
+ .read_iter = generic_file_read_iter,
+ .write_iter = ext4_file_write_iter,
+ .unlocked_ioctl = ext4_ioctl,
+#ifdef CONFIG_COMPAT
+ .compat_ioctl = ext4_compat_ioctl,
+#endif
+ .mmap = ext4_file_mmap,
+ .open = ext4_file_open,
+ .release = ext4_release_file,
+ .fsync = ext4_sync_file,
+ .fallocate = ext4_fallocate,
+};
+#endif
+
const struct inode_operations ext4_file_inode_operations = {
.setattr = ext4_setattr,
.getattr = ext4_getattr,
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index fd69da1..a07578c 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -691,14 +691,22 @@ retry:
inode_dio_done(inode);
goto locked;
}
- ret = __blockdev_direct_IO(rw, iocb, inode,
- inode->i_sb->s_bdev, iter, offset,
- ext4_get_block, NULL, NULL, 0);
+ if (IS_DAX(inode))
+ ret = dax_do_io(rw, iocb, inode, iter, offset,
+ ext4_get_block, NULL, 0);
+ else
+ ret = __blockdev_direct_IO(rw, iocb, inode,
+ inode->i_sb->s_bdev, iter, offset,
+ ext4_get_block, NULL, NULL, 0);
inode_dio_done(inode);
} else {
locked:
- ret = blockdev_direct_IO(rw, iocb, inode, iter,
- offset, ext4_get_block);
+ if (IS_DAX(inode))
+ ret = dax_do_io(rw, iocb, inode, iter, offset,
+ ext4_get_block, NULL, DIO_LOCKING);
+ else
+ ret = blockdev_direct_IO(rw, iocb, inode, iter,
+ offset, ext4_get_block);

if (unlikely((rw & WRITE) && ret < 0)) {
loff_t isize = i_size_read(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 8a06473..b7e7655 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3173,13 +3173,14 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
get_block_func = ext4_get_block_write;
dio_flags = DIO_LOCKING;
}
- ret = __blockdev_direct_IO(rw, iocb, inode,
- inode->i_sb->s_bdev, iter,
- offset,
- get_block_func,
- ext4_end_io_dio,
- NULL,
- dio_flags);
+ if (IS_DAX(inode))
+ ret = dax_do_io(rw, iocb, inode, iter, offset, get_block_func,
+ ext4_end_io_dio, dio_flags);
+ else
+ ret = __blockdev_direct_IO(rw, iocb, inode,
+ inode->i_sb->s_bdev, iter, offset,
+ get_block_func,
+ ext4_end_io_dio, NULL, dio_flags);

/*
* Put our reference to io_end. This can free the io_end structure e.g.
@@ -3343,14 +3344,7 @@ void ext4_set_aops(struct inode *inode)
inode->i_mapping->a_ops = &ext4_aops;
}

-/*
- * ext4_block_zero_page_range() zeros out a mapping of length 'length'
- * starting from file offset 'from'. The range to be zero'd must
- * be contained with in one block. If the specified range exceeds
- * the end of the block it will be shortened to end of the block
- * that cooresponds to 'from'
- */
-static int ext4_block_zero_page_range(handle_t *handle,
+static int __ext4_block_zero_page_range(handle_t *handle,
struct address_space *mapping, loff_t from, loff_t length)
{
ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT;
@@ -3441,6 +3435,22 @@ unlock:
}

/*
+ * ext4_block_zero_page_range() zeros out a mapping of length 'length'
+ * starting from file offset 'from'. The range to be zero'd must
+ * be contained with in one block. If the specified range exceeds
+ * the end of the block it will be shortened to end of the block
+ * that cooresponds to 'from'
+ */
+static int ext4_block_zero_page_range(handle_t *handle,
+ struct address_space *mapping, loff_t from, loff_t length)
+{
+ struct inode *inode = mapping->host;
+ if (IS_DAX(inode))
+ return dax_zero_page_range(inode, from, length, ext4_get_block);
+ return __ext4_block_zero_page_range(handle, mapping, from, length);
+}
+
+/*
* ext4_block_truncate_page() zeroes out a mapping from file offset `from'
* up to the end of the block which corresponds to `from'.
* This required during truncate. We need to physically zero the tail end
@@ -3961,8 +3971,10 @@ void ext4_set_inode_flags(struct inode *inode)
new_fl |= S_NOATIME;
if (flags & EXT4_DIRSYNC_FL)
new_fl |= S_DIRSYNC;
+ if (test_opt(inode->i_sb, DAX))
+ new_fl |= S_DAX;
inode_set_flags(inode, new_fl,
- S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+ S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX);
}

/* Propagate flags from i_flags to EXT4_I(inode)->i_flags */
@@ -4216,7 +4228,10 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)

if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext4_file_inode_operations;
- inode->i_fop = &ext4_file_operations;
+ if (test_opt(inode->i_sb, DAX))
+ inode->i_fop = &ext4_dax_file_operations;
+ else
+ inode->i_fop = &ext4_file_operations;
ext4_set_aops(inode);
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = &ext4_dir_inode_operations;
@@ -4686,7 +4701,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
* Truncate pagecache after we've waited for commit
* in data=journal mode to make pages freeable.
*/
- truncate_pagecache(inode, inode->i_size);
+ truncate_pagecache(inode, inode->i_size);
}
/*
* We want to call ext4_truncate() even if attr->ia_size ==
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 3520ab8..2ef6adf 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2251,7 +2251,10 @@ retry:
err = PTR_ERR(inode);
if (!IS_ERR(inode)) {
inode->i_op = &ext4_file_inode_operations;
- inode->i_fop = &ext4_file_operations;
+ if (test_opt(inode->i_sb, DAX))
+ inode->i_fop = &ext4_dax_file_operations;
+ else
+ inode->i_fop = &ext4_file_operations;
ext4_set_aops(inode);
err = ext4_add_nondir(handle, dentry, inode);
if (!err && IS_DIRSYNC(dir))
@@ -2315,7 +2318,10 @@ retry:
err = PTR_ERR(inode);
if (!IS_ERR(inode)) {
inode->i_op = &ext4_file_inode_operations;
- inode->i_fop = &ext4_file_operations;
+ if (test_opt(inode->i_sb, DAX))
+ inode->i_fop = &ext4_dax_file_operations;
+ else
+ inode->i_fop = &ext4_file_operations;
ext4_set_aops(inode);
d_tmpfile(dentry, inode);
err = ext4_orphan_add(handle, inode);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 6df7bc6..41bb01e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1162,7 +1162,7 @@ enum {
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_jqfmt_vfsv1, Opt_quota,
Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
- Opt_usrquota, Opt_grpquota, Opt_i_version,
+ Opt_usrquota, Opt_grpquota, Opt_i_version, Opt_dax,
Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
Opt_inode_readahead_blks, Opt_journal_ioprio,
@@ -1224,6 +1224,7 @@ static const match_table_t tokens = {
{Opt_barrier, "barrier"},
{Opt_nobarrier, "nobarrier"},
{Opt_i_version, "i_version"},
+ {Opt_dax, "dax"},
{Opt_stripe, "stripe=%u"},
{Opt_delalloc, "delalloc"},
{Opt_nodelalloc, "nodelalloc"},
@@ -1406,6 +1407,7 @@ static const struct mount_opts {
{Opt_min_batch_time, 0, MOPT_GTE0},
{Opt_inode_readahead_blks, 0, MOPT_GTE0},
{Opt_init_itable, 0, MOPT_GTE0},
+ {Opt_dax, EXT4_MOUNT_DAX, MOPT_SET},
{Opt_stripe, 0, MOPT_GTE0},
{Opt_resuid, 0, MOPT_GTE0},
{Opt_resgid, 0, MOPT_GTE0},
@@ -1642,6 +1644,11 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
}
sbi->s_jquota_fmt = m->mount_opt;
#endif
+#ifndef CONFIG_FS_DAX
+ } else if (token == Opt_dax) {
+ ext4_msg(sb, KERN_INFO, "dax option not supported");
+ return -1;
+#endif
} else {
if (!args->from)
arg = 1;
@@ -3575,6 +3582,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
"both data=journal and dioread_nolock");
goto failed_mount;
}
+ if (test_opt(sb, DAX)) {
+ ext4_msg(sb, KERN_ERR, "can't mount with "
+ "both data=journal and dax");
+ goto failed_mount;
+ }
if (test_opt(sb, DELALLOC))
clear_opt(sb, DELALLOC);
}
@@ -3638,6 +3650,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
goto failed_mount;
}

+ if (sbi->s_mount_opt & EXT4_MOUNT_DAX) {
+ if (blocksize != PAGE_SIZE) {
+ ext4_msg(sb, KERN_ERR,
+ "error: unsupported blocksize for dax");
+ goto failed_mount;
+ }
+ if (!sb->s_bdev->bd_disk->fops->direct_access) {
+ ext4_msg(sb, KERN_ERR,
+ "error: device does not support dax");
+ goto failed_mount;
+ }
+ }
+
if (sb->s_blocksize != blocksize) {
/* Validate the filesystem blocksize */
if (!sb_set_blocksize(sb, blocksize)) {
@@ -4846,6 +4871,18 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
err = -EINVAL;
goto restore_opts;
}
+ if (test_opt(sb, DAX)) {
+ ext4_msg(sb, KERN_ERR, "can't mount with "
+ "both data=journal and dax");
+ err = -EINVAL;
+ goto restore_opts;
+ }
+ }
+
+ if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT4_MOUNT_DAX) {
+ ext4_msg(sb, KERN_WARNING, "warning: refusing change of "
+ "dax flag with busy inodes while remounting");
+ sbi->s_mount_opt ^= EXT4_MOUNT_DAX;
}

if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED)
--
2.0.0

2014-07-22 19:53:27

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 20/22] xip: Add xip_zero_page_range

This new function allows us to support hole-punch for XIP files by zeroing
a partial page, as opposed to the xip_truncate_page() function which can
only truncate to the end of the page. Reimplement xip_truncate_page() as
a macro that calls xip_zero_page_range().

Signed-off-by: Matthew Wilcox <[email protected]>
[ported to 3.13-rc2]
Signed-off-by: Ross Zwisler <[email protected]>
---
Documentation/filesystems/dax.txt | 1 +
fs/dax.c | 20 ++++++++++++++------
include/linux/fs.h | 9 ++++++++-
3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 6441766..1fd3a6f 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -62,6 +62,7 @@ Filesystem support consists of
for fault and page_mkwrite (which should probably call dax_fault() and
dax_mkwrite(), passing the appropriate get_block() callback)
- calling dax_truncate_page() instead of block_truncate_page() for DAX files
+- calling dax_zero_page_range() instead of zero_user() for DAX files
- ensuring that there is sufficient locking between reads, writes,
truncates and page faults

diff --git a/fs/dax.c b/fs/dax.c
index b9bc5e2..39b95b1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -451,13 +451,16 @@ int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
EXPORT_SYMBOL_GPL(dax_mkwrite);

/**
- * dax_truncate_page - handle a partial page being truncated in a DAX file
+ * dax_zero_page_range - zero a range within a page of a DAX file
* @inode: The file being truncated
* @from: The file offset that is being truncated to
+ * @length: The number of bytes to zero
* @get_block: The filesystem method used to translate file offsets to blocks
*
- * Similar to block_truncate_page(), this function can be called by a
- * filesystem when it is truncating an DAX file to handle the partial page.
+ * This function can be called by a filesystem when it is zeroing part of a
+ * page in a DAX file. This is intended for hole-punch operations. If
+ * you are truncating a file, the helper function dax_truncate_page() may be
+ * more convenient.
*
* We work in terms of PAGE_CACHE_SIZE here for commonality with
* block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
@@ -465,12 +468,12 @@ EXPORT_SYMBOL_GPL(dax_mkwrite);
* block size is smaller than PAGE_SIZE, we have to zero the rest of the page
* since the file might be mmaped.
*/
-int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
+ get_block_t get_block)
{
struct buffer_head bh;
pgoff_t index = from >> PAGE_CACHE_SHIFT;
unsigned offset = from & (PAGE_CACHE_SIZE-1);
- unsigned length = PAGE_CACHE_ALIGN(from) - from;
int err;

/* Block boundary? Nothing to do */
@@ -487,9 +490,14 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
err = dax_get_addr(&bh, &addr, inode->i_blkbits);
if (err < 0)
return err;
+ /*
+ * ext4 sometimes asks to zero past the end of a block. It
+ * really just wants to zero to the end of the block.
+ */
+ length = min_t(unsigned, length, PAGE_CACHE_SIZE - offset);
memset(addr + offset, 0, length);
}

return 0;
}
-EXPORT_SYMBOL_GPL(dax_truncate_page);
+EXPORT_SYMBOL_GPL(dax_zero_page_range);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a75a045..44facd9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2463,6 +2463,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);

#ifdef CONFIG_FS_DAX
int dax_clear_blocks(struct inode *, sector_t block, long size);
+int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
int dax_truncate_page(struct inode *, loff_t from, get_block_t);
ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
loff_t, get_block_t, dio_iodone_t, int flags);
@@ -2474,7 +2475,8 @@ static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
return 0;
}

-static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
+static inline int dax_zero_page_range(struct inode *inode, loff_t from,
+ unsigned len, get_block_t gb)
{
return 0;
}
@@ -2487,6 +2489,11 @@ static inline ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
}
#endif

+/* Can't be a function because PAGE_CACHE_SIZE is defined in pagemap.h */
+#define dax_truncate_page(inode, from, get_block) \
+ dax_zero_page_range(inode, from, PAGE_CACHE_SIZE, get_block)
+
+
#ifdef CONFIG_BLOCK
typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
loff_t file_offset);
--
2.0.0

2014-07-22 19:53:19

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 05/22] Add vm_replace_mixed()

From: Matthew Wilcox <[email protected]>

vm_insert_mixed() will fail if there is already a valid PTE at that
location. The DAX code would rather replace the previous value with
the new PTE.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/mm.h | 8 ++++++--
mm/memory.c | 34 +++++++++++++++++++++-------------
2 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e04f531..8d1194c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1958,8 +1958,12 @@ int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);
-int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
- unsigned long pfn);
+int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn, bool replace);
+#define vm_insert_mixed(vma, addr, pfn) \
+ __vm_insert_mixed(vma, addr, pfn, false)
+#define vm_replace_mixed(vma, addr, pfn) \
+ __vm_insert_mixed(vma, addr, pfn, true)
int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);


diff --git a/mm/memory.c b/mm/memory.c
index 42bf429..cf06c97 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1476,7 +1476,7 @@ pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
* pages reserved for the old functions anyway.
*/
static int insert_page(struct vm_area_struct *vma, unsigned long addr,
- struct page *page, pgprot_t prot)
+ struct page *page, pgprot_t prot, bool replace)
{
struct mm_struct *mm = vma->vm_mm;
int retval;
@@ -1492,8 +1492,12 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
if (!pte)
goto out;
retval = -EBUSY;
- if (!pte_none(*pte))
- goto out_unlock;
+ if (!pte_none(*pte)) {
+ if (!replace)
+ goto out_unlock;
+ VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex));
+ zap_page_range_single(vma, addr, PAGE_SIZE, NULL);
+ }

/* Ok, finally just insert the thing.. */
get_page(page);
@@ -1549,12 +1553,12 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr,
BUG_ON(vma->vm_flags & VM_PFNMAP);
vma->vm_flags |= VM_MIXEDMAP;
}
- return insert_page(vma, addr, page, vma->vm_page_prot);
+ return insert_page(vma, addr, page, vma->vm_page_prot, false);
}
EXPORT_SYMBOL(vm_insert_page);

static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
- unsigned long pfn, pgprot_t prot)
+ unsigned long pfn, pgprot_t prot, bool replace)
{
struct mm_struct *mm = vma->vm_mm;
int retval;
@@ -1566,8 +1570,12 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
if (!pte)
goto out;
retval = -EBUSY;
- if (!pte_none(*pte))
- goto out_unlock;
+ if (!pte_none(*pte)) {
+ if (!replace)
+ goto out_unlock;
+ VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex));
+ zap_page_range_single(vma, addr, PAGE_SIZE, NULL);
+ }

/* Ok, finally just insert the thing.. */
entry = pte_mkspecial(pfn_pte(pfn, prot));
@@ -1620,14 +1628,14 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
if (track_pfn_insert(vma, &pgprot, pfn))
return -EINVAL;

- ret = insert_pfn(vma, addr, pfn, pgprot);
+ ret = insert_pfn(vma, addr, pfn, pgprot, false);

return ret;
}
EXPORT_SYMBOL(vm_insert_pfn);

-int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
- unsigned long pfn)
+int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
+ unsigned long pfn, bool replace)
{
BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));

@@ -1645,11 +1653,11 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
struct page *page;

page = pfn_to_page(pfn);
- return insert_page(vma, addr, page, vma->vm_page_prot);
+ return insert_page(vma, addr, page, vma->vm_page_prot, replace);
}
- return insert_pfn(vma, addr, pfn, vma->vm_page_prot);
+ return insert_pfn(vma, addr, pfn, vma->vm_page_prot, replace);
}
-EXPORT_SYMBOL(vm_insert_mixed);
+EXPORT_SYMBOL(__vm_insert_mixed);

/*
* maps a range of physical memory into the requested pages. the old
--
2.0.0

2014-07-22 19:53:15

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 12/22] Replace XIP documentation with DAX documentation

From: Matthew Wilcox <[email protected]>

Based on the original XIP documentation, this documents the current
state of affairs, and includes instructions on how users can enable DAX
if their devices and kernel support it.

Signed-off-by: Matthew Wilcox <[email protected]>
Reviewed-by: Randy Dunlap <[email protected]>
---
Documentation/filesystems/dax.txt | 89 +++++++++++++++++++++++++++++++++++++++
Documentation/filesystems/xip.txt | 71 -------------------------------
2 files changed, 89 insertions(+), 71 deletions(-)
create mode 100644 Documentation/filesystems/dax.txt
delete mode 100644 Documentation/filesystems/xip.txt

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
new file mode 100644
index 0000000..6441766
--- /dev/null
+++ b/Documentation/filesystems/dax.txt
@@ -0,0 +1,89 @@
+Direct Access for files
+-----------------------
+
+Motivation
+----------
+
+The page cache is usually used to buffer reads and writes to files.
+It is also used to provide the pages which are mapped into userspace
+by a call to mmap.
+
+For block devices that are memory-like, the page cache pages would be
+unnecessary copies of the original storage. The DAX code removes the
+extra copy by performing reads and writes directly to the storage device.
+For file mappings, the storage device is mapped directly into userspace.
+
+
+Usage
+-----
+
+If you have a block device which supports DAX, you can make a filesystem
+on it as usual. When mounting it, use the -o dax option manually
+or add 'dax' to the options in /etc/fstab.
+
+
+Implementation Tips for Block Driver Writers
+--------------------------------------------
+
+To support DAX in your block driver, implement the 'direct_access'
+block device operation. It is used to translate the sector number
+(expressed in units of 512-byte sectors) to a page frame number (pfn)
+that identifies the physical page for the memory. It also returns a
+kernel virtual address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested. The function should return the number
+of bytes that it can provide, although it must not exceed the number of
+bytes requested. It may also return a negative errno if an error occurs.
+
+In order to support this method, the storage must be byte-accessible by
+the CPU at all times. If your device uses paging techniques to expose
+a large amount of memory through a smaller window, then you cannot
+implement direct_access. Equally, if your device can occasionally
+stall the CPU for an extended period, you should also not attempt to
+implement direct_access.
+
+These block devices may be used for inspiration:
+- axonram: Axon DDR2 device driver
+- brd: RAM backed block device driver
+- dcssblk: s390 dcss block device driver
+
+
+Implementation Tips for Filesystem Writers
+------------------------------------------
+
+Filesystem support consists of
+- adding support to mark inodes as being DAX by setting the S_DAX flag in
+ i_flags
+- implementing the direct_IO address space operation, and calling
+ dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
+- implementing an mmap file operation for DAX files which sets the
+ VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers
+ for fault and page_mkwrite (which should probably call dax_fault() and
+ dax_mkwrite(), passing the appropriate get_block() callback)
+- calling dax_truncate_page() instead of block_truncate_page() for DAX files
+- ensuring that there is sufficient locking between reads, writes,
+ truncates and page faults
+
+The get_block() callback passed to the DAX functions may return
+uninitialised extents. If it does, it must ensure that simultaneous
+calls to get_block() (for example by a page-fault racing with a read()
+or a write()) work correctly.
+
+These filesystems may be used for inspiration:
+- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+
+
+Shortcomings
+------------
+
+Even if the kernel or its modules are stored on a filesystem that supports
+DAX on a block device that supports DAX, they will still be copied into RAM.
+
+Calling get_user_pages() on a range of user memory that has been mmaped
+from a DAX file will fail as there are no 'struct page' to describe
+those pages. This problem is being worked on. That means that O_DIRECT
+reads/writes to those memory ranges from a non-DAX file will fail (note
+that O_DIRECT reads/writes _of a DAX file_ do work, it is the memory
+that is being accessed that is key here). Other things that will not
+work include RDMA, sendfile() and splice().
diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
deleted file mode 100644
index b62eabf..0000000
--- a/Documentation/filesystems/xip.txt
+++ /dev/null
@@ -1,71 +0,0 @@
-Execute-in-place for file mappings
-----------------------------------
-
-Motivation
-----------
-File mappings are performed by mapping page cache pages to userspace. In
-addition, read&write type file operations also transfer data from/to the page
-cache.
-
-For memory backed storage devices that use the block device interface, the page
-cache pages are in fact copies of the original storage. Various approaches
-exist to work around the need for an extra copy. The ramdisk driver for example
-does read the data into the page cache, keeps a reference, and discards the
-original data behind later on.
-
-Execute-in-place solves this issue the other way around: instead of keeping
-data in the page cache, the need to have a page cache copy is eliminated
-completely. With execute-in-place, read&write type operations are performed
-directly from/to the memory backed storage device. For file mappings, the
-storage device itself is mapped directly into userspace.
-
-This implementation was initially written for shared memory segments between
-different virtual machines on s390 hardware to allow multiple machines to
-share the same binaries and libraries.
-
-Implementation
---------------
-Execute-in-place is implemented in three steps: block device operation,
-address space operation, and file operations.
-
-A block device operation named direct_access is used to translate the
-block device sector number to a page frame number (pfn) that identifies
-the physical page for the memory. It also returns a kernel virtual
-address that can be used to access the memory.
-
-The direct_access method takes a 'size' parameter that indicates the
-number of bytes being requested. The function should return the number
-of bytes that it can provide, although it must not exceed the number of
-bytes requested. It may also return a negative errno if an error occurs.
-
-The block device operation is optional, these block devices support it as of
-today:
-- dcssblk: s390 dcss block device driver
-
-An address space operation named get_xip_mem is used to retrieve references
-to a page frame number and a kernel address. To obtain these values a reference
-to an address_space is provided. This function assigns values to the kmem and
-pfn parameters. The third argument indicates whether the function should allocate
-blocks if needed.
-
-This address space operation is mutually exclusive with readpage&writepage that
-do page cache read/write operations.
-The following filesystems support it as of today:
-- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
-
-A set of file operations that do utilize get_xip_page can be found in
-mm/filemap_xip.c . The following file operation implementations are provided:
-- aio_read/aio_write
-- readv/writev
-- sendfile
-
-The generic file operations do_sync_read/do_sync_write can be used to implement
-classic synchronous IO calls.
-
-Shortcomings
-------------
-This implementation is limited to storage devices that are cpu addressable at
-all times (no highmem or such). It works well on rom/ram, but enhancements are
-needed to make it work with flash in read+write mode.
-Putting the Linux kernel and/or its modules on a xip filesystem does not mean
-they are not copied.
--
2.0.0

2014-07-22 19:54:07

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 10/22] Replace the XIP page fault handler with the DAX page fault handler

Instead of calling aops->get_xip_mem from the fault handler, the
filesystem passes a get_block_t that is used to find the appropriate
blocks.

Signed-off-by: Matthew Wilcox <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/dax.c | 221 +++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext2/file.c | 35 ++++++++-
include/linux/fs.h | 4 +-
mm/filemap_xip.c | 206 -------------------------------------------------
4 files changed, 257 insertions(+), 209 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 02e226f..4ab4890 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -19,9 +19,13 @@
#include <linux/buffer_head.h>
#include <linux/fs.h>
#include <linux/genhd.h>
+#include <linux/highmem.h>
+#include <linux/memcontrol.h>
+#include <linux/mm.h>
#include <linux/mutex.h>
#include <linux/sched.h>
#include <linux/uio.h>
+#include <linux/vmstat.h>

int dax_clear_blocks(struct inode *inode, sector_t block, long size)
{
@@ -64,6 +68,14 @@ static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
}

+static long dax_get_pfn(struct buffer_head *bh, unsigned long *pfn,
+ unsigned blkbits)
+{
+ void *addr;
+ sector_t sector = bh->b_blocknr << (blkbits - 9);
+ return bdev_direct_access(bh->b_bdev, sector, &addr, pfn, bh->b_size);
+}
+
static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t pos,
loff_t end)
{
@@ -228,3 +240,212 @@ ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
return retval;
}
EXPORT_SYMBOL_GPL(dax_do_io);
+
+/*
+ * The user has performed a load from a hole in the file. Allocating
+ * a new page in the file would cause excessive storage usage for
+ * workloads with sparse files. We allocate a page cache page instead.
+ * We'll kick it out of the page cache if it's ever written to,
+ * otherwise it will simply fall out of the page cache under memory
+ * pressure without ever having been dirtied.
+ */
+static int dax_load_hole(struct address_space *mapping, struct page *page,
+ struct vm_fault *vmf)
+{
+ unsigned long size;
+ struct inode *inode = mapping->host;
+ if (!page)
+ page = find_or_create_page(mapping, vmf->pgoff,
+ GFP_KERNEL | __GFP_ZERO);
+ if (!page)
+ return VM_FAULT_OOM;
+ /* Recheck i_size under page lock to avoid truncate race */
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (vmf->pgoff >= size) {
+ unlock_page(page);
+ page_cache_release(page);
+ return VM_FAULT_SIGBUS;
+ }
+
+ vmf->page = page;
+ return VM_FAULT_LOCKED;
+}
+
+static int copy_user_bh(struct page *to, struct buffer_head *bh,
+ unsigned blkbits, unsigned long vaddr)
+{
+ void *vfrom, *vto;
+ if (dax_get_addr(bh, &vfrom, blkbits) < 0)
+ return -EIO;
+ vto = kmap_atomic(to);
+ copy_user_page(vto, vfrom, vaddr, to);
+ kunmap_atomic(vto);
+ return 0;
+}
+
+static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+ get_block_t get_block)
+{
+ struct file *file = vma->vm_file;
+ struct inode *inode = file_inode(file);
+ struct address_space *mapping = file->f_mapping;
+ struct page *page;
+ struct buffer_head bh;
+ unsigned long vaddr = (unsigned long)vmf->virtual_address;
+ unsigned blkbits = inode->i_blkbits;
+ sector_t block;
+ pgoff_t size;
+ unsigned long pfn;
+ int error;
+ int major = 0;
+
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (vmf->pgoff >= size)
+ return VM_FAULT_SIGBUS;
+
+ memset(&bh, 0, sizeof(bh));
+ block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
+ bh.b_size = PAGE_SIZE;
+
+ repeat:
+ page = find_get_page(mapping, vmf->pgoff);
+ if (page) {
+ if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
+ page_cache_release(page);
+ return VM_FAULT_RETRY;
+ }
+ if (unlikely(page->mapping != mapping)) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto repeat;
+ }
+ } else {
+ mutex_lock(&mapping->i_mmap_mutex);
+ }
+
+ /* Guard against a race with truncate */
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (unlikely(vmf->pgoff >= size))
+ goto sigbus;
+
+ error = get_block(inode, block, &bh, 0);
+ if (error || bh.b_size < PAGE_SIZE)
+ goto sigbus;
+
+ if (!buffer_written(&bh) && !vmf->cow_page) {
+ if (vmf->flags & FAULT_FLAG_WRITE) {
+ error = get_block(inode, block, &bh, 1);
+ count_vm_event(PGMAJFAULT);
+ mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+ major = VM_FAULT_MAJOR;
+ if (error || bh.b_size < PAGE_SIZE)
+ goto sigbus;
+ } else {
+ mutex_unlock(&mapping->i_mmap_mutex);
+ return dax_load_hole(mapping, page, vmf);
+ }
+ }
+
+ if (vmf->cow_page) {
+ struct page *new_page = vmf->cow_page;
+ if (buffer_written(&bh))
+ error = copy_user_bh(new_page, &bh, blkbits, vaddr);
+ else
+ clear_user_highpage(new_page, vaddr);
+ if (!error) {
+ __SetPageUptodate(new_page);
+ vm_insert_page(vma, vaddr, new_page);
+ }
+ if (page) {
+ unlock_page(page);
+ page_cache_release(page);
+ } else {
+ mutex_unlock(&mapping->i_mmap_mutex);
+ }
+ return error ? VM_FAULT_SIGBUS : VM_FAULT_NOPAGE;
+ }
+
+ if (buffer_unwritten(&bh) || buffer_new(&bh))
+ dax_clear_blocks(inode, bh.b_blocknr, bh.b_size);
+
+ error = dax_get_pfn(&bh, &pfn, blkbits);
+ if (error > 0)
+ error = vm_replace_mixed(vma, vaddr, pfn);
+
+ if (!page) {
+ mutex_unlock(&mapping->i_mmap_mutex);
+ /*
+ * We may have raced with another thread which has inserted
+ * a zero page at this address. Remove it now if we did.
+ */
+ page = find_lock_page(mapping, vmf->pgoff);
+ }
+
+ if (page) {
+ delete_from_page_cache(page);
+ unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
+ PAGE_CACHE_SIZE, 0);
+ unlock_page(page);
+ page_cache_release(page);
+ }
+
+ if (error == -ENOMEM)
+ return VM_FAULT_OOM;
+ /* -EBUSY is fine, somebody else faulted on the same PTE */
+ if (error != -EBUSY)
+ BUG_ON(error);
+ return VM_FAULT_NOPAGE | major;
+
+ sigbus:
+ if (page) {
+ unlock_page(page);
+ page_cache_release(page);
+ } else {
+ mutex_unlock(&mapping->i_mmap_mutex);
+ }
+ return VM_FAULT_SIGBUS;
+}
+
+/**
+ * dax_fault - handle a page fault on a DAX file
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * When a page fault occurs, filesystems may call this helper in their
+ * fault handler for DAX files.
+ */
+int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+ get_block_t get_block)
+{
+ int result;
+ struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+ if (vmf->flags & FAULT_FLAG_WRITE) {
+ sb_start_pagefault(sb);
+ file_update_time(vma->vm_file);
+ }
+ result = do_dax_fault(vma, vmf, get_block);
+ if (vmf->flags & FAULT_FLAG_WRITE)
+ sb_end_pagefault(sb);
+
+ return result;
+}
+EXPORT_SYMBOL_GPL(dax_fault);
+
+/**
+ * dax_mkwrite - convert a read-only page to read-write in a DAX file
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * DAX handles reads of holes by adding pages full of zeroes into the
+ * mapping. If the page is subsequenty written to, we have to allocate
+ * the page on media and free the page that was in the cache.
+ */
+int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+ get_block_t get_block)
+{
+ return dax_fault(vma, vmf, get_block);
+}
+EXPORT_SYMBOL_GPL(dax_mkwrite);
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index a247123..da8dc64 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,6 +25,37 @@
#include "xattr.h"
#include "acl.h"

+#ifdef CONFIG_EXT2_FS_XIP
+static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return dax_fault(vma, vmf, ext2_get_block);
+}
+
+static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return dax_mkwrite(vma, vmf, ext2_get_block);
+}
+
+static const struct vm_operations_struct ext2_dax_vm_ops = {
+ .fault = ext2_dax_fault,
+ .page_mkwrite = ext2_dax_mkwrite,
+ .remap_pages = generic_file_remap_pages,
+};
+
+static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ if (!IS_DAX(file_inode(file)))
+ return generic_file_mmap(file, vma);
+
+ file_accessed(file);
+ vma->vm_ops = &ext2_dax_vm_ops;
+ vma->vm_flags |= VM_MIXEDMAP;
+ return 0;
+}
+#else
+#define ext2_file_mmap generic_file_mmap
+#endif
+
/*
* Called when filp is released. This happens when all file descriptors
* for a single struct file are closed. Note that different open() calls
@@ -70,7 +101,7 @@ const struct file_operations ext2_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = ext2_compat_ioctl,
#endif
- .mmap = generic_file_mmap,
+ .mmap = ext2_file_mmap,
.open = dquot_file_open,
.release = ext2_release_file,
.fsync = ext2_fsync,
@@ -89,7 +120,7 @@ const struct file_operations ext2_xip_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = ext2_compat_ioctl,
#endif
- .mmap = xip_file_mmap,
+ .mmap = ext2_file_mmap,
.open = dquot_file_open,
.release = ext2_release_file,
.fsync = ext2_fsync,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e38138b..d4259e1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -49,6 +49,7 @@ struct swap_info_struct;
struct seq_file;
struct workqueue_struct;
struct iov_iter;
+struct vm_fault;

extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -2464,10 +2465,11 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);

#ifdef CONFIG_FS_XIP
int dax_clear_blocks(struct inode *, sector_t block, long size);
-extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
extern int xip_truncate_page(struct address_space *mapping, loff_t from);
ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
loff_t, get_block_t, dio_iodone_t, int flags);
+int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int dax_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t);
#else
static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
{
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index f7c37a1..9dd45f3 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -22,212 +22,6 @@
#include <asm/io.h>

/*
- * We do use our own empty page to avoid interference with other users
- * of ZERO_PAGE(), such as /dev/zero
- */
-static DEFINE_MUTEX(xip_sparse_mutex);
-static seqcount_t xip_sparse_seq = SEQCNT_ZERO(xip_sparse_seq);
-static struct page *__xip_sparse_page;
-
-/* called under xip_sparse_mutex */
-static struct page *xip_sparse_page(void)
-{
- if (!__xip_sparse_page) {
- struct page *page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
-
- if (page)
- __xip_sparse_page = page;
- }
- return __xip_sparse_page;
-}
-
-/*
- * __xip_unmap is invoked from xip_unmap and
- * xip_write
- *
- * This function walks all vmas of the address_space and unmaps the
- * __xip_sparse_page when found at pgoff.
- */
-static void
-__xip_unmap (struct address_space * mapping,
- unsigned long pgoff)
-{
- struct vm_area_struct *vma;
- struct mm_struct *mm;
- unsigned long address;
- pte_t *pte;
- pte_t pteval;
- spinlock_t *ptl;
- struct page *page;
- unsigned count;
- int locked = 0;
-
- count = read_seqcount_begin(&xip_sparse_seq);
-
- page = __xip_sparse_page;
- if (!page)
- return;
-
-retry:
- mutex_lock(&mapping->i_mmap_mutex);
- vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
- mm = vma->vm_mm;
- address = vma->vm_start +
- ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
- BUG_ON(address < vma->vm_start || address >= vma->vm_end);
- pte = page_check_address(page, mm, address, &ptl, 1);
- if (pte) {
- /* Nuke the page table entry. */
- flush_cache_page(vma, address, pte_pfn(*pte));
- pteval = ptep_clear_flush(vma, address, pte);
- page_remove_rmap(page);
- dec_mm_counter(mm, MM_FILEPAGES);
- BUG_ON(pte_dirty(pteval));
- pte_unmap_unlock(pte, ptl);
- /* must invalidate_page _before_ freeing the page */
- mmu_notifier_invalidate_page(mm, address);
- page_cache_release(page);
- }
- }
- mutex_unlock(&mapping->i_mmap_mutex);
-
- if (locked) {
- mutex_unlock(&xip_sparse_mutex);
- } else if (read_seqcount_retry(&xip_sparse_seq, count)) {
- mutex_lock(&xip_sparse_mutex);
- locked = 1;
- goto retry;
- }
-}
-
-/*
- * xip_fault() is invoked via the vma operations vector for a
- * mapped memory region to read in file data during a page fault.
- *
- * This function is derived from filemap_fault, but used for execute in place
- */
-static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
-{
- struct file *file = vma->vm_file;
- struct address_space *mapping = file->f_mapping;
- struct inode *inode = mapping->host;
- pgoff_t size;
- void *xip_mem;
- unsigned long xip_pfn;
- struct page *page;
- int error;
-
- /* XXX: are VM_FAULT_ codes OK? */
-again:
- size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
- if (vmf->pgoff >= size)
- return VM_FAULT_SIGBUS;
-
- error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
- &xip_mem, &xip_pfn);
- if (likely(!error))
- goto found;
- if (error != -ENODATA)
- return VM_FAULT_OOM;
-
- /* sparse block */
- if ((vma->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
- (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) &&
- (!(mapping->host->i_sb->s_flags & MS_RDONLY))) {
- int err;
-
- /* maybe shared writable, allocate new block */
- mutex_lock(&xip_sparse_mutex);
- error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 1,
- &xip_mem, &xip_pfn);
- mutex_unlock(&xip_sparse_mutex);
- if (error)
- return VM_FAULT_SIGBUS;
- /* unmap sparse mappings at pgoff from all other vmas */
- __xip_unmap(mapping, vmf->pgoff);
-
-found:
- /* We must recheck i_size under i_mmap_mutex */
- mutex_lock(&mapping->i_mmap_mutex);
- size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
- PAGE_CACHE_SHIFT;
- if (unlikely(vmf->pgoff >= size)) {
- mutex_unlock(&mapping->i_mmap_mutex);
- return VM_FAULT_SIGBUS;
- }
- err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
- xip_pfn);
- mutex_unlock(&mapping->i_mmap_mutex);
- if (err == -ENOMEM)
- return VM_FAULT_OOM;
- /*
- * err == -EBUSY is fine, we've raced against another thread
- * that faulted-in the same page
- */
- if (err != -EBUSY)
- BUG_ON(err);
- return VM_FAULT_NOPAGE;
- } else {
- int err, ret = VM_FAULT_OOM;
-
- mutex_lock(&xip_sparse_mutex);
- write_seqcount_begin(&xip_sparse_seq);
- error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
- &xip_mem, &xip_pfn);
- if (unlikely(!error)) {
- write_seqcount_end(&xip_sparse_seq);
- mutex_unlock(&xip_sparse_mutex);
- goto again;
- }
- if (error != -ENODATA)
- goto out;
-
- /* We must recheck i_size under i_mmap_mutex */
- mutex_lock(&mapping->i_mmap_mutex);
- size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
- PAGE_CACHE_SHIFT;
- if (unlikely(vmf->pgoff >= size)) {
- ret = VM_FAULT_SIGBUS;
- goto unlock;
- }
- /* not shared and writable, use xip_sparse_page() */
- page = xip_sparse_page();
- if (!page)
- goto unlock;
- err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
- page);
- if (err == -ENOMEM)
- goto unlock;
-
- ret = VM_FAULT_NOPAGE;
-unlock:
- mutex_unlock(&mapping->i_mmap_mutex);
-out:
- write_seqcount_end(&xip_sparse_seq);
- mutex_unlock(&xip_sparse_mutex);
-
- return ret;
- }
-}
-
-static const struct vm_operations_struct xip_file_vm_ops = {
- .fault = xip_file_fault,
- .page_mkwrite = filemap_page_mkwrite,
- .remap_pages = generic_file_remap_pages,
-};
-
-int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
-{
- BUG_ON(!file->f_mapping->a_ops->get_xip_mem);
-
- file_accessed(file);
- vma->vm_ops = &xip_file_vm_ops;
- vma->vm_flags |= VM_MIXEDMAP;
- return 0;
-}
-EXPORT_SYMBOL_GPL(xip_file_mmap);
-
-/*
* truncate a page used for execute in place
* functionality is analog to block_truncate_page but does use get_xip_mem
* to get the page instead of page cache
--
2.0.0

2014-07-22 19:54:31

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 19/22] Get rid of most mentions of XIP in ext2

To help people transition, accept the 'xip' mount option (and report it
in /proc/mounts), but print a message encouraging people to switch over
to the 'dax' option.
---
fs/ext2/ext2.h | 13 +++++++------
fs/ext2/file.c | 2 +-
fs/ext2/inode.c | 6 +++---
fs/ext2/namei.c | 8 ++++----
fs/ext2/super.c | 25 ++++++++++++++++---------
5 files changed, 31 insertions(+), 23 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index b8b1c11..46133a0 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,14 +380,15 @@ struct ext2_inode {
#define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */
#define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */
#define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */
-#ifdef CONFIG_FS_DAX
-#define EXT2_MOUNT_XIP 0x010000 /* Execute in place */
-#else
-#define EXT2_MOUNT_XIP 0
-#endif
+#define EXT2_MOUNT_XIP 0x010000 /* Obsolete, use DAX */
#define EXT2_MOUNT_USRQUOTA 0x020000 /* user quota */
#define EXT2_MOUNT_GRPQUOTA 0x040000 /* group quota */
#define EXT2_MOUNT_RESERVATION 0x080000 /* Preallocation */
+#ifdef CONFIG_FS_DAX
+#define EXT2_MOUNT_DAX 0x100000 /* Direct Access */
+#else
+#define EXT2_MOUNT_DAX 0
+#endif


#define clear_opt(o, opt) o &= ~EXT2_MOUNT_##opt
@@ -789,7 +790,7 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end,
int datasync);
extern const struct inode_operations ext2_file_inode_operations;
extern const struct file_operations ext2_file_operations;
-extern const struct file_operations ext2_xip_file_operations;
+extern const struct file_operations ext2_dax_file_operations;

/* inode.c */
extern const struct address_space_operations ext2_aops;
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 46b333d..5b8cab5 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -110,7 +110,7 @@ const struct file_operations ext2_file_operations = {
};

#ifdef CONFIG_FS_DAX
-const struct file_operations ext2_xip_file_operations = {
+const struct file_operations ext2_dax_file_operations = {
.llseek = generic_file_llseek,
.read = new_sync_read,
.write = new_sync_write,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 034fd42..6434bc0 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1286,7 +1286,7 @@ void ext2_set_inode_flags(struct inode *inode)
inode->i_flags |= S_NOATIME;
if (flags & EXT2_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
- if (test_opt(inode->i_sb, XIP))
+ if (test_opt(inode->i_sb, DAX))
inode->i_flags |= S_DAX;
}

@@ -1388,9 +1388,9 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)

if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext2_file_inode_operations;
- if (test_opt(inode->i_sb, XIP)) {
+ if (test_opt(inode->i_sb, DAX)) {
inode->i_mapping->a_ops = &ext2_aops;
- inode->i_fop = &ext2_xip_file_operations;
+ inode->i_fop = &ext2_dax_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
inode->i_fop = &ext2_file_operations;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 0db888c..148f6e3 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -104,9 +104,9 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
return PTR_ERR(inode);

inode->i_op = &ext2_file_inode_operations;
- if (test_opt(inode->i_sb, XIP)) {
+ if (test_opt(inode->i_sb, DAX)) {
inode->i_mapping->a_ops = &ext2_aops;
- inode->i_fop = &ext2_xip_file_operations;
+ inode->i_fop = &ext2_dax_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
inode->i_fop = &ext2_file_operations;
@@ -125,9 +125,9 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
return PTR_ERR(inode);

inode->i_op = &ext2_file_inode_operations;
- if (test_opt(inode->i_sb, XIP)) {
+ if (test_opt(inode->i_sb, DAX)) {
inode->i_mapping->a_ops = &ext2_aops;
- inode->i_fop = &ext2_xip_file_operations;
+ inode->i_fop = &ext2_dax_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
inode->i_fop = &ext2_file_operations;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 2b58c48..29633627 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -290,6 +290,8 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root)
#ifdef CONFIG_FS_DAX
if (sbi->s_mount_opt & EXT2_MOUNT_XIP)
seq_puts(seq, ",xip");
+ if (sbi->s_mount_opt & EXT2_MOUNT_DAX)
+ seq_puts(seq, ",dax");
#endif

if (!test_opt(sb, RESERVATION))
@@ -393,7 +395,7 @@ enum {
Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic,
Opt_err_ro, Opt_nouid32, Opt_nocheck, Opt_debug,
Opt_oldalloc, Opt_orlov, Opt_nobh, Opt_user_xattr, Opt_nouser_xattr,
- Opt_acl, Opt_noacl, Opt_xip, Opt_ignore, Opt_err, Opt_quota,
+ Opt_acl, Opt_noacl, Opt_xip, Opt_dax, Opt_ignore, Opt_err, Opt_quota,
Opt_usrquota, Opt_grpquota, Opt_reservation, Opt_noreservation
};

@@ -422,6 +424,7 @@ static const match_table_t tokens = {
{Opt_acl, "acl"},
{Opt_noacl, "noacl"},
{Opt_xip, "xip"},
+ {Opt_dax, "dax"},
{Opt_grpquota, "grpquota"},
{Opt_ignore, "noquota"},
{Opt_quota, "quota"},
@@ -549,10 +552,14 @@ static int parse_options(char *options, struct super_block *sb)
break;
#endif
case Opt_xip:
+ ext2_msg(sb, KERN_INFO, "use dax instead of xip");
+ set_opt(sbi->s_mount_opt, XIP);
+ /* Fall through */
+ case Opt_dax:
#ifdef CONFIG_FS_DAX
- set_opt (sbi->s_mount_opt, XIP);
+ set_opt(sbi->s_mount_opt, DAX);
#else
- ext2_msg(sb, KERN_INFO, "xip option not supported");
+ ext2_msg(sb, KERN_INFO, "dax option not supported");
#endif
break;

@@ -896,15 +903,15 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)

blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);

- if (sbi->s_mount_opt & EXT2_MOUNT_XIP) {
+ if (sbi->s_mount_opt & EXT2_MOUNT_DAX) {
if (blocksize != PAGE_SIZE) {
ext2_msg(sb, KERN_ERR,
- "error: unsupported blocksize for xip");
+ "error: unsupported blocksize for dax");
goto failed_mount;
}
if (!sb->s_bdev->bd_disk->fops->direct_access) {
ext2_msg(sb, KERN_ERR,
- "error: device does not support xip");
+ "error: device does not support dax");
goto failed_mount;
}
}
@@ -1276,10 +1283,10 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);

es = sbi->s_es;
- if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) {
+ if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_DAX) {
ext2_msg(sb, KERN_WARNING, "warning: refusing change of "
- "xip flag with busy inodes while remounting");
- sbi->s_mount_opt ^= EXT2_MOUNT_XIP;
+ "dax flag with busy inodes while remounting");
+ sbi->s_mount_opt ^= EXT2_MOUNT_DAX;
}
if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) {
spin_unlock(&sbi->s_lock);
--
2.0.0

2014-07-22 19:54:53

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 17/22] Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX

The fewer Kconfig options we have the better. Use the generic
CONFIG_FS_DAX to enable XIP support in ext2 as well as in the core.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/Kconfig | 21 ++++++++++++++-------
fs/Makefile | 2 +-
fs/ext2/Kconfig | 11 -----------
fs/ext2/ext2.h | 2 +-
fs/ext2/file.c | 4 ++--
fs/ext2/super.c | 4 ++--
include/linux/fs.h | 4 ++--
7 files changed, 22 insertions(+), 26 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 312393f..a9eb53d 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -13,13 +13,6 @@ if BLOCK
source "fs/ext2/Kconfig"
source "fs/ext3/Kconfig"
source "fs/ext4/Kconfig"
-
-config FS_XIP
-# execute in place
- bool
- depends on EXT2_FS_XIP
- default y
-
source "fs/jbd/Kconfig"
source "fs/jbd2/Kconfig"

@@ -40,6 +33,20 @@ source "fs/ocfs2/Kconfig"
source "fs/btrfs/Kconfig"
source "fs/nilfs2/Kconfig"

+config FS_DAX
+ bool "Direct Access support"
+ depends on MMU
+ help
+ Direct Access (DAX) can be used on memory-backed block devices.
+ If the block device supports DAX and the filesystem supports DAX,
+ then you can avoid using the pagecache to buffer I/Os. Turning
+ on this option will compile in support for DAX; you will need to
+ mount the filesystem using the -o xip option.
+
+ If you do not have a block device that is capable of using this,
+ or if unsure, say N. Saying Y will increase the size of the kernel
+ by about 2kB.
+
endif # BLOCK

# Posix ACL utility routines
diff --git a/fs/Makefile b/fs/Makefile
index 03f65091..11d8e1b 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -28,7 +28,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o
obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o
obj-$(CONFIG_AIO) += aio.o
-obj-$(CONFIG_FS_XIP) += dax.o
+obj-$(CONFIG_FS_DAX) += dax.o
obj-$(CONFIG_FILE_LOCKING) += locks.o
obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o
obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout.o
diff --git a/fs/ext2/Kconfig b/fs/ext2/Kconfig
index 14a6780..c634874e 100644
--- a/fs/ext2/Kconfig
+++ b/fs/ext2/Kconfig
@@ -42,14 +42,3 @@ config EXT2_FS_SECURITY

If you are not using a security module that requires using
extended attributes for file security labels, say N.
-
-config EXT2_FS_XIP
- bool "Ext2 execute in place support"
- depends on EXT2_FS && MMU
- help
- Execute in place can be used on memory-backed block devices. If you
- enable this option, you can select to mount block devices which are
- capable of this feature without using the page cache.
-
- If you do not use a block device that is capable of using this,
- or if unsure, say N.
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 5ecf570..b30c3bd 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,7 +380,7 @@ struct ext2_inode {
#define EXT2_MOUNT_NO_UID32 0x000200 /* Disable 32-bit UIDs */
#define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */
#define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */
-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
#define EXT2_MOUNT_XIP 0x010000 /* Execute in place */
#else
#define EXT2_MOUNT_XIP 0
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index da8dc64..46b333d 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,7 +25,7 @@
#include "xattr.h"
#include "acl.h"

-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
return dax_fault(vma, vmf, ext2_get_block);
@@ -109,7 +109,7 @@ const struct file_operations ext2_file_operations = {
.splice_write = iter_file_splice_write,
};

-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
const struct file_operations ext2_xip_file_operations = {
.llseek = generic_file_llseek,
.read = new_sync_read,
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 747e293..2b58c48 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -287,7 +287,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root)
seq_puts(seq, ",grpquota");
#endif

-#if defined(CONFIG_EXT2_FS_XIP)
+#ifdef CONFIG_FS_DAX
if (sbi->s_mount_opt & EXT2_MOUNT_XIP)
seq_puts(seq, ",xip");
#endif
@@ -549,7 +549,7 @@ static int parse_options(char *options, struct super_block *sb)
break;
#endif
case Opt_xip:
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
set_opt (sbi->s_mount_opt, XIP);
#else
ext2_msg(sb, KERN_INFO, "xip option not supported");
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fc6f6bb..a75a045 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1616,7 +1616,7 @@ struct super_operations {
#define IS_IMA(inode) ((inode)->i_flags & S_IMA)
#define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT)
#define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC)
-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
#define IS_DAX(inode) ((inode)->i_flags & S_DAX)
#else
#define IS_DAX(inode) 0
@@ -2461,7 +2461,7 @@ extern loff_t fixed_size_llseek(struct file *file, loff_t offset,
extern int generic_file_open(struct inode * inode, struct file * filp);
extern int nonseekable_open(struct inode * inode, struct file * filp);

-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
int dax_clear_blocks(struct inode *, sector_t block, long size);
int dax_truncate_page(struct inode *, loff_t from, get_block_t);
ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
--
2.0.0

2014-07-22 19:53:13

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 04/22] Change direct_access calling convention

In order to support accesses to larger chunks of memory, pass in a
'size' parameter (counted in bytes), and return the amount available at
that address.

Support partitioning the underlying block device through a new helper
function, bdev_direct_access(), since partition handling should be done
in the block layer, not the filesystem, nor device driver.

Signed-off-by: Matthew Wilcox <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
Documentation/filesystems/xip.txt | 15 +++++++++------
arch/powerpc/sysdev/axonram.c | 12 ++++--------
drivers/block/brd.c | 8 +++++---
drivers/s390/block/dcssblk.c | 19 ++++++++++---------
fs/block_dev.c | 28 ++++++++++++++++++++++++++++
fs/ext2/xip.c | 31 +++++++++++++------------------
include/linux/blkdev.h | 6 ++++--
7 files changed, 73 insertions(+), 46 deletions(-)

diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
index 0466ee5..b62eabf 100644
--- a/Documentation/filesystems/xip.txt
+++ b/Documentation/filesystems/xip.txt
@@ -28,12 +28,15 @@ Implementation
Execute-in-place is implemented in three steps: block device operation,
address space operation, and file operations.

-A block device operation named direct_access is used to retrieve a
-reference (pointer) to a block on-disk. The reference is supposed to be
-cpu-addressable, physical address and remain valid until the release operation
-is performed. A struct block_device reference is used to address the device,
-and a sector_t argument is used to identify the individual block. As an
-alternative, memory technology devices can be used for this.
+A block device operation named direct_access is used to translate the
+block device sector number to a page frame number (pfn) that identifies
+the physical page for the memory. It also returns a kernel virtual
+address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested. The function should return the number
+of bytes that it can provide, although it must not exceed the number of
+bytes requested. It may also return a negative errno if an error occurs.

The block device operation is optional, these block devices support it as of
today:
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 830edc8..3ee1c08 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -139,17 +139,13 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
* axon_ram_direct_access - direct_access() method for block device
* @device, @sector, @data: see block_device_operations method
*/
-static int
+static long
axon_ram_direct_access(struct block_device *device, sector_t sector,
- void **kaddr, unsigned long *pfn)
+ void **kaddr, unsigned long *pfn, long size)
{
struct axon_ram_bank *bank = device->bd_disk->private_data;
- loff_t offset;
+ loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;

- offset = sector;
- if (device->bd_part != NULL)
- offset += device->bd_part->start_sect;
- offset <<= AXON_RAM_SECTOR_SHIFT;
if (offset >= bank->size) {
dev_err(&bank->device->dev, "Access outside of address space\n");
return -ERANGE;
@@ -158,7 +154,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
*kaddr = (void *)(bank->ph_addr + offset);
*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;

- return 0;
+ return min_t(long, size, bank->size - offset);
}

static const struct block_device_operations axon_ram_devops = {
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index c7d138e..96e4c96 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -370,8 +370,8 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
}

#ifdef CONFIG_BLK_DEV_XIP
-static int brd_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, unsigned long *pfn)
+static long brd_direct_access(struct block_device *bdev, sector_t sector,
+ void **kaddr, unsigned long *pfn, long size)
{
struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;
@@ -388,7 +388,9 @@ static int brd_direct_access(struct block_device *bdev, sector_t sector,
*kaddr = page_address(page);
*pfn = page_to_pfn(page);

- return 0;
+ /* Could optimistically check to see if the next page in the
+ * file is mapped to the next page of physical RAM */
+ return min_t(long, PAGE_SIZE, size);
}
#endif

diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 0f47175..58958d1 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -28,8 +28,8 @@
static int dcssblk_open(struct block_device *bdev, fmode_t mode);
static void dcssblk_release(struct gendisk *disk, fmode_t mode);
static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
-static int dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
- void **kaddr, unsigned long *pfn);
+static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
+ void **kaddr, unsigned long *pfn, long size);

static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";

@@ -866,25 +866,26 @@ fail:
bio_io_error(bio);
}

-static int
+static long
dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
- void **kaddr, unsigned long *pfn)
+ void **kaddr, unsigned long *pfn, long size)
{
struct dcssblk_dev_info *dev_info;
- unsigned long pgoff;
+ unsigned long offset, dev_sz;

dev_info = bdev->bd_disk->private_data;
if (!dev_info)
return -ENODEV;
+ dev_sz = dev_info->end - dev_info->start;
if (secnum % (PAGE_SIZE/512))
return -EINVAL;
- pgoff = secnum / (PAGE_SIZE / 512);
- if ((pgoff+1)*PAGE_SIZE-1 > dev_info->end - dev_info->start)
+ offset = secnum * 512;
+ if (offset > dev_sz)
return -ERANGE;
- *kaddr = (void *) (dev_info->start+pgoff*PAGE_SIZE);
+ *kaddr = (void *) (dev_info->start + offset);
*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;

- return 0;
+ return min_t(long, size, dev_sz - offset);
}

static void
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 6d72746..f1a158e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -427,6 +427,34 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
}
EXPORT_SYMBOL_GPL(bdev_write_page);

+/**
+ * bdev_direct_access() - Get the address for directly-accessibly memory
+ * @bdev: The device containing the memory
+ * @sector: The offset within the device
+ * @addr: Where to put the address of the memory
+ * @pfn: The Page Frame Number for the memory
+ * @size: The number of bytes requested
+ *
+ * If a block device is made up of directly addressable memory, this function
+ * will tell the caller the PFN and the address of the memory. The address
+ * may be directly dereferenced within the kernel without the need to call
+ * ioremap(), kmap() or similar. THe PFN is suitable for inserting into
+ * page tables.
+ *
+ * Return: negative errno if an error occurs, otherwise the number of bytes
+ * accessible at this address.
+ */
+long bdev_direct_access(struct block_device *bdev, sector_t sector,
+ void **addr, unsigned long *pfn, long size)
+{
+ const struct block_device_operations *ops = bdev->bd_disk->fops;
+ if (!ops->direct_access)
+ return -EOPNOTSUPP;
+ return ops->direct_access(bdev, sector + get_start_sect(bdev), addr,
+ pfn, size);
+}
+EXPORT_SYMBOL_GPL(bdev_direct_access);
+
/*
* pseudo-fs
*/
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index e98171a..bbc5fec 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,18 +13,12 @@
#include "ext2.h"
#include "xip.h"

-static inline int
-__inode_direct_access(struct inode *inode, sector_t block,
- void **kaddr, unsigned long *pfn)
+static inline long __inode_direct_access(struct inode *inode, sector_t block,
+ void **kaddr, unsigned long *pfn, long size)
{
struct block_device *bdev = inode->i_sb->s_bdev;
- const struct block_device_operations *ops = bdev->bd_disk->fops;
- sector_t sector;
-
- sector = block * (PAGE_SIZE / 512); /* ext2 block to bdev sector */
-
- BUG_ON(!ops->direct_access);
- return ops->direct_access(bdev, sector, kaddr, pfn);
+ sector_t sector = block * (PAGE_SIZE / 512);
+ return bdev_direct_access(bdev, sector, kaddr, pfn, size);
}

static inline int
@@ -53,12 +47,13 @@ ext2_clear_xip_target(struct inode *inode, sector_t block)
{
void *kaddr;
unsigned long pfn;
- int rc;
+ long size;

- rc = __inode_direct_access(inode, block, &kaddr, &pfn);
- if (!rc)
- clear_page(kaddr);
- return rc;
+ size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
+ if (size < 0)
+ return size;
+ clear_page(kaddr);
+ return 0;
}

void ext2_xip_verify_sb(struct super_block *sb)
@@ -77,7 +72,7 @@ void ext2_xip_verify_sb(struct super_block *sb)
int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
void **kmem, unsigned long *pfn)
{
- int rc;
+ long rc;
sector_t block;

/* first, retrieve the sector number */
@@ -86,6 +81,6 @@ int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
return rc;

/* retrieve address of the target data */
- rc = __inode_direct_access(mapping->host, block, kmem, pfn);
- return rc;
+ rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
+ return (rc < 0) ? rc : 0;
}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8699bcf..bc5ea9e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1613,8 +1613,8 @@ struct block_device_operations {
int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
- int (*direct_access) (struct block_device *, sector_t,
- void **, unsigned long *);
+ long (*direct_access) (struct block_device *, sector_t,
+ void **, unsigned long *pfn, long size);
unsigned int (*check_events) (struct gendisk *disk,
unsigned int clearing);
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
@@ -1632,6 +1632,8 @@ extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
extern int bdev_read_page(struct block_device *, sector_t, struct page *);
extern int bdev_write_page(struct block_device *, sector_t, struct page *,
struct writeback_control *);
+extern long bdev_direct_access(struct block_device *, sector_t, void **addr,
+ unsigned long *pfn, long size);
#else /* CONFIG_BLOCK */

struct block_device;
--
2.0.0

2014-07-22 19:55:20

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v8 16/22] ext2: Remove xip.c and xip.h

These files are now empty, so delete them

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext2/Makefile | 1 -
fs/ext2/inode.c | 1 -
fs/ext2/namei.c | 1 -
fs/ext2/super.c | 1 -
fs/ext2/xip.c | 15 ---------------
fs/ext2/xip.h | 16 ----------------
6 files changed, 35 deletions(-)
delete mode 100644 fs/ext2/xip.c
delete mode 100644 fs/ext2/xip.h

diff --git a/fs/ext2/Makefile b/fs/ext2/Makefile
index f42af45..445b0e9 100644
--- a/fs/ext2/Makefile
+++ b/fs/ext2/Makefile
@@ -10,4 +10,3 @@ ext2-y := balloc.o dir.o file.o ialloc.o inode.o \
ext2-$(CONFIG_EXT2_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
ext2-$(CONFIG_EXT2_FS_POSIX_ACL) += acl.o
ext2-$(CONFIG_EXT2_FS_SECURITY) += xattr_security.o
-ext2-$(CONFIG_EXT2_FS_XIP) += xip.o
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index cba3833..154cbcf 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -34,7 +34,6 @@
#include <linux/aio.h>
#include "ext2.h"
#include "acl.h"
-#include "xip.h"
#include "xattr.h"

static int __ext2_write_inode(struct inode *inode, int do_sync);
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 846c356..7ca803f 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -35,7 +35,6 @@
#include "ext2.h"
#include "xattr.h"
#include "acl.h"
-#include "xip.h"

static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode)
{
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 3ac6555..747e293 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -35,7 +35,6 @@
#include "ext2.h"
#include "xattr.h"
#include "acl.h"
-#include "xip.h"

static void ext2_sync_super(struct super_block *sb,
struct ext2_super_block *es, int wait);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
deleted file mode 100644
index 66ca113..0000000
--- a/fs/ext2/xip.c
+++ /dev/null
@@ -1,15 +0,0 @@
-/*
- * linux/fs/ext2/xip.c
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte ([email protected])
- */
-
-#include <linux/mm.h>
-#include <linux/fs.h>
-#include <linux/genhd.h>
-#include <linux/buffer_head.h>
-#include <linux/blkdev.h>
-#include "ext2.h"
-#include "xip.h"
-
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
deleted file mode 100644
index 87eeb04..0000000
--- a/fs/ext2/xip.h
+++ /dev/null
@@ -1,16 +0,0 @@
-/*
- * linux/fs/ext2/xip.h
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte ([email protected])
- */
-
-#ifdef CONFIG_EXT2_FS_XIP
-static inline int ext2_use_xip (struct super_block *sb)
-{
- struct ext2_sb_info *sbi = EXT2_SB(sb);
- return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
-}
-#else
-#define ext2_use_xip(sb) 0
-#endif
--
2.0.0

2014-07-23 09:10:47

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v8 05/22] Add vm_replace_mixed()

On Tue 22-07-14 15:47:53, Matthew Wilcox wrote:
> From: Matthew Wilcox <[email protected]>
>
> vm_insert_mixed() will fail if there is already a valid PTE at that
> location. The DAX code would rather replace the previous value with
> the new PTE.
>
> Signed-off-by: Matthew Wilcox <[email protected]>
This looks good to me although I'm not an expert in this area. So just:
Acked-by: Jan Kara <[email protected]>

Honza

> ---
> include/linux/mm.h | 8 ++++++--
> mm/memory.c | 34 +++++++++++++++++++++-------------
> 2 files changed, 27 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e04f531..8d1194c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1958,8 +1958,12 @@ int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
> int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
> int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> unsigned long pfn);
> -int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
> - unsigned long pfn);
> +int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
> + unsigned long pfn, bool replace);
> +#define vm_insert_mixed(vma, addr, pfn) \
> + __vm_insert_mixed(vma, addr, pfn, false)
> +#define vm_replace_mixed(vma, addr, pfn) \
> + __vm_insert_mixed(vma, addr, pfn, true)
> int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);
>
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 42bf429..cf06c97 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1476,7 +1476,7 @@ pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
> * pages reserved for the old functions anyway.
> */
> static int insert_page(struct vm_area_struct *vma, unsigned long addr,
> - struct page *page, pgprot_t prot)
> + struct page *page, pgprot_t prot, bool replace)
> {
> struct mm_struct *mm = vma->vm_mm;
> int retval;
> @@ -1492,8 +1492,12 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
> if (!pte)
> goto out;
> retval = -EBUSY;
> - if (!pte_none(*pte))
> - goto out_unlock;
> + if (!pte_none(*pte)) {
> + if (!replace)
> + goto out_unlock;
> + VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex));
> + zap_page_range_single(vma, addr, PAGE_SIZE, NULL);
> + }
>
> /* Ok, finally just insert the thing.. */
> get_page(page);
> @@ -1549,12 +1553,12 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr,
> BUG_ON(vma->vm_flags & VM_PFNMAP);
> vma->vm_flags |= VM_MIXEDMAP;
> }
> - return insert_page(vma, addr, page, vma->vm_page_prot);
> + return insert_page(vma, addr, page, vma->vm_page_prot, false);
> }
> EXPORT_SYMBOL(vm_insert_page);
>
> static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> - unsigned long pfn, pgprot_t prot)
> + unsigned long pfn, pgprot_t prot, bool replace)
> {
> struct mm_struct *mm = vma->vm_mm;
> int retval;
> @@ -1566,8 +1570,12 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> if (!pte)
> goto out;
> retval = -EBUSY;
> - if (!pte_none(*pte))
> - goto out_unlock;
> + if (!pte_none(*pte)) {
> + if (!replace)
> + goto out_unlock;
> + VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex));
> + zap_page_range_single(vma, addr, PAGE_SIZE, NULL);
> + }
>
> /* Ok, finally just insert the thing.. */
> entry = pte_mkspecial(pfn_pte(pfn, prot));
> @@ -1620,14 +1628,14 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> if (track_pfn_insert(vma, &pgprot, pfn))
> return -EINVAL;
>
> - ret = insert_pfn(vma, addr, pfn, pgprot);
> + ret = insert_pfn(vma, addr, pfn, pgprot, false);
>
> return ret;
> }
> EXPORT_SYMBOL(vm_insert_pfn);
>
> -int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
> - unsigned long pfn)
> +int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
> + unsigned long pfn, bool replace)
> {
> BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
>
> @@ -1645,11 +1653,11 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
> struct page *page;
>
> page = pfn_to_page(pfn);
> - return insert_page(vma, addr, page, vma->vm_page_prot);
> + return insert_page(vma, addr, page, vma->vm_page_prot, replace);
> }
> - return insert_pfn(vma, addr, pfn, vma->vm_page_prot);
> + return insert_pfn(vma, addr, pfn, vma->vm_page_prot, replace);
> }
> -EXPORT_SYMBOL(vm_insert_mixed);
> +EXPORT_SYMBOL(__vm_insert_mixed);
>
> /*
> * maps a range of physical memory into the requested pages. the old
> --
> 2.0.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-07-23 11:22:10

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v8 01/22] Fix XIP fault vs truncate race

On Tue, Jul 22, 2014 at 03:47:49PM -0400, Matthew Wilcox wrote:
> Pagecache faults recheck i_size after taking the page lock to ensure that
> the fault didn't race against a truncate. We don't have a page to lock
> in the XIP case, so use the i_mmap_mutex instead. It is locked in the
> truncate path in unmap_mapping_range() after updating i_size. So while
> we hold it in the fault path, we are guaranteed that either i_size has
> already been updated in the truncate path, or that the truncate will
> subsequently call zap_page_range_single() and so remove the mapping we
> have just inserted.
>
> There is a window of time in which i_size has been reduced and the
> thread has a mapping to a page which will be removed from the file,
> but this is harmless as the page will not be allocated to a different
> purpose before the thread's access to it is revoked.
>
> Signed-off-by: Matthew Wilcox <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> ---
> mm/filemap_xip.c | 24 ++++++++++++++++++++++--
> 1 file changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
> index d8d9fe3..c8d23e9 100644
> --- a/mm/filemap_xip.c
> +++ b/mm/filemap_xip.c
> @@ -260,8 +260,17 @@ again:
> __xip_unmap(mapping, vmf->pgoff);
>
> found:
> + /* We must recheck i_size under i_mmap_mutex */
> + mutex_lock(&mapping->i_mmap_mutex);
> + size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
> + PAGE_CACHE_SHIFT;

round_up() ?

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2014-07-23 11:23:58

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v8 02/22] Allow page fault handlers to perform the COW

On Tue, Jul 22, 2014 at 03:47:50PM -0400, Matthew Wilcox wrote:
> Currently COW of an XIP file is done by first bringing in a read-only
> mapping, then retrying the fault and copying the page. It is much more
> efficient to tell the fault handler that a COW is being attempted (by
> passing in the pre-allocated page in the vm_fault structure), and allow
> the handler to perform the COW operation itself.
>
> Signed-off-by: Matthew Wilcox <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2014-07-23 11:25:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v8 03/22] axonram: Fix bug in direct_access

On Tue, Jul 22, 2014 at 03:47:51PM -0400, Matthew Wilcox wrote:
> The 'pfn' returned by axonram was completely bogus, and has been since
> 2008.
>
> Signed-off-by: Matthew Wilcox <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>

Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov

2014-07-23 11:46:17

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v8 05/22] Add vm_replace_mixed()

On Tue, Jul 22, 2014 at 03:47:53PM -0400, Matthew Wilcox wrote:
> From: Matthew Wilcox <[email protected]>
>
> vm_insert_mixed() will fail if there is already a valid PTE at that
> location. The DAX code would rather replace the previous value with
> the new PTE.
>
> Signed-off-by: Matthew Wilcox <[email protected]>
> ---
> include/linux/mm.h | 8 ++++++--
> mm/memory.c | 34 +++++++++++++++++++++-------------
> 2 files changed, 27 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e04f531..8d1194c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1958,8 +1958,12 @@ int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
> int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
> int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> unsigned long pfn);
> -int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
> - unsigned long pfn);
> +int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
> + unsigned long pfn, bool replace);
> +#define vm_insert_mixed(vma, addr, pfn) \
> + __vm_insert_mixed(vma, addr, pfn, false)
> +#define vm_replace_mixed(vma, addr, pfn) \
> + __vm_insert_mixed(vma, addr, pfn, true)
> int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);
>
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 42bf429..cf06c97 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1476,7 +1476,7 @@ pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
> * pages reserved for the old functions anyway.
> */
> static int insert_page(struct vm_area_struct *vma, unsigned long addr,
> - struct page *page, pgprot_t prot)
> + struct page *page, pgprot_t prot, bool replace)
> {
> struct mm_struct *mm = vma->vm_mm;
> int retval;
> @@ -1492,8 +1492,12 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
> if (!pte)
> goto out;
> retval = -EBUSY;
> - if (!pte_none(*pte))
> - goto out_unlock;
> + if (!pte_none(*pte)) {
> + if (!replace)
> + goto out_unlock;
> + VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex));
> + zap_page_range_single(vma, addr, PAGE_SIZE, NULL);

zap_page_range_single() takes ptl by itself in zap_pte_range(). It's not
going to work.

And zap_page_range*() is pretty heavy weapon to shoot down one pte, which
we already have pointer to. Why?

--
Kirill A. Shutemov

2014-07-23 12:10:42

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v8 10/22] Replace the XIP page fault handler with the DAX page fault handler

On Tue, Jul 22, 2014 at 03:47:58PM -0400, Matthew Wilcox wrote:
> Instead of calling aops->get_xip_mem from the fault handler, the
> filesystem passes a get_block_t that is used to find the appropriate
> blocks.
>
> Signed-off-by: Matthew Wilcox <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> ---
> fs/dax.c | 221 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> fs/ext2/file.c | 35 ++++++++-
> include/linux/fs.h | 4 +-
> mm/filemap_xip.c | 206 -------------------------------------------------
> 4 files changed, 257 insertions(+), 209 deletions(-)
>
...

> +/**
> + * dax_fault - handle a page fault on a DAX file
> + * @vma: The virtual memory area where the fault occurred
> + * @vmf: The description of the fault
> + * @get_block: The filesystem method used to translate file offsets to blocks
> + *
> + * When a page fault occurs, filesystems may call this helper in their
> + * fault handler for DAX files.
> + */
> +int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> + get_block_t get_block)
> +{
> + int result;
> + struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> +
> + if (vmf->flags & FAULT_FLAG_WRITE) {

Nobody seems calls sb_start_pagefault() in fault handler.
Do you mean FAULT_FLAG_MKWRITE?

> + sb_start_pagefault(sb);
> + file_update_time(vma->vm_file);
> + }
> + result = do_dax_fault(vma, vmf, get_block);
> + if (vmf->flags & FAULT_FLAG_WRITE)
> + sb_end_pagefault(sb);
> +
> + return result;
> +}
> +EXPORT_SYMBOL_GPL(dax_fault);
> +
> +/**
> + * dax_mkwrite - convert a read-only page to read-write in a DAX file
> + * @vma: The virtual memory area where the fault occurred
> + * @vmf: The description of the fault
> + * @get_block: The filesystem method used to translate file offsets to blocks
> + *
> + * DAX handles reads of holes by adding pages full of zeroes into the
> + * mapping. If the page is subsequenty written to, we have to allocate
> + * the page on media and free the page that was in the cache.
> + */
> +int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> + get_block_t get_block)
> +{
> + return dax_fault(vma, vmf, get_block);
> +}
> +EXPORT_SYMBOL_GPL(dax_mkwrite);

I don't think we want to introduce new exported symbol just for dummy
wrapper. Just use ".page_mkwrite = foo_fault,". perf and selinux do
this.
Or add #define into header file if you want better readability.

--
Kirill A. Shutemov

2014-07-23 12:30:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v8 00/22] Support ext4 on NV-DIMMs

On Tue, Jul 22, 2014 at 03:47:48PM -0400, Matthew Wilcox wrote:
> One of the primary uses for NV-DIMMs is to expose them as a block device
> and use a filesystem to store files on the NV-DIMM. While that works,
> it currently wastes memory and CPU time buffering the files in the page
> cache. We have support in ext2 for bypassing the page cache, but it
> has some races which are unfixable in the current design. This series
> of patches rewrite the underlying support, and add support for direct
> access to ext4.

Matthew, as discussed before, your patchset make exessive use of
i_mmap_mutex. Are you going to address this later? Or what's the plan?

--
Kirill A. Shutemov

2014-07-23 13:52:27

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v8 05/22] Add vm_replace_mixed()

On Wed, Jul 23, 2014 at 02:45:40PM +0300, Kirill A. Shutemov wrote:
> On Tue, Jul 22, 2014 at 03:47:53PM -0400, Matthew Wilcox wrote:
> > From: Matthew Wilcox <[email protected]>
> >
> > vm_insert_mixed() will fail if there is already a valid PTE at that
> > location. The DAX code would rather replace the previous value with
> > the new PTE.

> > @@ -1492,8 +1492,12 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
> > if (!pte)
> > goto out;
> > retval = -EBUSY;
> > - if (!pte_none(*pte))
> > - goto out_unlock;
> > + if (!pte_none(*pte)) {
> > + if (!replace)
> > + goto out_unlock;
> > + VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex));
> > + zap_page_range_single(vma, addr, PAGE_SIZE, NULL);
>
> zap_page_range_single() takes ptl by itself in zap_pte_range(). It's not
> going to work.

I have a test program that exercises this path ... it seems to work!
Following the code, I don't understand why it does. Maybe it's not
exercising this path after all? I've attached the program (so that I
have an "oh, duh" moment about 5 seconds after sending the email).

> And zap_page_range*() is pretty heavy weapon to shoot down one pte, which
> we already have pointer to. Why?

I'd love to use a lighter-weight weapon! What would you recommend using,
zap_pte_range()?


Attachments:
(No filename) (1.31 kB)
dax-mmap-write-hole.c (766.00 B)
Download all attachments

2014-07-23 13:55:19

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v8 10/22] Replace the XIP page fault handler with the DAX page fault handler

On Wed, Jul 23, 2014 at 03:10:25PM +0300, Kirill A. Shutemov wrote:
> > +int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> > + get_block_t get_block)
> > +{
> > + int result;
> > + struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> > +
> > + if (vmf->flags & FAULT_FLAG_WRITE) {
>
> Nobody seems calls sb_start_pagefault() in fault handler.
> Do you mean FAULT_FLAG_MKWRITE?

We need to call sb_start_pagefault() if we're going to make a modification
to the filesystem. Admittedly, we don't know if we're going to make a
modification at this point, but if we're taking a write fault on a hole,
we will be. We can skip the call to sb_start_pagefault() if we're taking
a read fault.

> > +int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> > + get_block_t get_block)
> > +{
> > + return dax_fault(vma, vmf, get_block);
> > +}
> > +EXPORT_SYMBOL_GPL(dax_mkwrite);
>
> I don't think we want to introduce new exported symbol just for dummy
> wrapper. Just use ".page_mkwrite = foo_fault,". perf and selinux do
> this.
> Or add #define into header file if you want better readability.

They were different at one time ... agreed, I can just make them an alias
for now.

2014-07-23 13:59:51

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v8 00/22] Support ext4 on NV-DIMMs

On Wed, Jul 23, 2014 at 03:30:28PM +0300, Kirill A. Shutemov wrote:
> On Tue, Jul 22, 2014 at 03:47:48PM -0400, Matthew Wilcox wrote:
> > One of the primary uses for NV-DIMMs is to expose them as a block device
> > and use a filesystem to store files on the NV-DIMM. While that works,
> > it currently wastes memory and CPU time buffering the files in the page
> > cache. We have support in ext2 for bypassing the page cache, but it
> > has some races which are unfixable in the current design. This series
> > of patches rewrite the underlying support, and add support for direct
> > access to ext4.
>
> Matthew, as discussed before, your patchset make exessive use of
> i_mmap_mutex. Are you going to address this later? Or what's the plan?

Yes, it'll be addressed later. I have some ideas, but I'd like to get
some experience with just how bad this single mutex is before trying to
split it.

2014-07-23 14:21:46

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v8 05/22] Add vm_replace_mixed()

On Wed, Jul 23, 2014 at 09:52:22AM -0400, Matthew Wilcox wrote:
> On Wed, Jul 23, 2014 at 02:45:40PM +0300, Kirill A. Shutemov wrote:
> > On Tue, Jul 22, 2014 at 03:47:53PM -0400, Matthew Wilcox wrote:
> > > From: Matthew Wilcox <[email protected]>
> > >
> > > vm_insert_mixed() will fail if there is already a valid PTE at that
> > > location. The DAX code would rather replace the previous value with
> > > the new PTE.
>
> > > @@ -1492,8 +1492,12 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
> > > if (!pte)
> > > goto out;
> > > retval = -EBUSY;
> > > - if (!pte_none(*pte))
> > > - goto out_unlock;
> > > + if (!pte_none(*pte)) {
> > > + if (!replace)
> > > + goto out_unlock;
> > > + VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex));
> > > + zap_page_range_single(vma, addr, PAGE_SIZE, NULL);
> >
> > zap_page_range_single() takes ptl by itself in zap_pte_range(). It's not
> > going to work.
>
> I have a test program that exercises this path ... it seems to work!
> Following the code, I don't understand why it does. Maybe it's not
> exercising this path after all? I've attached the program (so that I
> have an "oh, duh" moment about 5 seconds after sending the email).

See below.

>
> > And zap_page_range*() is pretty heavy weapon to shoot down one pte, which
> > we already have pointer to. Why?
>
> I'd love to use a lighter-weight weapon! What would you recommend using,
> zap_pte_range()?

The most straight-forward way: extract body of pte cycle from
zap_pte_range() to separate function -- zap_pte() -- and use it.

> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/types.h>
> #include <sys/mman.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <errno.h>
>
> int
> main(int argc, char *argv[])
> {
> int fd;
> void *addr;
> char buf[4096];
>
> if (argc != 2) {
> fprintf(stderr, "usage: %s filename\n", argv[0]);
> exit(1);
> }
>
> if ((fd = open(argv[1], O_CREAT|O_RDWR, 0666)) < 0) {
> perror(argv[1]);
> exit(1);
> }
>
> if (ftruncate(fd, 4096) < 0) {

Shouldn't this be ftruncate(fd, 0)? Otherwise the memcpy() below will
fault in page from backing storage, not hole and write will not replace
anything.

> perror("ftruncate");
> exit(1);
> }
>
> if ((addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED,
> fd, 0)) == MAP_FAILED) {
> perror("mmap");
> exit(1);
> }
>
> close(fd);
>
> /* first read */
> memcpy(buf, addr, 4096);
>
> /* now write a bit */
> memcpy(addr, buf, 8);
>
> printf("%s: test passed.\n", argv[0]);
> exit(0);
> }


--
Kirill A. Shutemov

2014-07-23 14:27:57

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v8 05/22] Add vm_replace_mixed()

On Wed, Jul 23, 2014 at 05:20:48PM +0300, Kirill A. Shutemov wrote:
> On Wed, Jul 23, 2014 at 09:52:22AM -0400, Matthew Wilcox wrote:
> > I'd love to use a lighter-weight weapon! What would you recommend using,
> > zap_pte_range()?
>
> The most straight-forward way: extract body of pte cycle from
> zap_pte_range() to separate function -- zap_pte() -- and use it.

OK, I can do that. What about the other parts of zap_page_range(),
do I need to call them?

lru_add_drain();
tlb_gather_mmu(&tlb, mm, address, end);
update_hiwater_rss(mm);
mmu_notifier_invalidate_range_start(mm, address, end);
[ unmap_single_vma(&tlb, vma, address, end, details);]
mmu_notifier_invalidate_range_end(mm, address, end);
tlb_finish_mmu(&tlb, address, end);

> > if ((fd = open(argv[1], O_CREAT|O_RDWR, 0666)) < 0) {
> > perror(argv[1]);
> > exit(1);
> > }
> >
> > if (ftruncate(fd, 4096) < 0) {
>
> Shouldn't this be ftruncate(fd, 0)? Otherwise the memcpy() below will
> fault in page from backing storage, not hole and write will not replace
> anything.

Ah, it was starting with a new file, hence the O_CREAT up above.

2014-07-23 14:35:22

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v8 00/22] Support ext4 on NV-DIMMs

On Wed, Jul 23, 2014 at 07:10:32AM -0700, Howard Chu wrote:
> Matthew Wilcox wrote:
> >One of the primary uses for NV-DIMMs is to expose them as a block device
> >and use a filesystem to store files on the NV-DIMM. While that works,
> >it currently wastes memory and CPU time buffering the files in the page
> >cache. We have support in ext2 for bypassing the page cache, but it
> >has some races which are unfixable in the current design. This series
> >of patches rewrite the underlying support, and add support for direct
> >access to ext4.
>
> This is an awful lot of work to go thru just to get a glorified ext4
> RAMdisk. RAMdisks are one of the worst possible uses for RAM, requiring
> users to explicitly copy files to them before getting any benefit. Using RAM
> for a page cache instead brings benefits to all file accesses without
> requiring any user intervention.

Perhaps you misunderstand the problem. There are many different kinds
of NV-DIMM out there today with different performance characteristics.
One that has been described to me has write times 1000x slower than read
times. In that situation, you can't possibly "just use it as page cache";
you need to place the read-often; write-rarely files on that media.

> If the NVDIMM range was reserved for exclusive use of the page cache, then
> you would have an avenue to get persistence/safety for every filesystem
> mounted on a machine, not just a special case ext4.

No you wouldn't; you'd also need to have a mechanism to store the state
of the page cache persistently. And you have to make sure that the
filesystem does appropriate cache invalidations. By going the route
here, we can use the existing caching mechanisms (eg FS-Cache) which
have solved all the hard problems of making sure that local caches are
coherent with storage.

2014-07-23 14:42:54

by Howard Chu

[permalink] [raw]
Subject: Re: [PATCH v8 00/22] Support ext4 on NV-DIMMs

Matthew Wilcox wrote:
> One of the primary uses for NV-DIMMs is to expose them as a block device
> and use a filesystem to store files on the NV-DIMM. While that works,
> it currently wastes memory and CPU time buffering the files in the page
> cache. We have support in ext2 for bypassing the page cache, but it
> has some races which are unfixable in the current design. This series
> of patches rewrite the underlying support, and add support for direct
> access to ext4.

This is an awful lot of work to go thru just to get a glorified ext4 RAMdisk.
RAMdisks are one of the worst possible uses for RAM, requiring users to
explicitly copy files to them before getting any benefit. Using RAM for a page
cache instead brings benefits to all file accesses without requiring any user
intervention.

If the NVDIMM range was reserved for exclusive use of the page cache, then you
would have an avenue to get persistence/safety for every filesystem mounted on
a machine, not just a special case ext4.

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

2014-07-23 15:28:09

by Howard Chu

[permalink] [raw]
Subject: Re: [PATCH v8 00/22] Support ext4 on NV-DIMMs

Matthew Wilcox wrote:
> On Wed, Jul 23, 2014 at 07:10:32AM -0700, Howard Chu wrote:
>> Matthew Wilcox wrote:
>>> One of the primary uses for NV-DIMMs is to expose them as a block device
>>> and use a filesystem to store files on the NV-DIMM. While that works,
>>> it currently wastes memory and CPU time buffering the files in the page
>>> cache. We have support in ext2 for bypassing the page cache, but it
>>> has some races which are unfixable in the current design. This series
>>> of patches rewrite the underlying support, and add support for direct
>>> access to ext4.
>>
>> This is an awful lot of work to go thru just to get a glorified ext4
>> RAMdisk. RAMdisks are one of the worst possible uses for RAM, requiring
>> users to explicitly copy files to them before getting any benefit. Using RAM
>> for a page cache instead brings benefits to all file accesses without
>> requiring any user intervention.
>
> Perhaps you misunderstand the problem. There are many different kinds
> of NV-DIMM out there today with different performance characteristics.
> One that has been described to me has write times 1000x slower than read
> times. In that situation, you can't possibly "just use it as page cache";
> you need to place the read-often; write-rarely files on that media.

Thanks for the clarification. Yes, I was assuming "NVDIMM" to mean something
with DRAM performance + persistence, like http://www.agigatech.com/ddr3.php or
http://www.vikingtechnology.com/nvdimm-technology . That's also the definition
in Wikipedia http://en.wikipedia.org/wiki/NVDIMM

(NVDIMM) is a computer memory DRAM DIMM that retains data even when
electrical power is removed

What NVDIMM did you have in mind? It doesn't sound like Flash DIMM, which is
nowhere near DRAM performance for reads or writes. Closest so far would be
STT-MRAM but the capacity/density isn't really there yet.

>> If the NVDIMM range was reserved for exclusive use of the page cache, then
>> you would have an avenue to get persistence/safety for every filesystem
>> mounted on a machine, not just a special case ext4.
>
> No you wouldn't; you'd also need to have a mechanism to store the state
> of the page cache persistently.

OK, so you have to arrange for the page cache manager to use the NVRAM for its
own data structures as well.

> And you have to make sure that the
> filesystem does appropriate cache invalidations. By going the route
> here, we can use the existing caching mechanisms (eg FS-Cache) which
> have solved all the hard problems of making sure that local caches are
> coherent with storage.

OK. Thanks again for the explanation.

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

2014-07-23 15:55:15

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v8 05/22] Add vm_replace_mixed()

On Wed, Jul 23, 2014 at 10:27:45AM -0400, Matthew Wilcox wrote:
> On Wed, Jul 23, 2014 at 05:20:48PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Jul 23, 2014 at 09:52:22AM -0400, Matthew Wilcox wrote:
> > > I'd love to use a lighter-weight weapon! What would you recommend using,
> > > zap_pte_range()?
> >
> > The most straight-forward way: extract body of pte cycle from
> > zap_pte_range() to separate function -- zap_pte() -- and use it.
>
> OK, I can do that. What about the other parts of zap_page_range(),
> do I need to call them?
>
> lru_add_drain();

No, I guess..

> tlb_gather_mmu(&tlb, mm, address, end);
> tlb_finish_mmu(&tlb, address, end);

New zap_pte() should tolerate tlb == NULL and does flush_tlb_page() or
pte_clear_*flush or something.

> update_hiwater_rss(mm);

No: you cannot end up with lower rss after replace, iiuc.

> mmu_notifier_invalidate_range_start(mm, address, end);
> mmu_notifier_invalidate_range_end(mm, address, end);

mmu_notifier_invalidate_page() should be enough.

> > > if ((fd = open(argv[1], O_CREAT|O_RDWR, 0666)) < 0) {
> > > perror(argv[1]);
> > > exit(1);
> > > }
> > >
> > > if (ftruncate(fd, 4096) < 0) {
> >
> > Shouldn't this be ftruncate(fd, 0)? Otherwise the memcpy() below will
> > fault in page from backing storage, not hole and write will not replace
> > anything.
>
> Ah, it was starting with a new file, hence the O_CREAT up above.

Do you mean you pointed to new file all the time? O_CREAT doesn't truncate
file if it exists, iirc.

--
Kirill A. Shutemov

2014-07-23 15:58:45

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH v8 00/22] Support ext4 on NV-DIMMs

On 07/22/2014 10:47 PM, Matthew Wilcox wrote:
> One of the primary uses for NV-DIMMs is to expose them as a block device
> and use a filesystem to store files on the NV-DIMM. While that works,
> it currently wastes memory and CPU time buffering the files in the page
> cache. We have support in ext2 for bypassing the page cache, but it
> has some races which are unfixable in the current design. This series
> of patches rewrite the underlying support, and add support for direct
> access to ext4.
>
> I would particularly welcome feedback from mm people on patch 5 ("Add
> vm_replace_mixed()") and from fs people on patch 7 ("Add copy_to_iter(),
> copy_from_iter() and iov_iter_zero()").
>
> This iteration of the patchset rebases to 3.16-rc6 and makes substantial
> changes based on Jan Kara's feedback:
>

Hi Matthew

Thank you for all your hard work. I will review and test this series, and
report.

Have you please pushed this tree to git hub. It used to be on the prd
tree, if you could just add another branch there, it would be cool.
(https://github.com/01org/prd)

I've been testing performance with the 3.14 based tree against a post-pmfs
fs that uses prd, vs ext4-dax vs shmem: (Same memory workset sizes)
ext4-dax is over prd same as pmfs, 32G + 32G system memory
shmem is using the same 32G work-sets but on a 64G system memory (no prd)

With reads they are all the same with pmfs-prd a bit on top.
But writes, shmem is by far far better. I'll Investigate this farther, it
does not make sense looks like a serialization of writes somewhere.
(All numbers rounded)

fio 4K random reads 24 threads (different files):
pmfs-prd: 7,300,000
ext4-dax: 6,200,000
shmem: 6,150,000

fio 4k random writes 24 threads (different files):
pmfs-prd: 900,000
ext4-dax: 800,000
shmem: 3,500,000

BTW:
With the new pmfs-prd which I have also ported to the DAX tree - the old
pmfs was copy/pasting xip-write and xip-mmap from generic code with some
private hacks, but was still using xip-read stuff. I have just moved the
old xip-read stuff inside (like write path did before) and so ported to the
dax tree which removed these functions. I have spent a couple of hours on
looking on the DAX read/write/mmap path vs the current pmfs read/write/mmap
paths. Though they are very similar and clearly originated from the same code
The new DAX code is even more alien to pmfs, specially with the current use
of buffer-heads. (It could be made to work but with lots of extra shimming)
So the current DAX form is more "block" based and not very suited to a pure
memory based FS. If any of those come up they will need their own interfaces
more similar to shmem's private implementation.

Above you said regarding the old xip:
".. but it has some races which are unfixable in the current design"

I have tested pmfs-prd under xfstest and I'm passing lots of them. I cannot
easily spot the race you are talking about. (More like dir seeks and missing
stuff) Could you point me to a test I should run, to try and find it?

Thanks have a good day
Boaz

> - Handles errors in copy_user_bh()
> - Changes calling convention for dax_get_addr() / dax_get_pfn() to take a
> blkbits argument instead of an inode argument
> - Cache inode->i_blkbits in a local variable
> - Rename file offset to 'pos' to fit the rest of the VFS
> - Added a FIXME to fall back to buffered I/O if the filesystem doesn't
> support filling a hole from within the direct I/O path. Mysterious
> and complex are the ways of get_block implementations.
> - Moved the call to inode_dio_done() to after end_io() to fix a race
> - Added a comment about why we have to recheck i_size under the page lock
> - Use vm_insert_page() in the COW path instead of returning VM_FAULT_COWED
> - Handle errors coming back from dax_get_pfn() correctly in do_dax_fault()
> - Removes zero pages from the process' address space before trying to
> replace them with the PFN of the newly allocated block
> - Factor out bdev_direct_access() to support partitioning properly
> - Rework the i_mmap_mutex locking to remove an inversion vs the page lock
> - Make the ext2 rename of -o xip to -o dax more graceful
> - Only update file mtime/ctime on a write fault, not a read fault
>
>
> Matthew Wilcox (21):
> Fix XIP fault vs truncate race
> Allow page fault handlers to perform the COW
> axonram: Fix bug in direct_access
> Change direct_access calling convention
> Add vm_replace_mixed()
> Introduce IS_DAX(inode)
> Add copy_to_iter(), copy_from_iter() and iov_iter_zero()
> Replace XIP read and write with DAX I/O
> Replace ext2_clear_xip_target with dax_clear_blocks
> Replace the XIP page fault handler with the DAX page fault handler
> Replace xip_truncate_page with dax_truncate_page
> Replace XIP documentation with DAX documentation
> Remove get_xip_mem
> ext2: Remove ext2_xip_verify_sb()
> ext2: Remove ext2_use_xip
> ext2: Remove xip.c and xip.h
> Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX
> ext2: Remove ext2_aops_xip
> Get rid of most mentions of XIP in ext2
> xip: Add xip_zero_page_range
> brd: Rename XIP to DAX
>
> Ross Zwisler (1):
> ext4: Add DAX functionality
>
> Documentation/filesystems/Locking | 3 -
> Documentation/filesystems/dax.txt | 91 +++++++
> Documentation/filesystems/ext4.txt | 2 +
> Documentation/filesystems/xip.txt | 68 -----
> arch/powerpc/sysdev/axonram.c | 14 +-
> drivers/block/Kconfig | 13 +-
> drivers/block/brd.c | 22 +-
> drivers/s390/block/dcssblk.c | 19 +-
> fs/Kconfig | 21 +-
> fs/Makefile | 1 +
> fs/block_dev.c | 28 +++
> fs/dax.c | 503 +++++++++++++++++++++++++++++++++++++
> fs/exofs/inode.c | 1 -
> fs/ext2/Kconfig | 11 -
> fs/ext2/Makefile | 1 -
> fs/ext2/ext2.h | 10 +-
> fs/ext2/file.c | 45 +++-
> fs/ext2/inode.c | 38 +--
> fs/ext2/namei.c | 13 +-
> fs/ext2/super.c | 53 ++--
> fs/ext2/xip.c | 91 -------
> fs/ext2/xip.h | 26 --
> fs/ext4/ext4.h | 6 +
> fs/ext4/file.c | 53 +++-
> fs/ext4/indirect.c | 18 +-
> fs/ext4/inode.c | 51 ++--
> fs/ext4/namei.c | 10 +-
> fs/ext4/super.c | 39 ++-
> fs/open.c | 5 +-
> include/linux/blkdev.h | 6 +-
> include/linux/fs.h | 49 +++-
> include/linux/mm.h | 9 +-
> include/linux/uio.h | 3 +
> mm/Makefile | 1 -
> mm/fadvise.c | 6 +-
> mm/filemap.c | 6 +-
> mm/filemap_xip.c | 483 -----------------------------------
> mm/iov_iter.c | 237 +++++++++++++++--
> mm/madvise.c | 2 +-
> mm/memory.c | 45 ++--
> 40 files changed, 1228 insertions(+), 875 deletions(-)
> create mode 100644 Documentation/filesystems/dax.txt
> delete mode 100644 Documentation/filesystems/xip.txt
> create mode 100644 fs/dax.c
> delete mode 100644 fs/ext2/xip.c
> delete mode 100644 fs/ext2/xip.h
> delete mode 100644 mm/filemap_xip.c
>

2014-07-23 16:57:16

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH v8 10/22] Replace the XIP page fault handler with the DAX page fault handler

On 07/22/2014 10:47 PM, Matthew Wilcox wrote:
> Instead of calling aops->get_xip_mem from the fault handler, the
> filesystem passes a get_block_t that is used to find the appropriate
> blocks.
>
> Signed-off-by: Matthew Wilcox <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> ---
> fs/dax.c | 221 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> fs/ext2/file.c | 35 ++++++++-
> include/linux/fs.h | 4 +-
> mm/filemap_xip.c | 206 -------------------------------------------------
> 4 files changed, 257 insertions(+), 209 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 02e226f..4ab4890 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -19,9 +19,13 @@
> #include <linux/buffer_head.h>
> #include <linux/fs.h>
> #include <linux/genhd.h>
> +#include <linux/highmem.h>
> +#include <linux/memcontrol.h>
> +#include <linux/mm.h>
> #include <linux/mutex.h>
> #include <linux/sched.h>
> #include <linux/uio.h>
> +#include <linux/vmstat.h>
>
> int dax_clear_blocks(struct inode *inode, sector_t block, long size)
> {
> @@ -64,6 +68,14 @@ static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
> return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
> }
>
> +static long dax_get_pfn(struct buffer_head *bh, unsigned long *pfn,
> + unsigned blkbits)
> +{
> + void *addr;
> + sector_t sector = bh->b_blocknr << (blkbits - 9);
> + return bdev_direct_access(bh->b_bdev, sector, &addr, pfn, bh->b_size);
> +}
> +
> static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t pos,
> loff_t end)
> {
> @@ -228,3 +240,212 @@ ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode,
> return retval;
> }
> EXPORT_SYMBOL_GPL(dax_do_io);
> +
> +/*
> + * The user has performed a load from a hole in the file. Allocating
> + * a new page in the file would cause excessive storage usage for
> + * workloads with sparse files. We allocate a page cache page instead.
> + * We'll kick it out of the page cache if it's ever written to,
> + * otherwise it will simply fall out of the page cache under memory
> + * pressure without ever having been dirtied.
> + */

Do you like this ?? I understand that you cannot use the ZERO page or
such global page on a page cache since each instance needs its own
list_head/index/mapping and so on. But why use any page at all.

use a global ZERO page, either the system global, or static local to
this system. map it to the current application VMA in question, using it's
pfn (page_to_pfn) just like you do with real DAX-blocks from prd.

Then at mkwrite you can compare the current pte's pfn to the page_to_pfn(ZERO_PAGE)
or just re--block-get the block and if it is an hole you know it was the ZERO_PAGE,

and so use the vma_replace to set in an allocated real dax-block for writing.

Surly we should be able to stay on the pfn and virtual pointers domain without
use of any pages, also for this trivial zero-blocks stuff, No?

All that said there is something I do not understand in this code and the old code
as well. Please help.

Say app A reads an hole, then app B reads an hole. Both now point to the same
zero page pfn, now say app B writes to that hole, mkwrite will convert it to
a real dax-block pfn and will map the new pfn in the faulting vma. But what about
app A, will it read the old pfn? who loops on all VMA's that have some mapping
and invalidates those mapping.

Same with truncate. App A mmap-read a block, app B does a read-mmap then a truncate.
who loops on all VMA mappings of these blocks to invalidate them. With page-cache and
pages we have a list of all VMA's that currently have mappings on a page, but with
dax-pfns (dax-blocks) we do *not* have page struct, who keeps the list of current
active vma-mappings?

(I know it must be there just can't register where in code, probably need to read
this more carefully)

Thanks
Boaz

> +static int dax_load_hole(struct address_space *mapping, struct page *page,
> + struct vm_fault *vmf)
> +{
> + unsigned long size;
> + struct inode *inode = mapping->host;
> + if (!page)
> + page = find_or_create_page(mapping, vmf->pgoff,
> + GFP_KERNEL | __GFP_ZERO);
> + if (!page)
> + return VM_FAULT_OOM;
> + /* Recheck i_size under page lock to avoid truncate race */
> + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + if (vmf->pgoff >= size) {
> + unlock_page(page);
> + page_cache_release(page);
> + return VM_FAULT_SIGBUS;
> + }
> +
> + vmf->page = page;
> + return VM_FAULT_LOCKED;
> +}
> +
> +static int copy_user_bh(struct page *to, struct buffer_head *bh,
> + unsigned blkbits, unsigned long vaddr)
> +{
> + void *vfrom, *vto;
> + if (dax_get_addr(bh, &vfrom, blkbits) < 0)
> + return -EIO;
> + vto = kmap_atomic(to);
> + copy_user_page(vto, vfrom, vaddr, to);
> + kunmap_atomic(vto);
> + return 0;
> +}
> +
> +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> + get_block_t get_block)
> +{
> + struct file *file = vma->vm_file;
> + struct inode *inode = file_inode(file);
> + struct address_space *mapping = file->f_mapping;
> + struct page *page;
> + struct buffer_head bh;
> + unsigned long vaddr = (unsigned long)vmf->virtual_address;
> + unsigned blkbits = inode->i_blkbits;
> + sector_t block;
> + pgoff_t size;
> + unsigned long pfn;
> + int error;
> + int major = 0;
> +
> + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + if (vmf->pgoff >= size)
> + return VM_FAULT_SIGBUS;
> +
> + memset(&bh, 0, sizeof(bh));
> + block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
> + bh.b_size = PAGE_SIZE;
> +
> + repeat:
> + page = find_get_page(mapping, vmf->pgoff);
> + if (page) {
> + if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
> + page_cache_release(page);
> + return VM_FAULT_RETRY;
> + }
> + if (unlikely(page->mapping != mapping)) {
> + unlock_page(page);
> + page_cache_release(page);
> + goto repeat;
> + }
> + } else {
> + mutex_lock(&mapping->i_mmap_mutex);
> + }
> +
> + /* Guard against a race with truncate */
> + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + if (unlikely(vmf->pgoff >= size))
> + goto sigbus;
> +
> + error = get_block(inode, block, &bh, 0);
> + if (error || bh.b_size < PAGE_SIZE)
> + goto sigbus;
> +
> + if (!buffer_written(&bh) && !vmf->cow_page) {
> + if (vmf->flags & FAULT_FLAG_WRITE) {
> + error = get_block(inode, block, &bh, 1);
> + count_vm_event(PGMAJFAULT);
> + mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> + major = VM_FAULT_MAJOR;
> + if (error || bh.b_size < PAGE_SIZE)
> + goto sigbus;
> + } else {
> + mutex_unlock(&mapping->i_mmap_mutex);
> + return dax_load_hole(mapping, page, vmf);
> + }
> + }
> +
> + if (vmf->cow_page) {
> + struct page *new_page = vmf->cow_page;
> + if (buffer_written(&bh))
> + error = copy_user_bh(new_page, &bh, blkbits, vaddr);
> + else
> + clear_user_highpage(new_page, vaddr);
> + if (!error) {
> + __SetPageUptodate(new_page);
> + vm_insert_page(vma, vaddr, new_page);
> + }
> + if (page) {
> + unlock_page(page);
> + page_cache_release(page);
> + } else {
> + mutex_unlock(&mapping->i_mmap_mutex);
> + }
> + return error ? VM_FAULT_SIGBUS : VM_FAULT_NOPAGE;
> + }
> +
> + if (buffer_unwritten(&bh) || buffer_new(&bh))
> + dax_clear_blocks(inode, bh.b_blocknr, bh.b_size);
> +
> + error = dax_get_pfn(&bh, &pfn, blkbits);
> + if (error > 0)
> + error = vm_replace_mixed(vma, vaddr, pfn);
> +
> + if (!page) {
> + mutex_unlock(&mapping->i_mmap_mutex);
> + /*
> + * We may have raced with another thread which has inserted
> + * a zero page at this address. Remove it now if we did.
> + */
> + page = find_lock_page(mapping, vmf->pgoff);
> + }
> +
> + if (page) {
> + delete_from_page_cache(page);
> + unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
> + PAGE_CACHE_SIZE, 0);
> + unlock_page(page);
> + page_cache_release(page);
> + }
> +
> + if (error == -ENOMEM)
> + return VM_FAULT_OOM;
> + /* -EBUSY is fine, somebody else faulted on the same PTE */
> + if (error != -EBUSY)
> + BUG_ON(error);
> + return VM_FAULT_NOPAGE | major;
> +
> + sigbus:
> + if (page) {
> + unlock_page(page);
> + page_cache_release(page);
> + } else {
> + mutex_unlock(&mapping->i_mmap_mutex);
> + }
> + return VM_FAULT_SIGBUS;
> +}
> +
> +/**
> + * dax_fault - handle a page fault on a DAX file
> + * @vma: The virtual memory area where the fault occurred
> + * @vmf: The description of the fault
> + * @get_block: The filesystem method used to translate file offsets to blocks
> + *
> + * When a page fault occurs, filesystems may call this helper in their
> + * fault handler for DAX files.
> + */
> +int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> + get_block_t get_block)
> +{
> + int result;
> + struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> +
> + if (vmf->flags & FAULT_FLAG_WRITE) {
> + sb_start_pagefault(sb);
> + file_update_time(vma->vm_file);
> + }
> + result = do_dax_fault(vma, vmf, get_block);
> + if (vmf->flags & FAULT_FLAG_WRITE)
> + sb_end_pagefault(sb);
> +
> + return result;
> +}
> +EXPORT_SYMBOL_GPL(dax_fault);
> +
> +/**
> + * dax_mkwrite - convert a read-only page to read-write in a DAX file
> + * @vma: The virtual memory area where the fault occurred
> + * @vmf: The description of the fault
> + * @get_block: The filesystem method used to translate file offsets to blocks
> + *
> + * DAX handles reads of holes by adding pages full of zeroes into the
> + * mapping. If the page is subsequenty written to, we have to allocate
> + * the page on media and free the page that was in the cache.
> + */
> +int dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> + get_block_t get_block)
> +{
> + return dax_fault(vma, vmf, get_block);
> +}
> +EXPORT_SYMBOL_GPL(dax_mkwrite);
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index a247123..da8dc64 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -25,6 +25,37 @@
> #include "xattr.h"
> #include "acl.h"
>
> +#ifdef CONFIG_EXT2_FS_XIP
> +static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> + return dax_fault(vma, vmf, ext2_get_block);
> +}
> +
> +static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> + return dax_mkwrite(vma, vmf, ext2_get_block);
> +}
> +
> +static const struct vm_operations_struct ext2_dax_vm_ops = {
> + .fault = ext2_dax_fault,
> + .page_mkwrite = ext2_dax_mkwrite,
> + .remap_pages = generic_file_remap_pages,
> +};
> +
> +static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + if (!IS_DAX(file_inode(file)))
> + return generic_file_mmap(file, vma);
> +
> + file_accessed(file);
> + vma->vm_ops = &ext2_dax_vm_ops;
> + vma->vm_flags |= VM_MIXEDMAP;
> + return 0;
> +}
> +#else
> +#define ext2_file_mmap generic_file_mmap
> +#endif
> +
> /*
> * Called when filp is released. This happens when all file descriptors
> * for a single struct file are closed. Note that different open() calls
> @@ -70,7 +101,7 @@ const struct file_operations ext2_file_operations = {
> #ifdef CONFIG_COMPAT
> .compat_ioctl = ext2_compat_ioctl,
> #endif
> - .mmap = generic_file_mmap,
> + .mmap = ext2_file_mmap,
> .open = dquot_file_open,
> .release = ext2_release_file,
> .fsync = ext2_fsync,
> @@ -89,7 +120,7 @@ const struct file_operations ext2_xip_file_operations = {
> #ifdef CONFIG_COMPAT
> .compat_ioctl = ext2_compat_ioctl,
> #endif
> - .mmap = xip_file_mmap,
> + .mmap = ext2_file_mmap,
> .open = dquot_file_open,
> .release = ext2_release_file,
> .fsync = ext2_fsync,
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e38138b..d4259e1 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -49,6 +49,7 @@ struct swap_info_struct;
> struct seq_file;
> struct workqueue_struct;
> struct iov_iter;
> +struct vm_fault;
>
> extern void __init inode_init(void);
> extern void __init inode_init_early(void);
> @@ -2464,10 +2465,11 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);
>
> #ifdef CONFIG_FS_XIP
> int dax_clear_blocks(struct inode *, sector_t block, long size);
> -extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
> extern int xip_truncate_page(struct address_space *mapping, loff_t from);
> ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
> loff_t, get_block_t, dio_iodone_t, int flags);
> +int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
> +int dax_mkwrite(struct vm_area_struct *, struct vm_fault *, get_block_t);
> #else
> static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz)
> {
> diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
> index f7c37a1..9dd45f3 100644
> --- a/mm/filemap_xip.c
> +++ b/mm/filemap_xip.c
> @@ -22,212 +22,6 @@
> #include <asm/io.h>
>
> /*
> - * We do use our own empty page to avoid interference with other users
> - * of ZERO_PAGE(), such as /dev/zero
> - */
> -static DEFINE_MUTEX(xip_sparse_mutex);
> -static seqcount_t xip_sparse_seq = SEQCNT_ZERO(xip_sparse_seq);
> -static struct page *__xip_sparse_page;
> -
> -/* called under xip_sparse_mutex */
> -static struct page *xip_sparse_page(void)
> -{
> - if (!__xip_sparse_page) {
> - struct page *page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
> -
> - if (page)
> - __xip_sparse_page = page;
> - }
> - return __xip_sparse_page;
> -}
> -
> -/*
> - * __xip_unmap is invoked from xip_unmap and
> - * xip_write
> - *
> - * This function walks all vmas of the address_space and unmaps the
> - * __xip_sparse_page when found at pgoff.
> - */
> -static void
> -__xip_unmap (struct address_space * mapping,
> - unsigned long pgoff)
> -{
> - struct vm_area_struct *vma;
> - struct mm_struct *mm;
> - unsigned long address;
> - pte_t *pte;
> - pte_t pteval;
> - spinlock_t *ptl;
> - struct page *page;
> - unsigned count;
> - int locked = 0;
> -
> - count = read_seqcount_begin(&xip_sparse_seq);
> -
> - page = __xip_sparse_page;
> - if (!page)
> - return;
> -
> -retry:
> - mutex_lock(&mapping->i_mmap_mutex);
> - vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> - mm = vma->vm_mm;
> - address = vma->vm_start +
> - ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> - BUG_ON(address < vma->vm_start || address >= vma->vm_end);
> - pte = page_check_address(page, mm, address, &ptl, 1);
> - if (pte) {
> - /* Nuke the page table entry. */
> - flush_cache_page(vma, address, pte_pfn(*pte));
> - pteval = ptep_clear_flush(vma, address, pte);
> - page_remove_rmap(page);
> - dec_mm_counter(mm, MM_FILEPAGES);
> - BUG_ON(pte_dirty(pteval));
> - pte_unmap_unlock(pte, ptl);
> - /* must invalidate_page _before_ freeing the page */
> - mmu_notifier_invalidate_page(mm, address);
> - page_cache_release(page);
> - }
> - }
> - mutex_unlock(&mapping->i_mmap_mutex);
> -
> - if (locked) {
> - mutex_unlock(&xip_sparse_mutex);
> - } else if (read_seqcount_retry(&xip_sparse_seq, count)) {
> - mutex_lock(&xip_sparse_mutex);
> - locked = 1;
> - goto retry;
> - }
> -}
> -
> -/*
> - * xip_fault() is invoked via the vma operations vector for a
> - * mapped memory region to read in file data during a page fault.
> - *
> - * This function is derived from filemap_fault, but used for execute in place
> - */
> -static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> -{
> - struct file *file = vma->vm_file;
> - struct address_space *mapping = file->f_mapping;
> - struct inode *inode = mapping->host;
> - pgoff_t size;
> - void *xip_mem;
> - unsigned long xip_pfn;
> - struct page *page;
> - int error;
> -
> - /* XXX: are VM_FAULT_ codes OK? */
> -again:
> - size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> - if (vmf->pgoff >= size)
> - return VM_FAULT_SIGBUS;
> -
> - error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
> - &xip_mem, &xip_pfn);
> - if (likely(!error))
> - goto found;
> - if (error != -ENODATA)
> - return VM_FAULT_OOM;
> -
> - /* sparse block */
> - if ((vma->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
> - (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) &&
> - (!(mapping->host->i_sb->s_flags & MS_RDONLY))) {
> - int err;
> -
> - /* maybe shared writable, allocate new block */
> - mutex_lock(&xip_sparse_mutex);
> - error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 1,
> - &xip_mem, &xip_pfn);
> - mutex_unlock(&xip_sparse_mutex);
> - if (error)
> - return VM_FAULT_SIGBUS;
> - /* unmap sparse mappings at pgoff from all other vmas */
> - __xip_unmap(mapping, vmf->pgoff);
> -
> -found:
> - /* We must recheck i_size under i_mmap_mutex */
> - mutex_lock(&mapping->i_mmap_mutex);
> - size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
> - PAGE_CACHE_SHIFT;
> - if (unlikely(vmf->pgoff >= size)) {
> - mutex_unlock(&mapping->i_mmap_mutex);
> - return VM_FAULT_SIGBUS;
> - }
> - err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
> - xip_pfn);
> - mutex_unlock(&mapping->i_mmap_mutex);
> - if (err == -ENOMEM)
> - return VM_FAULT_OOM;
> - /*
> - * err == -EBUSY is fine, we've raced against another thread
> - * that faulted-in the same page
> - */
> - if (err != -EBUSY)
> - BUG_ON(err);
> - return VM_FAULT_NOPAGE;
> - } else {
> - int err, ret = VM_FAULT_OOM;
> -
> - mutex_lock(&xip_sparse_mutex);
> - write_seqcount_begin(&xip_sparse_seq);
> - error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
> - &xip_mem, &xip_pfn);
> - if (unlikely(!error)) {
> - write_seqcount_end(&xip_sparse_seq);
> - mutex_unlock(&xip_sparse_mutex);
> - goto again;
> - }
> - if (error != -ENODATA)
> - goto out;
> -
> - /* We must recheck i_size under i_mmap_mutex */
> - mutex_lock(&mapping->i_mmap_mutex);
> - size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
> - PAGE_CACHE_SHIFT;
> - if (unlikely(vmf->pgoff >= size)) {
> - ret = VM_FAULT_SIGBUS;
> - goto unlock;
> - }
> - /* not shared and writable, use xip_sparse_page() */
> - page = xip_sparse_page();
> - if (!page)
> - goto unlock;
> - err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
> - page);
> - if (err == -ENOMEM)
> - goto unlock;
> -
> - ret = VM_FAULT_NOPAGE;
> -unlock:
> - mutex_unlock(&mapping->i_mmap_mutex);
> -out:
> - write_seqcount_end(&xip_sparse_seq);
> - mutex_unlock(&xip_sparse_mutex);
> -
> - return ret;
> - }
> -}
> -
> -static const struct vm_operations_struct xip_file_vm_ops = {
> - .fault = xip_file_fault,
> - .page_mkwrite = filemap_page_mkwrite,
> - .remap_pages = generic_file_remap_pages,
> -};
> -
> -int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
> -{
> - BUG_ON(!file->f_mapping->a_ops->get_xip_mem);
> -
> - file_accessed(file);
> - vma->vm_ops = &xip_file_vm_ops;
> - vma->vm_flags |= VM_MIXEDMAP;
> - return 0;
> -}
> -EXPORT_SYMBOL_GPL(xip_file_mmap);
> -
> -/*
> * truncate a page used for execute in place
> * functionality is analog to block_truncate_page but does use get_xip_mem
> * to get the page instead of page cache
>

2014-07-23 19:52:05

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v8 00/22] Support ext4 on NV-DIMMs

On Wed, Jul 23, 2014 at 06:58:38PM +0300, Boaz Harrosh wrote:
> Have you please pushed this tree to git hub. It used to be on the prd
> tree, if you could just add another branch there, it would be cool.
> (https://github.com/01org/prd)

Ross handles the care & feeding of that tree ... he'll push that branch
out soon.

> With reads they are all the same with pmfs-prd a bit on top.
> But writes, shmem is by far far better. I'll Investigate this farther, it
> does not make sense looks like a serialization of writes somewhere.

I imagine there's locking that needs to happen for real filesystems that
shmem can happily ignore. Maybe one of the journal locks?

> (All numbers rounded)
>
> fio 4K random reads 24 threads (different files):
> pmfs-prd: 7,300,000
> ext4-dax: 6,200,000
> shmem: 6,150,000
>
> fio 4k random writes 24 threads (different files):
> pmfs-prd: 900,000
> ext4-dax: 800,000
> shmem: 3,500,000
>
> BTW:
> With the new pmfs-prd which I have also ported to the DAX tree - the old
> pmfs was copy/pasting xip-write and xip-mmap from generic code with some
> private hacks, but was still using xip-read stuff. I have just moved the
> old xip-read stuff inside (like write path did before) and so ported to the
> dax tree which removed these functions. I have spent a couple of hours on
> looking on the DAX read/write/mmap path vs the current pmfs read/write/mmap
> paths. Though they are very similar and clearly originated from the same code
> The new DAX code is even more alien to pmfs, specially with the current use
> of buffer-heads. (It could be made to work but with lots of extra shimming)
> So the current DAX form is more "block" based and not very suited to a pure
> memory based FS. If any of those come up they will need their own interfaces
> more similar to shmem's private implementation.

I think you misunderstand the point of buffer_head. It's not "this is
block based", it's "this is how filesystems describe their extents to
the VFS/block layer".

> Above you said regarding the old xip:
> ".. but it has some races which are unfixable in the current design"
>
> I have tested pmfs-prd under xfstest and I'm passing lots of them. I cannot
> easily spot the race you are talking about. (More like dir seeks and missing
> stuff) Could you point me to a test I should run, to try and find it?

xfstests seems to have more "check correct functionality" tests than
"find races" tests. I'm trying to persuade our QA people to write more
tests for xfstests with little success so far. Consider truncate()
vs page fault. Consider truncate() versus read(). Consider write()
to a hole vs page fault on that hole.

Oh, and as for pmfs itself ... consider the fact that it uses RCU for
locking, except the only call to synchronise_rcu() is commented out.
So it only has RCU read locks, which compile to, er, nothing. If the
RCU were actually protecting anything, the fact that it takes mutexes
while in an RCU locked section would rather invalidate the RCU protection.

So, yeah. Actually getting the locking right has an effect on
performance.

2014-07-23 19:57:51

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v8 10/22] Replace the XIP page fault handler with the DAX page fault handler

On Wed, Jul 23, 2014 at 07:57:09PM +0300, Boaz Harrosh wrote:
> > +/*
> > + * The user has performed a load from a hole in the file. Allocating
> > + * a new page in the file would cause excessive storage usage for
> > + * workloads with sparse files. We allocate a page cache page instead.
> > + * We'll kick it out of the page cache if it's ever written to,
> > + * otherwise it will simply fall out of the page cache under memory
> > + * pressure without ever having been dirtied.
> > + */
>
> Do you like this ?? I understand that you cannot use the ZERO page or
> such global page on a page cache since each instance needs its own
> list_head/index/mapping and so on. But why use any page at all.
>
> use a global ZERO page, either the system global, or static local to
> this system. map it to the current application VMA in question, using it's
> pfn (page_to_pfn) just like you do with real DAX-blocks from prd.

I must admit to not understanding the MM particularly well. There would
seem to be problems with rmap when doing this kind of trick. Also, this
is how reading from holes on regular filesystems work (except for the
part about kicking it out of page cache on a write). A third reason is
that there are some forms of PMem which are terribly slow to write to.
I have a longer-term plan to support these memories by transparently
caching them in DRAM and only writing back to the media on flush/sync.

> Say app A reads an hole, then app B reads an hole. Both now point to the same
> zero page pfn, now say app B writes to that hole, mkwrite will convert it to
> a real dax-block pfn and will map the new pfn in the faulting vma. But what about
> app A, will it read the old pfn? who loops on all VMA's that have some mapping
> and invalidates those mapping.

That's the call to unmap_mapping_range().

> Same with truncate. App A mmap-read a block, app B does a read-mmap then a truncate.
> who loops on all VMA mappings of these blocks to invalidate them. With page-cache and
> pages we have a list of all VMA's that currently have mappings on a page, but with
> dax-pfns (dax-blocks) we do *not* have page struct, who keeps the list of current
> active vma-mappings?

Same solution ... there's a list in the address_space of all the VMAs who
have it mapped. See truncate_pagecache() in mm/truncate.c (filesystems
usually call truncate_setsize()).

2014-07-23 20:55:35

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v8 00/22] Support ext4 on NV-DIMMs

On Wed, Jul 23, 2014 at 08:28:03AM -0700, Howard Chu wrote:
> >Perhaps you misunderstand the problem. There are many different kinds
> >of NV-DIMM out there today with different performance characteristics.
> >One that has been described to me has write times 1000x slower than read
> >times. In that situation, you can't possibly "just use it as page cache";
> >you need to place the read-often; write-rarely files on that media.
>
> Thanks for the clarification. Yes, I was assuming "NVDIMM" to mean something
> with DRAM performance + persistence, like http://www.agigatech.com/ddr3.php
> or http://www.vikingtechnology.com/nvdimm-technology . That's also the
> definition in Wikipedia http://en.wikipedia.org/wiki/NVDIMM

One example might be PCM (Phase Change Memory). PCM has read speeds
comporable to RAM or NOR flash, so you can certainly execute out of
PCM. However, writes generally tend to be bottlenecked on power
(since you need to effectively melt the PCM cell to cause it change
from crystalline to amorphous state and vice versa).

This year's FAST conference quoted a commericially available
SATA-attached PCM device that was 16x times faster than MLC SSD for 4k
reads, but 3.4x times slower than MLC SSD for 4k writes, and it was
basically limited to how much power/heat could fed/removed in/out of
the PCM chip. See:
https://www.usenix.org/conference/fast14/technical-sessions/presentation/kim
starting at about 2:00 to 10:30.

Cheers,

- Ted

2014-07-24 01:37:07

by Zhang, Tianfei

[permalink] [raw]
Subject: RE: [PATCH v8 05/22] Add vm_replace_mixed()

It is double page_table_lock issue, should be free-and-realloc will be simple and readability?

+ if (!pte_none(*pte)) {
+ if (!replace)
+ goto out_unlock;
+ VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex));
+ pte_unmap_unlock(pte, ptl);
+ zap_page_range_single(vma, addr, PAGE_SIZE, NULL);
+ pte = get_locked_pte(mm, addr, &ptl);
+ }

Best,
Figo

> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Kirill A. Shutemov
> Sent: Wednesday, July 23, 2014 11:55 PM
> To: Matthew Wilcox
> Cc: Wilcox, Matthew R; [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH v8 05/22] Add vm_replace_mixed()
>
> On Wed, Jul 23, 2014 at 10:27:45AM -0400, Matthew Wilcox wrote:
> > On Wed, Jul 23, 2014 at 05:20:48PM +0300, Kirill A. Shutemov wrote:
> > > On Wed, Jul 23, 2014 at 09:52:22AM -0400, Matthew Wilcox wrote:
> > > > I'd love to use a lighter-weight weapon! What would you recommend
> > > > using, zap_pte_range()?
> > >
> > > The most straight-forward way: extract body of pte cycle from
> > > zap_pte_range() to separate function -- zap_pte() -- and use it.
> >
> > OK, I can do that. What about the other parts of zap_page_range(), do
> > I need to call them?
> >
> > lru_add_drain();
>
> No, I guess..
>
> > tlb_gather_mmu(&tlb, mm, address, end);
> > tlb_finish_mmu(&tlb, address, end);
>
> New zap_pte() should tolerate tlb == NULL and does flush_tlb_page() or
> pte_clear_*flush or something.
>
> > update_hiwater_rss(mm);
>
> No: you cannot end up with lower rss after replace, iiuc.
>
> > mmu_notifier_invalidate_range_start(mm, address, end);
> > mmu_notifier_invalidate_range_end(mm, address, end);
>
> mmu_notifier_invalidate_page() should be enough.
>
> > > > if ((fd = open(argv[1], O_CREAT|O_RDWR, 0666)) < 0) {
> > > > perror(argv[1]);
> > > > exit(1);
> > > > }
> > > >
> > > > if (ftruncate(fd, 4096) < 0) {
> > >
> > > Shouldn't this be ftruncate(fd, 0)? Otherwise the memcpy() below
> > > will fault in page from backing storage, not hole and write will not
> > > replace anything.
> >
> > Ah, it was starting with a new file, hence the O_CREAT up above.
>
> Do you mean you pointed to new file all the time? O_CREAT doesn't truncate
> file if it exists, iirc.
>
> --
> Kirill A. Shutemov
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to
> [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2014-07-24 18:51:17

by Ross Zwisler

[permalink] [raw]
Subject: Re: [PATCH v8 00/22] Support ext4 on NV-DIMMs

On Wed, 2014-07-23 at 15:50 -0400, Matthew Wilcox wrote:
> On Wed, Jul 23, 2014 at 06:58:38PM +0300, Boaz Harrosh wrote:
> > Have you please pushed this tree to git hub. It used to be on the prd
> > tree, if you could just add another branch there, it would be cool.
> > (https://github.com/01org/prd)
>
> Ross handles the care & feeding of that tree ... he'll push that branch
> out soon.

I've updated the master branch of PRD's GitHub repo
(https://github.com/01org/prd) so that it is v3.16-rc6 + DAX v8 + PRD.

- Ross


2014-07-25 19:45:21

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v8 05/22] Add vm_replace_mixed()

On Wed, Jul 23, 2014 at 06:55:00PM +0300, Kirill A. Shutemov wrote:
> > update_hiwater_rss(mm);
>
> No: you cannot end up with lower rss after replace, iiuc.

Actually, you can ... when we replace a real page with a PFN, our rss
decreases.

> Do you mean you pointed to new file all the time? O_CREAT doesn't truncate
> file if it exists, iirc.

It was pointing to a new file. Still not sure why that one failed to trigger
the problem. The slightly modified version attached triggered the problem
*just fine* :-)

I've attached all the patches in my tree so far. For the v9 patch kit,
I'll keep patch 3 as a separate patch, but roll patches 1, 2 and 4 into
other patches.

I am seeing something odd though. When I run double-map with debugging
printks inserted in strategic spots in the kernel, I see four calls to
do_dax_fault(). The first two, as expected, are the loads from the two
mapped addresses. The third is via mkwrite, but then the fourth time
I get a regular page fault for write, and I don't understand why I get it.

Any ideas?


[ 880.966632] do_dax_fault: fault a8 page = (null) bh state 0 written
0 addr 7ff598835000
[ 880.966637] dax_load_hole: page = ffffea0002784730
[ 882.114387] do_dax_fault: fault a8 page = ffffea0002784730 bh state 0 written 0 addr 7ff598834000
[ 882.114389] dax_load_hole: page = ffffea0002784730
[ 882.780013] do_dax_fault: fault 5 page = ffffea0002784730 bh state 0 written 0 addr 7ff598835000
[ 882.780095] insert_pfn: pte = 8000000108200225
[ 882.780096] do_dax_fault: page = ffffea0002784730 pfn = 108200 error = 0
[ 882.780098] CPU: 1 PID: 1511 Comm: double-map Not tainted 3.16.0-rc6+ #89
[ 882.780099] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./Q87M-D2H, BIOS F6 08/03/2013
[ 882.780100] 0000000000000000 ffff8800b41e3ba0 ffffffff815c6e73 ffffea0002784730
[ 882.780102] ffff8800b41e3c88 ffffffff8124c319 00007ff598835000 ffff8800b2436ac0
[ 882.780104] 0000000000000005 ffffffffa01e3020 0000000000000004 ffff880000000005
[ 882.780106] Call Trace:
[ 882.780110] [<ffffffff815c6e73>] dump_stack+0x4d/0x66
[ 882.780114] [<ffffffff8124c319>] do_dax_fault+0x569/0x6f0
[ 882.780133] [<ffffffff8124c4df>] dax_fault+0x3f/0x90
[ 882.780136] [<ffffffff81023b2e>] ? native_sched_clock+0x2e/0xb0
[ 882.780137] [<ffffffff8124c53e>] dax_mkwrite+0xe/0x10
[ 882.780143] [<ffffffffa01db955>] ext4_dax_mkwrite+0x15/0x20 [ext4]
[ 882.780146] [<ffffffff811ab627>] do_page_mkwrite+0x47/0xb0
[ 882.780148] [<ffffffff811ad7f2>] do_wp_page+0x6e2/0x990
[ 882.780150] [<ffffffff811aff1b>] handle_mm_fault+0x6ab/0xf70
[ 882.780154] [<ffffffff81062d2c>] __do_page_fault+0x1ec/0x5b0
[ 882.780163] [<ffffffff81063112>] do_page_fault+0x22/0x30
[ 882.780165] [<ffffffff815d1b78>] page_fault+0x28/0x30
[ 882.780204] do_dax_fault: fault a9 page = (null) bh state 20 written 1 addr 7ff598835000
[ 882.780206] insert_pfn: pte = 8000000108200225
[ 882.780207] do_dax_fault: page = (null) pfn = 108200 error = 0
[ 882.780208] CPU: 1 PID: 1511 Comm: double-map Not tainted 3.16.0-rc6+ #89
[ 882.780208] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./Q87M-D2H, BIOS F6 08/03/2013
[ 882.780209] 0000000000000000 ffff8800b41e3bc0 ffffffff815c6e73 0000000000000000
[ 882.780211] ffff8800b41e3ca8 ffffffff8124c319 00007ff598835000 ffff8800b41e3c08
[ 882.780213] ffff8800a2a60608 ffffffffa01e3020 0000000000000000 ffff8800000000a9
[ 882.780214] Call Trace:
[ 882.780216] [<ffffffff815c6e73>] dump_stack+0x4d/0x66
[ 882.780218] [<ffffffff8124c319>] do_dax_fault+0x569/0x6f0
[ 882.780232] [<ffffffff8124c4df>] dax_fault+0x3f/0x90
[ 882.780238] [<ffffffffa01db975>] ext4_dax_fault+0x15/0x20 [ext4]
[ 882.780240] [<ffffffff811ab6d1>] __do_fault+0x41/0xd0
[ 882.780241] [<ffffffff811ae8a5>] do_shared_fault.isra.56+0x35/0x220
[ 882.780243] [<ffffffff811afb73>] handle_mm_fault+0x303/0xf70
[ 882.780246] [<ffffffff81062d2c>] __do_page_fault+0x1ec/0x5b0
[ 882.780254] [<ffffffff81063112>] do_page_fault+0x22/0x30
[ 882.780255] [<ffffffff815d1b78>] page_fault+0x28/0x30


diff --git a/fs/dax.c b/fs/dax.c
index b4fdfd9..4b0f928 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -257,6 +257,7 @@ static int dax_load_hole(struct address_space *mapping, struct page *page,
if (!page)
page = find_or_create_page(mapping, vmf->pgoff,
GFP_KERNEL | __GFP_ZERO);
+printk("%s: page = %p\n", __func__, page);
if (!page)
return VM_FAULT_OOM;
/* Recheck i_size under page lock to avoid truncate race */
@@ -332,6 +333,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
if (error || bh.b_size < PAGE_SIZE)
goto sigbus;

+printk("%s: fault %x page = %p bh state %lx written %d addr %lx\n", __func__, vmf->flags, page, bh.b_state, buffer_written(&bh), vaddr);
if (!buffer_written(&bh) && !vmf->cow_page) {
if (vmf->flags & FAULT_FLAG_WRITE) {
error = get_block(inode, block, &bh, 1);
@@ -372,6 +374,8 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
error = dax_get_pfn(&bh, &pfn, blkbits);
if (error > 0)
error = vm_replace_mixed(vma, vaddr, pfn);
+printk("%s: page = %p pfn = %lx error = %d\n", __func__, page, pfn, error);
+if ((vmf->flags & FAULT_FLAG_WRITE) || !(vmf->flags & FAULT_FLAG_USER)) dump_stack();

if (!page) {
mutex_unlock(&mapping->i_mmap_mutex);
diff --git a/mm/memory.c b/mm/memory.c
index a8e17ce..189716c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1614,6 +1614,7 @@ static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
/* Ok, finally just insert the thing.. */
entry = pte_mkspecial(pfn_pte(pfn, prot));
set_pte_at(mm, addr, pte, entry);
+printk("%s: pte = %llx\n", __func__, pte_val(entry));
update_mmu_cache(vma, addr, pte); /* XXX: why not for insert_page? */

retval = 0;


Attachments:
(No filename) (5.73 kB)
0001-dax-Only-unlock-the-i_mmap_mutex-if-we-locked-it.patch (842.00 B)
0002-dax-Call-delete_from_page_cache-after-unmap_mapping_.patch (928.00 B)
0003-Factor-zap_pte-out-of-zap_pte_range.patch (6.19 kB)
0004-mm-Introduce-zap_pte_single.patch (4.29 kB)
double-map.c (1.01 kB)
Download all attachments

2014-07-28 13:26:56

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v8 05/22] Add vm_replace_mixed()

On Fri, Jul 25, 2014 at 03:44:50PM -0400, Matthew Wilcox wrote:
> On Wed, Jul 23, 2014 at 06:55:00PM +0300, Kirill A. Shutemov wrote:
> > > update_hiwater_rss(mm);
> >
> > No: you cannot end up with lower rss after replace, iiuc.
>
> Actually, you can ... when we replace a real page with a PFN, our rss
> decreases.

Okay.

> > Do you mean you pointed to new file all the time? O_CREAT doesn't truncate
> > file if it exists, iirc.
>
> It was pointing to a new file. Still not sure why that one failed to trigger
> the problem. The slightly modified version attached triggered the problem
> *just fine* :-)
>
> I've attached all the patches in my tree so far. For the v9 patch kit,
> I'll keep patch 3 as a separate patch, but roll patches 1, 2 and 4 into
> other patches.
>
> I am seeing something odd though. When I run double-map with debugging
> printks inserted in strategic spots in the kernel, I see four calls to
> do_dax_fault(). The first two, as expected, are the loads from the two
> mapped addresses. The third is via mkwrite, but then the fourth time
> I get a regular page fault for write, and I don't understand why I get it.
>
> Any ideas?

unmap_mapping_range() clears pte you've just set by vm_replace_mixed() on
third fault.

And locking looks wrong: it seems you need to hold i_mmap_mutex while
replacing hole page with pfn. Your VM_BUG_ON() in zap_pte_single()
triggers on my setup.

> +static void zap_pte_single(struct vm_area_struct *vma, pte_t *pte,
> + unsigned long addr)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + int force_flush = 0;
> + int rss[NR_MM_COUNTERS];
> +
> + VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex));

It's wrong place for VM_BUG_ON(): zap_pte_single() on anon mapping should
work fine)

> +
> + init_rss_vec(rss);

Vector to commit single update to mm counters? What about inline counters
update for rss == NULL case?

> + update_hiwater_rss(mm);
> + flush_cache_page(vma, addr, pte_pfn(*pte));
> + zap_pte(NULL, vma, pte, addr, NULL, rss, &force_flush);
> + flush_tlb_page(vma, addr);
> + add_mm_rss_vec(mm, rss);
> +}
> +

--
Kirill A. Shutemov

2014-07-29 01:56:18

by Zhang, Tianfei

[permalink] [raw]
Subject: RE: [PATCH v8 05/22] Add vm_replace_mixed()



> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Kirill A. Shutemov
> Sent: Monday, July 28, 2014 9:26 PM
> To: Matthew Wilcox
> Cc: Wilcox, Matthew R; [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH v8 05/22] Add vm_replace_mixed()
>
> On Fri, Jul 25, 2014 at 03:44:50PM -0400, Matthew Wilcox wrote:
> > On Wed, Jul 23, 2014 at 06:55:00PM +0300, Kirill A. Shutemov wrote:
> > > > update_hiwater_rss(mm);
> > >
> > > No: you cannot end up with lower rss after replace, iiuc.
> >
> > Actually, you can ... when we replace a real page with a PFN, our rss
> > decreases.
>
> Okay.
>
> > > Do you mean you pointed to new file all the time? O_CREAT doesn't
> > > truncate file if it exists, iirc.
> >
> > It was pointing to a new file. Still not sure why that one failed to
> > trigger the problem. The slightly modified version attached triggered
> > the problem *just fine* :-)
> >
> > I've attached all the patches in my tree so far. For the v9 patch
> > kit, I'll keep patch 3 as a separate patch, but roll patches 1, 2 and
> > 4 into other patches.
> >
> > I am seeing something odd though. When I run double-map with
> > debugging printks inserted in strategic spots in the kernel, I see
> > four calls to do_dax_fault(). The first two, as expected, are the
> > loads from the two mapped addresses. The third is via mkwrite, but
> > then the fourth time I get a regular page fault for write, and I don't
> understand why I get it.
> >
> > Any ideas?
>
> unmap_mapping_range() clears pte you've just set by vm_replace_mixed() on
> third fault.
>
> And locking looks wrong: it seems you need to hold i_mmap_mutex while
> replacing hole page with pfn. Your VM_BUG_ON() in zap_pte_single() triggers
> on my setup.
>
> > +static void zap_pte_single(struct vm_area_struct *vma, pte_t *pte,
> > + unsigned long addr)
> > +{
> > + struct mm_struct *mm = vma->vm_mm;
> > + int force_flush = 0;
> > + int rss[NR_MM_COUNTERS];
> > +
> > +
> VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_m
> utex));
>
> It's wrong place for VM_BUG_ON(): zap_pte_single() on anon mapping should
> work fine)

Hi Shutemov:
1. I am confuse that why insert_page() drop the PageAnon, insert_pfn() donot drop PageAnon?

2.
remap_vmalloc_range() -> vm_insert_page()->insert_page() -> inc_mm_counter_fast(mm, MM_FILEPAGES);

so, in this scenario, this vmalloc page maybe not sure be a page cache, why increase the MM_FILEPAGES ?


>
> > +
> > + init_rss_vec(rss);
>
> Vector to commit single update to mm counters? What about inline counters
> update for rss == NULL case?
>
> > + update_hiwater_rss(mm);
> > + flush_cache_page(vma, addr, pte_pfn(*pte));
> > + zap_pte(NULL, vma, pte, addr, NULL, rss, &force_flush);
> > + flush_tlb_page(vma, addr);
> > + add_mm_rss_vec(mm, rss);
> > +}
> > +
>
> --
> Kirill A. Shutemov
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to
> [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2014-07-30 16:03:31

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH v8 04/22] Change direct_access calling convention

On 07/22/2014 10:47 PM, Matthew Wilcox wrote:
> In order to support accesses to larger chunks of memory, pass in a
> 'size' parameter (counted in bytes), and return the amount available at
> that address.
>
> Support partitioning the underlying block device through a new helper
> function, bdev_direct_access(), since partition handling should be done
> in the block layer, not the filesystem, nor device driver.
>
> Signed-off-by: Matthew Wilcox <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> ---
> Documentation/filesystems/xip.txt | 15 +++++++++------
> arch/powerpc/sysdev/axonram.c | 12 ++++--------
> drivers/block/brd.c | 8 +++++---
> drivers/s390/block/dcssblk.c | 19 ++++++++++---------
> fs/block_dev.c | 28 ++++++++++++++++++++++++++++
> fs/ext2/xip.c | 31 +++++++++++++------------------
> include/linux/blkdev.h | 6 ++++--
> 7 files changed, 73 insertions(+), 46 deletions(-)
>
> diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
> index 0466ee5..b62eabf 100644
> --- a/Documentation/filesystems/xip.txt
> +++ b/Documentation/filesystems/xip.txt
> @@ -28,12 +28,15 @@ Implementation
> Execute-in-place is implemented in three steps: block device operation,
> address space operation, and file operations.
>
> -A block device operation named direct_access is used to retrieve a
> -reference (pointer) to a block on-disk. The reference is supposed to be
> -cpu-addressable, physical address and remain valid until the release operation
> -is performed. A struct block_device reference is used to address the device,
> -and a sector_t argument is used to identify the individual block. As an
> -alternative, memory technology devices can be used for this.
> +A block device operation named direct_access is used to translate the
> +block device sector number to a page frame number (pfn) that identifies
> +the physical page for the memory. It also returns a kernel virtual
> +address that can be used to access the memory.
> +
> +The direct_access method takes a 'size' parameter that indicates the
> +number of bytes being requested. The function should return the number
> +of bytes that it can provide, although it must not exceed the number of
> +bytes requested. It may also return a negative errno if an error occurs.
>
> The block device operation is optional, these block devices support it as of
> today:
> diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
> index 830edc8..3ee1c08 100644
> --- a/arch/powerpc/sysdev/axonram.c
> +++ b/arch/powerpc/sysdev/axonram.c
> @@ -139,17 +139,13 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
> * axon_ram_direct_access - direct_access() method for block device
> * @device, @sector, @data: see block_device_operations method
> */
> -static int
> +static long
> axon_ram_direct_access(struct block_device *device, sector_t sector,
> - void **kaddr, unsigned long *pfn)
> + void **kaddr, unsigned long *pfn, long size)
> {
> struct axon_ram_bank *bank = device->bd_disk->private_data;
> - loff_t offset;
> + loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
>
> - offset = sector;
> - if (device->bd_part != NULL)
> - offset += device->bd_part->start_sect;
> - offset <<= AXON_RAM_SECTOR_SHIFT;
> if (offset >= bank->size) {
> dev_err(&bank->device->dev, "Access outside of address space\n");
> return -ERANGE;
> @@ -158,7 +154,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
> *kaddr = (void *)(bank->ph_addr + offset);
> *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
>
> - return 0;
> + return min_t(long, size, bank->size - offset);
> }
>
> static const struct block_device_operations axon_ram_devops = {
> diff --git a/drivers/block/brd.c b/drivers/block/brd.c
> index c7d138e..96e4c96 100644
> --- a/drivers/block/brd.c
> +++ b/drivers/block/brd.c
> @@ -370,8 +370,8 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector,
> }
>
> #ifdef CONFIG_BLK_DEV_XIP
> -static int brd_direct_access(struct block_device *bdev, sector_t sector,
> - void **kaddr, unsigned long *pfn)
> +static long brd_direct_access(struct block_device *bdev, sector_t sector,
> + void **kaddr, unsigned long *pfn, long size)
> {
> struct brd_device *brd = bdev->bd_disk->private_data;
> struct page *page;

Mathew hi You need to remove the wrong checks both alignment and
size from this driver.
(alignment is redundant but size is wrong)

These checks should be guarantied by the blk wrapper. (see below)

> @@ -388,7 +388,9 @@ static int brd_direct_access(struct block_device *bdev, sector_t sector,
> *kaddr = page_address(page);
> *pfn = page_to_pfn(page);
>
> - return 0;
> + /* Could optimistically check to see if the next page in the
> + * file is mapped to the next page of physical RAM */
> + return min_t(long, PAGE_SIZE, size);
> }
> #endif
>
> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
> index 0f47175..58958d1 100644
> --- a/drivers/s390/block/dcssblk.c
> +++ b/drivers/s390/block/dcssblk.c
> @@ -28,8 +28,8 @@
> static int dcssblk_open(struct block_device *bdev, fmode_t mode);
> static void dcssblk_release(struct gendisk *disk, fmode_t mode);
> static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
> -static int dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
> - void **kaddr, unsigned long *pfn);
> +static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
> + void **kaddr, unsigned long *pfn, long size);
>
> static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";
>
> @@ -866,25 +866,26 @@ fail:
> bio_io_error(bio);
> }
>
> -static int
> +static long
> dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
> - void **kaddr, unsigned long *pfn)
> + void **kaddr, unsigned long *pfn, long size)
> {
> struct dcssblk_dev_info *dev_info;
> - unsigned long pgoff;
> + unsigned long offset, dev_sz;
>
> dev_info = bdev->bd_disk->private_data;
> if (!dev_info)
> return -ENODEV;
> + dev_sz = dev_info->end - dev_info->start;
> if (secnum % (PAGE_SIZE/512))
> return -EINVAL;
> - pgoff = secnum / (PAGE_SIZE / 512);
> - if ((pgoff+1)*PAGE_SIZE-1 > dev_info->end - dev_info->start)
> + offset = secnum * 512;
> + if (offset > dev_sz)
> return -ERANGE;
> - *kaddr = (void *) (dev_info->start+pgoff*PAGE_SIZE);
> + *kaddr = (void *) (dev_info->start + offset);
> *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
>
> - return 0;
> + return min_t(long, size, dev_sz - offset);
> }
>
> static void
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 6d72746..f1a158e 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -427,6 +427,34 @@ int bdev_write_page(struct block_device *bdev, sector_t sector,
> }
> EXPORT_SYMBOL_GPL(bdev_write_page);
>
> +/**
> + * bdev_direct_access() - Get the address for directly-accessibly memory
> + * @bdev: The device containing the memory
> + * @sector: The offset within the device
> + * @addr: Where to put the address of the memory
> + * @pfn: The Page Frame Number for the memory
> + * @size: The number of bytes requested
> + *
> + * If a block device is made up of directly addressable memory, this function
> + * will tell the caller the PFN and the address of the memory. The address
> + * may be directly dereferenced within the kernel without the need to call
> + * ioremap(), kmap() or similar. THe PFN is suitable for inserting into
> + * page tables.
> + *
> + * Return: negative errno if an error occurs, otherwise the number of bytes
> + * accessible at this address.
> + */
> +long bdev_direct_access(struct block_device *bdev, sector_t sector,
> + void **addr, unsigned long *pfn, long size)
> +{
> + const struct block_device_operations *ops = bdev->bd_disk->fops;
> + if (!ops->direct_access)
> + return -EOPNOTSUPP;

You need to check alignment on PAGE_SIZE since this API requires it, do
to pfn defined to a page_number.

(Unless you want to define another output-param like page_offset.
but this exercise can be left to the caller)

You also need to check against the partition boundary. so something like:

+ if (sector & (PAGE_SECTORS-1))
+ return -EINVAL;
+ if (unlikely(sector + size > part_nr_sects_read(bdev->bd_part)))
+ return -ERANGE;

Then perhaps you can remove that check from drivers

> + return ops->direct_access(bdev, sector + get_start_sect(bdev), addr,
> + pfn, size);
> +}
> +EXPORT_SYMBOL_GPL(bdev_direct_access);
> +
> /*
> * pseudo-fs
> */
> diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
> index e98171a..bbc5fec 100644
> --- a/fs/ext2/xip.c
> +++ b/fs/ext2/xip.c
> @@ -13,18 +13,12 @@
> #include "ext2.h"
> #include "xip.h"
>
> -static inline int
> -__inode_direct_access(struct inode *inode, sector_t block,
> - void **kaddr, unsigned long *pfn)
> +static inline long __inode_direct_access(struct inode *inode, sector_t block,
> + void **kaddr, unsigned long *pfn, long size)
> {
> struct block_device *bdev = inode->i_sb->s_bdev;
> - const struct block_device_operations *ops = bdev->bd_disk->fops;
> - sector_t sector;
> -
> - sector = block * (PAGE_SIZE / 512); /* ext2 block to bdev sector */
> -
> - BUG_ON(!ops->direct_access);
> - return ops->direct_access(bdev, sector, kaddr, pfn);
> + sector_t sector = block * (PAGE_SIZE / 512);

Style: Need a space between declaration and code (have you check-patch)

> + return bdev_direct_access(bdev, sector, kaddr, pfn, size);
> }
>
> static inline int
> @@ -53,12 +47,13 @@ ext2_clear_xip_target(struct inode *inode, sector_t block)
> {
> void *kaddr;
> unsigned long pfn;
> - int rc;
> + long size;
>
> - rc = __inode_direct_access(inode, block, &kaddr, &pfn);
> - if (!rc)
> - clear_page(kaddr);
> - return rc;
> + size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
> + if (size < 0)

if(size < PAGE_SIZE), No?

You are going to memset PAGE_SIZE below in clear_page.

> + return size;
> + clear_page(kaddr);
> + return 0;
> }
>
> void ext2_xip_verify_sb(struct super_block *sb)
> @@ -77,7 +72,7 @@ void ext2_xip_verify_sb(struct super_block *sb)
> int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
> void **kmem, unsigned long *pfn)
> {
> - int rc;
> + long rc;
> sector_t block;
>
> /* first, retrieve the sector number */
> @@ -86,6 +81,6 @@ int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
> return rc;
>
> /* retrieve address of the target data */
> - rc = __inode_direct_access(mapping->host, block, kmem, pfn);
> - return rc;
> + rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
> + return (rc < 0) ? rc : 0;
> }
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 8699bcf..bc5ea9e 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1613,8 +1613,8 @@ struct block_device_operations {
> int (*rw_page)(struct block_device *, sector_t, struct page *, int rw);
> int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
> int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
> - int (*direct_access) (struct block_device *, sector_t,
> - void **, unsigned long *);
> + long (*direct_access) (struct block_device *, sector_t,
> + void **, unsigned long *pfn, long size);
> unsigned int (*check_events) (struct gendisk *disk,
> unsigned int clearing);
> /* ->media_changed() is DEPRECATED, use ->check_events() instead */
> @@ -1632,6 +1632,8 @@ extern int __blkdev_driver_ioctl(struct block_device *, fmode_t, unsigned int,
> extern int bdev_read_page(struct block_device *, sector_t, struct page *);
> extern int bdev_write_page(struct block_device *, sector_t, struct page *,
> struct writeback_control *);
> +extern long bdev_direct_access(struct block_device *, sector_t, void **addr,
> + unsigned long *pfn, long size);
> #else /* CONFIG_BLOCK */
>
> struct block_device;
>

Thanks
Boaz

2014-07-30 16:12:52

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH v8 04/22] Change direct_access calling convention

On 07/30/2014 07:03 PM, Boaz Harrosh wrote:
<>
>> +long bdev_direct_access(struct block_device *bdev, sector_t sector,
>> + void **addr, unsigned long *pfn, long size)
>> +{
>> + const struct block_device_operations *ops = bdev->bd_disk->fops;
>> + if (!ops->direct_access)
>> + return -EOPNOTSUPP;
>
> You need to check alignment on PAGE_SIZE since this API requires it, do
> to pfn defined to a page_number.
>
> (Unless you want to define another output-param like page_offset.
> but this exercise can be left to the caller)
>
> You also need to check against the partition boundary. so something like:
>
> + if (sector & (PAGE_SECTORS-1))
> + return -EINVAL;
> + if (unlikely(sector + size > part_nr_sects_read(bdev->bd_part)))

Off course I was wrong here size is in bytes not in sectors. Which points
out that maybe this API needs to be in sectors.

[Actually it needs to be in pages both size and offset, because of return of
pfn, but its your call.]

Anyway my code above needs to be fixed with
(size + 512 -1) / 512

Thanks
Boaz

> + return -ERANGE;
>
> Then perhaps you can remove that check from drivers
>
>> + return ops->direct_access(bdev, sector + get_start_sect(bdev), addr,
>> + pfn, size);
>> +}
>> +EXPORT_SYMBOL_GPL(bdev_direct_access);
<>

2014-07-30 19:45:08

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v8 04/22] Change direct_access calling convention

On Wed, Jul 30, 2014 at 07:03:24PM +0300, Boaz Harrosh wrote:
> > +long bdev_direct_access(struct block_device *bdev, sector_t sector,
> > + void **addr, unsigned long *pfn, long size)
> > +{
> > + const struct block_device_operations *ops = bdev->bd_disk->fops;
> > + if (!ops->direct_access)
> > + return -EOPNOTSUPP;
>
> You need to check alignment on PAGE_SIZE since this API requires it, do
> to pfn defined to a page_number.
>
> (Unless you want to define another output-param like page_offset.
> but this exercise can be left to the caller)
>
> You also need to check against the partition boundary. so something like:
>
> + if (sector & (PAGE_SECTORS-1))
> + return -EINVAL;

Mmm. PAGE_SECTORS is private to brd (and also private to bcache!) at
this point. We've got a real mess of defines of SECTOR_SIZE, SECTORSIZE,
SECTOR_SHIFT and so on, dotted throughout various random include files.
I am not the river to flush those Augean stables today.

I'll go with this, from the dcssblk driver:

if (sector % (PAGE_SIZE / 512))
return -EINVAL;

> + if (unlikely(sector + size > part_nr_sects_read(bdev->bd_part)))
> + return -ERANGE;
>
> Then perhaps you can remove that check from drivers

As noted in your followup, size is in terms of bytes. Perhaps it should
be named 'length' to make it more clear that it's a byte count, not a
sector count?

In any case, this looks best to me:

if ((sector + DIV_ROUND_UP(size, 512)) >
part_nr_sects_read(bdev->bd_part))
return -ERANGE;


> Style: Need a space between declaration and code (have you check-patch)

That's a bullshit check. I don't know why it's in checkpatch.

> > + if (size < 0)
>
> if(size < PAGE_SIZE), No?

No, absolutely not. PAGE_SIZE is unsigned long, which (if I understand
my C integer promotions correctly) means that 'size' gets promoted to
an unsigned long, and we compare them unsigned, so errors will never be
caught by this check.

2014-07-30 20:34:39

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v8 04/22] Change direct_access calling convention

On Wed, Jul 30, 2014 at 07:12:44PM +0300, Boaz Harrosh wrote:
> Off course I was wrong here size is in bytes not in sectors. Which points
> out that maybe this API needs to be in sectors.
>
> [Actually it needs to be in pages both size and offset, because of return of
> pfn, but its your call.]

I considered a number of options here. The VM wants things to be in pages.
The filesystem wants things to be in block size. The block device wants
things to be in sectors. It's all a mess when you try and converge.
Everybody understands bytes.

Here's what I've come up with on top of patch 4/22. Let me know what
you think.


diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 3ee1c08..741293f 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -146,11 +146,6 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
struct axon_ram_bank *bank = device->bd_disk->private_data;
loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;

- if (offset >= bank->size) {
- dev_err(&bank->device->dev, "Access outside of address space\n");
- return -ERANGE;
- }
-
*kaddr = (void *)(bank->ph_addr + offset);
*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 33a39e7..3483458 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -378,10 +378,6 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,

if (!brd)
return -ENODEV;
- if (sector & (PAGE_SECTORS-1))
- return -EINVAL;
- if (sector + PAGE_SECTORS > get_capacity(bdev->bd_disk))
- return -ERANGE;
page = brd_insert_page(brd, sector);
if (!page)
return -ENOSPC;
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 58958d1..2ee5556 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -877,11 +877,7 @@ dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
if (!dev_info)
return -ENODEV;
dev_sz = dev_info->end - dev_info->start;
- if (secnum % (PAGE_SIZE/512))
- return -EINVAL;
offset = secnum * 512;
- if (offset > dev_sz)
- return -ERANGE;
*kaddr = (void *) (dev_info->start + offset);
*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;

diff --git a/fs/block_dev.c b/fs/block_dev.c
index f1a158e..93ebdd53 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -450,8 +450,14 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
const struct block_device_operations *ops = bdev->bd_disk->fops;
if (!ops->direct_access)
return -EOPNOTSUPP;
- return ops->direct_access(bdev, sector + get_start_sect(bdev), addr,
- pfn, size);
+ if ((sector + DIV_ROUND_UP(size, 512)) >
+ part_nr_sects_read(bdev->bd_part))
+ return -ERANGE;
+ sector += get_start_sect(bdev);
+ if (sector % (PAGE_SIZE / 512))
+ return -EINVAL;
+ size = ops->direct_access(bdev, sector, addr, pfn, size);
+ return size ? size : -ERANGE;
}
EXPORT_SYMBOL_GPL(bdev_direct_access);

2014-07-31 10:11:49

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH v8 04/22] Change direct_access calling convention

On 07/30/2014 10:45 PM, Matthew Wilcox wrote:
<>
>> + if (sector & (PAGE_SECTORS-1))
>> + return -EINVAL;
>
> Mmm. PAGE_SECTORS is private to brd (and also private to bcache!) at
> this point. We've got a real mess of defines of SECTOR_SIZE, SECTORSIZE,
> SECTOR_SHIFT and so on, dotted throughout various random include files.
> I am not the river to flush those Augean stables today.
>
> I'll go with this, from the dcssblk driver:
>
> if (sector % (PAGE_SIZE / 512))
> return -EINVAL;
>

Sigh, right, sure I did not mean to make that fight. Works as well

<>
>> Style: Need a space between declaration and code (have you check-patch)
>
> That's a bullshit check. I don't know why it's in checkpatch.
>

I did not invent the rules. But I do respect them. I think the merit
of sticking to some common style is much higher then any particular
style choice. Though this particular one I do like, because of the
C rule that forces all declarations before code, so it makes it easier
on the maintenance. In any way Maintainers are suppose to run checkpatch
before submission, some do ;-)

<>
>>> + if (size < 0)
>>
>> if(size < PAGE_SIZE), No?
>
> No, absolutely not. PAGE_SIZE is unsigned long, which (if I understand
> my C integer promotions correctly) means that 'size' gets promoted to
> an unsigned long, and we compare them unsigned, so errors will never be
> caught by this check.

Good point I agree that you need a cast ie.

if(size < (long)PAGE_SIZE)

The reason I'm saying this is because of a bug I actually hit when
playing with partitioning and fdisk, it came out that the last partition's
size was not page aligned, and code that checked for (< 0) crashed because
prd returned the last two sectors of the partition, since your API is sector
based this can happen for you here, before you are memseting a PAGE_SIZE
you need to test there is space, No?

>
>

Thanks
Boaz

2014-07-31 10:17:04

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH v8 04/22] Change direct_access calling convention

On 07/30/2014 11:34 PM, Matthew Wilcox wrote:
> On Wed, Jul 30, 2014 at 07:12:44PM +0300, Boaz Harrosh wrote:
>> Off course I was wrong here size is in bytes not in sectors. Which points
>> out that maybe this API needs to be in sectors.
>>
>> [Actually it needs to be in pages both size and offset, because of return of
>> pfn, but its your call.]
>
> I considered a number of options here. The VM wants things to be in pages.
> The filesystem wants things to be in block size. The block device wants
> things to be in sectors. It's all a mess when you try and converge.
> Everybody understands bytes.
>
> Here's what I've come up with on top of patch 4/22. Let me know what
> you think.
>
>
> diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
> index 3ee1c08..741293f 100644
> --- a/arch/powerpc/sysdev/axonram.c
> +++ b/arch/powerpc/sysdev/axonram.c
> @@ -146,11 +146,6 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
> struct axon_ram_bank *bank = device->bd_disk->private_data;
> loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
>
> - if (offset >= bank->size) {
> - dev_err(&bank->device->dev, "Access outside of address space\n");
> - return -ERANGE;
> - }
> -
> *kaddr = (void *)(bank->ph_addr + offset);
> *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
>
> diff --git a/drivers/block/brd.c b/drivers/block/brd.c
> index 33a39e7..3483458 100644
> --- a/drivers/block/brd.c
> +++ b/drivers/block/brd.c
> @@ -378,10 +378,6 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector,
>
> if (!brd)
> return -ENODEV;
> - if (sector & (PAGE_SECTORS-1))
> - return -EINVAL;
> - if (sector + PAGE_SECTORS > get_capacity(bdev->bd_disk))
> - return -ERANGE;
> page = brd_insert_page(brd, sector);
> if (!page)
> return -ENOSPC;
> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
> index 58958d1..2ee5556 100644
> --- a/drivers/s390/block/dcssblk.c
> +++ b/drivers/s390/block/dcssblk.c
> @@ -877,11 +877,7 @@ dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
> if (!dev_info)
> return -ENODEV;
> dev_sz = dev_info->end - dev_info->start;
> - if (secnum % (PAGE_SIZE/512))
> - return -EINVAL;
> offset = secnum * 512;
> - if (offset > dev_sz)
> - return -ERANGE;
> *kaddr = (void *) (dev_info->start + offset);
> *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index f1a158e..93ebdd53 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -450,8 +450,14 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
> const struct block_device_operations *ops = bdev->bd_disk->fops;
> if (!ops->direct_access)
> return -EOPNOTSUPP;
> - return ops->direct_access(bdev, sector + get_start_sect(bdev), addr,
> - pfn, size);
> + if ((sector + DIV_ROUND_UP(size, 512)) >
> + part_nr_sects_read(bdev->bd_part))
> + return -ERANGE;
> + sector += get_start_sect(bdev);
> + if (sector % (PAGE_SIZE / 512))
> + return -EINVAL;
> + size = ops->direct_access(bdev, sector, addr, pfn, size);
> + return size ? size : -ERANGE;
> }
> EXPORT_SYMBOL_GPL(bdev_direct_access);
>
>

Very nice yes.

Only the check at ext2 I think is missing. Once you send the complete patch
I'll review it again. It would be very helpful if you send this patch a head
of Q, even for 3.17. It will help with merging later I think

Thanks
Boaz

2014-07-31 14:13:23

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v8 04/22] Change direct_access calling convention

On Thu, Jul 31, 2014 at 01:11:42PM +0300, Boaz Harrosh wrote:
> >>> + if (size < 0)
> >>
> >> if(size < PAGE_SIZE), No?
> >
> > No, absolutely not. PAGE_SIZE is unsigned long, which (if I understand
> > my C integer promotions correctly) means that 'size' gets promoted to
> > an unsigned long, and we compare them unsigned, so errors will never be
> > caught by this check.
>
> Good point I agree that you need a cast ie.
>
> if(size < (long)PAGE_SIZE)
>
> The reason I'm saying this is because of a bug I actually hit when
> playing with partitioning and fdisk, it came out that the last partition's
> size was not page aligned, and code that checked for (< 0) crashed because
> prd returned the last two sectors of the partition, since your API is sector
> based this can happen for you here, before you are memseting a PAGE_SIZE
> you need to test there is space, No?

Not in ext2/ext4. It requires block size == PAGE_SIZE, so it's never
going to request the last partial block in a partition.

2014-07-31 15:28:43

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH v8 04/22] Change direct_access calling convention

On 07/31/2014 05:13 PM, Matthew Wilcox wrote:
> On Thu, Jul 31, 2014 at 01:11:42PM +0300, Boaz Harrosh wrote:
>>>>> + if (size < 0)
>>>>
>>>> if(size < PAGE_SIZE), No?
>>>
>>> No, absolutely not. PAGE_SIZE is unsigned long, which (if I understand
>>> my C integer promotions correctly) means that 'size' gets promoted to
>>> an unsigned long, and we compare them unsigned, so errors will never be
>>> caught by this check.
>>
>> Good point I agree that you need a cast ie.
>>
>> if(size < (long)PAGE_SIZE)
>>
>> The reason I'm saying this is because of a bug I actually hit when
>> playing with partitioning and fdisk, it came out that the last partition's
>> size was not page aligned, and code that checked for (< 0) crashed because
>> prd returned the last two sectors of the partition, since your API is sector
>> based this can happen for you here, before you are memseting a PAGE_SIZE
>> you need to test there is space, No?
>
> Not in ext2/ext4. It requires block size == PAGE_SIZE, so it's never
> going to request the last partial block in a partition.
>

OK cool. then.

Matthew what is your opinion about this, do we need to push for removal
of the partition dead code which never worked for brd, or we need to push
for fixing and implementing new partition support for brd?

Also another thing I saw is that if we leave the flag
GENHD_FL_SUPPRESS_PARTITION_INFO

then mount -U UUID stops to work, regardless of partitions or not,
this is because Kernel will not put us on /proc/patitions.
I'll submit another patch to remove it.

BTW I hit another funny bug where the partition beginning was not
4K aligned apparently fdisk lets you do this if the total size is small
enough (like 4096 which is default for brd) so I ended up with accessing
sec zero, the supper-block, failing because of the alignment check at
direct_access().
Do you know of any API that brd/prd can do to not let fdisk do this?
I'm looking at it right now I just thought it is worth asking.

Thanks for everything
Boaz

2014-07-31 17:22:13

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v8 04/22] Change direct_access calling convention

On Thu, Jul 31, 2014 at 06:28:37PM +0300, Boaz Harrosh wrote:
> Matthew what is your opinion about this, do we need to push for removal
> of the partition dead code which never worked for brd, or we need to push
> for fixing and implementing new partition support for brd?

Fixing the code gets my vote. brd is useful for testing things ... and
sometimes we need to test things that involve partitions.

> Also another thing I saw is that if we leave the flag
> GENHD_FL_SUPPRESS_PARTITION_INFO
>
> then mount -U UUID stops to work, regardless of partitions or not,
> this is because Kernel will not put us on /proc/patitions.
> I'll submit another patch to remove it.

Yes, we should probably fix that too.

> BTW I hit another funny bug where the partition beginning was not
> 4K aligned apparently fdisk lets you do this if the total size is small
> enough (like 4096 which is default for brd) so I ended up with accessing
> sec zero, the supper-block, failing because of the alignment check at
> direct_access().

That's why I added on the partition start before doing the alignment
check :-)

> Do you know of any API that brd/prd can do to not let fdisk do this?
> I'm looking at it right now I just thought it is worth asking.

I think it's enough to refuse the mount. That feels like a patch to
ext2/4 (or maybe ext2/4 has a way to start the filesystem on a different
block boundary?)

2014-07-31 18:04:15

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH v8 04/22] Change direct_access calling convention

On 07/31/2014 08:19 PM, Matthew Wilcox wrote:
> On Thu, Jul 31, 2014 at 06:28:37PM +0300, Boaz Harrosh wrote:
>> Matthew what is your opinion about this, do we need to push for removal
>> of the partition dead code which never worked for brd, or we need to push
>> for fixing and implementing new partition support for brd?
>
> Fixing the code gets my vote. brd is useful for testing things ... and
> sometimes we need to test things that involve partitions.
>

OK I'm on it, its what I'm doing today.

rrr I manged to completely trash my vm by doing 'make install' of
util-linux and after reboot it never recovered, I remember that
mount complained about a now missing library and I forgot and rebooted,
that was the end of that. Anyway I installed a new fc20 system wanted
that for a long time over my old fc18

>> Also another thing I saw is that if we leave the flag
>> GENHD_FL_SUPPRESS_PARTITION_INFO
>>
>> then mount -U UUID stops to work, regardless of partitions or not,
>> this is because Kernel will not put us on /proc/patitions.
>> I'll submit another patch to remove it.
>
> Yes, we should probably fix that too.
>

Yes this is good stuff. I found out about the gpt option in fdisk
that's really good stuff because it gives you a PARTUUID even before
the mkfs, and the partitions are so mach more logical.
But only without that flag

>> BTW I hit another funny bug where the partition beginning was not
>> 4K aligned apparently fdisk lets you do this if the total size is small
>> enough (like 4096 which is default for brd) so I ended up with accessing
>> sec zero, the supper-block, failing because of the alignment check at
>> direct_access().
>
> That's why I added on the partition start before doing the alignment
> check :-)
>
Yes, exactly, I had very similar code to yours. I moved to your code
now First patch in the set is your patch 4/22 squashed with the modifications
you sent, then my fix, then the getgeo patch, then the remove of the flag.

But I'm still fighting fdisk's sector math, I can't for the life of me
figure out fdisk math, and it is all too easy to create a partition schema
that has an unaligned first/last sector.

I can observe and see the dis-alignment when the partitions are first
created, I can detect that at prd_probe time.

I can probably fix it by this logic:
When first detecting a new partition ie if bd_part->start_sect
is not aligned round-up to PAGE_SIZE. Then subtract from bd_part->nr_sects the
fixed up size and round-down bd_part->nr_sects to PAGE_SIZE
This way I still live inside the confined space that fdisk gave me but only IO
within largest aligned space. The leftover sectors are just wasted space.


>> Do you know of any API that brd/prd can do to not let fdisk do this?
>> I'm looking at it right now I just thought it is worth asking.
>
> I think it's enough to refuse the mount. That feels like a patch to
> ext2/4 (or maybe ext2/4 has a way to start the filesystem on a different
> block boundary?)
>

We should not leave this to the FSs to do again and again all over. I wonder
if there is some getgeo or some disk properties info somewhere that I can
set to force the core block layer to do this for me, I'm surprised that this
is the first place we have this problem?

Thanks
Boaz

2014-07-31 20:30:47

by Zwisler, Ross

[permalink] [raw]
Subject: Re: [PATCH v8 04/22] Change direct_access calling convention

On Thu, 2014-07-31 at 21:04 +0300, Boaz Harrosh wrote:
> On 07/31/2014 08:19 PM, Matthew Wilcox wrote:
> > On Thu, Jul 31, 2014 at 06:28:37PM +0300, Boaz Harrosh wrote:
> >> Matthew what is your opinion about this, do we need to push for removal
> >> of the partition dead code which never worked for brd, or we need to push
> >> for fixing and implementing new partition support for brd?
> >
> > Fixing the code gets my vote. brd is useful for testing things ... and
> > sometimes we need to test things that involve partitions.
> >
>
> OK I'm on it, its what I'm doing today.
>
> rrr I manged to completely trash my vm by doing 'make install' of
> util-linux and after reboot it never recovered, I remember that
> mount complained about a now missing library and I forgot and rebooted,
> that was the end of that. Anyway I installed a new fc20 system wanted
> that for a long time over my old fc18

Ah, I'm already working on this as well. :) If you want you can wait for my
patches to BRD & test - they should be out this week.

I'm planning on adding get_geo() and doing dynamic minors as is done in NVMe.

- Ross

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?