2013-12-17 19:18:39

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v3 0/3] Add XIP support to ext4

For v3, we've addressed the problem with unwritten extents that Dave
Chinner pointed out. Rather than testing bh_unwritten() as he suggested,
I opted for checking !buffer_mapped() as block_read_full_page() in
fs/buffer.c does. While I'm in that function I renamed the buffer_head
from 'tmp' to 'bh' to follow normal usage.

I also folded the "ext4: Add xip hole punching" patch into the main
"Add XIP functionality" patch since otherwise we're introducing broken
functionality in the middle of the series.

Matthew Wilcox (2):
Fix XIP fault vs truncate race
xip: Add xip_zero_page_range

Ross Zwisler (1):
ext4: Add XIP functionality

Documentation/filesystems/ext4.txt | 2 +
Documentation/filesystems/xip.txt | 3 ++
fs/Kconfig | 2 +-
fs/ext4/Kconfig | 11 +++++
fs/ext4/Makefile | 1 +
fs/ext4/ext4.h | 4 +-
fs/ext4/file.c | 17 ++++++++
fs/ext4/inode.c | 86 +++++++++++++++++++++++++++-----------
fs/ext4/namei.c | 11 ++++-
fs/ext4/super.c | 36 +++++++++++++++-
fs/ext4/xip.c | 78 ++++++++++++++++++++++++++++++++++
fs/ext4/xip.h | 24 +++++++++++
include/linux/fs.h | 8 ++++
mm/filemap_xip.c | 55 ++++++++++++++++--------
14 files changed, 289 insertions(+), 49 deletions(-)
create mode 100644 fs/ext4/xip.c
create mode 100644 fs/ext4/xip.h

--
1.8.4.rc3



2013-12-17 19:18:26

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v3 1/3] Fix XIP fault vs truncate race

Pagecache faults recheck i_size after taking the page lock to ensure that
the fault didn't race against a truncate. We don't have a page to lock
in the XIP case, so use the i_mmap_mutex instead. It is locked in the
truncate path in unmap_mapping_range() after updating i_size. So while
we hold it in the fault path, we are guaranteed that either i_size has
already been updated in the truncate path, or that the truncate will
subsequently call zap_page_range_single() and so remove the mapping we
have just inserted.

There is a window of time in which i_size has been reduced and the
thread has a mapping to a page which will be removed from the file,
but this is harmless as the page will not be allocated to a different
purpose before the thread's access to it is revoked.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/filemap_xip.c | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..c8d23e9 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -260,8 +260,17 @@ again:
__xip_unmap(mapping, vmf->pgoff);

found:
+ /* We must recheck i_size under i_mmap_mutex */
+ mutex_lock(&mapping->i_mmap_mutex);
+ size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+ PAGE_CACHE_SHIFT;
+ if (unlikely(vmf->pgoff >= size)) {
+ mutex_unlock(&mapping->i_mmap_mutex);
+ return VM_FAULT_SIGBUS;
+ }
err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
xip_pfn);
+ mutex_unlock(&mapping->i_mmap_mutex);
if (err == -ENOMEM)
return VM_FAULT_OOM;
/*
@@ -285,16 +294,27 @@ found:
}
if (error != -ENODATA)
goto out;
+
+ /* We must recheck i_size under i_mmap_mutex */
+ mutex_lock(&mapping->i_mmap_mutex);
+ size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+ PAGE_CACHE_SHIFT;
+ if (unlikely(vmf->pgoff >= size)) {
+ ret = VM_FAULT_SIGBUS;
+ goto unlock;
+ }
/* not shared and writable, use xip_sparse_page() */
page = xip_sparse_page();
if (!page)
- goto out;
+ goto unlock;
err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
page);
if (err == -ENOMEM)
- goto out;
+ goto unlock;

ret = VM_FAULT_NOPAGE;
+unlock:
+ mutex_unlock(&mapping->i_mmap_mutex);
out:
write_seqcount_end(&xip_sparse_seq);
mutex_unlock(&xip_sparse_mutex);
--
1.8.4.rc3


2013-12-17 19:18:27

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v3 2/3] xip: Add xip_zero_page_range

This new function allows us to support hole-punch for XIP files by zeroing
a partial page, as opposed to the xip_truncate_page() function which can
only truncate to the end of the page. Reimplement xip_truncate_page() as
a caller of xip_zero_page_range().

Signed-off-by: Matthew Wilcox <[email protected]>
[ported to 3.13-rc2]
Signed-off-by: Ross Zwisler <[email protected]>
---
include/linux/fs.h | 8 ++++++++
mm/filemap_xip.c | 31 +++++++++++++++----------------
2 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 121f11f..a063bff 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2509,11 +2509,19 @@ extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
extern ssize_t xip_file_write(struct file *filp, const char __user *buf,
size_t len, loff_t *ppos);
extern int xip_truncate_page(struct address_space *mapping, loff_t from);
+extern int xip_zero_page_range(struct address_space *, loff_t from,
+ unsigned length);
#else
static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
{
return 0;
}
+
+static inline int xip_zero_page_range(struct address_space *mapping,
+ loff_t from, unsigned length)
+{
+ return 0;
+}
#endif

#ifdef CONFIG_BLOCK
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index c8d23e9..d808b72 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -461,33 +461,19 @@ xip_file_write(struct file *filp, const char __user *buf, size_t len,
}
EXPORT_SYMBOL_GPL(xip_file_write);

-/*
- * truncate a page used for execute in place
- * functionality is analog to block_truncate_page but does use get_xip_mem
- * to get the page instead of page cache
- */
-int
-xip_truncate_page(struct address_space *mapping, loff_t from)
+int xip_zero_page_range(struct address_space *mapping, loff_t from,
+ unsigned length)
{
pgoff_t index = from >> PAGE_CACHE_SHIFT;
unsigned offset = from & (PAGE_CACHE_SIZE-1);
- unsigned blocksize;
- unsigned length;
void *xip_mem;
unsigned long xip_pfn;
int err;

- BUG_ON(!mapping->a_ops->get_xip_mem);
-
- blocksize = 1 << mapping->host->i_blkbits;
- length = offset & (blocksize - 1);
-
/* Block boundary? Nothing to do */
if (!length)
return 0;

- length = blocksize - length;
-
err = mapping->a_ops->get_xip_mem(mapping, index, 0,
&xip_mem, &xip_pfn);
if (unlikely(err)) {
@@ -497,7 +483,20 @@ xip_truncate_page(struct address_space *mapping, loff_t from)
else
return err;
}
+
memset(xip_mem + offset, 0, length);
return 0;
}
+EXPORT_SYMBOL_GPL(xip_zero_page_range);
+
+/*
+ * truncate a page used for execute in place
+ * functionality is analog to block_truncate_page but does use get_xip_mem
+ * to get the page instead of page cache
+ */
+int xip_truncate_page(struct address_space *mapping, loff_t from)
+{
+ unsigned length = PAGE_CACHE_ALIGN(from) - from;
+ return xip_zero_page_range(mapping, from, length);
+}
EXPORT_SYMBOL_GPL(xip_truncate_page);
--
1.8.4.rc3


2013-12-17 19:18:40

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v3 3/3] ext4: Add XIP functionality

From: Ross Zwisler <[email protected]>

This is a port of the XIP functionality found in the current version of
ext2. This patch set is intended to achieve feature parity with XIP in
ext2 rather than non-XIP in ext4.  In particular, it lacks support for
splice and AIO.  We'll be submitting patches in the future to add that
functionality, but we think this is a good start.

The motivation behind this work is that we believe that the XIP feature
will begin to find new uses as various persistent memory devices and
technologies come on to the market. Having direct, byte-addressable
access to persistent memory without having an additional copy in the
page cache can be a win in terms of I/O latency and overall memory
usage.

This patch applies cleanly to v3.13-rc2, and was tested using brd as our
block driver.

Signed-off-by: Ross Zwisler <[email protected]>
Reviewed-by: Andreas Dilger <[email protected]>
Signed-off-by: Matthew Wilcox <[email protected]>
---
Documentation/filesystems/ext4.txt | 2 +
Documentation/filesystems/xip.txt | 3 ++
fs/Kconfig | 2 +-
fs/ext4/Kconfig | 11 +++++
fs/ext4/Makefile | 1 +
fs/ext4/ext4.h | 4 +-
fs/ext4/file.c | 17 ++++++++
fs/ext4/inode.c | 86 +++++++++++++++++++++++++++-----------
fs/ext4/namei.c | 11 ++++-
fs/ext4/super.c | 36 +++++++++++++++-
fs/ext4/xip.c | 78 ++++++++++++++++++++++++++++++++++
fs/ext4/xip.h | 24 +++++++++++
12 files changed, 244 insertions(+), 31 deletions(-)
create mode 100644 fs/ext4/xip.c
create mode 100644 fs/ext4/xip.h

diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 919a329..c32c398 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -386,6 +386,8 @@ max_dir_size_kb=n This limits the size of directories so that any
i_version Enable 64-bit inode version support. This option is
off by default.

+xip Use execute in place (no caching) if possible.
+
Data Mode
=========
There are 3 different data modes:
diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
index 0466ee5..3df2d36 100644
--- a/Documentation/filesystems/xip.txt
+++ b/Documentation/filesystems/xip.txt
@@ -37,6 +37,8 @@ alternative, memory technology devices can be used for this.

The block device operation is optional, these block devices support it as of
today:
+- axonram: Axon DDR2 device driver
+- brd: Ram backed block device driver
- dcssblk: s390 dcss block device driver

An address space operation named get_xip_mem is used to retrieve references
@@ -49,6 +51,7 @@ This address space operation is mutually exclusive with readpage&writepage that
do page cache read/write operations.
The following filesystems support it as of today:
- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+- ext4: the fourth extended filesystem, see Documentation/filesystems/ext4.txt

A set of file operations that do utilize get_xip_page can be found in
mm/filemap_xip.c . The following file operation implementations are provided:
diff --git a/fs/Kconfig b/fs/Kconfig
index c229f82..595cc00 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -17,7 +17,7 @@ source "fs/ext4/Kconfig"
config FS_XIP
# execute in place
bool
- depends on EXT2_FS_XIP
+ depends on EXT2_FS_XIP || EXT4_FS_XIP
default y

source "fs/jbd/Kconfig"
diff --git a/fs/ext4/Kconfig b/fs/ext4/Kconfig
index efea5d5..62952cb 100644
--- a/fs/ext4/Kconfig
+++ b/fs/ext4/Kconfig
@@ -73,3 +73,14 @@ config EXT4_DEBUG
If you select Y here, then you will be able to turn on debugging
with a command such as:
echo 1 > /sys/module/ext4/parameters/mballoc_debug
+
+config EXT4_FS_XIP
+ bool "Ext4 execute in place support"
+ depends on EXT4_FS && MMU
+ help
+ Execute in place can be used on memory-backed block devices. If you
+ enable this option, you can select to mount block devices which are
+ capable of this feature without using the page cache.
+
+ If you do not use a block device that is capable of using this,
+ or if unsure, say N.
diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
index 0310fec..3f1ec56 100644
--- a/fs/ext4/Makefile
+++ b/fs/ext4/Makefile
@@ -12,3 +12,4 @@ ext4-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o page-io.o \

ext4-$(CONFIG_EXT4_FS_POSIX_ACL) += acl.o
ext4-$(CONFIG_EXT4_FS_SECURITY) += xattr_security.o
+ext4-$(CONFIG_EXT4_FS_XIP) += xip.o
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e618503..d083ad9 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -954,6 +954,7 @@ struct ext4_inode_info {
#define EXT4_MOUNT_ERRORS_MASK 0x00070
#define EXT4_MOUNT_MINIX_DF 0x00080 /* Mimics the Minix statfs */
#define EXT4_MOUNT_NOLOAD 0x00100 /* Don't use existing journal*/
+#define EXT4_MOUNT_XIP 0x00200 /* Execute in place */
#define EXT4_MOUNT_DATA_FLAGS 0x00C00 /* Mode for data writes: */
#define EXT4_MOUNT_JOURNAL_DATA 0x00400 /* Write data to journal */
#define EXT4_MOUNT_ORDERED_DATA 0x00800 /* Flush data before commit */
@@ -2121,8 +2122,6 @@ extern int ext4_writepage_trans_blocks(struct inode *);
extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
extern int ext4_block_truncate_page(handle_t *handle,
struct address_space *mapping, loff_t from);
-extern int ext4_block_zero_page_range(handle_t *handle,
- struct address_space *mapping, loff_t from, loff_t length);
extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
loff_t lstart, loff_t lend);
extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
@@ -2571,6 +2570,7 @@ extern const struct file_operations ext4_dir_operations;
/* file.c */
extern const struct inode_operations ext4_file_inode_operations;
extern const struct file_operations ext4_file_operations;
+extern const struct file_operations ext4_xip_file_operations;
extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
extern void ext4_unwritten_wait(struct inode *inode);

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 3da2194..b9499b2 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -609,6 +609,23 @@ const struct file_operations ext4_file_operations = {
.fallocate = ext4_fallocate,
};

+#ifdef CONFIG_EXT4_FS_XIP
+const struct file_operations ext4_xip_file_operations = {
+ .llseek = ext4_llseek,
+ .read = xip_file_read,
+ .write = xip_file_write,
+ .unlocked_ioctl = ext4_ioctl,
+#ifdef CONFIG_COMPAT
+ .compat_ioctl = ext4_compat_ioctl,
+#endif
+ .mmap = xip_file_mmap,
+ .open = ext4_file_open,
+ .release = ext4_release_file,
+ .fsync = ext4_sync_file,
+ .fallocate = ext4_fallocate,
+};
+#endif
+
const struct inode_operations ext4_file_inode_operations = {
.setattr = ext4_setattr,
.getattr = ext4_getattr,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0757634..7b50832 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -43,6 +43,7 @@
#include "xattr.h"
#include "acl.h"
#include "truncate.h"
+#include "xip.h"

#include <trace/events/ext4.h>

@@ -663,6 +664,23 @@ found:
WARN_ON(1);
}

+ if (ext4_use_xip(inode->i_sb)) {
+ ext4_fsblk_t fs_blk;
+
+ for (fs_blk = map->m_pblk;
+ fs_blk < map->m_pblk + map->m_len; fs_blk++) {
+ /*
+ * we need to clear the block
+ */
+ ret = ext4_clear_xip_target(inode, fs_blk);
+
+ if (ret) {
+ retval = ret;
+ goto has_zeroout;
+ }
+ }
+ }
+
/*
* If the extent has been zeroed out, we don't need to update
* extent status tree.
@@ -3270,6 +3288,11 @@ static const struct address_space_operations ext4_aops = {
.error_remove_page = generic_error_remove_page,
};

+const struct address_space_operations ext4_xip_aops = {
+ .bmap = ext4_bmap,
+ .get_xip_mem = ext4_get_xip_mem,
+};
+
static const struct address_space_operations ext4_journalled_aops = {
.readpage = ext4_readpage,
.readpages = ext4_readpages,
@@ -3317,40 +3340,22 @@ void ext4_set_aops(struct inode *inode)
default:
BUG();
}
- if (test_opt(inode->i_sb, DELALLOC))
+ if (ext4_use_xip(inode->i_sb))
+ inode->i_mapping->a_ops = &ext4_xip_aops;
+ else if (test_opt(inode->i_sb, DELALLOC))
inode->i_mapping->a_ops = &ext4_da_aops;
else
inode->i_mapping->a_ops = &ext4_aops;
}

/*
- * ext4_block_truncate_page() zeroes out a mapping from file offset `from'
- * up to the end of the block which corresponds to `from'.
- * This required during truncate. We need to physically zero the tail end
- * of that block so it doesn't yield old data if the file is later grown.
- */
-int ext4_block_truncate_page(handle_t *handle,
- struct address_space *mapping, loff_t from)
-{
- unsigned offset = from & (PAGE_CACHE_SIZE-1);
- unsigned length;
- unsigned blocksize;
- struct inode *inode = mapping->host;
-
- blocksize = inode->i_sb->s_blocksize;
- length = blocksize - (offset & (blocksize - 1));
-
- return ext4_block_zero_page_range(handle, mapping, from, length);
-}
-
-/*
- * ext4_block_zero_page_range() zeros out a mapping of length 'length'
+ * __ext4_block_zero_page_range() zeros out a mapping of length 'length'
* starting from file offset 'from'. The range to be zero'd must
* be contained with in one block. If the specified range exceeds
* the end of the block it will be shortened to end of the block
* that cooresponds to 'from'
*/
-int ext4_block_zero_page_range(handle_t *handle,
+static int __ext4_block_zero_page_range(handle_t *handle,
struct address_space *mapping, loff_t from, loff_t length)
{
ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT;
@@ -3440,6 +3445,34 @@ unlock:
return err;
}

+static int ext4_block_zero_page_range(handle_t *handle,
+ struct address_space *mapping, loff_t from, loff_t length)
+{
+ if (mapping_is_xip(mapping))
+ return xip_zero_page_range(mapping, from, length);
+ return __ext4_block_zero_page_range(handle, mapping, from, length);
+}
+
+/*
+ * ext4_block_truncate_page() zeroes out a mapping from file offset `from'
+ * up to the end of the block which corresponds to `from'.
+ * This required during truncate. We need to physically zero the tail end
+ * of that block so it doesn't yield old data if the file is later grown.
+ */
+int ext4_block_truncate_page(handle_t *handle,
+ struct address_space *mapping, loff_t from)
+{
+ unsigned offset = from & (PAGE_CACHE_SIZE-1);
+ unsigned length;
+ unsigned blocksize;
+ struct inode *inode = mapping->host;
+
+ blocksize = inode->i_sb->s_blocksize;
+ length = blocksize - (offset & (blocksize - 1));
+
+ return ext4_block_zero_page_range(handle, mapping, from, length);
+}
+
int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
loff_t lstart, loff_t length)
{
@@ -4201,7 +4234,10 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)

if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext4_file_inode_operations;
- inode->i_fop = &ext4_file_operations;
+ if (ext4_use_xip(inode->i_sb))
+ inode->i_fop = &ext4_xip_file_operations;
+ else
+ inode->i_fop = &ext4_file_operations;
ext4_set_aops(inode);
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = &ext4_dir_inode_operations;
@@ -4653,7 +4689,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
* Truncate pagecache after we've waited for commit
* in data=journal mode to make pages freeable.
*/
- truncate_pagecache(inode, inode->i_size);
+ truncate_pagecache(inode, inode->i_size);
}
/*
* We want to call ext4_truncate() even if attr->ia_size ==
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 5a0408d..20a9cf8 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -39,6 +39,7 @@

#include "xattr.h"
#include "acl.h"
+#include "xip.h"

#include <trace/events/ext4.h>
/*
@@ -2250,7 +2251,10 @@ retry:
err = PTR_ERR(inode);
if (!IS_ERR(inode)) {
inode->i_op = &ext4_file_inode_operations;
- inode->i_fop = &ext4_file_operations;
+ if (ext4_use_xip(inode->i_sb))
+ inode->i_fop = &ext4_xip_file_operations;
+ else
+ inode->i_fop = &ext4_file_operations;
ext4_set_aops(inode);
err = ext4_add_nondir(handle, dentry, inode);
if (!err && IS_DIRSYNC(dir))
@@ -2314,7 +2318,10 @@ retry:
err = PTR_ERR(inode);
if (!IS_ERR(inode)) {
inode->i_op = &ext4_file_inode_operations;
- inode->i_fop = &ext4_file_operations;
+ if (ext4_use_xip(inode->i_sb))
+ inode->i_fop = &ext4_xip_file_operations;
+ else
+ inode->i_fop = &ext4_file_operations;
ext4_set_aops(inode);
d_tmpfile(dentry, inode);
err = ext4_orphan_add(handle, inode);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c977f4e..144dfd5 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -50,6 +50,7 @@
#include "xattr.h"
#include "acl.h"
#include "mballoc.h"
+#include "xip.h"

#define CREATE_TRACE_POINTS
#include <trace/events/ext4.h>
@@ -1162,7 +1163,7 @@ enum {
Opt_inode_readahead_blks, Opt_journal_ioprio,
Opt_dioread_nolock, Opt_dioread_lock,
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
- Opt_max_dir_size_kb,
+ Opt_max_dir_size_kb, Opt_xip,
};

static const match_table_t tokens = {
@@ -1243,6 +1244,7 @@ static const match_table_t tokens = {
{Opt_removed, "reservation"}, /* mount option from ext2/3 */
{Opt_removed, "noreservation"}, /* mount option from ext2/3 */
{Opt_removed, "journal=%u"}, /* mount option from ext2/3 */
+ {Opt_xip, "xip"},
{Opt_err, NULL},
};

@@ -1436,6 +1438,7 @@ static const struct mount_opts {
{Opt_jqfmt_vfsv0, QFMT_VFS_V0, MOPT_QFMT},
{Opt_jqfmt_vfsv1, QFMT_VFS_V1, MOPT_QFMT},
{Opt_max_dir_size_kb, 0, MOPT_GTE0},
+ {Opt_xip, EXT4_MOUNT_XIP, MOPT_SET},
{Opt_err, 0, 0}
};

@@ -1638,6 +1641,11 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
}
sbi->s_jquota_fmt = m->mount_opt;
#endif
+#ifndef CONFIG_EXT4_FS_XIP
+ } else if (token == Opt_xip) {
+ ext4_msg(sb, KERN_INFO, "xip option not supported");
+ return -1;
+#endif
} else {
if (!args->from)
arg = 1;
@@ -3553,11 +3561,23 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
}
if (test_opt(sb, DELALLOC))
clear_opt(sb, DELALLOC);
+ if (test_opt(sb, XIP)) {
+ ext4_msg(sb, KERN_ERR, "can't mount with "
+ "both data=journal and xip");
+ goto failed_mount;
+ }
}

sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
(test_opt(sb, POSIX_ACL) ? MS_POSIXACL : 0);

+ if ((sbi->s_mount_opt & EXT4_MOUNT_XIP) &&
+ !sb->s_bdev->bd_disk->fops->direct_access) {
+ ext4_msg(sb, KERN_ERR, "can't mount with xip - "
+ "not supported by bdev");
+ goto failed_mount;
+ }
+
if (le32_to_cpu(es->s_rev_level) == EXT4_GOOD_OLD_REV &&
(EXT4_HAS_COMPAT_FEATURE(sb, ~0U) ||
EXT4_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
@@ -3604,6 +3624,12 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
goto failed_mount;
}

+ if (ext4_use_xip(sb) && blocksize != PAGE_SIZE) {
+ ext4_msg(sb, KERN_ERR, "Unsupported blocksize %d for xip",
+ blocksize);
+ goto failed_mount;
+ }
+
if (sb->s_blocksize != blocksize) {
/* Validate the filesystem blocksize */
if (!sb_set_blocksize(sb, blocksize)) {
@@ -4740,6 +4766,7 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
struct ext4_super_block *es;
struct ext4_sb_info *sbi = EXT4_SB(sb);
unsigned long old_sb_flags;
+ unsigned long old_mount_opt = sbi->s_mount_opt;
struct ext4_mount_options old_opts;
int enable_quota = 0;
ext4_group_t g;
@@ -4808,6 +4835,13 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)

es = sbi->s_es;

+ if ((sbi->s_mount_opt ^ old_mount_opt) & EXT4_MOUNT_XIP) {
+ ext4_msg(sb, KERN_WARNING, "warning: refusing change of "
+ "xip flag while remounting");
+ sbi->s_mount_opt &= ~EXT4_MOUNT_XIP;
+ sbi->s_mount_opt |= old_mount_opt & EXT4_MOUNT_XIP;
+ }
+
if (sbi->s_journal) {
ext4_init_journal_params(sb, sbi->s_journal);
set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
diff --git a/fs/ext4/xip.c b/fs/ext4/xip.c
new file mode 100644
index 0000000..5853345
--- /dev/null
+++ b/fs/ext4/xip.c
@@ -0,0 +1,78 @@
+/*
+ * linux/fs/ext4/xip.c
+ *
+ * Copyright (C) 2005 IBM Corporation
+ * Author: Carsten Otte ([email protected])
+ */
+
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+#include <linux/buffer_head.h>
+#include <linux/blkdev.h>
+#include "ext4.h"
+#include "xip.h"
+
+static inline int
+__inode_direct_access(struct inode *inode, sector_t block,
+ void **kaddr, unsigned long *pfn)
+{
+ struct block_device *bdev = inode->i_sb->s_bdev;
+ const struct block_device_operations *ops = bdev->bd_disk->fops;
+ sector_t sector;
+
+ sector = block * (PAGE_SIZE / 512); /* ext4 block to bdev sector */
+
+ BUG_ON(!ops->direct_access);
+ return ops->direct_access(bdev, sector, kaddr, pfn);
+}
+
+static inline int
+__ext4_get_block(struct inode *inode, pgoff_t pgoff, int create,
+ sector_t *result)
+{
+ struct buffer_head bh;
+ int rc;
+
+ memset(&bh, 0, sizeof(bh));
+ bh.b_size = inode->i_sb->s_blocksize;
+ rc = ext4_get_block(inode, pgoff, &bh, create);
+ *result = bh.b_blocknr;
+
+ /* did we get a sparse block (hole in the file)? */
+ if (!rc && !buffer_mapped(&bh)) {
+ BUG_ON(create);
+ rc = -ENODATA;
+ }
+
+ return rc;
+}
+
+int
+ext4_clear_xip_target(struct inode *inode, sector_t block)
+{
+ void *kaddr;
+ unsigned long pfn;
+ int rc;
+
+ rc = __inode_direct_access(inode, block, &kaddr, &pfn);
+ if (!rc)
+ clear_page(kaddr);
+ return rc;
+}
+
+int ext4_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
+ void **kmem, unsigned long *pfn)
+{
+ int rc;
+ sector_t block;
+
+ /* first, retrieve the sector number */
+ rc = __ext4_get_block(mapping->host, pgoff, create, &block);
+ if (rc)
+ return rc;
+
+ /* retrieve address of the target data */
+ rc = __inode_direct_access(mapping->host, block, kmem, pfn);
+ return rc;
+}
diff --git a/fs/ext4/xip.h b/fs/ext4/xip.h
new file mode 100644
index 0000000..af0d553
--- /dev/null
+++ b/fs/ext4/xip.h
@@ -0,0 +1,24 @@
+/*
+ * linux/fs/ext4/xip.h
+ *
+ * Copyright (C) 2005 IBM Corporation
+ * Author: Carsten Otte ([email protected])
+ */
+
+#ifdef CONFIG_EXT4_FS_XIP
+extern int ext4_clear_xip_target(struct inode *, sector_t);
+
+static inline int ext4_use_xip(struct super_block *sb)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ return sbi->s_mount_opt & EXT4_MOUNT_XIP;
+}
+int ext4_get_xip_mem(struct address_space *, pgoff_t, int,
+ void **, unsigned long *);
+#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_mem)
+#else
+#define mapping_is_xip(map) 0
+#define ext4_use_xip(sb) 0
+#define ext4_clear_xip_target(inode, chain) 0
+#define ext4_get_xip_mem NULL
+#endif
--
1.8.4.rc3

2013-12-17 22:30:50

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Tue, Dec 17, 2013 at 02:18:25PM -0500, Matthew Wilcox wrote:
> For v3, we've addressed the problem with unwritten extents that Dave
> Chinner pointed out.

No, you haven't addressed the problem. There is nothing in this
patch set that converts an unwritten extent after it is written to.
Hence on every subsequent read will return zeros because the block
is still marked as unwritten.

Further, write page faults won't do unwritten extent conversion or
block allocation, either, because:

+#ifdef CONFIG_EXT4_FS_XIP
+const struct file_operations ext4_xip_file_operations = {
+ .llseek = ext4_llseek,
+ .read = xip_file_read,
+ .write = xip_file_write,
+ .unlocked_ioctl = ext4_ioctl,
+#ifdef CONFIG_COMPAT
+ .compat_ioctl = ext4_compat_ioctl,
+#endif
+ .mmap = xip_file_mmap,
+ .open = ext4_file_open,
+ .release = ext4_release_file,
+ .fsync = ext4_sync_file,
+ .fallocate = ext4_fallocate,
+};

You wire .mmap up to xip_file_mmap, which wires up .page_mkwrite
like this:

static const struct vm_operations_struct xip_file_vm_ops = {
.fault = xip_file_fault,
.page_mkwrite = filemap_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};

and filemap_page_mkwrite() does none of the special stuff that
ext4_page_mkwrite() does for handling unwritten extents, allocating
blocks for faults over holes in files, etc.

We actually have an xfstests test that test whether mmap and
unwritten extents work correctly - xfs/166 - but there's nothing
XFS specific about it anymore. it could easily be made generic
simply by replacing xfs_bmap with the xfs_io fiemap command....

Also, you haven't address the read vs truncate races I pointed out.
That is, buffered read currently serialises against truncate via a
combination of inode size checks and page locks. i.e. after each
page is locked, it is checked to see if it is beyond EOF before
the read proceeds into that page. the XIP path does not have any
page locks, nor read IO locks, and so is not in any way serialised
against a truncate changing the size of the inode while the read is
in progress.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-12-18 02:31:43

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Wed, Dec 18, 2013 at 09:30:50AM +1100, Dave Chinner wrote:
> On Tue, Dec 17, 2013 at 02:18:25PM -0500, Matthew Wilcox wrote:
> > For v3, we've addressed the problem with unwritten extents that Dave
> > Chinner pointed out.
>
> No, you haven't addressed the problem. There is nothing in this
> patch set that converts an unwritten extent after it is written to.
> Hence on every subsequent read will return zeros because the block
> is still marked as unwritten.

I don't understand. Here's the path as I understand it:

xip_file_write -> __xip_file_write -> ext4_get_xip_mem(create=0),
returns -ENODATA. So we call ext4_get_xip_mem again, this time with
create=1 which causes ext4_get_block() to allocate blocks.

> Further, write page faults won't do unwritten extent conversion or
> block allocation, either, because:
>
> You wire .mmap up to xip_file_mmap, which wires up .page_mkwrite
> like this:
>
> static const struct vm_operations_struct xip_file_vm_ops = {
> .fault = xip_file_fault,
> .page_mkwrite = filemap_page_mkwrite,
> .remap_pages = generic_file_remap_pages,
> };
>
> and filemap_page_mkwrite() does none of the special stuff that
> ext4_page_mkwrite() does for handling unwritten extents, allocating
> blocks for faults over holes in files, etc.

Again, I don't think that's a problem. The first time we take a page
fault, we call xip_file_fault() which installs a PFN map if there's
no hole. If there is a hole, and the mapping is writable, it calls
get_xip_mem with create=1 again, causing the extent to be allocated,
so we never get an unwritten extent mapped to userspace.

> We actually have an xfstests test that test whether mmap and
> unwritten extents work correctly - xfs/166 - but there's nothing
> XFS specific about it anymore. it could easily be made generic
> simply by replacing xfs_bmap with the xfs_io fiemap command....

Thanks. I'll put that on the increasingly-long todo list ...

> Also, you haven't address the read vs truncate races I pointed out.
> That is, buffered read currently serialises against truncate via a
> combination of inode size checks and page locks. i.e. after each
> page is locked, it is checked to see if it is beyond EOF before
> the read proceeds into that page. the XIP path does not have any
> page locks, nor read IO locks, and so is not in any way serialised
> against a truncate changing the size of the inode while the read is
> in progress.

Umm ... what do you think patch 1/3 does? If you think it doesn't fix
the race, I need you to explain why.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2013-12-18 05:01:36

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Tue, Dec 17, 2013 at 07:31:43PM -0700, Matthew Wilcox wrote:
> On Wed, Dec 18, 2013 at 09:30:50AM +1100, Dave Chinner wrote:
> > No, you haven't addressed the problem. There is nothing in this
> > patch set that converts an unwritten extent after it is written to.
> > Hence on every subsequent read will return zeros because the block
> > is still marked as unwritten.
>
> I don't understand. Here's the path as I understand it:
>
> xip_file_write -> __xip_file_write -> ext4_get_xip_mem(create=0),
> returns -ENODATA. So we call ext4_get_xip_mem again, this time with
> create=1 which causes ext4_get_block() to allocate blocks.

When Dave says that the extent is unwritten, what he means is that the
block as been allocated, but it is marked as being uninitialized.
Since the block is uninitialized we must not read from that block;
instead, if the user issues a read request to an uninitialized block,
we must return all zero's for that block (lest we reveal stale data).
And if we try to write to an uninitialized block, *after* we write to
the data block, we have to clear the uninitalized block, which in some
cases might mean splitting the extent --- if we have an extent which
maps logical blocks 0 to 5 to physical blocks 100 to 105, and we write
to block #2, will need to change that single uninitialized extent to
three extents --- one covering blocks logical blocks 0-1, one covering
logical block 2, and one covering logical blocks 3-5, where the first
and third would be marked uninitialized, and the second would be
marked initialized. Since we potentially need to convert one extent
to three extents, this might involve an extent tree node split.

You keep talking about allocated vs unallocated, and create=0 and
create=1, but even for an allocated block, that block may be marked
initialized or uninitialized --- and if it is marked uninitialized,
xip_file_write must call a file system-specific callback to allow this
conversion to take place.

Now for persistent memory, if writes are infinitely fast and free
(i.e., there are no significant write cycle limitations), there is a
terrible, terrible hack we could do, which would involve fallocate()
never creating uninitialized extents, but instead writing zero's to
all of the allocated blocks. And whenever we open a file, before we
return a file descriptor, we could scan all of the extent blocks
looking for uninitialized extents, and if so, convert them to
initialized extents by writing zeros to all of the blocks at open
time.

This would be horribly slow for hard drives, but typically the devices
that use XIP are either read-only, or fairly fast. It's really ugly,
and if it turns out that persistent memory is not infinitely fast and
have infinite write cycles (and my understanding is that persistent
memory is either extraordinarily expensive, or not as perfect in terms
speed and wear resistenace as some of their boosters claim), it's not
clear that trying to evade dealing with uninitialized extents is a
smart way to go.

In other words, suppose somone calls fallocate on a 2GB region on an
XIP mounted file system. Would you be happy forcing 2GB's worth of
writes at fallocate time(), just because we don't want to deal with
adding a file system callback in xip_file_write()?

Regards,

- Ted

2013-12-18 12:33:19

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Tue, Dec 17, 2013 at 07:31:43PM -0700, Matthew Wilcox wrote:
> On Wed, Dec 18, 2013 at 09:30:50AM +1100, Dave Chinner wrote:
> > On Tue, Dec 17, 2013 at 02:18:25PM -0500, Matthew Wilcox wrote:
> > > For v3, we've addressed the problem with unwritten extents that Dave
> > > Chinner pointed out.
> >
> > No, you haven't addressed the problem. There is nothing in this
> > patch set that converts an unwritten extent after it is written to.
> > Hence on every subsequent read will return zeros because the block
> > is still marked as unwritten.
>
> I don't understand. Here's the path as I understand it:
>
> xip_file_write -> __xip_file_write -> ext4_get_xip_mem(create=0),
> returns -ENODATA. So we call ext4_get_xip_mem again, this time with
> create=1 which causes ext4_get_block() to allocate blocks.

Ted has already answered this, so I'll skip it.

> > Also, you haven't address the read vs truncate races I pointed out.
> > That is, buffered read currently serialises against truncate via a
> > combination of inode size checks and page locks. i.e. after each
> > page is locked, it is checked to see if it is beyond EOF before
> > the read proceeds into that page. the XIP path does not have any
> > page locks, nor read IO locks, and so is not in any way serialised
> > against a truncate changing the size of the inode while the read is
> > in progress.
>
> Umm ... what do you think patch 1/3 does? If you think it doesn't fix
> the race, I need you to explain why.

That's something that happens with a mmap page fault. I'm talking
about read(2) calls which end up in xip_file_read() ->
do_xip_mapping_read().

do_xip_mapping_read() samples the inode size before the copy loop,
and then never looks at it again. The read doesn't hold any locks,
because the way it's wired up it jumps straight from the .read
method, so it goes:

vfs_read()
xip_file_read()
do_xip_mapping_read().

No locks are held, so we can race with a truncate at any point in
time. That will change the inode size, and because
do_xip_mapping_read() is not serialised in any way nor does it
check the inode size on each loop, it never detects that the inode
size has changed and so can be reading from beyond the new EOF.

Now, I have no idea what ext4 does when asked to map blocks beyond
EOF, but that will define the behaviour that do_xip_mapping_read()
has when a truncate race occurs(*). But it the behaviour is most
definitely filesystem dependent, and that is most definitely wrong.

There needs to be serialisation against truncate provided in some
way, and that's what the page cache page locks provide non-XIP
buffered read paths. We don't have them in the XIP infrastructure,
and so there's a problem here.

Holding the i_mutex is not sufficient, as the locks that need to be
held to provide this serialisation are owned by the filesystems, not
the generic code. Hence XIP needs to use the normal .aio_read path
and have the filesystems call do_xip_mapping_read() when the
appropriate locks have been gained.

And, in fact, the XIP write(2) path needs to go through filesystems
to be locked correctly just like the read path. Buffered writes need
to be serialised against reads and other filesystem operations, and
holding the i_mutex is not sufficient for that purpose - again, the
locks tha tneed to be held are defined by the filesystem, not the
XIP infrastructure....

The only saving grace is that XIP is so old it doesn't use the
multi-block mapping code that all filesystems now provide to
mpage_readpages(), and so once the blocks have been removed from the
inode the mapping.

Like I said, the XIP code is needs a heap of infrastructure work
before we can hook a modern filesystem up to it in a sane way.

(*) As a side issue, the XIP ext4 block mapping code now has a call
chain that looks like:

ext4_xip_get_mem
__ext4_get_block
ext4_get_block
_ext4_get_block
....

You might want to have a think about some of the names and
abstractions being used, because naming like that is going to make
reading stack traces from kernels compiled without frame pointers
a nightmare....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-12-18 14:27:52

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Wed, Dec 18, 2013 at 12:01:27AM -0500, Theodore Ts'o wrote:
> On Tue, Dec 17, 2013 at 07:31:43PM -0700, Matthew Wilcox wrote:
> > On Wed, Dec 18, 2013 at 09:30:50AM +1100, Dave Chinner wrote:
> > > No, you haven't addressed the problem. There is nothing in this
> > > patch set that converts an unwritten extent after it is written to.
> > > Hence on every subsequent read will return zeros because the block
> > > is still marked as unwritten.
> >
> > I don't understand. Here's the path as I understand it:
> >
> > xip_file_write -> __xip_file_write -> ext4_get_xip_mem(create=0),
> > returns -ENODATA. So we call ext4_get_xip_mem again, this time with
> > create=1 which causes ext4_get_block() to allocate blocks.
>
> When Dave says that the extent is unwritten, what he means is that the
> block as been allocated, but it is marked as being uninitialized.
> Since the block is uninitialized we must not read from that block;
> instead, if the user issues a read request to an uninitialized block,
> we must return all zero's for that block (lest we reveal stale data).
> And if we try to write to an uninitialized block, *after* we write to
> the data block, we have to clear the uninitalized block, which in some
> cases might mean splitting the extent --- if we have an extent which
> maps logical blocks 0 to 5 to physical blocks 100 to 105, and we write
> to block #2, will need to change that single uninitialized extent to
> three extents --- one covering blocks logical blocks 0-1, one covering
> logical block 2, and one covering logical blocks 3-5, where the first
> and third would be marked uninitialized, and the second would be
> marked initialized. Since we potentially need to convert one extent
> to three extents, this might involve an extent tree node split.

So I think we do all that. If xip_file_read() sees a block which is
!buffer_mapped, it fills with zeroes. If xip_file_write() sees a block
which is !buffer_mapped, it asks ext4_get_block to map it by passing
in create=1. Part of the patch includes zeroing the newly allocated
block under i_data_sem before calling ext4_es_insert_extent(), which I
think is enough to prevent reading stale data.

> You keep talking about allocated vs unallocated, and create=0 and
> create=1, but even for an allocated block, that block may be marked
> initialized or uninitialized --- and if it is marked uninitialized,
> xip_file_write must call a file system-specific callback to allow this
> conversion to take place.

Could you take pity on me and tell me what flags I need to check in the
buffer_head to determine this state of affairs?

> In other words, suppose somone calls fallocate on a 2GB region on an
> XIP mounted file system. Would you be happy forcing 2GB's worth of
> writes at fallocate time(), just because we don't want to deal with
> adding a file system callback in xip_file_write()?

I think there is a callback in xip_file_write(), and it's get_xip_mem().
>From what you're saying, it sounds like it's just not doing enough.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2013-12-18 15:22:34

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Wed, Dec 18, 2013 at 11:33:19PM +1100, Dave Chinner wrote:
> That's something that happens with a mmap page fault. I'm talking
> about read(2) calls which end up in xip_file_read() ->
> do_xip_mapping_read().

*light goes on*! Thank you! I'll spin up a patch to fix that.

One approach would be grabbing the i_mmap_mutex while we copy the data
to userspace. That's not great from a scalability point of view, and I
think there's a great need to ensure we can't deadlock against a fault
(eg the classic read() into an mmap() of the same file), but I think
it's doable.

> There needs to be serialisation against truncate provided in some
> way, and that's what the page cache page locks provide non-XIP
> buffered read paths. We don't have them in the XIP infrastructure,
> and so there's a problem here.
>
> Holding the i_mutex is not sufficient, as the locks that need to be
> held to provide this serialisation are owned by the filesystems, not
> the generic code. Hence XIP needs to use the normal .aio_read path
> and have the filesystems call do_xip_mapping_read() when the
> appropriate locks have been gained.
>
> And, in fact, the XIP write(2) path needs to go through filesystems
> to be locked correctly just like the read path. Buffered writes need
> to be serialised against reads and other filesystem operations, and
> holding the i_mutex is not sufficient for that purpose - again, the
> locks tha tneed to be held are defined by the filesystem, not the
> XIP infrastructure....

I see, I see ... I'm going to have to think about this some more.

> The only saving grace is that XIP is so old it doesn't use the
> multi-block mapping code that all filesystems now provide to
> mpage_readpages(), and so once the blocks have been removed from the
> inode the mapping.

I think maybe you lost some words from that final sentence? I have patches
that attempt to make the XIP code work in larger quantities than single
pages, so that's in progress.

> Like I said, the XIP code is needs a heap of infrastructure work
> before we can hook a modern filesystem up to it in a sane way.

Oh, I quite agree. I was just so focused on the problems with mmap vs
truncate I didn't stop to consider the problems with read vs truncate.
I assumed the original designers had covered that (and in fairness,
maybe they did, but it got broken at some point in the last eight years).

> (*) As a side issue, the XIP ext4 block mapping code now has a call
> chain that looks like:
>
> ext4_xip_get_mem
> __ext4_get_block
> ext4_get_block
> _ext4_get_block
> ....
>
> You might want to have a think about some of the names and
> abstractions being used, because naming like that is going to make
> reading stack traces from kernels compiled without frame pointers
> a nightmare....

Yeah, I already took care of that in my tree ... just didn't get round
to postng it yet, plus as it stands it depends (purely textually) on
some other patches.

commit 7983604d53b5925aa2741beec05622266c4fbb9e
Author: Matthew Wilcox <[email protected]>
Date: Tue Dec 17 15:47:38 2013 -0500

ext2,ext4: Inline __ext*_get_block into ext*_get_xip_mem

__ext*_get_block each only have one caller, and it's easier to read
the code when they're inlined into their caller.

Signed-off-by: Matthew Wilcox <[email protected]>

diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index fa40091..0ea6475 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -22,27 +22,6 @@ static inline long __inode_direct_access(struct inode *inode, sector_t block,
return ops->direct_access(bdev, sector, kaddr, pfn, size);
}

-static inline int
-__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
- sector_t *result)
-{
- struct buffer_head tmp;
- int rc;
-
- memset(&tmp, 0, sizeof(struct buffer_head));
- tmp.b_size = 1 << inode->i_blkbits;
- rc = ext2_get_block(inode, pgoff, &tmp, create);
- *result = tmp.b_blocknr;
-
- /* did we get a sparse block (hole in the file)? */
- if (!tmp.b_blocknr && !rc) {
- BUG_ON(create);
- rc = -ENODATA;
- }
-
- return rc;
-}
-
int
ext2_clear_xip_target(struct inode *inode, sector_t block)
{
@@ -73,15 +52,22 @@ void ext2_xip_verify_sb(struct super_block *sb)
int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
void **kmem, unsigned long *pfn)
{
+ struct inode *inode = mapping->host;
+ struct buffer_head bh;
long rc;
- sector_t block;

- /* first, retrieve the sector number */
- rc = __ext2_get_block(mapping->host, pgoff, create, &block);
+ /* Convert file offset to block number */
+ memset(&bh, 0, sizeof(bh));
+ bh.b_size = 1 << inode->i_blkbits;
+ rc = ext2_get_block(inode, pgoff, &bh, create);
if (rc)
return rc;
+ if (!buffer_mapped(&bh)) {
+ BUG_ON(create);
+ return -ENODATA;
+ }

/* retrieve address of the target data */
- rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
+ rc = __inode_direct_access(inode, bh.b_blocknr, kmem, pfn, PAGE_SIZE);
return (rc < 0) ? rc : 0;
}
diff --git a/fs/ext4/xip.c b/fs/ext4/xip.c
index 0868a40..0192f20 100644
--- a/fs/ext4/xip.c
+++ b/fs/ext4/xip.c
@@ -22,27 +22,6 @@ static inline long __inode_direct_access(struct inode *inode, sector_t block,
return ops->direct_access(bdev, sector, kaddr, pfn, size);
}

-static inline int
-__ext4_get_block(struct inode *inode, pgoff_t pgoff, int create,
- sector_t *result)
-{
- struct buffer_head bh;
- int rc;
-
- memset(&bh, 0, sizeof(bh));
- bh.b_size = inode->i_sb->s_blocksize;
- rc = ext4_get_block(inode, pgoff, &bh, create);
- *result = bh.b_blocknr;
-
- /* did we get a sparse block (hole in the file)? */
- if (!rc && !buffer_mapped(&bh)) {
- BUG_ON(create);
- rc = -ENODATA;
- }
-
- return rc;
-}
-
int ext4_clear_xip_target(struct inode *inode, sector_t block)
{
void *kaddr;
@@ -59,15 +38,22 @@ int ext4_clear_xip_target(struct inode *inode, sector_t block)
int ext4_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
void **kmem, unsigned long *pfn)
{
+ struct inode *inode = mapping->host;
+ struct buffer_head bh;
long rc;
- sector_t block;

- /* first, retrieve the sector number */
- rc = __ext4_get_block(mapping->host, pgoff, create, &block);
+ /* Convert file offset to block number */
+ memset(&bh, 0, sizeof(bh));
+ bh.b_size = inode->i_sb->s_blocksize;
+ rc = ext4_get_block(inode, pgoff, &bh, create);
if (rc)
return rc;
+ if (!buffer_mapped(&bh)) {
+ BUG_ON(create);
+ return -ENODATA;
+ }

/* retrieve address of the target data */
- rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
+ rc = __inode_direct_access(inode, bh.b_blocknr, kmem, pfn, PAGE_SIZE);
return (rc < 0) ? rc : 0;
}

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2013-12-18 18:13:07

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On 12/17/13, 4:30 PM, Dave Chinner wrote:
> On Tue, Dec 17, 2013 at 02:18:25PM -0500, Matthew Wilcox wrote:
>> For v3, we've addressed the problem with unwritten extents that Dave
>> Chinner pointed out.
>
> No, you haven't addressed the problem. There is nothing in this
> patch set that converts an unwritten extent after it is written to.
> Hence on every subsequent read will return zeros because the block
> is still marked as unwritten.

I'd really strongly encourage you guys to try running through xfstests.

IF you haven't seen this stuff fail yet, you're apparently not
testing these paths - I'm guessing xfstests may expose these holes
quite well.

I'd be happy to help you get up & running on xfstests if you need a
hand.

-Eric


2013-12-19 00:48:36

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Wed, Dec 18, 2013 at 08:22:34AM -0700, Matthew Wilcox wrote:
> On Wed, Dec 18, 2013 at 11:33:19PM +1100, Dave Chinner wrote:
> > That's something that happens with a mmap page fault. I'm talking
> > about read(2) calls which end up in xip_file_read() ->
> > do_xip_mapping_read().
>
> *light goes on*! Thank you! I'll spin up a patch to fix that.
>
> One approach would be grabbing the i_mmap_mutex while we copy the data
> to userspace. That's not great from a scalability point of view, and I
> think there's a great need to ensure we can't deadlock against a fault
> (eg the classic read() into an mmap() of the same file), but I think
> it's doable.

i_mmap_mutex only covers the mmap page faults, not the changes to
the filesystem that the truncate does.

We already have a model for handling non page cache based IO paths:
Direct IO has to handle this read/truncate race condition without
page lock serialisation, and it works just fine. Go and look at
inode_dio_wait() rather than reinventing the wheel.

> > There needs to be serialisation against truncate provided in some
> > way, and that's what the page cache page locks provide non-XIP
> > buffered read paths. We don't have them in the XIP infrastructure,
> > and so there's a problem here.
> >
> > Holding the i_mutex is not sufficient, as the locks that need to be
> > held to provide this serialisation are owned by the filesystems, not
> > the generic code. Hence XIP needs to use the normal .aio_read path
> > and have the filesystems call do_xip_mapping_read() when the
> > appropriate locks have been gained.
> >
> > And, in fact, the XIP write(2) path needs to go through filesystems
> > to be locked correctly just like the read path. Buffered writes need
> > to be serialised against reads and other filesystem operations, and
> > holding the i_mutex is not sufficient for that purpose - again, the
> > locks tha tneed to be held are defined by the filesystem, not the
> > XIP infrastructure....
>
> I see, I see ... I'm going to have to think about this some more.

I've already explained how to do this - remember my comments about
using the direct IO setup model to weave in and out of the
filesystems appropriately?

> > The only saving grace is that XIP is so old it doesn't use the
> > multi-block mapping code that all filesystems now provide to
> > mpage_readpages(), and so once the blocks have been removed from the
> > inode the mapping.

"ext4 will not find an extent and return a hole, hence returning zeros
rather than stale data. But this won't work on XFS, because extents
beyond EOF are allowed, are common, and need to be handled specially
by the write IO path."

> I think maybe you lost some words from that final sentence? I have patches
> that attempt to make the XIP code work in larger quantities than single
> pages, so that's in progress.

And at that point you need external serialisation against truncate,
because mapping a large extent with no other serialisation means it
will be considered valid even when it isn't. The fine-grained page
lock/mapping check is what prevents this truncate race in the
buffered read path, and there's nothing like that in the XIP code so
multi-block mappings are going to expose the read/truncate race far
worse.

> > Like I said, the XIP code is needs a heap of infrastructure work
> > before we can hook a modern filesystem up to it in a sane way.
>
> Oh, I quite agree. I was just so focused on the problems with mmap vs
> truncate I didn't stop to consider the problems with read vs truncate.
> I assumed the original designers had covered that (and in fairness,
> maybe they did, but it got broken at some point in the last eight years).

I don't think it ever worked correctly - XIP has the same IO path
requirements for truncate serialisation as direct IO has but the
XIP code appears to have never considered this truncate problem.

This is one of the reasons I'm recommending that XIP follow the
direct IO model for the read/write IO path: we already have working,
tested infrastructure that solves this problem, and XIP can simply
hook into that and these problems just go away.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-12-19 01:05:31

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Thu, Dec 19, 2013 at 11:48:31AM +1100, Dave Chinner wrote:
> We already have a model for handling non page cache based IO paths:
> Direct IO has to handle this read/truncate race condition without
> page lock serialisation, and it works just fine. Go and look at
> inode_dio_wait() rather than reinventing the wheel.

I've spent most of today looking at that code. Here's where I'm at
right now. It doesn't even compile.

diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 18b34d2..29be737 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -16,9 +16,7 @@ static inline int ext2_use_xip (struct super_block *sb)
}
int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
void **, unsigned long *);
-#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_mem)
#else
-#define mapping_is_xip(map) 0
#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
#define ext2_clear_xip_target(inode, chain) 0
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b9499b2..66d2b35 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -190,7 +190,8 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
}
}

- if (unlikely(iocb->ki_filp->f_flags & O_DIRECT))
+ if (unlikely(iocb->ki_filp->f_flags & O_DIRECT) ||
+ (mapping_is_xip(inode->i_mapping)))
ret = ext4_file_dio_write(iocb, iov, nr_segs, pos);
else
ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
@@ -612,8 +613,10 @@ const struct file_operations ext4_file_operations = {
#ifdef CONFIG_EXT4_FS_XIP
const struct file_operations ext4_xip_file_operations = {
.llseek = ext4_llseek,
- .read = xip_file_read,
- .write = xip_file_write,
+ .read = do_sync_read,
+ .write = do_sync_write,
+ .aio_read = generic_file_aio_read,
+ .aio_write = ext4_file_write,
.unlocked_ioctl = ext4_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = ext4_compat_ioctl,
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 594009f..ae760d9 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -686,15 +686,22 @@ retry:
inode_dio_done(inode);
goto locked;
}
- ret = __blockdev_direct_IO(rw, iocb, inode,
- inode->i_sb->s_bdev, iov,
- offset, nr_segs,
- ext4_get_block, NULL, NULL, 0);
+ if (mapping_is_xip(file->f_mapping))
+ ret = xip_io(rw, iocb, inode, iov, offset, nr_segs,
+ ext4_get_block, NULL, 0);
+ else
+ ret = __blockdev_direct_IO(rw, iocb, inode,
+ inode->i_sb->s_bdev, iov, offset,
+ nr_segs, ext4_get_block, NULL, NULL, 0);
inode_dio_done(inode);
} else {
locked:
- ret = blockdev_direct_IO(rw, iocb, inode, iov,
- offset, nr_segs, ext4_get_block);
+ if (mapping_is_xip(file->f_mapping))
+ ret = xip_io(rw, iocb, inode, iov, offset, nr_segs,
+ ext4_get_block, NULL, 0);
+ else
+ ret = blockdev_direct_IO(rw, iocb, inode, iov,
+ offset, nr_segs, ext4_get_block);

if (unlikely((rw & WRITE) && ret < 0)) {
loff_t isize = i_size_read(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7b50832..0303412 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3170,13 +3170,14 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
get_block_func = ext4_get_block_write;
dio_flags = DIO_LOCKING;
}
- ret = __blockdev_direct_IO(rw, iocb, inode,
- inode->i_sb->s_bdev, iov,
- offset, nr_segs,
- get_block_func,
- ext4_end_io_dio,
- NULL,
- dio_flags);
+ if (mapping_is_xip(file->f_mapping))
+ ret = xip_io(rw, iocb, inode, iov, offset, nr_segs,
+ get_block_func, ext4_end_io_dio, dio_flags);
+ else
+ ret = __blockdev_direct_IO(rw, iocb, inode,
+ inode->i_sb->s_bdev, iov, offset,
+ nr_segs, get_block_func,
+ ext4_end_io_dio, NULL, dio_flags);

/*
* Put our reference to io_end. This can free the io_end structure e.g.
@@ -3291,6 +3292,7 @@ static const struct address_space_operations ext4_aops = {
const struct address_space_operations ext4_xip_aops = {
.bmap = ext4_bmap,
.get_xip_mem = ext4_get_xip_mem,
+ .direct_IO = ext4_direct_IO,
};

static const struct address_space_operations ext4_journalled_aops = {
diff --git a/fs/ext4/xip.h b/fs/ext4/xip.h
index af0d553..b279dae 100644
--- a/fs/ext4/xip.h
+++ b/fs/ext4/xip.h
@@ -15,9 +15,7 @@ static inline int ext4_use_xip(struct super_block *sb)
}
int ext4_get_xip_mem(struct address_space *, pgoff_t, int,
void **, unsigned long *);
-#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_mem)
#else
-#define mapping_is_xip(map) 0
#define ext4_use_xip(sb) 0
#define ext4_clear_xip_target(inode, chain) 0
#define ext4_get_xip_mem NULL
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a063bff..94a8e50 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2511,6 +2511,14 @@ extern ssize_t xip_file_write(struct file *filp, const char __user *buf,
extern int xip_truncate_page(struct address_space *mapping, loff_t from);
extern int xip_zero_page_range(struct address_space *, loff_t from,
unsigned length);
+extern ssize_t xip_io(int rw, struct kiocb *, struct inode *,
+ const struct iovec *, loff_t, unsigned segs,
+ get_block_t, dio_iodone_t, int flags);
+
+static inline bool mapping_is_xip(struct address_space *mapping)
+{
+ return mapping->a_ops->get_xip_mem != NULL;
+}
#else
static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
{
@@ -2522,6 +2530,18 @@ static inline int xip_zero_page_range(struct address_space *mapping,
{
return 0;
}
+
+static inline bool mapping_is_xip(struct address_space *mapping)
+{
+ return false;
+}
+
+static inline ssize_t xip_io(int rw, struct kiocb *iocb, struct inode *inode,
+ const struct iovec *iov, loff_t offset, unsigned nr_segs,
+ get_block_t get_block, dio_iodone_t end_io, int flags)
+{
+ return -ENOTTY;
+}
#endif

#ifdef CONFIG_BLOCK
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d808b72..38caad3 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -8,6 +8,7 @@
*
*/

+#include <linux/buffer_head.h>
#include <linux/fs.h>
#include <linux/pagemap.h>
#include <linux/export.h>
@@ -41,6 +42,103 @@ static struct page *xip_sparse_page(void)
return __xip_sparse_page;
}

+static ssize_t __xip_io(int rw, struct inode *inode, const struct iovec *iov,
+ loff_t offset, loff_t end, unsigned nr_segs,
+ get_block_t get_block, struct buffer_head *bh)
+{
+ ssize_t retval;
+ unsigned seg = 0;
+ unsigned len;
+ unsigned copied = 0;
+ loff_t max = offset;
+
+ while (offset < end) {
+ void __user *buf = iov[seg].iov_base + copied;
+ bool hole;
+
+ if (max == offset) {
+ sector_t block = offset >> inode->i_blkbits;
+ memset(bh, 0, sizeof(*bh));
+ bh->b_size = ALIGN(end - offset, PAGE_SIZE);
+ retval = get_block(inode, block, bh, 0);
+ if (retval)
+ break;
+ if (buffer_mapped(bh))
+ hole = false;
+ else
+ hole = true;
+ if (rw == WRITE && hole) {
+ bh->b_size = ALIGN(end - offset, PAGE_SIZE);
+ retval = get_block(inode, block, bh, 1);
+ if (retval)
+ break;
+ }
+ max = offset + bh->b_size;
+ }
+
+ len = min_t(unsigned, iov[seg].iov_len - copied, max - offset);
+
+ if (rw == WRITE)
+ len -= __copy_from_user_nocache(addr, buf, len);
+ else if (!hole)
+ len -= __copy_to_user(buf, addr, len);
+ else
+ len -= __clear_user(buf, len);
+
+ offset += len;
+ copied += len;
+ if (copied == iov[seg].iov_len) {
+ seg++;
+ copied = 0;
+ }
+ }
+
+ return retval;
+}
+
+/*
+ * Perform I/O to an XIP file. We follow the same rules as
+ * __blockdev_direct_IO with respect to locking
+ */
+ssize_t xip_io(int rw, struct kiocb *iocb, struct inode *inode,
+ const struct iovec *iov, loff_t offset, unsigned nr_segs,
+ get_block_t get_block, dio_iodone_t end_io, int flags)
+{
+ struct buffer_head bh;
+ unsigned seg;
+ ssize_t retval = -EINVAL;
+ loff_t end = offset;
+
+ for (seg = 0; seg < nr_segs; seg++)
+ end += iov[seg].iov_len;
+
+ if ((flags & DIO_LOCKING) && (rw == READ)) {
+ struct address_space *mapping = inode->i_mapping;
+ mutex_lock(&inode->i_mutex);
+ retval = filemap_write_and_wait_range(mapping, offset, end - 1);
+ if (retval) {
+ mutex_unlock(&inode->i_mutex);
+ goto out;
+ }
+ }
+
+ /* Protects against truncate */
+ atomic_inc(&inode->i_dio_count);
+
+ retval = __xip_io(rw, inode, iov, offset, end, nr_segs, get_block, &bh);
+
+ if ((flags & DIO_LOCKING) && (rw == READ))
+ mutex_unlock(&inode->i_mutex);
+
+ inode_dio_done(inode);
+
+ if (end_io)
+ end_io(iocb, offset, transferred, bh.b_private);
+ out:
+ return retval;
+}
+EXPORT_SYMBOL_GPL(xip_io);
+
/*
* This is a file read routine for execute in place files, and uses
* the mapping->a_ops->get_xip_mem() function for the actual low-level

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2013-12-19 01:58:44

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Wed, Dec 18, 2013 at 06:05:31PM -0700, Matthew Wilcox wrote:
> On Thu, Dec 19, 2013 at 11:48:31AM +1100, Dave Chinner wrote:
> > We already have a model for handling non page cache based IO paths:
> > Direct IO has to handle this read/truncate race condition without
> > page lock serialisation, and it works just fine. Go and look at
> > inode_dio_wait() rather than reinventing the wheel.
>
> I've spent most of today looking at that code. Here's where I'm at
> right now. It doesn't even compile.

Comments in line.

>
> diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
> index 18b34d2..29be737 100644
> --- a/fs/ext2/xip.h
> +++ b/fs/ext2/xip.h
> @@ -16,9 +16,7 @@ static inline int ext2_use_xip (struct super_block *sb)
> }
> int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
> void **, unsigned long *);
> -#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_mem)
> #else
> -#define mapping_is_xip(map) 0
> #define ext2_xip_verify_sb(sb) do { } while (0)
> #define ext2_use_xip(sb) 0
> #define ext2_clear_xip_target(inode, chain) 0
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index b9499b2..66d2b35 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -190,7 +190,8 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
> }
> }
>
> - if (unlikely(iocb->ki_filp->f_flags & O_DIRECT))
> + if (unlikely(iocb->ki_filp->f_flags & O_DIRECT) ||
> + (mapping_is_xip(inode->i_mapping)))

I suspect a helper function a good idea here. Something like
"is_io_direct(iocb->ki_filp)"

> index 594009f..ae760d9 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -686,15 +686,22 @@ retry:
> inode_dio_done(inode);
> goto locked;
> }
> - ret = __blockdev_direct_IO(rw, iocb, inode,
> - inode->i_sb->s_bdev, iov,
> - offset, nr_segs,
> - ext4_get_block, NULL, NULL, 0);
> + if (mapping_is_xip(file->f_mapping))
> + ret = xip_io(rw, iocb, inode, iov, offset, nr_segs,
> + ext4_get_block, NULL, 0);

xip_direct_io() might be a better name here...

> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2511,6 +2511,14 @@ extern ssize_t xip_file_write(struct file *filp, const char __user *buf,
> extern int xip_truncate_page(struct address_space *mapping, loff_t from);
> extern int xip_zero_page_range(struct address_space *, loff_t from,
> unsigned length);
> +extern ssize_t xip_io(int rw, struct kiocb *, struct inode *,
> + const struct iovec *, loff_t, unsigned segs,
> + get_block_t, dio_iodone_t, int flags);
> +
> +static inline bool mapping_is_xip(struct address_space *mapping)
> +{
> + return mapping->a_ops->get_xip_mem != NULL;
> +}

I think that you should put a flag in the mapping for this, rather
than chase pointers to do the check.

> +static ssize_t __xip_io(int rw, struct inode *inode, const struct iovec *iov,
> + loff_t offset, loff_t end, unsigned nr_segs,
> + get_block_t get_block, struct buffer_head *bh)
> +{
> + ssize_t retval;
> + unsigned seg = 0;
> + unsigned len;
> + unsigned copied = 0;
> + loff_t max = offset;
> +
> + while (offset < end) {
> + void __user *buf = iov[seg].iov_base + copied;
> + bool hole;
> +
> + if (max == offset) {
> + sector_t block = offset >> inode->i_blkbits;
> + memset(bh, 0, sizeof(*bh));
> + bh->b_size = ALIGN(end - offset, PAGE_SIZE);
> + retval = get_block(inode, block, bh, 0);
> + if (retval)
> + break;
> + if (buffer_mapped(bh))
> + hole = false;
> + else
> + hole = true;
> + if (rw == WRITE && hole) {
> + bh->b_size = ALIGN(end - offset, PAGE_SIZE);
> + retval = get_block(inode, block, bh, 1);
> + if (retval)
> + break;
> + }

Why do two write mappings here? If it's a write, then we always want
to fill a hole, so the create value sent to getblock is:

retval = get_block(inode, block, bh, rw == WRITE);
if (retval)
break;
if (buffer_mapped(bh))
hole = false;
else
hole = true;

And that's all you need.

> + max = offset + bh->b_size;
> + }
> +
> + len = min_t(unsigned, iov[seg].iov_len - copied, max - offset);
> +
> + if (rw == WRITE)
> + len -= __copy_from_user_nocache(addr, buf, len);
> + else if (!hole)
> + len -= __copy_to_user(buf, addr, len);
> + else
> + len -= __clear_user(buf, len);
> +
> + offset += len;
> + copied += len;
> + if (copied == iov[seg].iov_len) {
> + seg++;
> + copied = 0;
> + }
> + }
> +
> + return retval;
> +}
> +
> +/*
> + * Perform I/O to an XIP file. We follow the same rules as
> + * __blockdev_direct_IO with respect to locking
> + */

OK, that's interesting, because it means that it will be different
to normal buffered page cache IO. It will allow concurrent
overlapping reads and writes - something that POSIX does not allow -
and places the burden of synchronising concurrent reads and writes
on the application.

That is different to the current XIP, which serialises writes
against each other, but not against reads. That's not strictly POSIX
compliant, either, but it prevents concurrent overlapping writes
from occurring and so behaves more like applications expect buffered
IO to work.

For persistent memory, I'd prefer that we have concurrent write Io
capabilities from the start, but I'm not sure we should just do this
without first talking about it....

> +ssize_t xip_io(int rw, struct kiocb *iocb, struct inode *inode,
> + const struct iovec *iov, loff_t offset, unsigned nr_segs,
> + get_block_t get_block, dio_iodone_t end_io, int flags)
> +{
> + struct buffer_head bh;
> + unsigned seg;
> + ssize_t retval = -EINVAL;
> + loff_t end = offset;
> +
> + for (seg = 0; seg < nr_segs; seg++)
> + end += iov[seg].iov_len;
> +
> + if ((flags & DIO_LOCKING) && (rw == READ)) {
> + struct address_space *mapping = inode->i_mapping;
> + mutex_lock(&inode->i_mutex);
> + retval = filemap_write_and_wait_range(mapping, offset, end - 1);
> + if (retval) {
> + mutex_unlock(&inode->i_mutex);
> + goto out;
> + }
> + }
> +
> + /* Protects against truncate */
> + atomic_inc(&inode->i_dio_count);
> +
> + retval = __xip_io(rw, inode, iov, offset, end, nr_segs, get_block, &bh);

Can we avoid using "__" prefixes for new code? xip_do_direct_io() is
a much better name....

> +
> + if ((flags & DIO_LOCKING) && (rw == READ))
> + mutex_unlock(&inode->i_mutex);
> +
> + inode_dio_done(inode);
> +
> + if (end_io)
> + end_io(iocb, offset, transferred, bh.b_private);

And that solves the unwritten extent problem for the IO path. Now we
just need to solve it for the mmap path. That, I suspect will
require a custom .page_mkwrite handler....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-12-19 02:07:59

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Wed, Dec 18, 2013 at 07:27:49AM -0700, Matthew Wilcox wrote:
>
> I think there is a callback in xip_file_write(), and it's get_xip_mem().
> From what you're saying, it sounds like it's just not doing enough.

The problem is that git_xip_mem() is called before we write to the
memory, right?

We need to convert the uninit extents to be marked as initialized in
*after* the write has been sent to the storage medium.

> Could you take pity on me and tell me what flags I need to check in the
> buffer_head to determine this state of affairs?

The buffer head flag is BH_Unwritten, but note that how BH_Delay and
BH_Boundary are handled is very file system specific. There is no
guarantee that all file systems will use these flags, nor will use
them in the same way. (I've been considering changing ext4 to not
attach buffer heads to the pages at all -- except for DIO, which
requires them today -- because iterating over all of the buffer heads
attached to the pages is a pain in the tuckus, and ext4 now has an
extent status tree which we could use to get this information without
needing to lock the pages.)

It's much better to give a way for the file system to attach a
callback to the XIP request object.

Cheers,

- Ted

2013-12-19 04:12:43

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Wed, Dec 18, 2013 at 09:07:59PM -0500, Theodore Ts'o wrote:
> On Wed, Dec 18, 2013 at 07:27:49AM -0700, Matthew Wilcox wrote:
> >
> > I think there is a callback in xip_file_write(), and it's get_xip_mem().
> > From what you're saying, it sounds like it's just not doing enough.
>
> The problem is that git_xip_mem() is called before we write to the
> memory, right?
>
> We need to convert the uninit extents to be marked as initialized in
> *after* the write has been sent to the storage medium.

Now that I've spent the best part of a day looking at the ext4 code, I
still don't think there's a problem here. With the way the XIP code is
currently written (calling ext4_get_block with create=1), we won't get an
uninitialised extent in the caller. Instead, we'll get one that's been
zeroed (the zeroing is part of patch 3/3 and done only for xip files).

As I understand it, when ext4 uses direct I/O, it can pass
ext4_get_block_write() as the get_block method, which uses the
magic EXT4_GET_BLOCKS_IO_CREATE_EXT flag to permit the allocation of
uninitialised extents. But the regular ext4_get_block cannot create
uninitialised extents (it can return them in the case of create=0,
and we handle that correctly as a hole).

Moreover, I don't see, eg, block_read_full_page() handling uninitialised
extents specially, so I'm pretty sure the regular ext4_get_block()
can't create uninitialised extents.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2013-12-19 04:38:00

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Wed, Dec 18, 2013 at 09:12:41PM -0700, Matthew Wilcox wrote:
> On Wed, Dec 18, 2013 at 09:07:59PM -0500, Theodore Ts'o wrote:
> > On Wed, Dec 18, 2013 at 07:27:49AM -0700, Matthew Wilcox wrote:
> > >
> > > I think there is a callback in xip_file_write(), and it's get_xip_mem().
> > > From what you're saying, it sounds like it's just not doing enough.
> >
> > The problem is that git_xip_mem() is called before we write to the
> > memory, right?
> >
> > We need to convert the uninit extents to be marked as initialized in
> > *after* the write has been sent to the storage medium.
>
> Now that I've spent the best part of a day looking at the ext4 code, I
> still don't think there's a problem here. With the way the XIP code is
> currently written (calling ext4_get_block with create=1), we won't get an
> uninitialised extent in the caller. Instead, we'll get one that's been
> zeroed (the zeroing is part of patch 3/3 and done only for xip files).

You will with XFS. And you will when an application uses fallocate()
to allocate the space before the write(2) call. So it's irrelevant
what the write call does - the generic infrastructure needs to
handle the fact that it can be writing into an unwritten region and
be required to call a filesystem specific IO completion handler to
deal with it.

> As I understand it, when ext4 uses direct I/O, it can pass
> ext4_get_block_write() as the get_block method, which uses the
> magic EXT4_GET_BLOCKS_IO_CREATE_EXT flag to permit the allocation of
> uninitialised extents. But the regular ext4_get_block cannot create
> uninitialised extents (it can return them in the case of create=0,
> and we handle that correctly as a hole).

Stop thinking that ext4 is the entire world. XFS creates unwritten
extents via the direct IO getblock callout when create=1 is set an
dthe write lands in a hole. It then uses the IO completion callout
ot convert them to written extents once the write IO is complete.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-12-19 05:43:03

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Wed, Dec 18, 2013 at 09:12:41PM -0700, Matthew Wilcox wrote:
> Now that I've spent the best part of a day looking at the ext4 code, I
> still don't think there's a problem here. With the way the XIP code is
> currently written (calling ext4_get_block with create=1), we won't get an
> uninitialised extent in the caller. Instead, we'll get one that's been
> zeroed (the zeroing is part of patch 3/3 and done only for xip files).

If the block was originally allocated via fallocate(2), it will be
marked as uninitialized. When you call ext4_get_block(), if the block
has been allocated, it will be returned --- and ext4_map_block() as
called by ext4_get_block() does ****not*** clear the uninitialized
flag. It can't do so because it would be racy; you can only clear the
flag once the data blocks has been written.

As far as patch 3/3, it clears the pages in the page cache, but it
doesn't zap them in the XIP storage device. But it only does this on
the code path when it allocated a block. But if the block has already
been previously mapped via fallocate(2), you never hit this section of
code.

Regards,

- Ted

2013-12-19 15:20:49

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Thu, Dec 19, 2013 at 12:43:03AM -0500, Theodore Ts'o wrote:
> On Wed, Dec 18, 2013 at 09:12:41PM -0700, Matthew Wilcox wrote:
> > Now that I've spent the best part of a day looking at the ext4 code, I
> > still don't think there's a problem here. With the way the XIP code is
> > currently written (calling ext4_get_block with create=1), we won't get an
> > uninitialised extent in the caller. Instead, we'll get one that's been
> > zeroed (the zeroing is part of patch 3/3 and done only for xip files).
>
> If the block was originally allocated via fallocate(2), it will be
> marked as uninitialized. When you call ext4_get_block(), if the block
> has been allocated, it will be returned --- and ext4_map_block() as
> called by ext4_get_block() does ****not*** clear the uninitialized
> flag. It can't do so because it would be racy; you can only clear the
> flag once the data blocks has been written.
>
> As far as patch 3/3, it clears the pages in the page cache, but it
> doesn't zap them in the XIP storage device. But it only does this on
> the code path when it allocated a block. But if the block has already
> been previously mapped via fallocate(2), you never hit this section of
> code.

Umm. That sounds like the real bug then. Any page returned from
get_xip_mem must be initialised, because we may be about to map it
into userspace.

We could have ext4_get_xip_mem() check buffer_unwritten(); if it's set,
zero the blocks and call ext4_convert_unwritten_extents(). Would that
work?

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2013-12-19 15:32:15

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Thu, Dec 19, 2013 at 12:58:44PM +1100, Dave Chinner wrote:
> >
> > - if (unlikely(iocb->ki_filp->f_flags & O_DIRECT))
> > + if (unlikely(iocb->ki_filp->f_flags & O_DIRECT) ||
> > + (mapping_is_xip(inode->i_mapping)))
>
> I suspect a helper function a good idea here. Something like
> "is_io_direct(iocb->ki_filp)"

Seems like a good idea.

> > index 594009f..ae760d9 100644
> > --- a/fs/ext4/indirect.c
> > +++ b/fs/ext4/indirect.c
> > @@ -686,15 +686,22 @@ retry:
> > inode_dio_done(inode);
> > goto locked;
> > }
> > - ret = __blockdev_direct_IO(rw, iocb, inode,
> > - inode->i_sb->s_bdev, iov,
> > - offset, nr_segs,
> > - ext4_get_block, NULL, NULL, 0);
> > + if (mapping_is_xip(file->f_mapping))
> > + ret = xip_io(rw, iocb, inode, iov, offset, nr_segs,
> > + ext4_get_block, NULL, 0);
>
> xip_direct_io() might be a better name here...

I you're a man who his functions verbs :-)

> > +static inline bool mapping_is_xip(struct address_space *mapping)
> > +{
> > + return mapping->a_ops->get_xip_mem != NULL;
> > +}
>
> I think that you should put a flag in the mapping for this, rather
> than chase pointers to do the check.

Probably. I think we may end up without a get_xip_mem() aop by the time
we're finished.

> > + retval = get_block(inode, block, bh, 0);
> > + if (retval)
> > + break;
> > + if (buffer_mapped(bh))
> > + hole = false;
> > + else
> > + hole = true;
> > + if (rw == WRITE && hole) {
> > + bh->b_size = ALIGN(end - offset, PAGE_SIZE);
> > + retval = get_block(inode, block, bh, 1);
> > + if (retval)
> > + break;
> > + }
>
> Why do two write mappings here? If it's a write, then we always want
> to fill a hole, so the create value sent to getblock is:

Yeah, there's a missing piece here. At the moment, I'm supposed to take
the stupid xip_sparse_mutex before filling a hole, and call __xip_unmap
after filling it. I think that has to go away, and once that's done,
I agree with your optimisation.

> > +/*
> > + * Perform I/O to an XIP file. We follow the same rules as
> > + * __blockdev_direct_IO with respect to locking
> > + */
>
> OK, that's interesting, because it means that it will be different
> to normal buffered page cache IO. It will allow concurrent
> overlapping reads and writes - something that POSIX does not allow -
> and places the burden of synchronising concurrent reads and writes
> on the application.
>
> That is different to the current XIP, which serialises writes
> against each other, but not against reads. That's not strictly POSIX
> compliant, either, but it prevents concurrent overlapping writes
> from occurring and so behaves more like applications expect buffered
> IO to work.
>
> For persistent memory, I'd prefer that we have concurrent write Io
> capabilities from the start, but I'm not sure we should just do this
> without first talking about it....

I think you're right. Let's drag this topic out to lkml and make sure
Linus is aware before we go too much further.

> > + /* Protects against truncate */
> > + atomic_inc(&inode->i_dio_count);
> > +
> > + retval = __xip_io(rw, inode, iov, offset, end, nr_segs, get_block, &bh);
>
> Can we avoid using "__" prefixes for new code? xip_do_direct_io() is
> a much better name....

Then it won't fit on a single line ;-) I have no attachment to the name,
but isn't all xip IO direct?

> > +
> > + if ((flags & DIO_LOCKING) && (rw == READ))
> > + mutex_unlock(&inode->i_mutex);
> > +
> > + inode_dio_done(inode);
> > +
> > + if (end_io)
> > + end_io(iocb, offset, transferred, bh.b_private);
>
> And that solves the unwritten extent problem for the IO path. Now we
> just need to solve it for the mmap path. That, I suspect will
> require a custom .page_mkwrite handler....

No, page_mkwrite() never gets called. At this point, I'm thinking a
custom ->fault handler that looks something like this:

static int ext4_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
return xip_fault(vma, vmf, ext4_get_block_write, ext4_end_io_dio);
}

But I'll report back further when I've had a chance to see how it
turns out.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2013-12-19 16:17:38

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Thu, Dec 19, 2013 at 08:20:49AM -0700, Matthew Wilcox wrote:
> We could have ext4_get_xip_mem() check buffer_unwritten(); if it's set,
> zero the blocks and call ext4_convert_unwritten_extents(). Would that
> work?

That would work, but we'd be doing a double write to each data block
--- first writing all zero's, and then later, writing the actual data.
So this would halve our write bandwidth, and double the write wear on
the device.

And of course, if we did this on the read path, we'd be unnecessarily
doing an extent conversion.

Regards,

- Ted

2013-12-19 17:12:05

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Thu, Dec 19, 2013 at 11:17:28AM -0500, Theodore Ts'o wrote:
> On Thu, Dec 19, 2013 at 08:20:49AM -0700, Matthew Wilcox wrote:
> > We could have ext4_get_xip_mem() check buffer_unwritten(); if it's set,
> > zero the blocks and call ext4_convert_unwritten_extents(). Would that
> > work?
>
> That would work, but we'd be doing a double write to each data block
> --- first writing all zero's, and then later, writing the actual data.
> So this would halve our write bandwidth, and double the write wear on
> the device.

Once this patch is done that I'm working on in the other thread, the
only place we'll be calling get_xip_mem() is in the page fault path,
which always has to zero no matter whether it's a load or a store.

As I said in my last reply to Dave:

At this point, I'm thinking a
custom ->fault handler that looks something like this:

static int ext4_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
return xip_fault(vma, vmf, ext4_get_block_write, ext4_end_io_dio);
}

... I think it'll actually be ext4_get_block_fault, not _write, and it
will include code to zero the returned blocks if they're uninitialised.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2013-12-19 17:18:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Thu, Dec 19, 2013 at 10:12:02AM -0700, Matthew Wilcox wrote:
>
> ... I think it'll actually be ext4_get_block_fault, not _write, and it
> will include code to zero the returned blocks if they're uninitialised.

I assume what you mean here is if we see that the blocks are
uninitialized, we don't need to read from the persistent memory at
all; we can just map in a zeroed page, hopefully one from our stock of
pre-zeroed pages. Yes?

- Ted

2013-12-19 23:47:00

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Thu, Dec 19, 2013 at 08:32:13AM -0700, Matthew Wilcox wrote:
> On Thu, Dec 19, 2013 at 12:58:44PM +1100, Dave Chinner wrote:
> > > + retval = get_block(inode, block, bh, 0);
> > > + if (retval)
> > > + break;
> > > + if (buffer_mapped(bh))
> > > + hole = false;
> > > + else
> > > + hole = true;
> > > + if (rw == WRITE && hole) {
> > > + bh->b_size = ALIGN(end - offset, PAGE_SIZE);
> > > + retval = get_block(inode, block, bh, 1);
> > > + if (retval)
> > > + break;
> > > + }
> >
> > Why do two write mappings here? If it's a write, then we always want
> > to fill a hole, so the create value sent to getblock is:
>
> Yeah, there's a missing piece here. At the moment, I'm supposed to take
> the stupid xip_sparse_mutex before filling a hole, and call __xip_unmap
> after filling it. I think that has to go away, and once that's done,
> I agree with your optimisation.

Then perhaps we need to get rid of the xip_sparse_mutex first? :/

> > > +/*
> > > + * Perform I/O to an XIP file. We follow the same rules as
> > > + * __blockdev_direct_IO with respect to locking
> > > + */
> >
> > OK, that's interesting, because it means that it will be different
> > to normal buffered page cache IO. It will allow concurrent
> > overlapping reads and writes - something that POSIX does not allow -
> > and places the burden of synchronising concurrent reads and writes
> > on the application.
> >
> > That is different to the current XIP, which serialises writes
> > against each other, but not against reads. That's not strictly POSIX
> > compliant, either, but it prevents concurrent overlapping writes
> > from occurring and so behaves more like applications expect buffered
> > IO to work.
> >
> > For persistent memory, I'd prefer that we have concurrent write Io
> > capabilities from the start, but I'm not sure we should just do this
> > without first talking about it....
>
> I think you're right. Let's drag this topic out to lkml and make sure
> Linus is aware before we go too much further.

Sure.

Keep in mind we've been discussing making normal buffered IO have
the same concurrency model as direct IO, but with the additional
provision that concurrent IO to the same range is serialised (i.e.
the IO range locks we've preveiously mentioned in the truncate/mmap
discussion). They would also be used for direct IO to avoid the
overallping IO issue there, too.


> > > + /* Protects against truncate */
> > > + atomic_inc(&inode->i_dio_count);
> > > +
> > > + retval = __xip_io(rw, inode, iov, offset, end, nr_segs, get_block, &bh);
> >
> > Can we avoid using "__" prefixes for new code? xip_do_direct_io() is
> > a much better name....
>
> Then it won't fit on a single line ;-) I have no attachment to the name,
> but isn't all xip IO direct?

mmap based IO does not use the "direct IO" path....

> > > + if ((flags & DIO_LOCKING) && (rw == READ))
> > > + mutex_unlock(&inode->i_mutex);
> > > +
> > > + inode_dio_done(inode);
> > > +
> > > + if (end_io)
> > > + end_io(iocb, offset, transferred, bh.b_private);
> >
> > And that solves the unwritten extent problem for the IO path. Now we
> > just need to solve it for the mmap path. That, I suspect will
> > require a custom .page_mkwrite handler....
>
> No, page_mkwrite() never gets called. At this point, I'm thinking a
> custom ->fault handler that looks something like this:

And that's another difference to the normal filesystem mmap paths.
.fault is a read-only operation for filesystems and
.page-mkwrite is the write-fault modification path. i.e.
.fault is only supposed to populate the page into the page
cache by reading it from disk via ->readpage(s). It does not do
block allocation - if the fault is over a hole then a new, zeroed
page is placed in the page cache regardless of whether it is a read
or write page fault.

->page_mkwrite is then used by page fault infrstructure to inform
filesystem that a write fault has occurred and they may need to
allocate backing store for the page, or convert unwritten extents to
written.

What xip_file_fault() does is ask the fielsystem to allocate blocks
for writeable mappings, rather than just inserting a sparse page
over holes and unwritten extents. That fails to handle unwritten
extents correctly - they remain unwritten despite the fact that
userspace can now write to the page.

IOWs, xip_file_fault() needs to drop the allocation of blocks and
only ever insert mappings for pages that have data in them or sprase
pages for holes and unwritten extents. Then the filesystem needs to
provide it's own ->page_mkwrite callout to do block allocation and
unwritten extent conversion on write page faults, and the XIP code
needs to provide a helper function to replace the sparse page in the
mappings with the correct page mapped from the backing device after
allocation or unwritten extent conversion.

That will make XIP behave almost identically to the normal page
cache based page fault path, requiring only a small addition to the
filesystem page_mkwrite handler to support XIP...


> static int ext4_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> {
> return xip_fault(vma, vmf, ext4_get_block_write, ext4_end_io_dio);
> }

I think the xip fault handler should be generic as there's no reason
for it to do anything other that read-only operations. It's the
page_mkwrite callout that needs custom code for each filesystem.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-12-20 16:45:30

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Fri, Dec 20, 2013 at 10:46:54AM +1100, Dave Chinner wrote:
> Then perhaps we need to get rid of the xip_sparse_mutex first? :/

Yeah, already done in my tree. Just finishing up a few other pieces.

> > > And that solves the unwritten extent problem for the IO path. Now we
> > > just need to solve it for the mmap path. That, I suspect will
> > > require a custom .page_mkwrite handler....
> >
> > No, page_mkwrite() never gets called. At this point, I'm thinking a
> > custom ->fault handler that looks something like this:
>
> And that's another difference to the normal filesystem mmap paths.
> .fault is a read-only operation for filesystems and
> .page-mkwrite is the write-fault modification path. i.e.
> .fault is only supposed to populate the page into the page
> cache by reading it from disk via ->readpage(s). It does not do
> block allocation - if the fault is over a hole then a new, zeroed
> page is placed in the page cache regardless of whether it is a read
> or write page fault.
>
> ->page_mkwrite is then used by page fault infrstructure to inform
> filesystem that a write fault has occurred and they may need to
> allocate backing store for the page, or convert unwritten extents to
> written.
>
> What xip_file_fault() does is ask the fielsystem to allocate blocks
> for writeable mappings, rather than just inserting a sparse page
> over holes and unwritten extents. That fails to handle unwritten
> extents correctly - they remain unwritten despite the fact that
> userspace can now write to the page.

I agree with you up to this point. But xip_file_fault() uses the same
get_block_t callback to allocate blocks that block_page_mkwrite() does.
So there's no real difference from the fs' point of view.

> IOWs, xip_file_fault() needs to drop the allocation of blocks and
> only ever insert mappings for pages that have data in them or sprase
> pages for holes and unwritten extents. Then the filesystem needs to
> provide it's own ->page_mkwrite callout to do block allocation and
> unwritten extent conversion on write page faults, and the XIP code
> needs to provide a helper function to replace the sparse page in the
> mappings with the correct page mapped from the backing device after
> allocation or unwritten extent conversion.
>
> That will make XIP behave almost identically to the normal page
> cache based page fault path, requiring only a small addition to the
> filesystem page_mkwrite handler to support XIP...

I decided to see if there was anything particularly hard about the XFS
code in this area. I really think it's just this for you:

+++ b/fs/xfs/xfs_file.c
@@ -957,12 +957,27 @@ xfs_file_readdir(
return 0;
}

+static int xfs_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return xip_fault(vma, vmf, xfs_get_blocks);
+}
+
+static const struct vm_operations_struct xfs_xip_vm_ops = {
+ .fault = xfs_xip_fault,
+ .remap_pages = generic_file_remap_pages,
+};
+
STATIC int
xfs_file_mmap(
struct file *filp,
struct vm_area_struct *vma)
{
- vma->vm_ops = &xfs_file_vm_ops;
+ if (IS_XIP(file_inode(filp))) {
+ vma->vm_ops = &xfs_xip_vm_ops;
+ vma->vm_flags |= VM_MIXEDMAP;
+ } else {
+ vma->vm_ops = &xfs_file_vm_ops;
+ }

file_accessed(filp);
return 0;


> > static int ext4_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > {
> > return xip_fault(vma, vmf, ext4_get_block_write, ext4_end_io_dio);
> > }
>
> I think the xip fault handler should be generic as there's no reason
> for it to do anything other that read-only operations. It's the
> page_mkwrite callout that needs custom code for each filesystem.

With no struct page for the XIP memory, it's just not feasible to do it
that way.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2013-12-20 18:17:33

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Thu, Dec 19, 2013 at 12:18:48PM -0500, Theodore Ts'o wrote:
> On Thu, Dec 19, 2013 at 10:12:02AM -0700, Matthew Wilcox wrote:
> >
> > ... I think it'll actually be ext4_get_block_fault, not _write, and it
> > will include code to zero the returned blocks if they're uninitialised.
>
> I assume what you mean here is if we see that the blocks are
> uninitialized, we don't need to read from the persistent memory at
> all; we can just map in a zeroed page, hopefully one from our stock of
> pre-zeroed pages. Yes?

Maybe. We have a tension here between wanting to avoid unnecessary
writes to the media (as you say, wear is going to be important for some
media, if not all) and wanting to not fragment files (both for extent
tree compactness and so that we can use PMD or even PGD mappings if the
stars align). It'll be up to the filesystem whether it chooses to satisfy
the get_block request with something prezeroed, or something that aligns
nicely. Ideally, it'll be able to find a block of storage that does both!

Actually, I now see a second way to read what you wrote. If you meant
"we can map in ZERO_PAGE or one of its analogs", then no. The amount
of cruft that optimisation added to the filemap_xip code is horrendous.
I don't think it's a particularly common workload (mmap a holey file,
read lots of zeroes out of it without ever writing to it), so I think
it's far better to allocate a page of storage and zero it.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2013-12-20 19:34:55

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Fri, Dec 20, 2013 at 11:17:31AM -0700, Matthew Wilcox wrote:
>
> Maybe. We have a tension here between wanting to avoid unnecessary
> writes to the media (as you say, wear is going to be important for some
> media, if not all) and wanting to not fragment files (both for extent
> tree compactness and so that we can use PMD or even PGD mappings if the
> stars align). It'll be up to the filesystem whether it chooses to satisfy
> the get_block request with something prezeroed, or something that aligns
> nicely. Ideally, it'll be able to find a block of storage that does both!
>
> Actually, I now see a second way to read what you wrote. If you meant
> "we can map in ZERO_PAGE or one of its analogs", then no. The amount
> of cruft that optimisation added to the filemap_xip code is horrendous.
> I don't think it's a particularly common workload (mmap a holey file,
> read lots of zeroes out of it without ever writing to it), so I think
> it's far better to allocate a page of storage and zero it.

It seems that you're primarily focused about allocated versus
unallocated blocks, and what I think Dave and I are trying to point
out is the distinction between initialized and uninitialized blocks
(which are already allocated).

So I was thinking about the case where the blocks were already
allocated and mapped --- so we have a logical -> physical block
mapping already established. However, if the blocks were allocated
via fallocate(2), so they are unallocated, although they will be
well-aligned.

Which means that if you pre-zero at read time, at that point you will
be fragmenting the extent tree, and the blocks are already
well-aligned so it's in fact better to fault in a zero page at read
time when we are dealing with an allocated, but not-yet-initialized
block.

Also, one of the ways which we handle fragmentation is via delayed
allocation. That is, we don't make the allocation decision until the
last possible second. We do lose this optimization for direct I/O,
since that's part of the nature of the beast --- but there's no reason
not to have it for XIP writes --- especially if the goal is to be able
to support persistent memory storage devices in a first class way,
instead of a one-off hack for demonstration purposes....

- Ted

2013-12-20 20:11:03

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Fri, Dec 20, 2013 at 02:34:55PM -0500, Theodore Ts'o wrote:
> On Fri, Dec 20, 2013 at 11:17:31AM -0700, Matthew Wilcox wrote:
> > Maybe. We have a tension here between wanting to avoid unnecessary
> > writes to the media (as you say, wear is going to be important for some
> > media, if not all) and wanting to not fragment files (both for extent
> > tree compactness and so that we can use PMD or even PGD mappings if the
> > stars align). It'll be up to the filesystem whether it chooses to satisfy
> > the get_block request with something prezeroed, or something that aligns
> > nicely. Ideally, it'll be able to find a block of storage that does both!
> >
> > Actually, I now see a second way to read what you wrote. If you meant
> > "we can map in ZERO_PAGE or one of its analogs", then no. The amount
> > of cruft that optimisation added to the filemap_xip code is horrendous.
> > I don't think it's a particularly common workload (mmap a holey file,
> > read lots of zeroes out of it without ever writing to it), so I think
> > it's far better to allocate a page of storage and zero it.
>
> It seems that you're primarily focused about allocated versus
> unallocated blocks, and what I think Dave and I are trying to point
> out is the distinction between initialized and uninitialized blocks
> (which are already allocated).

I understand the difference for filesystems on block devices; filesystem
blocks can be:

* allocated, initialised (eg: after they've been written to)
* allocated, uninitialised (eg: after fallocate)
* unallocated, initialised (eg: written to in the page cache)
* unallocated, uninitialised

A filesystem on top of an XIP device can't have an unallocated,
initialised block. There's no page cache to buffer the write in, so at
the point where you're going to store to it, you have to allocate.

You also can't mmap() an allocated, uninitialised block. It's got to
be initialised before we can insert a PTE that points to it.

> So I was thinking about the case where the blocks were already
> allocated and mapped --- so we have a logical -> physical block
> mapping already established. However, if the blocks were allocated
> via fallocate(2), so they are unallocated, although they will be
> well-aligned.
>
> Which means that if you pre-zero at read time, at that point you will
> be fragmenting the extent tree, and the blocks are already
> well-aligned so it's in fact better to fault in a zero page at read
> time when we are dealing with an allocated, but not-yet-initialized
> block.

Just to check here, you mean a ZERO_PAGE, right? Or a page cache page
that has been zeroed?

> Also, one of the ways which we handle fragmentation is via delayed
> allocation. That is, we don't make the allocation decision until the
> last possible second. We do lose this optimization for direct I/O,
> since that's part of the nature of the beast --- but there's no reason
> not to have it for XIP writes --- especially if the goal is to be able
> to support persistent memory storage devices in a first class way,
> instead of a one-off hack for demonstration purposes....

I think it's also the nature of the beast that you lose this optimisation
for XIP writes. Sure, there are multiple ways we could do _not exactly_
XIP to take advantage of persistent memory, and I think we should discuss
them, but we should leave XIP as meaning XIP. I have some great ideas
about how a hybrid approach could work ...

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2013-12-23 03:16:20

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Fri, Dec 20, 2013 at 11:17:31AM -0700, Matthew Wilcox wrote:
> On Thu, Dec 19, 2013 at 12:18:48PM -0500, Theodore Ts'o wrote:
> > On Thu, Dec 19, 2013 at 10:12:02AM -0700, Matthew Wilcox wrote:
> > >
> > > ... I think it'll actually be ext4_get_block_fault, not _write, and it
> > > will include code to zero the returned blocks if they're uninitialised.
> >
> > I assume what you mean here is if we see that the blocks are
> > uninitialized, we don't need to read from the persistent memory at
> > all; we can just map in a zeroed page, hopefully one from our stock of
> > pre-zeroed pages. Yes?
>
> Maybe. We have a tension here between wanting to avoid unnecessary
> writes to the media (as you say, wear is going to be important for some
> media, if not all) and wanting to not fragment files (both for extent
> tree compactness and so that we can use PMD or even PGD mappings if the
> stars align). It'll be up to the filesystem whether it chooses to satisfy
> the get_block request with something prezeroed, or something that aligns
> nicely. Ideally, it'll be able to find a block of storage that does both!

Yes, XFS with it's per-file extent size hints does that already.
Like I've said previously - this is a solved problem and all we need
is the IO completion callbacks to make sub-extent size IOs convert
the allocated unwritten regions appropriately...

> Actually, I now see a second way to read what you wrote. If you meant
> "we can map in ZERO_PAGE or one of its analogs", then no. The amount
> of cruft that optimisation added to the filemap_xip code is horrendous.
> I don't think it's a particularly common workload (mmap a holey file,
> read lots of zeroes out of it without ever writing to it), so I think
> it's far better to allocate a page of storage and zero it.

Happens far more often than you think in scientific calculations.
Sparse matrices are extremely common, and it's a valid optimistion
to walk then with mmap and have all the uninitialised vectors simply
return zero without having storage space allocated. In this sort of
situation, you really don't want to be allocating and zeroing
persistent memory just because a terabyte sized sparse identity
matrix was mmapped and read in it's entirity during a calculation....

Persistent memory needs to handle sparse files efficiently. I'd
suggest that we already have very well tested mechanism to do
that: the mapping tree on each inode. use the radix tree to index
the space, mapping either a zero page into each hole index that is
mapped read only, and replace it with an allocated, zeroed mapping
at page_mkwrite() time. i.e. use the mapping radix tree to point at
all the pages we've mapped from the backing device rather than just
mapping an anonymous memory address from the backing device
into userspace.

That also opens the door for easily retrofitting buffered writes
into persistent memory if we need them (e.g. mmap() of encrypted
persistent memory).

Once again, we don't need to re-invent the wheel because we already
have one designed specifically for this purpose....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-12-23 03:36:47

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Fri, Dec 20, 2013 at 01:11:00PM -0700, Matthew Wilcox wrote:
> On Fri, Dec 20, 2013 at 02:34:55PM -0500, Theodore Ts'o wrote:
> > On Fri, Dec 20, 2013 at 11:17:31AM -0700, Matthew Wilcox wrote:
> > > Maybe. We have a tension here between wanting to avoid unnecessary
> > > writes to the media (as you say, wear is going to be important for some
> > > media, if not all) and wanting to not fragment files (both for extent
> > > tree compactness and so that we can use PMD or even PGD mappings if the
> > > stars align). It'll be up to the filesystem whether it chooses to satisfy
> > > the get_block request with something prezeroed, or something that aligns
> > > nicely. Ideally, it'll be able to find a block of storage that does both!
> > >
> > > Actually, I now see a second way to read what you wrote. If you meant
> > > "we can map in ZERO_PAGE or one of its analogs", then no. The amount
> > > of cruft that optimisation added to the filemap_xip code is horrendous.
> > > I don't think it's a particularly common workload (mmap a holey file,
> > > read lots of zeroes out of it without ever writing to it), so I think
> > > it's far better to allocate a page of storage and zero it.
> >
> > It seems that you're primarily focused about allocated versus
> > unallocated blocks, and what I think Dave and I are trying to point
> > out is the distinction between initialized and uninitialized blocks
> > (which are already allocated).
>
> I understand the difference for filesystems on block devices; filesystem
> blocks can be:
>
> * allocated, initialised (eg: after they've been written to)
> * allocated, uninitialised (eg: after fallocate)

* allocated, encoded (e.g. encrypted, compressed, deduplicated, etc)

> * unallocated, initialised (eg: written to in the page cache)

i thikn you are refering to delayed allocation here? i.e. the blocks
are reserved, and since they aren't allocated they can't be
considered initialised or uninitialised.

IOWs, I think you are conflating "page cache dirty" with "filesystem
block initialised" but you can't do that because no filesystem
blocks have been allocated.

> * unallocated, uninitialised

That's called free space, and the page cache/XIP will never see
that.

> A filesystem on top of an XIP device can't have an unallocated,
> initialised block. There's no page cache to buffer the write in, so at
> the point where you're going to store to it, you have to allocate.

Like I've consistently said: XIP needs to support page cache
buffering.

> You also can't mmap() an allocated, uninitialised block. It's got to
> be initialised before we can insert a PTE that points to it.

No, the memory that mmap() points to needs to be initialised.
Whether that's the same thing as the backing store is another
matter. If we are using persistent memory, we can use XIP if the
backing store allows it, otherwise we need to use the page cache to
buffer the mmap() operations.

> > So I was thinking about the case where the blocks were already
> > allocated and mapped --- so we have a logical -> physical block
> > mapping already established. However, if the blocks were allocated
> > via fallocate(2), so they are unallocated, although they will be
> > well-aligned.
> >
> > Which means that if you pre-zero at read time, at that point you will
> > be fragmenting the extent tree, and the blocks are already
> > well-aligned so it's in fact better to fault in a zero page at read
> > time when we are dealing with an allocated, but not-yet-initialized
> > block.
>
> Just to check here, you mean a ZERO_PAGE, right? Or a page cache page
> that has been zeroed?
>
> > Also, one of the ways which we handle fragmentation is via delayed
> > allocation. That is, we don't make the allocation decision until the
> > last possible second. We do lose this optimization for direct I/O,
> > since that's part of the nature of the beast --- but there's no reason
> > not to have it for XIP writes --- especially if the goal is to be able
> > to support persistent memory storage devices in a first class way,
> > instead of a one-off hack for demonstration purposes....
>
> I think it's also the nature of the beast that you lose this optimisation
> for XIP writes. Sure, there are multiple ways we could do _not exactly_
> XIP to take advantage of persistent memory, and I think we should discuss
> them, but we should leave XIP as meaning XIP. I have some great ideas
> about how a hybrid approach could work ...

The problem is that you can't do generic, system wide XIP without
considering all different possibilities. How do you do XIP on a COW
based filesystem, where the block mapping changes every time a block
is written? You can do XIP for read, but you can't for write, and
write needs to invalidate read XIP mappings. Indeed, some hardware
types can do XIP for read, but cannot do it for write (e.g. flash).
They need buffering to allow writes to be done for XIP mappings.

What I'm trying to say is that I think the whole idea of XIP is
separate from the page cache is completely the wrong way to go about
fixing it. XIP should simply be a method of mapping backing device
pages into the existing per-inode mapping tree. If we need to
encode, remap, etc because of constraints of the configuration (be
it filesystem implementation or block device encodings) then we just
use the normal buffered IO path, with the ->writepages path hitting
the block layer to do the memcpy or encoding into persistent
memory. Otherwise we just hit the direct IO path we've been talking
about up to this point...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-12-23 03:45:56

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Mon, Dec 23, 2013 at 02:36:41PM +1100, Dave Chinner wrote:
> What I'm trying to say is that I think the whole idea of XIP is
> separate from the page cache is completely the wrong way to go about
> fixing it. XIP should simply be a method of mapping backing device
> pages into the existing per-inode mapping tree. If we need to
> encode, remap, etc because of constraints of the configuration (be
> it filesystem implementation or block device encodings) then we just
> use the normal buffered IO path, with the ->writepages path hitting
> the block layer to do the memcpy or encoding into persistent
> memory. Otherwise we just hit the direct IO path we've been talking
> about up to this point...

That's a very filesystem person way of thinking about the problem :-)
The problem is that you've now pushed it off on the MM people. A page
in the page cache needs a struct page to represent it. If you've got
70x as much persistent memory as you have volatile memory, then you just
filled all of your volatile memory with struct pages to describe the
persistent memory. I don't remember if you were around for the joys
of dealing with 16GB+ i386 machines, but the unholy messes created to
avoid running out of the 800MB or so of lowmem are still with us.

I mean, sure, it's doable. But it's got its own tradeoffs and they
aren't pleasant for many workloads. We could talk about ways to work
around it, like making struct page be able to describe larger chunks of
memory, but I don't think I'm capable of that amount of surgery to the VM.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2013-12-23 04:14:47

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Fri, Dec 20, 2013 at 09:45:30AM -0700, Matthew Wilcox wrote:
> On Fri, Dec 20, 2013 at 10:46:54AM +1100, Dave Chinner wrote:
> > Then perhaps we need to get rid of the xip_sparse_mutex first? :/
>
> Yeah, already done in my tree. Just finishing up a few other pieces.
>
> > > > And that solves the unwritten extent problem for the IO path. Now we
> > > > just need to solve it for the mmap path. That, I suspect will
> > > > require a custom .page_mkwrite handler....
> > >
> > > No, page_mkwrite() never gets called. At this point, I'm thinking a
> > > custom ->fault handler that looks something like this:
> >
> > And that's another difference to the normal filesystem mmap paths.
> > .fault is a read-only operation for filesystems and
> > .page-mkwrite is the write-fault modification path. i.e.
> > .fault is only supposed to populate the page into the page
> > cache by reading it from disk via ->readpage(s). It does not do
> > block allocation - if the fault is over a hole then a new, zeroed
> > page is placed in the page cache regardless of whether it is a read
> > or write page fault.
> >
> > ->page_mkwrite is then used by page fault infrstructure to inform
> > filesystem that a write fault has occurred and they may need to
> > allocate backing store for the page, or convert unwritten extents to
> > written.
> >
> > What xip_file_fault() does is ask the fielsystem to allocate blocks
> > for writeable mappings, rather than just inserting a sparse page
> > over holes and unwritten extents. That fails to handle unwritten
> > extents correctly - they remain unwritten despite the fact that
> > userspace can now write to the page.
>
> I agree with you up to this point. But xip_file_fault() uses the same
> get_block_t callback to allocate blocks that block_page_mkwrite() does.
> So there's no real difference from the fs' point of view.

I beg to differ. You're adding a new allocation path from page
faults into the filesystem code. We already have one - it's called
page_mkwrite.


> > IOWs, xip_file_fault() needs to drop the allocation of blocks and
> > only ever insert mappings for pages that have data in them or sprase
> > pages for holes and unwritten extents. Then the filesystem needs to
> > provide it's own ->page_mkwrite callout to do block allocation and
> > unwritten extent conversion on write page faults, and the XIP code
> > needs to provide a helper function to replace the sparse page in the
> > mappings with the correct page mapped from the backing device after
> > allocation or unwritten extent conversion.
> >
> > That will make XIP behave almost identically to the normal page
> > cache based page fault path, requiring only a small addition to the
> > filesystem page_mkwrite handler to support XIP...
>
> I decided to see if there was anything particularly hard about the XFS
> code in this area. I really think it's just this for you:
>
> +++ b/fs/xfs/xfs_file.c
> @@ -957,12 +957,27 @@ xfs_file_readdir(
> return 0;
> }
>
> +static int xfs_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> + return xip_fault(vma, vmf, xfs_get_blocks);
> +}
> +
> +static const struct vm_operations_struct xfs_xip_vm_ops = {
> + .fault = xfs_xip_fault,
> + .remap_pages = generic_file_remap_pages,
> +};

No, it doesn't. ->page_mkwrite re-enters the filesystem back in
->write_begin and ->write_end to do zeroing and allocation and to
set up tracking to trigger unwritten extent conversion, etc.

So, off the top of my head, how does xip_fault():

- avoid triggering delayed allocation?
- trigger unwritten extent conversion to occur?
- ensure that partial blocks at EOF are zeroed correctly?
- deal with partial write failure? (e.g. unwritten extent
conversion fails)

AFAICT, it can't. This is all stuff that is specific to the
filesystem imlpementation, and stuff that is already handled by the
page_mkwrite paths....

> > > static int ext4_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > {
> > > return xip_fault(vma, vmf, ext4_get_block_write, ext4_end_io_dio);
> > > }
> >
> > I think the xip fault handler should be generic as there's no reason
> > for it to do anything other that read-only operations. It's the
> > page_mkwrite callout that needs custom code for each filesystem.
>
> With no struct page for the XIP memory, it's just not feasible to do it
> that way.

Yet we are making the major assumption that XIP memory is made up of
pages that can be mapped directly into page tables. So, it's pages
and we optimise filesystems to deal with native page sizes, but XIP
says it's not pages and so we have to deal with it completely
differently?

Perhaps XIP pages need to have struct pages allocated for them at
device initialisation time, such that all xip_get_mem turns into
xip_find_get_page() and the page gets inserted directly into the
mapping tree on the inode? All of a sudden, XIP looks almost the
same as the normal mmap path, and filesystems only need to ensure
that they zero the page and/or convert it if necessary before the
page fault returns...

Seriously, just because the current XIP code uses some hack to work
with *one device*, it doesn't mean that's the best way to approach
the problem. XIP is not something magic and different - it's just a
different way of inserting a page into the page cache....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-12-23 04:32:48

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Sun, Dec 22, 2013 at 08:45:54PM -0700, Matthew Wilcox wrote:
> On Mon, Dec 23, 2013 at 02:36:41PM +1100, Dave Chinner wrote:
> > What I'm trying to say is that I think the whole idea of XIP is
> > separate from the page cache is completely the wrong way to go about
> > fixing it. XIP should simply be a method of mapping backing device
> > pages into the existing per-inode mapping tree. If we need to
> > encode, remap, etc because of constraints of the configuration (be
> > it filesystem implementation or block device encodings) then we just
> > use the normal buffered IO path, with the ->writepages path hitting
> > the block layer to do the memcpy or encoding into persistent
> > memory. Otherwise we just hit the direct IO path we've been talking
> > about up to this point...
>
> That's a very filesystem person way of thinking about the problem :-)
> The problem is that you've now pushed it off on the MM people. A page
> in the page cache needs a struct page to represent it. If you've got

Ever crossed you mind that perhaps persistent memory could store
them? They don't need to be in volatile RAM, especially if
persistent memory is as addressable as volatile RAM. So, problem
solved - you just use part of persistent memory to track all the
pages of persistent memory used for storage....

> 70x as much persistent memory as you have volatile memory, then you just
> filled all of your volatile memory with struct pages to describe the
> persistent memory. I don't remember if you were around for the joys
> of dealing with 16GB+ i386 machines, but the unholy messes created to
> avoid running out of the 800MB or so of lowmem are still with us.

The lowmem/highmem problem was caused by the kernel not being able
to directly address the high memory on those machines. That's not a
problem with persistent memory - the kernel can address the
persistent memory directly, and so there is nothing stopping the
kernel from storing the indexing information in persistent memory,
even if it doesn't use the persistent nature of the memory...

> I mean, sure, it's doable. But it's got its own tradeoffs and they
> aren't pleasant for many workloads. We could talk about ways to work
> around it, like making struct page be able to describe larger chunks of
> memory, but I don't think I'm capable of that amount of surgery to the VM.

I don't think it requires major surgery - it should be no different
to initialising a region of volatile memory, like we do for every
node on NUMA machines....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-12-23 06:56:47

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Sun, Dec 22, 2013 at 08:45:54PM -0700, Matthew Wilcox wrote:
> On Mon, Dec 23, 2013 at 02:36:41PM +1100, Dave Chinner wrote:
> > What I'm trying to say is that I think the whole idea of XIP is
> > separate from the page cache is completely the wrong way to go about
> > fixing it. XIP should simply be a method of mapping backing device
> > pages into the existing per-inode mapping tree. If we need to
> > encode, remap, etc because of constraints of the configuration (be
> > it filesystem implementation or block device encodings) then we just
> > use the normal buffered IO path, with the ->writepages path hitting
> > the block layer to do the memcpy or encoding into persistent
> > memory. Otherwise we just hit the direct IO path we've been talking
> > about up to this point...
>
> That's a very filesystem person way of thinking about the problem :-)
> The problem is that you've now pushed it off on the MM people.

I didn't comment on this before, but now I've had a bit of time to
think about it, it's become obvious to me that there is a
fundamental disconnect here. To risk stating the obvious, but
persistent memory is just memory and someone has to manage it.

I'll state up front that I do spend a fair bit of time in memory
management code - all the shrinker scaling for NUMA systems that
landed recently was stuff I originally wrote. I'm spending time
reviewing patches to get memcg awareness into the shrinkers and
filesystem caches. Persistent memory has a lot of overlap between
the MM and FS subsystems, just like shrinkers overlap lots of
different subsystems...

So from a filesystem perspective, we move data in and out of pages
of memory that are managed by the memory management subsystem, and
we move that data to and from filesystem blocks via an IO path.

The management of the memory that filesystems use is actually
the responsibility of the memory management subsystem - allocation,
reclaim, tracking, etc are all handled by the mm subsystem. That has
tendrils down into filesystem code - writeback for cleaning pages,
shrinkers for freeing inodes, dentries and other filesystem caches,
etc.

Persistent memory may be physically different to volatile memory,
but it is still exposed as byte addressable, mappable pages of
memory to the OS. Hence it could be treated in exactly the same way
that volatile memory pages are treated.

That is, a persistent memory device could be considered to be a
block device with a page sized sector. i.e. a 1:1 mapping between
the block device address space and the persistent memory page. A
filesystem tracks sectors in the block device address space with
filesystem metadata to expose the storage in a namespace, but that's
not the same thing as using managing how persistent memory is
exposed to virtual addresses in userspace. The former is data
indexing, the latter is a data access.

In terms of data indexing, the inode mapping tree is used to track
the relationship between the file offset of the user data, the
memory backing the data and the block index in the filesystem. That
realtionship is read from filesystem metadata.

For data access, the memory backing the data is tracked via
a struct page allocated out of volatile system memory. To get that
data to/from the backing storage, we need to perform an IO
operation on the memory backing the data, and we determine where to
get that from via the data index...

In the case of XIP, we still have the same data index relationship.
The difference is in the data access - XIP gets the backing memory
from the block device rather than from the free memory the VM.
However, we don't get a struct page - we get an opaque handle we
cannot use for data indexing purposes, and hence we need unique IO
paths to deal with this difference.

If the persistent memory device can hand us struct pages rather than
mapped memory handles, we don't need to change our data indexing
methods, nor do we need to change the way data in the page cache is
accessed. mmap() gets direct access, just like the current XIP, but
we can use all of the smarts filesystems have for optimal
block allocation.

Further, if the persistent memory device implements an IO implementation
(->make_request) like brd does (brd_make_request), then we get double
buffered persistent memory that we can use for things like stacked
IO devices that encode the data that is being stored. It all ends up
completely transparent to the filesystem, the mm subsystem, the
users, etc. XIP just works automatically when it can, otherwise it
just behaves like a really fast block device....

IOWs, I don't see XIP as something that should be tacked on to the
side of the filesystems and bypass the normal IO paths. it's
somethign that should be integrated directly and used automatically
if it can be used. And that requires persistent memory to be treated
as pages just like volatile memory.

That's how I see persistent memory fitting into the FS/MM world. It
needs help from both the FS and MM subsystems, and to try to
shoe-horn it completely into one or the other just won't work in the
long run.

The reality is that you're on a steep learning curve here, Willy.
What filesystems do and the way they interact with the MM subsystem
interact is a whole lot more complex that you realised. I know that
XIP is not a new concept (I writing XIP stuff 20 years ago on 68000s
with a whole 6MB of battery backed SRAM), but filesystems and the
page cache have got a whole lot more complex since ext2....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-12-23 14:51:42

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Mon, Dec 23, 2013 at 05:56:41PM +1100, Dave Chinner wrote:
> IOWs, I don't see XIP as something that should be tacked on to the
> side of the filesystems and bypass the normal IO paths. it's
> somethign that should be integrated directly and used automatically
> if it can be used. And that requires persistent memory to be treated
> as pages just like volatile memory.

I agree with Dave's suggestions here completely. However, for the
long term...

> If the persistent memory device can hand us struct pages rather than
> mapped memory handles, we don't need to change our data indexing
> methods, nor do we need to change the way data in the page cache is
> accessed. mmap() gets direct access, just like the current XIP, but
> we can use all of the smarts filesystems have for optimal
> block allocation.

I suspect in the long term, or a least for persistent memory, the mm
subsystem is going to have to move away from having a struct page for
every 4k region of memory. This just doesn't scale very well, and
especially in the case of persistent memory, where we might in the
long run have terrabytes of the stuff (assuming the price comes down,
but hopefully they will), it probably makes sense to have some kind of
struct page_extent structure which covers a region of pages.

After all, there are portions of the struct page which aren't needed
for persistent memory. We don't need to scan persistent memory and we
don't need to deactivate pages of persistent memory (since it's not
like we can reuse the persistent memory page for anything else so long
as it belongs to an allocated file).

But all of this is in the long-term. In the short term, I suspect the
reality is that whatever interfaces we try to set up for persistent
memory, they will probably have to be rewritten and rototilled at
least once or twice more.

After all, this stuff isn't available in real life, and we don't know
which variants of the technology will win out, in terms of
price/performance and market acceptance. Some of them may require
double-buffering where you map in the page read-only, but if you need
to modify the page, you really want to make a copy-on-write to
traditional RAM (because writes, though faster than flash and byte
addressable, might still be slow enough --- say, 2x or 3x --- that you
wouldn't want to be doing a lot of calculations and modifications to
the persistent memory.)

So we do need to worry about overdesign. Maybe it will be better to
do something simple/crappy/stupid initially, with the understanding
that it might not be the kindest as far as write cycles or
performance, but with the expectation that it will have to be
rewritten later. After all, this isn't like a file system format
where once we have committed to the format, it is really painful to
change.

How many times have we written the network stack? Three, four, five
times depending on how you count? It's true we haven't been as lucky
at being able to rev block layer when it has desparately needed
rototilling and rewriting in the past, but maybe we should accept the
fact that we _will_ get it wrong this second time that we rewrite XIP,
and just try to make it be less painful when we need to rewrite it the
for the third time...

- Ted

2013-12-24 16:27:16

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Add XIP support to ext4

On Mon, Dec 23, 2013 at 02:16:16PM +1100, Dave Chinner wrote:
> > Actually, I now see a second way to read what you wrote. If you meant
> > "we can map in ZERO_PAGE or one of its analogs", then no. The amount
> > of cruft that optimisation added to the filemap_xip code is horrendous.
> > I don't think it's a particularly common workload (mmap a holey file,
> > read lots of zeroes out of it without ever writing to it), so I think
> > it's far better to allocate a page of storage and zero it.
>
> Happens far more often than you think in scientific calculations.
> Sparse matrices are extremely common, and it's a valid optimistion
> to walk then with mmap and have all the uninitialised vectors simply
> return zero without having storage space allocated. In this sort of
> situation, you really don't want to be allocating and zeroing
> persistent memory just because a terabyte sized sparse identity
> matrix was mmapped and read in it's entirity during a calculation....

It turns out not to be too bad. I think the real problem with the old
XIP code was that they tried to microoptimise by using a single zero
page for every hole rather than doing what the generic pagecache code
does and use a page per hole. Patch at the end of this mail.

> Persistent memory needs to handle sparse files efficiently. I'd
> suggest that we already have very well tested mechanism to do
> that: the mapping tree on each inode. use the radix tree to index
> the space, mapping either a zero page into each hole index that is
> mapped read only, and replace it with an allocated, zeroed mapping
> at page_mkwrite() time. i.e. use the mapping radix tree to point at
> all the pages we've mapped from the backing device rather than just
> mapping an anonymous memory address from the backing device
> into userspace.

We could reuse the radix tree for things that aren't pages, like
swp_to_radix_entry() does. I don't see what that will give us over
the current system.

> That also opens the door for easily retrofitting buffered writes
> into persistent memory if we need them (e.g. mmap() of encrypted
> persistent memory).

I don't see why we'd do it that way. If we're layering software
encryption between the app and the storage then it's no longer direct
access. It's no longer XIP. You'd just use the normal bio paths.

Optimisation for reading sparse pages:

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 7d6e492..b28bc6f 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -31,6 +31,11 @@ static int ext2_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
return xip_fault(vma, vmf, ext2_get_block);
}

+static int ext2_xip_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return xip_mkwrite(vma, vmf, ext2_get_block);
+}
+
static int ext2_xip_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, unsigned int flags)
{
@@ -40,6 +45,7 @@ static int ext2_xip_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
static const struct vm_operations_struct ext2_xip_vm_ops = {
.fault = ext2_xip_fault,
.pmd_fault = ext2_xip_pmd_fault,
+ .page_mkwrite = ext2_xip_mkwrite,
.remap_pages = generic_file_remap_pages,
};

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6211f56..4eea421 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -211,9 +211,15 @@ static int ext4_xip_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
return xip_pmd_fault(vma, addr, pmd, flags, ext4_get_block);
}

+static int ext4_xip_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return xip_mkwrite(vma, vmf, ext4_get_block);
+}
+
static const struct vm_operations_struct ext4_xip_vm_ops = {
.fault = ext4_xip_fault,
.pmd_fault = ext4_xip_pmd_fault,
+ .page_mkwrite = ext4_xip_mkwrite,
.remap_pages = generic_file_remap_pages,
};
#else
diff --git a/fs/xip.c b/fs/xip.c
index b6a8e0c..8049703 100644
--- a/fs/xip.c
+++ b/fs/xip.c
@@ -187,7 +202,26 @@ ssize_t xip_do_io(int rw, struct kiocb *iocb, struct inode *inode,
}
EXPORT_SYMBOL_GPL(xip_do_io);

+static int xip_read_hole(struct address_space *mapping, struct vm_fault *vmf)
+{
+ unsigned long size;
+ struct inode *inode = mapping->host;
+ struct page *page = find_or_create_page(mapping, vmf->pgoff,
+ GFP_KERNEL | __GFP_ZERO);
+ if (!page)
+ return VM_FAULT_OOM;
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (vmf->pgoff >= size) {
+ unlock_page(page);
+ page_cache_release(page);
+ return VM_FAULT_SIGBUS;
+ }
+
+ vmf->page = page;
+ return VM_FAULT_LOCKED;
+}
+
static void copy_user_bh(struct page *to, struct inode *inode,
struct buffer_head *bh, unsigned long vaddr)
{
@@ -226,12 +252,14 @@ static int do_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
bh.b_size = PAGE_SIZE;
error = get_block(inode, block, &bh, 0);

- /* Don't allocate backing store if we're going to COW a hole */
if (!error && !buffer_mapped(&bh) && !vmf->cow_page) {
- error = get_block(inode, block, &bh, 1);
- count_vm_event(PGMAJFAULT);
- mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
- major = VM_FAULT_MAJOR;
+ if (vmf->flags & FAULT_FLAG_WRITE) {
+ error = get_block(inode, block, &bh, 1);
+ count_vm_event(PGMAJFAULT);
+ mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+ major = VM_FAULT_MAJOR;
+ } else
+ return xip_read_hole(mapping, vmf);
}

if (error)
@@ -279,19 +316,45 @@ int xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
}
EXPORT_SYMBOL_GPL(xip_fault);

+/**
+ * xip_mkwrite - convert a read-only page to read-write in an XIP file
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * XIP handles reads of holes by adding pages full of zeroes into the
+ * mapping. If the page is subsequenty written to, we have to allocate
+ * the page on media and free the page that was in the cache.
+ */
+int xip_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
+ get_block_t get_block)
+{
+ int result;
+ struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+ sb_start_pagefault(sb);
+ file_update_time(vma->vm_file);
+ result = do_xip_fault(vma, vmf, get_block);
+ sb_end_pagefault(sb);
+
+ if (!(result & VM_FAULT_ERROR)) {
+ struct page *page = vmf->page;
+ unmap_mapping_range(page->mapping,
+ (loff_t)page->index << PAGE_CACHE_SHIFT,
+ PAGE_CACHE_SIZE, 0);
+ delete_from_page_cache(page);
+ }
+
+ return result;
+}
+EXPORT_SYMBOL_GPL(xip_mkwrite);
+
/*
* The 'colour' (ie low bits) within a PMD of a page offset. This comes up
* more often than one might expect in the below function.
diff --git a/mm/memory.c b/mm/memory.c
index 0d332cf..2a9cfb2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2708,6 +2708,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
vmf.pgoff = old_page->index;
vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
vmf.page = old_page;
+ vmf.cow_page = NULL;

/*
* Notify the address space that the page is about to

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."