The original intent of this series was to add a per-inode DAX flag to ext4
so that it would be consistent with XFS. In my travels I found and fixed
several related issues in both ext4 and XFS.
I'm not fully happy with the ways that ext4 DAX interacts with conflicting
features (journaling, inline data and encryption). My goal with this
series was to make all these interactions as consistent as possilble, and
of course to make them safe. If anyone has ideas for improvements, I'm
very open.
Ross Zwisler (9):
ext4: remove duplicate extended attributes defs
xfs: always use DAX if mount option is used
xfs: validate bdev support for DAX inode flag
ext4: add ext4_should_use_dax()
ext4: ext4_change_inode_journal_flag error handling
ext4: safely transition S_DAX on journaling changes
ext4: prevent data corruption with inline data + DAX
ext4: add sanity check for encryption + DAX
ext4: add per-inode DAX flag
fs/ext4/ext4.h | 47 ++++++---------------------------------------
fs/ext4/ext4_jbd2.h | 16 ++++++++++++++++
fs/ext4/inline.c | 10 ----------
fs/ext4/inode.c | 45 ++++++++++++++++++++++++-------------------
fs/ext4/ioctl.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++--
fs/ext4/super.c | 8 ++++++++
fs/xfs/xfs_ioctl.c | 14 +++++++++++---
7 files changed, 119 insertions(+), 76 deletions(-)
--
2.9.5
Currently only the blocksize is checked, but we should really be calling
bdev_dax_supported() which also tests to make sure we can get a
struct dax_device and that the dax_direct_access() path is working.
This is the same check that we do for the "-o dax" mount option in
xfs_fs_fill_super().
Signed-off-by: Ross Zwisler <[email protected]>
CC: [email protected]
---
fs/xfs/xfs_ioctl.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 8155ddc..dac195d 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1075,6 +1075,7 @@ xfs_ioctl_setattr_dax_invalidate(
int *join_flags)
{
struct inode *inode = VFS_I(ip);
+ struct super_block *sb = inode->i_sb;
int error;
*join_flags = 0;
@@ -1087,7 +1088,7 @@ xfs_ioctl_setattr_dax_invalidate(
if (fa->fsx_xflags & FS_XFLAG_DAX) {
if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode)))
return -EINVAL;
- if (ip->i_mount->m_sb.sb_blocksize != PAGE_SIZE)
+ if (bdev_dax_supported(sb, sb->s_blocksize) < 0)
return -EINVAL;
}
--
2.9.5
The following commit:
commit 9b7365fc1c82 ("ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR
interface support")
added several defines related to extended attributes to ext4.h. They were
added within an #ifndef FS_IOC_FSGETXATTR block with the comment:
/* Until the uapi changes get merged for project quota... */
Those uapi changes were merged by this commit:
commit 334e580a6f97 ("fs: XFS_IOC_FS[SG]SETXATTR to FS_IOC_FS[SG]ETXATTR
promotion")
so all the definitions needed by ext4 are available in
include/uapi/linux/fs.h. Remove the duplicates from ext4.h.
Signed-off-by: Ross Zwisler <[email protected]>
Cc: Li Xi <[email protected]>
Cc: Theodore Ts'o <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Dave Chinner <[email protected]>
---
fs/ext4/ext4.h | 37 -------------------------------------
1 file changed, 37 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index a2bb7d2..c950278 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -644,43 +644,6 @@ enum {
#define EXT4_IOC_GET_ENCRYPTION_PWSALT FS_IOC_GET_ENCRYPTION_PWSALT
#define EXT4_IOC_GET_ENCRYPTION_POLICY FS_IOC_GET_ENCRYPTION_POLICY
-#ifndef FS_IOC_FSGETXATTR
-/* Until the uapi changes get merged for project quota... */
-
-#define FS_IOC_FSGETXATTR _IOR('X', 31, struct fsxattr)
-#define FS_IOC_FSSETXATTR _IOW('X', 32, struct fsxattr)
-
-/*
- * Structure for FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR.
- */
-struct fsxattr {
- __u32 fsx_xflags; /* xflags field value (get/set) */
- __u32 fsx_extsize; /* extsize field value (get/set)*/
- __u32 fsx_nextents; /* nextents field value (get) */
- __u32 fsx_projid; /* project identifier (get/set) */
- unsigned char fsx_pad[12];
-};
-
-/*
- * Flags for the fsx_xflags field
- */
-#define FS_XFLAG_REALTIME 0x00000001 /* data in realtime volume */
-#define FS_XFLAG_PREALLOC 0x00000002 /* preallocated file extents */
-#define FS_XFLAG_IMMUTABLE 0x00000008 /* file cannot be modified */
-#define FS_XFLAG_APPEND 0x00000010 /* all writes append */
-#define FS_XFLAG_SYNC 0x00000020 /* all writes synchronous */
-#define FS_XFLAG_NOATIME 0x00000040 /* do not update access time */
-#define FS_XFLAG_NODUMP 0x00000080 /* do not include in backups */
-#define FS_XFLAG_RTINHERIT 0x00000100 /* create with rt bit set */
-#define FS_XFLAG_PROJINHERIT 0x00000200 /* create with parents projid */
-#define FS_XFLAG_NOSYMLINKS 0x00000400 /* disallow symlink creation */
-#define FS_XFLAG_EXTSIZE 0x00000800 /* extent size allocator hint */
-#define FS_XFLAG_EXTSZINHERIT 0x00001000 /* inherit inode extent size */
-#define FS_XFLAG_NODEFRAG 0x00002000 /* do not defragment */
-#define FS_XFLAG_FILESTREAM 0x00004000 /* use filestream allocator */
-#define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */
-#endif /* !defined(FS_IOC_FSGETXATTR) */
-
#define EXT4_IOC_FSGETXATTR FS_IOC_FSGETXATTR
#define EXT4_IOC_FSSETXATTR FS_IOC_FSSETXATTR
--
2.9.5
The current code has an issue where the user can't reliably tell whether or
not DAX is being used to service page faults and I/O when the DAX mount
option is used. In this case each inode within the mounted filesystem
starts with S_DAX set due to the mount option, but it can be cleared if
someone touches the individual inode flag.
For example:
# mount | grep dax
/dev/pmem0 on /mnt type xfs
(rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota)
# touch /mnt/a /mnt/b # both files currently use DAX
# xfs_io -c "lsattr" /mnt/* # neither has the DAX inode option set
----------e----- /mnt/a
----------e----- /mnt/b
# xfs_io -c "chattr -x" /mnt/a # this clears S_DAX for /mnt/a
# xfs_io -c "lsattr" /mnt/*
----------e----- /mnt/a
----------e----- /mnt/b
We end up with both /mnt/a and /mnt/b looking identical from the point of
view of the mount option and from lsattr, but one is using DAX and the
other is not.
Fix this by always doing DAX I/O when either the mount option is set or
when the DAX inode flag is set. This means that DAX will always be used
for all inodes on a filesystem mounted with -o dax, making the usage
reliable and detectable.
Signed-off-by: Ross Zwisler <[email protected]>
CC: [email protected]
---
fs/xfs/xfs_ioctl.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 9c0c7a9..8155ddc 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1008,7 +1008,7 @@ xfs_diflags_to_linux(
inode->i_flags |= S_NOATIME;
else
inode->i_flags &= ~S_NOATIME;
- if (xflags & FS_XFLAG_DAX)
+ if ((xflags & FS_XFLAG_DAX) || (ip->i_mount->m_flags & XFS_MOUNT_DAX))
inode->i_flags |= S_DAX;
else
inode->i_flags &= ~S_DAX;
@@ -1091,7 +1091,14 @@ xfs_ioctl_setattr_dax_invalidate(
return -EINVAL;
}
- /* If the DAX state is not changing, we have nothing to do here. */
+ /*
+ * If the DAX state is not changing, we have nothing to do here. If
+ * the DAX mount option was used we will update the DAX inode flag as
+ * the user requested but we will continue to use DAX for I/O and page
+ * faults regardless of how the inode flag is set.
+ */
+ if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
+ return 0;
if ((fa->fsx_xflags & FS_XFLAG_DAX) && IS_DAX(inode))
return 0;
if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_DAX(inode))
--
2.9.5
This helper, in the spirit of ext4_should_dioread_nolock() et al., replaces
the complex conditional in ext4_set_inode_flags() and will soon be called
in multiple places.
Signed-off-by: Ross Zwisler <[email protected]>
---
fs/ext4/ext4_jbd2.h | 15 +++++++++++++++
fs/ext4/inode.c | 4 +---
2 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 48143e3..65e2aa9 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -462,4 +462,19 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
return 1;
}
+static inline bool ext4_should_use_dax(struct inode *inode)
+{
+ if (!test_opt(inode->i_sb, DAX))
+ return false;
+ if (!S_ISREG(inode->i_mode))
+ return false;
+ if (ext4_should_journal_data(inode))
+ return false;
+ if (ext4_has_inline_data(inode))
+ return false;
+ if (ext4_encrypted_inode(inode))
+ return false;
+ return true;
+}
+
#endif /* _EXT4_JBD2_H */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c774bdc..864fb94 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4599,9 +4599,7 @@ void ext4_set_inode_flags(struct inode *inode)
new_fl |= S_NOATIME;
if (flags & EXT4_DIRSYNC_FL)
new_fl |= S_DIRSYNC;
- if (test_opt(inode->i_sb, DAX) && S_ISREG(inode->i_mode) &&
- !ext4_should_journal_data(inode) && !ext4_has_inline_data(inode) &&
- !ext4_encrypted_inode(inode))
+ if (ext4_should_use_dax(inode))
new_fl |= S_DAX;
inode_set_flags(inode, new_fl,
S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX);
--
2.9.5
Follow the lead of XFS and add a per-inode flag that allows the choice of
whether or not to use DAX to be made for individual files. As with XFS
this flag can also be set on a directory, in which case it will be
inherited by new inodes (both subdirectories and files) created within that
directory.
This inode flag can be viewed and manipulated using xfs_io, just as is done
with the XFS version of this flag:
# mount /dev/pmem0 /mnt/
# mount | grep /mnt
/dev/pmem0 on /mnt type ext4 (rw,relatime,seclabel,data=ordered)
# touch /mnt/a
# xfs_io -c lsattr /mnt/a
---------------- /mnt/a
# xfs_io -c 'chattr +x' /mnt/a
# xfs_io -c lsattr /mnt/a
--------------x- /mnt/a
We will use DAX in ext4 if either the mount option is used or if the inode
flag is set.
The 'chattr +x' functionality of xfs_io has been in place for foreign
filesystems (ext4) since xfsprogs v4.8.0. Support for 'lsattr' on foreign
filesystems was added with:
commit e13a325cd7a9 ("xfs_io: allow lsattr & lsproj on foreign filesystems")
which can currently be found in the "for-next" branch.
The rules for the per-inode DAX flag in ext4 are pretty much the same as
the rules for the '-o dax' mount option:
1) When DAX and journaling are both requested on the same inode, journaling
will win and DAX will be turned off. The S_DAX transitions due to
journaling mode changes are done safely by locking out page faults and I/O,
doing a writeback and an invalidate.
2) The DAX inode flag will fail with -EINVAL for filesystems that can
contain inline data.
3) If you try and use DAX on an encrypted inode, the inode will accept the
DAX inode flag but DAX will never be used.
Signed-off-by: Ross Zwisler <[email protected]>
---
The behavior of the DAX inode flag when combined with journaling, inline
data and encryption was meant to stay as consistent as possible with the
'-o dax' mount option. If there is a better way to handle these conflics,
please let me know.
---
fs/ext4/ext4.h | 10 ++++++----
fs/ext4/ext4_jbd2.h | 5 +++--
fs/ext4/inode.c | 2 +-
fs/ext4/ioctl.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++--
4 files changed, 63 insertions(+), 9 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c950278..9d8e067 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -399,10 +399,11 @@ struct flex_groups {
#define EXT4_EOFBLOCKS_FL 0x00400000 /* Blocks allocated beyond EOF */
#define EXT4_INLINE_DATA_FL 0x10000000 /* Inode has inline data. */
#define EXT4_PROJINHERIT_FL 0x20000000 /* Create with parents projid */
+#define EXT4_DAX_FL 0x40000000 /* use DAX for IO */
#define EXT4_RESERVED_FL 0x80000000 /* reserved for ext4 lib */
-#define EXT4_FL_USER_VISIBLE 0x304BDFFF /* User visible flags */
-#define EXT4_FL_USER_MODIFIABLE 0x204BC0FF /* User modifiable flags */
+#define EXT4_FL_USER_VISIBLE 0x704BDFFF /* User visible flags */
+#define EXT4_FL_USER_MODIFIABLE 0x604BC0FF /* User modifiable flags */
/* Flags we can manipulate with through EXT4_IOC_FSSETXATTR */
#define EXT4_FL_XFLAG_VISIBLE (EXT4_SYNC_FL | \
@@ -410,14 +411,15 @@ struct flex_groups {
EXT4_APPEND_FL | \
EXT4_NODUMP_FL | \
EXT4_NOATIME_FL | \
- EXT4_PROJINHERIT_FL)
+ EXT4_PROJINHERIT_FL | \
+ EXT4_DAX_FL)
/* Flags that should be inherited by new inodes from their parent. */
#define EXT4_FL_INHERITED (EXT4_SECRM_FL | EXT4_UNRM_FL | EXT4_COMPR_FL |\
EXT4_SYNC_FL | EXT4_NODUMP_FL | EXT4_NOATIME_FL |\
EXT4_NOCOMPR_FL | EXT4_JOURNAL_DATA_FL |\
EXT4_NOTAIL_FL | EXT4_DIRSYNC_FL |\
- EXT4_PROJINHERIT_FL)
+ EXT4_PROJINHERIT_FL | EXT4_DAX_FL)
/* Flags that are appropriate for regular files (all but dir-specific ones). */
#define EXT4_REG_FLMASK (~(EXT4_DIRSYNC_FL | EXT4_TOPDIR_FL))
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 65e2aa9..14b7a84 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -462,9 +462,10 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
return 1;
}
-static inline bool ext4_should_use_dax(struct inode *inode)
+static inline bool ext4_should_use_dax(struct inode *inode,
+ bool dax_inode_flag)
{
- if (!test_opt(inode->i_sb, DAX))
+ if (!(test_opt(inode->i_sb, DAX) || dax_inode_flag))
return false;
if (!S_ISREG(inode->i_mode))
return false;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index facb5ae..022ca58 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4599,7 +4599,7 @@ void ext4_set_inode_flags(struct inode *inode)
new_fl |= S_NOATIME;
if (flags & EXT4_DIRSYNC_FL)
new_fl |= S_DIRSYNC;
- if (ext4_should_use_dax(inode))
+ if (ext4_should_use_dax(inode, !!(flags & EXT4_DAX_FL)))
new_fl |= S_DAX;
inode_set_flags(inode, new_fl,
S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC|S_DAX);
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index afb66d4..8626a94 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -22,6 +22,7 @@
#include <linux/fsmap.h>
#include "fsmap.h"
#include <trace/events/ext4.h>
+#include <linux/dax.h>
/**
* Swap memory between @a and @b for @len bytes.
@@ -205,6 +206,41 @@ static int uuid_is_zero(__u8 u[16])
}
#endif
+static int ext4_ioctl_dax_invalidate(struct inode *inode, bool dax_inode_flag)
+{
+ bool old_dax = !!(inode->i_flags & S_DAX);
+ bool new_dax = ext4_should_use_dax(inode, dax_inode_flag);
+ struct super_block *sb = inode->i_sb;
+
+ lockdep_assert_held(&inode->i_rwsem);
+ lockdep_assert_held(&EXT4_I(inode)->i_mmap_sem);
+
+ if (dax_inode_flag) {
+ if (ext4_has_feature_inline_data(sb)) {
+ ext4_msg(sb, KERN_ERR, "Cannot use DAX on a filesystem"
+ " that may contain inline data");
+ return -EINVAL;
+ }
+ if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode)))
+ return -EINVAL;
+ if (bdev_dax_supported(sb, sb->s_blocksize) < 0)
+ return -EINVAL;
+ }
+
+ if (old_dax != new_dax) {
+ int err;
+
+ err = filemap_write_and_wait(inode->i_mapping);
+ if (err)
+ return err;
+
+ err = invalidate_inode_pages2(inode->i_mapping);
+ if (err)
+ return err;
+ }
+ return 0;
+}
+
static int ext4_ioctl_setflags(struct inode *inode,
unsigned int flags)
{
@@ -258,10 +294,15 @@ static int ext4_ioctl_setflags(struct inode *inode,
goto flags_out;
}
+ down_write(&ei->i_mmap_sem);
+ err = ext4_ioctl_dax_invalidate(inode, !!(flags & EXT4_DAX_FL));
+ if (err)
+ goto unlock_out;
+
handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
if (IS_ERR(handle)) {
err = PTR_ERR(handle);
- goto flags_out;
+ goto unlock_out;
}
if (IS_SYNC(inode))
ext4_handle_sync(handle);
@@ -286,6 +327,7 @@ static int ext4_ioctl_setflags(struct inode *inode,
err = ext4_mark_iloc_dirty(handle, inode, &iloc);
flags_err:
+ up_write(&ei->i_mmap_sem);
ext4_journal_stop(handle);
if (err)
goto flags_out;
@@ -303,6 +345,10 @@ static int ext4_ioctl_setflags(struct inode *inode,
flags_out:
return err;
+
+unlock_out:
+ up_write(&ei->i_mmap_sem);
+ return err;
}
#ifdef CONFIG_QUOTA
@@ -425,12 +471,15 @@ static inline __u32 ext4_iflags_to_xflags(unsigned long iflags)
xflags |= FS_XFLAG_NOATIME;
if (iflags & EXT4_PROJINHERIT_FL)
xflags |= FS_XFLAG_PROJINHERIT;
+ if (iflags & EXT4_DAX_FL)
+ xflags |= FS_XFLAG_DAX;
return xflags;
}
#define EXT4_SUPPORTED_FS_XFLAGS (FS_XFLAG_SYNC | FS_XFLAG_IMMUTABLE | \
FS_XFLAG_APPEND | FS_XFLAG_NODUMP | \
- FS_XFLAG_NOATIME | FS_XFLAG_PROJINHERIT)
+ FS_XFLAG_NOATIME | FS_XFLAG_PROJINHERIT | \
+ FS_XFLAG_DAX)
/* Transfer xflags flags to internal */
static inline unsigned long ext4_xflags_to_iflags(__u32 xflags)
@@ -449,6 +498,8 @@ static inline unsigned long ext4_xflags_to_iflags(__u32 xflags)
iflags |= EXT4_NOATIME_FL;
if (xflags & FS_XFLAG_PROJINHERIT)
iflags |= EXT4_PROJINHERIT_FL;
+ if (xflags & FS_XFLAG_DAX)
+ iflags |= EXT4_DAX_FL;
return iflags;
}
--
2.9.5
If an inode has inline data it is currently prevented from using DAX by a
check in ext4_should_use_dax(). When the inode grows inline data via
ext4_create_inline_data() or removes its inline data via
ext4_destroy_inline_data_nolock(), the value of S_DAX can change.
Currently these changes are unsafe because we don't hold off page faults
and I/O, write back dirty radix tree entries and invalidate all mappings.
This work is done in XFS via xfs_ioctl_setattr_dax_invalidate(), for
example. This unsafe transitioning of S_DAX could potentially lead to data
corruption.
Fix this issue by preventing the DAX mount option from being used on
filesystems that were created to support inline data. Inline data is an
option given to mkfs.ext4.
We prevent DAX from being used with inline data as opposed to trying to
safely manage the transition of S_DAX because of the locking complexities:
1) filemap_write_and_wait() eventually calls ext4_writepages(), which
acquires the sbi->s_journal_flag_rwsem. This lock ranks above the
jbdw_handle which is eventually taken by ext4_journal_start(). This
essentially means that the writeback has to happen outside of the context
of an active journal handle (outside of ext4_journal_start() to
ext4_journal_stop().)
2) To lock out page faults we take a write lock on the ei->i_mmap_sem, and
this lock again ranks above the jbd2_handle taken by ext4_journal_start().
So, as with the writeback code in 1) above we have to take ei->i_mmap_sem
outside of the context of an active journal handle.
We are able to work around both of these restrictions and safely transition
S_DAX when we change the journaling mode, but for inline data it's much
harder because each of ext4_create_inline_data() and
ext4_destroy_inline_data_nolock() are passed in journal handles that have
already been started.
To be able to safely writeback and invalidate our dirty inode data we'd
either need to uplevel the locking, writeback and invalidate into all the
callers of those two functions, or we'd need to stop our current journal
handle, do the appropriate locking, writeback and invalidate, unlock and
restart the journal handle.
These both seem too complex, and I don't know if we have a valid use case
where we need to combine a filesystem with inline data and DAX, so just
prevent them from being used together.
Signed-off-by: Ross Zwisler <[email protected]>
CC: [email protected]
---
fs/ext4/inline.c | 10 ----------
fs/ext4/super.c | 5 +++++
2 files changed, 5 insertions(+), 10 deletions(-)
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 28c5c3a..fd95019 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -302,11 +302,6 @@ static int ext4_create_inline_data(handle_t *handle,
EXT4_I(inode)->i_inline_size = len + EXT4_MIN_INLINE_DATA_SIZE;
ext4_clear_inode_flag(inode, EXT4_INODE_EXTENTS);
ext4_set_inode_flag(inode, EXT4_INODE_INLINE_DATA);
- /*
- * Propagate changes to inode->i_flags as well - e.g. S_DAX may
- * get cleared
- */
- ext4_set_inode_flags(inode);
get_bh(is.iloc.bh);
error = ext4_mark_iloc_dirty(handle, inode, &is.iloc);
@@ -451,11 +446,6 @@ static int ext4_destroy_inline_data_nolock(handle_t *handle,
}
}
ext4_clear_inode_flag(inode, EXT4_INODE_INLINE_DATA);
- /*
- * Propagate changes to inode->i_flags as well - e.g. S_DAX may
- * get set.
- */
- ext4_set_inode_flags(inode);
get_bh(is.iloc.bh);
error = ext4_mark_iloc_dirty(handle, inode, &is.iloc);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index d61a70e2..d549dfb 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3686,6 +3686,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
}
if (sbi->s_mount_opt & EXT4_MOUNT_DAX) {
+ if (ext4_has_feature_inline_data(sb)) {
+ ext4_msg(sb, KERN_ERR, "Cannot use DAX on a filesystem"
+ " that may contain inline data");
+ goto failed_mount;
+ }
err = bdev_dax_supported(sb, blocksize);
if (err)
goto failed_mount;
--
2.9.5
We prevent DAX from being used on inodes which are using ext4's built in
encryption via a check in ext4_should_use_dax(). We do have what appears
to be an unsafe transition of S_DAX in ext4_set_context(), though, where
S_DAX can get disabled without us doing a proper writeback + invalidate.
I actually think we are safe in this case because of the following:
1) You can't encrypt an existing file. Encryption can only be set on an
empty directory, with new inodes in that directory being created with
encryption turned on, so I don't think it's possible to turn encryption on
for a file that has open DAX mmaps or outstanding I/Os.
2) There is no way to turn encryption off on a given file. Once an inode
is encrypted, it stays encrypted for the life of that inode, so we don't
have to worry about the case where we turn encryption off and S_DAX
suddenly turns on.
3) The only way we end up in ext4_set_context() to turn on encryption is
when we are creating a new file in the encrypted directory. This happens
as part of ext4_create() before the inode has been allowed to do any I/O.
Here's the call tree:
ext4_create()
__ext4_new_inode()
ext4_set_inode_flags() // sets S_DAX
fscrypt_inherit_context()
fscrypt_get_encryption_info();
ext4_set_context() // sets EXT4_INODE_ENCRYPT, clears S_DAX
So, I actually think it's safe to transition S_DAX in ext4_set_context()
without any locking, writebacks or invalidations. I've added a
WARN_ON_ONCE() sanity check to make sure that we are notified if we ever
encounter a case where we are encrypting an inode that already has data,
in which case we need to add code to safely transition S_DAX.
Signed-off-by: Ross Zwisler <[email protected]>
CC: [email protected]
---
fs/ext4/super.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index d549dfb..6604a18 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1159,6 +1159,9 @@ static int ext4_set_context(struct inode *inode, const void *ctx, size_t len,
if (inode->i_ino == EXT4_ROOT_INO)
return -EPERM;
+ if (WARN_ON_ONCE(IS_DAX(inode) && i_size_read(inode)))
+ return -EINVAL;
+
res = ext4_convert_inline_data(inode);
if (res)
return res;
--
2.9.5
Rework the error handling in ext4_change_inode_journal_flag() so that
multiple paths can re-use portions of the same cleanup code via gotos
instead of each path doing their own cleanup. This will benefit later
patches that add more paths to this function that must be unwound on error.
Signed-off-by: Ross Zwisler <[email protected]>
CC: [email protected]
---
fs/ext4/inode.c | 24 +++++++++++++-----------
1 file changed, 13 insertions(+), 11 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 864fb94..d218991 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5953,11 +5953,8 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
if (val) {
down_write(&EXT4_I(inode)->i_mmap_sem);
err = filemap_write_and_wait(inode->i_mapping);
- if (err < 0) {
- up_write(&EXT4_I(inode)->i_mmap_sem);
- ext4_inode_resume_unlocked_dio(inode);
- return err;
- }
+ if (err < 0)
+ goto out_unlock;
}
percpu_down_write(&sbi->s_journal_flag_rwsem);
@@ -5975,12 +5972,8 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
ext4_set_inode_flag(inode, EXT4_INODE_JOURNAL_DATA);
else {
err = jbd2_journal_flush(journal);
- if (err < 0) {
- jbd2_journal_unlock_updates(journal);
- percpu_up_write(&sbi->s_journal_flag_rwsem);
- ext4_inode_resume_unlocked_dio(inode);
- return err;
- }
+ if (err < 0)
+ goto out_journal_unlock;
ext4_clear_inode_flag(inode, EXT4_INODE_JOURNAL_DATA);
}
ext4_set_aops(inode);
@@ -6009,6 +6002,15 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
ext4_std_error(inode->i_sb, err);
return err;
+
+out_journal_unlock:
+ jbd2_journal_unlock_updates(journal);
+ percpu_up_write(&sbi->s_journal_flag_rwsem);
+out_unlock:
+ if (val)
+ up_write(&EXT4_I(inode)->i_mmap_sem);
+ ext4_inode_resume_unlocked_dio(inode);
+ return err;
}
static int ext4_bh_unmapped(handle_t *handle, struct buffer_head *bh)
--
2.9.5
The IOCTL path which switches the journaling mode for an inode is currently
unsafe because it doesn't properly do a writeback and invalidation on the
inode. In XFS, for example, safe transitions of S_DAX are handled by
xfs_ioctl_setattr_dax_invalidate() which locks out page faults and I/O,
does a writeback via filemap_write_and_wait() and an invalidation via
invalidate_inode_pages2().
Without this in place we can see the following kernel warning when we try
and insert a DAX exceptional entry but find that a dirty page cache page is
still in the mapping->radix_tree:
WARNING: CPU: 4 PID: 1052 at mm/filemap.c:262 __delete_from_page_cache+0x375/0x550
Modules linked in: dax_pmem nd_pmem device_dax nd_btt nfit libnvdimm
CPU: 4 PID: 1052 Comm: small Not tainted 4.13.0-rc6-00055-gac26931 #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
task: ffff88020ccd0000 task.stack: ffffc900021d4000
RIP: 0010:__delete_from_page_cache+0x375/0x550
RSP: 0000:ffffc900021d7b90 EFLAGS: 00010002
RAX: 002fffc00001123d RBX: ffffffffffffffff RCX: ffff8801d9440d68
RDX: 0000000000000000 RSI: ffffffff81fd5b84 RDI: ffffffff81f6f0e5
RBP: ffffc900021d7be0 R08: 0000000000000000 R09: ffff8801f9938c70
R10: 0000000000000021 R11: ffff8801f9938c91 R12: ffff8801d9440d70
R13: ffffea0007fdda80 R14: 0000000000000001 R15: ffff8801d9440d68
FS: 00007feacc041700(0000) GS:ffff880211800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000010420000 CR3: 000000020cfd8000 CR4: 00000000000006e0
Call Trace:
dax_insert_mapping_entry+0x158/0x2c0
dax_iomap_fault+0x1020/0x1bb0
ext4_dax_huge_fault+0xc8/0x160
ext4_dax_fault+0x10/0x20
__do_fault+0x20/0x110
__handle_mm_fault+0x97d/0x1120
handle_mm_fault+0x188/0x2f0
__do_page_fault+0x28f/0x590
trace_do_page_fault+0x58/0x2c0
do_async_page_fault+0x2c/0x90
async_page_fault+0x28/0x30
I'm pretty sure we could make a test that shows userspace visible data
corruption as well in this scenario.
Make it safe to change the journaling mode and turn on or off S_DAX by
adding locking to properly lock out page faults (i_mmap_sem) and then doing
the writeback and invalidate. I/O is already held off because all callers
of ext4_ioctl_setflags() hold the inode lock.
The locking for this new code is complex because of the following:
1) filemap_write_and_wait() eventually calls ext4_writepages(), which
acquires the sbi->s_journal_flag_rwsem. This lock ranks above the
jbdw_handle which is eventually taken by ext4_journal_start(). This
essentially means that the writeback has to happen outside of the context
of an active journal handle (outside of ext4_journal_start() to
ext4_journal_stop().)
2) To lock out page faults we take a write lock on the ei->i_mmap_sem, and
this lock again ranks above the jbd2_handle taken by ext4_journal_start().
So, as with the writeback code in 1) above we have to take ei->i_mmap_sem
outside of the context of an active journal handle.
Signed-off-by: Ross Zwisler <[email protected]>
CC: [email protected]
---
Please let me know if my "context of an active journal handle" terminology
makes sense, or if some other phrasing is more common.
I'll work on an xfstest which shows this data corruption.
---
fs/ext4/inode.c | 25 +++++++++++++++----------
1 file changed, 15 insertions(+), 10 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d218991..facb5ae 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5949,13 +5949,15 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
* fallocate or buffered writes in dioread_nolock mode covered by
* dirty data which can be converted only after flushing the dirty
* data (and journalled aops don't know how to handle these cases).
+ *
+ * Because the changes to our journaling mode may also modify S_DAX we
+ * need to hold off I/O and page faults so we can write back any dirty
+ * data and invalidate all mappings.
*/
- if (val) {
- down_write(&EXT4_I(inode)->i_mmap_sem);
- err = filemap_write_and_wait(inode->i_mapping);
- if (err < 0)
- goto out_unlock;
- }
+ down_write(&EXT4_I(inode)->i_mmap_sem);
+ err = filemap_write_and_wait(inode->i_mapping);
+ if (err < 0)
+ goto out_unlock;
percpu_down_write(&sbi->s_journal_flag_rwsem);
jbd2_journal_lock_updates(journal);
@@ -5976,6 +5978,11 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
goto out_journal_unlock;
ext4_clear_inode_flag(inode, EXT4_INODE_JOURNAL_DATA);
}
+
+ err = invalidate_inode_pages2(inode->i_mapping);
+ if (err < 0)
+ goto out_journal_unlock;
+
ext4_set_aops(inode);
/*
* Update inode->i_flags after EXT4_INODE_JOURNAL_DATA was updated.
@@ -5986,8 +5993,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
jbd2_journal_unlock_updates(journal);
percpu_up_write(&sbi->s_journal_flag_rwsem);
- if (val)
- up_write(&EXT4_I(inode)->i_mmap_sem);
+ up_write(&EXT4_I(inode)->i_mmap_sem);
ext4_inode_resume_unlocked_dio(inode);
/* Finally we can mark the inode as dirty. */
@@ -6007,8 +6013,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
jbd2_journal_unlock_updates(journal);
percpu_up_write(&sbi->s_journal_flag_rwsem);
out_unlock:
- if (val)
- up_write(&EXT4_I(inode)->i_mmap_sem);
+ up_write(&EXT4_I(inode)->i_mmap_sem);
ext4_inode_resume_unlocked_dio(inode);
return err;
}
--
2.9.5
On 9/5/17 5:35 PM, Ross Zwisler wrote:
> The original intent of this series was to add a per-inode DAX flag to ext4
> so that it would be consistent with XFS. In my travels I found and fixed
> several related issues in both ext4 and XFS.
Hi Ross -
hch had a lot of reasons to nuke the dax flag from orbit, and we just
/disabled/ it in xfs due to its habit of crashing the kernel...
so a couple questions:
1) does this series pass hch's "test the per-inode DAX flag" fstest?
2) do we have an agreement that we need this flag at all, or is this
just a parity item because xfs has^whad a per-inode flag?
Thanks,
-Eric
> I'm not fully happy with the ways that ext4 DAX interacts with conflicting
> features (journaling, inline data and encryption). My goal with this
> series was to make all these interactions as consistent as possilble, and
> of course to make them safe. If anyone has ideas for improvements, I'm
> very open.
>
> Ross Zwisler (9):
> ext4: remove duplicate extended attributes defs
> xfs: always use DAX if mount option is used
> xfs: validate bdev support for DAX inode flag
> ext4: add ext4_should_use_dax()
> ext4: ext4_change_inode_journal_flag error handling
> ext4: safely transition S_DAX on journaling changes
> ext4: prevent data corruption with inline data + DAX
> ext4: add sanity check for encryption + DAX
> ext4: add per-inode DAX flag
>
> fs/ext4/ext4.h | 47 ++++++---------------------------------------
> fs/ext4/ext4_jbd2.h | 16 ++++++++++++++++
> fs/ext4/inline.c | 10 ----------
> fs/ext4/inode.c | 45 ++++++++++++++++++++++++-------------------
> fs/ext4/ioctl.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++--
> fs/ext4/super.c | 8 ++++++++
> fs/xfs/xfs_ioctl.c | 14 +++++++++++---
> 7 files changed, 119 insertions(+), 76 deletions(-)
>
On Tue 05-09-17 16:35:33, Ross Zwisler wrote:
> The following commit:
>
> commit 9b7365fc1c82 ("ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR
> interface support")
>
> added several defines related to extended attributes to ext4.h. They were
> added within an #ifndef FS_IOC_FSGETXATTR block with the comment:
>
> /* Until the uapi changes get merged for project quota... */
>
> Those uapi changes were merged by this commit:
>
> commit 334e580a6f97 ("fs: XFS_IOC_FS[SG]SETXATTR to FS_IOC_FS[SG]ETXATTR
> promotion")
>
> so all the definitions needed by ext4 are available in
> include/uapi/linux/fs.h. Remove the duplicates from ext4.h.
>
> Signed-off-by: Ross Zwisler <[email protected]>
> Cc: Li Xi <[email protected]>
> Cc: Theodore Ts'o <[email protected]>
> Cc: Andreas Dilger <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Dave Chinner <[email protected]>
Yeah, good cleanup. You can add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/ext4/ext4.h | 37 -------------------------------------
> 1 file changed, 37 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index a2bb7d2..c950278 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -644,43 +644,6 @@ enum {
> #define EXT4_IOC_GET_ENCRYPTION_PWSALT FS_IOC_GET_ENCRYPTION_PWSALT
> #define EXT4_IOC_GET_ENCRYPTION_POLICY FS_IOC_GET_ENCRYPTION_POLICY
>
> -#ifndef FS_IOC_FSGETXATTR
> -/* Until the uapi changes get merged for project quota... */
> -
> -#define FS_IOC_FSGETXATTR _IOR('X', 31, struct fsxattr)
> -#define FS_IOC_FSSETXATTR _IOW('X', 32, struct fsxattr)
> -
> -/*
> - * Structure for FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR.
> - */
> -struct fsxattr {
> - __u32 fsx_xflags; /* xflags field value (get/set) */
> - __u32 fsx_extsize; /* extsize field value (get/set)*/
> - __u32 fsx_nextents; /* nextents field value (get) */
> - __u32 fsx_projid; /* project identifier (get/set) */
> - unsigned char fsx_pad[12];
> -};
> -
> -/*
> - * Flags for the fsx_xflags field
> - */
> -#define FS_XFLAG_REALTIME 0x00000001 /* data in realtime volume */
> -#define FS_XFLAG_PREALLOC 0x00000002 /* preallocated file extents */
> -#define FS_XFLAG_IMMUTABLE 0x00000008 /* file cannot be modified */
> -#define FS_XFLAG_APPEND 0x00000010 /* all writes append */
> -#define FS_XFLAG_SYNC 0x00000020 /* all writes synchronous */
> -#define FS_XFLAG_NOATIME 0x00000040 /* do not update access time */
> -#define FS_XFLAG_NODUMP 0x00000080 /* do not include in backups */
> -#define FS_XFLAG_RTINHERIT 0x00000100 /* create with rt bit set */
> -#define FS_XFLAG_PROJINHERIT 0x00000200 /* create with parents projid */
> -#define FS_XFLAG_NOSYMLINKS 0x00000400 /* disallow symlink creation */
> -#define FS_XFLAG_EXTSIZE 0x00000800 /* extent size allocator hint */
> -#define FS_XFLAG_EXTSZINHERIT 0x00001000 /* inherit inode extent size */
> -#define FS_XFLAG_NODEFRAG 0x00002000 /* do not defragment */
> -#define FS_XFLAG_FILESTREAM 0x00004000 /* use filestream allocator */
> -#define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */
> -#endif /* !defined(FS_IOC_FSGETXATTR) */
> -
> #define EXT4_IOC_FSGETXATTR FS_IOC_FSGETXATTR
> #define EXT4_IOC_FSSETXATTR FS_IOC_FSSETXATTR
>
> --
> 2.9.5
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue 05-09-17 16:35:38, Ross Zwisler wrote:
> The IOCTL path which switches the journaling mode for an inode is currently
> unsafe because it doesn't properly do a writeback and invalidation on the
> inode. In XFS, for example, safe transitions of S_DAX are handled by
> xfs_ioctl_setattr_dax_invalidate() which locks out page faults and I/O,
> does a writeback via filemap_write_and_wait() and an invalidation via
> invalidate_inode_pages2().
>
> Without this in place we can see the following kernel warning when we try
> and insert a DAX exceptional entry but find that a dirty page cache page is
> still in the mapping->radix_tree:
>
> WARNING: CPU: 4 PID: 1052 at mm/filemap.c:262 __delete_from_page_cache+0x375/0x550
> Modules linked in: dax_pmem nd_pmem device_dax nd_btt nfit libnvdimm
> CPU: 4 PID: 1052 Comm: small Not tainted 4.13.0-rc6-00055-gac26931 #3
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
> task: ffff88020ccd0000 task.stack: ffffc900021d4000
> RIP: 0010:__delete_from_page_cache+0x375/0x550
> RSP: 0000:ffffc900021d7b90 EFLAGS: 00010002
> RAX: 002fffc00001123d RBX: ffffffffffffffff RCX: ffff8801d9440d68
> RDX: 0000000000000000 RSI: ffffffff81fd5b84 RDI: ffffffff81f6f0e5
> RBP: ffffc900021d7be0 R08: 0000000000000000 R09: ffff8801f9938c70
> R10: 0000000000000021 R11: ffff8801f9938c91 R12: ffff8801d9440d70
> R13: ffffea0007fdda80 R14: 0000000000000001 R15: ffff8801d9440d68
> FS: 00007feacc041700(0000) GS:ffff880211800000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000010420000 CR3: 000000020cfd8000 CR4: 00000000000006e0
> Call Trace:
> dax_insert_mapping_entry+0x158/0x2c0
> dax_iomap_fault+0x1020/0x1bb0
> ext4_dax_huge_fault+0xc8/0x160
> ext4_dax_fault+0x10/0x20
> __do_fault+0x20/0x110
> __handle_mm_fault+0x97d/0x1120
> handle_mm_fault+0x188/0x2f0
> __do_page_fault+0x28f/0x590
> trace_do_page_fault+0x58/0x2c0
> do_async_page_fault+0x2c/0x90
> async_page_fault+0x28/0x30
>
> I'm pretty sure we could make a test that shows userspace visible data
> corruption as well in this scenario.
>
> Make it safe to change the journaling mode and turn on or off S_DAX by
> adding locking to properly lock out page faults (i_mmap_sem) and then doing
> the writeback and invalidate. I/O is already held off because all callers
> of ext4_ioctl_setflags() hold the inode lock.
Yeah, this is a good point. It is just that this is not enough as I
discovered in [1]. You also need to tear down & recreate VMAs when changing
DAX flag which is a bit tricky. So for now I think returning EBUSY when
file is mmaped and we'd like to flip DAX flag is the best solution. Hmm?
[1] https://www.spinics.net/lists/linux-xfs/msg09859.html
> The locking for this new code is complex because of the following:
>
> 1) filemap_write_and_wait() eventually calls ext4_writepages(), which
> acquires the sbi->s_journal_flag_rwsem. This lock ranks above the
> jbdw_handle which is eventually taken by ext4_journal_start(). This
> essentially means that the writeback has to happen outside of the context
> of an active journal handle (outside of ext4_journal_start() to
> ext4_journal_stop().)
>
> 2) To lock out page faults we take a write lock on the ei->i_mmap_sem, and
> this lock again ranks above the jbd2_handle taken by ext4_journal_start().
> So, as with the writeback code in 1) above we have to take ei->i_mmap_sem
> outside of the context of an active journal handle.
Welcome to the joy of fs locking ;)
> Signed-off-by: Ross Zwisler <[email protected]>
> CC: [email protected]
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, Sep 05, 2017 at 09:12:35PM -0500, Eric Sandeen wrote:
> On 9/5/17 5:35 PM, Ross Zwisler wrote:
> > The original intent of this series was to add a per-inode DAX flag to ext4
> > so that it would be consistent with XFS. In my travels I found and fixed
> > several related issues in both ext4 and XFS.
>
> Hi Ross -
>
> hch had a lot of reasons to nuke the dax flag from orbit, and we just
> /disabled/ it in xfs due to its habit of crashing the kernel...
Ah, sorry, I wasn't CC'd on those threads and missed them. For any interested
bystanders:
https://www.spinics.net/lists/linux-ext4/msg57840.html
https://www.spinics.net/lists/linux-xfs/msg09831.html
https://www.spinics.net/lists/linux-xfs/msg10124.html
> so a couple questions:
>
> 1) does this series pass hch's "test the per-inode DAX flag" fstest?
Nope, it has the exact same problems as the XFS per-inode DAX flag.
> 2) do we have an agreement that we need this flag at all, or is this
> just a parity item because xfs has^whad a per-inode flag?
It was for parity, and because it allows admins finer grained control over
their system. Basically all things discussed in response to Lukas's original
patch in the first link above.
The way this series ended up the first 8 patches were all fixes for the
existing code, and only patch 9 introduced the new per-inode flag. I'll drop
patch 9 for now and rework the first 8 patches so we can get safer behavior of
the existing DAX mount option in ext4. We can try patch 9 again later if we
come to an agreement that re-enables the XFS per-inode DAX option.
On Wed, Sep 06, 2017 at 11:47:00AM +0200, Jan Kara wrote:
> On Tue 05-09-17 16:35:38, Ross Zwisler wrote:
> > The IOCTL path which switches the journaling mode for an inode is currently
> > unsafe because it doesn't properly do a writeback and invalidation on the
> > inode. In XFS, for example, safe transitions of S_DAX are handled by
> > xfs_ioctl_setattr_dax_invalidate() which locks out page faults and I/O,
> > does a writeback via filemap_write_and_wait() and an invalidation via
> > invalidate_inode_pages2().
> >
> > Without this in place we can see the following kernel warning when we try
> > and insert a DAX exceptional entry but find that a dirty page cache page is
> > still in the mapping->radix_tree:
> >
> > WARNING: CPU: 4 PID: 1052 at mm/filemap.c:262 __delete_from_page_cache+0x375/0x550
> > Modules linked in: dax_pmem nd_pmem device_dax nd_btt nfit libnvdimm
> > CPU: 4 PID: 1052 Comm: small Not tainted 4.13.0-rc6-00055-gac26931 #3
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
> > task: ffff88020ccd0000 task.stack: ffffc900021d4000
> > RIP: 0010:__delete_from_page_cache+0x375/0x550
> > RSP: 0000:ffffc900021d7b90 EFLAGS: 00010002
> > RAX: 002fffc00001123d RBX: ffffffffffffffff RCX: ffff8801d9440d68
> > RDX: 0000000000000000 RSI: ffffffff81fd5b84 RDI: ffffffff81f6f0e5
> > RBP: ffffc900021d7be0 R08: 0000000000000000 R09: ffff8801f9938c70
> > R10: 0000000000000021 R11: ffff8801f9938c91 R12: ffff8801d9440d70
> > R13: ffffea0007fdda80 R14: 0000000000000001 R15: ffff8801d9440d68
> > FS: 00007feacc041700(0000) GS:ffff880211800000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000010420000 CR3: 000000020cfd8000 CR4: 00000000000006e0
> > Call Trace:
> > dax_insert_mapping_entry+0x158/0x2c0
> > dax_iomap_fault+0x1020/0x1bb0
> > ext4_dax_huge_fault+0xc8/0x160
> > ext4_dax_fault+0x10/0x20
> > __do_fault+0x20/0x110
> > __handle_mm_fault+0x97d/0x1120
> > handle_mm_fault+0x188/0x2f0
> > __do_page_fault+0x28f/0x590
> > trace_do_page_fault+0x58/0x2c0
> > do_async_page_fault+0x2c/0x90
> > async_page_fault+0x28/0x30
> >
> > I'm pretty sure we could make a test that shows userspace visible data
> > corruption as well in this scenario.
> >
> > Make it safe to change the journaling mode and turn on or off S_DAX by
> > adding locking to properly lock out page faults (i_mmap_sem) and then doing
> > the writeback and invalidate. I/O is already held off because all callers
> > of ext4_ioctl_setflags() hold the inode lock.
>
> Yeah, this is a good point. It is just that this is not enough as I
> discovered in [1]. You also need to tear down & recreate VMAs when changing
> DAX flag which is a bit tricky. So for now I think returning EBUSY when
> file is mmaped and we'd like to flip DAX flag is the best solution. Hmm?
>
> [1] https://www.spinics.net/lists/linux-xfs/msg09859.html
Yea, thanks for the link, I totally missed this discussion (obviously).
Cool, I'll rework this for v2.
> > The locking for this new code is complex because of the following:
> >
> > 1) filemap_write_and_wait() eventually calls ext4_writepages(), which
> > acquires the sbi->s_journal_flag_rwsem. This lock ranks above the
> > jbdw_handle which is eventually taken by ext4_journal_start(). This
> > essentially means that the writeback has to happen outside of the context
> > of an active journal handle (outside of ext4_journal_start() to
> > ext4_journal_stop().)
> >
> > 2) To lock out page faults we take a write lock on the ei->i_mmap_sem, and
> > this lock again ranks above the jbd2_handle taken by ext4_journal_start().
> > So, as with the writeback code in 1) above we have to take ei->i_mmap_sem
> > outside of the context of an active journal handle.
>
> Welcome to the joy of fs locking ;)
:) Well, I feel like I learned a lot more about ext4 during this patch set!
> > Signed-off-by: Ross Zwisler <[email protected]>
> > CC: [email protected]
>
> Honza
>
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
On Sep 5, 2017, at 4:35 PM, Ross Zwisler <[email protected]> wrote:
>
> If an inode has inline data it is currently prevented from using DAX by a
> check in ext4_should_use_dax(). When the inode grows inline data via
> ext4_create_inline_data() or removes its inline data via
> ext4_destroy_inline_data_nolock(), the value of S_DAX can change.
>
> Currently these changes are unsafe because we don't hold off page faults
> and I/O, write back dirty radix tree entries and invalidate all mappings.
> This work is done in XFS via xfs_ioctl_setattr_dax_invalidate(), for
> example. This unsafe transitioning of S_DAX could potentially lead to data
> corruption.
>
> Fix this issue by preventing the DAX mount option from being used on
> filesystems that were created to support inline data. Inline data is an
> option given to mkfs.ext4.
>
> We prevent DAX from being used with inline data as opposed to trying to
> safely manage the transition of S_DAX because of the locking complexities:
>
> 1) filemap_write_and_wait() eventually calls ext4_writepages(), which
> acquires the sbi->s_journal_flag_rwsem. This lock ranks above the
> jbdw_handle which is eventually taken by ext4_journal_start(). This
> essentially means that the writeback has to happen outside of the context
> of an active journal handle (outside of ext4_journal_start() to
> ext4_journal_stop().)
>
> 2) To lock out page faults we take a write lock on the ei->i_mmap_sem, and
> this lock again ranks above the jbd2_handle taken by ext4_journal_start().
> So, as with the writeback code in 1) above we have to take ei->i_mmap_sem
> outside of the context of an active journal handle.
>
> We are able to work around both of these restrictions and safely transition
> S_DAX when we change the journaling mode, but for inline data it's much
> harder because each of ext4_create_inline_data() and
> ext4_destroy_inline_data_nolock() are passed in journal handles that have
> already been started.
>
> To be able to safely writeback and invalidate our dirty inode data we'd
> either need to uplevel the locking, writeback and invalidate into all the
> callers of those two functions, or we'd need to stop our current journal
> handle, do the appropriate locking, writeback and invalidate, unlock and
> restart the journal handle.
>
> These both seem too complex, and I don't know if we have a valid use case
> where we need to combine a filesystem with inline data and DAX, so just
> prevent them from being used together.
The one reason I can see to use inline data + DAX is that inline data saves
space for very small files, even if the performance improvement is minimal.
Since NVDIMMs are still relatively expensive, storing very small files and
directories directly in the inode is probably worthwhile.
That said, there are still occasional bugs in the inline data code, so it
makes sense to ensure these two features are not enabled at the same time
if they don't play well together.
> Signed-off-by: Ross Zwisler <[email protected]>
> CC: [email protected]
> ---
> fs/ext4/inline.c | 10 ----------
> fs/ext4/super.c | 5 +++++
> 2 files changed, 5 insertions(+), 10 deletions(-)
>
> diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
> index 28c5c3a..fd95019 100644
> --- a/fs/ext4/inline.c
> +++ b/fs/ext4/inline.c
> @@ -302,11 +302,6 @@ static int ext4_create_inline_data(handle_t *handle,
> EXT4_I(inode)->i_inline_size = len + EXT4_MIN_INLINE_DATA_SIZE;
> ext4_clear_inode_flag(inode, EXT4_INODE_EXTENTS);
> ext4_set_inode_flag(inode, EXT4_INODE_INLINE_DATA);
> - /*
> - * Propagate changes to inode->i_flags as well - e.g. S_DAX may
> - * get cleared
> - */
> - ext4_set_inode_flags(inode);
What about other flags in the inode? It doesn't make sense to drop this
entirely. The S_DAX flag shouldn't be set if the inode has the inline
data flag set, according to ext4_set_inode_flags().
> get_bh(is.iloc.bh);
> error = ext4_mark_iloc_dirty(handle, inode, &is.iloc);
>
> @@ -451,11 +446,6 @@ static int ext4_destroy_inline_data_nolock(handle_t *handle,
> }
> }
> ext4_clear_inode_flag(inode, EXT4_INODE_INLINE_DATA);
> - /*
> - * Propagate changes to inode->i_flags as well - e.g. S_DAX may
> - * get set.
> - */
> - ext4_set_inode_flags(inode);
Same?
> get_bh(is.iloc.bh);
> error = ext4_mark_iloc_dirty(handle, inode, &is.iloc);
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index d61a70e2..d549dfb 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -3686,6 +3686,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> }
>
> if (sbi->s_mount_opt & EXT4_MOUNT_DAX) {
> + if (ext4_has_feature_inline_data(sb)) {
> + ext4_msg(sb, KERN_ERR, "Cannot use DAX on a filesystem"
> + " that may contain inline data");
> + goto failed_mount;
> + }
Wouldn't it be enough to just prevent modification of inodes that are stored
in the inode? It should be OK to read such files. At a minimum that means
there should not be an error in case of read-only mounting. A better choice
would be to return an error only at runtime in case of open-for-write, or
only if the file is actually being written.
Cheers, Andreas
On Wed, Sep 06, 2017 at 02:55:31PM -0600, Andreas Dilger wrote:
> On Sep 5, 2017, at 4:35 PM, Ross Zwisler <[email protected]> wrote:
> >
> > If an inode has inline data it is currently prevented from using DAX by a
> > check in ext4_should_use_dax(). When the inode grows inline data via
> > ext4_create_inline_data() or removes its inline data via
> > ext4_destroy_inline_data_nolock(), the value of S_DAX can change.
> >
> > Currently these changes are unsafe because we don't hold off page faults
> > and I/O, write back dirty radix tree entries and invalidate all mappings.
> > This work is done in XFS via xfs_ioctl_setattr_dax_invalidate(), for
> > example. This unsafe transitioning of S_DAX could potentially lead to data
> > corruption.
> >
> > Fix this issue by preventing the DAX mount option from being used on
> > filesystems that were created to support inline data. Inline data is an
> > option given to mkfs.ext4.
> >
> > We prevent DAX from being used with inline data as opposed to trying to
> > safely manage the transition of S_DAX because of the locking complexities:
> >
> > 1) filemap_write_and_wait() eventually calls ext4_writepages(), which
> > acquires the sbi->s_journal_flag_rwsem. This lock ranks above the
> > jbdw_handle which is eventually taken by ext4_journal_start(). This
> > essentially means that the writeback has to happen outside of the context
> > of an active journal handle (outside of ext4_journal_start() to
> > ext4_journal_stop().)
> >
> > 2) To lock out page faults we take a write lock on the ei->i_mmap_sem, and
> > this lock again ranks above the jbd2_handle taken by ext4_journal_start().
> > So, as with the writeback code in 1) above we have to take ei->i_mmap_sem
> > outside of the context of an active journal handle.
> >
> > We are able to work around both of these restrictions and safely transition
> > S_DAX when we change the journaling mode, but for inline data it's much
> > harder because each of ext4_create_inline_data() and
> > ext4_destroy_inline_data_nolock() are passed in journal handles that have
> > already been started.
> >
> > To be able to safely writeback and invalidate our dirty inode data we'd
> > either need to uplevel the locking, writeback and invalidate into all the
> > callers of those two functions, or we'd need to stop our current journal
> > handle, do the appropriate locking, writeback and invalidate, unlock and
> > restart the journal handle.
> >
> > These both seem too complex, and I don't know if we have a valid use case
> > where we need to combine a filesystem with inline data and DAX, so just
> > prevent them from being used together.
>
> The one reason I can see to use inline data + DAX is that inline data saves
> space for very small files, even if the performance improvement is minimal.
> Since NVDIMMs are still relatively expensive, storing very small files and
> directories directly in the inode is probably worthwhile.
>
> That said, there are still occasional bugs in the inline data code, so it
> makes sense to ensure these two features are not enabled at the same time
> if they don't play well together.
Yep, unfortunately it's increasingly looking like we just can't ever safely
transition S_DAX at runtime:
https://www.spinics.net/lists/linux-xfs/msg09859.html
I think the current thought is to just avoid it completely until/unless we can
figure out a way to make it safe that is worthwhile.
> > Signed-off-by: Ross Zwisler <[email protected]>
> > CC: [email protected]
> > ---
> > fs/ext4/inline.c | 10 ----------
> > fs/ext4/super.c | 5 +++++
> > 2 files changed, 5 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
> > index 28c5c3a..fd95019 100644
> > --- a/fs/ext4/inline.c
> > +++ b/fs/ext4/inline.c
> > @@ -302,11 +302,6 @@ static int ext4_create_inline_data(handle_t *handle,
> > EXT4_I(inode)->i_inline_size = len + EXT4_MIN_INLINE_DATA_SIZE;
> > ext4_clear_inode_flag(inode, EXT4_INODE_EXTENTS);
> > ext4_set_inode_flag(inode, EXT4_INODE_INLINE_DATA);
> > - /*
> > - * Propagate changes to inode->i_flags as well - e.g. S_DAX may
> > - * get cleared
> > - */
> > - ext4_set_inode_flags(inode);
>
> What about other flags in the inode? It doesn't make sense to drop this
> entirely. The S_DAX flag shouldn't be set if the inode has the inline
> data flag set, according to ext4_set_inode_flags().
AFAICT all the other flags in the inode are independent of
EXT4_INODE_INLINE_DATA. The only one that could switch as a side-effect of
EXT4_INODE_INLINE_DATA is S_DAX.
The lines I'm deleting were added explicitly for the side-effect transition of
S_DAX (commit a3caa24b703794), so I don't think we still need them at all if
we prevent S_DAX from transitioning.
Ditto for the other deletion.
>
> > get_bh(is.iloc.bh);
> > error = ext4_mark_iloc_dirty(handle, inode, &is.iloc);
> >
> > @@ -451,11 +446,6 @@ static int ext4_destroy_inline_data_nolock(handle_t *handle,
> > }
> > }
> > ext4_clear_inode_flag(inode, EXT4_INODE_INLINE_DATA);
> > - /*
> > - * Propagate changes to inode->i_flags as well - e.g. S_DAX may
> > - * get set.
> > - */
> > - ext4_set_inode_flags(inode);
>
> Same?
>
> > get_bh(is.iloc.bh);
> > error = ext4_mark_iloc_dirty(handle, inode, &is.iloc);
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index d61a70e2..d549dfb 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -3686,6 +3686,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> > }
> >
> > if (sbi->s_mount_opt & EXT4_MOUNT_DAX) {
> > + if (ext4_has_feature_inline_data(sb)) {
> > + ext4_msg(sb, KERN_ERR, "Cannot use DAX on a filesystem"
> > + " that may contain inline data");
> > + goto failed_mount;
> > + }
>
> Wouldn't it be enough to just prevent modification of inodes that are stored
> in the inode? It should be OK to read such files. At a minimum that means
> there should not be an error in case of read-only mounting. A better choice
> would be to return an error only at runtime in case of open-for-write, or
> only if the file is actually being written.
Hmm...yes? I think this *could* work - we'd have to prevent anything that
would make the inode switch from using inline data to using extents. This
includes writes (all writes? Or just ones that would trigger the
transition?), fallocate, etc.
I think we could make this work, but the consequence would be that
applications would receive I/O errors or errors for seemingly random syscalls,
and they might have a hard time understanding why.
It seems cleaner and simpler to me to just disallow the combination of inline
data + DAX, and to print a descriptive message in dmesg? The combination of
the two features would end up being pretty unfriendly anyway, and the
implementation to make it safe would probably be more complex.
On Wed, Sep 6, 2017 at 10:07 AM, Ross Zwisler
<[email protected]> wrote:
> On Tue, Sep 05, 2017 at 09:12:35PM -0500, Eric Sandeen wrote:
>> On 9/5/17 5:35 PM, Ross Zwisler wrote:
>> > The original intent of this series was to add a per-inode DAX flag to ext4
>> > so that it would be consistent with XFS. In my travels I found and fixed
>> > several related issues in both ext4 and XFS.
>>
>> Hi Ross -
>>
>> hch had a lot of reasons to nuke the dax flag from orbit, and we just
>> /disabled/ it in xfs due to its habit of crashing the kernel...
>
> Ah, sorry, I wasn't CC'd on those threads and missed them. For any interested
> bystanders:
>
> https://www.spinics.net/lists/linux-ext4/msg57840.html
> https://www.spinics.net/lists/linux-xfs/msg09831.html
> https://www.spinics.net/lists/linux-xfs/msg10124.html
>
>> so a couple questions:
>>
>> 1) does this series pass hch's "test the per-inode DAX flag" fstest?
>
> Nope, it has the exact same problems as the XFS per-inode DAX flag.
>
>> 2) do we have an agreement that we need this flag at all, or is this
>> just a parity item because xfs has^whad a per-inode flag?
>
> It was for parity, and because it allows admins finer grained control over
> their system. Basically all things discussed in response to Lukas's original
> patch in the first link above.
I think it's more than parity. When pmem is slower than page cache it
is actively harmful to have DAX enabled globally for a filesystem. So,
not only should we push for per-inode DAX control, we should also push
to deprecate the mount option. I agree with Christoph that we should
try to automatically and transparently enable DAX where it makes
sense, but we also need a finer-grained mechanism than a mount flag to
force the behavior one way or the other.
On Thu, Sep 07, 2017 at 01:54:45PM -0700, Dan Williams wrote:
> On Wed, Sep 6, 2017 at 10:07 AM, Ross Zwisler
> <[email protected]> wrote:
> > On Tue, Sep 05, 2017 at 09:12:35PM -0500, Eric Sandeen wrote:
> >> On 9/5/17 5:35 PM, Ross Zwisler wrote:
> >> > The original intent of this series was to add a per-inode DAX flag to ext4
> >> > so that it would be consistent with XFS. In my travels I found and fixed
> >> > several related issues in both ext4 and XFS.
> >>
> >> Hi Ross -
> >>
> >> hch had a lot of reasons to nuke the dax flag from orbit, and we just
> >> /disabled/ it in xfs due to its habit of crashing the kernel...
> >
> > Ah, sorry, I wasn't CC'd on those threads and missed them. For any interested
> > bystanders:
> >
> > https://www.spinics.net/lists/linux-ext4/msg57840.html
> > https://www.spinics.net/lists/linux-xfs/msg09831.html
> > https://www.spinics.net/lists/linux-xfs/msg10124.html
> >
> >> so a couple questions:
> >>
> >> 1) does this series pass hch's "test the per-inode DAX flag" fstest?
> >
> > Nope, it has the exact same problems as the XFS per-inode DAX flag.
> >
> >> 2) do we have an agreement that we need this flag at all, or is this
> >> just a parity item because xfs has^whad a per-inode flag?
> >
> > It was for parity, and because it allows admins finer grained control over
> > their system. Basically all things discussed in response to Lukas's original
> > patch in the first link above.
>
> I think it's more than parity. When pmem is slower than page cache it
> is actively harmful to have DAX enabled globally for a filesystem. So,
> not only should we push for per-inode DAX control, we should also push
> to deprecate the mount option. I agree with Christoph that we should
> try to automatically and transparently enable DAX where it makes
> sense, but we also need a finer-grained mechanism than a mount flag to
> force the behavior one way or the other.
Yep, agreed. I'll play with how to make this work after I've sorted out all
the data corruptions I've found. :)
On Sep 7, 2017, at 3:13 PM, Ross Zwisler <[email protected]> wrote:
>
> On Thu, Sep 07, 2017 at 01:54:45PM -0700, Dan Williams wrote:
>> On Wed, Sep 6, 2017 at 10:07 AM, Ross Zwisler
>> <[email protected]> wrote:
>>> On Tue, Sep 05, 2017 at 09:12:35PM -0500, Eric Sandeen wrote:
>>>> On 9/5/17 5:35 PM, Ross Zwisler wrote:
>>>>> The original intent of this series was to add a per-inode DAX flag to ext4
>>>>> so that it would be consistent with XFS. In my travels I found and fixed
>>>>> several related issues in both ext4 and XFS.
>>>>
>>>> Hi Ross -
>>>>
>>>> hch had a lot of reasons to nuke the dax flag from orbit, and we just
>>>> /disabled/ it in xfs due to its habit of crashing the kernel...
>>>
>>> Ah, sorry, I wasn't CC'd on those threads and missed them. For any interested
>>> bystanders:
>>>
>>> https://www.spinics.net/lists/linux-ext4/msg57840.html
>>> https://www.spinics.net/lists/linux-xfs/msg09831.html
>>> https://www.spinics.net/lists/linux-xfs/msg10124.html
>>>
>>>> so a couple questions:
>>>>
>>>> 1) does this series pass hch's "test the per-inode DAX flag" fstest?
>>>
>>> Nope, it has the exact same problems as the XFS per-inode DAX flag.
>>>
>>>> 2) do we have an agreement that we need this flag at all, or is this
>>>> just a parity item because xfs has^whad a per-inode flag?
>>>
>>> It was for parity, and because it allows admins finer grained control over
>>> their system. Basically all things discussed in response to Lukas's original
>>> patch in the first link above.
>>
>> I think it's more than parity. When pmem is slower than page cache it
>> is actively harmful to have DAX enabled globally for a filesystem. So,
>> not only should we push for per-inode DAX control, we should also push
>> to deprecate the mount option. I agree with Christoph that we should
>> try to automatically and transparently enable DAX where it makes
>> sense, but we also need a finer-grained mechanism than a mount flag to
>> force the behavior one way or the other.
>
> Yep, agreed. I'll play with how to make this work after I've sorted out all
> the data corruptions I've found. :)
It seems that the majority of problems are from enabling/disabling S_DAX
on an inode that already has dirty data. However, I wonder if this could
be prevented at runtime, and only allow S_DAX to be set when the inode is
first instantiated, and wouldn't be allowed to change after that? Setting
or clearing the per-inode DAX flag might still be allowed, but it wouldn't
be enabled until the inode is next fetched into cache? Similarly, for
inodes that have conflicting features (e.g. inline data or encryption)
would not be allowed to enable S_DAX.
My assumption here is that it is possible to fall back to always using
page cache for such inodes, and flush the data to pmem via the block
interface for inodes that don't have S_DAX set?
That would allow the vast majority of cases to work out of the box, or in
a few rare cases where the DAX feature is being changed (e.g. inline data
inode on disk growing to external disk blocks) would use the page cache
until such a time that the inode is dropped from cache and reloaded (at
worst the next remount).
Cheers, Andreas
On Thu, Sep 07, 2017 at 03:26:10PM -0600, Andreas Dilger wrote:
> On Sep 7, 2017, at 3:13 PM, Ross Zwisler <[email protected]> wrote:
> >
> > On Thu, Sep 07, 2017 at 01:54:45PM -0700, Dan Williams wrote:
> >> On Wed, Sep 6, 2017 at 10:07 AM, Ross Zwisler
> >> <[email protected]> wrote:
> >>> On Tue, Sep 05, 2017 at 09:12:35PM -0500, Eric Sandeen wrote:
> >>>> On 9/5/17 5:35 PM, Ross Zwisler wrote:
> >>>>> The original intent of this series was to add a per-inode DAX flag to ext4
> >>>>> so that it would be consistent with XFS. In my travels I found and fixed
> >>>>> several related issues in both ext4 and XFS.
> >>>>
> >>>> Hi Ross -
> >>>>
> >>>> hch had a lot of reasons to nuke the dax flag from orbit, and we just
> >>>> /disabled/ it in xfs due to its habit of crashing the kernel...
> >>>
> >>> Ah, sorry, I wasn't CC'd on those threads and missed them. For any interested
> >>> bystanders:
> >>>
> >>> https://www.spinics.net/lists/linux-ext4/msg57840.html
> >>> https://www.spinics.net/lists/linux-xfs/msg09831.html
> >>> https://www.spinics.net/lists/linux-xfs/msg10124.html
> >>>
> >>>> so a couple questions:
> >>>>
> >>>> 1) does this series pass hch's "test the per-inode DAX flag" fstest?
> >>>
> >>> Nope, it has the exact same problems as the XFS per-inode DAX flag.
> >>>
> >>>> 2) do we have an agreement that we need this flag at all, or is this
> >>>> just a parity item because xfs has^whad a per-inode flag?
> >>>
> >>> It was for parity, and because it allows admins finer grained control over
> >>> their system. Basically all things discussed in response to Lukas's original
> >>> patch in the first link above.
> >>
> >> I think it's more than parity. When pmem is slower than page cache it
> >> is actively harmful to have DAX enabled globally for a filesystem. So,
> >> not only should we push for per-inode DAX control, we should also push
> >> to deprecate the mount option. I agree with Christoph that we should
> >> try to automatically and transparently enable DAX where it makes
> >> sense, but we also need a finer-grained mechanism than a mount flag to
> >> force the behavior one way or the other.
> >
> > Yep, agreed. I'll play with how to make this work after I've sorted out all
> > the data corruptions I've found. :)
>
> It seems that the majority of problems are from enabling/disabling S_DAX
> on an inode that already has dirty data.
I don't think it's precisely about dirty data, more about having mappings set
up and I/Os in flight, even if those are read operations. Tomorrow I'll post
some xfstests for the data corruptions due to DAX + each of inline data and
journaling, and those both happen because we set up one mapping to page cache,
and one to DAX. Once either is written to they become out of sync.
> However, I wonder if this could
> be prevented at runtime, and only allow S_DAX to be set when the inode is
> first instantiated, and wouldn't be allowed to change after that? Setting
> or clearing the per-inode DAX flag might still be allowed, but it wouldn't
> be enabled until the inode is next fetched into cache? Similarly, for
> inodes that have conflicting features (e.g. inline data or encryption)
> would not be allowed to enable S_DAX.
Ooh, this seems interesting. This would ensure that S_DAX transitions
couldn't ever race with I/Os or mmaps(). I had some other ideas for how to
handle this, but I think your idea is more promising. :)
I guess with this solution we'd need:
a) A good way of letting the user detect the state where they had set the DAX
inode flag, but that it wasn't yet in use by the inode.
b) A reliable way of flushing the inode from the filesystem cache, so that the
next time an open() happens they get the new behavior. The way I usually do
this is via umount/remount, but there is probably already a way to do this?
> My assumption here is that it is possible to fall back to always using
> page cache for such inodes, and flush the data to pmem via the block
> interface for inodes that don't have S_DAX set?
Correct.
> That would allow the vast majority of cases to work out of the box, or in
> a few rare cases where the DAX feature is being changed (e.g. inline data
> inode on disk growing to external disk blocks) would use the page cache
> until such a time that the inode is dropped from cache and reloaded (at
> worst the next remount).
Ah, yep, this has the potential to solve those cases as well. Seems
promising, to me at least. :)
On Thu, Sep 07, 2017 at 03:51:48PM -0600, Ross Zwisler wrote:
> On Thu, Sep 07, 2017 at 03:26:10PM -0600, Andreas Dilger wrote:
> > However, I wonder if this could
> > be prevented at runtime, and only allow S_DAX to be set when the inode is
> > first instantiated, and wouldn't be allowed to change after that? Setting
> > or clearing the per-inode DAX flag might still be allowed, but it wouldn't
> > be enabled until the inode is next fetched into cache? Similarly, for
> > inodes that have conflicting features (e.g. inline data or encryption)
> > would not be allowed to enable S_DAX.
>
> Ooh, this seems interesting. This would ensure that S_DAX transitions
> couldn't ever race with I/Os or mmaps(). I had some other ideas for how to
> handle this, but I think your idea is more promising. :)
IMO, that's an awful admin interface - it can't be done on demand
(i.e. when needed) because we can't force an inode to be evicted
from the cache. And then we have the "why the hell did that just
change" problem if an inode is evicted due to memory pressure and
then immediately reinstantiated by the running workload. That's a
recipe for driving admins insane...
> I guess with this solution we'd need:
>
> a) A good way of letting the user detect the state where they had set the DAX
> inode flag, but that it wasn't yet in use by the inode.
>
> b) A reliable way of flushing the inode from the filesystem cache, so that the
> next time an open() happens they get the new behavior. The way I usually do
> this is via umount/remount, but there is probably already a way to do this?
Not if it's referenced. And if it's not referenced, then the only
hammer we have is Brutus^Wdrop_caches. That's not an option for
production machines.
Neat idea, but one I'd already thought of and discarded as "not
practical from an admin perspective".
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Fri, Sep 08, 2017 at 08:12:01AM +1000, Dave Chinner wrote:
> On Thu, Sep 07, 2017 at 03:51:48PM -0600, Ross Zwisler wrote:
> > On Thu, Sep 07, 2017 at 03:26:10PM -0600, Andreas Dilger wrote:
> > > However, I wonder if this could
> > > be prevented at runtime, and only allow S_DAX to be set when the inode is
> > > first instantiated, and wouldn't be allowed to change after that? Setting
> > > or clearing the per-inode DAX flag might still be allowed, but it wouldn't
> > > be enabled until the inode is next fetched into cache? Similarly, for
> > > inodes that have conflicting features (e.g. inline data or encryption)
> > > would not be allowed to enable S_DAX.
> >
> > Ooh, this seems interesting. This would ensure that S_DAX transitions
> > couldn't ever race with I/Os or mmaps(). I had some other ideas for how to
> > handle this, but I think your idea is more promising. :)
>
> IMO, that's an awful admin interface - it can't be done on demand
> (i.e. when needed) because we can't force an inode to be evicted
> from the cache. And then we have the "why the hell did that just
> change" problem if an inode is evicted due to memory pressure and
> then immediately reinstantiated by the running workload. That's a
> recipe for driving admins insane...
>
> > I guess with this solution we'd need:
> >
> > a) A good way of letting the user detect the state where they had set the DAX
> > inode flag, but that it wasn't yet in use by the inode.
> >
> > b) A reliable way of flushing the inode from the filesystem cache, so that the
> > next time an open() happens they get the new behavior. The way I usually do
> > this is via umount/remount, but there is probably already a way to do this?
>
> Not if it's referenced. And if it's not referenced, then the only
> hammer we have is Brutus^Wdrop_caches. That's not an option for
> production machines.
>
> Neat idea, but one I'd already thought of and discarded as "not
> practical from an admin perspective".
Okay, so other ideas (which you have also probably already though of) include:
1) Just return -EBUSY if anyone tries to change the DAX flag of an inode with
open mappings or any open file handles. To prevent TOCTOU races we'd have to
do some additional locking while actually changing the flag.
2) Be more drastic and follow the flow of ext4 file based encryption, only
allowing the inode flag to be set by an admin on an empty directory. Files in
that directory will inherit it when they are created, and we don't provide a
way to clear. If you want your file to not use DAX, move it to a different
directory (which I think for ext4 encryption turns it into a new inode).
Other ideas?
On Thu, Sep 07, 2017 at 04:19:00PM -0600, Ross Zwisler wrote:
> On Fri, Sep 08, 2017 at 08:12:01AM +1000, Dave Chinner wrote:
> > On Thu, Sep 07, 2017 at 03:51:48PM -0600, Ross Zwisler wrote:
> > > On Thu, Sep 07, 2017 at 03:26:10PM -0600, Andreas Dilger wrote:
> > > > However, I wonder if this could
> > > > be prevented at runtime, and only allow S_DAX to be set when the inode is
> > > > first instantiated, and wouldn't be allowed to change after that? Setting
> > > > or clearing the per-inode DAX flag might still be allowed, but it wouldn't
> > > > be enabled until the inode is next fetched into cache? Similarly, for
> > > > inodes that have conflicting features (e.g. inline data or encryption)
> > > > would not be allowed to enable S_DAX.
> > >
> > > Ooh, this seems interesting. This would ensure that S_DAX transitions
> > > couldn't ever race with I/Os or mmaps(). I had some other ideas for how to
> > > handle this, but I think your idea is more promising. :)
> >
> > IMO, that's an awful admin interface - it can't be done on demand
> > (i.e. when needed) because we can't force an inode to be evicted
> > from the cache. And then we have the "why the hell did that just
> > change" problem if an inode is evicted due to memory pressure and
> > then immediately reinstantiated by the running workload. That's a
> > recipe for driving admins insane...
> >
> > > I guess with this solution we'd need:
> > >
> > > a) A good way of letting the user detect the state where they had set the DAX
> > > inode flag, but that it wasn't yet in use by the inode.
> > >
> > > b) A reliable way of flushing the inode from the filesystem cache, so that the
> > > next time an open() happens they get the new behavior. The way I usually do
> > > this is via umount/remount, but there is probably already a way to do this?
> >
> > Not if it's referenced. And if it's not referenced, then the only
> > hammer we have is Brutus^Wdrop_caches. That's not an option for
> > production machines.
> >
> > Neat idea, but one I'd already thought of and discarded as "not
> > practical from an admin perspective".
>
> Okay, so other ideas (which you have also probably already though of) include:
>
> 1) Just return -EBUSY if anyone tries to change the DAX flag of an inode with
> open mappings or any open file handles.
You have to have an open fd to change the flag. :)
> To prevent TOCTOU races we'd have to
> do some additional locking while actually changing the flag.
I think that make sense - the fundamental problem is that the
mappings are different between dax and non-dax, and that we can't
properly lock out page faults to to prevent sending a racing
page fault down the wrong path.
> 2) Be more drastic and follow the flow of ext4 file based encryption, only
> allowing the inode flag to be set by an admin on an empty directory. Files in
> that directory will inherit it when they are created, and we don't provide a
> way to clear. If you want your file to not use DAX, move it to a different
> directory (which I think for ext4 encryption turns it into a new inode).
Seems like the wrong model to me - moving application data files
is a PITA because you've also go to change the app config to point
at the new location...
> Other ideas?
IMO, we need to fix the page fault path so we don't look at inode
flags to determine processing behaviour during the fault. Fault
processing as DAX or non-dax needs to be determined by the page
fault code and communicated to the fs via the vmf as the contents
of the vmf for a dax fault can be invalid for a non-dax fault. Fixing
that problem (i.e. make DAX is a property of the mapping and
instantiate it from the inode only at mmap() time) means all the
page fault vs inode flag race problems go away and we have a model
that is much more robust if we want to expand it in future.
Combine that with -EBUSY when there are active mappings as you've
proposed above and I think we've got a much more solid solution to
the problem.
-Dave.
--
Dave Chinner
[email protected]
On Fri 08-09-17 09:25:43, Dave Chinner wrote:
> On Thu, Sep 07, 2017 at 04:19:00PM -0600, Ross Zwisler wrote:
> > On Fri, Sep 08, 2017 at 08:12:01AM +1000, Dave Chinner wrote:
> > > On Thu, Sep 07, 2017 at 03:51:48PM -0600, Ross Zwisler wrote:
> > > > On Thu, Sep 07, 2017 at 03:26:10PM -0600, Andreas Dilger wrote:
> > > > > However, I wonder if this could
> > > > > be prevented at runtime, and only allow S_DAX to be set when the inode is
> > > > > first instantiated, and wouldn't be allowed to change after that? Setting
> > > > > or clearing the per-inode DAX flag might still be allowed, but it wouldn't
> > > > > be enabled until the inode is next fetched into cache? Similarly, for
> > > > > inodes that have conflicting features (e.g. inline data or encryption)
> > > > > would not be allowed to enable S_DAX.
> > > >
> > > > Ooh, this seems interesting. This would ensure that S_DAX transitions
> > > > couldn't ever race with I/Os or mmaps(). I had some other ideas for how to
> > > > handle this, but I think your idea is more promising. :)
> > >
> > > IMO, that's an awful admin interface - it can't be done on demand
> > > (i.e. when needed) because we can't force an inode to be evicted
> > > from the cache. And then we have the "why the hell did that just
> > > change" problem if an inode is evicted due to memory pressure and
> > > then immediately reinstantiated by the running workload. That's a
> > > recipe for driving admins insane...
> > >
> > > > I guess with this solution we'd need:
> > > >
> > > > a) A good way of letting the user detect the state where they had set the DAX
> > > > inode flag, but that it wasn't yet in use by the inode.
> > > >
> > > > b) A reliable way of flushing the inode from the filesystem cache, so that the
> > > > next time an open() happens they get the new behavior. The way I usually do
> > > > this is via umount/remount, but there is probably already a way to do this?
> > >
> > > Not if it's referenced. And if it's not referenced, then the only
> > > hammer we have is Brutus^Wdrop_caches. That's not an option for
> > > production machines.
> > >
> > > Neat idea, but one I'd already thought of and discarded as "not
> > > practical from an admin perspective".
> >
> > Okay, so other ideas (which you have also probably already though of) include:
> >
> > 1) Just return -EBUSY if anyone tries to change the DAX flag of an inode with
> > open mappings or any open file handles.
>
> You have to have an open fd to change the flag. :)
Yeah, open file handles don't matter and we can serialize against IO in
progress, that's not a big deal. Established mappings are difficult to deal
with.
> > To prevent TOCTOU races we'd have to
> > do some additional locking while actually changing the flag.
>
> I think that make sense - the fundamental problem is that the
> mappings are different between dax and non-dax, and that we can't
> properly lock out page faults to to prevent sending a racing
> page fault down the wrong path.
>
> > 2) Be more drastic and follow the flow of ext4 file based encryption, only
> > allowing the inode flag to be set by an admin on an empty directory. Files in
> > that directory will inherit it when they are created, and we don't provide a
> > way to clear. If you want your file to not use DAX, move it to a different
> > directory (which I think for ext4 encryption turns it into a new inode).
>
> Seems like the wrong model to me - moving application data files
> is a PITA because you've also go to change the app config to point
> at the new location...
Agreed.
> > Other ideas?
>
> IMO, we need to fix the page fault path so we don't look at inode
> flags to determine processing behaviour during the fault. Fault
> processing as DAX or non-dax needs to be determined by the page
> fault code and communicated to the fs via the vmf as the contents
> of the vmf for a dax fault can be invalid for a non-dax fault. Fixing
> that problem (i.e. make DAX is a property of the mapping and
> instantiate it from the inode only at mmap() time) means all the
> page fault vs inode flag race problems go away and we have a model
> that is much more robust if we want to expand it in future.
In fact, the real problem is only with .page_mkwrite and .pfn_mkwrite
callbacks. For those setup of 'vmf' differs. For .fault or .huge_fault the
vmf is the same regardless whether we do DAX or non-DAX fault. But it seems
difficult to me to determine DAX / non-DAX fault in vmf since locks
necessary to stabilize S_DAX flag are acquired only in filesystem-specific
handlers (and the locks themselves are fs specific).
So the only way I see of dealing safely with these races is careful
checking in .page_mkwrite and .pfn_mkwrite after necessary locks are
obtained and bail out doing nothing if state is inconsistent. VM will retry
the fault and we'll get to the correct handler next time.
But if we disallow any mappings when switching S_DAX flag, then all the
above is moot and there can be no races... We just have to be sure to block
new mappings of the file while switching the flag.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri, Sep 08, 2017 at 09:25:43AM +1000, Dave Chinner wrote:
> > Okay, so other ideas (which you have also probably already though of) include:
> >
> > 1) Just return -EBUSY if anyone tries to change the DAX flag of an inode with
> > open mappings or any open file handles.
>
> You have to have an open fd to change the flag. :)
What if we only allow the S_DAX flag to be *set*, when i_size and
i_blocks is zero? We could also require that only one file descriptor
be open against the inode, and that it be opened O_RDONLY.
- Ted
On Fri 08-09-17 11:39:13, Ted Tso wrote:
> On Fri, Sep 08, 2017 at 09:25:43AM +1000, Dave Chinner wrote:
> > > Okay, so other ideas (which you have also probably already though of) include:
> > >
> > > 1) Just return -EBUSY if anyone tries to change the DAX flag of an inode with
> > > open mappings or any open file handles.
> >
> > You have to have an open fd to change the flag. :)
>
> What if we only allow the S_DAX flag to be *set*, when i_size and
> i_blocks is zero? We could also require that only one file descriptor
> be open against the inode, and that it be opened O_RDONLY.
We could do something like that but IMHO it will be a pain to use (e.g.
think how difficult it would be to switch your existing database to use DAX
for data files). We can make transition reliable whenever
inode->i_mapping->i_mmap RB tree is empty (effectively: whenever the file
is not mmaped). And that should be relaxed enough for most usecases... But
I agree that it will be somewhat tricky to prevent creation of new mappings
while we are switching S_DAX flag so it needs more though.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR