This series does the work needed to safely re-enable the XFS per-inode DAX
flag. This includes fixes to make use of the DAX inode flag more safe and
consistent, fixes to the read and write I/O path locking to make S_DAX
transitions safe, and some code that prevents the DAX inode flag from
transitioning when any mappings are set up.
This series has passed my fstests regression testing both with and without
DAX, and it also passes Christoph's regression test for the inode flag:
https://www.spinics.net/lists/linux-xfs/msg10124.html
My goal is to get feedback on this approach and on the XFS implementation,
and then to do a similar implementation for ext4 based on my previous ext4
DAX inode flag patches:
https://patchwork.kernel.org/patch/9939743/
These patches apply cleanly to v4.14-rc2.
Ross Zwisler (7):
xfs: always use DAX if mount option is used
xfs: validate bdev support for DAX inode flag
xfs: protect S_DAX transitions in XFS read path
xfs: protect S_DAX transitions in XFS write path
xfs: introduce xfs_is_dax_state_changing
mm, fs: introduce file_operations->post_mmap()
xfs: re-enable XFS per-inode DAX
fs/xfs/xfs_file.c | 172 ++++++++++++++++++++++-------------------------------
fs/xfs/xfs_ioctl.c | 47 ++++++++++++---
include/linux/fs.h | 1 +
mm/mmap.c | 2 +
4 files changed, 114 insertions(+), 108 deletions(-)
--
2.9.5
Re-enable the XFS per-inode DAX flag, preventing S_DAX from changing when
any mappings are present.
Signed-off-by: Ross Zwisler <[email protected]>
---
fs/xfs/xfs_ioctl.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 386b437..7a24dd5 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1012,12 +1012,10 @@ xfs_diflags_to_linux(
inode->i_flags |= S_NOATIME;
else
inode->i_flags &= ~S_NOATIME;
-#if 0 /* disabled until the flag switching races are sorted out */
if ((xflags & FS_XFLAG_DAX) || (ip->i_mount->m_flags & XFS_MOUNT_DAX))
inode->i_flags |= S_DAX;
else
inode->i_flags &= ~S_DAX;
-#endif
}
static bool
@@ -1049,6 +1047,8 @@ xfs_ioctl_setattr_xflags(
{
struct xfs_mount *mp = ip->i_mount;
uint64_t di_flags2;
+ struct address_space *mapping = VFS_I(ip)->i_mapping;
+ bool dax_changing;
/* Can't change realtime flag if any extents are allocated. */
if ((ip->i_d.di_nextents || ip->i_delayed_blks) &&
@@ -1084,10 +1084,23 @@ xfs_ioctl_setattr_xflags(
if (di_flags2 && ip->i_d.di_version < 3)
return -EINVAL;
+ dax_changing = xfs_is_dax_state_changing(fa->fsx_xflags, ip);
+ if (dax_changing) {
+ i_mmap_lock_read(mapping);
+ if (mapping_mapped(mapping)) {
+ i_mmap_unlock_read(mapping);
+ return -EBUSY;
+ }
+ }
+
ip->i_d.di_flags = xfs_flags2diflags(ip, fa->fsx_xflags);
ip->i_d.di_flags2 = di_flags2;
xfs_diflags_to_linux(ip);
+
+ if (dax_changing)
+ i_mmap_unlock_read(mapping);
+
xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_CHG);
xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
XFS_STATS_INC(mp, xs_ig_attrchg);
--
2.9.5
Pull this code out of xfs_ioctl_setattr_dax_invalidate() as it will be used
in multiple places soon.
Signed-off-by: Ross Zwisler <[email protected]>
---
fs/xfs/xfs_ioctl.c | 34 +++++++++++++++++++++++-----------
1 file changed, 23 insertions(+), 11 deletions(-)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 0433aef..386b437 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1020,6 +1020,27 @@ xfs_diflags_to_linux(
#endif
}
+static bool
+xfs_is_dax_state_changing(
+ unsigned int xflags,
+ struct xfs_inode *ip)
+{
+ struct inode *inode = VFS_I(ip);
+
+ /*
+ * If the DAX mount option was used we will update the DAX inode flag
+ * as the user requested but we will continue to use DAX for I/O and
+ * page faults regardless of how the inode flag is set.
+ */
+ if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
+ return false;
+ if ((xflags & FS_XFLAG_DAX) && IS_DAX(inode))
+ return false;
+ if (!(xflags & FS_XFLAG_DAX) && !IS_DAX(inode))
+ return false;
+ return true;
+}
+
static int
xfs_ioctl_setattr_xflags(
struct xfs_trans *tp,
@@ -1105,17 +1126,8 @@ xfs_ioctl_setattr_dax_invalidate(
return -EINVAL;
}
- /*
- * If the DAX state is not changing, we have nothing to do here. If
- * the DAX mount option was used we will update the DAX inode flag as
- * the user requested but we will continue to use DAX for I/O and page
- * faults regardless of how the inode flag is set.
- */
- if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
- return 0;
- if ((fa->fsx_xflags & FS_XFLAG_DAX) && IS_DAX(inode))
- return 0;
- if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_DAX(inode))
+ /* If the DAX state is not changing, we have nothing to do here. */
+ if (!xfs_is_dax_state_changing(fa->fsx_xflags, ip))
return 0;
/* lock, flush and invalidate mapping in preparation for flag change */
--
2.9.5
Currently only the blocksize is checked, but we should really be calling
bdev_dax_supported() which also tests to make sure we can get a
struct dax_device and that the dax_direct_access() path is working.
This is the same check that we do for the "-o dax" mount option in
xfs_fs_fill_super().
Signed-off-by: Ross Zwisler <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
fs/xfs/xfs_ioctl.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 26faeb9..0433aef 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1088,6 +1088,7 @@ xfs_ioctl_setattr_dax_invalidate(
int *join_flags)
{
struct inode *inode = VFS_I(ip);
+ struct super_block *sb = inode->i_sb;
int error;
*join_flags = 0;
@@ -1100,7 +1101,7 @@ xfs_ioctl_setattr_dax_invalidate(
if (fa->fsx_xflags & FS_XFLAG_DAX) {
if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode)))
return -EINVAL;
- if (ip->i_mount->m_sb.sb_blocksize != PAGE_SIZE)
+ if (bdev_dax_supported(sb, sb->s_blocksize) < 0)
return -EINVAL;
}
--
2.9.5
Before support for the per-inode DAX flag was disabled the XFS the code had
an issue where the user couldn't reliably tell whether or not DAX was being
used to service page faults and I/O when the DAX mount option was used. In
this case each inode within the mounted filesystem started with S_DAX set
due to the mount option, but it could be cleared if someone touched the
individual inode flag.
For example (v4.13 and before):
# mount | grep dax
/dev/pmem0 on /mnt type xfs
(rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota)
# touch /mnt/a /mnt/b # both files currently use DAX
# xfs_io -c "lsattr" /mnt/* # neither has the DAX inode option set
----------e----- /mnt/a
----------e----- /mnt/b
# xfs_io -c "chattr -x" /mnt/a # this clears S_DAX for /mnt/a
# xfs_io -c "lsattr" /mnt/*
----------e----- /mnt/a
----------e----- /mnt/b
We end up with both /mnt/a and /mnt/b looking identical from the point of
view of the mount option and from lsattr, but one is using DAX and the
other is not.
Fix this by always doing DAX I/O when either the mount option is set or
when the DAX inode flag is set. This means that DAX will always be used
for all inodes on a filesystem mounted with -o dax, making the usage
reliable and detectable.
Signed-off-by: Ross Zwisler <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
fs/xfs/xfs_ioctl.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 5049e8a..26faeb9 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1013,7 +1013,7 @@ xfs_diflags_to_linux(
else
inode->i_flags &= ~S_NOATIME;
#if 0 /* disabled until the flag switching races are sorted out */
- if (xflags & FS_XFLAG_DAX)
+ if ((xflags & FS_XFLAG_DAX) || (ip->i_mount->m_flags & XFS_MOUNT_DAX))
inode->i_flags |= S_DAX;
else
inode->i_flags &= ~S_DAX;
@@ -1104,7 +1104,14 @@ xfs_ioctl_setattr_dax_invalidate(
return -EINVAL;
}
- /* If the DAX state is not changing, we have nothing to do here. */
+ /*
+ * If the DAX state is not changing, we have nothing to do here. If
+ * the DAX mount option was used we will update the DAX inode flag as
+ * the user requested but we will continue to use DAX for I/O and page
+ * faults regardless of how the inode flag is set.
+ */
+ if (ip->i_mount->m_flags & XFS_MOUNT_DAX)
+ return 0;
if ((fa->fsx_xflags & FS_XFLAG_DAX) && IS_DAX(inode))
return 0;
if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_DAX(inode))
--
2.9.5
When mappings are created the vma->vm_flags that they use vary based on
whether the inode being mapped is using DAX or not. This setup happens in
XFS via mmap_region()=>call_mmap()=>xfs_file_mmap().
For us to be able to safely use the DAX per-inode flag we need to prevent
S_DAX transitions when any mappings are present, and we will do that by
looking at the address_space->i_mmap tree and returning -EBUSY if any
mappings are present.
Unfortunately at the time that the filesystem's file_operations->mmap()
entry point is called the mapping has not yet been added to the
address_space->i_mmap tree. This means that at that point in time we
cannot determine whether or not the mapping will be set up to support DAX.
Fix this by adding a new file_operations entry called post_mmap() which is
called after the mapping has been added to the address_space->i_mmap tree.
This post_mmap() op now happens at a time when we can be sure whether the
mapping will use DAX or not, and we can set up the vma->vm_flags
appropriately.
Signed-off-by: Ross Zwisler <[email protected]>
---
fs/xfs/xfs_file.c | 15 ++++++++++++++-
include/linux/fs.h | 1 +
mm/mmap.c | 2 ++
3 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 2816858..9d66aaa 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1087,9 +1087,21 @@ xfs_file_mmap(
{
file_accessed(filp);
vma->vm_ops = &xfs_file_vm_ops;
+ return 0;
+}
+
+/* This call happens during mmap(), after the vma has been inserted into the
+ * inode->i_mapping->i_mmap tree. At this point the decision on whether or
+ * not to use DAX for this mapping has been set and will not change for the
+ * duration of the mapping.
+ */
+STATIC void
+xfs_file_post_mmap(
+ struct file *filp,
+ struct vm_area_struct *vma)
+{
if (IS_DAX(file_inode(filp)))
vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
- return 0;
}
const struct file_operations xfs_file_operations = {
@@ -1103,6 +1115,7 @@ const struct file_operations xfs_file_operations = {
.compat_ioctl = xfs_file_compat_ioctl,
#endif
.mmap = xfs_file_mmap,
+ .post_mmap = xfs_file_post_mmap,
.open = xfs_file_open,
.release = xfs_file_release,
.fsync = xfs_file_fsync,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 339e737..7c06838 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1701,6 +1701,7 @@ struct file_operations {
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
+ void (*post_mmap) (struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
int (*release) (struct inode *, struct file *);
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506f..ee7458a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1711,6 +1711,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
vma_link(mm, vma, prev, rb_link, rb_parent);
/* Once vma denies write, undo our temporary denial count */
if (file) {
+ if (file->f_op->post_mmap)
+ file->f_op->post_mmap(file, vma);
if (vm_flags & VM_SHARED)
mapping_unmap_writable(file->f_mapping);
if (vm_flags & VM_DENYWRITE)
--
2.9.5
In the current XFS read I/O path we check IS_DAX() in xfs_file_read_iter()
to decide whether to do DAX I/O, direct I/O or buffered I/O. This check is
done without holding the XFS_IOLOCK, though, which means that if we allow
S_DAX to be manipulated via the inode flag we can run into this race:
CPU 0 CPU 1
----- -----
xfs_file_read_iter()
IS_DAX() << returns false
xfs_ioctl_setattr()
xfs_ioctl_setattr_dax_invalidate()
xfs_ilock(XFS_MMAPLOCK|XFS_IOLOCK)
sets S_DAX
releases XFS_MMAPLOCK and XFS_IOLOCK
xfs_file_buffered_aio_read()
does buffered I/O to DAX inode, death
Fix this by ensuring that we only check S_DAX when we hold the XFS_IOLOCK
in the read path.
Signed-off-by: Ross Zwisler <[email protected]>
---
fs/xfs/xfs_file.c | 42 +++++++++++++-----------------------------
1 file changed, 13 insertions(+), 29 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ebdd0bd..ca4c8fd 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -207,7 +207,6 @@ xfs_file_dio_aio_read(
{
struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
size_t count = iov_iter_count(to);
- ssize_t ret;
trace_xfs_file_direct_read(ip, count, iocb->ki_pos);
@@ -215,12 +214,7 @@ xfs_file_dio_aio_read(
return 0; /* skip atime */
file_accessed(iocb->ki_filp);
-
- xfs_ilock(ip, XFS_IOLOCK_SHARED);
- ret = iomap_dio_rw(iocb, to, &xfs_iomap_ops, NULL);
- xfs_iunlock(ip, XFS_IOLOCK_SHARED);
-
- return ret;
+ return iomap_dio_rw(iocb, to, &xfs_iomap_ops, NULL);
}
static noinline ssize_t
@@ -230,23 +224,14 @@ xfs_file_dax_read(
{
struct xfs_inode *ip = XFS_I(iocb->ki_filp->f_mapping->host);
size_t count = iov_iter_count(to);
- ssize_t ret = 0;
trace_xfs_file_dax_read(ip, count, iocb->ki_pos);
if (!count)
return 0; /* skip atime */
- if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) {
- if (iocb->ki_flags & IOCB_NOWAIT)
- return -EAGAIN;
- xfs_ilock(ip, XFS_IOLOCK_SHARED);
- }
- ret = dax_iomap_rw(iocb, to, &xfs_iomap_ops);
- xfs_iunlock(ip, XFS_IOLOCK_SHARED);
-
file_accessed(iocb->ki_filp);
- return ret;
+ return dax_iomap_rw(iocb, to, &xfs_iomap_ops);
}
STATIC ssize_t
@@ -255,19 +240,9 @@ xfs_file_buffered_aio_read(
struct iov_iter *to)
{
struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
- ssize_t ret;
trace_xfs_file_buffered_read(ip, iov_iter_count(to), iocb->ki_pos);
-
- if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) {
- if (iocb->ki_flags & IOCB_NOWAIT)
- return -EAGAIN;
- xfs_ilock(ip, XFS_IOLOCK_SHARED);
- }
- ret = generic_file_read_iter(iocb, to);
- xfs_iunlock(ip, XFS_IOLOCK_SHARED);
-
- return ret;
+ return generic_file_read_iter(iocb, to);
}
STATIC ssize_t
@@ -276,7 +251,8 @@ xfs_file_read_iter(
struct iov_iter *to)
{
struct inode *inode = file_inode(iocb->ki_filp);
- struct xfs_mount *mp = XFS_I(inode)->i_mount;
+ struct xfs_inode *ip = XFS_I(inode);
+ struct xfs_mount *mp = ip->i_mount;
ssize_t ret = 0;
XFS_STATS_INC(mp, xs_read_calls);
@@ -284,6 +260,12 @@ xfs_file_read_iter(
if (XFS_FORCED_SHUTDOWN(mp))
return -EIO;
+ if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED)) {
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ return -EAGAIN;
+ xfs_ilock(ip, XFS_IOLOCK_SHARED);
+ }
+
if (IS_DAX(inode))
ret = xfs_file_dax_read(iocb, to);
else if (iocb->ki_flags & IOCB_DIRECT)
@@ -291,6 +273,8 @@ xfs_file_read_iter(
else
ret = xfs_file_buffered_aio_read(iocb, to);
+ xfs_iunlock(ip, XFS_IOLOCK_SHARED);
+
if (ret > 0)
XFS_STATS_ADD(mp, xs_read_bytes, ret);
return ret;
--
2.9.5
In the current XFS write I/O path we check IS_DAX() in
xfs_file_write_iter() to decide whether to do DAX I/O, direct I/O or
buffered I/O. This check is done without holding the XFS_IOLOCK, though,
which means that if we allow S_DAX to be manipulated via the inode flag we
can run into this race:
CPU 0 CPU 1
----- -----
xfs_file_write_iter()
IS_DAX() << returns false
xfs_ioctl_setattr()
xfs_ioctl_setattr_dax_invalidate()
xfs_ilock(XFS_MMAPLOCK|XFS_IOLOCK)
sets S_DAX
releases XFS_MMAPLOCK and XFS_IOLOCK
xfs_file_buffered_aio_write()
does buffered I/O to DAX inode, death
Fix this by ensuring that we only check S_DAX when we hold the XFS_IOLOCK
in the write path.
Signed-off-by: Ross Zwisler <[email protected]>
---
fs/xfs/xfs_file.c | 115 +++++++++++++++++++++---------------------------------
1 file changed, 44 insertions(+), 71 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index ca4c8fd..2816858 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -308,9 +308,7 @@ xfs_zero_eof(
/*
* Common pre-write limit and setup checks.
*
- * Called with the iolocked held either shared and exclusive according to
- * @iolock, and returns with it held. Might upgrade the iolock to exclusive
- * if called for a direct write beyond i_size.
+ * Called with the iolock held in exclusive mode.
*/
STATIC ssize_t
xfs_file_aio_write_checks(
@@ -322,7 +320,6 @@ xfs_file_aio_write_checks(
struct inode *inode = file->f_mapping->host;
struct xfs_inode *ip = XFS_I(inode);
ssize_t error = 0;
- size_t count = iov_iter_count(from);
bool drained_dio = false;
restart:
@@ -335,21 +332,9 @@ xfs_file_aio_write_checks(
return error;
/*
- * For changing security info in file_remove_privs() we need i_rwsem
- * exclusively.
- */
- if (*iolock == XFS_IOLOCK_SHARED && !IS_NOSEC(inode)) {
- xfs_iunlock(ip, *iolock);
- *iolock = XFS_IOLOCK_EXCL;
- xfs_ilock(ip, *iolock);
- goto restart;
- }
- /*
* If the offset is beyond the size of the file, we need to zero any
* blocks that fall between the existing EOF and the start of this
- * write. If zeroing is needed and we are currently holding the
- * iolock shared, we need to update it to exclusive which implies
- * having to redo all checks before.
+ * write.
*
* We need to serialise against EOF updates that occur in IO
* completions here. We want to make sure that nobody is changing the
@@ -365,12 +350,6 @@ xfs_file_aio_write_checks(
spin_unlock(&ip->i_flags_lock);
if (!drained_dio) {
- if (*iolock == XFS_IOLOCK_SHARED) {
- xfs_iunlock(ip, *iolock);
- *iolock = XFS_IOLOCK_EXCL;
- xfs_ilock(ip, *iolock);
- iov_iter_reexpand(from, count);
- }
/*
* We now have an IO submission barrier in place, but
* AIO can do EOF updates during IO completion and hence
@@ -491,7 +470,8 @@ xfs_dio_write_end_io(
STATIC ssize_t
xfs_file_dio_aio_write(
struct kiocb *iocb,
- struct iov_iter *from)
+ struct iov_iter *from,
+ int *iolock)
{
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
@@ -500,7 +480,6 @@ xfs_file_dio_aio_write(
struct xfs_mount *mp = ip->i_mount;
ssize_t ret = 0;
int unaligned_io = 0;
- int iolock;
size_t count = iov_iter_count(from);
struct xfs_buftarg *target = XFS_IS_REALTIME_INODE(ip) ?
mp->m_rtdev_targp : mp->m_ddev_targp;
@@ -510,11 +489,12 @@ xfs_file_dio_aio_write(
return -EINVAL;
/*
- * Don't take the exclusive iolock here unless the I/O is unaligned to
- * the file system block size. We don't need to consider the EOF
- * extension case here because xfs_file_aio_write_checks() will relock
- * the inode as necessary for EOF zeroing cases and fill out the new
- * inode size as appropriate.
+ * We hold the exclusive iolock via our caller. After the common
+ * write checks we will demote it to a shared iolock unless the I/O is
+ * unaligned to the file system block size. We don't need to consider
+ * the EOF extension case here because xfs_file_aio_write_checks()
+ * will deal with EOF zeroing cases and fill out the new inode size as
+ * appropriate.
*/
if ((iocb->ki_pos & mp->m_blockmask) ||
((iocb->ki_pos + count) & mp->m_blockmask)) {
@@ -528,26 +508,17 @@ xfs_file_dio_aio_write(
trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos, count);
return -EREMCHG;
}
- iolock = XFS_IOLOCK_EXCL;
- } else {
- iolock = XFS_IOLOCK_SHARED;
}
- if (!xfs_ilock_nowait(ip, iolock)) {
- if (iocb->ki_flags & IOCB_NOWAIT)
- return -EAGAIN;
- xfs_ilock(ip, iolock);
- }
- ret = xfs_file_aio_write_checks(iocb, from, &iolock);
+ ret = xfs_file_aio_write_checks(iocb, from, iolock);
if (ret)
goto out;
count = iov_iter_count(from);
/*
* If we are doing unaligned IO, wait for all other IO to drain,
- * otherwise demote the lock if we had to take the exclusive lock
- * for other reasons in xfs_file_aio_write_checks.
+ * otherwise demote the lock.
*/
if (unaligned_io) {
/* If we are going to wait for other DIO to finish, bail */
@@ -557,15 +528,14 @@ xfs_file_dio_aio_write(
} else {
inode_dio_wait(inode);
}
- } else if (iolock == XFS_IOLOCK_EXCL) {
+ } else if (*iolock == XFS_IOLOCK_EXCL) {
xfs_ilock_demote(ip, XFS_IOLOCK_EXCL);
- iolock = XFS_IOLOCK_SHARED;
+ *iolock = XFS_IOLOCK_SHARED;
}
trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, xfs_dio_write_end_io);
out:
- xfs_iunlock(ip, iolock);
/*
* No fallback to buffered IO on errors for XFS, direct IO will either
@@ -578,22 +548,16 @@ xfs_file_dio_aio_write(
static noinline ssize_t
xfs_file_dax_write(
struct kiocb *iocb,
- struct iov_iter *from)
+ struct iov_iter *from,
+ int *iolock)
{
struct inode *inode = iocb->ki_filp->f_mapping->host;
struct xfs_inode *ip = XFS_I(inode);
- int iolock = XFS_IOLOCK_EXCL;
ssize_t ret, error = 0;
size_t count;
loff_t pos;
- if (!xfs_ilock_nowait(ip, iolock)) {
- if (iocb->ki_flags & IOCB_NOWAIT)
- return -EAGAIN;
- xfs_ilock(ip, iolock);
- }
-
- ret = xfs_file_aio_write_checks(iocb, from, &iolock);
+ ret = xfs_file_aio_write_checks(iocb, from, iolock);
if (ret)
goto out;
@@ -607,14 +571,14 @@ xfs_file_dax_write(
error = xfs_setfilesize(ip, pos, ret);
}
out:
- xfs_iunlock(ip, iolock);
return error ? error : ret;
}
STATIC ssize_t
xfs_file_buffered_aio_write(
struct kiocb *iocb,
- struct iov_iter *from)
+ struct iov_iter *from,
+ int *iolock)
{
struct file *file = iocb->ki_filp;
struct address_space *mapping = file->f_mapping;
@@ -622,16 +586,12 @@ xfs_file_buffered_aio_write(
struct xfs_inode *ip = XFS_I(inode);
ssize_t ret;
int enospc = 0;
- int iolock;
if (iocb->ki_flags & IOCB_NOWAIT)
return -EOPNOTSUPP;
write_retry:
- iolock = XFS_IOLOCK_EXCL;
- xfs_ilock(ip, iolock);
-
- ret = xfs_file_aio_write_checks(iocb, from, &iolock);
+ ret = xfs_file_aio_write_checks(iocb, from, iolock);
if (ret)
goto out;
@@ -653,32 +613,35 @@ xfs_file_buffered_aio_write(
* running at the same time.
*/
if (ret == -EDQUOT && !enospc) {
- xfs_iunlock(ip, iolock);
+ xfs_iunlock(ip, *iolock);
enospc = xfs_inode_free_quota_eofblocks(ip);
if (enospc)
- goto write_retry;
+ goto lock_retry;
enospc = xfs_inode_free_quota_cowblocks(ip);
if (enospc)
- goto write_retry;
- iolock = 0;
+ goto lock_retry;
+ *iolock = 0;
} else if (ret == -ENOSPC && !enospc) {
struct xfs_eofblocks eofb = {0};
enospc = 1;
xfs_flush_inodes(ip->i_mount);
- xfs_iunlock(ip, iolock);
+ xfs_iunlock(ip, *iolock);
eofb.eof_flags = XFS_EOF_FLAGS_SYNC;
xfs_icache_free_eofblocks(ip->i_mount, &eofb);
xfs_icache_free_cowblocks(ip->i_mount, &eofb);
- goto write_retry;
+ goto lock_retry;
}
current->backing_dev_info = NULL;
out:
- if (iolock)
- xfs_iunlock(ip, iolock);
return ret;
+lock_retry:
+ xfs_ilock(ip, *iolock);
+ if (IS_DAX(inode))
+ return -EAGAIN;
+ goto write_retry;
}
STATIC ssize_t
@@ -692,6 +655,7 @@ xfs_file_write_iter(
struct xfs_inode *ip = XFS_I(inode);
ssize_t ret;
size_t ocount = iov_iter_count(from);
+ int iolock = XFS_IOLOCK_EXCL;
XFS_STATS_INC(ip->i_mount, xs_write_calls);
@@ -701,8 +665,14 @@ xfs_file_write_iter(
if (XFS_FORCED_SHUTDOWN(ip->i_mount))
return -EIO;
+ if (!xfs_ilock_nowait(ip, iolock)) {
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ return -EAGAIN;
+ xfs_ilock(ip, iolock);
+ }
+
if (IS_DAX(inode))
- ret = xfs_file_dax_write(iocb, from);
+ ret = xfs_file_dax_write(iocb, from, &iolock);
else if (iocb->ki_flags & IOCB_DIRECT) {
/*
* Allow a directio write to fall back to a buffered
@@ -710,14 +680,17 @@ xfs_file_write_iter(
* CoW. In all other directio scenarios we do not
* allow an operation to fall back to buffered mode.
*/
- ret = xfs_file_dio_aio_write(iocb, from);
+ ret = xfs_file_dio_aio_write(iocb, from, &iolock);
if (ret == -EREMCHG)
goto buffered;
} else {
buffered:
- ret = xfs_file_buffered_aio_write(iocb, from);
+ ret = xfs_file_buffered_aio_write(iocb, from, &iolock);
}
+ if (iolock)
+ xfs_iunlock(ip, iolock);
+
if (ret > 0) {
XFS_STATS_ADD(ip->i_mount, xs_write_bytes, ret);
--
2.9.5
On Mon, Sep 25, 2017 at 4:14 PM, Ross Zwisler
<[email protected]> wrote:
> When mappings are created the vma->vm_flags that they use vary based on
> whether the inode being mapped is using DAX or not. This setup happens in
> XFS via mmap_region()=>call_mmap()=>xfs_file_mmap().
>
> For us to be able to safely use the DAX per-inode flag we need to prevent
> S_DAX transitions when any mappings are present, and we will do that by
> looking at the address_space->i_mmap tree and returning -EBUSY if any
> mappings are present.
>
> Unfortunately at the time that the filesystem's file_operations->mmap()
> entry point is called the mapping has not yet been added to the
> address_space->i_mmap tree. This means that at that point in time we
> cannot determine whether or not the mapping will be set up to support DAX.
>
> Fix this by adding a new file_operations entry called post_mmap() which is
> called after the mapping has been added to the address_space->i_mmap tree.
> This post_mmap() op now happens at a time when we can be sure whether the
> mapping will use DAX or not, and we can set up the vma->vm_flags
> appropriately.
>
> Signed-off-by: Ross Zwisler <[email protected]>
> ---
> fs/xfs/xfs_file.c | 15 ++++++++++++++-
> include/linux/fs.h | 1 +
> mm/mmap.c | 2 ++
> 3 files changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 2816858..9d66aaa 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1087,9 +1087,21 @@ xfs_file_mmap(
> {
> file_accessed(filp);
> vma->vm_ops = &xfs_file_vm_ops;
> + return 0;
> +}
> +
> +/* This call happens during mmap(), after the vma has been inserted into the
> + * inode->i_mapping->i_mmap tree. At this point the decision on whether or
> + * not to use DAX for this mapping has been set and will not change for the
> + * duration of the mapping.
> + */
> +STATIC void
> +xfs_file_post_mmap(
> + struct file *filp,
> + struct vm_area_struct *vma)
> +{
> if (IS_DAX(file_inode(filp)))
> vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
It's not clear to me what this is actually protecting? vma_is_dax()
returns true regardless of the vm_flags state , so what is the benefit
to delaying the vm_flags setting to ->post_mmap()?
Also, why is this a file_operation and not a vm_operation?
On Mon, Sep 25, 2017 at 05:14:00PM -0600, Ross Zwisler wrote:
> In the current XFS read I/O path we check IS_DAX() in xfs_file_read_iter()
> to decide whether to do DAX I/O, direct I/O or buffered I/O. This check is
> done without holding the XFS_IOLOCK, though, which means that if we allow
> S_DAX to be manipulated via the inode flag we can run into this race:
>
> CPU 0 CPU 1
> ----- -----
> xfs_file_read_iter()
> IS_DAX() << returns false
> xfs_ioctl_setattr()
> xfs_ioctl_setattr_dax_invalidate()
> xfs_ilock(XFS_MMAPLOCK|XFS_IOLOCK)
> sets S_DAX
> releases XFS_MMAPLOCK and XFS_IOLOCK
> xfs_file_buffered_aio_read()
> does buffered I/O to DAX inode, death
>
> Fix this by ensuring that we only check S_DAX when we hold the XFS_IOLOCK
> in the read path.
>
> Signed-off-by: Ross Zwisler <[email protected]>
> ---
> fs/xfs/xfs_file.c | 42 +++++++++++++-----------------------------
> 1 file changed, 13 insertions(+), 29 deletions(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index ebdd0bd..ca4c8fd 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -207,7 +207,6 @@ xfs_file_dio_aio_read(
> {
> struct xfs_inode *ip = XFS_I(file_inode(iocb->ki_filp));
> size_t count = iov_iter_count(to);
> - ssize_t ret;
>
> trace_xfs_file_direct_read(ip, count, iocb->ki_pos);
>
> @@ -215,12 +214,7 @@ xfs_file_dio_aio_read(
> return 0; /* skip atime */
>
> file_accessed(iocb->ki_filp);
> -
> - xfs_ilock(ip, XFS_IOLOCK_SHARED);
> - ret = iomap_dio_rw(iocb, to, &xfs_iomap_ops, NULL);
> - xfs_iunlock(ip, XFS_IOLOCK_SHARED);
> -
> - return ret;
> + return iomap_dio_rw(iocb, to, &xfs_iomap_ops, NULL);
This puts file_accessed under the XFS_IOLOCK_SHARED now. Is that a
safe/sane thing to do for DIO?
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, Sep 25, 2017 at 05:14:01PM -0600, Ross Zwisler wrote:
> In the current XFS write I/O path we check IS_DAX() in
> xfs_file_write_iter() to decide whether to do DAX I/O, direct I/O or
> buffered I/O. This check is done without holding the XFS_IOLOCK, though,
> which means that if we allow S_DAX to be manipulated via the inode flag we
> can run into this race:
>
> CPU 0 CPU 1
> ----- -----
> xfs_file_write_iter()
> IS_DAX() << returns false
> xfs_ioctl_setattr()
> xfs_ioctl_setattr_dax_invalidate()
> xfs_ilock(XFS_MMAPLOCK|XFS_IOLOCK)
> sets S_DAX
> releases XFS_MMAPLOCK and XFS_IOLOCK
> xfs_file_buffered_aio_write()
> does buffered I/O to DAX inode, death
>
> Fix this by ensuring that we only check S_DAX when we hold the XFS_IOLOCK
> in the write path.
NACK. This breaks concurrent direct IO write semantics. We must not
take XFS_IOLOCK_EXCL on direct IO writes unless it is absolutely
necessary - there are lots of applications out there that rely on
these semantics for performance.
CHeers,
Dave.
--
Dave Chinner
[email protected]
On Mon, Sep 25, 2017 at 05:13:58PM -0600, Ross Zwisler wrote:
> Before support for the per-inode DAX flag was disabled the XFS the code had
> an issue where the user couldn't reliably tell whether or not DAX was being
> used to service page faults and I/O when the DAX mount option was used. In
> this case each inode within the mounted filesystem started with S_DAX set
> due to the mount option, but it could be cleared if someone touched the
> individual inode flag.
>
> For example (v4.13 and before):
>
> # mount | grep dax
> /dev/pmem0 on /mnt type xfs
> (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota)
>
> # touch /mnt/a /mnt/b # both files currently use DAX
>
> # xfs_io -c "lsattr" /mnt/* # neither has the DAX inode option set
> ----------e----- /mnt/a
> ----------e----- /mnt/b
>
> # xfs_io -c "chattr -x" /mnt/a # this clears S_DAX for /mnt/a
>
> # xfs_io -c "lsattr" /mnt/*
> ----------e----- /mnt/a
> ----------e----- /mnt/b
That's really a bug in the lsattr code, yes? If we've cleared the
S_DAX flag for the inode, then why is it being reported in lsattr?
Or if we failed to clear the S_DAX flag in the 'chattr -x' call,
then isn't that the bug that needs fixing?
Remember, the whole point of the dax inode flag was to be able to
override the mount option setting so that admins could turn off/on
dax for the things that didn't/did work with DAX correctly so they
didn't need multiple filesystems on pmem to segregate the apps that
did/didn't work with DAX...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, Sep 25, 2017 at 05:14:04PM -0600, Ross Zwisler wrote:
> Re-enable the XFS per-inode DAX flag, preventing S_DAX from changing when
> any mappings are present.
>
> Signed-off-by: Ross Zwisler <[email protected]>
> ---
> fs/xfs/xfs_ioctl.c | 17 +++++++++++++++--
> 1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 386b437..7a24dd5 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1012,12 +1012,10 @@ xfs_diflags_to_linux(
> inode->i_flags |= S_NOATIME;
> else
> inode->i_flags &= ~S_NOATIME;
> -#if 0 /* disabled until the flag switching races are sorted out */
> if ((xflags & FS_XFLAG_DAX) || (ip->i_mount->m_flags & XFS_MOUNT_DAX))
> inode->i_flags |= S_DAX;
> else
> inode->i_flags &= ~S_DAX;
> -#endif
> }
>
> static bool
> @@ -1049,6 +1047,8 @@ xfs_ioctl_setattr_xflags(
> {
> struct xfs_mount *mp = ip->i_mount;
> uint64_t di_flags2;
> + struct address_space *mapping = VFS_I(ip)->i_mapping;
> + bool dax_changing;
>
> /* Can't change realtime flag if any extents are allocated. */
> if ((ip->i_d.di_nextents || ip->i_delayed_blks) &&
> @@ -1084,10 +1084,23 @@ xfs_ioctl_setattr_xflags(
> if (di_flags2 && ip->i_d.di_version < 3)
> return -EINVAL;
>
> + dax_changing = xfs_is_dax_state_changing(fa->fsx_xflags, ip);
> + if (dax_changing) {
> + i_mmap_lock_read(mapping);
> + if (mapping_mapped(mapping)) {
> + i_mmap_unlock_read(mapping);
> + return -EBUSY;
> + }
> + }
> +
> ip->i_d.di_flags = xfs_flags2diflags(ip, fa->fsx_xflags);
> ip->i_d.di_flags2 = di_flags2;
>
> xfs_diflags_to_linux(ip);
> +
> + if (dax_changing)
> + i_mmap_unlock_read(mapping);
Is this safe to be taking here under the ILOCK_EXCL? i.e. this is
the lock order here:
IOLOCK_EXCL -> MMAPLOCK_EXCL -> ILOCK_EXCL -> i_mmap_rwsem
The truncate path must run outside the ILOCK
context, and it does this order via unmap_mapping_range:
IOLOCK_EXCL -> MMAPLOCK_EXCL -> i_mmap_rwsem
On a page fault, we do:
mmap_sem -> MMAPLOCK_EXCL -> page lock -> ILOCK_EXCL
Which gives the order
IOLOCK_EXCL
-> mmap_sem
-> MMAPLOCK_EXCL
-> page lock
-> ILOCK_EXCL
-> i_mmap_rwsem
What I'm not clear on is what the orders between page locks and
pte locks and i_mapping_tree_lock and i_mmap_rwsem. If there's any
locks that the filesystem can take above the ILOCK that are also
taken under the i_mmap_rwsem, then we have a deadlock vector.
Historically we've avoided any mm/ level interactions under the
ILOCK_EXCL because of it's location in the page fault path locking
order (e.g. lockdep will go nuts if we take a page fault with the
ILOCK held). Hence I'm extremely wary of putting any other mm/ level
locks under the ILOCK like this without a clear explanation of the
locking orders and why it won't deadlock....
Cheers,
Dave.
--
Dave Chinner
[email protected]
We can't just take locking one level up, as we need differnet locking
for different kinds of I/O.
I think you probably want an IOCB_DAX flag to check IS_DAX once and
then stick to it, similar to what we do for direct I/O.
> +static bool
> +xfs_is_dax_state_changing(
> + unsigned int xflags,
> + struct xfs_inode *ip)
And I have no fricking idea what 'is_dax_state_changing' is supposed
to mean for the caller. This needs a better name and/or a comment
explaining the function.
On Mon, Sep 25, 2017 at 05:14:03PM -0600, Ross Zwisler wrote:
> When mappings are created the vma->vm_flags that they use vary based on
> whether the inode being mapped is using DAX or not. This setup happens in
> XFS via mmap_region()=>call_mmap()=>xfs_file_mmap().
>
> For us to be able to safely use the DAX per-inode flag we need to prevent
> S_DAX transitions when any mappings are present, and we will do that by
> looking at the address_space->i_mmap tree and returning -EBUSY if any
> mappings are present.
>
> Unfortunately at the time that the filesystem's file_operations->mmap()
> entry point is called the mapping has not yet been added to the
> address_space->i_mmap tree. This means that at that point in time we
> cannot determine whether or not the mapping will be set up to support DAX.
>
> Fix this by adding a new file_operations entry called post_mmap() which is
> called after the mapping has been added to the address_space->i_mmap tree.
> This post_mmap() op now happens at a time when we can be sure whether the
> mapping will use DAX or not, and we can set up the vma->vm_flags
> appropriately.
Just like in the read/write path we'll need a flag that is passed down
from the VM based on checking IS_DAX once and exactly once instead.
On Mon, Sep 25, 2017 at 05:14:04PM -0600, Ross Zwisler wrote:
> Re-enable the XFS per-inode DAX flag, preventing S_DAX from changing when
> any mappings are present.
Before we re-enable it please come up with a coherent description
of the per-inode DAX flag that makes sense to a user. We'll also need
to find a good place to document it, e.g. a new ioctl_setflags man
page.
On Mon, Sep 25, 2017 at 05:13:59PM -0600, Ross Zwisler wrote:
> Currently only the blocksize is checked, but we should really be calling
> bdev_dax_supported() which also tests to make sure we can get a
> struct dax_device and that the dax_direct_access() path is working.
>
> This is the same check that we do for the "-o dax" mount option in
> xfs_fs_fill_super().
>
> Signed-off-by: Ross Zwisler <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
I think we just want to pick this up ASAP. And between my vague
memoried and that reviewed-by tag it already was part of a different
series, wasn't it?
On Tue 26-09-17 09:38:12, Dave Chinner wrote:
> On Mon, Sep 25, 2017 at 05:13:58PM -0600, Ross Zwisler wrote:
> > Before support for the per-inode DAX flag was disabled the XFS the code had
> > an issue where the user couldn't reliably tell whether or not DAX was being
> > used to service page faults and I/O when the DAX mount option was used. In
> > this case each inode within the mounted filesystem started with S_DAX set
> > due to the mount option, but it could be cleared if someone touched the
> > individual inode flag.
> >
> > For example (v4.13 and before):
> >
> > # mount | grep dax
> > /dev/pmem0 on /mnt type xfs
> > (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota)
> >
> > # touch /mnt/a /mnt/b # both files currently use DAX
> >
> > # xfs_io -c "lsattr" /mnt/* # neither has the DAX inode option set
> > ----------e----- /mnt/a
> > ----------e----- /mnt/b
> >
> > # xfs_io -c "chattr -x" /mnt/a # this clears S_DAX for /mnt/a
> >
> > # xfs_io -c "lsattr" /mnt/*
> > ----------e----- /mnt/a
> > ----------e----- /mnt/b
>
> That's really a bug in the lsattr code, yes? If we've cleared the
> S_DAX flag for the inode, then why is it being reported in lsattr?
> Or if we failed to clear the S_DAX flag in the 'chattr -x' call,
> then isn't that the bug that needs fixing?
>
> Remember, the whole point of the dax inode flag was to be able to
> override the mount option setting so that admins could turn off/on
> dax for the things that didn't/did work with DAX correctly so they
> didn't need multiple filesystems on pmem to segregate the apps that
> did/didn't work with DAX...
So I think there is some confusion that is created by the fact that whether
DAX is used or not is controlled by both a mount option and an inode flag.
We could define that "Inode flag always wins" which is what you seem to
suggest above but then mount option has no practical effect since on-disk
S_DAX flag will always overrule it.
Ross suggests that DAX should be used if "Inode flag or mount option is
set". Which is similar to how e.g. noatime inode flag works but does not
allow to selectively disable DAX.
So if we wanted both mount option to work and selective disabling of DAX,
we would need three states of inode setting - force DAX, disable DAX,
inherit from mount option.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, Sep 26, 2017 at 11:35:48AM +0200, Jan Kara wrote:
> On Tue 26-09-17 09:38:12, Dave Chinner wrote:
> > On Mon, Sep 25, 2017 at 05:13:58PM -0600, Ross Zwisler wrote:
> > > Before support for the per-inode DAX flag was disabled the XFS the code had
> > > an issue where the user couldn't reliably tell whether or not DAX was being
> > > used to service page faults and I/O when the DAX mount option was used. In
> > > this case each inode within the mounted filesystem started with S_DAX set
> > > due to the mount option, but it could be cleared if someone touched the
> > > individual inode flag.
> > >
> > > For example (v4.13 and before):
> > >
> > > # mount | grep dax
> > > /dev/pmem0 on /mnt type xfs
> > > (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota)
> > >
> > > # touch /mnt/a /mnt/b # both files currently use DAX
> > >
> > > # xfs_io -c "lsattr" /mnt/* # neither has the DAX inode option set
> > > ----------e----- /mnt/a
> > > ----------e----- /mnt/b
> > >
> > > # xfs_io -c "chattr -x" /mnt/a # this clears S_DAX for /mnt/a
> > >
> > > # xfs_io -c "lsattr" /mnt/*
> > > ----------e----- /mnt/a
> > > ----------e----- /mnt/b
> >
> > That's really a bug in the lsattr code, yes? If we've cleared the
> > S_DAX flag for the inode, then why is it being reported in lsattr?
> > Or if we failed to clear the S_DAX flag in the 'chattr -x' call,
> > then isn't that the bug that needs fixing?
> >
> > Remember, the whole point of the dax inode flag was to be able to
> > override the mount option setting so that admins could turn off/on
> > dax for the things that didn't/did work with DAX correctly so they
> > didn't need multiple filesystems on pmem to segregate the apps that
> > did/didn't work with DAX...
>
> So I think there is some confusion that is created by the fact that whether
> DAX is used or not is controlled by both a mount option and an inode flag.
> We could define that "Inode flag always wins" which is what you seem to
> suggest above but then mount option has no practical effect since on-disk
> S_DAX flag will always overrule it.
Well, quite frankly, I never wanted the mount option for XFS. It was
supposed to be for initial testing only, then we'd /always/ use the
the inode flags. For a filesystem to default to using DAX, we
set the DAX flag on the root inode at mkfs time, and then everything
inode flag based just works.
But it seems that we're now stuck with the stupid, blunt, brute
force mount option because that's what the first commit on ext4
used. Now we're just about stuck with this silly "but we can't turn
it off" problem because of the mount option overriding everything.
If we have to keep the mount option, then lets fix it to mean "mount
option sets inheritable inode flag on directory creation" and
/maybe/ "mount option sets inode flag on file creation".
This then allows the inode flag to control everything else. i.e the
mount option sets the initial flag value rather than the behaviour
of the inode. The behaviour of the inode should be entirely
controlled by the inode flag, hence after initial creation the
chattr +/-x commands do what they advertise regardless of the mount
option value.
Yes, it means that existing users are going to have to run chattr -R
+x on their pmem filesystems to get the inode flags on disk, but
this is all tagged with EXPERIMENTAL and this is the sort of change
that is expected from experimental functionality.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, Sep 25, 2017 at 11:32 PM, Christoph Hellwig <[email protected]> wrote:
> We can't just take locking one level up, as we need differnet locking
> for different kinds of I/O.
>
> I think you probably want an IOCB_DAX flag to check IS_DAX once and
> then stick to it, similar to what we do for direct I/O.
I wonder if this works better with a reference count mechanism
per-file so that we don't need a hold a lock over the whole
transition. Similar to request_queue reference counting, when DAX is
being turned off we block new references and drain the in-flight ones.
On Tue, Sep 26, 2017 at 06:59:37AM -0700, Dan Williams wrote:
> > I think you probably want an IOCB_DAX flag to check IS_DAX once and
> > then stick to it, similar to what we do for direct I/O.
>
> I wonder if this works better with a reference count mechanism
> per-file so that we don't need a hold a lock over the whole
> transition. Similar to request_queue reference counting, when DAX is
> being turned off we block new references and drain the in-flight ones.
Maybe. But that assumes we want to be stuck in a perpetual binary
DAX on/off state on a given file. Which makes not only for an awkward
interface (inode or mount flag), but also might be fundamentally the
wrong thing to do for some media where you'd happily read directly
from it but rather buffer writes in DRAM.
On Tue, Sep 26, 2017 at 09:09:57PM +1000, Dave Chinner wrote:
> Well, quite frankly, I never wanted the mount option for XFS. It was
> supposed to be for initial testing only, then we'd /always/ use the
> the inode flags. For a filesystem to default to using DAX, we
> set the DAX flag on the root inode at mkfs time, and then everything
> inode flag based just works.
And I deeply fundamentally disagree. The mount option is a nice
enough big hammer to try a mode without encoding nitty gritty details
into the application ABI.
The per-inode persistent flag is the biggest nightmare ever, as we see
in all these discussions about it.
What does it even mean? Right now it forces direct addressing as long
as the underlying media supports that. But what about media that
you directly access but you really don't want to because it's really slow?
Or media that is so god damn fast that you never want to buffer? Or
media where you want to buffer for writes (or at least some of them)
but not for reads?
It encodes a very specific mechanism for an early direct access
implementation into the ABI. What we really need is for applications
to declare an intent, not specify a particular mechanism.
On Tue, Sep 26, 2017 at 08:36:50AM +0200, Christoph Hellwig wrote:
> On Mon, Sep 25, 2017 at 05:13:59PM -0600, Ross Zwisler wrote:
> > Currently only the blocksize is checked, but we should really be calling
> > bdev_dax_supported() which also tests to make sure we can get a
> > struct dax_device and that the dax_direct_access() path is working.
> >
> > This is the same check that we do for the "-o dax" mount option in
> > xfs_fs_fill_super().
> >
> > Signed-off-by: Ross Zwisler <[email protected]>
> > Reviewed-by: Christoph Hellwig <[email protected]>
>
> I think we just want to pick this up ASAP. And between my vague
> memoried and that reviewed-by tag it already was part of a different
> series, wasn't it?
Yep, the first 2 patches were part of this series:
https://lkml.org/lkml/2017/9/7/552
which you reviewed. I included them in this series because the later patches
needed to build on them. It looks like they are now in Darrick's
xfs-4.14-fixes branch, but haven't yet made it upstream.
On Tue, Sep 26, 2017 at 04:37:43PM +0200, Christoph Hellwig wrote:
> On Tue, Sep 26, 2017 at 09:09:57PM +1000, Dave Chinner wrote:
> > Well, quite frankly, I never wanted the mount option for XFS. It was
> > supposed to be for initial testing only, then we'd /always/ use the
> > the inode flags. For a filesystem to default to using DAX, we
> > set the DAX flag on the root inode at mkfs time, and then everything
> > inode flag based just works.
>
> And I deeply fundamentally disagree. The mount option is a nice
> enough big hammer to try a mode without encoding nitty gritty details
> into the application ABI.
>
> The per-inode persistent flag is the biggest nightmare ever, as we see
> in all these discussions about it.
>
> What does it even mean? Right now it forces direct addressing as long
> as the underlying media supports that. But what about media that
> you directly access but you really don't want to because it's really slow?
> Or media that is so god damn fast that you never want to buffer? Or
> media where you want to buffer for writes (or at least some of them)
> but not for reads?
>
> It encodes a very specific mechanism for an early direct access
> implementation into the ABI. What we really need is for applications
> to declare an intent, not specify a particular mechanism.
I agree that Christoph's idea about having the system intelligently adjust to
use DAX based on performance information it gathers about the underlying
persistent memory (probably via the HMAT on x86_64 systems) is interesting,
but I think we're still a ways away from that.
FWIW, as my patches suggest and Jan observed I think that we should allow
users to turn on DAX by treating the inode flag and the mount flag as an 'or'
operation. i.e. you get DAX if either the mount option is specified or if the
inode flag is set, and you can continue to manipulate the per-inode flag as
you want regardless of the mount option. I think this provides maximum
flexibility of the mechanism to select DAX without enforcing policy.
In the end, though, I think what's really important is that we figure out what
the various options mean, have the same story for both XFS and ext4, and
document it as hch suggested in response to my patch 7 in this series.
Does it make sense at this point to just start a "dax" man page that can
contain info about the mount options, inode flags, kernel config options, how
to get PMDs, etc? Or does this documentation need to be sprinkled around more
in existing man pages?
On Tue, Sep 26, 2017 at 11:16:38AM -0600, Ross Zwisler wrote:
> On Tue, Sep 26, 2017 at 08:36:50AM +0200, Christoph Hellwig wrote:
> > On Mon, Sep 25, 2017 at 05:13:59PM -0600, Ross Zwisler wrote:
> > > Currently only the blocksize is checked, but we should really be calling
> > > bdev_dax_supported() which also tests to make sure we can get a
> > > struct dax_device and that the dax_direct_access() path is working.
> > >
> > > This is the same check that we do for the "-o dax" mount option in
> > > xfs_fs_fill_super().
> > >
> > > Signed-off-by: Ross Zwisler <[email protected]>
> > > Reviewed-by: Christoph Hellwig <[email protected]>
> >
> > I think we just want to pick this up ASAP. And between my vague
> > memoried and that reviewed-by tag it already was part of a different
> > series, wasn't it?
>
> Yep, the first 2 patches were part of this series:
>
> https://lkml.org/lkml/2017/9/7/552
>
> which you reviewed. I included them in this series because the later patches
> needed to build on them. It looks like they are now in Darrick's
> xfs-4.14-fixes branch, but haven't yet made it upstream.
I'm pulling that first patch from -fixes because Dave & Christoph
started discussing it again in this new thread after I'd pushed the
patch from September 7th to korg.
The second patch looks fine, it'll stay.
--D
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On 9/26/17 6:09 AM, Dave Chinner wrote:
> On Tue, Sep 26, 2017 at 11:35:48AM +0200, Jan Kara wrote:
>> On Tue 26-09-17 09:38:12, Dave Chinner wrote:
>>> On Mon, Sep 25, 2017 at 05:13:58PM -0600, Ross Zwisler wrote:
>>>> Before support for the per-inode DAX flag was disabled the XFS the code had
>>>> an issue where the user couldn't reliably tell whether or not DAX was being
>>>> used to service page faults and I/O when the DAX mount option was used. In
>>>> this case each inode within the mounted filesystem started with S_DAX set
>>>> due to the mount option, but it could be cleared if someone touched the
>>>> individual inode flag.
>>>>
>>>> For example (v4.13 and before):
>>>>
>>>> # mount | grep dax
>>>> /dev/pmem0 on /mnt type xfs
>>>> (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota)
>>>>
>>>> # touch /mnt/a /mnt/b # both files currently use DAX
>>>>
>>>> # xfs_io -c "lsattr" /mnt/* # neither has the DAX inode option set
>>>> ----------e----- /mnt/a
>>>> ----------e----- /mnt/b
>>>>
>>>> # xfs_io -c "chattr -x" /mnt/a # this clears S_DAX for /mnt/a
>>>>
>>>> # xfs_io -c "lsattr" /mnt/*
>>>> ----------e----- /mnt/a
>>>> ----------e----- /mnt/b
>>>
>>> That's really a bug in the lsattr code, yes? If we've cleared the
>>> S_DAX flag for the inode, then why is it being reported in lsattr?
>>> Or if we failed to clear the S_DAX flag in the 'chattr -x' call,
>>> then isn't that the bug that needs fixing?
>>>
>>> Remember, the whole point of the dax inode flag was to be able to
>>> override the mount option setting so that admins could turn off/on
>>> dax for the things that didn't/did work with DAX correctly so they
>>> didn't need multiple filesystems on pmem to segregate the apps that
>>> did/didn't work with DAX...
>>
>> So I think there is some confusion that is created by the fact that whether
>> DAX is used or not is controlled by both a mount option and an inode flag.
>> We could define that "Inode flag always wins" which is what you seem to
>> suggest above but then mount option has no practical effect since on-disk
>> S_DAX flag will always overrule it.
>
> Well, quite frankly, I never wanted the mount option for XFS. It was
> supposed to be for initial testing only, then we'd /always/ use the
> the inode flags. For a filesystem to default to using DAX, we
> set the DAX flag on the root inode at mkfs time, and then everything
> inode flag based just works.
>
> But it seems that we're now stuck with the stupid, blunt, brute
> force mount option because that's what the first commit on ext4
> used. Now we're just about stuck with this silly "but we can't turn
> it off" problem because of the mount option overriding everything.
I don't think the existence of a mount option in ext4 makes us any
more "stuck" than the mount option in xfs does.
fs/xfs/xfs_super.c: "DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
fs/ext4/super.c: "DAX enabled. Warning: EXPERIMENTAL, use at your own risk");
so when^wif this argument ever gets settled, I think there is plenty
of latitude to do the right thing, potentially breaking the old thing.
> If we have to keep the mount option, then lets fix it to mean "mount
> option sets inheritable inode flag on directory creation" and
> /maybe/ "mount option sets inode flag on file creation".
>
> This then allows the inode flag to control everything else. i.e the
> mount option sets the initial flag value rather than the behaviour
> of the inode. The behaviour of the inode should be entirely
> controlled by the inode flag, hence after initial creation the
> chattr +/-x commands do what they advertise regardless of the mount
> option value.
>
> Yes, it means that existing users are going to have to run chattr -R
> +x on their pmem filesystems to get the inode flags on disk, but
> this is all tagged with EXPERIMENTAL and this is the sort of change
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> that is expected from experimental functionality.
Right.
-Eric
> Cheers,
>
> Dave.
>
On Tue, Sep 26, 2017 at 7:33 AM, Christoph Hellwig <[email protected]> wrote:
> On Tue, Sep 26, 2017 at 06:59:37AM -0700, Dan Williams wrote:
>> > I think you probably want an IOCB_DAX flag to check IS_DAX once and
>> > then stick to it, similar to what we do for direct I/O.
>>
>> I wonder if this works better with a reference count mechanism
>> per-file so that we don't need a hold a lock over the whole
>> transition. Similar to request_queue reference counting, when DAX is
>> being turned off we block new references and drain the in-flight ones.
>
> Maybe. But that assumes we want to be stuck in a perpetual binary
> DAX on/off state on a given file. Which makes not only for an awkward
> interface (inode or mount flag), but also might be fundamentally the
> wrong thing to do for some media where you'd happily read directly
> from it but rather buffer writes in DRAM.
I think we'll always need an explicit override available, but yes we
need to think about what the override looks like in the context of a
kernel that is able to automatically pick the right I/O policy
relative to the media type. A potential mixed policy for reads vs
writes makes sense. Where would this finer grained I/O policy
selection go other than more inode flags?
On Tue, Sep 26, 2017 at 09:38:12AM +1000, Dave Chinner wrote:
> On Mon, Sep 25, 2017 at 05:13:58PM -0600, Ross Zwisler wrote:
> > Before support for the per-inode DAX flag was disabled the XFS the code had
> > an issue where the user couldn't reliably tell whether or not DAX was being
> > used to service page faults and I/O when the DAX mount option was used. In
> > this case each inode within the mounted filesystem started with S_DAX set
> > due to the mount option, but it could be cleared if someone touched the
> > individual inode flag.
> >
> > For example (v4.13 and before):
> >
> > # mount | grep dax
> > /dev/pmem0 on /mnt type xfs
> > (rw,relatime,seclabel,attr2,dax,inode64,sunit=4096,swidth=4096,noquota)
> >
> > # touch /mnt/a /mnt/b # both files currently use DAX
> >
> > # xfs_io -c "lsattr" /mnt/* # neither has the DAX inode option set
> > ----------e----- /mnt/a
> > ----------e----- /mnt/b
> >
> > # xfs_io -c "chattr -x" /mnt/a # this clears S_DAX for /mnt/a
> >
> > # xfs_io -c "lsattr" /mnt/*
> > ----------e----- /mnt/a
> > ----------e----- /mnt/b
>
> That's really a bug in the lsattr code, yes? If we've cleared the
> S_DAX flag for the inode, then why is it being reported in lsattr?
> Or if we failed to clear the S_DAX flag in the 'chattr -x' call,
> then isn't that the bug that needs fixing?
No, I think lsattr/chattr are working correctly. In both the examples above
the DAX inode flag (which is represeted by an 'x') is never set. S_DAX is the
in-memory inode flag (not the on-media inode flag) which is not manipulated
directly by lsattr/chattr, but instead reflects whether the inode is actually
using DAX or not.
Manipulating and displaying the on-media inode flag works as expected with
lsattr/chattr:
# xfs_io -c "lsattr" ./a
---------------- ./a
# xfs_io -c "chattr +x" ./a
# xfs_io -c "lsattr" ./a
--------------x- ./a
- Ross
On Mon, Sep 25, 2017 at 04:38:45PM -0700, Dan Williams wrote:
> On Mon, Sep 25, 2017 at 4:14 PM, Ross Zwisler
> <[email protected]> wrote:
> > When mappings are created the vma->vm_flags that they use vary based on
> > whether the inode being mapped is using DAX or not. This setup happens in
> > XFS via mmap_region()=>call_mmap()=>xfs_file_mmap().
> >
> > For us to be able to safely use the DAX per-inode flag we need to prevent
> > S_DAX transitions when any mappings are present, and we will do that by
> > looking at the address_space->i_mmap tree and returning -EBUSY if any
> > mappings are present.
> >
> > Unfortunately at the time that the filesystem's file_operations->mmap()
> > entry point is called the mapping has not yet been added to the
> > address_space->i_mmap tree. This means that at that point in time we
> > cannot determine whether or not the mapping will be set up to support DAX.
> >
> > Fix this by adding a new file_operations entry called post_mmap() which is
> > called after the mapping has been added to the address_space->i_mmap tree.
> > This post_mmap() op now happens at a time when we can be sure whether the
> > mapping will use DAX or not, and we can set up the vma->vm_flags
> > appropriately.
> >
> > Signed-off-by: Ross Zwisler <[email protected]>
> > ---
> > fs/xfs/xfs_file.c | 15 ++++++++++++++-
> > include/linux/fs.h | 1 +
> > mm/mmap.c | 2 ++
> > 3 files changed, 17 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 2816858..9d66aaa 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -1087,9 +1087,21 @@ xfs_file_mmap(
> > {
> > file_accessed(filp);
> > vma->vm_ops = &xfs_file_vm_ops;
> > + return 0;
> > +}
> > +
> > +/* This call happens during mmap(), after the vma has been inserted into the
> > + * inode->i_mapping->i_mmap tree. At this point the decision on whether or
> > + * not to use DAX for this mapping has been set and will not change for the
> > + * duration of the mapping.
> > + */
> > +STATIC void
> > +xfs_file_post_mmap(
> > + struct file *filp,
> > + struct vm_area_struct *vma)
> > +{
> > if (IS_DAX(file_inode(filp)))
> > vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
>
> It's not clear to me what this is actually protecting? vma_is_dax()
> returns true regardless of the vm_flags state , so what is the benefit
> to delaying the vm_flags setting to ->post_mmap()?
Right, but the point is that until the vma has been inserted into the
inode->i_mapping->i_mmap tree, the results of IS_DAX() don't matter because it
can still change. Until this insertion happens we cannot know whether or not
we should set up the vma->vm_flags to support DAX mappings (i.e. have
VM_MIXEDMAP and VM_HUGEPAGE set). This decision can only be made (in this
proposed scheme) *after* the inode->i_mapping->i_mmap tree has been
populated, which means we need another call into the filesystem after this
insertion has happened.
We don't want to mess with the existing file_operations->mmap() call because
in many filesystems that does sanity checking and setup that you really want
to have happen *before* the mapping is completed and inserted into the
inode->i_mapping->i_mmap tree.
> Also, why is this a file_operation and not a vm_operation?
Because ->mmap() is also a file_operation, and this is an analogous call from
the mmap code that needs to happen at a different time. Or are you suggesting
that file_operations->mmap() should be moved to be a vm_operation? If not,
why would one be in one operations table and one in another?
On Tue, Sep 26, 2017 at 08:36:11AM +0200, Christoph Hellwig wrote:
> On Mon, Sep 25, 2017 at 05:14:04PM -0600, Ross Zwisler wrote:
> > Re-enable the XFS per-inode DAX flag, preventing S_DAX from changing when
> > any mappings are present.
>
> Before we re-enable it please come up with a coherent description
> of the per-inode DAX flag that makes sense to a user. We'll also need
> to find a good place to document it, e.g. a new ioctl_setflags man
> page.
I agree that documentation is a great place to start, if we can just agree on
what we want the functionality to be. :)
On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
<[email protected]> wrote:
> On Mon, Sep 25, 2017 at 04:38:45PM -0700, Dan Williams wrote:
>> On Mon, Sep 25, 2017 at 4:14 PM, Ross Zwisler
>> <[email protected]> wrote:
>> > When mappings are created the vma->vm_flags that they use vary based on
>> > whether the inode being mapped is using DAX or not. This setup happens in
>> > XFS via mmap_region()=>call_mmap()=>xfs_file_mmap().
>> >
>> > For us to be able to safely use the DAX per-inode flag we need to prevent
>> > S_DAX transitions when any mappings are present, and we will do that by
>> > looking at the address_space->i_mmap tree and returning -EBUSY if any
>> > mappings are present.
>> >
>> > Unfortunately at the time that the filesystem's file_operations->mmap()
>> > entry point is called the mapping has not yet been added to the
>> > address_space->i_mmap tree. This means that at that point in time we
>> > cannot determine whether or not the mapping will be set up to support DAX.
>> >
>> > Fix this by adding a new file_operations entry called post_mmap() which is
>> > called after the mapping has been added to the address_space->i_mmap tree.
>> > This post_mmap() op now happens at a time when we can be sure whether the
>> > mapping will use DAX or not, and we can set up the vma->vm_flags
>> > appropriately.
>> >
>> > Signed-off-by: Ross Zwisler <[email protected]>
>> > ---
>> > fs/xfs/xfs_file.c | 15 ++++++++++++++-
>> > include/linux/fs.h | 1 +
>> > mm/mmap.c | 2 ++
>> > 3 files changed, 17 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> > index 2816858..9d66aaa 100644
>> > --- a/fs/xfs/xfs_file.c
>> > +++ b/fs/xfs/xfs_file.c
>> > @@ -1087,9 +1087,21 @@ xfs_file_mmap(
>> > {
>> > file_accessed(filp);
>> > vma->vm_ops = &xfs_file_vm_ops;
>> > + return 0;
>> > +}
>> > +
>> > +/* This call happens during mmap(), after the vma has been inserted into the
>> > + * inode->i_mapping->i_mmap tree. At this point the decision on whether or
>> > + * not to use DAX for this mapping has been set and will not change for the
>> > + * duration of the mapping.
>> > + */
>> > +STATIC void
>> > +xfs_file_post_mmap(
>> > + struct file *filp,
>> > + struct vm_area_struct *vma)
>> > +{
>> > if (IS_DAX(file_inode(filp)))
>> > vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
>>
>> It's not clear to me what this is actually protecting? vma_is_dax()
>> returns true regardless of the vm_flags state , so what is the benefit
>> to delaying the vm_flags setting to ->post_mmap()?
>
> Right, but the point is that until the vma has been inserted into the
> inode->i_mapping->i_mmap tree, the results of IS_DAX() don't matter because it
> can still change. Until this insertion happens we cannot know whether or not
> we should set up the vma->vm_flags to support DAX mappings (i.e. have
> VM_MIXEDMAP and VM_HUGEPAGE set).
Those flags are not DAX flags. The side effect of these being set on
non-DAX mappings is that we effectively auto madvise(MADV_HUGEPAGE)
and enable some page-less insertion paths. Both of those are
effectively no-ops for normal mappings since normal mappings always
have an associated struct page and the THP policy these days is
usually "always".
> This decision can only be made (in this
> proposed scheme) *after* the inode->i_mapping->i_mmap tree has been
> populated, which means we need another call into the filesystem after this
> insertion has happened.
I get that, but it seems over-engineered and something that can also
be safely cleaned up after the fact by the code path that is disabling
DAX.
> We don't want to mess with the existing file_operations->mmap() call because
> in many filesystems that does sanity checking and setup that you really want
> to have happen *before* the mapping is completed and inserted into the
> inode->i_mapping->i_mmap tree.
>
>> Also, why is this a file_operation and not a vm_operation?
>
> Because ->mmap() is also a file_operation, and this is an analogous call from
> the mmap code that needs to happen at a different time. Or are you suggesting
> that file_operations->mmap() should be moved to be a vm_operation? If not,
> why would one be in one operations table and one in another?
Growing something as widely used as file_operations for this one-off
fixup feels like overkill. vm_operations is not much better, but it at
least constrains the data structure growth to something closer to the
problem space.
On Tue, Sep 26, 2017 at 11:30:57AM -0600, Ross Zwisler wrote:
> On Tue, Sep 26, 2017 at 04:37:43PM +0200, Christoph Hellwig wrote:
> > On Tue, Sep 26, 2017 at 09:09:57PM +1000, Dave Chinner wrote:
> > > Well, quite frankly, I never wanted the mount option for XFS. It was
> > > supposed to be for initial testing only, then we'd /always/ use the
> > > the inode flags. For a filesystem to default to using DAX, we
> > > set the DAX flag on the root inode at mkfs time, and then everything
> > > inode flag based just works.
> >
> > And I deeply fundamentally disagree. The mount option is a nice
> > enough big hammer to try a mode without encoding nitty gritty details
> > into the application ABI.
> >
> > The per-inode persistent flag is the biggest nightmare ever, as we see
> > in all these discussions about it.
> >
> > What does it even mean? Right now it forces direct addressing as long
> > as the underlying media supports that. But what about media that
> > you directly access but you really don't want to because it's really slow?
> > Or media that is so god damn fast that you never want to buffer? Or
> > media where you want to buffer for writes (or at least some of them)
> > but not for reads?
> >
> > It encodes a very specific mechanism for an early direct access
> > implementation into the ABI. What we really need is for applications
> > to declare an intent, not specify a particular mechanism.
>
> I agree that Christoph's idea about having the system intelligently adjust to
> use DAX based on performance information it gathers about the underlying
> persistent memory (probably via the HMAT on x86_64 systems) is interesting,
> but I think we're still a ways away from that.
>
> FWIW, as my patches suggest and Jan observed I think that we should allow
> users to turn on DAX by treating the inode flag and the mount flag as an 'or'
> operation. i.e. you get DAX if either the mount option is specified or if the
> inode flag is set, and you can continue to manipulate the per-inode flag as
> you want regardless of the mount option. I think this provides maximum
> flexibility of the mechanism to select DAX without enforcing policy.
>
> In the end, though, I think what's really important is that we figure out what
> the various options mean, have the same story for both XFS and ext4, and
> document it as hch suggested in response to my patch 7 in this series.
Agreed. We have a fundamental conflict between letting the sysadmin or
user decide how they want an inode to behave vs. letting the kernel make
all the decisions based on whatever information it gathers.
I'm pulled this patch out of -fixes and for-next because I feel strongly
discouraged about taking any more patches that change the user-control
parts of the DAX implementation until we reach a consensus on what to
do.
Given that DAX and pmem support in filesystems is still experimental,
I'm open to changing the interface as needed. Where do we think we'll
be in a few years once ACPI or whatever reaches the point of being able
to tell the kernel about the general performance characteristics of the
pmem? What choices about the interface do we need to make now so that
we can get there while minimizing the number of insufficient interfaces
to deprecate?
My personal guess is that most programs will not care enough to want to
make a syscall so we might as well give them the most performant option
available.
Roughly speaking, here are the use cases I can think of:
* Regular buffered read/write: we can let the kernel decide if it wants
to push the IO through the page cache, directly access the pmem, or
some future combination of the two.
* O_DIRECT read/write: Straight to pmem.
* Regular mmap: This seems fairly agnostic to how we actually make the
memory mapping work, right?
* MAP_DIRECT/MAP_SYNC mmap: If userspace actually goes to the trouble
of making sure the whole range is allocated and pre-written and uses
these flags then they get direct access.
I've wondered off and on if an acceptable solution is to define a number
of things surrounding an inode for which XFS /could/ optimize, and let
the user tell us which one thing matters most to them: total manual
control over everything like we do now, sequential io, random io,
fastest mmap access possible, most direct access to storage, etc.
If you set a hint other than full manual control then XFS reserves the
right to change inode flags at any time to satisfy the hint.
For the most part I'm in favor of Christoph's suggestion to let the
kernel decide on its own, and I don't see the point in encoding details
of the storage medium access strategy on the disk, particularly since
filesystems are supposed to be fairly independent of storage. But
frankly, so many people have asked me over the years if there's some way
to influence the decision-making that I won't quite let go of file hints
as a way to influence the decisions XFS makes around storage media.
> Does it make sense at this point to just start a "dax" man page that can
> contain info about the mount options, inode flags, kernel config options, how
> to get PMDs, etc? Or does this documentation need to be sprinkled around more
> in existing man pages?
Personally it'd be a lot easier to tell internal groups to go look at a
single documentation page that discusses everything you'd want to know
about enabling dax -- how to control it, how to make large page table
entries work, etc. Some of those things will get into fs internals,
however, which probably belong in the ext4/xfs manpages. I suggest
laying out the general details in a single dax manpage and pointing
people at each fs's documentation for specific details.
--D
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote:
> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
<>
> > This decision can only be made (in this
> > proposed scheme) *after* the inode->i_mapping->i_mmap tree has been
> > populated, which means we need another call into the filesystem after this
> > insertion has happened.
>
> I get that, but it seems over-engineered and something that can also
> be safely cleaned up after the fact by the code path that is disabling
> DAX.
I don't think you can safely clean it up after the fact because some thread
might have already called ->mmap() to set up the vma->vm_flags for their new
mapping, but they haven't added it to inode->i_mapping->i_mmap.
The inode->i_mapping->i_mmap tree is the only way (that I know of at least)
that the filesystem has any idea about about the mapping. This is the method
by which we would try and clean up mapping flags, if we were to do so, and
it's the only way that the filesystem can know whether or not mappings exist.
The only way that I could think of to make this safely work is to have the
insertion into the inode->i_mapping->i_mmap tree be our sync point. After
that the filesystem and the mapping code can communicate on the state of DAX,
but before that I think it's basically indeterminate.
On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler
<[email protected]> wrote:
> On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote:
>> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
> <>
>> > This decision can only be made (in this
>> > proposed scheme) *after* the inode->i_mapping->i_mmap tree has been
>> > populated, which means we need another call into the filesystem after this
>> > insertion has happened.
>>
>> I get that, but it seems over-engineered and something that can also
>> be safely cleaned up after the fact by the code path that is disabling
>> DAX.
>
> I don't think you can safely clean it up after the fact because some thread
> might have already called ->mmap() to set up the vma->vm_flags for their new
> mapping, but they haven't added it to inode->i_mapping->i_mmap.
If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the
DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal
memory mappings.
> The inode->i_mapping->i_mmap tree is the only way (that I know of at least)
> that the filesystem has any idea about about the mapping. This is the method
> by which we would try and clean up mapping flags, if we were to do so, and
> it's the only way that the filesystem can know whether or not mappings exist.
>
> The only way that I could think of to make this safely work is to have the
> insertion into the inode->i_mapping->i_mmap tree be our sync point. After
> that the filesystem and the mapping code can communicate on the state of DAX,
> but before that I think it's basically indeterminate.
If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what
breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in
the ->mmap() handler and let the default THP policy take over. In
fact, see transparent_hugepage_enabled() we already auto-enable huge
page support for dax mappings regardless of VM_HUGEPAGE.
On Tue, Sep 26, 2017 at 12:48:30PM -0700, Darrick J. Wong wrote:
> For the most part I'm in favor of Christoph's suggestion to let the
> kernel decide on its own, and I don't see the point in encoding details
> of the storage medium access strategy on the disk, particularly since
> filesystems are supposed to be fairly independent of storage. But
> frankly, so many people have asked me over the years if there's some way
> to influence the decision-making that I won't quite let go of file hints
> as a way to influence the decisions XFS makes around storage media.
And that's pretty much it. The discussion here is not about whether
there should be a flag, but what semantics it should have when the
flag is not set. If "flag not set" means "kernel selects
automatically", then that's fine by me.
But history tells us that users and admins want a way to be able to
override the kernel's automatic behaviours because they are /never
100% correct/ for everyone. There are always exceptions, otherwise
we wouldn't have the great plethora of mkfs, mount, proc and sysfs
options for our filesystems or storage. Anyone who says "the kernel
will always do the right thing for everyone automatically" is living
in a dream world.
Note: I agree that the kernel should do the right thing w.r.t. DAX
automatically. We don't need a mount option for that - we can probe
for dax support automatically and use it automatically already.
However, in a world where the kernel automatically uses that
functionality when it is present, admins and users need a way to
solve the "default behaviour is bad for me, let me control this
manually" problem. That's where the inode flags come in....
i.e. What I'm advocating is a model DAX gets enabled automatically
if the underlying device supports is using whatever the kernel
thinks is optimal at the time the access is made, but the user can
override/direct behvaiour on a case by case basis via persistent
inode flags/xattrs/whatever.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Sep 26, 2017 at 11:30:57AM -0600, Ross Zwisler wrote:
> I agree that Christoph's idea about having the system intelligently adjust to
> use DAX based on performance information it gathers about the underlying
> persistent memory (probably via the HMAT on x86_64 systems) is interesting,
> but I think we're still a ways away from that.
So what are the missing blockers for a getting started?
> FWIW, as my patches suggest and Jan observed I think that we should allow
> users to turn on DAX by treating the inode flag and the mount flag as an 'or'
> operation. i.e. you get DAX if either the mount option is specified or if the
> inode flag is set, and you can continue to manipulate the per-inode flag as
> you want regardless of the mount option. I think this provides maximum
> flexibility of the mechanism to select DAX without enforcing policy.
IFF we stick to the dax flag that's the only workable way. The only
major issue I still see with that is that this allows unprivilegued
users to enable DAX on a any file they own / have write access to.
So there isn't really any way to effectively disable the DAX path
by the sysadmin.
> Does it make sense at this point to just start a "dax" man page that can
> contain info about the mount options, inode flags, kernel config options, how
> to get PMDs, etc? Or does this documentation need to be sprinkled around more
> in existing man pages?
A dax manpage would be good.
On Tue 26-09-17 14:41:53, Dan Williams wrote:
> On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler
> <[email protected]> wrote:
> > On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote:
> >> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
> > <>
> >> > This decision can only be made (in this
> >> > proposed scheme) *after* the inode->i_mapping->i_mmap tree has been
> >> > populated, which means we need another call into the filesystem after this
> >> > insertion has happened.
> >>
> >> I get that, but it seems over-engineered and something that can also
> >> be safely cleaned up after the fact by the code path that is disabling
> >> DAX.
> >
> > I don't think you can safely clean it up after the fact because some thread
> > might have already called ->mmap() to set up the vma->vm_flags for their new
> > mapping, but they haven't added it to inode->i_mapping->i_mmap.
>
> If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the
> DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal
> memory mappings.
>
> > The inode->i_mapping->i_mmap tree is the only way (that I know of at least)
> > that the filesystem has any idea about about the mapping. This is the method
> > by which we would try and clean up mapping flags, if we were to do so, and
> > it's the only way that the filesystem can know whether or not mappings exist.
> >
> > The only way that I could think of to make this safely work is to have the
> > insertion into the inode->i_mapping->i_mmap tree be our sync point. After
> > that the filesystem and the mapping code can communicate on the state of DAX,
> > but before that I think it's basically indeterminate.
>
> If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what
> breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in
> the ->mmap() handler and let the default THP policy take over. In
> fact, see transparent_hugepage_enabled() we already auto-enable huge
> page support for dax mappings regardless of VM_HUGEPAGE.
Hum, this is an interesting option. So do you suggest that filesystems
supporting DAX would always setup mappings with VM_MIXEDMAP and without
VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap?
That could actually work. The only possible issue I can see is that
VM_MIXEDMAP is still slightly different from normal page mappings and it
could have some performance implications - e.g. copy_page_range() does more
work on VM_MIXEDMAP mappings but not on normal page mappings.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, Sep 27, 2017 at 4:35 AM, Jan Kara <[email protected]> wrote:
> On Tue 26-09-17 14:41:53, Dan Williams wrote:
>> On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler
>> <[email protected]> wrote:
>> > On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote:
>> >> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
>> > <>
>> >> > This decision can only be made (in this
>> >> > proposed scheme) *after* the inode->i_mapping->i_mmap tree has been
>> >> > populated, which means we need another call into the filesystem after this
>> >> > insertion has happened.
>> >>
>> >> I get that, but it seems over-engineered and something that can also
>> >> be safely cleaned up after the fact by the code path that is disabling
>> >> DAX.
>> >
>> > I don't think you can safely clean it up after the fact because some thread
>> > might have already called ->mmap() to set up the vma->vm_flags for their new
>> > mapping, but they haven't added it to inode->i_mapping->i_mmap.
>>
>> If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the
>> DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal
>> memory mappings.
>>
>> > The inode->i_mapping->i_mmap tree is the only way (that I know of at least)
>> > that the filesystem has any idea about about the mapping. This is the method
>> > by which we would try and clean up mapping flags, if we were to do so, and
>> > it's the only way that the filesystem can know whether or not mappings exist.
>> >
>> > The only way that I could think of to make this safely work is to have the
>> > insertion into the inode->i_mapping->i_mmap tree be our sync point. After
>> > that the filesystem and the mapping code can communicate on the state of DAX,
>> > but before that I think it's basically indeterminate.
>>
>> If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what
>> breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in
>> the ->mmap() handler and let the default THP policy take over. In
>> fact, see transparent_hugepage_enabled() we already auto-enable huge
>> page support for dax mappings regardless of VM_HUGEPAGE.
>
> Hum, this is an interesting option. So do you suggest that filesystems
> supporting DAX would always setup mappings with VM_MIXEDMAP and without
> VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap?
> That could actually work. The only possible issue I can see is that
> VM_MIXEDMAP is still slightly different from normal page mappings and it
> could have some performance implications - e.g. copy_page_range() does more
> work on VM_MIXEDMAP mappings but not on normal page mappings.
We can also get rid of VM_MIXEDMAP if we disable DAX in the
!pfn_t_has_page() case.
On Wed 27-09-17 07:00:53, Dan Williams wrote:
> On Wed, Sep 27, 2017 at 4:35 AM, Jan Kara <[email protected]> wrote:
> > On Tue 26-09-17 14:41:53, Dan Williams wrote:
> >> On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler
> >> <[email protected]> wrote:
> >> > On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote:
> >> >> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
> >> > <>
> >> >> > This decision can only be made (in this
> >> >> > proposed scheme) *after* the inode->i_mapping->i_mmap tree has been
> >> >> > populated, which means we need another call into the filesystem after this
> >> >> > insertion has happened.
> >> >>
> >> >> I get that, but it seems over-engineered and something that can also
> >> >> be safely cleaned up after the fact by the code path that is disabling
> >> >> DAX.
> >> >
> >> > I don't think you can safely clean it up after the fact because some thread
> >> > might have already called ->mmap() to set up the vma->vm_flags for their new
> >> > mapping, but they haven't added it to inode->i_mapping->i_mmap.
> >>
> >> If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the
> >> DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal
> >> memory mappings.
> >>
> >> > The inode->i_mapping->i_mmap tree is the only way (that I know of at least)
> >> > that the filesystem has any idea about about the mapping. This is the method
> >> > by which we would try and clean up mapping flags, if we were to do so, and
> >> > it's the only way that the filesystem can know whether or not mappings exist.
> >> >
> >> > The only way that I could think of to make this safely work is to have the
> >> > insertion into the inode->i_mapping->i_mmap tree be our sync point. After
> >> > that the filesystem and the mapping code can communicate on the state of DAX,
> >> > but before that I think it's basically indeterminate.
> >>
> >> If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what
> >> breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in
> >> the ->mmap() handler and let the default THP policy take over. In
> >> fact, see transparent_hugepage_enabled() we already auto-enable huge
> >> page support for dax mappings regardless of VM_HUGEPAGE.
> >
> > Hum, this is an interesting option. So do you suggest that filesystems
> > supporting DAX would always setup mappings with VM_MIXEDMAP and without
> > VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap?
> > That could actually work. The only possible issue I can see is that
> > VM_MIXEDMAP is still slightly different from normal page mappings and it
> > could have some performance implications - e.g. copy_page_range() does more
> > work on VM_MIXEDMAP mappings but not on normal page mappings.
>
> We can also get rid of VM_MIXEDMAP if we disable DAX in the
> !pfn_t_has_page() case.
Yeah, although it would be a pity to require struct page just to avoid
having to set VM_MIXEDMAP flag...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, Sep 27, 2017 at 8:07 AM, Jan Kara <[email protected]> wrote:
> On Wed 27-09-17 07:00:53, Dan Williams wrote:
>> On Wed, Sep 27, 2017 at 4:35 AM, Jan Kara <[email protected]> wrote:
>> > On Tue 26-09-17 14:41:53, Dan Williams wrote:
>> >> On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler
>> >> <[email protected]> wrote:
>> >> > On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote:
>> >> >> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
>> >> > <>
>> >> >> > This decision can only be made (in this
>> >> >> > proposed scheme) *after* the inode->i_mapping->i_mmap tree has been
>> >> >> > populated, which means we need another call into the filesystem after this
>> >> >> > insertion has happened.
>> >> >>
>> >> >> I get that, but it seems over-engineered and something that can also
>> >> >> be safely cleaned up after the fact by the code path that is disabling
>> >> >> DAX.
>> >> >
>> >> > I don't think you can safely clean it up after the fact because some thread
>> >> > might have already called ->mmap() to set up the vma->vm_flags for their new
>> >> > mapping, but they haven't added it to inode->i_mapping->i_mmap.
>> >>
>> >> If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the
>> >> DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal
>> >> memory mappings.
>> >>
>> >> > The inode->i_mapping->i_mmap tree is the only way (that I know of at least)
>> >> > that the filesystem has any idea about about the mapping. This is the method
>> >> > by which we would try and clean up mapping flags, if we were to do so, and
>> >> > it's the only way that the filesystem can know whether or not mappings exist.
>> >> >
>> >> > The only way that I could think of to make this safely work is to have the
>> >> > insertion into the inode->i_mapping->i_mmap tree be our sync point. After
>> >> > that the filesystem and the mapping code can communicate on the state of DAX,
>> >> > but before that I think it's basically indeterminate.
>> >>
>> >> If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what
>> >> breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in
>> >> the ->mmap() handler and let the default THP policy take over. In
>> >> fact, see transparent_hugepage_enabled() we already auto-enable huge
>> >> page support for dax mappings regardless of VM_HUGEPAGE.
>> >
>> > Hum, this is an interesting option. So do you suggest that filesystems
>> > supporting DAX would always setup mappings with VM_MIXEDMAP and without
>> > VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap?
>> > That could actually work. The only possible issue I can see is that
>> > VM_MIXEDMAP is still slightly different from normal page mappings and it
>> > could have some performance implications - e.g. copy_page_range() does more
>> > work on VM_MIXEDMAP mappings but not on normal page mappings.
>>
>> We can also get rid of VM_MIXEDMAP if we disable DAX in the
>> !pfn_t_has_page() case.
>
> Yeah, although it would be a pity to require struct page just to avoid
> having to set VM_MIXEDMAP flag...
Yes, but the real motivation is fixing all the basic things that break
without struct page, like ptrace and direct-i/o. The removal of
needing to set VM_MIXEDMAP is just a nice side effect. I'll send a
patch because DAX without pages has too many surprise failure cases.
On Wed, Sep 27, 2017 at 01:35:27PM +0200, Jan Kara wrote:
> On Tue 26-09-17 14:41:53, Dan Williams wrote:
> > On Tue, Sep 26, 2017 at 2:06 PM, Ross Zwisler
> > <[email protected]> wrote:
> > > On Tue, Sep 26, 2017 at 12:19:21PM -0700, Dan Williams wrote:
> > >> On Tue, Sep 26, 2017 at 11:57 AM, Ross Zwisler
> > > <>
> > >> > This decision can only be made (in this
> > >> > proposed scheme) *after* the inode->i_mapping->i_mmap tree has been
> > >> > populated, which means we need another call into the filesystem after this
> > >> > insertion has happened.
> > >>
> > >> I get that, but it seems over-engineered and something that can also
> > >> be safely cleaned up after the fact by the code path that is disabling
> > >> DAX.
> > >
> > > I don't think you can safely clean it up after the fact because some thread
> > > might have already called ->mmap() to set up the vma->vm_flags for their new
> > > mapping, but they haven't added it to inode->i_mapping->i_mmap.
> >
> > If madvise(MADV_NOHUGEPAGE) can dynamically change vm_flags, then the
> > DAX disable path can as well. VM_MIXEDMAP looks to be a nop for normal
> > memory mappings.
> >
> > > The inode->i_mapping->i_mmap tree is the only way (that I know of at least)
> > > that the filesystem has any idea about about the mapping. This is the method
> > > by which we would try and clean up mapping flags, if we were to do so, and
> > > it's the only way that the filesystem can know whether or not mappings exist.
> > >
> > > The only way that I could think of to make this safely work is to have the
> > > insertion into the inode->i_mapping->i_mmap tree be our sync point. After
> > > that the filesystem and the mapping code can communicate on the state of DAX,
> > > but before that I think it's basically indeterminate.
> >
> > If we lose the race and leak VM_HUGEPAGE to a non-DAX mapping what
> > breaks? I'd rather be in favor of not setting VM_HUGEPAGE at all in
> > the ->mmap() handler and let the default THP policy take over. In
> > fact, see transparent_hugepage_enabled() we already auto-enable huge
> > page support for dax mappings regardless of VM_HUGEPAGE.
>
> Hum, this is an interesting option. So do you suggest that filesystems
> supporting DAX would always setup mappings with VM_MIXEDMAP and without
> VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap?
> That could actually work. The only possible issue I can see is that
> VM_MIXEDMAP is still slightly different from normal page mappings and it
> could have some performance implications - e.g. copy_page_range() does more
> work on VM_MIXEDMAP mappings but not on normal page mappings.
It looks like having VM_MIXEDMAP always set for filesystems that support DAX
might affect their memory's NUMA migration in the non-DAX case?
8e76d4e sched, numa: do not hint for NUMA balancing on VM_MIXEDMAP mappings
On Wed, Sep 27, 2017 at 8:39 AM, Ross Zwisler
<[email protected]> wrote:
> On Wed, Sep 27, 2017 at 01:35:27PM +0200, Jan Kara wrote:
[..]
>> Hum, this is an interesting option. So do you suggest that filesystems
>> supporting DAX would always setup mappings with VM_MIXEDMAP and without
>> VM_HUGEPAGE and thus we'd get rid of dependency on S_DAX flag in ->mmap?
>> That could actually work. The only possible issue I can see is that
>> VM_MIXEDMAP is still slightly different from normal page mappings and it
>> could have some performance implications - e.g. copy_page_range() does more
>> work on VM_MIXEDMAP mappings but not on normal page mappings.
>
> It looks like having VM_MIXEDMAP always set for filesystems that support DAX
> might affect their memory's NUMA migration in the non-DAX case?
>
> 8e76d4e sched, numa: do not hint for NUMA balancing on VM_MIXEDMAP mappings
Addressed separately here:
c1ef8e2c0235 mm: disable numa migration faults for dax vmas
On Tue, Sep 26, 2017 at 11:40:01PM -0700, Christoph Hellwig wrote:
> On Tue, Sep 26, 2017 at 11:30:57AM -0600, Ross Zwisler wrote:
> > I agree that Christoph's idea about having the system intelligently adjust to
> > use DAX based on performance information it gathers about the underlying
> > persistent memory (probably via the HMAT on x86_64 systems) is interesting,
> > but I think we're still a ways away from that.
>
> So what are the missing blockers for a getting started?
Well, I don't know if platforms that support HMAT + PMEM are widely available,
but we have all the details in the ACPI spec, so we could begin to code it up
and things will "just work" when platforms arrive.
> > FWIW, as my patches suggest and Jan observed I think that we should allow
> > users to turn on DAX by treating the inode flag and the mount flag as an 'or'
> > operation. i.e. you get DAX if either the mount option is specified or if the
> > inode flag is set, and you can continue to manipulate the per-inode flag as
> > you want regardless of the mount option. I think this provides maximum
> > flexibility of the mechanism to select DAX without enforcing policy.
>
> IFF we stick to the dax flag that's the only workable way. The only
> major issue I still see with that is that this allows unprivilegued
> users to enable DAX on a any file they own / have write access to.
> So there isn't really any way to effectively disable the DAX path
> by the sysadmin.
Hum, I wonder if maybe we need/want three different mount modes? What about:
autodax (the default): the filesystem is free to use DAX or not, as it sees
fit and thinks is optimal. For the time being we can make this mean "don't
use DAX", and phase in DAX usage as we add support for the HMAT, etc.
Users can manually turn on DAX for a given inode by setting the DAX inode
flag, but there is no way for the user to *prevent* DAX for an inode - the
kernel can always choose to turn it on.
MAP_DIRECT and MAP_SYNC work.
nodax: Don't use DAX. The kernel won't choose to use DAX, and any DAX inode
flags will be ignored. This gives the sysadmin the override that I think
you're looking for. The user can still manipulate the inode flags as they see
fit.
MAP_DIRECT and MAP_SYNC both fail.
dax: Use DAX for all inodes in the filesystem. Again the inode flags are
essentially ignored, but the user can manipulate the inode flags as they see
fit. This is basically unchanged from how it works today, modulo the bug
where DAX can get turned off if you unset the inode flag where it wasn't even
set (patch 1 in my series).
MAP_DIRECT and MAP_SYNC work.
> > Does it make sense at this point to just start a "dax" man page that can
> > contain info about the mount options, inode flags, kernel config options, how
> > to get PMDs, etc? Or does this documentation need to be sprinkled around more
> > in existing man pages?
>
> A dax manpage would be good.
Okay, I'll start with a manpage, and once we agree on whats in there we can
start working on code again. :)