2020-02-27 05:32:53

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5

From: Ira Weiny <[email protected]>

Changes from V4:
* Open code the aops lock rather than add it to the xfs_ilock()
subsystem (Darrick's comments were obsoleted by this change)
* Fix lkp build suggestions and bugs

Changes from V3:
* Remove global locking... :-D
* put back per inode locking and remove pre-mature optimizations
* Fix issues with Directories having IS_DAX() set
* Fix kernel crash issues reported by Jeff
* Add some clean up patches
* Consolidate diflags to iflags functions
* Update/add documentation
* Reorder/rename patches quite a bit

Changes from V2:

* Move i_dax_sem to be a global percpu_rw_sem rather than per inode
Internal discussions with Dan determined this would be easier,
just as performant, and slightly less overhead that having it
in the SB as suggested by Jan
* Fix locking order in comments and throughout code
* Change "mode" to "state" throughout commits
* Add CONFIG_FS_DAX wrapper to disable inode_[un]lock_state() when not
configured
* Add static branch for which is activated by a device which supports
DAX in XFS
* Change "lock/unlock" to up/down read/write as appropriate
Previous names were over simplified
* Update comments/documentation

* Remove the xfs specific lock to the vfs (global) layer.
* Fix i_dax_sem locking order and comments

* Move 'i_mapped' count from struct inode to struct address_space and
rename it to mmap_count
* Add inode_has_mappings() call

* Fix build issues
* Clean up syntax spacing and minor issues
* Update man page text for STATX_ATTR_DAX
* Add reviewed-by's
* Rebase to 5.6

Rename patch:
from: fs/xfs: Add lock/unlock state to xfs
to: fs/xfs: Add write DAX lock to xfs layer
Add patch:
fs/xfs: Clarify lockdep dependency for xfs_isilocked()
Drop patch:
fs/xfs: Fix truncate up


At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
consumption due to their inability to detect whether the kernel will
instantiate page cache for a file, and cases where a global dax enable via a
mount option is too coarse.

The following patch series enables selecting the use of DAX on individual files
and/or directories on xfs, and lays some groundwork to do so in ext4. In this
scheme the dax mount option can be omitted to allow the per-file property to
take effect.

The insight at LSF/MM was to separate the per-mount or per-file "physical"
capability switch from an "effective" attribute for the file.

At LSF/MM we discussed the difficulties of switching the DAX state of a file
with active mappings / page cache. It was thought the races could be avoided
by limiting DAX state flips to 0-length files.

However, this turns out to not be true.[3] This is because address space
operations (a_ops) may be in use at any time the inode is referenced and users
have expressed a desire to be able to change the DAX state on a file with data
in it. For those reasons this patch set allows changing the DAX state flag on
a file as long as it is not current mapped.

Details of when and how DAX state can be changed on a file is included in a
documentation patch.

It should be noted that the physical DAX flag inheritance is not shown in this
patch set as it was maintained from previous work on XFS. The physical DAX
flag and it's inheritance will need to be added to other file systems for user
control.

As submitted this works on real hardware testing.


[1] https://lwn.net/Articles/787973/
[2] https://lwn.net/Articles/787233/
[3] https://lkml.org/lkml/2019/10/20/96
[4] https://patchwork.kernel.org/patch/11310511/


To: [email protected]
Cc: Alexander Viro <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: "Theodore Y. Ts'o" <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]


Ira Weiny (12):
fs/xfs: Remove unnecessary initialization of i_rwsem
fs: Remove unneeded IS_DAX() check
fs/stat: Define DAX statx attribute
fs/xfs: Isolate the physical DAX flag from enabled
fs/xfs: Create function xfs_inode_enable_dax()
fs: Add locking for a dynamic address space operations state
fs: Prevent DAX state change if file is mmap'ed
fs/xfs: Hold off aops users while changing DAX state
fs/xfs: Clean up locking in dax invalidate
fs/xfs: Allow toggle of effective DAX flag
fs/xfs: Remove xfs_diflags_to_linux()
Documentation/dax: Update Usage section

Documentation/filesystems/dax.txt | 84 +++++++++++++++++++++++++-
Documentation/filesystems/vfs.rst | 16 +++++
fs/attr.c | 1 +
fs/inode.c | 16 ++++-
fs/iomap/buffered-io.c | 1 +
fs/open.c | 4 ++
fs/stat.c | 5 ++
fs/xfs/xfs_icache.c | 5 +-
fs/xfs/xfs_inode.h | 2 +
fs/xfs/xfs_ioctl.c | 98 +++++++++++++++----------------
fs/xfs/xfs_iops.c | 69 +++++++++++++++-------
include/linux/fs.h | 73 ++++++++++++++++++++++-
include/uapi/linux/stat.h | 1 +
mm/fadvise.c | 7 ++-
mm/filemap.c | 4 ++
mm/huge_memory.c | 1 +
mm/khugepaged.c | 2 +
mm/mmap.c | 19 +++++-
mm/util.c | 9 ++-
19 files changed, 328 insertions(+), 89 deletions(-)

--
2.21.0


2020-02-27 05:33:29

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V5 12/12] Documentation/dax: Update Usage section

From: Ira Weiny <[email protected]>

Update the Usage section to reflect the new individual dax selection
functionality.

Signed-off-by: Ira Weiny <[email protected]>
---
Documentation/filesystems/dax.txt | 84 ++++++++++++++++++++++++++++++-
1 file changed, 82 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 679729442fd2..32e37c550f76 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -20,8 +20,88 @@ Usage
If you have a block device which supports DAX, you can make a filesystem
on it as usual. The DAX code currently only supports files with a block
size equal to your kernel's PAGE_SIZE, so you may need to specify a block
-size when creating the filesystem. When mounting it, use the "-o dax"
-option on the command line or add 'dax' to the options in /etc/fstab.
+size when creating the filesystem.
+
+Enabling DAX on an individual file basis (XFS)
+----------------------------------------------
+
+There are 2 per file dax flags. One is a physical configuration setting and
+the other a currently enabled state.
+
+The physical configuration setting is maintained on individual file and
+directory inodes. It is preserved within the file system. This 'physical'
+config setting can be set using an ioctl and/or an application such as "xfs_io
+-c 'chattr [-+]x'". Files and directories automatically inherit their physical
+dax setting from their parent directory when created. Therefore, setting the
+physical dax setting at directory creation time can be used to set a default
+behavior for that sub-tree. Doing so on the root directory acts to set a
+default for the entire file system.
+
+To clarify inheritance here are 3 examples:
+
+Example A:
+
+mkdir -p a/b/c
+xfs_io 'chattr +x' a
+mkdir a/b/c/d
+mkdir a/e
+
+ dax: a,e
+ no dax: b,c,d
+
+Example B:
+
+mkdir a
+xfs_io 'chattr +x' a
+mkdir -p a/b/c/d
+
+ dax: a,b,c,d
+ no dax:
+
+Example C:
+
+mkdir -p a/b/c
+xfs_io 'chattr +x' c
+mkdir a/b/c/d
+
+ dax: c,d
+ no dax: a,b
+
+
+The current inode enabled state is set when a file inode is loaded and it is
+determined that the underlying media supports dax.
+
+statx can be used to query the file's current enabled state. NOTE that a
+directory will never be operating in a dax state. Therefore, the dax config
+state must be queried to see what config state a file or sub-directory will
+inherit from a directory.
+
+NOTE: Setting a file or directory's config state with xfs_io is possible even
+if the underlying media does not support dax.
+
+
+Enabling dax on a file system wide basis ('-o dax' mount option)
+----------------------------------------------------------------
+
+The physical dax configuration of all files can be overridden using a mount
+option. In summary:
+
+ (physical flag || mount option) && capable device == dax in effect
+ ( <xfs_io> || <'-o dax'> ) && capable device == <statx dax true>
+
+To enable the mount override, use "-o dax" on the command line or add
+'dax' to the options in /etc/fstab
+
+Using the mount option does not change the physical configured state of
+individual files. Therefore, remounting _without_ the mount option will allow
+the file system to set file's enabled state directly based on their config
+setting.
+
+NOTE: Setting a file or directory's physical config state is possible while the
+file system is mounted with the dax override. However, the file's enabled
+state will continue to be overridden and "dax enabled" until the mount option
+is removed and a remount performed. At that point the file's physical config
+state dictates the enabled state.


Implementation Tips for Block Driver Writers
--
2.21.0

2020-02-27 05:33:45

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V5 09/12] fs/xfs: Clean up locking in dax invalidate

From: Ira Weiny <[email protected]>

Define a variable to hold the lock flags to ensure that the correct
locks are returned or released on error.

Signed-off-by: Ira Weiny <[email protected]>
---
fs/xfs/xfs_ioctl.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 6c4d4ea3b6b6..40ae791cfb41 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1190,7 +1190,7 @@ xfs_ioctl_setattr_dax_invalidate(
int *join_flags)
{
struct inode *inode = VFS_I(ip);
- int error;
+ int error, flags;

*join_flags = 0;

@@ -1205,8 +1205,10 @@ xfs_ioctl_setattr_dax_invalidate(
if (S_ISDIR(inode->i_mode))
return 0;

+ flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL;
+
/* lock, flush and invalidate mapping in preparation for flag change */
- xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
+ xfs_ilock(ip, flags);

/*
* If there is a mapping in place we must remain in our current state.
@@ -1223,11 +1225,11 @@ xfs_ioctl_setattr_dax_invalidate(
if (error)
goto out_unlock;

- *join_flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL;
+ *join_flags = flags;
return 0;

out_unlock:
- xfs_iunlock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
+ xfs_iunlock(ip, flags);
return error;

}
--
2.21.0

2020-02-27 05:33:51

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V5 11/12] fs/xfs: Remove xfs_diflags_to_linux()

From: Ira Weiny <[email protected]>

The functionality in xfs_diflags_to_linux() is now identical to
xfs_diflags_to_iflags().

Remove xfs_diflags_to_linux() and call xfs_diflags_to_iflags() directly.

While we are here simplify xfs_diflags_to_iflags() to take just struct
xfs_inode.

And use xfs_ip2xflags() to ensure future diflags are included correctly.

Signed-off-by: kbuild test robot <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes from V4:
Pick up lkp build suggestion (make xfs_inode_enable_dax() static)
---
fs/xfs/xfs_inode.h | 2 +-
fs/xfs/xfs_ioctl.c | 32 +-------------------------------
fs/xfs/xfs_iops.c | 31 +++++++++++++++++++------------
3 files changed, 21 insertions(+), 44 deletions(-)

diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 7d4968915dec..2c8c2804f88b 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -467,7 +467,7 @@ int xfs_break_layouts(struct inode *inode, uint *iolock,
extern void xfs_setup_inode(struct xfs_inode *ip);
extern void xfs_setup_iops(struct xfs_inode *ip);
extern void xfs_setup_a_ops(struct xfs_inode *ip);
-extern bool xfs_inode_enable_dax(struct xfs_inode *ip);
+extern void xfs_diflags_to_iflags(struct xfs_inode *ip);

/*
* When setting up a newly allocated inode, we need to call
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 38e91de44a6f..bd67359a3251 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1100,36 +1100,6 @@ xfs_flags2diflags2(
return di_flags2;
}

-STATIC void
-xfs_diflags_to_linux(
- struct xfs_inode *ip)
-{
- struct inode *inode = VFS_I(ip);
- unsigned int xflags = xfs_ip2xflags(ip);
-
- if (xflags & FS_XFLAG_IMMUTABLE)
- inode->i_flags |= S_IMMUTABLE;
- else
- inode->i_flags &= ~S_IMMUTABLE;
- if (xflags & FS_XFLAG_APPEND)
- inode->i_flags |= S_APPEND;
- else
- inode->i_flags &= ~S_APPEND;
- if (xflags & FS_XFLAG_SYNC)
- inode->i_flags |= S_SYNC;
- else
- inode->i_flags &= ~S_SYNC;
- if (xflags & FS_XFLAG_NOATIME)
- inode->i_flags |= S_NOATIME;
- else
- inode->i_flags &= ~S_NOATIME;
-
- if (xfs_inode_enable_dax(ip))
- inode->i_flags |= S_DAX;
- else
- inode->i_flags &= ~S_DAX;
-}
-
static int
xfs_ioctl_setattr_xflags(
struct xfs_trans *tp,
@@ -1167,7 +1137,7 @@ xfs_ioctl_setattr_xflags(
ip->i_d.di_flags = xfs_flags2diflags(ip, fa->fsx_xflags);
ip->i_d.di_flags2 = di_flags2;

- xfs_diflags_to_linux(ip);
+ xfs_diflags_to_iflags(ip);
xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_CHG);
xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
XFS_STATS_INC(mp, xs_ig_attrchg);
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 8f023be39705..c63cf0b73b75 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1259,7 +1259,7 @@ xfs_inode_supports_dax(
return xfs_inode_buftarg(ip)->bt_daxdev != NULL;
}

-bool
+static bool
xfs_inode_enable_dax(
struct xfs_inode *ip)
{
@@ -1273,26 +1273,33 @@ xfs_inode_enable_dax(
return false;
}

-STATIC void
+void
xfs_diflags_to_iflags(
- struct inode *inode,
struct xfs_inode *ip)
{
- uint16_t flags = ip->i_d.di_flags;
-
- inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
- S_NOATIME | S_DAX);
+ struct inode *inode = VFS_I(ip);
+ uint16_t diflags = xfs_ip2xflags(ip);

- if (flags & XFS_DIFLAG_IMMUTABLE)
+ if (diflags & FS_XFLAG_IMMUTABLE)
inode->i_flags |= S_IMMUTABLE;
- if (flags & XFS_DIFLAG_APPEND)
+ else
+ inode->i_flags &= ~S_IMMUTABLE;
+ if (diflags & FS_XFLAG_APPEND)
inode->i_flags |= S_APPEND;
- if (flags & XFS_DIFLAG_SYNC)
+ else
+ inode->i_flags &= ~S_APPEND;
+ if (diflags & FS_XFLAG_SYNC)
inode->i_flags |= S_SYNC;
- if (flags & XFS_DIFLAG_NOATIME)
+ else
+ inode->i_flags &= ~S_SYNC;
+ if (diflags & FS_XFLAG_NOATIME)
inode->i_flags |= S_NOATIME;
+ else
+ inode->i_flags &= ~S_NOATIME;
if (xfs_inode_enable_dax(ip))
inode->i_flags |= S_DAX;
+ else
+ inode->i_flags &= ~S_DAX;
}

/*
@@ -1321,7 +1328,7 @@ xfs_setup_inode(
inode->i_gid = xfs_gid_to_kgid(ip->i_d.di_gid);

i_size_write(inode, ip->i_d.di_size);
- xfs_diflags_to_iflags(inode, ip);
+ xfs_diflags_to_iflags(ip);

if (S_ISDIR(inode->i_mode)) {
/*
--
2.21.0

2020-02-27 05:34:44

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V5 03/12] fs/stat: Define DAX statx attribute

From: Ira Weiny <[email protected]>

In order for users to determine if a file is currently operating in DAX
state (effective DAX). Define a statx attribute value and set that
attribute if the effective DAX flag is set.

To go along with this we propose the following addition to the statx man
page:

STATX_ATTR_DAX

The file is in the DAX (cpu direct access) state. DAX state
attempts to minimize software cache effects for both I/O and
memory mappings of this file. It requires a file system which
has been configured to support DAX.

DAX generally assumes all accesses are via cpu load / store
instructions which can minimize overhead for small accesses, but
may adversely affect cpu utilization for large transfers.

File I/O is done directly to/from user-space buffers and memory
mapped I/O may be performed with direct memory mappings that
bypass kernel page cache.

While the DAX property tends to result in data being transferred
synchronously, it does not give the same guarantees of O_SYNC
where data and the necessary metadata are transferred together.

A DAX file may support being mapped with the MAP_SYNC flag,
which enables a program to use CPU cache flush instructions to
persist CPU store operations without an explicit fsync(2). See
mmap(2) for more information.

Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes from V2:
Update man page text with comments from Darrick, Jan, Dan, and
Dave.
---
fs/stat.c | 3 +++
include/uapi/linux/stat.h | 1 +
2 files changed, 4 insertions(+)

diff --git a/fs/stat.c b/fs/stat.c
index 030008796479..894699c74dde 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -79,6 +79,9 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
if (IS_AUTOMOUNT(inode))
stat->attributes |= STATX_ATTR_AUTOMOUNT;

+ if (IS_DAX(inode))
+ stat->attributes |= STATX_ATTR_DAX;
+
if (inode->i_op->getattr)
return inode->i_op->getattr(path, stat, request_mask,
query_flags);
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index ad80a5c885d5..e5f9d5517f6b 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -169,6 +169,7 @@ struct statx {
#define STATX_ATTR_ENCRYPTED 0x00000800 /* [I] File requires key to decrypt in fs */
#define STATX_ATTR_AUTOMOUNT 0x00001000 /* Dir: Automount trigger */
#define STATX_ATTR_VERITY 0x00100000 /* [I] Verity protected file */
+#define STATX_ATTR_DAX 0x00002000 /* [I] File is DAX */


#endif /* _UAPI_LINUX_STAT_H */
--
2.21.0

2020-02-27 05:34:51

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V5 08/12] fs/xfs: Hold off aops users while changing DAX state

From: Ira Weiny <[email protected]>

XFS requires the use of the aops of an inode to be quiesced prior to
changing the aops vector to/from the DAX vector.

Take the aops write lock while changing DAX state.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes from v4:
Open code the aops write lock
Obsolete: Clean up lock name in comments
Obsolete: Change #define: XFS_DAX_EXCL => XFS_AOPSLOCK_EXCL

Changes from v3:
Change locking function names to reflect changes in previous
patches.

Changes from V2:
Change name of patch (WAS: fs/xfs: Add lock/unlock state to xfs)
Remove the xfs specific lock and move to the vfs layer.
We still use XFS_LOCK_DAX_EXCL to be able to pass this
flag through to the transaction code. But we no longer
have a lock specific to xfs. This removes a lot of code
from the XFS layer, preps us for using this in ext4, and
is actually more straight forward now that all the
locking requirements are better known.

Fix locking order comment
Rework for new 'state' names
(Other comments on the previous patch are not applicable with
new patch as much of the code was removed in favor of the vfs
level lock)
---
fs/xfs/xfs_ioctl.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 498fae2ef9f6..6c4d4ea3b6b6 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1444,7 +1444,11 @@ xfs_ioctl_setattr(
* or cancel time, so need to be passed through to
* xfs_ioctl_setattr_get_trans() so it can apply them to the join call
* appropriately.
+ *
+ * Further we hold off aops users until we have completed any potential
+ * changing of aops due to attribute changes.
*/
+ inode_aops_down_write(VFS_I(ip));
code = xfs_ioctl_setattr_dax_invalidate(ip, fa, &join_flags);
if (code)
goto error_free_dquots;
@@ -1527,6 +1531,7 @@ xfs_ioctl_setattr(
xfs_qm_dqrele(udqp);
xfs_qm_dqrele(pdqp);

+ inode_aops_up_write(VFS_I(ip));
return code;

error_trans_cancel:
@@ -1534,6 +1539,7 @@ xfs_ioctl_setattr(
error_free_dquots:
xfs_qm_dqrele(udqp);
xfs_qm_dqrele(pdqp);
+ inode_aops_up_write(VFS_I(ip));
return code;
}

@@ -1603,7 +1609,11 @@ xfs_ioc_setxflags(
* or cancel time, so need to be passed through to
* xfs_ioctl_setattr_get_trans() so it can apply them to the join call
* appropriately.
+ *
+ * Further we hold off aops users until we have completed any potential
+ * changing of aops due to attribute changes.
*/
+ inode_aops_down_write(VFS_I(ip));
error = xfs_ioctl_setattr_dax_invalidate(ip, &fa, &join_flags);
if (error)
goto out_drop_write;
@@ -1630,6 +1640,7 @@ xfs_ioc_setxflags(
error = xfs_trans_commit(tp);
out_drop_write:
mnt_drop_write_file(filp);
+ inode_aops_up_write(VFS_I(ip));
return error;
}

--
2.21.0

2020-03-05 15:53:15

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5

FYI, I still will fully NAK any series that adds additional locks
and thus atomic instructions to basically every fs call, and grows
the inode by a rw_semaphore plus and atomic64_t. I also think the
whole idea of switching operation vectors at runtime is fatally flawed
and we should never add such code, nevermind just for a fringe usecase
of a fringe feature.

On Wed, Feb 26, 2020 at 09:24:30PM -0800, [email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> Changes from V4:
> * Open code the aops lock rather than add it to the xfs_ilock()
> subsystem (Darrick's comments were obsoleted by this change)
> * Fix lkp build suggestions and bugs
>
> Changes from V3:
> * Remove global locking... :-D
> * put back per inode locking and remove pre-mature optimizations
> * Fix issues with Directories having IS_DAX() set
> * Fix kernel crash issues reported by Jeff
> * Add some clean up patches
> * Consolidate diflags to iflags functions
> * Update/add documentation
> * Reorder/rename patches quite a bit
>
> Changes from V2:
>
> * Move i_dax_sem to be a global percpu_rw_sem rather than per inode
> Internal discussions with Dan determined this would be easier,
> just as performant, and slightly less overhead that having it
> in the SB as suggested by Jan
> * Fix locking order in comments and throughout code
> * Change "mode" to "state" throughout commits
> * Add CONFIG_FS_DAX wrapper to disable inode_[un]lock_state() when not
> configured
> * Add static branch for which is activated by a device which supports
> DAX in XFS
> * Change "lock/unlock" to up/down read/write as appropriate
> Previous names were over simplified
> * Update comments/documentation
>
> * Remove the xfs specific lock to the vfs (global) layer.
> * Fix i_dax_sem locking order and comments
>
> * Move 'i_mapped' count from struct inode to struct address_space and
> rename it to mmap_count
> * Add inode_has_mappings() call
>
> * Fix build issues
> * Clean up syntax spacing and minor issues
> * Update man page text for STATX_ATTR_DAX
> * Add reviewed-by's
> * Rebase to 5.6
>
> Rename patch:
> from: fs/xfs: Add lock/unlock state to xfs
> to: fs/xfs: Add write DAX lock to xfs layer
> Add patch:
> fs/xfs: Clarify lockdep dependency for xfs_isilocked()
> Drop patch:
> fs/xfs: Fix truncate up
>
>
> At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
> consumption due to their inability to detect whether the kernel will
> instantiate page cache for a file, and cases where a global dax enable via a
> mount option is too coarse.
>
> The following patch series enables selecting the use of DAX on individual files
> and/or directories on xfs, and lays some groundwork to do so in ext4. In this
> scheme the dax mount option can be omitted to allow the per-file property to
> take effect.
>
> The insight at LSF/MM was to separate the per-mount or per-file "physical"
> capability switch from an "effective" attribute for the file.
>
> At LSF/MM we discussed the difficulties of switching the DAX state of a file
> with active mappings / page cache. It was thought the races could be avoided
> by limiting DAX state flips to 0-length files.
>
> However, this turns out to not be true.[3] This is because address space
> operations (a_ops) may be in use at any time the inode is referenced and users
> have expressed a desire to be able to change the DAX state on a file with data
> in it. For those reasons this patch set allows changing the DAX state flag on
> a file as long as it is not current mapped.
>
> Details of when and how DAX state can be changed on a file is included in a
> documentation patch.
>
> It should be noted that the physical DAX flag inheritance is not shown in this
> patch set as it was maintained from previous work on XFS. The physical DAX
> flag and it's inheritance will need to be added to other file systems for user
> control.
>
> As submitted this works on real hardware testing.
>
>
> [1] https://lwn.net/Articles/787973/
> [2] https://lwn.net/Articles/787233/
> [3] https://lkml.org/lkml/2019/10/20/96
> [4] https://patchwork.kernel.org/patch/11310511/
>
>
> To: [email protected]
> Cc: Alexander Viro <[email protected]>
> Cc: "Darrick J. Wong" <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Dave Chinner <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: "Theodore Y. Ts'o" <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
>
>
> Ira Weiny (12):
> fs/xfs: Remove unnecessary initialization of i_rwsem
> fs: Remove unneeded IS_DAX() check
> fs/stat: Define DAX statx attribute
> fs/xfs: Isolate the physical DAX flag from enabled
> fs/xfs: Create function xfs_inode_enable_dax()
> fs: Add locking for a dynamic address space operations state
> fs: Prevent DAX state change if file is mmap'ed
> fs/xfs: Hold off aops users while changing DAX state
> fs/xfs: Clean up locking in dax invalidate
> fs/xfs: Allow toggle of effective DAX flag
> fs/xfs: Remove xfs_diflags_to_linux()
> Documentation/dax: Update Usage section
>
> Documentation/filesystems/dax.txt | 84 +++++++++++++++++++++++++-
> Documentation/filesystems/vfs.rst | 16 +++++
> fs/attr.c | 1 +
> fs/inode.c | 16 ++++-
> fs/iomap/buffered-io.c | 1 +
> fs/open.c | 4 ++
> fs/stat.c | 5 ++
> fs/xfs/xfs_icache.c | 5 +-
> fs/xfs/xfs_inode.h | 2 +
> fs/xfs/xfs_ioctl.c | 98 +++++++++++++++----------------
> fs/xfs/xfs_iops.c | 69 +++++++++++++++-------
> include/linux/fs.h | 73 ++++++++++++++++++++++-
> include/uapi/linux/stat.h | 1 +
> mm/fadvise.c | 7 ++-
> mm/filemap.c | 4 ++
> mm/huge_memory.c | 1 +
> mm/khugepaged.c | 2 +
> mm/mmap.c | 19 +++++-
> mm/util.c | 9 ++-
> 19 files changed, 328 insertions(+), 89 deletions(-)
>
> --
> 2.21.0
---end quoted text---

2020-03-09 17:05:02

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5

On Thu, Mar 05, 2020 at 04:51:44PM +0100, Christoph Hellwig wrote:
> FYI, I still will fully NAK any series that adds additional locks
> and thus atomic instructions to basically every fs call, and grows
> the inode by a rw_semaphore plus and atomic64_t. I also think the
> whole idea of switching operation vectors at runtime is fatally flawed
> and we should never add such code, nevermind just for a fringe usecase
> of a fringe feature.

Being new to this area of the kernel I'm not clear on the history...

It was my understanding that the per-file flag support was a requirement to
removing the experimental designation from DAX. Is this still the case?

Ira

>
> On Wed, Feb 26, 2020 at 09:24:30PM -0800, [email protected] wrote:
> > From: Ira Weiny <[email protected]>
> >
> > Changes from V4:
> > * Open code the aops lock rather than add it to the xfs_ilock()
> > subsystem (Darrick's comments were obsoleted by this change)
> > * Fix lkp build suggestions and bugs
> >
> > Changes from V3:
> > * Remove global locking... :-D
> > * put back per inode locking and remove pre-mature optimizations
> > * Fix issues with Directories having IS_DAX() set
> > * Fix kernel crash issues reported by Jeff
> > * Add some clean up patches
> > * Consolidate diflags to iflags functions
> > * Update/add documentation
> > * Reorder/rename patches quite a bit
> >
> > Changes from V2:
> >
> > * Move i_dax_sem to be a global percpu_rw_sem rather than per inode
> > Internal discussions with Dan determined this would be easier,
> > just as performant, and slightly less overhead that having it
> > in the SB as suggested by Jan
> > * Fix locking order in comments and throughout code
> > * Change "mode" to "state" throughout commits
> > * Add CONFIG_FS_DAX wrapper to disable inode_[un]lock_state() when not
> > configured
> > * Add static branch for which is activated by a device which supports
> > DAX in XFS
> > * Change "lock/unlock" to up/down read/write as appropriate
> > Previous names were over simplified
> > * Update comments/documentation
> >
> > * Remove the xfs specific lock to the vfs (global) layer.
> > * Fix i_dax_sem locking order and comments
> >
> > * Move 'i_mapped' count from struct inode to struct address_space and
> > rename it to mmap_count
> > * Add inode_has_mappings() call
> >
> > * Fix build issues
> > * Clean up syntax spacing and minor issues
> > * Update man page text for STATX_ATTR_DAX
> > * Add reviewed-by's
> > * Rebase to 5.6
> >
> > Rename patch:
> > from: fs/xfs: Add lock/unlock state to xfs
> > to: fs/xfs: Add write DAX lock to xfs layer
> > Add patch:
> > fs/xfs: Clarify lockdep dependency for xfs_isilocked()
> > Drop patch:
> > fs/xfs: Fix truncate up
> >
> >
> > At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
> > consumption due to their inability to detect whether the kernel will
> > instantiate page cache for a file, and cases where a global dax enable via a
> > mount option is too coarse.
> >
> > The following patch series enables selecting the use of DAX on individual files
> > and/or directories on xfs, and lays some groundwork to do so in ext4. In this
> > scheme the dax mount option can be omitted to allow the per-file property to
> > take effect.
> >
> > The insight at LSF/MM was to separate the per-mount or per-file "physical"
> > capability switch from an "effective" attribute for the file.
> >
> > At LSF/MM we discussed the difficulties of switching the DAX state of a file
> > with active mappings / page cache. It was thought the races could be avoided
> > by limiting DAX state flips to 0-length files.
> >
> > However, this turns out to not be true.[3] This is because address space
> > operations (a_ops) may be in use at any time the inode is referenced and users
> > have expressed a desire to be able to change the DAX state on a file with data
> > in it. For those reasons this patch set allows changing the DAX state flag on
> > a file as long as it is not current mapped.
> >
> > Details of when and how DAX state can be changed on a file is included in a
> > documentation patch.
> >
> > It should be noted that the physical DAX flag inheritance is not shown in this
> > patch set as it was maintained from previous work on XFS. The physical DAX
> > flag and it's inheritance will need to be added to other file systems for user
> > control.
> >
> > As submitted this works on real hardware testing.
> >
> >
> > [1] https://lwn.net/Articles/787973/
> > [2] https://lwn.net/Articles/787233/
> > [3] https://lkml.org/lkml/2019/10/20/96
> > [4] https://patchwork.kernel.org/patch/11310511/
> >
> >
> > To: [email protected]
> > Cc: Alexander Viro <[email protected]>
> > Cc: "Darrick J. Wong" <[email protected]>
> > Cc: Dan Williams <[email protected]>
> > Cc: Dave Chinner <[email protected]>
> > Cc: Christoph Hellwig <[email protected]>
> > Cc: "Theodore Y. Ts'o" <[email protected]>
> > Cc: Jan Kara <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> >
> >
> > Ira Weiny (12):
> > fs/xfs: Remove unnecessary initialization of i_rwsem
> > fs: Remove unneeded IS_DAX() check
> > fs/stat: Define DAX statx attribute
> > fs/xfs: Isolate the physical DAX flag from enabled
> > fs/xfs: Create function xfs_inode_enable_dax()
> > fs: Add locking for a dynamic address space operations state
> > fs: Prevent DAX state change if file is mmap'ed
> > fs/xfs: Hold off aops users while changing DAX state
> > fs/xfs: Clean up locking in dax invalidate
> > fs/xfs: Allow toggle of effective DAX flag
> > fs/xfs: Remove xfs_diflags_to_linux()
> > Documentation/dax: Update Usage section
> >
> > Documentation/filesystems/dax.txt | 84 +++++++++++++++++++++++++-
> > Documentation/filesystems/vfs.rst | 16 +++++
> > fs/attr.c | 1 +
> > fs/inode.c | 16 ++++-
> > fs/iomap/buffered-io.c | 1 +
> > fs/open.c | 4 ++
> > fs/stat.c | 5 ++
> > fs/xfs/xfs_icache.c | 5 +-
> > fs/xfs/xfs_inode.h | 2 +
> > fs/xfs/xfs_ioctl.c | 98 +++++++++++++++----------------
> > fs/xfs/xfs_iops.c | 69 +++++++++++++++-------
> > include/linux/fs.h | 73 ++++++++++++++++++++++-
> > include/uapi/linux/stat.h | 1 +
> > mm/fadvise.c | 7 ++-
> > mm/filemap.c | 4 ++
> > mm/huge_memory.c | 1 +
> > mm/khugepaged.c | 2 +
> > mm/mmap.c | 19 +++++-
> > mm/util.c | 9 ++-
> > 19 files changed, 328 insertions(+), 89 deletions(-)
> >
> > --
> > 2.21.0
> ---end quoted text---

2020-03-11 06:40:04

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5

On Tue, Mar 10, 2020 at 08:36:14PM -0700, Darrick J. Wong wrote:
> There are still other things that need to be ironed out WRT pmem:
>
> a) reflink and page/pfn/whatever sharing -- fix the mm or (ab)use the
> xfs buffer cache, or something worse?

I don't think we need either. We just need to remove the DAX page
association for hwpoison that requires the struct page to store the
mapping and index. Get rid of that and we should be able to safely
map the same page into different inode address spaces at the same
time. When we write a shared page, we COW it immediately and replace
the page in the inode's mapping tree, so we can't actually write to
a shared page...

IOWs, the dax_associate_page() related functionality probably needs
to be a filesystem callout - part of the aops vector, I think, so
that device dax can still use it. That way XFS can go it's own way,
while ext4 and device dax can continue to use the existing mechanism
mechanisn that is currently implemented....

XFS can then make use of rmapbt to find the owners on a bad page
notification, and run the "kill userspace dead dead dead" lookup on
each mapping/index tuple rather than pass it around on a struct
page. i.e. we'll do a kill scan for each mapping/index owner tuple
we find, not just one.

That requires converting all the current vma killer code to pass
mapping/index tuples around rather than the struct page. That kill
code doesn't actually need the struct page, it just needs the
mapping/index tuple to match to the vmas that have it mapped into
userspace.

> b) getting our stories straight on how to clear poison, and whether or
> not we can come up with a better story for ZERO_FILE_RANGE on pmem. In
> the ideal world I'd love to see Z_F_R actually memset(0) the pmem and
> clear poison, at least if the file->pmem mappings were contiguous.

Are you talking about ZFR from userspace through the filesystem (how
do you clear poison in free space?) or ZFR on the dax device fro
either userspace or the kernel filesystem code?

> c) wiring up xfs to hwpoison, or wiring up hwpoison to xfs, or otherwise
> figuring out how to get storage to tell xfs that it lost something so
> that maybe xfs can fix it quickly

Yup, I think that's a dax device callback into the filesystem. i.e
the hwpoison gets delivered to the dax device, which then calls the
fs function rather than do it's current "dax_lock_page(), kill
userspace dead dead dead, dax_unlock_page()" dance. The filesystem
can do a much more intricate dance and wreak far more havoc on
userspace than what the dax device can do.....

Copious amounts of unused time are things I don't have,
unfortunately. Only having 7 fingers to type with right now doesn't
help, either.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-03-11 17:09:40

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5

On Tue, Mar 10, 2020 at 11:44 PM Christoph Hellwig <[email protected]> wrote:
>
> On Wed, Mar 11, 2020 at 05:39:42PM +1100, Dave Chinner wrote:
> > IOWs, the dax_associate_page() related functionality probably needs
> > to be a filesystem callout - part of the aops vector, I think, so
> > that device dax can still use it. That way XFS can go it's own way,
> > while ext4 and device dax can continue to use the existing mechanism
> > mechanisn that is currently implemented....
>
> s/XFS/XFS with rmap/, as most XFS file systems currently don't have
> that enabled we'll also need to keep the legacy path around.

Agree, it needs to be an opt-in capability as ext4 and xfs w/o rmap
will stay on the legacy path for the foreseeable future.

2020-03-12 00:49:57

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5

On Wed, Mar 11, 2020 at 07:44:12AM +0100, Christoph Hellwig wrote:
> On Wed, Mar 11, 2020 at 05:39:42PM +1100, Dave Chinner wrote:
> > IOWs, the dax_associate_page() related functionality probably needs
> > to be a filesystem callout - part of the aops vector, I think, so
> > that device dax can still use it. That way XFS can go it's own way,
> > while ext4 and device dax can continue to use the existing mechanism
> > mechanisn that is currently implemented....
>
> s/XFS/XFS with rmap/, as most XFS file systems currently don't have
> that enabled we'll also need to keep the legacy path around.

Sure, that's trivially easy to handle in the XFS code once the
callouts are in place.

But, quite frankly, we can enforce rmap to be enabled
enabled because nobody is using a reflink enabled FS w/ DAX right
now. Everyone will have to mkfs their filesystems anyway to enable
reflink+dax, so we simply don't allow reflink+dax to be enabled
unless rmap is also enabled. Simple, easy, trivial.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-03-12 07:27:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5

On Thu, Mar 12, 2020 at 11:49:32AM +1100, Dave Chinner wrote:
> > > IOWs, the dax_associate_page() related functionality probably needs
> > > to be a filesystem callout - part of the aops vector, I think, so
> > > that device dax can still use it. That way XFS can go it's own way,
> > > while ext4 and device dax can continue to use the existing mechanism
> > > mechanisn that is currently implemented....
> >
> > s/XFS/XFS with rmap/, as most XFS file systems currently don't have
> > that enabled we'll also need to keep the legacy path around.
>
> Sure, that's trivially easy to handle in the XFS code once the
> callouts are in place.
>
> But, quite frankly, we can enforce rmap to be enabled
> enabled because nobody is using a reflink enabled FS w/ DAX right
> now. Everyone will have to mkfs their filesystems anyway to enable
> reflink+dax, so we simply don't allow reflink+dax to be enabled
> unless rmap is also enabled. Simple, easy, trivial.

True, I think rmap will be required for DAX+reflink. But we still
need the legacy infrastructure for error reporting.

2020-03-16 09:54:08

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5

On Wed 11-03-20 10:07:18, Dan Williams wrote:
> On Tue, Mar 10, 2020 at 11:30 PM Christoph Hellwig <[email protected]> wrote:
> >
> > On Tue, Mar 10, 2020 at 08:36:14PM -0700, Darrick J. Wong wrote:
> > > 1) Leave the inode flag (FS_XFLAG_DAX) as it is, and export the S_DAX
> > > status via statx. Document that changes to FS_XFLAG_DAX do not take
> > > effect immediately and that one must check statx to find out the real
> > > mode. If we choose this, I would also deprecate the dax mount option;
> > > send in my mkfs.xfs patch to make it so that you can set FS_XFLAG_DAX on
> > > all files at mkfs time; and we can finally lay this whole thing to rest.
> > > This is the closest to what we have today.
> > >
> > > 2) Withdraw FS_XFLAG_DAX entirely, and let the kernel choose based on
> > > usage patterns, hardware heuristics, or spiteful arbitrariness.
> >
> > 3) Only allow changing FS_XFLAG_DAX on directories or files that do
> > not have blocks allocated to them yet, and side step all the hard
> > problems.
>
> This sounds reasonable to me.
>
> As for deprecating the mount option, I think at a minimum it needs to
> continue be accepted as an option even if it is ignored to not break
> existing setups.

Agreed. But that's how we usually deprecate mount options. Also I'd say
that statx() support for reporting DAX state and some education of
programmers using DAX is required before we deprecate the mount option
since currently applications check 'dax' mount option to determine how much
memory they need to set aside for page cache before they consume everything
else on the machine...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-03-16 09:55:42

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5

On Mon, Mar 16, 2020 at 10:52:24AM +0100, Jan Kara wrote:
> > This sounds reasonable to me.
> >
> > As for deprecating the mount option, I think at a minimum it needs to
> > continue be accepted as an option even if it is ignored to not break
> > existing setups.
>
> Agreed. But that's how we usually deprecate mount options. Also I'd say
> that statx() support for reporting DAX state and some education of
> programmers using DAX is required before we deprecate the mount option
> since currently applications check 'dax' mount option to determine how much
> memory they need to set aside for page cache before they consume everything
> else on the machine...

I don't even think we should deprecate it. It isn't painful to maintain
and actually useful for testing. Instead we should expand it into a
tristate:

dax=off
dax=flag
dax=always

where the existing "dax" option maps to "dax=always" and nodax maps
to "dax=off". and dax=flag becomes the default for DAX capable devices.

2020-04-01 04:01:13

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5

On Mon, Mar 16, 2020 at 10:55:09AM +0100, Christoph Hellwig wrote:
> On Mon, Mar 16, 2020 at 10:52:24AM +0100, Jan Kara wrote:
> > > This sounds reasonable to me.
> > >
> > > As for deprecating the mount option, I think at a minimum it needs to
> > > continue be accepted as an option even if it is ignored to not break
> > > existing setups.
> >
> > Agreed. But that's how we usually deprecate mount options. Also I'd say
> > that statx() support for reporting DAX state and some education of
> > programmers using DAX is required before we deprecate the mount option
> > since currently applications check 'dax' mount option to determine how much
> > memory they need to set aside for page cache before they consume everything
> > else on the machine...
>
> I don't even think we should deprecate it. It isn't painful to maintain
> and actually useful for testing. Instead we should expand it into a
> tristate:
>
> dax=off
> dax=flag
> dax=always
>
> where the existing "dax" option maps to "dax=always" and nodax maps
> to "dax=off". and dax=flag becomes the default for DAX capable devices.

That works for me. In summary:

- Applications must call statx to discover the current S_DAX state.

- There exists an advisory file inode flag FS_XFLAG_DAX that can be
changed on files that have no blocks allocated to them. Changing
this flag does not necessarily change the S_DAX state immediately
but programs can query the S_DAX state via statx.

If FS_XFLAG_DAX is set and the fs is on pmem then it will always
enable S_DAX at inode load time; if FS_XFLAG_DAX is not set, it will
never enable S_DAX. Unless overridden...

- There exists a dax= mount option. dax=off means "never set S_DAX,
ignore FS_XFLAG_DAX"; dax=always means "always set S_DAX (at least on
pmem), ignore FS_XFLAG_DAX"; and dax=iflag means "follow FS_XFLAG_DAX"
and is the default. "dax" by itself means "dax=always". "nodax"
means "dax=off".

- There exists an advisory directory inode flag FS_XFLAG_DAX that can
be changed at any time. The flag state is copied into any files or
subdirectories created within that directory. If programs require
that file access runs in S_DAX mode, they'll have to create those
files themselves inside a directory with FS_XFLAG_DAX set, or mount
the fs with dax=always.

Ok? Let's please get this part finished for 5.8, then we can get back
to arguing about fs-rmap and reflink and dax and whatnot.

--D

2020-04-01 10:25:43

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5

On Wed 01-04-20 04:00:21, Darrick J. Wong wrote:
> On Mon, Mar 16, 2020 at 10:55:09AM +0100, Christoph Hellwig wrote:
> > On Mon, Mar 16, 2020 at 10:52:24AM +0100, Jan Kara wrote:
> > > > This sounds reasonable to me.
> > > >
> > > > As for deprecating the mount option, I think at a minimum it needs to
> > > > continue be accepted as an option even if it is ignored to not break
> > > > existing setups.
> > >
> > > Agreed. But that's how we usually deprecate mount options. Also I'd say
> > > that statx() support for reporting DAX state and some education of
> > > programmers using DAX is required before we deprecate the mount option
> > > since currently applications check 'dax' mount option to determine how much
> > > memory they need to set aside for page cache before they consume everything
> > > else on the machine...
> >
> > I don't even think we should deprecate it. It isn't painful to maintain
> > and actually useful for testing. Instead we should expand it into a
> > tristate:
> >
> > dax=off
> > dax=flag
> > dax=always
> >
> > where the existing "dax" option maps to "dax=always" and nodax maps
> > to "dax=off". and dax=flag becomes the default for DAX capable devices.
>
> That works for me. In summary:
>
> - Applications must call statx to discover the current S_DAX state.
>
> - There exists an advisory file inode flag FS_XFLAG_DAX that can be
> changed on files that have no blocks allocated to them. Changing
> this flag does not necessarily change the S_DAX state immediately
> but programs can query the S_DAX state via statx.

I generally like the proposal but I think the fact that toggling
FS_XFLAG_DAX will not have immediate effect on S_DAX will cause quite some
confusion and ultimately bug reports. I'm thinking whether we could somehow
improve this... For example an ioctl that would try to make set inode flags
effective by evicting the inode (and returning EBUSY if the eviction is
impossible for some reason)?

Honza

>
> If FS_XFLAG_DAX is set and the fs is on pmem then it will always
> enable S_DAX at inode load time; if FS_XFLAG_DAX is not set, it will
> never enable S_DAX. Unless overridden...
>
> - There exists a dax= mount option. dax=off means "never set S_DAX,
> ignore FS_XFLAG_DAX"; dax=always means "always set S_DAX (at least on
> pmem), ignore FS_XFLAG_DAX"; and dax=iflag means "follow FS_XFLAG_DAX"
> and is the default. "dax" by itself means "dax=always". "nodax"
> means "dax=off".
>
> - There exists an advisory directory inode flag FS_XFLAG_DAX that can
> be changed at any time. The flag state is copied into any files or
> subdirectories created within that directory. If programs require
> that file access runs in S_DAX mode, they'll have to create those
> files themselves inside a directory with FS_XFLAG_DAX set, or mount
> the fs with dax=always.
>
> Ok? Let's please get this part finished for 5.8, then we can get back
> to arguing about fs-rmap and reflink and dax and whatnot.
>
> --D
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-04-02 08:54:45

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5

On Wed, Apr 01, 2020 at 12:25:11PM +0200, Jan Kara wrote:
> > - Applications must call statx to discover the current S_DAX state.
> >
> > - There exists an advisory file inode flag FS_XFLAG_DAX that can be
> > changed on files that have no blocks allocated to them. Changing
> > this flag does not necessarily change the S_DAX state immediately
> > but programs can query the S_DAX state via statx.
>
> I generally like the proposal but I think the fact that toggling
> FS_XFLAG_DAX will not have immediate effect on S_DAX will cause quite some
> confusion and ultimately bug reports. I'm thinking whether we could somehow
> improve this... For example an ioctl that would try to make set inode flags
> effective by evicting the inode (and returning EBUSY if the eviction is
> impossible for some reason)?

I'd just return an error for that case, don't play silly games like
evicting the inode.

2020-04-02 21:00:23

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5

On Thu, Apr 02, 2020 at 10:53:27AM +0200, Christoph Hellwig wrote:
> On Wed, Apr 01, 2020 at 12:25:11PM +0200, Jan Kara wrote:
> > > - Applications must call statx to discover the current S_DAX state.
> > >
> > > - There exists an advisory file inode flag FS_XFLAG_DAX that can be
> > > changed on files that have no blocks allocated to them. Changing
> > > this flag does not necessarily change the S_DAX state immediately
> > > but programs can query the S_DAX state via statx.
> >
> > I generally like the proposal but I think the fact that toggling
> > FS_XFLAG_DAX will not have immediate effect on S_DAX will cause quite some
> > confusion and ultimately bug reports. I'm thinking whether we could somehow
> > improve this... For example an ioctl that would try to make set inode flags
> > effective by evicting the inode (and returning EBUSY if the eviction is
> > impossible for some reason)?
>
> I'd just return an error for that case, don't play silly games like
> evicting the inode.

I think I agree with Christoph here. But I want to clarify. I was heading in
a direction of failing the ioctl completely. But we could have the flag change
with an appropriate error which could let the user know the change has been
delayed.

But I don't immediately see what error code is appropriate for such an
indication. Candidates I can envision:

EAGAIN
ERESTART
EUSERS
EINPROGRESS

None are perfect but I'm leaning toward EINPROGRESS.

Ira

2020-04-03 04:40:31

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5

On Wed, Apr 01, 2020 at 04:00:21AM +0000, Darrick J. Wong wrote:
> On Mon, Mar 16, 2020 at 10:55:09AM +0100, Christoph Hellwig wrote:
> > On Mon, Mar 16, 2020 at 10:52:24AM +0100, Jan Kara wrote:
> > > > This sounds reasonable to me.
> > > >
> > > > As for deprecating the mount option, I think at a minimum it needs to
> > > > continue be accepted as an option even if it is ignored to not break
> > > > existing setups.
> > >
> > > Agreed. But that's how we usually deprecate mount options. Also I'd say
> > > that statx() support for reporting DAX state and some education of
> > > programmers using DAX is required before we deprecate the mount option
> > > since currently applications check 'dax' mount option to determine how much
> > > memory they need to set aside for page cache before they consume everything
> > > else on the machine...
> >
> > I don't even think we should deprecate it. It isn't painful to maintain
> > and actually useful for testing. Instead we should expand it into a
> > tristate:
> >
> > dax=off
> > dax=flag
> > dax=always
> >
> > where the existing "dax" option maps to "dax=always" and nodax maps
> > to "dax=off". and dax=flag becomes the default for DAX capable devices.
>
> That works for me. In summary:
>
> - Applications must call statx to discover the current S_DAX state.
>
> - There exists an advisory file inode flag FS_XFLAG_DAX that can be
> changed on files that have no blocks allocated to them. Changing
> this flag does not necessarily change the S_DAX state immediately
> but programs can query the S_DAX state via statx.
>
> If FS_XFLAG_DAX is set and the fs is on pmem then it will always
> enable S_DAX at inode load time; if FS_XFLAG_DAX is not set, it will
> never enable S_DAX. Unless overridden...
>
> - There exists a dax= mount option. dax=off means "never set S_DAX,
> ignore FS_XFLAG_DAX"; dax=always means "always set S_DAX (at least on
> pmem), ignore FS_XFLAG_DAX"; and dax=iflag means "follow FS_XFLAG_DAX"
> and is the default. "dax" by itself means "dax=always". "nodax"
> means "dax=off".
>
> - There exists an advisory directory inode flag FS_XFLAG_DAX that can
> be changed at any time. The flag state is copied into any files or
> subdirectories created within that directory. If programs require
> that file access runs in S_DAX mode, they'll have to create those
> files themselves inside a directory with FS_XFLAG_DAX set, or mount
> the fs with dax=always.

One other thing to add here. They _can_ set the FS_XFLAG_DAX on a file with
data and force an eviction to get S_DAX to change.

I think that is a nice reason to have a different error code returned.

>
> Ok? Let's please get this part finished for 5.8, then we can get back
> to arguing about fs-rmap and reflink and dax and whatnot.

I'm happy to see you motivated to get this in.

I'm starting with a new xfstest to make sure we agree on the semantics prior to
more patches. I hope to have the xfstest patch sent tomorrow sometime.

Ira

>
> --D

2020-04-03 07:28:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5

On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote:
> > I'd just return an error for that case, don't play silly games like
> > evicting the inode.
>
> I think I agree with Christoph here. But I want to clarify. I was heading in
> a direction of failing the ioctl completely. But we could have the flag change
> with an appropriate error which could let the user know the change has been
> delayed.
>
> But I don't immediately see what error code is appropriate for such an
> indication. Candidates I can envision:
>
> EAGAIN
> ERESTART
> EUSERS
> EINPROGRESS
>
> None are perfect but I'm leaning toward EINPROGRESS.

I really, really dislike that idea. The whole point of not forcing
evictions is to make it clear - no this inode is "busy" you can't
do that. A reasonably smart application can try to evict itself.

But returning an error and doing a lazy change anyway is straight from
the playbook for arcane and confusing API designs.

2020-04-03 15:49:44

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5

On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote:
> On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote:
> > > I'd just return an error for that case, don't play silly games like
> > > evicting the inode.
> >
> > I think I agree with Christoph here. But I want to clarify. I was heading in
> > a direction of failing the ioctl completely. But we could have the flag change
> > with an appropriate error which could let the user know the change has been
> > delayed.
> >
> > But I don't immediately see what error code is appropriate for such an
> > indication. Candidates I can envision:
> >
> > EAGAIN
> > ERESTART
> > EUSERS
> > EINPROGRESS
> >
> > None are perfect but I'm leaning toward EINPROGRESS.
>
> I really, really dislike that idea. The whole point of not forcing
> evictions is to make it clear - no this inode is "busy" you can't
> do that. A reasonably smart application can try to evict itself.

I don't understand. What Darrick proposed would never need any evictions. If
the file has blocks allocated the FS_XFLAG_DAX flag can not be changed. So I
don't see what good eviction would do at all.

>
> But returning an error and doing a lazy change anyway is straight from
> the playbook for arcane and confusing API designs.

Jan countered with a proposal that the FS_XFLAG_DAX does change with blocks
allocated. But that S_DAX would change on eviction. Adding that some eviction
ioctl could be added.

You then proposed just returning an error for that case. (This lead me to
believe that you were ok with an eviction based change of S_DAX.)

So I agreed that changing S_DAX could be delayed until an explicit eviction.
But, to aid the 'smart application', a different error code could be used to
indicate that the FS_XFLAG_DAX had been changed but that until that explicit
eviction occurs S_DAX would remain.

So I don't fully follow what you mean by 'lazy change'?

Do you still really, really dislike an explicit eviction method for changing
the S_DAX flag?

If FS_XFLAG_DAX can never be changed on a file with blocks allocated and the
user wants to change the mode of operations on their 'data'; they would have to
create a new file with the proper setting and move the data there. For example
copy the file into a directory marked FS_XFLAG_DAX==true?

I'm ok with either interface as I think both could be clear if documented.

Jan?

Ira

2020-04-03 18:21:49

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5

On Fri, Apr 03, 2020 at 07:03:38PM +0200, Jan Kara wrote:
> On Fri 03-04-20 08:48:29, Ira Weiny wrote:
> > On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote:
> > > On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote:
> > > > > I'd just return an error for that case, don't play silly games like
> > > > > evicting the inode.
> > > >
> > > > I think I agree with Christoph here. But I want to clarify. I was heading in
> > > > a direction of failing the ioctl completely. But we could have the flag change
> > > > with an appropriate error which could let the user know the change has been
> > > > delayed.
> > > >
> > > > But I don't immediately see what error code is appropriate for such an
> > > > indication. Candidates I can envision:
> > > >
> > > > EAGAIN
> > > > ERESTART
> > > > EUSERS
> > > > EINPROGRESS
> > > >
> > > > None are perfect but I'm leaning toward EINPROGRESS.
> > >
> > > I really, really dislike that idea. The whole point of not forcing
> > > evictions is to make it clear - no this inode is "busy" you can't
> > > do that. A reasonably smart application can try to evict itself.
> >
> > I don't understand. What Darrick proposed would never need any
> > evictions. If the file has blocks allocated the FS_XFLAG_DAX flag can
> > not be changed. So I don't see what good eviction would do at all.
>
> I guess there's some confusion here (may well be than on my side). Darrick
> propose that we can switch FS_XFLAG_DAX only when file has no blocks
> allocated - fine by me. But that still does not mean than we can switch
> S_DAX immediately, does it? Because that would still mean we need to switch
> aops on living inode and that's ... difficult and Christoph didn't want to
> clutter the code with it.
>
> So I've understood Darrick's proposal as: Just switch FS_XFLAG_DAX flag,
> S_DAX flag will magically switch when inode gets evicted and the inode gets
> reloaded from the disk again. Did I misunderstand anything?
>
> And my thinking was that this is surprising behavior for the user and so it
> will likely generate lots of bug reports along the lines of "DAX inode flag
> does not work!". So I was pondering how to make the behavior less
> confusing... The ioctl I've suggested was just a poor attempt at that.

Ok but then I don't understand Christophs comment to "just return an error for
that case"? Which case?

>
> > > But returning an error and doing a lazy change anyway is straight from
> > > the playbook for arcane and confusing API designs.
> >
> > Jan countered with a proposal that the FS_XFLAG_DAX does change with
> > blocks allocated. But that S_DAX would change on eviction. Adding that
> > some eviction ioctl could be added.
>
> No, I didn't mean that we can change FS_XFLAG_DAX with blocks allocated. I
> was still speaking about the case without blocks allocated.

Ah ok good point. But again what 'error' do we return when FS_XFLAG_DAX
changed but S_DAX did not?

>
> > You then proposed just returning an error for that case. (This lead me to
> > believe that you were ok with an eviction based change of S_DAX.)
> >
> > So I agreed that changing S_DAX could be delayed until an explicit eviction.
> > But, to aid the 'smart application', a different error code could be used to
> > indicate that the FS_XFLAG_DAX had been changed but that until that explicit
> > eviction occurs S_DAX would remain.
> >
> > So I don't fully follow what you mean by 'lazy change'?
> >
> > Do you still really, really dislike an explicit eviction method for changing
> > the S_DAX flag?
> >
> > If FS_XFLAG_DAX can never be changed on a file with blocks allocated and the
> > user wants to change the mode of operations on their 'data'; they would have to
> > create a new file with the proper setting and move the data there. For example
> > copy the file into a directory marked FS_XFLAG_DAX==true?
> >
> > I'm ok with either interface as I think both could be clear if documented.
>
> I agree that what Darrick suggested is technically easily doable and can be
> documented. But it is not natural behavior (i.e., different than all inode
> flags we have) and we know how careful people are when reading
> documentation...
>

Ok For 5.8 why don't we not allow FS_XFLAG_DAX to be changed on files _at_
_all_...

In summary:

- Applications must call statx to discover the current S_DAX state.

- There exists an advisory file inode flag FS_XFLAG_DAX that is set based on
the parent directory FS_XFLAG_DAX inode flag. (There is no way to change
this flag after file creation.)

If FS_XFLAG_DAX is set and the fs is on pmem then it will enable S_DAX at
inode load time; if FS_XFLAG_DAX is not set, it will not enable S_DAX.
Unless overridden...

- There exists a dax= mount option.

"-o dax=off" means "never set S_DAX, ignore FS_XFLAG_DAX"
"-o nodax" means "dax=off"
"-o dax=always" means "always set S_DAX (at least on pmem), ignore FS_XFLAG_DAX"
"-o dax" by itself means "dax=always"
"-o dax=iflag" means "follow FS_XFLAG_DAX" and is the default

- There exists an advisory directory inode flag FS_XFLAG_DAX that can be
changed at any time. The flag state is copied into any files or
subdirectories when they are created within that directory. If programs
require file access runs in S_DAX mode, they'll have to create those files
inside a directory with FS_XFLAG_DAX set, or mount the fs with an
appropriate dax mount option.


???

Ira

2020-04-03 18:31:40

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5

On Fri, Apr 03, 2020 at 07:03:38PM +0200, Jan Kara wrote:
> On Fri 03-04-20 08:48:29, Ira Weiny wrote:
> > On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote:
> > > On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote:
> > > > > I'd just return an error for that case, don't play silly games like
> > > > > evicting the inode.
> > > >
> > > > I think I agree with Christoph here. But I want to clarify. I was heading in
> > > > a direction of failing the ioctl completely. But we could have the flag change
> > > > with an appropriate error which could let the user know the change has been
> > > > delayed.
> > > >
> > > > But I don't immediately see what error code is appropriate for such an
> > > > indication. Candidates I can envision:
> > > >
> > > > EAGAIN
> > > > ERESTART
> > > > EUSERS
> > > > EINPROGRESS
> > > >
> > > > None are perfect but I'm leaning toward EINPROGRESS.
> > >
> > > I really, really dislike that idea. The whole point of not forcing
> > > evictions is to make it clear - no this inode is "busy" you can't
> > > do that. A reasonably smart application can try to evict itself.
> >
> > I don't understand. What Darrick proposed would never need any
> > evictions. If the file has blocks allocated the FS_XFLAG_DAX flag can
> > not be changed. So I don't see what good eviction would do at all.
>
> I guess there's some confusion here (may well be than on my side). Darrick
> propose that we can switch FS_XFLAG_DAX only when file has no blocks
> allocated - fine by me. But that still does not mean than we can switch
> S_DAX immediately, does it? Because that would still mean we need to switch
> aops on living inode and that's ... difficult and Christoph didn't want to
> clutter the code with it.

IIRC, the reason Ira was trying to introduce this file operations lock
is because there isn't any other safe way to change the operations
dynamically, because some of those operations don't start taking any
locks at all until after we've accessed the file operations pointer.

Now that I think about it some more, that's also means we can't change
S_DAX even on files without blocks allocated. Look at fallocate, it
doesn't even take i_rwsem until we've called into (say)
xfs_file_fallocate.

Ok, so with that in mind, I think we simply have to say that
FS_XFLAG_DAX is 100% advisory, and S_DAX will magically switch some time
in the future or after the next umount/mount cycle.

FSSETXATTR can't evict an inode it has just set FS_XFLAG_DAX on, because
it applies that change to the fd passed into ioctl(), which means that the
caller has a reference to the fd -> file -> inode, which means the inode
is unevictable until after the call completes.

IOWs,

> So I've understood Darrick's proposal as: Just switch FS_XFLAG_DAX flag,
> S_DAX flag will magically switch when inode gets evicted and the inode gets
> reloaded from the disk again. Did I misunderstand anything?

That was /my/ understanding. :)

> And my thinking was that this is surprising behavior for the user and so it
> will likely generate lots of bug reports along the lines of "DAX inode flag
> does not work!". So I was pondering how to make the behavior less
> confusing... The ioctl I've suggested was just a poor attempt at that.

At best, userspace uses FSSETXATTR to change FS_XFLAG_DAX, and then
calls some as-yet-undefined ioctl to try to evict the inode from memory.
I'm not sure how you'd actually do that, though, considering that you'd
have to close the fd and as soon as that happens the inode can disappear
permanently.

Ok, that's not sane. Forget I ever wrote that.

> > > But returning an error and doing a lazy change anyway is straight from
> > > the playbook for arcane and confusing API designs.
> >
> > Jan countered with a proposal that the FS_XFLAG_DAX does change with
> > blocks allocated. But that S_DAX would change on eviction. Adding that
> > some eviction ioctl could be added.
>
> No, I didn't mean that we can change FS_XFLAG_DAX with blocks allocated. I
> was still speaking about the case without blocks allocated.
>
> > You then proposed just returning an error for that case. (This lead me to
> > believe that you were ok with an eviction based change of S_DAX.)
> >
> > So I agreed that changing S_DAX could be delayed until an explicit eviction.
> > But, to aid the 'smart application', a different error code could be used to
> > indicate that the FS_XFLAG_DAX had been changed but that until that explicit
> > eviction occurs S_DAX would remain.

There's no point in returning a magic error code from FSSETXATTR, as I
realized above. To restate: FSSETXATTR can't evict the inode, so it
would always have to return the magic error code. The best we can do is
tell userspace that they can set the advisory FS_XFLAG_DAX flag and then
check STATX_ATTR_DAX immediately afterwards. If some day in the future
we get smarter and can change it immediately, the statx output will
reflect that.

> > So I don't fully follow what you mean by 'lazy change'?
> >
> > Do you still really, really dislike an explicit eviction method for changing
> > the S_DAX flag?
> >
> > If FS_XFLAG_DAX can never be changed on a file with blocks allocated and the
> > user wants to change the mode of operations on their 'data'; they would have to
> > create a new file with the proper setting and move the data there. For example
> > copy the file into a directory marked FS_XFLAG_DAX==true?
> >
> > I'm ok with either interface as I think both could be clear if documented.
>
> I agree that what Darrick suggested is technically easily doable and can be
> documented. But it is not natural behavior (i.e., different than all inode
> flags we have) and we know how careful people are when reading
> documentation...

To reflect all that I've rambled in this thread, I withdraw the previous
paragraph and submit this one for consideration:

- There exists an advisory file inode flag FS_XFLAG_DAX.

If FS_XFLAG_DAX is set and the fs is on pmem then it will always
enable S_DAX at inode load time; if FS_XFLAG_DAX is not set, it will
never enable S_DAX. The advice can be overridden by mount option.

Changing this flag does not necessarily change the S_DAX state
immediately but programs can query the S_DAX state via statx to
detect when the new advice has gone into effect.

Consider this from the perspective of minimizing changes to userspace
programs between now and the future. If your program really wants
S_DAX, you can write:

fd = open(...);

ioctl(fd, FSGETXATTR, &fsx);

if (fsx.xflags & FS_XFLAG_DAX)
return 0;

fsx.xflags |= FS_XFLAG_DAX;
ioctl(fd, FSSETXATTR, &fsx);

do {
statx(fd, STATX_GIVE_ME_EVERYTHING, &statx);
if (statx.attrs & STATX_ATTR_DAX)
return 0;

/* do stupid magic to evict things */
} while (we haven't gotten bored and wandered away);

This code snippet will work with the current limitations of the kernel,
and it'll continue working even on Linux v5000 where we finally figure
out how to change S_DAX on the fly.

--D

>
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2020-04-03 18:43:53

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5

On Fri, Apr 03, 2020 at 11:18:43AM -0700, Ira Weiny wrote:
> On Fri, Apr 03, 2020 at 07:03:38PM +0200, Jan Kara wrote:
> > On Fri 03-04-20 08:48:29, Ira Weiny wrote:
> > > On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote:
> > > > On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote:
> > > > > > I'd just return an error for that case, don't play silly games like
> > > > > > evicting the inode.
> > > > >
> > > > > I think I agree with Christoph here. But I want to clarify. I was heading in
> > > > > a direction of failing the ioctl completely. But we could have the flag change
> > > > > with an appropriate error which could let the user know the change has been
> > > > > delayed.
> > > > >
> > > > > But I don't immediately see what error code is appropriate for such an
> > > > > indication. Candidates I can envision:
> > > > >
> > > > > EAGAIN
> > > > > ERESTART
> > > > > EUSERS
> > > > > EINPROGRESS
> > > > >
> > > > > None are perfect but I'm leaning toward EINPROGRESS.
> > > >
> > > > I really, really dislike that idea. The whole point of not forcing
> > > > evictions is to make it clear - no this inode is "busy" you can't
> > > > do that. A reasonably smart application can try to evict itself.
> > >
> > > I don't understand. What Darrick proposed would never need any
> > > evictions. If the file has blocks allocated the FS_XFLAG_DAX flag can
> > > not be changed. So I don't see what good eviction would do at all.
> >
> > I guess there's some confusion here (may well be than on my side). Darrick
> > propose that we can switch FS_XFLAG_DAX only when file has no blocks
> > allocated - fine by me. But that still does not mean than we can switch
> > S_DAX immediately, does it? Because that would still mean we need to switch
> > aops on living inode and that's ... difficult and Christoph didn't want to
> > clutter the code with it.
> >
> > So I've understood Darrick's proposal as: Just switch FS_XFLAG_DAX flag,
> > S_DAX flag will magically switch when inode gets evicted and the inode gets
> > reloaded from the disk again. Did I misunderstand anything?
> >
> > And my thinking was that this is surprising behavior for the user and so it
> > will likely generate lots of bug reports along the lines of "DAX inode flag
> > does not work!". So I was pondering how to make the behavior less
> > confusing... The ioctl I've suggested was just a poor attempt at that.
>
> Ok but then I don't understand Christophs comment to "just return an error for
> that case"? Which case?
>
> >
> > > > But returning an error and doing a lazy change anyway is straight from
> > > > the playbook for arcane and confusing API designs.
> > >
> > > Jan countered with a proposal that the FS_XFLAG_DAX does change with
> > > blocks allocated. But that S_DAX would change on eviction. Adding that
> > > some eviction ioctl could be added.
> >
> > No, I didn't mean that we can change FS_XFLAG_DAX with blocks allocated. I
> > was still speaking about the case without blocks allocated.
>
> Ah ok good point. But again what 'error' do we return when FS_XFLAG_DAX
> changed but S_DAX did not?
>
> >
> > > You then proposed just returning an error for that case. (This lead me to
> > > believe that you were ok with an eviction based change of S_DAX.)
> > >
> > > So I agreed that changing S_DAX could be delayed until an explicit eviction.
> > > But, to aid the 'smart application', a different error code could be used to
> > > indicate that the FS_XFLAG_DAX had been changed but that until that explicit
> > > eviction occurs S_DAX would remain.
> > >
> > > So I don't fully follow what you mean by 'lazy change'?
> > >
> > > Do you still really, really dislike an explicit eviction method for changing
> > > the S_DAX flag?
> > >
> > > If FS_XFLAG_DAX can never be changed on a file with blocks allocated and the
> > > user wants to change the mode of operations on their 'data'; they would have to
> > > create a new file with the proper setting and move the data there. For example
> > > copy the file into a directory marked FS_XFLAG_DAX==true?
> > >
> > > I'm ok with either interface as I think both could be clear if documented.
> >
> > I agree that what Darrick suggested is technically easily doable and can be
> > documented. But it is not natural behavior (i.e., different than all inode
> > flags we have) and we know how careful people are when reading
> > documentation...
> >
>
> Ok For 5.8 why don't we not allow FS_XFLAG_DAX to be changed on files _at_
> _all_...
>
> In summary:
>
> - Applications must call statx to discover the current S_DAX state.

Ok.

> - There exists an advisory file inode flag FS_XFLAG_DAX that is set based on
> the parent directory FS_XFLAG_DAX inode flag. (There is no way to change
> this flag after file creation.)
>
> If FS_XFLAG_DAX is set and the fs is on pmem then it will enable S_DAX at
> inode load time; if FS_XFLAG_DAX is not set, it will not enable S_DAX.
> Unless overridden...

Ok, fine with me. :)

> - There exists a dax= mount option.
>
> "-o dax=off" means "never set S_DAX, ignore FS_XFLAG_DAX"
> "-o nodax" means "dax=off"

I surveyed the three fses that support dax and found that none of the
filesystems actually have a 'nodax' flag. Now would be the time not to
add such a thing, and make people specify dax=off instead. It would
be handy if we could have a single fsparam_enum for figuring out the dax
mount options.

> "-o dax=always" means "always set S_DAX (at least on pmem), ignore FS_XFLAG_DAX"
> "-o dax" by itself means "dax=always"
> "-o dax=iflag" means "follow FS_XFLAG_DAX" and is the default
>
> - There exists an advisory directory inode flag FS_XFLAG_DAX that can be
> changed at any time. The flag state is copied into any files or
> subdirectories when they are created within that directory. If programs
> require file access runs in S_DAX mode, they'll have to create those files

"...they must create..."

> inside a directory with FS_XFLAG_DAX set, or mount the fs with an
> appropriate dax mount option.

Otherwise seems ok to me.

--D

>
>
> ???
>
> Ira
>