2020-04-07 18:30:36

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V6 0/8] Enable per-file/per-directory DAX operations V6

From: Ira Weiny <[email protected]>

Changes from V5:
* make dax mount option a tri-state
* Reject changes to FS_XFLAG_DAX for regular files
- Allow only on directories
* Update documentation

At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
consumption due to their inability to detect whether the kernel will
instantiate page cache for a file, and cases where a global dax enable via a
mount option is too coarse.

The following patch series enables the use of DAX on individual files and/or
directories on xfs, and lays some groundwork to do so in ext4. It further
enhances the dax mount option to be a tri-state of 'always', 'never', or
'iflag' (default). Furthermore, it maintians '-o dax' to be equivalent to '-o
dax=always'.

The insight at LSF/MM was to separate the per-mount or per-file "physical"
(FS_XFLAG_DAX) capability switch from an "effective" (S_DAX) attribute for the
file.

At LSF/MM we discussed the difficulties of switching the DAX state of a file
with active mappings / page cache. It was thought the races could be avoided
by limiting DAX state flips to 0-length files.

However, this turns out to not be true.[3][5] This is because address space
operations (a_ops) may be in use at any time the inode is referenced.

For this reason direct manipulation of the FS_XFLAG_DAX file is prohibited on
files in this patch set. File can only inherit this flag from their parent
directory on creation.

Details of when and how DAX state can be changed on a file is included in a
documentation patch.

It should be noted that FS_XFLAG_DAX inheritance is not shown in this patch set
as it was maintained from previous work on XFS. FS_XFLAG_DAX and it's
inheritance will need to be added to other file systems for user control.


[1] https://lwn.net/Articles/787973/
[2] https://lwn.net/Articles/787233/
[3] https://lkml.org/lkml/2019/10/20/96
[4] https://patchwork.kernel.org/patch/11310511/
[5] https://lore.kernel.org/lkml/[email protected]/


To: [email protected]
Cc: "Darrick J. Wong" <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: "Theodore Y. Ts'o" <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]


Changes from V4:
* Open code the aops lock rather than add it to the xfs_ilock()
subsystem (Darrick's comments were obsoleted by this change)
* Fix lkp build suggestions and bugs

Changes from V3:
* Remove global locking... :-D
* put back per inode locking and remove pre-mature optimizations
* Fix issues with Directories having IS_DAX() set
* Fix kernel crash issues reported by Jeff
* Add some clean up patches
* Consolidate diflags to iflags functions
* Update/add documentation
* Reorder/rename patches quite a bit

Changes from V2:

* Move i_dax_sem to be a global percpu_rw_sem rather than per inode
Internal discussions with Dan determined this would be easier,
just as performant, and slightly less overhead that having it
in the SB as suggested by Jan
* Fix locking order in comments and throughout code
* Change "mode" to "state" throughout commits
* Add CONFIG_FS_DAX wrapper to disable inode_[un]lock_state() when not
configured
* Add static branch for which is activated by a device which supports
DAX in XFS
* Change "lock/unlock" to up/down read/write as appropriate
Previous names were over simplified
* Update comments/documentation

* Remove the xfs specific lock to the vfs (global) layer.
* Fix i_dax_sem locking order and comments

* Move 'i_mapped' count from struct inode to struct address_space and
rename it to mmap_count
* Add inode_has_mappings() call

* Fix build issues
* Clean up syntax spacing and minor issues
* Update man page text for STATX_ATTR_DAX
* Add reviewed-by's
* Rebase to 5.6

Rename patch:
from: fs/xfs: Add lock/unlock state to xfs
to: fs/xfs: Add write DAX lock to xfs layer
Add patch:
fs/xfs: Clarify lockdep dependency for xfs_isilocked()
Drop patch:
fs/xfs: Fix truncate up

Ira Weiny (8):
fs/xfs: Remove unnecessary initialization of i_rwsem
fs: Remove unneeded IS_DAX() check
fs/stat: Define DAX statx attribute
fs/xfs: Make DAX mount option a tri-state
fs/xfs: Create function xfs_inode_enable_dax()
fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()
fs/xfs: Change xfs_ioctl_setattr_dax_invalidate() to
xfs_ioctl_dax_check()
Documentation/dax: Update Usage section

Documentation/filesystems/dax.txt | 94 +++++++++++++++++++++-
fs/stat.c | 3 +
fs/xfs/xfs_icache.c | 4 +-
fs/xfs/xfs_inode.h | 1 +
fs/xfs/xfs_ioctl.c | 124 +++---------------------------
fs/xfs/xfs_iops.c | 62 ++++++++++-----
fs/xfs/xfs_mount.h | 26 ++++++-
fs/xfs/xfs_super.c | 34 ++++++--
include/linux/fs.h | 2 +-
include/uapi/linux/stat.h | 1 +
10 files changed, 206 insertions(+), 145 deletions(-)

--
2.25.1


2020-04-07 18:30:39

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V6 1/8] fs/xfs: Remove unnecessary initialization of i_rwsem

From: Ira Weiny <[email protected]>

An earlier call of xfs_reinit_inode() from xfs_iget_cache_hit() already
handles initialization of i_rwsem.

Doing so again is unneeded.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes from V4:
Update commit message to make it clear the xfs_iget_cache_hit()
is actually doing the initialization via xfs_reinit_inode()

New for V4:

NOTE: This was found while ensuring the new i_aops_sem was properly
handled. It seems like this is a layering violation so I think it is
worth cleaning up so as to not confuse others.
---
fs/xfs/xfs_icache.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 8dc2e5414276..836a1f09be03 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -419,6 +419,7 @@ xfs_iget_cache_hit(
spin_unlock(&ip->i_flags_lock);
rcu_read_unlock();

+ ASSERT(!rwsem_is_locked(&inode->i_rwsem));
error = xfs_reinit_inode(mp, inode);
if (error) {
bool wake;
@@ -452,9 +453,6 @@ xfs_iget_cache_hit(
ip->i_sick = 0;
ip->i_checked = 0;

- ASSERT(!rwsem_is_locked(&inode->i_rwsem));
- init_rwsem(&inode->i_rwsem);
-
spin_unlock(&ip->i_flags_lock);
spin_unlock(&pag->pag_ici_lock);
} else {
--
2.25.1

2020-04-07 18:30:49

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V6 2/8] fs: Remove unneeded IS_DAX() check

From: Ira Weiny <[email protected]>

Remove the check because DAX now has it's own read/write methods and
file systems which support DAX check IS_DAX() prior to IOCB_DIRECT on
their own. Therefore, it does not matter if the file state is DAX when
the iocb flags are created.

Reviewed-by: Dave Chinner <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes from v3:
Reword commit message.
Reordered to be a 'pre-cleanup' patch
---
include/linux/fs.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index abedbffe2c9e..f97b99c36cee 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3389,7 +3389,7 @@ extern int file_update_time(struct file *file);

static inline bool io_is_direct(struct file *filp)
{
- return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
+ return (filp->f_flags & O_DIRECT);
}

static inline bool vma_is_dax(struct vm_area_struct *vma)
--
2.25.1

2020-04-07 18:30:52

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V6 3/8] fs/stat: Define DAX statx attribute

From: Ira Weiny <[email protected]>

In order for users to determine if a file is currently operating in DAX
state (effective DAX). Define a statx attribute value and set that
attribute if the effective DAX flag is set.

To go along with this we propose the following addition to the statx man
page:

STATX_ATTR_DAX

The file is in the DAX (cpu direct access) state. DAX state
attempts to minimize software cache effects for both I/O and
memory mappings of this file. It requires a file system which
has been configured to support DAX.

DAX generally assumes all accesses are via cpu load / store
instructions which can minimize overhead for small accesses, but
may adversely affect cpu utilization for large transfers.

File I/O is done directly to/from user-space buffers and memory
mapped I/O may be performed with direct memory mappings that
bypass kernel page cache.

While the DAX property tends to result in data being transferred
synchronously, it does not give the same guarantees of O_SYNC
where data and the necessary metadata are transferred together.

A DAX file may support being mapped with the MAP_SYNC flag,
which enables a program to use CPU cache flush instructions to
persist CPU store operations without an explicit fsync(2). See
mmap(2) for more information.

Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Ira Weiny <[email protected]>

---
Changes from V2:
Update man page text with comments from Darrick, Jan, Dan, and
Dave.
---
fs/stat.c | 3 +++
include/uapi/linux/stat.h | 1 +
2 files changed, 4 insertions(+)

diff --git a/fs/stat.c b/fs/stat.c
index 030008796479..894699c74dde 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -79,6 +79,9 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
if (IS_AUTOMOUNT(inode))
stat->attributes |= STATX_ATTR_AUTOMOUNT;

+ if (IS_DAX(inode))
+ stat->attributes |= STATX_ATTR_DAX;
+
if (inode->i_op->getattr)
return inode->i_op->getattr(path, stat, request_mask,
query_flags);
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index ad80a5c885d5..e5f9d5517f6b 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -169,6 +169,7 @@ struct statx {
#define STATX_ATTR_ENCRYPTED 0x00000800 /* [I] File requires key to decrypt in fs */
#define STATX_ATTR_AUTOMOUNT 0x00001000 /* Dir: Automount trigger */
#define STATX_ATTR_VERITY 0x00100000 /* [I] Verity protected file */
+#define STATX_ATTR_DAX 0x00002000 /* [I] File is DAX */


#endif /* _UAPI_LINUX_STAT_H */
--
2.25.1

2020-04-07 18:31:03

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V6 4/8] fs/xfs: Make DAX mount option a tri-state

From: Ira Weiny <[email protected]>

As agreed upon[1]. We make the dax mount option a tri-state. '-o dax'
continues to operate the same. We add 'always', 'never', and 'iflag'
(default).

[1] https://lore.kernel.org/lkml/[email protected]/

Signed-off-by: Ira Weiny <[email protected]>

---
Changes from v5:
New Patch
---
fs/xfs/xfs_iops.c | 2 +-
fs/xfs/xfs_mount.h | 26 +++++++++++++++++++++++++-
fs/xfs/xfs_super.c | 34 +++++++++++++++++++++++++++++-----
3 files changed, 55 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 81f2f93caec0..1ec4a36917bd 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1248,7 +1248,7 @@ xfs_inode_supports_dax(
return false;

/* DAX mount option or DAX iflag must be set. */
- if (!(mp->m_flags & XFS_MOUNT_DAX) &&
+ if (xfs_mount_dax_mode(mp) != XFS_DAX_ALWAYS &&
!(ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
return false;

diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 88ab09ed29e7..ce027ee06692 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -233,7 +233,31 @@ typedef struct xfs_mount {
allocator */
#define XFS_MOUNT_NOATTR2 (1ULL << 25) /* disable use of attr2 format */

-#define XFS_MOUNT_DAX (1ULL << 62) /* TEST ONLY! */
+/* DAX flag is a 2 bit field representing a tri-state for dax
+ * iflag, always, never
+ * We reserve/document the 2 bits using dax field/field2
+ */
+#define XFS_DAX_FIELD_MASK 0x3ULL
+#define XFS_DAX_FIELD_SHIFT 62
+#define XFS_MOUNT_DAX_FIELD (1ULL << 62)
+#define XFS_MOUNT_DAX_FIELD2 (1ULL << 63)
+
+enum {
+ XFS_DAX_IFLAG = 0,
+ XFS_DAX_ALWAYS = 1,
+ XFS_DAX_NEVER = 2,
+};
+
+static inline void xfs_mount_set_dax(struct xfs_mount *mp, u32 val)
+{
+ mp->m_flags &= ~(XFS_DAX_FIELD_MASK << XFS_DAX_FIELD_SHIFT);
+ mp->m_flags |= ((val & XFS_DAX_FIELD_MASK) << XFS_DAX_FIELD_SHIFT);
+}
+
+static inline u32 xfs_mount_dax_mode(struct xfs_mount *mp)
+{
+ return (mp->m_flags >> XFS_DAX_FIELD_SHIFT) & XFS_DAX_FIELD_MASK;
+}

/*
* Max and min values for mount-option defined I/O
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 2094386af8ac..d2fd465eeed5 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -47,6 +47,13 @@ static struct kset *xfs_kset; /* top-level xfs sysfs dir */
static struct xfs_kobj xfs_dbg_kobj; /* global debug sysfs attrs */
#endif

+static const struct constant_table dax_param_enums[] = {
+ {"iflag", XFS_DAX_IFLAG },
+ {"always", XFS_DAX_ALWAYS },
+ {"never", XFS_DAX_NEVER },
+ {}
+};
+
/*
* Table driven mount option parser.
*/
@@ -59,7 +66,7 @@ enum {
Opt_filestreams, Opt_quota, Opt_noquota, Opt_usrquota, Opt_grpquota,
Opt_prjquota, Opt_uquota, Opt_gquota, Opt_pquota,
Opt_uqnoenforce, Opt_gqnoenforce, Opt_pqnoenforce, Opt_qnoenforce,
- Opt_discard, Opt_nodiscard, Opt_dax,
+ Opt_discard, Opt_nodiscard, Opt_dax, Opt_dax_enum,
};

static const struct fs_parameter_spec xfs_fs_parameters[] = {
@@ -103,6 +110,7 @@ static const struct fs_parameter_spec xfs_fs_parameters[] = {
fsparam_flag("discard", Opt_discard),
fsparam_flag("nodiscard", Opt_nodiscard),
fsparam_flag("dax", Opt_dax),
+ fsparam_enum("dax", Opt_dax_enum, dax_param_enums),
{}
};

@@ -129,7 +137,6 @@ xfs_fs_show_options(
{ XFS_MOUNT_GRPID, ",grpid" },
{ XFS_MOUNT_DISCARD, ",discard" },
{ XFS_MOUNT_LARGEIO, ",largeio" },
- { XFS_MOUNT_DAX, ",dax" },
{ 0, NULL }
};
struct xfs_mount *mp = XFS_M(root->d_sb);
@@ -185,6 +192,20 @@ xfs_fs_show_options(
if (!(mp->m_qflags & XFS_ALL_QUOTA_ACCT))
seq_puts(m, ",noquota");

+ switch (xfs_mount_dax_mode(mp)) {
+ case XFS_DAX_IFLAG:
+ seq_puts(m, ",dax=iflag");
+ break;
+ case XFS_DAX_ALWAYS:
+ seq_puts(m, ",dax=always");
+ break;
+ case XFS_DAX_NEVER:
+ seq_puts(m, ",dax=never");
+ break;
+ default:
+ break;
+ }
+
return 0;
}

@@ -1244,7 +1265,10 @@ xfs_fc_parse_param(
return 0;
#ifdef CONFIG_FS_DAX
case Opt_dax:
- mp->m_flags |= XFS_MOUNT_DAX;
+ xfs_mount_set_dax(mp, XFS_DAX_ALWAYS);
+ return 0;
+ case Opt_dax_enum:
+ xfs_mount_set_dax(mp, result.uint_32);
return 0;
#endif
default:
@@ -1437,7 +1461,7 @@ xfs_fc_fill_super(
if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
sb->s_flags |= SB_I_VERSION;

- if (mp->m_flags & XFS_MOUNT_DAX) {
+ if (xfs_mount_dax_mode(mp) == XFS_DAX_ALWAYS) {
bool rtdev_is_dax = false, datadev_is_dax;

xfs_warn(mp,
@@ -1451,7 +1475,7 @@ xfs_fc_fill_super(
if (!rtdev_is_dax && !datadev_is_dax) {
xfs_alert(mp,
"DAX unsupported by block device. Turning off DAX.");
- mp->m_flags &= ~XFS_MOUNT_DAX;
+ xfs_mount_set_dax(mp, XFS_DAX_NEVER);
}
if (xfs_sb_version_hasreflink(&mp->m_sb)) {
xfs_alert(mp,
--
2.25.1

2020-04-07 18:31:19

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V6 6/8] fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()

From: Ira Weiny <[email protected]>

The functionality in xfs_diflags_to_linux() and xfs_diflags_to_iflags() are
nearly identical. The only difference is that *_to_linux() is called after
inode setup and disallows changing the DAX flag.

Combining them can be done with a flag which indicates if this is the initial
setup to allow the DAX flag to be properly set only at init time.

So remove xfs_diflags_to_linux() and call the modified xfs_diflags_to_iflags()
directly.

While we are here simplify xfs_diflags_to_iflags() to take struct xfs_inode and
use xfs_ip2xflags() to ensure future diflags are included correctly.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes from V5:
The functions are no longer identical so we can only combine
them rather than deleting one completely. This is reflected in
the new init parameter.
---
fs/xfs/xfs_inode.h | 1 +
fs/xfs/xfs_ioctl.c | 33 +--------------------------------
fs/xfs/xfs_iops.c | 42 +++++++++++++++++++++++++++---------------
3 files changed, 29 insertions(+), 47 deletions(-)

diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 492e53992fa9..e76ed9ca17f7 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -466,6 +466,7 @@ int xfs_break_layouts(struct inode *inode, uint *iolock,
/* from xfs_iops.c */
extern void xfs_setup_inode(struct xfs_inode *ip);
extern void xfs_setup_iops(struct xfs_inode *ip);
+extern void xfs_diflags_to_iflags(struct xfs_inode *ip, bool init);

/*
* When setting up a newly allocated inode, we need to call
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index d42de92cb283..c6cd92ef4a05 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1100,37 +1100,6 @@ xfs_flags2diflags2(
return di_flags2;
}

-STATIC void
-xfs_diflags_to_linux(
- struct xfs_inode *ip)
-{
- struct inode *inode = VFS_I(ip);
- unsigned int xflags = xfs_ip2xflags(ip);
-
- if (xflags & FS_XFLAG_IMMUTABLE)
- inode->i_flags |= S_IMMUTABLE;
- else
- inode->i_flags &= ~S_IMMUTABLE;
- if (xflags & FS_XFLAG_APPEND)
- inode->i_flags |= S_APPEND;
- else
- inode->i_flags &= ~S_APPEND;
- if (xflags & FS_XFLAG_SYNC)
- inode->i_flags |= S_SYNC;
- else
- inode->i_flags &= ~S_SYNC;
- if (xflags & FS_XFLAG_NOATIME)
- inode->i_flags |= S_NOATIME;
- else
- inode->i_flags &= ~S_NOATIME;
-#if 0 /* disabled until the flag switching races are sorted out */
- if (xflags & FS_XFLAG_DAX)
- inode->i_flags |= S_DAX;
- else
- inode->i_flags &= ~S_DAX;
-#endif
-}
-
static int
xfs_ioctl_setattr_xflags(
struct xfs_trans *tp,
@@ -1168,7 +1137,7 @@ xfs_ioctl_setattr_xflags(
ip->i_d.di_flags = xfs_flags2diflags(ip, fa->fsx_xflags);
ip->i_d.di_flags2 = di_flags2;

- xfs_diflags_to_linux(ip);
+ xfs_diflags_to_iflags(ip, false);
xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_CHG);
xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
XFS_STATS_INC(mp, xs_ig_attrchg);
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index e07f7b641226..a4ac8568c8c7 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1259,7 +1259,7 @@ xfs_inode_supports_dax(
return xfs_inode_buftarg(ip)->bt_daxdev != NULL;
}

-STATIC bool
+static bool
xfs_inode_enable_dax(
struct xfs_inode *ip)
{
@@ -1272,26 +1272,38 @@ xfs_inode_enable_dax(
return false;
}

-STATIC void
+void
xfs_diflags_to_iflags(
- struct inode *inode,
- struct xfs_inode *ip)
+ struct xfs_inode *ip,
+ bool init)
{
- uint16_t flags = ip->i_d.di_flags;
-
- inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
- S_NOATIME | S_DAX);
+ struct inode *inode = VFS_I(ip);
+ uint diflags = xfs_ip2xflags(ip);

- if (flags & XFS_DIFLAG_IMMUTABLE)
+ if (diflags & FS_XFLAG_IMMUTABLE)
inode->i_flags |= S_IMMUTABLE;
- if (flags & XFS_DIFLAG_APPEND)
+ else
+ inode->i_flags &= ~S_IMMUTABLE;
+ if (diflags & FS_XFLAG_APPEND)
inode->i_flags |= S_APPEND;
- if (flags & XFS_DIFLAG_SYNC)
+ else
+ inode->i_flags &= ~S_APPEND;
+ if (diflags & FS_XFLAG_SYNC)
inode->i_flags |= S_SYNC;
- if (flags & XFS_DIFLAG_NOATIME)
+ else
+ inode->i_flags &= ~S_SYNC;
+ if (diflags & FS_XFLAG_NOATIME)
inode->i_flags |= S_NOATIME;
- if (xfs_inode_enable_dax(ip))
- inode->i_flags |= S_DAX;
+ else
+ inode->i_flags &= ~S_NOATIME;
+
+ /* Only toggle the dax flag when initializing */
+ if (init) {
+ if (xfs_inode_enable_dax(ip))
+ inode->i_flags |= S_DAX;
+ else
+ inode->i_flags &= ~S_DAX;
+ }
}

/*
@@ -1320,7 +1332,7 @@ xfs_setup_inode(
inode->i_gid = xfs_gid_to_kgid(ip->i_d.di_gid);

i_size_write(inode, ip->i_d.di_size);
- xfs_diflags_to_iflags(inode, ip);
+ xfs_diflags_to_iflags(ip, true);

if (S_ISDIR(inode->i_mode)) {
/*
--
2.25.1

2020-04-07 18:31:55

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V6 7/8] fs/xfs: Change xfs_ioctl_setattr_dax_invalidate() to xfs_ioctl_dax_check()

From: Ira Weiny <[email protected]>

We only support changing FS_XFLAG_DAX on directories. Files get their
flag from the parent directory on creation only. So no data
invalidation needs to happen.

Alter the xfs_ioctl_setattr_dax_invalidate() to be
xfs_ioctl_dax_check().

This also allows use to remove the join_flags logic.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes from v5:
New patch
---
fs/xfs/xfs_ioctl.c | 91 +++++-----------------------------------------
1 file changed, 10 insertions(+), 81 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index c6cd92ef4a05..5472faab7c4f 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1145,63 +1145,18 @@ xfs_ioctl_setattr_xflags(
}

/*
- * If we are changing DAX flags, we have to ensure the file is clean and any
- * cached objects in the address space are invalidated and removed. This
- * requires us to lock out other IO and page faults similar to a truncate
- * operation. The locks need to be held until the transaction has been committed
- * so that the cache invalidation is atomic with respect to the DAX flag
- * manipulation.
+ * Only directories are allowed to change dax flags
*/
static int
xfs_ioctl_setattr_dax_invalidate(
- struct xfs_inode *ip,
- struct fsxattr *fa,
- int *join_flags)
+ struct xfs_inode *ip)
{
struct inode *inode = VFS_I(ip);
- struct super_block *sb = inode->i_sb;
- int error;
-
- *join_flags = 0;
-
- /*
- * It is only valid to set the DAX flag on regular files and
- * directories on filesystems where the block size is equal to the page
- * size. On directories it serves as an inherited hint so we don't
- * have to check the device for dax support or flush pagecache.
- */
- if (fa->fsx_xflags & FS_XFLAG_DAX) {
- struct xfs_buftarg *target = xfs_inode_buftarg(ip);
-
- if (!bdev_dax_supported(target->bt_bdev, sb->s_blocksize))
- return -EINVAL;
- }
-
- /* If the DAX state is not changing, we have nothing to do here. */
- if ((fa->fsx_xflags & FS_XFLAG_DAX) && IS_DAX(inode))
- return 0;
- if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_DAX(inode))
- return 0;
-
- if (S_ISDIR(inode->i_mode))
- return 0;

- /* lock, flush and invalidate mapping in preparation for flag change */
- xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
- error = filemap_write_and_wait(inode->i_mapping);
- if (error)
- goto out_unlock;
- error = invalidate_inode_pages2(inode->i_mapping);
- if (error)
- goto out_unlock;
+ if (!S_ISDIR(inode->i_mode))
+ return -EINVAL;

- *join_flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL;
return 0;
-
-out_unlock:
- xfs_iunlock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
- return error;
-
}

/*
@@ -1209,17 +1164,10 @@ xfs_ioctl_setattr_dax_invalidate(
* have permission to do so. On success, return a clean transaction and the
* inode locked exclusively ready for further operation specific checks. On
* failure, return an error without modifying or locking the inode.
- *
- * The inode might already be IO locked on call. If this is the case, it is
- * indicated in @join_flags and we take full responsibility for ensuring they
- * are unlocked from now on. Hence if we have an error here, we still have to
- * unlock them. Otherwise, once they are joined to the transaction, they will
- * be unlocked on commit/cancel.
*/
static struct xfs_trans *
xfs_ioctl_setattr_get_trans(
- struct xfs_inode *ip,
- int join_flags)
+ struct xfs_inode *ip)
{
struct xfs_mount *mp = ip->i_mount;
struct xfs_trans *tp;
@@ -1236,8 +1184,7 @@ xfs_ioctl_setattr_get_trans(
goto out_unlock;

xfs_ilock(ip, XFS_ILOCK_EXCL);
- xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL | join_flags);
- join_flags = 0;
+ xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);

/*
* CAP_FOWNER overrides the following restrictions:
@@ -1258,8 +1205,6 @@ xfs_ioctl_setattr_get_trans(
out_cancel:
xfs_trans_cancel(tp);
out_unlock:
- if (join_flags)
- xfs_iunlock(ip, join_flags);
return ERR_PTR(error);
}

@@ -1386,7 +1331,6 @@ xfs_ioctl_setattr(
struct xfs_dquot *pdqp = NULL;
struct xfs_dquot *olddquot = NULL;
int code;
- int join_flags = 0;

trace_xfs_ioctl_setattr(ip);

@@ -1410,18 +1354,11 @@ xfs_ioctl_setattr(
return code;
}

- /*
- * Changing DAX config may require inode locking for mapping
- * invalidation. These need to be held all the way to transaction commit
- * or cancel time, so need to be passed through to
- * xfs_ioctl_setattr_get_trans() so it can apply them to the join call
- * appropriately.
- */
- code = xfs_ioctl_setattr_dax_invalidate(ip, fa, &join_flags);
+ code = xfs_ioctl_setattr_dax_invalidate(ip);
if (code)
goto error_free_dquots;

- tp = xfs_ioctl_setattr_get_trans(ip, join_flags);
+ tp = xfs_ioctl_setattr_get_trans(ip);
if (IS_ERR(tp)) {
code = PTR_ERR(tp);
goto error_free_dquots;
@@ -1552,7 +1489,6 @@ xfs_ioc_setxflags(
struct fsxattr fa;
struct fsxattr old_fa;
unsigned int flags;
- int join_flags = 0;
int error;

if (copy_from_user(&flags, arg, sizeof(flags)))
@@ -1569,18 +1505,11 @@ xfs_ioc_setxflags(
if (error)
return error;

- /*
- * Changing DAX config may require inode locking for mapping
- * invalidation. These need to be held all the way to transaction commit
- * or cancel time, so need to be passed through to
- * xfs_ioctl_setattr_get_trans() so it can apply them to the join call
- * appropriately.
- */
- error = xfs_ioctl_setattr_dax_invalidate(ip, &fa, &join_flags);
+ error = xfs_ioctl_setattr_dax_invalidate(ip);
if (error)
goto out_drop_write;

- tp = xfs_ioctl_setattr_get_trans(ip, join_flags);
+ tp = xfs_ioctl_setattr_get_trans(ip);
if (IS_ERR(tp)) {
error = PTR_ERR(tp);
goto out_drop_write;
--
2.25.1

2020-04-07 18:32:37

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V6 8/8] Documentation/dax: Update Usage section

From: Ira Weiny <[email protected]>

Update the Usage section to reflect the new individual dax selection
functionality.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes from V5:
Update to reflect the agreed upon semantics
https://lore.kernel.org/lkml/[email protected]/
---
Documentation/filesystems/dax.txt | 94 ++++++++++++++++++++++++++++++-
1 file changed, 91 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 679729442fd2..d84e8101cf8a 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -17,11 +17,99 @@ For file mappings, the storage device is mapped directly into userspace.
Usage
-----

-If you have a block device which supports DAX, you can make a filesystem
+If you have a block device which supports DAX, you can make a file system
on it as usual. The DAX code currently only supports files with a block
size equal to your kernel's PAGE_SIZE, so you may need to specify a block
-size when creating the filesystem. When mounting it, use the "-o dax"
-option on the command line or add 'dax' to the options in /etc/fstab.
+size when creating the file system.
+
+Enabling DAX on an individual file basis (XFS)
+----------------------------------------------
+
+There are 2 per file dax flags. One is a physical inode setting (FS_XFLAG_DAX) and
+the other a currently enabled state (S_DAX).
+
+FS_XFLAG_DAX is maintained on individual file and directory inodes. It is
+preserved within the file system. This 'physical' config setting can be set on
+directories using an ioctl and/or an application such as "xfs_io -c 'chattr
+[-+]x'". Files and directories automatically inherit FS_XFLAG_DAX from their
+parent directory _when_ _created_. Therefore, setting FS_XFLAG_DAX at
+directory creation time can be used to set a default behavior for an entire
+sub-tree. (Doing so on the root directory acts to set a default for the entire
+file system.)
+
+To clarify inheritance here are 3 examples:
+
+Example A:
+
+mkdir -p a/b/c
+xfs_io 'chattr +x' a
+mkdir a/b/c/d
+mkdir a/e
+
+ dax: a,e
+ no dax: b,c,d
+
+Example B:
+
+mkdir a
+xfs_io 'chattr +x' a
+mkdir -p a/b/c/d
+
+ dax: a,b,c,d
+ no dax:
+
+Example C:
+
+mkdir -p a/b/c
+xfs_io 'chattr +x' c
+mkdir a/b/c/d
+
+ dax: c,d
+ no dax: a,b
+
+
+The current enabled state (S_DAX) is set when a file inode is loaded based on
+the underlying media support and the file systems dax mount option setting. See
+below.
+
+statx can be used to query S_DAX. NOTE that a directory will never have S_DAX
+set and therefore statx will always return false. FS_XFLAG_DAX can be queried
+with ioctl or xfs_io on directories.
+
+NOTE: Setting FS_XFLAG_DAX on a directory is possible even if the underlying
+media does not support dax. Furthermore, files and directories will continue
+to inherit FS_XLFAG_DAX even if the underlying media does not support dax.
+
+
+overriding FS_XFLAG_DAX (the dax= mount option)
+-----------------------------------------------
+
+The dax mount option is a tri-state option (never, always, iflag):
+
+ "-o dax=never" means "never set S_DAX, ignore FS_XFLAG_DAX"
+ "-o dax=always" means "always set S_DAX (at least on pmem), ignore FS_XFLAG_DAX"
+ "-o dax" by itself means "dax=always" to remain compatible with older
+ kernels
+ "-o dax=iflag" means "follow FS_XFLAG_DAX"
+
+The default state is 'iflag'. The following algorithm is used to determine the
+effective mode of the file S_DAX on a capable device.
+
+ S_DAX &= FS_XFLAG_DAX;
+
+ if (dax_mount == "always")
+ S_DAX = true;
+ else if (dax_mount == "off"
+ S_DAX = false;
+
+Using the mount option does not change the physical configured state of
+individual files.
+
+NOTE: Setting FS_XFLAG_DAX on a directory is possible while the file system is
+mounted with the dax override. In addition, files and directories will inherit
+FS_XFLAG_DAX as normal while the file system is overriden. However, the file's
+enabled state will continue to be the mount option until remounted with
+dax=iflag.


Implementation Tips for Block Driver Writers
--
2.25.1

2020-04-07 23:47:12

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH V6 1/8] fs/xfs: Remove unnecessary initialization of i_rwsem

On Tue, Apr 07, 2020 at 11:29:51AM -0700, [email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> An earlier call of xfs_reinit_inode() from xfs_iget_cache_hit() already
> handles initialization of i_rwsem.
>
> Doing so again is unneeded.
>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes from V4:
> Update commit message to make it clear the xfs_iget_cache_hit()
> is actually doing the initialization via xfs_reinit_inode()
>
> New for V4:
>
> NOTE: This was found while ensuring the new i_aops_sem was properly
> handled. It seems like this is a layering violation so I think it is
> worth cleaning up so as to not confuse others.
> ---
> fs/xfs/xfs_icache.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 8dc2e5414276..836a1f09be03 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -419,6 +419,7 @@ xfs_iget_cache_hit(
> spin_unlock(&ip->i_flags_lock);
> rcu_read_unlock();
>
> + ASSERT(!rwsem_is_locked(&inode->i_rwsem));
> error = xfs_reinit_inode(mp, inode);
> if (error) {
> bool wake;
> @@ -452,9 +453,6 @@ xfs_iget_cache_hit(
> ip->i_sick = 0;
> ip->i_checked = 0;
>
> - ASSERT(!rwsem_is_locked(&inode->i_rwsem));
> - init_rwsem(&inode->i_rwsem);
> -
> spin_unlock(&ip->i_flags_lock);
> spin_unlock(&pag->pag_ici_lock);
> } else {

Looks good.

Reviewed-by: Dave Chinner <[email protected]>
--
Dave Chinner
[email protected]

2020-04-07 23:48:20

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH V6 3/8] fs/stat: Define DAX statx attribute

On Tue, Apr 07, 2020 at 11:29:53AM -0700, [email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> In order for users to determine if a file is currently operating in DAX
> state (effective DAX). Define a statx attribute value and set that
> attribute if the effective DAX flag is set.
>
> To go along with this we propose the following addition to the statx man
> page:
>
> STATX_ATTR_DAX
>
> The file is in the DAX (cpu direct access) state. DAX state
> attempts to minimize software cache effects for both I/O and
> memory mappings of this file. It requires a file system which
> has been configured to support DAX.
>
> DAX generally assumes all accesses are via cpu load / store
> instructions which can minimize overhead for small accesses, but
> may adversely affect cpu utilization for large transfers.
>
> File I/O is done directly to/from user-space buffers and memory
> mapped I/O may be performed with direct memory mappings that
> bypass kernel page cache.
>
> While the DAX property tends to result in data being transferred
> synchronously, it does not give the same guarantees of O_SYNC
> where data and the necessary metadata are transferred together.
>
> A DAX file may support being mapped with the MAP_SYNC flag,
> which enables a program to use CPU cache flush instructions to
> persist CPU store operations without an explicit fsync(2). See
> mmap(2) for more information.
>
> Reviewed-by: Jan Kara <[email protected]>
> Reviewed-by: Darrick J. Wong <[email protected]>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes from V2:
> Update man page text with comments from Darrick, Jan, Dan, and
> Dave.
> ---
> fs/stat.c | 3 +++
> include/uapi/linux/stat.h | 1 +
> 2 files changed, 4 insertions(+)
>
> diff --git a/fs/stat.c b/fs/stat.c
> index 030008796479..894699c74dde 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -79,6 +79,9 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
> if (IS_AUTOMOUNT(inode))
> stat->attributes |= STATX_ATTR_AUTOMOUNT;
>
> + if (IS_DAX(inode))
> + stat->attributes |= STATX_ATTR_DAX;
> +
> if (inode->i_op->getattr)
> return inode->i_op->getattr(path, stat, request_mask,
> query_flags);
> diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
> index ad80a5c885d5..e5f9d5517f6b 100644
> --- a/include/uapi/linux/stat.h
> +++ b/include/uapi/linux/stat.h
> @@ -169,6 +169,7 @@ struct statx {
> #define STATX_ATTR_ENCRYPTED 0x00000800 /* [I] File requires key to decrypt in fs */
> #define STATX_ATTR_AUTOMOUNT 0x00001000 /* Dir: Automount trigger */
> #define STATX_ATTR_VERITY 0x00100000 /* [I] Verity protected file */
> +#define STATX_ATTR_DAX 0x00002000 /* [I] File is DAX */
>
>
> #endif /* _UAPI_LINUX_STAT_H */

Looks fine. Man page text seems ok, too.

Reviewed-by: Dave Chinner <[email protected]>
--
Dave Chinner
[email protected]

2020-04-07 23:59:49

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH V6 4/8] fs/xfs: Make DAX mount option a tri-state

On Tue, Apr 07, 2020 at 11:29:54AM -0700, [email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> As agreed upon[1]. We make the dax mount option a tri-state. '-o dax'
> continues to operate the same. We add 'always', 'never', and 'iflag'
> (default).
>
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes from v5:
> New Patch
> ---
> fs/xfs/xfs_iops.c | 2 +-
> fs/xfs/xfs_mount.h | 26 +++++++++++++++++++++++++-
> fs/xfs/xfs_super.c | 34 +++++++++++++++++++++++++++++-----
> 3 files changed, 55 insertions(+), 7 deletions(-)
>
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 81f2f93caec0..1ec4a36917bd 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1248,7 +1248,7 @@ xfs_inode_supports_dax(
> return false;
>
> /* DAX mount option or DAX iflag must be set. */
> - if (!(mp->m_flags & XFS_MOUNT_DAX) &&
> + if (xfs_mount_dax_mode(mp) != XFS_DAX_ALWAYS &&
> !(ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
> return false;
>
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 88ab09ed29e7..ce027ee06692 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -233,7 +233,31 @@ typedef struct xfs_mount {
> allocator */
> #define XFS_MOUNT_NOATTR2 (1ULL << 25) /* disable use of attr2 format */
>
> -#define XFS_MOUNT_DAX (1ULL << 62) /* TEST ONLY! */
> +/* DAX flag is a 2 bit field representing a tri-state for dax
> + * iflag, always, never
> + * We reserve/document the 2 bits using dax field/field2
> + */
> +#define XFS_DAX_FIELD_MASK 0x3ULL
> +#define XFS_DAX_FIELD_SHIFT 62
> +#define XFS_MOUNT_DAX_FIELD (1ULL << 62)
> +#define XFS_MOUNT_DAX_FIELD2 (1ULL << 63)
> +
> +enum {
> + XFS_DAX_IFLAG = 0,
> + XFS_DAX_ALWAYS = 1,
> + XFS_DAX_NEVER = 2,
> +};
> +
> +static inline void xfs_mount_set_dax(struct xfs_mount *mp, u32 val)
> +{
> + mp->m_flags &= ~(XFS_DAX_FIELD_MASK << XFS_DAX_FIELD_SHIFT);
> + mp->m_flags |= ((val & XFS_DAX_FIELD_MASK) << XFS_DAX_FIELD_SHIFT);
> +}
> +
> +static inline u32 xfs_mount_dax_mode(struct xfs_mount *mp)
> +{
> + return (mp->m_flags >> XFS_DAX_FIELD_SHIFT) & XFS_DAX_FIELD_MASK;
> +}

This is overly complex. Just use 2 flags:

#define XFS_MOUNT_DAX_ALWAYS (1ULL << 26)
#define XFS_MOUNT_DAX_NEVER (1ULL << 27)

and if no mount flag is set, we use the inode flag....

> @@ -59,7 +66,7 @@ enum {
> Opt_filestreams, Opt_quota, Opt_noquota, Opt_usrquota, Opt_grpquota,
> Opt_prjquota, Opt_uquota, Opt_gquota, Opt_pquota,
> Opt_uqnoenforce, Opt_gqnoenforce, Opt_pqnoenforce, Opt_qnoenforce,
> - Opt_discard, Opt_nodiscard, Opt_dax,
> + Opt_discard, Opt_nodiscard, Opt_dax, Opt_dax_enum,
> };
>
> static const struct fs_parameter_spec xfs_fs_parameters[] = {
> @@ -103,6 +110,7 @@ static const struct fs_parameter_spec xfs_fs_parameters[] = {
> fsparam_flag("discard", Opt_discard),
> fsparam_flag("nodiscard", Opt_nodiscard),
> fsparam_flag("dax", Opt_dax),
> + fsparam_enum("dax", Opt_dax_enum, dax_param_enums),
> {}
> };
>
> @@ -129,7 +137,6 @@ xfs_fs_show_options(
> { XFS_MOUNT_GRPID, ",grpid" },
> { XFS_MOUNT_DISCARD, ",discard" },
> { XFS_MOUNT_LARGEIO, ",largeio" },
> - { XFS_MOUNT_DAX, ",dax" },
+ { XFS_MOUNT_DAX_ALWAYS, ",dax=always" },
+ { XFS_MOUNT_DAX_NEVER, ",dax=never" },

> { 0, NULL }
> };
> struct xfs_mount *mp = XFS_M(root->d_sb);
> @@ -185,6 +192,20 @@ xfs_fs_show_options(
> if (!(mp->m_qflags & XFS_ALL_QUOTA_ACCT))
> seq_puts(m, ",noquota");
>
> + switch (xfs_mount_dax_mode(mp)) {
> + case XFS_DAX_IFLAG:
> + seq_puts(m, ",dax=iflag");
> + break;
> + case XFS_DAX_ALWAYS:
> + seq_puts(m, ",dax=always");
> + break;
> + case XFS_DAX_NEVER:
> + seq_puts(m, ",dax=never");
> + break;
> + default:
> + break;
> + }

if (!(mp->m_flags & (XFS_MOUNT_DAX_ALWAYS | XFS_MOUNT_DAX_NEVER))
seq_puts(m, ",dax=iflag");

> +
> return 0;
> }
>
> @@ -1244,7 +1265,10 @@ xfs_fc_parse_param(
> return 0;
> #ifdef CONFIG_FS_DAX
> case Opt_dax:
> - mp->m_flags |= XFS_MOUNT_DAX;
> + xfs_mount_set_dax(mp, XFS_DAX_ALWAYS);
> + return 0;
> + case Opt_dax_enum:
> + xfs_mount_set_dax(mp, result.uint_32);
> return 0;
> #endif
> default:
> @@ -1437,7 +1461,7 @@ xfs_fc_fill_super(
> if (XFS_SB_VERSION_NUM(&mp->m_sb) == XFS_SB_VERSION_5)
> sb->s_flags |= SB_I_VERSION;
>
> - if (mp->m_flags & XFS_MOUNT_DAX) {
> + if (xfs_mount_dax_mode(mp) == XFS_DAX_ALWAYS) {

if (mp->m_flags & XFS_MOUNT_DAX_ALWAYS) {

> bool rtdev_is_dax = false, datadev_is_dax;
>
> xfs_warn(mp,
> @@ -1451,7 +1475,7 @@ xfs_fc_fill_super(
> if (!rtdev_is_dax && !datadev_is_dax) {
> xfs_alert(mp,
> "DAX unsupported by block device. Turning off DAX.");
> - mp->m_flags &= ~XFS_MOUNT_DAX;
> + xfs_mount_set_dax(mp, XFS_DAX_NEVER);
> }
> if (xfs_sb_version_hasreflink(&mp->m_sb)) {
> xfs_alert(mp,
> --
> 2.25.1
>
>

--
Dave Chinner
[email protected]

2020-04-08 02:09:06

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH V6 6/8] fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()

On Tue, Apr 07, 2020 at 11:29:56AM -0700, [email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> The functionality in xfs_diflags_to_linux() and xfs_diflags_to_iflags() are
> nearly identical. The only difference is that *_to_linux() is called after
> inode setup and disallows changing the DAX flag.
>
> Combining them can be done with a flag which indicates if this is the initial
> setup to allow the DAX flag to be properly set only at init time.
>
> So remove xfs_diflags_to_linux() and call the modified xfs_diflags_to_iflags()
> directly.
>
> While we are here simplify xfs_diflags_to_iflags() to take struct xfs_inode and
> use xfs_ip2xflags() to ensure future diflags are included correctly.
>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes from V5:
> The functions are no longer identical so we can only combine
> them rather than deleting one completely. This is reflected in
> the new init parameter.
> ---
> fs/xfs/xfs_inode.h | 1 +
> fs/xfs/xfs_ioctl.c | 33 +--------------------------------
> fs/xfs/xfs_iops.c | 42 +++++++++++++++++++++++++++---------------
> 3 files changed, 29 insertions(+), 47 deletions(-)
>
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 492e53992fa9..e76ed9ca17f7 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -466,6 +466,7 @@ int xfs_break_layouts(struct inode *inode, uint *iolock,
> /* from xfs_iops.c */
> extern void xfs_setup_inode(struct xfs_inode *ip);
> extern void xfs_setup_iops(struct xfs_inode *ip);
> +extern void xfs_diflags_to_iflags(struct xfs_inode *ip, bool init);
>
> /*
> * When setting up a newly allocated inode, we need to call
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index d42de92cb283..c6cd92ef4a05 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1100,37 +1100,6 @@ xfs_flags2diflags2(
> return di_flags2;
> }
>
> -STATIC void
> -xfs_diflags_to_linux(
> - struct xfs_inode *ip)
> -{
> - struct inode *inode = VFS_I(ip);
> - unsigned int xflags = xfs_ip2xflags(ip);
> -
> - if (xflags & FS_XFLAG_IMMUTABLE)
> - inode->i_flags |= S_IMMUTABLE;
> - else
> - inode->i_flags &= ~S_IMMUTABLE;
> - if (xflags & FS_XFLAG_APPEND)
> - inode->i_flags |= S_APPEND;
> - else
> - inode->i_flags &= ~S_APPEND;
> - if (xflags & FS_XFLAG_SYNC)
> - inode->i_flags |= S_SYNC;
> - else
> - inode->i_flags &= ~S_SYNC;
> - if (xflags & FS_XFLAG_NOATIME)
> - inode->i_flags |= S_NOATIME;
> - else
> - inode->i_flags &= ~S_NOATIME;
> -#if 0 /* disabled until the flag switching races are sorted out */
> - if (xflags & FS_XFLAG_DAX)
> - inode->i_flags |= S_DAX;
> - else
> - inode->i_flags &= ~S_DAX;
> -#endif

So this variant will set the flag in the inode if the disk inode
flag is set, otherwise it will clear it. It does it with if/else
branches.


> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index e07f7b641226..a4ac8568c8c7 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1259,7 +1259,7 @@ xfs_inode_supports_dax(
> return xfs_inode_buftarg(ip)->bt_daxdev != NULL;
> }
>
> -STATIC bool
> +static bool
> xfs_inode_enable_dax(
> struct xfs_inode *ip)
> {

This belongs in the previous patch.

> @@ -1272,26 +1272,38 @@ xfs_inode_enable_dax(
> return false;
> }
>
> -STATIC void
> +void
> xfs_diflags_to_iflags(
> - struct inode *inode,
> - struct xfs_inode *ip)
> + struct xfs_inode *ip,
> + bool init)
> {
> - uint16_t flags = ip->i_d.di_flags;
> -
> - inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
> - S_NOATIME | S_DAX);

And this code cleared all the flags in the inode first, then
set them if the disk inode flag is set. This does not require
branches, resulting in more readable code and better code
generation.

> + struct inode *inode = VFS_I(ip);
> + uint diflags = xfs_ip2xflags(ip);
>
> - if (flags & XFS_DIFLAG_IMMUTABLE)
> + if (diflags & FS_XFLAG_IMMUTABLE)
> inode->i_flags |= S_IMMUTABLE;
> - if (flags & XFS_DIFLAG_APPEND)
> + else
> + inode->i_flags &= ~S_IMMUTABLE;

> + if (diflags & FS_XFLAG_APPEND)
> inode->i_flags |= S_APPEND;
> - if (flags & XFS_DIFLAG_SYNC)
> + else
> + inode->i_flags &= ~S_APPEND;
> + if (diflags & FS_XFLAG_SYNC)
> inode->i_flags |= S_SYNC;
> - if (flags & XFS_DIFLAG_NOATIME)
> + else
> + inode->i_flags &= ~S_SYNC;
> + if (diflags & FS_XFLAG_NOATIME)
> inode->i_flags |= S_NOATIME;
> - if (xfs_inode_enable_dax(ip))
> - inode->i_flags |= S_DAX;
> + else
> + inode->i_flags &= ~S_NOATIME;
> +
> + /* Only toggle the dax flag when initializing */
> + if (init) {
> + if (xfs_inode_enable_dax(ip))
> + inode->i_flags |= S_DAX;
> + else
> + inode->i_flags &= ~S_DAX;
> + }
> }

IOWs, this:

struct inode *inode = VFS_I(ip);
unsigned int xflags = xfs_ip2xflags(ip);
unsigned int flags = 0;

if (xflags & FS_XFLAG_IMMUTABLE)
flags |= S_IMMUTABLE;
if (xflags & FS_XFLAG_APPEND)
flags |= S_APPEND;
if (xflags & FS_XFLAG_SYNC)
flags |= S_SYNC;
if (xflags & FS_XFLAG_NOATIME)
flags |= S_NOATIME;
if ((xflags & FS_XFLAG_DAX) && init)
flags |= S_DAX;

inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC | S_NOATIME);
inode->i_flags |= flags;

ends up being much easier to read and results in better code
generation. And we don't need to clear the S_DAX flag when "init" is
set, because we are starting from an inode that has no flags set
(because init!)...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-04-08 02:23:50

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH V6 7/8] fs/xfs: Change xfs_ioctl_setattr_dax_invalidate() to xfs_ioctl_dax_check()

On Tue, Apr 07, 2020 at 11:29:57AM -0700, [email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> We only support changing FS_XFLAG_DAX on directories. Files get their
> flag from the parent directory on creation only. So no data
> invalidation needs to happen.

Which leads me to ask: how are users and/or admins supposed to
remove the flag from regular files once it is set in the filesystem?

Only being able to override the flag via the "dax=never" mount
option means that once the flag is set, nobody can ever remove it
and they can only globally turn off dax if it gets set incorrectly.
It also means a global interrupt because all apps on the filesystem
need to be stopped so the filesystem can be unmounted and mounted
again with dax=never. This is highly unfriendly to admins and users.

IOWs, we _must_ be able to clear this inode flag on regular inodes
in some way. I don't care if it doesn't change the current in-memory
state, but we must be able to clear the flags so that the next time
the inodes are instantiated DAX is not enabled for those files...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-04-08 15:38:36

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH V6 7/8] fs/xfs: Change xfs_ioctl_setattr_dax_invalidate() to xfs_ioctl_dax_check()

On Tue, Apr 07, 2020 at 11:29:57AM -0700, [email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> We only support changing FS_XFLAG_DAX on directories. Files get their
> flag from the parent directory on creation only. So no data
> invalidation needs to happen.
>
> Alter the xfs_ioctl_setattr_dax_invalidate() to be
> xfs_ioctl_dax_check().
>
> This also allows use to remove the join_flags logic.
>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes from v5:
> New patch
> ---
> fs/xfs/xfs_ioctl.c | 91 +++++-----------------------------------------
> 1 file changed, 10 insertions(+), 81 deletions(-)
>
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index c6cd92ef4a05..5472faab7c4f 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -1145,63 +1145,18 @@ xfs_ioctl_setattr_xflags(
> }
>
> /*
> - * If we are changing DAX flags, we have to ensure the file is clean and any
> - * cached objects in the address space are invalidated and removed. This
> - * requires us to lock out other IO and page faults similar to a truncate
> - * operation. The locks need to be held until the transaction has been committed
> - * so that the cache invalidation is atomic with respect to the DAX flag
> - * manipulation.
> + * Only directories are allowed to change dax flags
> */
> static int
> xfs_ioctl_setattr_dax_invalidate(
> - struct xfs_inode *ip,
> - struct fsxattr *fa,
> - int *join_flags)
> + struct xfs_inode *ip)
> {
> struct inode *inode = VFS_I(ip);
> - struct super_block *sb = inode->i_sb;
> - int error;
> -
> - *join_flags = 0;
> -
> - /*
> - * It is only valid to set the DAX flag on regular files and
> - * directories on filesystems where the block size is equal to the page
> - * size. On directories it serves as an inherited hint so we don't
> - * have to check the device for dax support or flush pagecache.
> - */
> - if (fa->fsx_xflags & FS_XFLAG_DAX) {
> - struct xfs_buftarg *target = xfs_inode_buftarg(ip);
> -
> - if (!bdev_dax_supported(target->bt_bdev, sb->s_blocksize))
> - return -EINVAL;
> - }
> -
> - /* If the DAX state is not changing, we have nothing to do here. */
> - if ((fa->fsx_xflags & FS_XFLAG_DAX) && IS_DAX(inode))
> - return 0;
> - if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_DAX(inode))
> - return 0;

Does the !S_ISDIR check below apply unconditionally even if we weren't
trying to change the DAX flag?

> - if (S_ISDIR(inode->i_mode))
> - return 0;
>
> - /* lock, flush and invalidate mapping in preparation for flag change */
> - xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
> - error = filemap_write_and_wait(inode->i_mapping);
> - if (error)
> - goto out_unlock;
> - error = invalidate_inode_pages2(inode->i_mapping);
> - if (error)
> - goto out_unlock;
> + if (!S_ISDIR(inode->i_mode))
> + return -EINVAL;

If this entire function collapses to an S_ISDIR check then you might
as well just hoist this one piece to the caller. Also, where is
xfs_ioctl_dax_check?

<confused>

--D

>
> - *join_flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL;
> return 0;
> -
> -out_unlock:
> - xfs_iunlock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
> - return error;
> -
> }
>
> /*
> @@ -1209,17 +1164,10 @@ xfs_ioctl_setattr_dax_invalidate(
> * have permission to do so. On success, return a clean transaction and the
> * inode locked exclusively ready for further operation specific checks. On
> * failure, return an error without modifying or locking the inode.
> - *
> - * The inode might already be IO locked on call. If this is the case, it is
> - * indicated in @join_flags and we take full responsibility for ensuring they
> - * are unlocked from now on. Hence if we have an error here, we still have to
> - * unlock them. Otherwise, once they are joined to the transaction, they will
> - * be unlocked on commit/cancel.
> */
> static struct xfs_trans *
> xfs_ioctl_setattr_get_trans(
> - struct xfs_inode *ip,
> - int join_flags)
> + struct xfs_inode *ip)
> {
> struct xfs_mount *mp = ip->i_mount;
> struct xfs_trans *tp;
> @@ -1236,8 +1184,7 @@ xfs_ioctl_setattr_get_trans(
> goto out_unlock;
>
> xfs_ilock(ip, XFS_ILOCK_EXCL);
> - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL | join_flags);
> - join_flags = 0;
> + xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
>
> /*
> * CAP_FOWNER overrides the following restrictions:
> @@ -1258,8 +1205,6 @@ xfs_ioctl_setattr_get_trans(
> out_cancel:
> xfs_trans_cancel(tp);
> out_unlock:
> - if (join_flags)
> - xfs_iunlock(ip, join_flags);
> return ERR_PTR(error);
> }
>
> @@ -1386,7 +1331,6 @@ xfs_ioctl_setattr(
> struct xfs_dquot *pdqp = NULL;
> struct xfs_dquot *olddquot = NULL;
> int code;
> - int join_flags = 0;
>
> trace_xfs_ioctl_setattr(ip);
>
> @@ -1410,18 +1354,11 @@ xfs_ioctl_setattr(
> return code;
> }
>
> - /*
> - * Changing DAX config may require inode locking for mapping
> - * invalidation. These need to be held all the way to transaction commit
> - * or cancel time, so need to be passed through to
> - * xfs_ioctl_setattr_get_trans() so it can apply them to the join call
> - * appropriately.
> - */
> - code = xfs_ioctl_setattr_dax_invalidate(ip, fa, &join_flags);
> + code = xfs_ioctl_setattr_dax_invalidate(ip);
> if (code)
> goto error_free_dquots;
>
> - tp = xfs_ioctl_setattr_get_trans(ip, join_flags);
> + tp = xfs_ioctl_setattr_get_trans(ip);
> if (IS_ERR(tp)) {
> code = PTR_ERR(tp);
> goto error_free_dquots;
> @@ -1552,7 +1489,6 @@ xfs_ioc_setxflags(
> struct fsxattr fa;
> struct fsxattr old_fa;
> unsigned int flags;
> - int join_flags = 0;
> int error;
>
> if (copy_from_user(&flags, arg, sizeof(flags)))
> @@ -1569,18 +1505,11 @@ xfs_ioc_setxflags(
> if (error)
> return error;
>
> - /*
> - * Changing DAX config may require inode locking for mapping
> - * invalidation. These need to be held all the way to transaction commit
> - * or cancel time, so need to be passed through to
> - * xfs_ioctl_setattr_get_trans() so it can apply them to the join call
> - * appropriately.
> - */
> - error = xfs_ioctl_setattr_dax_invalidate(ip, &fa, &join_flags);
> + error = xfs_ioctl_setattr_dax_invalidate(ip);
> if (error)
> goto out_drop_write;
>
> - tp = xfs_ioctl_setattr_get_trans(ip, join_flags);
> + tp = xfs_ioctl_setattr_get_trans(ip);
> if (IS_ERR(tp)) {
> error = PTR_ERR(tp);
> goto out_drop_write;
> --
> 2.25.1
>

2020-04-08 19:43:38

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V6 6/8] fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()

On Wed, Apr 08, 2020 at 12:08:27PM +1000, Dave Chinner wrote:
> On Tue, Apr 07, 2020 at 11:29:56AM -0700, [email protected] wrote:
> > From: Ira Weiny <[email protected]>

[snip]

> >
> > -STATIC void
> > -xfs_diflags_to_linux(
> > - struct xfs_inode *ip)
> > -{
> > - struct inode *inode = VFS_I(ip);
> > - unsigned int xflags = xfs_ip2xflags(ip);
> > -
> > - if (xflags & FS_XFLAG_IMMUTABLE)
> > - inode->i_flags |= S_IMMUTABLE;
> > - else
> > - inode->i_flags &= ~S_IMMUTABLE;
> > - if (xflags & FS_XFLAG_APPEND)
> > - inode->i_flags |= S_APPEND;
> > - else
> > - inode->i_flags &= ~S_APPEND;
> > - if (xflags & FS_XFLAG_SYNC)
> > - inode->i_flags |= S_SYNC;
> > - else
> > - inode->i_flags &= ~S_SYNC;
> > - if (xflags & FS_XFLAG_NOATIME)
> > - inode->i_flags |= S_NOATIME;
> > - else
> > - inode->i_flags &= ~S_NOATIME;
> > -#if 0 /* disabled until the flag switching races are sorted out */
> > - if (xflags & FS_XFLAG_DAX)
> > - inode->i_flags |= S_DAX;
> > - else
> > - inode->i_flags &= ~S_DAX;
> > -#endif
>
> So this variant will set the flag in the inode if the disk inode
> flag is set, otherwise it will clear it. It does it with if/else
> branches.
>
>
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index e07f7b641226..a4ac8568c8c7 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -1259,7 +1259,7 @@ xfs_inode_supports_dax(
> > return xfs_inode_buftarg(ip)->bt_daxdev != NULL;
> > }
> >
> > -STATIC bool
> > +static bool
> > xfs_inode_enable_dax(
> > struct xfs_inode *ip)
> > {
>
> This belongs in the previous patch.

Ah yea... Sorry.

Fixed in V7

>
> > @@ -1272,26 +1272,38 @@ xfs_inode_enable_dax(
> > return false;
> > }
> >
> > -STATIC void
> > +void
> > xfs_diflags_to_iflags(
> > - struct inode *inode,
> > - struct xfs_inode *ip)
> > + struct xfs_inode *ip,
> > + bool init)
> > {
> > - uint16_t flags = ip->i_d.di_flags;
> > -
> > - inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
> > - S_NOATIME | S_DAX);
>
> And this code cleared all the flags in the inode first, then
> set them if the disk inode flag is set. This does not require
> branches, resulting in more readable code and better code
> generation.
>
> > + struct inode *inode = VFS_I(ip);
> > + uint diflags = xfs_ip2xflags(ip);
> >
> > - if (flags & XFS_DIFLAG_IMMUTABLE)
> > + if (diflags & FS_XFLAG_IMMUTABLE)
> > inode->i_flags |= S_IMMUTABLE;
> > - if (flags & XFS_DIFLAG_APPEND)
> > + else
> > + inode->i_flags &= ~S_IMMUTABLE;
>
> > + if (diflags & FS_XFLAG_APPEND)
> > inode->i_flags |= S_APPEND;
> > - if (flags & XFS_DIFLAG_SYNC)
> > + else
> > + inode->i_flags &= ~S_APPEND;
> > + if (diflags & FS_XFLAG_SYNC)
> > inode->i_flags |= S_SYNC;
> > - if (flags & XFS_DIFLAG_NOATIME)
> > + else
> > + inode->i_flags &= ~S_SYNC;
> > + if (diflags & FS_XFLAG_NOATIME)
> > inode->i_flags |= S_NOATIME;
> > - if (xfs_inode_enable_dax(ip))
> > - inode->i_flags |= S_DAX;
> > + else
> > + inode->i_flags &= ~S_NOATIME;
> > +
> > + /* Only toggle the dax flag when initializing */
> > + if (init) {
> > + if (xfs_inode_enable_dax(ip))
> > + inode->i_flags |= S_DAX;
> > + else
> > + inode->i_flags &= ~S_DAX;
> > + }
> > }
>
> IOWs, this:
>
> struct inode *inode = VFS_I(ip);
> unsigned int xflags = xfs_ip2xflags(ip);
> unsigned int flags = 0;
>
> if (xflags & FS_XFLAG_IMMUTABLE)
> flags |= S_IMMUTABLE;
> if (xflags & FS_XFLAG_APPEND)
> flags |= S_APPEND;
> if (xflags & FS_XFLAG_SYNC)
> flags |= S_SYNC;
> if (xflags & FS_XFLAG_NOATIME)
> flags |= S_NOATIME;
> if ((xflags & FS_XFLAG_DAX) && init)
> flags |= S_DAX;
>
> inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC | S_NOATIME);
> inode->i_flags |= flags;
>
> ends up being much easier to read and results in better code
> generation. And we don't need to clear the S_DAX flag when "init" is
> set, because we are starting from an inode that has no flags set
> (because init!)...

This sounds good but I think we need a slight modification to make the function equivalent in functionality.

void
xfs_diflags_to_iflags(
struct xfs_inode *ip,
bool init)
{
struct inode *inode = VFS_I(ip);
unsigned int xflags = xfs_ip2xflags(ip);
unsigned int flags = 0;

inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC | S_NOATIME |
S_DAX);

if (xflags & FS_XFLAG_IMMUTABLE)
flags |= S_IMMUTABLE;
if (xflags & FS_XFLAG_APPEND)
flags |= S_APPEND;
if (xflags & FS_XFLAG_SYNC)
flags |= S_SYNC;
if (xflags & FS_XFLAG_NOATIME)
flags |= S_NOATIME;
if (init && xfs_inode_enable_dax(ip))
flags |= S_DAX;

inode->i_flags |= flags;
}

I've queued this for v7.

Thanks,
Ira

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2020-04-08 19:56:29

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V6 7/8] fs/xfs: Change xfs_ioctl_setattr_dax_invalidate() to xfs_ioctl_dax_check()

On Wed, Apr 08, 2020 at 08:37:17AM -0700, Darrick J. Wong wrote:
> On Tue, Apr 07, 2020 at 11:29:57AM -0700, [email protected] wrote:
> > From: Ira Weiny <[email protected]>
> >
> > We only support changing FS_XFLAG_DAX on directories. Files get their
> > flag from the parent directory on creation only. So no data
> > invalidation needs to happen.
> >
> > Alter the xfs_ioctl_setattr_dax_invalidate() to be
> > xfs_ioctl_dax_check().
> >
> > This also allows use to remove the join_flags logic.
> >
> > Signed-off-by: Ira Weiny <[email protected]>
> >
> > ---
> > Changes from v5:
> > New patch
> > ---
> > fs/xfs/xfs_ioctl.c | 91 +++++-----------------------------------------
> > 1 file changed, 10 insertions(+), 81 deletions(-)
> >
> > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > index c6cd92ef4a05..5472faab7c4f 100644
> > --- a/fs/xfs/xfs_ioctl.c
> > +++ b/fs/xfs/xfs_ioctl.c
> > @@ -1145,63 +1145,18 @@ xfs_ioctl_setattr_xflags(
> > }
> >
> > /*
> > - * If we are changing DAX flags, we have to ensure the file is clean and any
> > - * cached objects in the address space are invalidated and removed. This
> > - * requires us to lock out other IO and page faults similar to a truncate
> > - * operation. The locks need to be held until the transaction has been committed
> > - * so that the cache invalidation is atomic with respect to the DAX flag
> > - * manipulation.
> > + * Only directories are allowed to change dax flags
> > */
> > static int
> > xfs_ioctl_setattr_dax_invalidate(
> > - struct xfs_inode *ip,
> > - struct fsxattr *fa,
> > - int *join_flags)
> > + struct xfs_inode *ip)
> > {
> > struct inode *inode = VFS_I(ip);
> > - struct super_block *sb = inode->i_sb;
> > - int error;
> > -
> > - *join_flags = 0;
> > -
> > - /*
> > - * It is only valid to set the DAX flag on regular files and
> > - * directories on filesystems where the block size is equal to the page
> > - * size. On directories it serves as an inherited hint so we don't
> > - * have to check the device for dax support or flush pagecache.
> > - */
> > - if (fa->fsx_xflags & FS_XFLAG_DAX) {
> > - struct xfs_buftarg *target = xfs_inode_buftarg(ip);
> > -
> > - if (!bdev_dax_supported(target->bt_bdev, sb->s_blocksize))
> > - return -EINVAL;
> > - }
> > -
> > - /* If the DAX state is not changing, we have nothing to do here. */
> > - if ((fa->fsx_xflags & FS_XFLAG_DAX) && IS_DAX(inode))
> > - return 0;
> > - if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_DAX(inode))
> > - return 0;
>
> Does the !S_ISDIR check below apply unconditionally even if we weren't
> trying to change the DAX flag?

:-/

shoot... I really screwed this up...

>
> > - if (S_ISDIR(inode->i_mode))
> > - return 0;
> >
> > - /* lock, flush and invalidate mapping in preparation for flag change */
> > - xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
> > - error = filemap_write_and_wait(inode->i_mapping);
> > - if (error)
> > - goto out_unlock;
> > - error = invalidate_inode_pages2(inode->i_mapping);
> > - if (error)
> > - goto out_unlock;
> > + if (!S_ISDIR(inode->i_mode))
> > + return -EINVAL;
>
> If this entire function collapses to an S_ISDIR check then you might
> as well just hoist this one piece to the caller.

Oops... yea this is broken... All broken.

> Also, where is
> xfs_ioctl_dax_check?

I forgot to change the name.

>
> <confused>

Much less so than me... :-/

I'll clean it up.

Thanks, this was really bad on my part...
Ira

>
> --D
>
> >
> > - *join_flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL;
> > return 0;
> > -
> > -out_unlock:
> > - xfs_iunlock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
> > - return error;
> > -
> > }
> >
> > /*
> > @@ -1209,17 +1164,10 @@ xfs_ioctl_setattr_dax_invalidate(
> > * have permission to do so. On success, return a clean transaction and the
> > * inode locked exclusively ready for further operation specific checks. On
> > * failure, return an error without modifying or locking the inode.
> > - *
> > - * The inode might already be IO locked on call. If this is the case, it is
> > - * indicated in @join_flags and we take full responsibility for ensuring they
> > - * are unlocked from now on. Hence if we have an error here, we still have to
> > - * unlock them. Otherwise, once they are joined to the transaction, they will
> > - * be unlocked on commit/cancel.
> > */
> > static struct xfs_trans *
> > xfs_ioctl_setattr_get_trans(
> > - struct xfs_inode *ip,
> > - int join_flags)
> > + struct xfs_inode *ip)
> > {
> > struct xfs_mount *mp = ip->i_mount;
> > struct xfs_trans *tp;
> > @@ -1236,8 +1184,7 @@ xfs_ioctl_setattr_get_trans(
> > goto out_unlock;
> >
> > xfs_ilock(ip, XFS_ILOCK_EXCL);
> > - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL | join_flags);
> > - join_flags = 0;
> > + xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
> >
> > /*
> > * CAP_FOWNER overrides the following restrictions:
> > @@ -1258,8 +1205,6 @@ xfs_ioctl_setattr_get_trans(
> > out_cancel:
> > xfs_trans_cancel(tp);
> > out_unlock:
> > - if (join_flags)
> > - xfs_iunlock(ip, join_flags);
> > return ERR_PTR(error);
> > }
> >
> > @@ -1386,7 +1331,6 @@ xfs_ioctl_setattr(
> > struct xfs_dquot *pdqp = NULL;
> > struct xfs_dquot *olddquot = NULL;
> > int code;
> > - int join_flags = 0;
> >
> > trace_xfs_ioctl_setattr(ip);
> >
> > @@ -1410,18 +1354,11 @@ xfs_ioctl_setattr(
> > return code;
> > }
> >
> > - /*
> > - * Changing DAX config may require inode locking for mapping
> > - * invalidation. These need to be held all the way to transaction commit
> > - * or cancel time, so need to be passed through to
> > - * xfs_ioctl_setattr_get_trans() so it can apply them to the join call
> > - * appropriately.
> > - */
> > - code = xfs_ioctl_setattr_dax_invalidate(ip, fa, &join_flags);
> > + code = xfs_ioctl_setattr_dax_invalidate(ip);
> > if (code)
> > goto error_free_dquots;
> >
> > - tp = xfs_ioctl_setattr_get_trans(ip, join_flags);
> > + tp = xfs_ioctl_setattr_get_trans(ip);
> > if (IS_ERR(tp)) {
> > code = PTR_ERR(tp);
> > goto error_free_dquots;
> > @@ -1552,7 +1489,6 @@ xfs_ioc_setxflags(
> > struct fsxattr fa;
> > struct fsxattr old_fa;
> > unsigned int flags;
> > - int join_flags = 0;
> > int error;
> >
> > if (copy_from_user(&flags, arg, sizeof(flags)))
> > @@ -1569,18 +1505,11 @@ xfs_ioc_setxflags(
> > if (error)
> > return error;
> >
> > - /*
> > - * Changing DAX config may require inode locking for mapping
> > - * invalidation. These need to be held all the way to transaction commit
> > - * or cancel time, so need to be passed through to
> > - * xfs_ioctl_setattr_get_trans() so it can apply them to the join call
> > - * appropriately.
> > - */
> > - error = xfs_ioctl_setattr_dax_invalidate(ip, &fa, &join_flags);
> > + error = xfs_ioctl_setattr_dax_invalidate(ip);
> > if (error)
> > goto out_drop_write;
> >
> > - tp = xfs_ioctl_setattr_get_trans(ip, join_flags);
> > + tp = xfs_ioctl_setattr_get_trans(ip);
> > if (IS_ERR(tp)) {
> > error = PTR_ERR(tp);
> > goto out_drop_write;
> > --
> > 2.25.1
> >

2020-04-08 21:57:46

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH V6 7/8] fs/xfs: Change xfs_ioctl_setattr_dax_invalidate() to xfs_ioctl_dax_check()

On Wed, Apr 08, 2020 at 11:58:03AM +0200, Jan Kara wrote:
> On Wed 08-04-20 12:23:18, Dave Chinner wrote:
> > On Tue, Apr 07, 2020 at 11:29:57AM -0700, [email protected] wrote:
> > > From: Ira Weiny <[email protected]>
> > >
> > > We only support changing FS_XFLAG_DAX on directories. Files get their
> > > flag from the parent directory on creation only. So no data
> > > invalidation needs to happen.
> >
> > Which leads me to ask: how are users and/or admins supposed to
> > remove the flag from regular files once it is set in the filesystem?
> >
> > Only being able to override the flag via the "dax=never" mount
> > option means that once the flag is set, nobody can ever remove it
> > and they can only globally turn off dax if it gets set incorrectly.
> > It also means a global interrupt because all apps on the filesystem
> > need to be stopped so the filesystem can be unmounted and mounted
> > again with dax=never. This is highly unfriendly to admins and users.
> >
> > IOWs, we _must_ be able to clear this inode flag on regular inodes
> > in some way. I don't care if it doesn't change the current in-memory
> > state, but we must be able to clear the flags so that the next time
> > the inodes are instantiated DAX is not enabled for those files...
>
> Well, there's one way to clear the flag: delete the file. If you still care
> about the data, you can copy the data first. It isn't very convenient, I
> agree, and effectively means restarting whatever application that is using
> the file.

Restarting the application is fine. Having to backup/restore or copy
the entire data set just to turn off an inode flag? That's not a
viable management strategy. We could be talking about terabytes of
data here.

I explained how we can safely remove the flag in the other branch of
this thread...

> But it seems like more understandable API than letting user clear
> the on-disk flag but the inode will still use DAX until kernel decides to
> evict the inode

Certainly doesn't seem that way to me. "stop app, clear flags, drop
caches, restart app" is a pretty simple, easy thing to do for an
admin.

Especially compared to process that is effectively "stop app, backup
data set, delete data set, clear flags, restore data set, restart
app"

> - because that often means you need to restart the
> application using the file anyway for the flag change to have any effect.

That's a trivial requirement compared to the downtime and resource
cost of a data set backup/restore just to clear inode flags....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-04-08 21:57:46

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH V6 6/8] fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()

On Wed, Apr 08, 2020 at 10:09:23AM -0700, Ira Weiny wrote:
> On Wed, Apr 08, 2020 at 12:08:27PM +1000, Dave Chinner wrote:
> > On Tue, Apr 07, 2020 at 11:29:56AM -0700, [email protected] wrote:
> > > From: Ira Weiny <[email protected]>
>
> [snip]
>
> > >
> > > -STATIC void
> > > -xfs_diflags_to_linux(
> > > - struct xfs_inode *ip)
> > > -{
> > > - struct inode *inode = VFS_I(ip);
> > > - unsigned int xflags = xfs_ip2xflags(ip);
> > > -
> > > - if (xflags & FS_XFLAG_IMMUTABLE)
> > > - inode->i_flags |= S_IMMUTABLE;
> > > - else
> > > - inode->i_flags &= ~S_IMMUTABLE;
> > > - if (xflags & FS_XFLAG_APPEND)
> > > - inode->i_flags |= S_APPEND;
> > > - else
> > > - inode->i_flags &= ~S_APPEND;
> > > - if (xflags & FS_XFLAG_SYNC)
> > > - inode->i_flags |= S_SYNC;
> > > - else
> > > - inode->i_flags &= ~S_SYNC;
> > > - if (xflags & FS_XFLAG_NOATIME)
> > > - inode->i_flags |= S_NOATIME;
> > > - else
> > > - inode->i_flags &= ~S_NOATIME;
> > > -#if 0 /* disabled until the flag switching races are sorted out */
> > > - if (xflags & FS_XFLAG_DAX)
> > > - inode->i_flags |= S_DAX;
> > > - else
> > > - inode->i_flags &= ~S_DAX;
> > > -#endif
> >
> > So this variant will set the flag in the inode if the disk inode
> > flag is set, otherwise it will clear it. It does it with if/else
> > branches.
> >
> >
> > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > index e07f7b641226..a4ac8568c8c7 100644
> > > --- a/fs/xfs/xfs_iops.c
> > > +++ b/fs/xfs/xfs_iops.c
> > > @@ -1259,7 +1259,7 @@ xfs_inode_supports_dax(
> > > return xfs_inode_buftarg(ip)->bt_daxdev != NULL;
> > > }
> > >
> > > -STATIC bool
> > > +static bool
> > > xfs_inode_enable_dax(
> > > struct xfs_inode *ip)
> > > {
> >
> > This belongs in the previous patch.
>
> Ah yea... Sorry.
>
> Fixed in V7
>
> >
> > > @@ -1272,26 +1272,38 @@ xfs_inode_enable_dax(
> > > return false;
> > > }
> > >
> > > -STATIC void
> > > +void
> > > xfs_diflags_to_iflags(
> > > - struct inode *inode,
> > > - struct xfs_inode *ip)
> > > + struct xfs_inode *ip,
> > > + bool init)
> > > {
> > > - uint16_t flags = ip->i_d.di_flags;
> > > -
> > > - inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
> > > - S_NOATIME | S_DAX);
> >
> > And this code cleared all the flags in the inode first, then
> > set them if the disk inode flag is set. This does not require
> > branches, resulting in more readable code and better code
> > generation.
> >
> > > + struct inode *inode = VFS_I(ip);
> > > + uint diflags = xfs_ip2xflags(ip);
> > >
> > > - if (flags & XFS_DIFLAG_IMMUTABLE)
> > > + if (diflags & FS_XFLAG_IMMUTABLE)
> > > inode->i_flags |= S_IMMUTABLE;
> > > - if (flags & XFS_DIFLAG_APPEND)
> > > + else
> > > + inode->i_flags &= ~S_IMMUTABLE;
> >
> > > + if (diflags & FS_XFLAG_APPEND)
> > > inode->i_flags |= S_APPEND;
> > > - if (flags & XFS_DIFLAG_SYNC)
> > > + else
> > > + inode->i_flags &= ~S_APPEND;
> > > + if (diflags & FS_XFLAG_SYNC)
> > > inode->i_flags |= S_SYNC;
> > > - if (flags & XFS_DIFLAG_NOATIME)
> > > + else
> > > + inode->i_flags &= ~S_SYNC;
> > > + if (diflags & FS_XFLAG_NOATIME)
> > > inode->i_flags |= S_NOATIME;
> > > - if (xfs_inode_enable_dax(ip))
> > > - inode->i_flags |= S_DAX;
> > > + else
> > > + inode->i_flags &= ~S_NOATIME;
> > > +
> > > + /* Only toggle the dax flag when initializing */
> > > + if (init) {
> > > + if (xfs_inode_enable_dax(ip))
> > > + inode->i_flags |= S_DAX;
> > > + else
> > > + inode->i_flags &= ~S_DAX;
> > > + }
> > > }
> >
> > IOWs, this:
> >
> > struct inode *inode = VFS_I(ip);
> > unsigned int xflags = xfs_ip2xflags(ip);
> > unsigned int flags = 0;
> >
> > if (xflags & FS_XFLAG_IMMUTABLE)
> > flags |= S_IMMUTABLE;
> > if (xflags & FS_XFLAG_APPEND)
> > flags |= S_APPEND;
> > if (xflags & FS_XFLAG_SYNC)
> > flags |= S_SYNC;
> > if (xflags & FS_XFLAG_NOATIME)
> > flags |= S_NOATIME;
> > if ((xflags & FS_XFLAG_DAX) && init)
> > flags |= S_DAX;
> >
> > inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC | S_NOATIME);
> > inode->i_flags |= flags;
> >
> > ends up being much easier to read and results in better code
> > generation. And we don't need to clear the S_DAX flag when "init" is
> > set, because we are starting from an inode that has no flags set
> > (because init!)...
>
> This sounds good but I think we need a slight modification to make the function equivalent in functionality.
>
> void
> xfs_diflags_to_iflags(
> struct xfs_inode *ip,
> bool init)
> {
> struct inode *inode = VFS_I(ip);
> unsigned int xflags = xfs_ip2xflags(ip);
> unsigned int flags = 0;
>
> inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC | S_NOATIME |
> S_DAX);

We don't want to clear the dax flag here, ever, if it is already
set. That is an externally visible change and opens us up (again) to
races where IS_DAX() changes half way through a fault path. IOWs, avoiding
clearing the DAX flag was something I did explicitly in the above
code fragment.

And it makes the logic clearer by pre-calculating the new flags,
then clearing and setting the inode flags together, rather than
having the spearated at the top and bottom of the function.

THis leads to an obvious conclusion: if we never clear the in memory
S_DAX flag, we can actually clear the on-disk flag safely, so that
next time the inode cycles into memory it won't be using DAX. IOWs,
admins can stop the applications, clear the DAX flag and drop
caches. This should result in the inode being recycled and when the
app is restarted it will run without DAX. No ned for deleting files,
copying large data sets, etc just to turn off an inode flag.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-04-08 21:59:51

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH V6 6/8] fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()

On Wed, Apr 8, 2020 at 2:02 PM Dave Chinner <[email protected]> wrote:
>
> On Wed, Apr 08, 2020 at 10:09:23AM -0700, Ira Weiny wrote:
> > On Wed, Apr 08, 2020 at 12:08:27PM +1000, Dave Chinner wrote:
> > > On Tue, Apr 07, 2020 at 11:29:56AM -0700, [email protected] wrote:
> > > > From: Ira Weiny <[email protected]>
> >
> > [snip]
> >
> > > >
> > > > -STATIC void
> > > > -xfs_diflags_to_linux(
> > > > - struct xfs_inode *ip)
> > > > -{
> > > > - struct inode *inode = VFS_I(ip);
> > > > - unsigned int xflags = xfs_ip2xflags(ip);
> > > > -
> > > > - if (xflags & FS_XFLAG_IMMUTABLE)
> > > > - inode->i_flags |= S_IMMUTABLE;
> > > > - else
> > > > - inode->i_flags &= ~S_IMMUTABLE;
> > > > - if (xflags & FS_XFLAG_APPEND)
> > > > - inode->i_flags |= S_APPEND;
> > > > - else
> > > > - inode->i_flags &= ~S_APPEND;
> > > > - if (xflags & FS_XFLAG_SYNC)
> > > > - inode->i_flags |= S_SYNC;
> > > > - else
> > > > - inode->i_flags &= ~S_SYNC;
> > > > - if (xflags & FS_XFLAG_NOATIME)
> > > > - inode->i_flags |= S_NOATIME;
> > > > - else
> > > > - inode->i_flags &= ~S_NOATIME;
> > > > -#if 0 /* disabled until the flag switching races are sorted out */
> > > > - if (xflags & FS_XFLAG_DAX)
> > > > - inode->i_flags |= S_DAX;
> > > > - else
> > > > - inode->i_flags &= ~S_DAX;
> > > > -#endif
> > >
> > > So this variant will set the flag in the inode if the disk inode
> > > flag is set, otherwise it will clear it. It does it with if/else
> > > branches.
> > >
> > >
> > > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > > index e07f7b641226..a4ac8568c8c7 100644
> > > > --- a/fs/xfs/xfs_iops.c
> > > > +++ b/fs/xfs/xfs_iops.c
> > > > @@ -1259,7 +1259,7 @@ xfs_inode_supports_dax(
> > > > return xfs_inode_buftarg(ip)->bt_daxdev != NULL;
> > > > }
> > > >
> > > > -STATIC bool
> > > > +static bool
> > > > xfs_inode_enable_dax(
> > > > struct xfs_inode *ip)
> > > > {
> > >
> > > This belongs in the previous patch.
> >
> > Ah yea... Sorry.
> >
> > Fixed in V7
> >
> > >
> > > > @@ -1272,26 +1272,38 @@ xfs_inode_enable_dax(
> > > > return false;
> > > > }
> > > >
> > > > -STATIC void
> > > > +void
> > > > xfs_diflags_to_iflags(
> > > > - struct inode *inode,
> > > > - struct xfs_inode *ip)
> > > > + struct xfs_inode *ip,
> > > > + bool init)
> > > > {
> > > > - uint16_t flags = ip->i_d.di_flags;
> > > > -
> > > > - inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC |
> > > > - S_NOATIME | S_DAX);
> > >
> > > And this code cleared all the flags in the inode first, then
> > > set them if the disk inode flag is set. This does not require
> > > branches, resulting in more readable code and better code
> > > generation.
> > >
> > > > + struct inode *inode = VFS_I(ip);
> > > > + uint diflags = xfs_ip2xflags(ip);
> > > >
> > > > - if (flags & XFS_DIFLAG_IMMUTABLE)
> > > > + if (diflags & FS_XFLAG_IMMUTABLE)
> > > > inode->i_flags |= S_IMMUTABLE;
> > > > - if (flags & XFS_DIFLAG_APPEND)
> > > > + else
> > > > + inode->i_flags &= ~S_IMMUTABLE;
> > >
> > > > + if (diflags & FS_XFLAG_APPEND)
> > > > inode->i_flags |= S_APPEND;
> > > > - if (flags & XFS_DIFLAG_SYNC)
> > > > + else
> > > > + inode->i_flags &= ~S_APPEND;
> > > > + if (diflags & FS_XFLAG_SYNC)
> > > > inode->i_flags |= S_SYNC;
> > > > - if (flags & XFS_DIFLAG_NOATIME)
> > > > + else
> > > > + inode->i_flags &= ~S_SYNC;
> > > > + if (diflags & FS_XFLAG_NOATIME)
> > > > inode->i_flags |= S_NOATIME;
> > > > - if (xfs_inode_enable_dax(ip))
> > > > - inode->i_flags |= S_DAX;
> > > > + else
> > > > + inode->i_flags &= ~S_NOATIME;
> > > > +
> > > > + /* Only toggle the dax flag when initializing */
> > > > + if (init) {
> > > > + if (xfs_inode_enable_dax(ip))
> > > > + inode->i_flags |= S_DAX;
> > > > + else
> > > > + inode->i_flags &= ~S_DAX;
> > > > + }
> > > > }
> > >
> > > IOWs, this:
> > >
> > > struct inode *inode = VFS_I(ip);
> > > unsigned int xflags = xfs_ip2xflags(ip);
> > > unsigned int flags = 0;
> > >
> > > if (xflags & FS_XFLAG_IMMUTABLE)
> > > flags |= S_IMMUTABLE;
> > > if (xflags & FS_XFLAG_APPEND)
> > > flags |= S_APPEND;
> > > if (xflags & FS_XFLAG_SYNC)
> > > flags |= S_SYNC;
> > > if (xflags & FS_XFLAG_NOATIME)
> > > flags |= S_NOATIME;
> > > if ((xflags & FS_XFLAG_DAX) && init)
> > > flags |= S_DAX;
> > >
> > > inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC | S_NOATIME);
> > > inode->i_flags |= flags;
> > >
> > > ends up being much easier to read and results in better code
> > > generation. And we don't need to clear the S_DAX flag when "init" is
> > > set, because we are starting from an inode that has no flags set
> > > (because init!)...
> >
> > This sounds good but I think we need a slight modification to make the function equivalent in functionality.
> >
> > void
> > xfs_diflags_to_iflags(
> > struct xfs_inode *ip,
> > bool init)
> > {
> > struct inode *inode = VFS_I(ip);
> > unsigned int xflags = xfs_ip2xflags(ip);
> > unsigned int flags = 0;
> >
> > inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC | S_NOATIME |
> > S_DAX);
>
> We don't want to clear the dax flag here, ever, if it is already
> set. That is an externally visible change and opens us up (again) to
> races where IS_DAX() changes half way through a fault path. IOWs, avoiding
> clearing the DAX flag was something I did explicitly in the above
> code fragment.
>
> And it makes the logic clearer by pre-calculating the new flags,
> then clearing and setting the inode flags together, rather than
> having the spearated at the top and bottom of the function.
>
> THis leads to an obvious conclusion: if we never clear the in memory
> S_DAX flag, we can actually clear the on-disk flag safely, so that
> next time the inode cycles into memory it won't be using DAX. IOWs,
> admins can stop the applications, clear the DAX flag and drop
> caches. This should result in the inode being recycled and when the
> app is restarted it will run without DAX. No ned for deleting files,
> copying large data sets, etc just to turn off an inode flag.

Makes sense, but is that sufficient? I recall you saying there might
be a multitude of other reasons that the inode is not evicted, not the
least of which is races [1]. Does this need another flag, lets call it
"dax toggle" to track the "I requested the inode to clear the flag,
but on cache-flush + restart the inode never got evicted" case. S_DAX
almost plays this role, but it loses the ability to audit which files
are pending an inode eviction event. So the dax-toggle flag indicates
to the kernel to xor the toggle value with the inode flag on inode
instantiation and the dax inode flag is never directly manipulated by
the ioctl path.

[1]: http://lore.kernel.org/r/[email protected]

2020-04-08 22:07:51

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V6 6/8] fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()

On Thu, Apr 09, 2020 at 07:02:36AM +1000, Dave Chinner wrote:
> On Wed, Apr 08, 2020 at 10:09:23AM -0700, Ira Weiny wrote:

[snip]

> >
> > This sounds good but I think we need a slight modification to make the function equivalent in functionality.
> >
> > void
> > xfs_diflags_to_iflags(
> > struct xfs_inode *ip,
> > bool init)
> > {
> > struct inode *inode = VFS_I(ip);
> > unsigned int xflags = xfs_ip2xflags(ip);
> > unsigned int flags = 0;
> >
> > inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC | S_NOATIME |
> > S_DAX);
>
> We don't want to clear the dax flag here, ever, if it is already
> set. That is an externally visible change and opens us up (again) to
> races where IS_DAX() changes half way through a fault path. IOWs, avoiding
> clearing the DAX flag was something I did explicitly in the above
> code fragment.

<sigh> yes... you are correct.

But I don't like depending on the caller to clear the S_DAX flag if
xfs_inode_enable_dax() is false. IMO this function should clear the flag in
that case for consistency...

This is part of the reason I used the if/else logic from xfs_diflags_to_linux()
originally. It is very explicit.

>
> And it makes the logic clearer by pre-calculating the new flags,
> then clearing and setting the inode flags together, rather than
> having the spearated at the top and bottom of the function.

But this will not clear the S_DAX flag even if init is true. To me that is a
potential for confusion down the road.

>
> THis leads to an obvious conclusion: if we never clear the in memory
> S_DAX flag, we can actually clear the on-disk flag safely, so that
> next time the inode cycles into memory it won't be using DAX. IOWs,
> admins can stop the applications, clear the DAX flag and drop
> caches. This should result in the inode being recycled and when the
> app is restarted it will run without DAX. No ned for deleting files,
> copying large data sets, etc just to turn off an inode flag.

We already discussed evicting the inode and it was determined to be too
confusing.[*]

Furthermore, if we did want an interface like that why not allow the on-disk
flag to be set as well as cleared?

IMO, this function should set all of the flags consistently including S_DAX.

Ira

[*] https://lore.kernel.org/lkml/[email protected]/

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2020-04-08 22:11:32

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V6 6/8] fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()

On Wed, Apr 08, 2020 at 02:28:30PM -0700, Dan Williams wrote:
> On Wed, Apr 8, 2020 at 2:02 PM Dave Chinner <[email protected]> wrote:
> >

[snip]

> > >
> > > void
> > > xfs_diflags_to_iflags(
> > > struct xfs_inode *ip,
> > > bool init)
> > > {
> > > struct inode *inode = VFS_I(ip);
> > > unsigned int xflags = xfs_ip2xflags(ip);
> > > unsigned int flags = 0;
> > >
> > > inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC | S_NOATIME |
> > > S_DAX);
> >
> > We don't want to clear the dax flag here, ever, if it is already
> > set. That is an externally visible change and opens us up (again) to
> > races where IS_DAX() changes half way through a fault path. IOWs, avoiding
> > clearing the DAX flag was something I did explicitly in the above
> > code fragment.
> >
> > And it makes the logic clearer by pre-calculating the new flags,
> > then clearing and setting the inode flags together, rather than
> > having the spearated at the top and bottom of the function.
> >
> > THis leads to an obvious conclusion: if we never clear the in memory
> > S_DAX flag, we can actually clear the on-disk flag safely, so that
> > next time the inode cycles into memory it won't be using DAX. IOWs,
> > admins can stop the applications, clear the DAX flag and drop
> > caches. This should result in the inode being recycled and when the
> > app is restarted it will run without DAX. No ned for deleting files,
> > copying large data sets, etc just to turn off an inode flag.
>
> Makes sense, but is that sufficient? I recall you saying there might
> be a multitude of other reasons that the inode is not evicted, not the
> least of which is races [1]. Does this need another flag, lets call it
> "dax toggle" to track the "I requested the inode to clear the flag,
> but on cache-flush + restart the inode never got evicted" case. S_DAX
> almost plays this role, but it loses the ability to audit which files
> are pending an inode eviction event. So the dax-toggle flag indicates
> to the kernel to xor the toggle value with the inode flag on inode
> instantiation and the dax inode flag is never directly manipulated by
> the ioctl path.
>
> [1]: http://lore.kernel.org/r/[email protected]

FWIW I think we should continue down this simplified interface and get this
done for 5.8. If we can come up with a way for delayed mode change I'm all for
looking into that. But there has been too much controversy/difficulty about
changing the bit on a file.

So let's table this idea until >= 5.9

Ira

2020-04-08 22:28:03

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V6 7/8] fs/xfs: Change xfs_ioctl_setattr_dax_invalidate() to xfs_ioctl_dax_check()

On Thu, Apr 09, 2020 at 07:09:50AM +1000, Dave Chinner wrote:
> On Wed, Apr 08, 2020 at 11:58:03AM +0200, Jan Kara wrote:
> > On Wed 08-04-20 12:23:18, Dave Chinner wrote:
> > > On Tue, Apr 07, 2020 at 11:29:57AM -0700, [email protected] wrote:
> > > > From: Ira Weiny <[email protected]>
> > > >
> > > > We only support changing FS_XFLAG_DAX on directories. Files get their
> > > > flag from the parent directory on creation only. So no data
> > > > invalidation needs to happen.
> > >
> > > Which leads me to ask: how are users and/or admins supposed to
> > > remove the flag from regular files once it is set in the filesystem?
> > >
> > > Only being able to override the flag via the "dax=never" mount
> > > option means that once the flag is set, nobody can ever remove it
> > > and they can only globally turn off dax if it gets set incorrectly.
> > > It also means a global interrupt because all apps on the filesystem
> > > need to be stopped so the filesystem can be unmounted and mounted
> > > again with dax=never. This is highly unfriendly to admins and users.
> > >
> > > IOWs, we _must_ be able to clear this inode flag on regular inodes
> > > in some way. I don't care if it doesn't change the current in-memory
> > > state, but we must be able to clear the flags so that the next time
> > > the inodes are instantiated DAX is not enabled for those files...
> >
> > Well, there's one way to clear the flag: delete the file. If you still care
> > about the data, you can copy the data first. It isn't very convenient, I
> > agree, and effectively means restarting whatever application that is using
> > the file.
>
> Restarting the application is fine. Having to backup/restore or copy
> the entire data set just to turn off an inode flag? That's not a
> viable management strategy. We could be talking about terabytes of
> data here.
>
> I explained how we can safely remove the flag in the other branch of
> this thread...
>
> > But it seems like more understandable API than letting user clear
> > the on-disk flag but the inode will still use DAX until kernel decides to
> > evict the inode
>
> Certainly doesn't seem that way to me. "stop app, clear flags, drop
> caches, restart app" is a pretty simple, easy thing to do for an
> admin.

I want to be clear here: I think this is reasonable. However, I don't see
consensus for that interface.

Christoph in particular said that a 'lazy change' is: "... straight from
the playbook for arcane and confusing API designs."

"But returning an error and doing a lazy change anyway is straight from
the playbook for arcane and confusing API designs."

-- https://lore.kernel.org/lkml/[email protected]/

Did I somehow misunderstand this?

Again for this patch set, 5.8, lets leave that alone for now. I think if we
disable setting this on files right now we can still allow it in the future as
another step forward.

>
> Especially compared to process that is effectively "stop app, backup
> data set, delete data set, clear flags, restore data set, restart
> app"
>
> > - because that often means you need to restart the
> > application using the file anyway for the flag change to have any effect.
>
> That's a trivial requirement compared to the downtime and resource
> cost of a data set backup/restore just to clear inode flags....
>

I agree but others do not. This still provides a baby step forward and some
granularity for those who plan out the creation of their files.

Ira

2020-04-08 23:21:39

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH V6 6/8] fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()

On Wed, Apr 08, 2020 at 03:07:35PM -0700, Ira Weiny wrote:
> On Thu, Apr 09, 2020 at 07:02:36AM +1000, Dave Chinner wrote:
> > On Wed, Apr 08, 2020 at 10:09:23AM -0700, Ira Weiny wrote:
>
> [snip]
>
> > >
> > > This sounds good but I think we need a slight modification to make the function equivalent in functionality.
> > >
> > > void
> > > xfs_diflags_to_iflags(
> > > struct xfs_inode *ip,
> > > bool init)
> > > {
> > > struct inode *inode = VFS_I(ip);
> > > unsigned int xflags = xfs_ip2xflags(ip);
> > > unsigned int flags = 0;
> > >
> > > inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC | S_NOATIME |
> > > S_DAX);
> >
> > We don't want to clear the dax flag here, ever, if it is already
> > set. That is an externally visible change and opens us up (again) to
> > races where IS_DAX() changes half way through a fault path. IOWs, avoiding
> > clearing the DAX flag was something I did explicitly in the above
> > code fragment.
>
> <sigh> yes... you are correct.
>
> But I don't like depending on the caller to clear the S_DAX flag if
> xfs_inode_enable_dax() is false. IMO this function should clear the flag in
> that case for consistency...

No. We simply cannot do that here except in the init case when the
inode is not yet visible to userspace. In which case, we know -for
certain- that the S_DAX is not set, and hence we do not need to
clear it. Initial conditions matter!

If you want to make sure of this, add this:

ASSERT(!(IS_DAX(inode) && init));

And now we'll catch inodes that incorrectly have S_DAX set at init
time.

> > memory S_DAX flag, we can actually clear the on-disk flag
> > safely, so that next time the inode cycles into memory it won't
> > be using DAX. IOWs, admins can stop the applications, clear the
> > DAX flag and drop caches. This should result in the inode being
> > recycled and when the app is restarted it will run without DAX.
> > No ned for deleting files, copying large data sets, etc just to
> > turn off an inode flag.
>
> We already discussed evicting the inode and it was determined to
> be too confusing.[*]

That discussion did not even consider how admins are supposed to
clear the inode flag once it is set on disk. It was entirely
focussed around "we can't change in memory S_DAX state" and how the
tri-state mount option to "override" the on-disk flag could be done.

Nobody noticed that being unable to rmeove the on-disk flag means
the admin's only option to turn off dax for an application is to
turn it off for everything, filesystem wide, which requires:

1. stopping the app.
2. stopping every other app using the filesystem
3. unmounting the filesystem
4. changing to dax=never mount option
5. mounting the filesystem
6. restarting all apps.

It's a hard stop for everything using the filesystem, and it changes
the runtime environment for all applications, not just the one that
needs DAX turned off. Not to mention that if it's the root
filesystem that is using DAX, then it's a full system reboot needed
to change the mount options.

IMO, this is a non-starter from a production point of view - testing
and qualification of all applications rather than just the affected
app is required to make this sort of change. It simply does not
follow the "minimal change to fix the problem" rules for managing
issues in production environments.

So, pLease explain to me how this process:

1. stop the app
2. remove inode flags via xfs_io
3. run drop_caches
4. start the app

is worse than requiring admins to unmount the filesystem to turn off
DAX for an application.

> Furthermore, if we did want an interface like that why not allow
> the on-disk flag to be set as well as cleared?

Well, why not - it's why I implemented the flag in the first place!
The only problem we have here is how to safely change the in-memory
DAX state, and that largely has nothing to do with setting/clearing
the on-disk flag....

Cheers,

Dave.

--
Dave Chinner
[email protected]

2020-04-08 23:48:51

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH V6 7/8] fs/xfs: Change xfs_ioctl_setattr_dax_invalidate() to xfs_ioctl_dax_check()

On Wed, Apr 08, 2020 at 03:26:36PM -0700, Ira Weiny wrote:
> On Thu, Apr 09, 2020 at 07:09:50AM +1000, Dave Chinner wrote:
> > On Wed, Apr 08, 2020 at 11:58:03AM +0200, Jan Kara wrote:
> > I explained how we can safely remove the flag in the other branch of
> > this thread...
> >
> > > But it seems like more understandable API than letting user clear
> > > the on-disk flag but the inode will still use DAX until kernel decides to
> > > evict the inode
> >
> > Certainly doesn't seem that way to me. "stop app, clear flags, drop
> > caches, restart app" is a pretty simple, easy thing to do for an
> > admin.
>
> I want to be clear here: I think this is reasonable. However, I don't see
> consensus for that interface.
>
> Christoph in particular said that a 'lazy change' is: "... straight from
> the playbook for arcane and confusing API designs."
>
> "But returning an error and doing a lazy change anyway is straight from
> the playbook for arcane and confusing API designs."
>
> -- https://lore.kernel.org/lkml/[email protected]/
>
> Did I somehow misunderstand this?

Yes. Clearing the on-disk flag successfully should not return an
error.

What is wrong is having it clear the flag successfully and returning
an error because the operation doesn't take immediate effect, then
having the change take effect later after telling the application
there was an error.

That's what Christoph was saying is "straight from the playbook for
arcane and confusing API designs."

There's absolutely nothing wrong with setting/clearing the on-disk
flag and having the change take effect some time later depending on
some external context. We've done this sort of thing for a -long
time- and it's not XFS specific at all.

e.g. changing the on-disk APPEND flag doesn't change the write
behaviour of currently open files - it only affects the behaviour of
future file opens. IOWs, we can have the flag set on disk, but we
can still write randomly to the inode as long as we have a file
descriptor that was opened before the APPEND on disk flag was set.

That's exactly the same class of behaviour as we are talking about
here for the on-disk DAX flag.

> > Especially compared to process that is effectively "stop app, backup
> > data set, delete data set, clear flags, restore data set, restart
> > app"
> >
> > > - because that often means you need to restart the
> > > application using the file anyway for the flag change to have any effect.
> >
> > That's a trivial requirement compared to the downtime and resource
> > cost of a data set backup/restore just to clear inode flags....
>
> I agree but others do not. This still provides a baby step forward and some

It's not a baby step forward. We can't expose a behaviour to
userspace and then decide to change it completely at some later
date. We have to think through the entire admin model before
setting it in concrete.

If an admin operation can set an optional persistent feature flags
on a file, then there *must* be admin operations that can remove
that persistent feature flag from said files. This has *nothing to
do with DAX* - it's a fundamental principle of balanced system
design.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-04-08 23:59:02

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH V6 6/8] fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()

On Wed, Apr 08, 2020 at 02:28:30PM -0700, Dan Williams wrote:
> On Wed, Apr 8, 2020 at 2:02 PM Dave Chinner <[email protected]> wrote:
> > THis leads to an obvious conclusion: if we never clear the in memory
> > S_DAX flag, we can actually clear the on-disk flag safely, so that
> > next time the inode cycles into memory it won't be using DAX. IOWs,
> > admins can stop the applications, clear the DAX flag and drop
> > caches. This should result in the inode being recycled and when the
> > app is restarted it will run without DAX. No ned for deleting files,
> > copying large data sets, etc just to turn off an inode flag.
>
> Makes sense, but is that sufficient? I recall you saying there might
> be a multitude of other reasons that the inode is not evicted, not the
> least of which is races [1]. Does this need another flag, lets call it
> "dax toggle" to track the "I requested the inode to clear the flag,
> but on cache-flush + restart the inode never got evicted" case.

You mean something like XFS_IDONTCACHE?

i.e. the functionality already exists in XFS to selectively evict an
inode from cache when the last reference to it is dropped rather
than let it go to the LRUs and hang around in memory.

That flag can be set when changing the on disk DAX flag, and we can
tweak how it works so new cache hits don't clear it (as happens
now). Hence the only thing that can prevent eviction are active
references.

That means we'll still need to stop the application and drop_caches,
because we need to close all the files and purge the dentries that
hold references to the inode before it can be evicted.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-04-09 00:12:32

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V6 6/8] fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()

On Thu, Apr 09, 2020 at 09:21:06AM +1000, Dave Chinner wrote:
> On Wed, Apr 08, 2020 at 03:07:35PM -0700, Ira Weiny wrote:
> > On Thu, Apr 09, 2020 at 07:02:36AM +1000, Dave Chinner wrote:
> > > On Wed, Apr 08, 2020 at 10:09:23AM -0700, Ira Weiny wrote:
> >
> > [snip]
> >
> > > >
> > > > This sounds good but I think we need a slight modification to make the function equivalent in functionality.
> > > >
> > > > void
> > > > xfs_diflags_to_iflags(
> > > > struct xfs_inode *ip,
> > > > bool init)
> > > > {
> > > > struct inode *inode = VFS_I(ip);
> > > > unsigned int xflags = xfs_ip2xflags(ip);
> > > > unsigned int flags = 0;
> > > >
> > > > inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC | S_NOATIME |
> > > > S_DAX);
> > >
> > > We don't want to clear the dax flag here, ever, if it is already
> > > set. That is an externally visible change and opens us up (again) to
> > > races where IS_DAX() changes half way through a fault path. IOWs, avoiding
> > > clearing the DAX flag was something I did explicitly in the above
> > > code fragment.
> >
> > <sigh> yes... you are correct.
> >
> > But I don't like depending on the caller to clear the S_DAX flag if
> > xfs_inode_enable_dax() is false. IMO this function should clear the flag in
> > that case for consistency...
>
> No. We simply cannot do that here except in the init case when the
> inode is not yet visible to userspace. In which case, we know -for
> certain- that the S_DAX is not set, and hence we do not need to
> clear it. Initial conditions matter!
>
> If you want to make sure of this, add this:
>
> ASSERT(!(IS_DAX(inode) && init));
>
> And now we'll catch inodes that incorrectly have S_DAX set at init
> time.

Ok, that will work. Also documents that expected initial condition.

>
> > > memory S_DAX flag, we can actually clear the on-disk flag
> > > safely, so that next time the inode cycles into memory it won't
> > > be using DAX. IOWs, admins can stop the applications, clear the
> > > DAX flag and drop caches. This should result in the inode being
> > > recycled and when the app is restarted it will run without DAX.
> > > No ned for deleting files, copying large data sets, etc just to
> > > turn off an inode flag.
> >
> > We already discussed evicting the inode and it was determined to
> > be too confusing.[*]
>
> That discussion did not even consider how admins are supposed to
> clear the inode flag once it is set on disk.

I think this is a bit unfair. I think we were all considering it... and I
still think this patch set is a step in the right direction.

> It was entirely
> focussed around "we can't change in memory S_DAX state"

Not true. There were many ideas on how to change the FS_XFLAG_DAX with some
sort of delayed S_DAX state to avoid changing S_DAX on an in memory inode.

I made the comment:

"... I want to clarify. ... we could have the flag change with an
appropriate error which could let the user know the change has been
delayed."

-- https://lore.kernel.org/lkml/[email protected]/

Jan made multiple comments:

"I generally like the proposal but I think the fact that toggling
FS_XFLAG_DAX will not have immediate effect on S_DAX will cause quite
some confusion and ultimately bug reports."

-- https://lore.kernel.org/lkml/[email protected]/


"Just switch FS_XFLAG_DAX flag, S_DAX flag will magically switch when
inode gets evicted and the inode gets reloaded from the disk again. Did
I misunderstand anything?

And my thinking was that this is surprising behavior for the user and
so it will likely generate lots of bug reports along the lines of "DAX
inode flag does not work!"."

-- https://lore.kernel.org/lkml/[email protected]/

Darrick also had similar ideas/comments.

Christoph did say:

"A reasonably smart application can try to evict itself."

-- https://lore.kernel.org/lkml/[email protected]/

Which I was unclear about???

Christoph does this mean you would be ok with changing the FS_XFLAG_DAX on disk
and letting S_DAX change later?

> and how the
> tri-state mount option to "override" the on-disk flag could be done.
>
> Nobody noticed that being unable to rmeove the on-disk flag means
> the admin's only option to turn off dax for an application is to
> turn it off for everything, filesystem wide, which requires:

No. This is not entirely true. While I don't like the idea of having to copy
data (and I agree with your points) it is possible to do.

>
> 1. stopping the app.
> 2. stopping every other app using the filesystem
> 3. unmounting the filesystem
> 4. changing to dax=never mount option

I don't understand why we need to unmount and mount with dax=never?

> 5. mounting the filesystem
> 6. restarting all apps.
>
> It's a hard stop for everything using the filesystem, and it changes
> the runtime environment for all applications, not just the one that
> needs DAX turned off. Not to mention that if it's the root
> filesystem that is using DAX, then it's a full system reboot needed
> to change the mount options.
>
> IMO, this is a non-starter from a production point of view - testing
> and qualification of all applications rather than just the affected
> app is required to make this sort of change. It simply does not
> follow the "minimal change to fix the problem" rules for managing
> issues in production environments.
>
> So, pLease explain to me how this process:
>
> 1. stop the app
> 2. remove inode flags via xfs_io
> 3. run drop_caches
> 4. start the app
>
> is worse than requiring admins to unmount the filesystem to turn off
> DAX for an application.

Jan? Christoph?

>
> > Furthermore, if we did want an interface like that why not allow
> > the on-disk flag to be set as well as cleared?
>
> Well, why not - it's why I implemented the flag in the first place!
> The only problem we have here is how to safely change the in-memory
> DAX state, and that largely has nothing to do with setting/clearing
> the on-disk flag....

With the above change to xfs_diflags_to_iflags() I think we are ok here.

Ira

2020-04-09 00:22:22

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V6 6/8] fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()

On Thu, Apr 09, 2020 at 09:58:36AM +1000, Dave Chinner wrote:
> On Wed, Apr 08, 2020 at 02:28:30PM -0700, Dan Williams wrote:
> > On Wed, Apr 8, 2020 at 2:02 PM Dave Chinner <[email protected]> wrote:
> > > THis leads to an obvious conclusion: if we never clear the in memory
> > > S_DAX flag, we can actually clear the on-disk flag safely, so that
> > > next time the inode cycles into memory it won't be using DAX. IOWs,
> > > admins can stop the applications, clear the DAX flag and drop
> > > caches. This should result in the inode being recycled and when the
> > > app is restarted it will run without DAX. No ned for deleting files,
> > > copying large data sets, etc just to turn off an inode flag.
> >
> > Makes sense, but is that sufficient? I recall you saying there might
> > be a multitude of other reasons that the inode is not evicted, not the
> > least of which is races [1]. Does this need another flag, lets call it
> > "dax toggle" to track the "I requested the inode to clear the flag,
> > but on cache-flush + restart the inode never got evicted" case.
>
> You mean something like XFS_IDONTCACHE?
>
> i.e. the functionality already exists in XFS to selectively evict an
> inode from cache when the last reference to it is dropped rather
> than let it go to the LRUs and hang around in memory.
>
> That flag can be set when changing the on disk DAX flag, and we can
> tweak how it works so new cache hits don't clear it (as happens
> now). Hence the only thing that can prevent eviction are active
> references.
>
> That means we'll still need to stop the application and drop_caches,
> because we need to close all the files and purge the dentries that
> hold references to the inode before it can be evicted.

That sounds like a great idea...

Jan? Christoph?

I will reiterate though that any delayed S_DAX change be done as a follow on
series to the one proposed here.

Ira

2020-04-09 00:50:28

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH V6 6/8] fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()

On Wed, Apr 08, 2020 at 05:12:06PM -0700, Ira Weiny wrote:
> On Thu, Apr 09, 2020 at 09:21:06AM +1000, Dave Chinner wrote:
> > On Wed, Apr 08, 2020 at 03:07:35PM -0700, Ira Weiny wrote:
> > > On Thu, Apr 09, 2020 at 07:02:36AM +1000, Dave Chinner wrote:
> > > > On Wed, Apr 08, 2020 at 10:09:23AM -0700, Ira Weiny wrote:
> > >
> > > [snip]
> > >
> > > > >
> > > > > This sounds good but I think we need a slight modification to make the function equivalent in functionality.
> > > > >
> > > > > void
> > > > > xfs_diflags_to_iflags(
> > > > > struct xfs_inode *ip,
> > > > > bool init)
> > > > > {
> > > > > struct inode *inode = VFS_I(ip);
> > > > > unsigned int xflags = xfs_ip2xflags(ip);
> > > > > unsigned int flags = 0;
> > > > >
> > > > > inode->i_flags &= ~(S_IMMUTABLE | S_APPEND | S_SYNC | S_NOATIME |
> > > > > S_DAX);
> > > >
> > > > We don't want to clear the dax flag here, ever, if it is already
> > > > set. That is an externally visible change and opens us up (again) to
> > > > races where IS_DAX() changes half way through a fault path. IOWs, avoiding
> > > > clearing the DAX flag was something I did explicitly in the above
> > > > code fragment.
> > >
> > > <sigh> yes... you are correct.
> > >
> > > But I don't like depending on the caller to clear the S_DAX flag if
> > > xfs_inode_enable_dax() is false. IMO this function should clear the flag in
> > > that case for consistency...
> >
> > No. We simply cannot do that here except in the init case when the
> > inode is not yet visible to userspace. In which case, we know -for
> > certain- that the S_DAX is not set, and hence we do not need to
> > clear it. Initial conditions matter!
> >
> > If you want to make sure of this, add this:
> >
> > ASSERT(!(IS_DAX(inode) && init));
> >
> > And now we'll catch inodes that incorrectly have S_DAX set at init
> > time.
>
> Ok, that will work. Also documents that expected initial condition.
>
> >
> > > > memory S_DAX flag, we can actually clear the on-disk flag
> > > > safely, so that next time the inode cycles into memory it won't
> > > > be using DAX. IOWs, admins can stop the applications, clear the
> > > > DAX flag and drop caches. This should result in the inode being
> > > > recycled and when the app is restarted it will run without DAX.
> > > > No ned for deleting files, copying large data sets, etc just to
> > > > turn off an inode flag.
> > >
> > > We already discussed evicting the inode and it was determined to
> > > be too confusing.[*]
> >
> > That discussion did not even consider how admins are supposed to
> > clear the inode flag once it is set on disk.
>
> I think this is a bit unfair. I think we were all considering it... and I
> still think this patch set is a step in the right direction.
>
> > It was entirely
> > focussed around "we can't change in memory S_DAX state"
>
> Not true. There were many ideas on how to change the FS_XFLAG_DAX with some
> sort of delayed S_DAX state to avoid changing S_DAX on an in memory inode.
>
> I made the comment:
>
> "... I want to clarify. ... we could have the flag change with an
> appropriate error which could let the user know the change has been
> delayed."
>
> -- https://lore.kernel.org/lkml/[email protected]/
>
> Jan made multiple comments:
>
> "I generally like the proposal but I think the fact that toggling
> FS_XFLAG_DAX will not have immediate effect on S_DAX will cause quite
> some confusion and ultimately bug reports."
>
> -- https://lore.kernel.org/lkml/[email protected]/
>
>
> "Just switch FS_XFLAG_DAX flag, S_DAX flag will magically switch when
> inode gets evicted and the inode gets reloaded from the disk again. Did
> I misunderstand anything?
>
> And my thinking was that this is surprising behavior for the user and
> so it will likely generate lots of bug reports along the lines of "DAX
> inode flag does not work!"."
>
> -- https://lore.kernel.org/lkml/[email protected]/
>
> Darrick also had similar ideas/comments.

None of these consider how the admin is supposed to remove
the flag once it is set. They all talk about issues that result
from setting/clearing the flag inside the kernel, and don't consider
the administration impacts of having an unclearable flag on disk
that the kernel uses for operational policy decisions.

Nobody said "hey, wait, if we don't allow the flag to be cleared
once the flag is set, how do we actually remove it so the admin can
stop overriding them globally with the mount option? If the kernel
can't clear the flag, then what mechanism are we going to provide to
let them clear it without interrupting production?"

> Christoph did say:
>
> "A reasonably smart application can try to evict itself."
>
> -- https://lore.kernel.org/lkml/[email protected]/

I'd love to know how an unprivileged application can force the
eviction of an inode from cache. If the application cannot evict
files it is using reliably, then it's no better solution than
drop_caches. And given that userspace has to take references to
files to access them, by definition a file that userspace is trying
to evict will have active references and hence cannot be evicted.

However, if userspace can reliably evict inodes that it can access,
then we have a timing attack vector that we need to address. i.e.
by now everyone here should know that user controlled cache eviction
is the basic requirement for creating most OS and CPU level timing
attacks....

> > and how the
> > tri-state mount option to "override" the on-disk flag could be done.
> >
> > Nobody noticed that being unable to rmeove the on-disk flag means
> > the admin's only option to turn off dax for an application is to
> > turn it off for everything, filesystem wide, which requires:
>
> No. This is not entirely true. While I don't like the idea of having to copy
> data (and I agree with your points) it is possible to do.
>
> >
> > 1. stopping the app.
> > 2. stopping every other app using the filesystem
> > 3. unmounting the filesystem
> > 4. changing to dax=never mount option
>
> I don't understand why we need to unmount and mount with dax=never?

1. IIUC your patches correctly, that's exactly how you implemented
the dax=... mount option.

2. If you don't unmount the filesystem and only require a remount,
then changing the mount option has all the same "delayed effect"
issues that changing the on-disk flag has. i.e. We can't change the
in-memory inode state until the inode has been evicted from cache
and a remount doesn't guarantee that cache eviction will succeed.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-04-09 12:30:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH V6 7/8] fs/xfs: Change xfs_ioctl_setattr_dax_invalidate() to xfs_ioctl_dax_check()

On Thu, Apr 09, 2020 at 09:48:17AM +1000, Dave Chinner wrote:
> > Christoph in particular said that a 'lazy change' is: "... straight from
> > the playbook for arcane and confusing API designs."
> >
> > "But returning an error and doing a lazy change anyway is straight from
> > the playbook for arcane and confusing API designs."
> >
> > -- https://lore.kernel.org/lkml/[email protected]/
> >
> > Did I somehow misunderstand this?
>
> Yes. Clearing the on-disk flag successfully should not return an
> error.
>
> What is wrong is having it clear the flag successfully and returning
> an error because the operation doesn't take immediate effect, then
> having the change take effect later after telling the application
> there was an error.
>
> That's what Christoph was saying is "straight from the playbook for
> arcane and confusing API designs."

Yes.

> There's absolutely nothing wrong with setting/clearing the on-disk
> flag and having the change take effect some time later depending on
> some external context. We've done this sort of thing for a -long
> time- and it's not XFS specific at all.
>
> e.g. changing the on-disk APPEND flag doesn't change the write
> behaviour of currently open files - it only affects the behaviour of
> future file opens. IOWs, we can have the flag set on disk, but we
> can still write randomly to the inode as long as we have a file
> descriptor that was opened before the APPEND on disk flag was set.
>
> That's exactly the same class of behaviour as we are talking about
> here for the on-disk DAX flag.

Some people consider that a bug, though. But I don't think we can
change that now. In general I don't think APIs that don't take
immediate effect are all that great, but in some cases we can live
with them if they are properly documented. But APIs that return
an error, but actually take effect later anyway are just crazy.

2020-04-09 12:42:11

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH V6 6/8] fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()

On Wed, Apr 08, 2020 at 05:22:03PM -0700, Ira Weiny wrote:
> > You mean something like XFS_IDONTCACHE?
> >
> > i.e. the functionality already exists in XFS to selectively evict an
> > inode from cache when the last reference to it is dropped rather
> > than let it go to the LRUs and hang around in memory.
> >
> > That flag can be set when changing the on disk DAX flag, and we can
> > tweak how it works so new cache hits don't clear it (as happens
> > now). Hence the only thing that can prevent eviction are active
> > references.
> >
> > That means we'll still need to stop the application and drop_caches,
> > because we need to close all the files and purge the dentries that
> > hold references to the inode before it can be evicted.
>
> That sounds like a great idea...
>
> Jan? Christoph?

Sounds ok. Note that we could also pretty trivially lift
XFS_IDONTCACHE to the VFS if we need to apply the same scheme to
ext4.

2020-04-09 14:59:49

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V6 2/8] fs: Remove unneeded IS_DAX() check

On Thu, Apr 09, 2020 at 09:31:34AM +0200, Christoph Hellwig wrote:
> On Tue, Apr 07, 2020 at 11:29:52AM -0700, [email protected] wrote:
> > static inline bool io_is_direct(struct file *filp)
> > {
> > - return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
> > + return (filp->f_flags & O_DIRECT);
> > }
>
> As requested last time: Can you please also just remove io_is_direct?

FWIW I just found this mail in my junk folder... My fault I know... :-/

Regardless I did not see that request last time but I can do that,

Done for V7
Ira

2020-04-09 15:06:13

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V6 4/8] fs/xfs: Make DAX mount option a tri-state

On Wed, Apr 08, 2020 at 10:48:01AM +1000, Dave Chinner wrote:
> On Tue, Apr 07, 2020 at 05:09:04PM -0700, Ira Weiny wrote:
> > On Wed, Apr 08, 2020 at 09:59:09AM +1000, Dave Chinner wrote:
> > >
> > > This is overly complex. Just use 2 flags:
> >
> > LOL... I was afraid someone would say that. At first I used 2 flags with
> > fsparam_string, but then I realized Darrick suggested fsparam_enum:
>
> Well, I'm not concerned about the fsparam enum, it's just that
> encoding an integer into a flags bit field is just ... messy.
> Especially when encoding that state can be done with just 2 flags.
>
> If you want to keep the xfs_mount_dax_mode() wrapper, then:
>
> static inline uint32_t xfs_mount_dax_mode(struct xfs_mount *mp)
> {
> if (mp->m_flags & XFS_MOUNT_DAX_NEVER)
> return XFS_DAX_NEVER;
> if (mp->m_flags & XFS_MOUNT_DAX_ALWAYS)
> return XFS_DAX_ALWAYS;
> return XFS_DAX_IFLAG;
> }
>
> but once it's encoded in flags like this, the wrapper really isn't
> necessary...

Done for v7

>
> Also, while I think of it, can we change "iflag" to "inode". i.e.
> the DAX state is held on the inode. Saying it comes from an "inode
> flag" encodes the implementation into the user interface. i.e. it
> could well be held in an xattr on the inode on another filesystem,
> so we shouldn't mention "flag" in the user API....

Sure "inode" is fine with me. Easy change, done for v7

Ira

2020-04-09 15:31:15

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V6 6/8] fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()

On Wed, Apr 08, 2020 at 05:30:21PM -0700, Darrick J. Wong wrote:

[snip]

>
> But you're right, this thing keeps swirling around and around and around
> because we can't ever get to agreement on this. Maybe I'll just become
> XFS BOFH MAINTAINER and make a decision like this:
>
> 1 Applications must call statx to discover the current S_DAX state.
>
> 2 There exists an advisory file inode flag FS_XFLAG_DAX that is set based on
> the parent directory FS_XFLAG_DAX inode flag. This advisory flag can be
> changed after file creation, but it does not immediately affect the S_DAX
> state.
>
> If FS_XFLAG_DAX is set and the fs is on pmem then it will enable S_DAX at
> inode load time; if FS_XFLAG_DAX is not set, it will not enable S_DAX.
> Unless overridden...
>
> 3 There exists a dax= mount option.
>
> "-o dax=never" means "never set S_DAX, ignore FS_XFLAG_DAX"
> "-o dax=always" means "always set S_DAX (at least on pmem), ignore FS_XFLAG_DAX"
> "-o dax" by itself means "dax=always"
> "-o dax=iflag" means "follow FS_XFLAG_DAX" and is the default

per-Dave '-o dax=inode'

>
> 4 There exists an advisory directory inode flag FS_XFLAG_DAX that can be
> changed at any time. The flag state is copied into any files or
> subdirectories when they are created within that directory.

Good.

> If programs
> require file access runs in S_DAX mode, they must create those files
> inside a directory with FS_XFLAG_DAX set, or mount the fs with an
> appropriate dax mount option.

Why do we need this to be true? If the FS_XFLAG_DAX flag can be cleared why
not set it and allow the S_DAX change to occur later just like clearing it?
The logic is exactly the same.

>
> 5 Programs that require a specific file access mode (DAX or not DAX) must

s/must/can/

> do one of the following:
>
> (a) create files in directories with the FS_XFLAG_DAX flag set as needed;

Again if we allow clearing the flag why not setting? So this is 1 option they
'can' do.

>
> (b) have the administrator set an override via mount option;
>
> (c) if they need to change a file's FS_XFLAG_DAX flag so that it does not
> match the S_DAX state (as reported by statx), they must cause the
> kernel to evict the inode from memory. This can be done by:
>
> i> closing the file;
> ii> re-opening the file and using statx to see if the fs has
> changed the S_DAX flag;

i and ii need to be 1 step the user must follow.

> iii> if not, either unmount and remount the filesystem, or
> closing the file and using drop_caches.
>
> 6 I no longer think it's too wild to require that users who want to
> squeeze every last bit of performance out of the particular rough and
> tumble bits of their storage also be exposed to the difficulties of
> what happens when the operating system can't totally virtualize those
> hardware capabilities. Your high performance sports car is not a
> Toyota minivan, as it were.

I'm good with this statement. But I think we need to clean up the verbiage for
the documentation... ;-)

Thanks for the summary. I like these to get everyone on the same page. :-D
Ira

>
> I think (like Dave said) that if you set XFS_IDONTCACHE on the inode
> when you change the DAX flag, the VFS will kill the inode the instant
> the last user close()s the file. Then 5.c.ii will actually work.
>
> --D
>
> > >
> > > > Furthermore, if we did want an interface like that why not allow
> > > > the on-disk flag to be set as well as cleared?
> > >
> > > Well, why not - it's why I implemented the flag in the first place!
> > > The only problem we have here is how to safely change the in-memory
> > > DAX state, and that largely has nothing to do with setting/clearing
> > > the on-disk flag....
> >
> > With the above change to xfs_diflags_to_iflags() I think we are ok here.
> >
> > Ira
> >

2020-04-09 17:15:53

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH V6 6/8] fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()

On Thu, Apr 09, 2020 at 08:29:44AM -0700, Ira Weiny wrote:
> On Wed, Apr 08, 2020 at 05:30:21PM -0700, Darrick J. Wong wrote:
>
> [snip]
>
> >
> > But you're right, this thing keeps swirling around and around and around
> > because we can't ever get to agreement on this. Maybe I'll just become
> > XFS BOFH MAINTAINER and make a decision like this:
> >
> > 1 Applications must call statx to discover the current S_DAX state.
> >
> > 2 There exists an advisory file inode flag FS_XFLAG_DAX that is set based on
> > the parent directory FS_XFLAG_DAX inode flag. This advisory flag can be
> > changed after file creation, but it does not immediately affect the S_DAX
> > state.
> >
> > If FS_XFLAG_DAX is set and the fs is on pmem then it will enable S_DAX at
> > inode load time; if FS_XFLAG_DAX is not set, it will not enable S_DAX.
> > Unless overridden...
> >
> > 3 There exists a dax= mount option.
> >
> > "-o dax=never" means "never set S_DAX, ignore FS_XFLAG_DAX"
> > "-o dax=always" means "always set S_DAX (at least on pmem), ignore FS_XFLAG_DAX"
> > "-o dax" by itself means "dax=always"
> > "-o dax=iflag" means "follow FS_XFLAG_DAX" and is the default
>
> per-Dave '-o dax=inode'

Ok.

> >
> > 4 There exists an advisory directory inode flag FS_XFLAG_DAX that can be
> > changed at any time. The flag state is copied into any files or
> > subdirectories when they are created within that directory.
>
> Good.
>
> > If programs
> > require file access runs in S_DAX mode, they must create those files
> > inside a directory with FS_XFLAG_DAX set, or mount the fs with an
> > appropriate dax mount option.
>
> Why do we need this to be true? If the FS_XFLAG_DAX flag can be cleared why
> not set it and allow the S_DAX change to occur later just like clearing it?
> The logic is exactly the same.

I think I'll just delete this sentence since I started pushing all that
verbiage towards (5).

To answer your question, yes, FS_XFLAG_DAX can be set on a file at any
time, the same as it can be cleared at any time. Sorry that was
unclear, I'll fix that for the next draft (below).

> >
> > 5 Programs that require a specific file access mode (DAX or not DAX) must
>
> s/must/can/

Ok.

> > do one of the following:
> >
> > (a) create files in directories with the FS_XFLAG_DAX flag set as needed;
>
> Again if we allow clearing the flag why not setting? So this is 1 option they
> 'can' do.
>
> >
> > (b) have the administrator set an override via mount option;
> >
> > (c) if they need to change a file's FS_XFLAG_DAX flag so that it does not
> > match the S_DAX state (as reported by statx), they must cause the
> > kernel to evict the inode from memory. This can be done by:
> >
> > i> closing the file;
> > ii> re-opening the file and using statx to see if the fs has
> > changed the S_DAX flag;
>
> i and ii need to be 1 step the user must follow.

Ok, I'll combine these.

> > iii> if not, either unmount and remount the filesystem, or
> > closing the file and using drop_caches.
> >
> > 6 I no longer think it's too wild to require that users who want to
> > squeeze every last bit of performance out of the particular rough and
> > tumble bits of their storage also be exposed to the difficulties of
> > what happens when the operating system can't totally virtualize those
> > hardware capabilities. Your high performance sports car is not a
> > Toyota minivan, as it were.
>
> I'm good with this statement. But I think we need to clean up the verbiage for
> the documentation... ;-)

Heh. :)

> Thanks for the summary. I like these to get everyone on the same page. :-D

And today:

1. There exists an in-kernel access mode flag S_DAX that is set when
file accesses go directly to persistent memory, bypassing the page
cache. Applications must call statx to discover the current S_DAX
state (STATX_ATTR_DAX).

2. There exists an advisory file inode flag FS_XFLAG_DAX that is
inherited from the parent directory FS_XFLAG_DAX inode flag at file
creation time. This advisory flag can be set or cleared at any
time, but doing so does not immediately affect the S_DAX state.

Unless overridden by mount options (see (3)), if FS_XFLAG_DAX is set
and the fs is on pmem then it will enable S_DAX at inode load time;
if FS_XFLAG_DAX is not set, it will not enable S_DAX.

3. There exists a dax= mount option.

"-o dax=never" means "never set S_DAX, ignore FS_XFLAG_DAX."

"-o dax=always" means "always set S_DAX (at least on pmem),
and ignore FS_XFLAG_DAX."

"-o dax" is an alias for "dax=always".

"-o dax=inode" means "follow FS_XFLAG_DAX" and is the default.

4. There exists an advisory directory inode flag FS_XFLAG_DAX that can
be set or cleared at any time. The flag state is copied into any
files or subdirectories when they are created within that directory.

5. Programs that require a specific file access mode (DAX or not DAX)
can do one of the following:

(a) Create files in directories that the FS_XFLAG_DAX flag set as
needed; or

(b) Have the administrator set an override via mount option; or

(c) Set or clear the file's FS_XFLAG_DAX flag as needed. Programs
must then cause the kernel to evict the inode from memory. This
can be done by:

i> Closing the file and re-opening the file and using statx to
see if the fs has changed the S_DAX flag; and

ii> If the file still does not have the desired S_DAX access
mode, either unmount and remount the filesystem, or close
the file and use drop_caches.

6. It's not unreasonable that users who want to squeeze every last bit
of performance out of the particular rough and tumble bits of their
storage also be exposed to the difficulties of what happens when the
operating system can't totally virtualize those hardware
capabilities. Your high performance sports car is not a Toyota
minivan, as it were.

Given our overnight discussions, I don't think it'll be difficult to
hoist XFS_IDONTCACHE to the VFS so that 5.c.i is enough to change the
S_DAX state if nobody else is using the file.

--D

> >
> > > Furthermore, if we did want an interface like that why not allow
> > > the on-disk flag to be set as well as cleared?
> >
> > Well, why not - it's why I implemented the flag in the first place!
> > The only problem we have here is how to safely change the in-memory
> > DAX state, and that largely has nothing to do with setting/clearing
> > the on-disk flag....
>
> With the above change to xfs_diflags_to_iflags() I think we are ok here.
>
> Ira
>
--D

> Ira
>
> >
> > I think (like Dave said) that if you set XFS_IDONTCACHE on the inode
> > when you change the DAX flag, the VFS will kill the inode the instant
> > the last user close()s the file. Then 5.c.ii will actually work.
> >
> > --D
> >
> > > >
> > > > > Furthermore, if we did want an interface like that why not allow
> > > > > the on-disk flag to be set as well as cleared?
> > > >
> > > > Well, why not - it's why I implemented the flag in the first place!
> > > > The only problem we have here is how to safely change the in-memory
> > > > DAX state, and that largely has nothing to do with setting/clearing
> > > > the on-disk flag....
> > >
> > > With the above change to xfs_diflags_to_iflags() I think we are ok here.
> > >
> > > Ira
> > >

2020-04-09 17:21:30

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH V6 6/8] fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()

On Thu 09-04-20 09:59:27, Darrick J. Wong wrote:
> And today:
>
> 1. There exists an in-kernel access mode flag S_DAX that is set when
> file accesses go directly to persistent memory, bypassing the page
> cache. Applications must call statx to discover the current S_DAX
> state (STATX_ATTR_DAX).
>
> 2. There exists an advisory file inode flag FS_XFLAG_DAX that is
> inherited from the parent directory FS_XFLAG_DAX inode flag at file
> creation time. This advisory flag can be set or cleared at any
> time, but doing so does not immediately affect the S_DAX state.
>
> Unless overridden by mount options (see (3)), if FS_XFLAG_DAX is set
> and the fs is on pmem then it will enable S_DAX at inode load time;
> if FS_XFLAG_DAX is not set, it will not enable S_DAX.
>
> 3. There exists a dax= mount option.
>
> "-o dax=never" means "never set S_DAX, ignore FS_XFLAG_DAX."
>
> "-o dax=always" means "always set S_DAX (at least on pmem),
> and ignore FS_XFLAG_DAX."
>
> "-o dax" is an alias for "dax=always".
>
> "-o dax=inode" means "follow FS_XFLAG_DAX" and is the default.
>
> 4. There exists an advisory directory inode flag FS_XFLAG_DAX that can
> be set or cleared at any time. The flag state is copied into any
> files or subdirectories when they are created within that directory.
>
> 5. Programs that require a specific file access mode (DAX or not DAX)
> can do one of the following:
>
> (a) Create files in directories that the FS_XFLAG_DAX flag set as
> needed; or
>
> (b) Have the administrator set an override via mount option; or
>
> (c) Set or clear the file's FS_XFLAG_DAX flag as needed. Programs
> must then cause the kernel to evict the inode from memory. This
> can be done by:
>
> i> Closing the file and re-opening the file and using statx to
> see if the fs has changed the S_DAX flag; and
>
> ii> If the file still does not have the desired S_DAX access
> mode, either unmount and remount the filesystem, or close
> the file and use drop_caches.
>
> 6. It's not unreasonable that users who want to squeeze every last bit
> of performance out of the particular rough and tumble bits of their
> storage also be exposed to the difficulties of what happens when the
> operating system can't totally virtualize those hardware
> capabilities. Your high performance sports car is not a Toyota
> minivan, as it were.
>
> Given our overnight discussions, I don't think it'll be difficult to
> hoist XFS_IDONTCACHE to the VFS so that 5.c.i is enough to change the
> S_DAX state if nobody else is using the file.

I still find the "S_DAX changes on inode eviction" confusing but I
guess it's as good as it gets and with XFS_IDONTCACHE lifted to VFS,
it seems acceptable so OK from me.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-04-09 20:52:19

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V6 6/8] fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()

On Thu, Apr 09, 2020 at 02:41:27PM +0200, Christoph Hellwig wrote:
> On Wed, Apr 08, 2020 at 05:22:03PM -0700, Ira Weiny wrote:
> > > You mean something like XFS_IDONTCACHE?
> > >
> > > i.e. the functionality already exists in XFS to selectively evict an
> > > inode from cache when the last reference to it is dropped rather
> > > than let it go to the LRUs and hang around in memory.
> > >
> > > That flag can be set when changing the on disk DAX flag, and we can
> > > tweak how it works so new cache hits don't clear it (as happens
> > > now). Hence the only thing that can prevent eviction are active
> > > references.
> > >
> > > That means we'll still need to stop the application and drop_caches,
> > > because we need to close all the files and purge the dentries that
> > > hold references to the inode before it can be evicted.
> >
> > That sounds like a great idea...
> >
> > Jan? Christoph?
>
> Sounds ok. Note that we could also pretty trivially lift
> XFS_IDONTCACHE to the VFS if we need to apply the same scheme to
> ext4.

Yes I have been slowing working on ext4 in the background. So lifting
XXX_IDONTCACHE to VFS will be part of that series.

Thanks,
Ira

2020-04-09 20:56:50

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V6 6/8] fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()

On Thu, Apr 09, 2020 at 09:59:27AM -0700, Darrick J. Wong wrote:

[snip]

> And today:
>
> 1. There exists an in-kernel access mode flag S_DAX that is set when
> file accesses go directly to persistent memory, bypassing the page
> cache. Applications must call statx to discover the current S_DAX
> state (STATX_ATTR_DAX).
>
> 2. There exists an advisory file inode flag FS_XFLAG_DAX that is
> inherited from the parent directory FS_XFLAG_DAX inode flag at file
> creation time. This advisory flag can be set or cleared at any
> time, but doing so does not immediately affect the S_DAX state.
>
> Unless overridden by mount options (see (3)), if FS_XFLAG_DAX is set
> and the fs is on pmem then it will enable S_DAX at inode load time;
> if FS_XFLAG_DAX is not set, it will not enable S_DAX.
>
> 3. There exists a dax= mount option.
>
> "-o dax=never" means "never set S_DAX, ignore FS_XFLAG_DAX."
>
> "-o dax=always" means "always set S_DAX (at least on pmem),
> and ignore FS_XFLAG_DAX."
>
> "-o dax" is an alias for "dax=always".
>
> "-o dax=inode" means "follow FS_XFLAG_DAX" and is the default.
>
> 4. There exists an advisory directory inode flag FS_XFLAG_DAX that can
> be set or cleared at any time. The flag state is copied into any
> files or subdirectories when they are created within that directory.
>
> 5. Programs that require a specific file access mode (DAX or not DAX)
> can do one of the following:
>
> (a) Create files in directories that the FS_XFLAG_DAX flag set as
> needed; or
>
> (b) Have the administrator set an override via mount option; or
>
> (c) Set or clear the file's FS_XFLAG_DAX flag as needed. Programs
> must then cause the kernel to evict the inode from memory. This
> can be done by:
>
> i> Closing the file and re-opening the file and using statx to
> see if the fs has changed the S_DAX flag; and
>
> ii> If the file still does not have the desired S_DAX access
> mode, either unmount and remount the filesystem, or close
> the file and use drop_caches.
>
> 6. It's not unreasonable that users who want to squeeze every last bit
> of performance out of the particular rough and tumble bits of their
> storage also be exposed to the difficulties of what happens when the
> operating system can't totally virtualize those hardware
> capabilities. Your high performance sports car is not a Toyota
> minivan, as it were.
>
> Given our overnight discussions, I don't think it'll be difficult to
> hoist XFS_IDONTCACHE to the VFS so that 5.c.i is enough to change the
> S_DAX state if nobody else is using the file.

Agreed!

One note on implementation, I plan to get XFS_IDONTCACHE tested with XFS and
leave hoisting it to VFS for the series which enables ext4 as that is when we
would need such hoisting.

Thanks everyone for another round of discussions! :-D

Ira

2020-04-10 00:29:39

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH V6 6/8] fs/xfs: Combine xfs_diflags_to_linux() and xfs_diflags_to_iflags()

On Thu, Apr 09, 2020 at 02:40:31PM +0200, Christoph Hellwig wrote:
> On Thu, Apr 09, 2020 at 10:49:21AM +1000, Dave Chinner wrote:
> > > Christoph did say:
> > >
> > > "A reasonably smart application can try to evict itself."
> > >
> > > -- https://lore.kernel.org/lkml/[email protected]/
> >
> > I'd love to know how an unprivileged application can force the
> > eviction of an inode from cache.
>
> Where did the "unprivileged" suddenly come from?

I'm assuming that applications are being run without the root
permissions needed to run drop_caches. i.e. the apps are
unprivileged, and therefore can't brute force inode cache eviction.
That's why I'm asking what mechanism these applications are using to
evict inodes on demand without requiring elevated privileges,
because I can't see how they'd acheive this...

Cheers,

Dave.
--
Dave Chinner
[email protected]