2019-10-20 16:00:40

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 0/5] Enable per-file/directory DAX operations

From: Ira Weiny <[email protected]>

At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
consumption due to their inability to detect whether the kernel will
instantiate page cache for a file, and cases where a global dax enable via a
mount option is too coarse.

The following patch series enables selecting the use of DAX on individual files
and/or directories on xfs, and lays some groundwork to do so in ext4. In this
scheme the dax mount option can be omitted to allow the per-file property to
take effect.

The insight at LSF/MM was to separate the per-mount or per-file "physical"
capability switch from an "effective" attribute for the file.

At LSF/MM we discussed the difficulties of switching the mode of a file with
active mappings / page cache. Rather than solve those races the decision was to
just limit mode flips to 0-length files.

Finally, the physical DAX flag inheritance is maintained from previous work on
XFS but should be added for other file systems for consistence.


[1] https://lwn.net/Articles/787973/
[2] https://lwn.net/Articles/787233/

To: [email protected]
Cc: Alexander Viro <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: "Theodore Y. Ts'o" <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

Ira Weiny (5):
fs/stat: Define DAX statx attribute
fs/xfs: Isolate the physical DAX flag from effective
fs/xfs: Separate functionality of xfs_inode_supports_dax()
fs/xfs: Clean up DAX support check
fs/xfs: Allow toggle of physical DAX flag

fs/stat.c | 3 +++
fs/xfs/xfs_ioctl.c | 32 ++++++++++++++------------------
fs/xfs/xfs_iops.c | 36 ++++++++++++++++++++++++++++++------
fs/xfs/xfs_iops.h | 2 ++
include/uapi/linux/stat.h | 1 +
5 files changed, 50 insertions(+), 24 deletions(-)

--
2.20.1


2019-10-20 16:00:40

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 1/5] fs/stat: Define DAX statx attribute

From: Ira Weiny <[email protected]>

In order for users to determine if a file is currently operating in DAX
mode (effective DAX). Define a statx attribute value and set that
attribute if the effective DAX flag is set.

To go along with this we propose the following addition to the statx man
page:

STATX_ATTR_DAX

DAX (cpu direct access) is a file mode that attempts to minimize
software cache effects for both I/O and memory mappings of this
file. It requires a capable device, a compatible filesystem
block size, and filesystem opt-in. It generally assumes all
accesses are via cpu load / store instructions which can
minimize overhead for small accesses, but adversely affect cpu
utilization for large transfers. File I/O is done directly
to/from user-space buffers. While the DAX property tends to
result in data being transferred synchronously it does not give
the guarantees of synchronous I/O that data and necessary
metadata are transferred. Memory mapped I/O may be performed
with direct mappings that bypass system memory buffering. Again
while memory-mapped I/O tends to result in data being
transferred synchronously it does not guarantee synchronous
metadata updates. A dax file may optionally support being mapped
with the MAP_SYNC flag which does allow cpu store operations to
be considered synchronous modulo cpu cache effects.

Signed-off-by: Ira Weiny <[email protected]>
---
fs/stat.c | 3 +++
include/uapi/linux/stat.h | 1 +
2 files changed, 4 insertions(+)

diff --git a/fs/stat.c b/fs/stat.c
index c38e4c2e1221..59ca360c1ffb 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -77,6 +77,9 @@ int vfs_getattr_nosec(const struct path *path, struct kstat *stat,
if (IS_AUTOMOUNT(inode))
stat->attributes |= STATX_ATTR_AUTOMOUNT;

+ if (inode->i_flags & S_DAX)
+ stat->attributes |= STATX_ATTR_DAX;
+
if (inode->i_op->getattr)
return inode->i_op->getattr(path, stat, request_mask,
query_flags);
diff --git a/include/uapi/linux/stat.h b/include/uapi/linux/stat.h
index 7b35e98d3c58..5b0962121ef7 100644
--- a/include/uapi/linux/stat.h
+++ b/include/uapi/linux/stat.h
@@ -169,6 +169,7 @@ struct statx {
#define STATX_ATTR_ENCRYPTED 0x00000800 /* [I] File requires key to decrypt in fs */

#define STATX_ATTR_AUTOMOUNT 0x00001000 /* Dir: Automount trigger */
+#define STATX_ATTR_DAX 0x00002000 /* [I] File is DAX */


#endif /* _UAPI_LINUX_STAT_H */
--
2.20.1

2019-10-20 16:00:41

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 4/5] fs/xfs: Clean up DAX support check

From: Ira Weiny <[email protected]>

Rather than open coding xfs_inode_supports_dax() in
xfs_ioctl_setattr_dax_invalidate() export xfs_inode_supports_dax() and
call it in preparation for swapping dax flags.

This also means updating xfs_inode_supports_dax() to return true for a
directory.

Signed-off-by: Ira Weiny <[email protected]>
---
fs/xfs/xfs_ioctl.c | 17 +++--------------
fs/xfs/xfs_iops.c | 8 ++++++--
fs/xfs/xfs_iops.h | 2 ++
3 files changed, 11 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 0ea326290cca..d3a7340d3209 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1299,24 +1299,13 @@ xfs_ioctl_setattr_dax_invalidate(
int *join_flags)
{
struct inode *inode = VFS_I(ip);
- struct super_block *sb = inode->i_sb;
int error;

*join_flags = 0;

- /*
- * It is only valid to set the DAX flag on regular files and
- * directories on filesystems where the block size is equal to the page
- * size. On directories it serves as an inherited hint so we don't
- * have to check the device for dax support or flush pagecache.
- */
- if (fa->fsx_xflags & FS_XFLAG_DAX) {
- if (!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode)))
- return -EINVAL;
- if (!bdev_dax_supported(xfs_find_bdev_for_inode(VFS_I(ip)),
- sb->s_blocksize))
- return -EINVAL;
- }
+ if ((fa->fsx_xflags & FS_XFLAG_DAX) == FS_XFLAG_DAX &&
+ !xfs_inode_supports_dax(ip))
+ return -EINVAL;

/* If the DAX state is not changing, we have nothing to do here. */
if ((fa->fsx_xflags & FS_XFLAG_DAX) &&
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 9e5c79aa1b54..535b87ffc135 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1216,14 +1216,18 @@ xfs_inode_mount_is_dax(
}

/* Figure out if this file actually supports DAX. */
-static bool
+bool
xfs_inode_supports_dax(
struct xfs_inode *ip)
{
struct xfs_mount *mp = ip->i_mount;

/* Only supported on non-reflinked files. */
- if (!S_ISREG(VFS_I(ip)->i_mode) || xfs_is_reflink_inode(ip))
+ if (xfs_is_reflink_inode(ip))
+ return false;
+
+ /* Only supported on regular files and directories. */
+ if (!(S_ISREG(VFS_I(ip)->i_mode) || S_ISDIR(VFS_I(ip)->i_mode)))
return false;

/* Block size must match page size */
diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h
index 4d24ff309f59..f24fec8de1d6 100644
--- a/fs/xfs/xfs_iops.h
+++ b/fs/xfs/xfs_iops.h
@@ -24,4 +24,6 @@ extern int xfs_setattr_nonsize(struct xfs_inode *ip, struct iattr *vap,
extern int xfs_vn_setattr_nonsize(struct dentry *dentry, struct iattr *vap);
extern int xfs_vn_setattr_size(struct dentry *dentry, struct iattr *vap);

+extern bool xfs_inode_supports_dax(struct xfs_inode *ip);
+
#endif /* __XFS_IOPS_H__ */
--
2.20.1

2019-10-20 16:00:43

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag

From: Ira Weiny <[email protected]>

Switching between DAX and non-DAX on a file is racy with respect to data
operations. However, if no data is involved the flag is safe to switch.

Allow toggling the physical flag if a file is empty. The file length
check is not racy with respect to other operations as it is performed
under XFS_MMAPLOCK_EXCL and XFS_IOLOCK_EXCL locks.

Signed-off-by: Ira Weiny <[email protected]>
---
fs/xfs/xfs_ioctl.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index d3a7340d3209..3839684c249b 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -33,6 +33,7 @@
#include "xfs_sb.h"
#include "xfs_ag.h"
#include "xfs_health.h"
+#include "libxfs/xfs_dir2.h"

#include <linux/mount.h>
#include <linux/namei.h>
@@ -1232,12 +1233,10 @@ xfs_diflags_to_linux(
inode->i_flags |= S_NOATIME;
else
inode->i_flags &= ~S_NOATIME;
-#if 0 /* disabled until the flag switching races are sorted out */
if (xflags & FS_XFLAG_DAX)
inode->i_flags |= S_DAX;
else
inode->i_flags &= ~S_DAX;
-#endif
}

static int
@@ -1320,6 +1319,12 @@ xfs_ioctl_setattr_dax_invalidate(

/* lock, flush and invalidate mapping in preparation for flag change */
xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
+
+ if (i_size_read(inode) != 0) {
+ error = -EOPNOTSUPP;
+ goto out_unlock;
+ }
+
error = filemap_write_and_wait(inode->i_mapping);
if (error)
goto out_unlock;
--
2.20.1

2019-10-20 16:00:53

by Ira Weiny

[permalink] [raw]
Subject: [PATCH 3/5] fs/xfs: Separate functionality of xfs_inode_supports_dax()

From: Ira Weiny <[email protected]>

xfs_inode_supports_dax() should reflect if the inode can support DAX not
that it is enabled for DAX. Leave that to other helper functions.

Change the caller of xfs_inode_supports_dax() to call
xfs_inode_use_dax() which reflects new logic to override the effective
DAX flag with either the mount option or the physical DAX flag.

To make the logic clear create 2 helper functions for the mount and
physical flag.

Signed-off-by: Ira Weiny <[email protected]>
---
fs/xfs/xfs_iops.c | 32 ++++++++++++++++++++++++++------
1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index fe285d123d69..9e5c79aa1b54 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1206,6 +1206,15 @@ static const struct inode_operations xfs_inline_symlink_inode_operations = {
.update_time = xfs_vn_update_time,
};

+static bool
+xfs_inode_mount_is_dax(
+ struct xfs_inode *ip)
+{
+ struct xfs_mount *mp = ip->i_mount;
+
+ return (mp->m_flags & XFS_MOUNT_DAX) == XFS_MOUNT_DAX;
+}
+
/* Figure out if this file actually supports DAX. */
static bool
xfs_inode_supports_dax(
@@ -1217,11 +1226,6 @@ xfs_inode_supports_dax(
if (!S_ISREG(VFS_I(ip)->i_mode) || xfs_is_reflink_inode(ip))
return false;

- /* DAX mount option or DAX iflag must be set. */
- if (!(mp->m_flags & XFS_MOUNT_DAX) &&
- !(ip->i_d.di_flags2 & XFS_DIFLAG2_DAX))
- return false;
-
/* Block size must match page size */
if (mp->m_sb.sb_blocksize != PAGE_SIZE)
return false;
@@ -1230,6 +1234,22 @@ xfs_inode_supports_dax(
return xfs_find_daxdev_for_inode(VFS_I(ip)) != NULL;
}

+static bool
+xfs_inode_is_dax(
+ struct xfs_inode *ip)
+{
+ return (ip->i_d.di_flags2 & XFS_DIFLAG2_DAX) == XFS_DIFLAG2_DAX;
+}
+
+static bool
+xfs_inode_use_dax(
+ struct xfs_inode *ip)
+{
+ return xfs_inode_supports_dax(ip) &&
+ (xfs_inode_mount_is_dax(ip) ||
+ xfs_inode_is_dax(ip));
+}
+
STATIC void
xfs_diflags_to_iflags(
struct inode *inode,
@@ -1248,7 +1268,7 @@ xfs_diflags_to_iflags(
inode->i_flags |= S_SYNC;
if (flags & XFS_DIFLAG_NOATIME)
inode->i_flags |= S_NOATIME;
- if (xfs_inode_supports_dax(ip))
+ if (xfs_inode_use_dax(ip))
inode->i_flags |= S_DAX;
}

--
2.20.1

2019-10-21 00:46:10

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag

On Sun, Oct 20, 2019 at 08:59:35AM -0700, [email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> Switching between DAX and non-DAX on a file is racy with respect to data
> operations. However, if no data is involved the flag is safe to switch.
>
> Allow toggling the physical flag if a file is empty. The file length
> check is not racy with respect to other operations as it is performed
> under XFS_MMAPLOCK_EXCL and XFS_IOLOCK_EXCL locks.
>
> Signed-off-by: Ira Weiny <[email protected]>
> ---
> fs/xfs/xfs_ioctl.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index d3a7340d3209..3839684c249b 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -33,6 +33,7 @@
> #include "xfs_sb.h"
> #include "xfs_ag.h"
> #include "xfs_health.h"
> +#include "libxfs/xfs_dir2.h"
>
> #include <linux/mount.h>
> #include <linux/namei.h>
> @@ -1232,12 +1233,10 @@ xfs_diflags_to_linux(
> inode->i_flags |= S_NOATIME;
> else
> inode->i_flags &= ~S_NOATIME;
> -#if 0 /* disabled until the flag switching races are sorted out */
> if (xflags & FS_XFLAG_DAX)
> inode->i_flags |= S_DAX;
> else
> inode->i_flags &= ~S_DAX;
> -#endif

This code has bit-rotted. See xfs_setup_iops(), where we now have a
different inode->i_mapping->a_ops for DAX inodes.

That, fundamentally, is the issue here - it's not setting/clearing
the DAX flag that is the issue, it's doing a swap of the
mapping->a_ops while there may be other code using that ops
structure.

IOWs, if there is any code anywhere in the kernel that
calls an address space op without holding one of the three locks we
hold here (i_rwsem, MMAPLOCK, ILOCK) then it can race with the swap
of the address space operations.

By limiting the address space swap to file sizes of zero, we rule
out the page fault path (mmap of a zero length file segv's with an
access beyond EOF on the first read/write page fault, right?).
However, other aops callers that might run unlocked and do the wrong
thing if the aops pointer is swapped between check of the aop method
existing and actually calling it even if the file size is zero?

A quick look shows that FIBMAP (ioctl_fibmap())) looks susceptible
to such a race condition with the current definitions of the XFS DAX
aops. I'm guessing there will be others, but I haven't looked
further than this...

> /* lock, flush and invalidate mapping in preparation for flag change */
> xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
> +
> + if (i_size_read(inode) != 0) {
> + error = -EOPNOTSUPP;
> + goto out_unlock;
> + }

Wrong error. Should be the same as whatever is returned when we try
to change the extent size hint and can't because the file is
non-zero in length (-EINVAL, I think). Also needs a comment
explainging why this check exists, and probably better written as
i_size_read() > 0 ....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-10-25 19:16:08

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 0/5] Enable per-file/directory DAX operations

On 25/10/2019 00:35, Dave Chinner wrote:
> On Thu, Oct 24, 2019 at 05:05:45PM +0300, Boaz Harrosh wrote:
<>
>> Yes I said that as well.
>
> You said "it must be set between creation and first write",
> stating the requirement for an on-disk flag to work.

Sorry for not being clear. Its not new, I do not explain myself
very well.

The above quote is if you set/clear the flag directly.

> I'm describing how that requirement is actually implemented. i.e. what
> you are stating is something we actually implemented years ago...
>

What you are talking about is when the flag is inherited from parent.
And I did mention that option as well.

[Which is my concern because my main point is that I want creation+write
to be none-DAX. Then after writer is done. Switch to DAX-on so READs can
be fast and not take any memory.
And you talked about another case when I start DAX-on and then for
say, use for RDMA turn it off later.
]

This flag is Unique because current RFC has an i_size == 0 check
that other flags do not have

>>> I also seem
>>> to recall that there was a need to take some vm level lock to really
>>> prevent page fault races, and that we can't safely take that in a
>>> safe combination with all the filesystem locks we need.
>>>
>>
>> We do not really care with page fault races in the Kernel as long
>
> Oh yes we do. A write fault is a 2-part operation - a read fault to
> populate the pte and mapping, then a write fault (->page_mkwrite) to
> do all the filesystem work needed to dirty the page and pte.
>
> The read fault sets up the state for the write fault, and if we
> change the aops between these two operations, then the
> ->page_mkwrite implementation goes kaboom.
>
> This isn't a theoretical problem - this is exactly the race
> condition that lead us to disabling the flag in the first place.
> There is no serialisation between the read and write parts of the
> page fault iand the filesystem changing the DAX flag and ops vector,
> and so fixing this problem requires hold yet more locks in the
> filesystem path to completely lock out page fault processing on the
> inode's mapping.
>

Again sorry that I do not explain very good.

Already on the read fault we populate the xarray,
My point was that if I want to set the DAX mode I must enforce that
there are no other parallel users on my inode. The check that the
xarray is empty is my convoluted way to check that there are no other
users except me. If xarray is not empty I bail out with EBUISY

I agree this is stupid.

Which is the same stupid as i_size==0 check. Both have the same
intention, to make sure that nothing is going on in parallel to
me.

>> as I protect the xarray access and these are protected well if we
>> take truncate locking. But we have a bigger problem that you pointed
>> out with the change of the operations vector pointer.
>>
>> I was thinking about this last night. One way to do this is with
>> file-exclusive-lock. Correct me if I'm wrong:
>> file-exclusive-readwrite-lock means any other openers will fail and
>> if there are openers already the lock will fail. Which is what we want
>> no?
>
> The filesystem ioctls and page faults have no visibility of file
> locks. They don't know and can't find out in a sane manner that an
> inode has a single -user- reference.
>

This is not true. The FS has full control. It is not too hard a work
to take a file-excl-lock from within the IOCTL implementation. then
unlock. that is option 1. Option 2 of the App taking the lock then
for the check we might need a new export from the lock-subsystem.

> And it introduces a new problem for any application using the
> fssetxattr() ioctl - accidentally not setting the S_DAX flag to be
> unmodified will now fail, and that means such a change breaks
> existing applications. Sure, you can say they are "buggy
> applications", but the fact is this user API change breaks them.
>

I am not sure I completely understood. let me try ...

The app wants to set some foo flag. It can set the ignore mask to all
1(s) except the flag it wants to set/clear.
And/or get_current_flags(); modify foo_flag; set_flags().

Off course in both the ignore case or when the DAX bit does not change
we do not go on a locking rampage.

So I'm not sure I see the problem

> Hence I don't think we can change the user API for setting/clearing
> this flag like this.
>

Yes we must not modify behavior of Apps that are modifing other flags.

> Cheers,
> Dave.
>

Perhaps we always go by the directory. And then do an mv dir_DAX/foo dir_NODAX/foo
to have an effective change. In hard links the first one at iget time before populating
the inode cache takes affect. (And never change the flag on the fly)
(Just brain storming here)

Or perhaps another API?

Thanks Dave
Boaz

2019-11-02 04:27:02

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 0/5] Enable per-file/directory DAX operations

On Fri, Nov 1, 2019 at 3:47 PM Dave Chinner <[email protected]> wrote:
[..]
> No, the flag does not get turned on until we've solved the problems
> that resulted in us turning it off. We've gone over this mutliple
> times, and nobody has solved the issues that need solving - everyone
> seems to just hack around the issues rather than solving it
> properly. If we thought taking some kind of shortcut full of
> compromises and gotchas was the right solution, we would have never
> turned the flag off in the first place.

My fault. I thought the effective vs physical distinction was worth
taking an incremental step forward. Ira was continuing to look at the
a_ops issue in the meantime.

2019-11-08 13:13:53

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag

On Mon 21-10-19 15:49:31, Ira Weiny wrote:
> On Mon, Oct 21, 2019 at 11:45:36AM +1100, Dave Chinner wrote:
> > On Sun, Oct 20, 2019 at 08:59:35AM -0700, [email protected] wrote:
> > That, fundamentally, is the issue here - it's not setting/clearing
> > the DAX flag that is the issue, it's doing a swap of the
> > mapping->a_ops while there may be other code using that ops
> > structure.
> >
> > IOWs, if there is any code anywhere in the kernel that
> > calls an address space op without holding one of the three locks we
> > hold here (i_rwsem, MMAPLOCK, ILOCK) then it can race with the swap
> > of the address space operations.
> >
> > By limiting the address space swap to file sizes of zero, we rule
> > out the page fault path (mmap of a zero length file segv's with an
> > access beyond EOF on the first read/write page fault, right?).
>
> Yes I checked that and thought we were safe here...
>
> > However, other aops callers that might run unlocked and do the wrong
> > thing if the aops pointer is swapped between check of the aop method
> > existing and actually calling it even if the file size is zero?
> >
> > A quick look shows that FIBMAP (ioctl_fibmap())) looks susceptible
> > to such a race condition with the current definitions of the XFS DAX
> > aops. I'm guessing there will be others, but I haven't looked
> > further than this...
>
> I'll check for others and think on what to do about this. ext4 will have the
> same problem I think. :-(

Just as a datapoint, ext4 is bold and sets inode->i_mapping->a_ops on
existing inodes when switching journal data flag and so far it has not
blown up. What we did to deal with issues Dave describes is that we
introduced percpu rw-semaphore guarding switching of aops and then inside
problematic functions redirect callbacks in the right direction under this
semaphore. Somewhat ugly but it seems to work.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-11-08 13:47:44

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag

On Fri 08-11-19 14:12:38, Jan Kara wrote:
> On Mon 21-10-19 15:49:31, Ira Weiny wrote:
> > On Mon, Oct 21, 2019 at 11:45:36AM +1100, Dave Chinner wrote:
> > > On Sun, Oct 20, 2019 at 08:59:35AM -0700, [email protected] wrote:
> > > That, fundamentally, is the issue here - it's not setting/clearing
> > > the DAX flag that is the issue, it's doing a swap of the
> > > mapping->a_ops while there may be other code using that ops
> > > structure.
> > >
> > > IOWs, if there is any code anywhere in the kernel that
> > > calls an address space op without holding one of the three locks we
> > > hold here (i_rwsem, MMAPLOCK, ILOCK) then it can race with the swap
> > > of the address space operations.
> > >
> > > By limiting the address space swap to file sizes of zero, we rule
> > > out the page fault path (mmap of a zero length file segv's with an
> > > access beyond EOF on the first read/write page fault, right?).
> >
> > Yes I checked that and thought we were safe here...
> >
> > > However, other aops callers that might run unlocked and do the wrong
> > > thing if the aops pointer is swapped between check of the aop method
> > > existing and actually calling it even if the file size is zero?
> > >
> > > A quick look shows that FIBMAP (ioctl_fibmap())) looks susceptible
> > > to such a race condition with the current definitions of the XFS DAX
> > > aops. I'm guessing there will be others, but I haven't looked
> > > further than this...
> >
> > I'll check for others and think on what to do about this. ext4 will have the
> > same problem I think. :-(
>
> Just as a datapoint, ext4 is bold and sets inode->i_mapping->a_ops on
> existing inodes when switching journal data flag and so far it has not
> blown up. What we did to deal with issues Dave describes is that we
> introduced percpu rw-semaphore guarding switching of aops and then inside
> problematic functions redirect callbacks in the right direction under this
> semaphore. Somewhat ugly but it seems to work.

Thinking about this some more, perhaps this scheme could be actually
transformed in something workable. We could have a global (or maybe per-sb
but I'm not sure it's worth it) percpu rwsem and we could transform aops
calls into:

percpu_down_read(aops_rwsem);
do_call();
percpu_up_read(aops_rwsem);

With some macro magic it needn't be even that ugly.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-11-11 16:09:34

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag

On Fri 08-11-19 11:36:13, Ira Weiny wrote:
> On Fri, Nov 08, 2019 at 02:46:06PM +0100, Jan Kara wrote:
> > On Fri 08-11-19 14:12:38, Jan Kara wrote:
> > > On Mon 21-10-19 15:49:31, Ira Weiny wrote:
> > > > On Mon, Oct 21, 2019 at 11:45:36AM +1100, Dave Chinner wrote:
> > > > > On Sun, Oct 20, 2019 at 08:59:35AM -0700, [email protected] wrote:
> > > > > That, fundamentally, is the issue here - it's not setting/clearing
> > > > > the DAX flag that is the issue, it's doing a swap of the
> > > > > mapping->a_ops while there may be other code using that ops
> > > > > structure.
> > > > >
> > > > > IOWs, if there is any code anywhere in the kernel that
> > > > > calls an address space op without holding one of the three locks we
> > > > > hold here (i_rwsem, MMAPLOCK, ILOCK) then it can race with the swap
> > > > > of the address space operations.
> > > > >
> > > > > By limiting the address space swap to file sizes of zero, we rule
> > > > > out the page fault path (mmap of a zero length file segv's with an
> > > > > access beyond EOF on the first read/write page fault, right?).
> > > >
> > > > Yes I checked that and thought we were safe here...
> > > >
> > > > > However, other aops callers that might run unlocked and do the wrong
> > > > > thing if the aops pointer is swapped between check of the aop method
> > > > > existing and actually calling it even if the file size is zero?
> > > > >
> > > > > A quick look shows that FIBMAP (ioctl_fibmap())) looks susceptible
> > > > > to such a race condition with the current definitions of the XFS DAX
> > > > > aops. I'm guessing there will be others, but I haven't looked
> > > > > further than this...
> > > >
> > > > I'll check for others and think on what to do about this. ext4 will have the
> > > > same problem I think. :-(
> > >
> > > Just as a datapoint, ext4 is bold and sets inode->i_mapping->a_ops on
> > > existing inodes when switching journal data flag and so far it has not
> > > blown up. What we did to deal with issues Dave describes is that we
> > > introduced percpu rw-semaphore guarding switching of aops and then inside
> > > problematic functions redirect callbacks in the right direction under this
> > > semaphore. Somewhat ugly but it seems to work.
>
> Ah I am glad you brought this up. I had not seen this before.
>
> Is that s_journal_flag_rwsem?

Yes.

> In the general case I don't think that correctly protects against:
>
> if (a_ops->call)
> a_ops->call();
>
> Because not all operations are defined in both ext4_aops and
> ext4_journalled_aops. Specifically migratepage.
>
> move_to_new_page() specifically follows the pattern above with migratepage. So
> is there a bug here?

Looks like there could be.

> > Thinking about this some more, perhaps this scheme could be actually
> > transformed in something workable. We could have a global (or maybe per-sb
> > but I'm not sure it's worth it) percpu rwsem and we could transform aops
> > calls into:
> >
> > percpu_down_read(aops_rwsem);
> > do_call();
> > percpu_up_read(aops_rwsem);
> >
> > With some macro magic it needn't be even that ugly.
>
> I think this is safer. And what I have been investigating/coding up.
> Because that also would protect against the above with:
>
> percpu_down_read(aops_rwsem);
> if (a_ops->call)
> a_ops->call();
> percpu_up_read(aops_rwsem);
>
> However I have been looking at SRCU because we also have patterns like:
>
>
> generic_file_buffered_read
> if (a_ops->is_partially_uptodate)
> a_ops->is_partially_uptodate()
> page_cache_sync_readahead
> force_page_cache_readahead
> if (!a_ops->readpage && !a_ops->readpages)
> return;
> __do_page_cache_readahead
> read_pages
> if (a_ops->readpages)
> a_ops->readpages()
> a_ops->readpage
>
>
> So we would have to pass the a_ops through to use a rwsem. Where SRCU I
> think would be fine to just take the SRCU read lock multiple times. Am I
> wrong?

So the idea I had would not solve this issue because we'd release the rwsem
once we return from ->is_partially_uptodate(). This example shows that we
actually expect consistency among different aops as they are called in
sequence and that's much more difficult to achieve than just a consistency
within single aop call.

> We also have a 3rd (2nd?) issue. There are callers who check for the
> presence of an operation to be used later. For example do_dentry_open():
>
> do_dentry_open()
> {
> ...
> if (<flags> & O_DIRECT)
> if (!<a_ops> || !<a_ops>->direct_IO)
> return -EINVAL;
> ...
> }
>
> After this open direct_IO better be there AFAICT so changing the a_ops
> later would not be good. For ext4 direct_IO is defined for all the
> a_ops... so I guess that is not a big deal. However, is the user really
> getting the behavior they expect in this case?

In this particular case I don't think there's any practical harm for any
filesystem but in general this is another instance where consistency of
aops over time is assumed.

> I'm afraid of requiring FSs to have to follow rules in defining their a_ops.
> Because I'm afraid maintaining those rules would be hard and would eventually
> lead to crashes when someone did it wrong.

I guess this very much depends on the rules. But yes, anything non-obvious
or hard to check would quickly lead to bugs, I agree. But IMHO fully
general solution to above problems would clutter the generic code in rather
ugly way as well because usage of aops is pretty widespread in mm and fs
code. It isn't just a few places that call them...

But I think we could significantly reduce the problem by looking at what's
in aops. We have lots of operations there that operate on pages. If we
mandate that before and during switching of aops, you must make sure
there's nothing in page cache for the inode, you've already dealt with 90%
of the problems.

Beside these we have:
* write_begin - that creates page in page cache so above rule should stop
it as well
* bmap - honestly I'd be inclined to just move this to inode_operations
just like fiemap. There's nothing about address_space in its functionality.
* swap_activate / swap_deactivate - Either I'd move these to
file_operations (what's there about address_space, right), or since all
instances of this only care about the inode, we can as well just pass
only inode to the function and move it to inode_operations.

And then the really problematic ones:
* direct_IO - Logically with how the IO path is structured, it belongs in
aops so I wouldn't move it. With the advance of iomap it is on its way to
being removed altogether but that will take a long time to happen
completely. So for now I'd mandate that direct_IO path must be locked out
while switching aops.
* readpages - these should be locked out by the rule that page creation is
forbidden.
* writepages - these need to be locked out when switching aops.

And that should be it. So I don't think there's a need for reference-counting
of aops in the generic code, especially since I don't think it can be done
in an elegant way (but feel free to correct me). I think that just
providing a way to lock-out above three calls would be enough.

> So for this 3rd (2nd) case I think we should simply take a reference to the
> a_ops and fail changing the mode. For the DAX case that means the user is best
> served by taking a write lease on the file to ensure there are no other opens
> which could cause issues.
>
> Would that work for changing the journaling mode?
>
> And I _think_ this is the only issue we have with this right now. But if other
> callers of a_ops needed the pattern of using the a_ops at a time across context
> changes they would need to ensure this reference was taken.
>
> What I have come up with thus far is an interface like:
>
> /*
> * as_get_a_ops() -- safely get the a_ops from the address_space specified
> *
> * @as: address space to get a_ops from
> * @ref: used to indicate if a reference is required on this a_ops
> * @tok: srcu token to be returned in as_put_a_ops()
> *
> * The a_ops returned is protected from changing until as_put_a_ops().
> *
> * If ref is specified then ref must also be specified in as_put_a_ops() to
> * release this reference. In this case a reference is taken on the a_ops
> * which will prevent it from changing until the reference is released.
> *
> * References should _ONLY_ be taken when the a_ops needs to be constant
> * across a user context switch because doing so will block changing the a_ops
> * until that reference is released.
> *
> * Examples of using a reference are checks for specific a_ops pointers which
> * are expected to support functionality at a later date (example direct_IO)
> */
> static inline const struct address_space_operations *
> as_get_a_ops(struct address_space *as, bool ref, int *tok)
> {
> ...
> }
>
> static inline void
> as_assign_a_ops(struct address_space *as,
> const struct address_space_operations *a_ops)
> {
> ...
> }
>
> static inline void as_put_a_ops(struct address_space *as, int tok, bool ref)
> {
> ...
> }
>
>
> I'm still working out the details of using SRCU and a ref count. I have made
> at least 1 complete pass of all the a_ops users and I think this would cover
> them all.

Well, my concern with the use of interface like this is:

a) The clutter in the generic code
b) It's difficult to make this work with SRCU because presumably you want
to use synchronize_srcu() while switching aops. But then you have three
operations to do:
1) switch aops
2) set inode flag
3) synchronize_srcu

and depending on the order in which you do these either "old aops"
operations will see inode with a flag or "new aops" will see the inode
without a flag and either can confuse those functions...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-11-11 23:55:47

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag

>
> > In the general case I don't think that correctly protects against:
> >
> > if (a_ops->call)
> > a_ops->call();
> >
> > Because not all operations are defined in both ext4_aops and
> > ext4_journalled_aops. Specifically migratepage.
> >
> > move_to_new_page() specifically follows the pattern above with migratepage. So
> > is there a bug here?
>
> Looks like there could be.

Ok I'm going to leave this alone and whatever I come up with try and make sure
that the ext4 journal a_ops fits.

[snip]

> >
> > However I have been looking at SRCU because we also have patterns like:
> >
> >
> > generic_file_buffered_read
> > if (a_ops->is_partially_uptodate)
> > a_ops->is_partially_uptodate()
> > page_cache_sync_readahead
> > force_page_cache_readahead
> > if (!a_ops->readpage && !a_ops->readpages)
> > return;
> > __do_page_cache_readahead
> > read_pages
> > if (a_ops->readpages)
> > a_ops->readpages()
> > a_ops->readpage
> >
> >
> > So we would have to pass the a_ops through to use a rwsem. Where SRCU I
> > think would be fine to just take the SRCU read lock multiple times. Am I
> > wrong?
>
> So the idea I had would not solve this issue because we'd release the rwsem
> once we return from ->is_partially_uptodate(). This example shows that we
> actually expect consistency among different aops as they are called in
> sequence and that's much more difficult to achieve than just a consistency
> within single aop call.

I can't be sure but is seems like consistency for an operation is somewhat
important. OR we have to develop rules around when filesystems can, or can not
change a_ops so that the consistency does not matter.

I think what scares me the most is that I'm not really sure what those rules
are...

>
> > We also have a 3rd (2nd?) issue. There are callers who check for the
> > presence of an operation to be used later. For example do_dentry_open():
> >
> > do_dentry_open()
> > {
> > ...
> > if (<flags> & O_DIRECT)
> > if (!<a_ops> || !<a_ops>->direct_IO)
> > return -EINVAL;
> > ...
> > }
> >
> > After this open direct_IO better be there AFAICT so changing the a_ops
> > later would not be good. For ext4 direct_IO is defined for all the
> > a_ops... so I guess that is not a big deal. However, is the user really
> > getting the behavior they expect in this case?
>
> In this particular case I don't think there's any practical harm for any
> filesystem but in general this is another instance where consistency of
> aops over time is assumed.

Yes... But this is just a more complex situation to maintain consistency.
Again I don't have any idea if consistency is required.

But the current definition/use of a_ops is very static (with the exception of
the ext4 case you gave), so I'm thinking that much of the code is written with
the assumption that the vectors do not change.

>
> > I'm afraid of requiring FSs to have to follow rules in defining their a_ops.
> > Because I'm afraid maintaining those rules would be hard and would eventually
> > lead to crashes when someone did it wrong.
>
> I guess this very much depends on the rules. But yes, anything non-obvious
> or hard to check would quickly lead to bugs, I agree. But IMHO fully
> general solution to above problems would clutter the generic code in rather
> ugly way as well because usage of aops is pretty widespread in mm and fs
> code. It isn't just a few places that call them...

I agree it does clutter the code. and the patches I have are not pretty and
I'm not even sure they are not broken... So yea I'm open to ideas!

>
> But I think we could significantly reduce the problem by looking at what's
> in aops. We have lots of operations there that operate on pages. If we
> mandate that before and during switching of aops, you must make sure
> there's nothing in page cache for the inode, you've already dealt with 90%
> of the problems.

Sounds promising...

But digging into this it looks like we need a similar rule for the DAX side to
have no mappings outstanding. Perhaps you meant that when you said page cache?

>
> Beside these we have:
> * write_begin - that creates page in page cache so above rule should stop
> it as well

Just to make sure I understand, do you propose that we put a check in
pagecache_write_[begin|end]() to protect from these calls while changing?

I just want to be sure because I've wondered if we can get away with minimal
checks, or checks on individual functions, in the generic code. But that
seemed kind of ugly as well.

> * bmap - honestly I'd be inclined to just move this to inode_operations
> just like fiemap. There's nothing about address_space in its functionality.

Hmmm... What about these call paths?

ext4_bmap()
filemap_write_and_wait()
filemap_fdatawrite()
__filemap_fdatawrite()
__filemap_fdatawrite_range()
do_writepages()
a_ops->writepages()

or

xfs_vm_bmap()
iomap_bmap()
filemap_write_and_wait()
...
:-/

maybe we should leave it and have it covered under the page cache rule you
propose?


> * swap_activate / swap_deactivate - Either I'd move these to
> file_operations (what's there about address_space, right), or since all
> instances of this only care about the inode, we can as well just pass
> only inode to the function and move it to inode_operations.

XFS calls iomap_swapfile_activate() which calls vfs_fsync() which needs the
file. Seems like file_operations would be better.

I like the idea of cleaning this up a lot. I've gone ahead with a couple of
patches to do this. At least this simplifies things a little bit...

>
> And then the really problematic ones:
> * direct_IO - Logically with how the IO path is structured, it belongs in
> aops so I wouldn't move it. With the advance of iomap it is on its way to
> being removed altogether but that will take a long time to happen
> completely. So for now I'd mandate that direct_IO path must be locked out
> while switching aops.

How do we lock this out between checking for this support on open and using it
later?

I think if we go down this path we have to make a rule that says that direct_IO
must be defined for both a_ops. Right now for our 2 use case we are lucky that
direct_IO is also defined as the same function. So there is little danger of
odd behavior.

Let me explore more.

> * readpages - these should be locked out by the rule that page creation is
> forbidden.

Agreed. Same question applies here as for pagecache_write_[begin|end]().
Should I special case a check for this?

> * writepages - these need to be locked out when switching aops.

If nothing is in the pagecache could we ignore this as well?

>
> And that should be it. So I don't think there's a need for reference-counting
> of aops in the generic code, especially since I don't think it can be done
> in an elegant way (but feel free to correct me). I think that just
> providing a way to lock-out above three calls would be enough.

Ok, I've been thinking about this whole email more today. And what if we add a
couple of FS specific lockout callbacks in struct address_space itself.

If they are defined the FS has dynamic a_ops capability and these 3 functions
will need to be locked out by a rw_sem controlled by the FS.

We can then document the "rules" for dynamic a_ops better for FS's to support
them by referencing the special cases and the fact that dynamic a_ops requires
these callbacks to be defined.

It clutters the generic code a bit, but not as much as my idea. At the same
time it helps to self document the special cases in both the FS's and the core
code.

[snip]

> >
> >
> > I'm still working out the details of using SRCU and a ref count. I have made
> > at least 1 complete pass of all the a_ops users and I think this would cover
> > them all.
>
> Well, my concern with the use of interface like this is:
>
> a) The clutter in the generic code

There is a bit of clutter. What I'm most concerned about is the amount of
special casing. I still think it would be cleaner in the long run... And
force some structure on the use of a_ops. But after looking at it more there
may be a middle ground.

> b) It's difficult to make this work with SRCU because presumably you want
> to use synchronize_srcu() while switching aops. But then you have three
> operations to do:
> 1) switch aops
> 2) set inode flag
> 3) synchronize_srcu
>
> and depending on the order in which you do these either "old aops"
> operations will see inode with a flag or "new aops" will see the inode
> without a flag and either can confuse those functions...

Yes that might be a challenge.

Ira