2019-10-25 19:16:59

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/5] Enable per-file/directory DAX operations

On Fri, Oct 25, 2019 at 02:29:04AM +0300, Boaz Harrosh wrote:
> On 25/10/2019 00:35, Dave Chinner wrote:
> > On Thu, Oct 24, 2019 at 05:05:45PM +0300, Boaz Harrosh wrote:
> > This isn't a theoretical problem - this is exactly the race
> > condition that lead us to disabling the flag in the first place.
> > There is no serialisation between the read and write parts of the
> > page fault iand the filesystem changing the DAX flag and ops vector,
> > and so fixing this problem requires hold yet more locks in the
> > filesystem path to completely lock out page fault processing on the
> > inode's mapping.
> >
>
> Again sorry that I do not explain very good.
>
> Already on the read fault we populate the xarray,

On a write fault we can have an empty xarray slot so the write fault
needs to both populate the xarray slot (read fault) and process the
write fault.

> My point was that if I want to set the DAX mode I must enforce that
> there are no other parallel users on my inode. The check that the
> xarray is empty is my convoluted way to check that there are no other
> users except me. If xarray is not empty I bail out with EBUISY

Checking the xarray being empty is racy. The moment you drop the
mapping lock, the page fault can populate a slot in the mapping that
you just checked was empty. And then you swap the aops between the
population and the ->page-mkwrite() call in the page fault
that is running, and things go boom.

Unless there's something new in the page fault path that nobody has
noticed in the past couple of years, this TOCTOU race hasn't been
solved....

> Perhaps we always go by the directory. And then do an mv dir_DAX/foo dir_NODAX/foo

The inode is instatiated before the rename is run, so it's set up
with it's old dir config, not the new one. So this ends up with the
same problem of haivng to change the S_DAX flag and aops vector
dynamically on rename. Same problem, not a solution.

> to have an effective change. In hard links the first one at iget time before populating
> the inode cache takes affect.

If something like a find or backup program brings the inode into
cache, the app may not even get the behaviour it wants, and it can't
change it until the inode is evicted from cache, which may be never.
Nobody wants implicit/random/uncontrollable/unchangeable behaviour
like this.

> (And never change the flag on the fly)
> (Just brain storming here)

We went over all this ground when we disabled the flag in the first
place. We disabled the flag because we couldn't come up with a sane
way to flip the ops vector short of tracking the number of aops
calls in progress at any given time. i.e. reference counting the
aops structure, but that's hard to do with a const ops structure,
and so it got disabled rather than allowing users to crash
kernels....

Cheers,

-Dave.
--
Dave Chinner
[email protected]


2019-10-25 19:17:41

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 0/5] Enable per-file/directory DAX operations

On 25/10/2019 03:36, Dave Chinner wrote:
> On Fri, Oct 25, 2019 at 02:29:04AM +0300, Boaz Harrosh wrote:
<>

>> Perhaps we always go by the directory. And then do an mv dir_DAX/foo dir_NODAX/foo
>
> The inode is instatiated before the rename is run, so it's set up
> with it's old dir config, not the new one. So this ends up with the
> same problem of haivng to change the S_DAX flag and aops vector
> dynamically on rename. Same problem, not a solution.
>

Yes Admin needs a inode-drop_caches after the mv if she/he wants an effective
change.

>> to have an effective change. In hard links the first one at iget time before populating
>> the inode cache takes affect.
>
> If something like a find or backup program brings the inode into
> cache, the app may not even get the behaviour it wants, and it can't
> change it until the inode is evicted from cache, which may be never.

inode-drop-caches. (echo 2 > /proc/sys/vm/drop_caches)

> Nobody wants implicit/random/uncontrollable/unchangeable behaviour
> like this.
>

You mean in the case of hard links between different mode directories?
I agree it is not so good. I do not like it too.

<>
> We went over all this ground when we disabled the flag in the first
> place. We disabled the flag because we couldn't come up with a sane
> way to flip the ops vector short of tracking the number of aops
> calls in progress at any given time. i.e. reference counting the
> aops structure, but that's hard to do with a const ops structure,
> and so it got disabled rather than allowing users to crash
> kernels....
>

Do you mean dropping this patchset all together? I missed that.

Current patchset with the i_size == 0 thing is really bad I think.
Its the same has dropping the direct change all together and only
supporting inheritance from parent.
[Which again for me is really not interesting]

> Cheers,
> -Dave.

Lets sleep on it. Please remind me if xfs supports clone + DAX

Thanks Dave
Boaz

2019-10-25 20:58:07

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 0/5] Enable per-file/directory DAX operations

On Fri, Oct 25, 2019 at 11:36:03AM +1100, Dave Chinner wrote:
> On Fri, Oct 25, 2019 at 02:29:04AM +0300, Boaz Harrosh wrote:
> > On 25/10/2019 00:35, Dave Chinner wrote:
>
> If something like a find or backup program brings the inode into
> cache, the app may not even get the behaviour it wants, and it can't
> change it until the inode is evicted from cache, which may be never.

Why would this be never?

> Nobody wants implicit/random/uncontrollable/unchangeable behaviour
> like this.

I'm thinking this could work with a bit of effort on the users part. While the
behavior does have a bit of uncertainty, I feel like there has to be a way to
get the inode to drop from the cache when a final iput() happens on the inode.

Admin programs should not leave files open forever, without the users knowing
about it. So I don't understand why the inode could not be evicted from the
cache if the FS knew that this change had been made and the inode needs to be
"re-loaded". See below...

>
> > (And never change the flag on the fly)
> > (Just brain storming here)
>
> We went over all this ground when we disabled the flag in the first
> place. We disabled the flag because we couldn't come up with a sane
> way to flip the ops vector short of tracking the number of aops
> calls in progress at any given time. i.e. reference counting the
> aops structure, but that's hard to do with a const ops structure,
> and so it got disabled rather than allowing users to crash
> kernels....

Agreed. We can't change the a_ops without some guarantee that no one is using
the file. Which means we need all fds to close and a final iput(). I thought
that would mean an eviction of the inode and a subsequent reload.

Yesterday I coded up the following (applies on top of this series) but I can't
seem to get it to work because I believe xfs is keeping a reference on the
inode. What am I missing? I think if I could get xfs to recognize that the
inode needs to be cleared from it's cache this would work, with some caveats.

Currently this works if I remount the fs or if I use <procfs>/drop_caches like
Boaz mentioned.

Isn't there a way to get xfs to do that on it's own?

Ira


From 7762afd95a3e21a782dffd2fd8e13ae4a23b5e4a Mon Sep 17 00:00:00 2001
From: Ira Weiny <[email protected]>
Date: Fri, 25 Oct 2019 13:32:07 -0700
Subject: [PATCH] squash: Allow phys change on any length file

delay the changing of the effective bit to when the inode is re-read
into the cache.

Currently a work in Progress because xfs seems to cache the inodes as
well and I'm not sure how to get xfs to release it's reference.
---
fs/xfs/xfs_ioctl.c | 18 +++++++-----------
include/linux/fs.h | 5 ++++-
2 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 89cf47ef273e..4d730d5781d9 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1233,10 +1233,13 @@ xfs_diflags_to_linux(
inode->i_flags |= S_NOATIME;
else
inode->i_flags &= ~S_NOATIME;
- if (xflags & FS_XFLAG_DAX)
- inode->i_flags |= S_DAX;
- else
- inode->i_flags &= ~S_DAX;
+ /* NOTE: we do not allow the physical DAX flag to immediately change
+ * the effective IS_DAX() flag tell the VFS layer to remove the inode
+ * from the cache on the final iput() to force recreation on the next
+ * 'fresh' open */
+ if (((xflags & FS_XFLAG_DAX) && !IS_DAX(inode)) ||
+ (!(xflags & FS_XFLAG_DAX) && IS_DAX(inode)))
+ inode->i_flags |= S_REVALIDATE;
}

static int
@@ -1320,13 +1323,6 @@ xfs_ioctl_setattr_dax_invalidate(
/* lock, flush and invalidate mapping in preparation for flag change */
xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);

- /* File size must be zero to avoid races with asynchronous page
- * faults */
- if (i_size_read(inode) > 0) {
- error = -EINVAL;
- goto out_unlock;
- }
-
error = filemap_write_and_wait(inode->i_mapping);
if (error)
goto out_unlock;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0b4d8fc79e0f..4e9b7bf53c86 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1998,6 +1998,7 @@ struct super_operations {
#define S_ENCRYPTED 16384 /* Encrypted file (using fs/crypto/) */
#define S_CASEFOLD 32768 /* Casefolded file */
#define S_VERITY 65536 /* Verity file (using fs/verity/) */
+#define S_REVALIDATE 131072 /* Drop inode from cache on final put */

/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -2040,6 +2041,7 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags
#define IS_ENCRYPTED(inode) ((inode)->i_flags & S_ENCRYPTED)
#define IS_CASEFOLDED(inode) ((inode)->i_flags & S_CASEFOLD)
#define IS_VERITY(inode) ((inode)->i_flags & S_VERITY)
+#define IS_REVALIDATE(inode) ((inode)->i_flags & S_REVALIDATE)

#define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \
(inode)->i_rdev == WHITEOUT_DEV)
@@ -3027,7 +3029,8 @@ extern int inode_needs_sync(struct inode *inode);
extern int generic_delete_inode(struct inode *inode);
static inline int generic_drop_inode(struct inode *inode)
{
- return !inode->i_nlink || inode_unhashed(inode);
+ return !inode->i_nlink || inode_unhashed(inode) ||
+ IS_REVALIDATE(inode);
}

extern struct inode *ilookup5_nowait(struct super_block *sb,
--
2.20.1


2019-10-28 11:44:01

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/5] Enable per-file/directory DAX operations

On Fri, Oct 25, 2019 at 01:49:26PM -0700, Ira Weiny wrote:
> On Fri, Oct 25, 2019 at 11:36:03AM +1100, Dave Chinner wrote:
> > On Fri, Oct 25, 2019 at 02:29:04AM +0300, Boaz Harrosh wrote:
> > > On 25/10/2019 00:35, Dave Chinner wrote:
> >
> > If something like a find or backup program brings the inode into
> > cache, the app may not even get the behaviour it wants, and it can't
> > change it until the inode is evicted from cache, which may be never.
>
> Why would this be never?

Because only unreferenced inodes can be removed from cache. As long
as something holds a reference or repeatedly accesses the inode such
that reclaim always skips it because it is referenced, it will never
get evicted from the cache.

IOWs, "never" in the practical sense, not "never" in the theoretical
sense.

> > Nobody wants implicit/random/uncontrollable/unchangeable behaviour
> > like this.
>
> I'm thinking this could work with a bit of effort on the users part. While the
> behavior does have a bit of uncertainty, I feel like there has to be a way to
> get the inode to drop from the cache when a final iput() happens on the inode.

Keep in mind that the final iput()->evict() process doesn't mean the
inode is going to get removed from all filesystem inode caches, just
the VFS level cache. The filesystem can still have internal
references to the inode, and still be doing work on the inode that
the VFS knows nothing about. XFS definitely fits into this category.

XFS will, however, re-initialise the inode aops structure if the VFS
then does another lookup on the inode while it is in this
"reclaimed" state, so from the VFS perspective it looks like a
newly instantiated inodes on the next lookup. We don't actually need
to do this for large parts of the inode as it is already still in
the valid state from the evict() call. It's an implementation
simplification that means we always re-init the ops vectors attached
to the inode rather than just the fields that need to be
re-initialised.

IOWs, evict/reinit changing the aops vector because the on disk dax
flag changed on XFS works by luck right now, not intent....

> Admin programs should not leave files open forever, without the users knowing
> about it. So I don't understand why the inode could not be evicted from the
> cache if the FS knew that this change had been made and the inode needs to be
> "re-loaded". See below...

Doesn't need to be an open file - inodes are pinned in memory by the
reference the dentry holds on it. Hence as long as there are
actively referenced dentries that point at the inode, the inode
cannot be reclaimed. Hard links mean multiple dentries could pin the
inode, too.

> > > (And never change the flag on the fly)
> > > (Just brain storming here)
> >
> > We went over all this ground when we disabled the flag in the first
> > place. We disabled the flag because we couldn't come up with a sane
> > way to flip the ops vector short of tracking the number of aops
> > calls in progress at any given time. i.e. reference counting the
> > aops structure, but that's hard to do with a const ops structure,
> > and so it got disabled rather than allowing users to crash
> > kernels....
>
> Agreed. We can't change the a_ops without some guarantee that no one is using
> the file. Which means we need all fds to close and a final iput(). I thought
> that would mean an eviction of the inode and a subsequent reload.
>
> Yesterday I coded up the following (applies on top of this series) but I can't
> seem to get it to work because I believe xfs is keeping a reference on the
> inode. What am I missing? I think if I could get xfs to recognize that the
> inode needs to be cleared from it's cache this would work, with some caveats.

You are missing the fact that dentries hold an active reference to
inodes. So a path lookup (access(), stat(), etc) will pin the inode
just as effectively as holding an open file because they instantiate
a dentry that holds a reference to the inode....

> Currently this works if I remount the fs or if I use <procfs>/drop_caches like
> Boaz mentioned.

drop_caches frees all the dentries that don't have an active
references before it iterates over inodes, thereby dropping the
cached reference(s) to the inode that pins it in memory before it
iterates the inode LRU.

> Isn't there a way to get xfs to do that on it's own?

Not reliably. Killing all the dentries doesn't guarantee the inode
will be reclaimed immediately. The ioctl() itself requires an open
file reference to the inode, and there's no telling how many other
references there are to the inode that the filesystem a) can't find,
and b) even if it can find them, it is illegal to release them.

IOWs, if you are relying on being able to force eviction of inode
from the cache for correct operation of a user controlled flag, then
it's just not going to work.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-10-31 16:18:37

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH 0/5] Enable per-file/directory DAX operations

On Mon, Oct 28, 2019 at 09:10:39AM +1100, Dave Chinner wrote:
> On Fri, Oct 25, 2019 at 01:49:26PM -0700, Ira Weiny wrote:

[snip]

>
> > Currently this works if I remount the fs or if I use <procfs>/drop_caches like
> > Boaz mentioned.
>
> drop_caches frees all the dentries that don't have an active
> references before it iterates over inodes, thereby dropping the
> cached reference(s) to the inode that pins it in memory before it
> iterates the inode LRU.
>
> > Isn't there a way to get xfs to do that on it's own?
>
> Not reliably. Killing all the dentries doesn't guarantee the inode
> will be reclaimed immediately. The ioctl() itself requires an open
> file reference to the inode, and there's no telling how many other
> references there are to the inode that the filesystem a) can't find,
> and b) even if it can find them, it is illegal to release them.
>
> IOWs, if you are relying on being able to force eviction of inode
> from the cache for correct operation of a user controlled flag, then
> it's just not going to work.

Agree, I see the difficulty of forcing the effective flag to change in this
path. However, the only thing I am relying on is that the ioctl will change
the physical flag.

IOW I am proposing that the semantic be that changing the physical flag does
_not_ immediately change the effective flag. With that clarified up front the
user can adjust accordingly.

After thinking about this more I think there is a strong use case to be able to
change the physical flag on a non-zero length file. That use case is to be
able to restore files from backups.

Therefore, having the effective flag flip at some later time when the a_ops can
safely be changed (for example a remount/drop_cache event) is beneficial.

I propose the user has no direct control over this event and it is mainly used
to restore files from backups which is mainly an admin operation where a
remount is a reasonable thing to do.

Users direct control of the effective flag is through inheritance. The user
needs to create the file in a DAX enable dir and they get effective operation
right away.

If in the future we can determine a safe way to trigger the a_ops change we can
add that to the semantic as an alternative for users.

Ira

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2019-11-01 22:48:03

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/5] Enable per-file/directory DAX operations

On Thu, Oct 31, 2019 at 09:17:58AM -0700, Ira Weiny wrote:
> On Mon, Oct 28, 2019 at 09:10:39AM +1100, Dave Chinner wrote:
> > On Fri, Oct 25, 2019 at 01:49:26PM -0700, Ira Weiny wrote:
>
> [snip]
>
> >
> > > Currently this works if I remount the fs or if I use <procfs>/drop_caches like
> > > Boaz mentioned.
> >
> > drop_caches frees all the dentries that don't have an active
> > references before it iterates over inodes, thereby dropping the
> > cached reference(s) to the inode that pins it in memory before it
> > iterates the inode LRU.
> >
> > > Isn't there a way to get xfs to do that on it's own?
> >
> > Not reliably. Killing all the dentries doesn't guarantee the inode
> > will be reclaimed immediately. The ioctl() itself requires an open
> > file reference to the inode, and there's no telling how many other
> > references there are to the inode that the filesystem a) can't find,
> > and b) even if it can find them, it is illegal to release them.
> >
> > IOWs, if you are relying on being able to force eviction of inode
> > from the cache for correct operation of a user controlled flag, then
> > it's just not going to work.
>
> Agree, I see the difficulty of forcing the effective flag to change in this
> path. However, the only thing I am relying on is that the ioctl will change
> the physical flag.
>
> IOW I am proposing that the semantic be that changing the physical flag does
> _not_ immediately change the effective flag. With that clarified up front the
> user can adjust accordingly.

Which makes it useless from an admin perspective. i.e. to change the
way the application uses DAX now, admins are going to have to end up
rebooting the machine to guarantee that the kernel has picked up the
change in the on-disk flag.

> After thinking about this more I think there is a strong use case to be able to
> change the physical flag on a non-zero length file. That use case is to be
> able to restore files from backups.

Why does that matter? Backup programs need to set the flag before
the data is written into the destination file, just like they do
with restoring other flags that influence data placement like the RT
device bit and extent size hints...

Basically, all these issues you keep trying to work around go away
if we can come up with a way of swapping the aops vector safely.
That's the problem we need to solve, anything else results in
largely unacceptible user visible admin warts.

> I propose the user has no direct control over this event and it is mainly used
> to restore files from backups which is mainly an admin operation where a
> remount is a reasonable thing to do.

As soon as users understand that they flag can be changed, they are
going to want to do that and they are going to want it to work
reliably.

> Users direct control of the effective flag is through inheritance. The user
> needs to create the file in a DAX enable dir and they get effective operation
> right away.

Until they realise the application is slow or broken because it is
using DAX, and they want to turn DAX off for that application. Then
they have *no control*. You cannot have it both ways - being able to
turn something on but not turn it off is not "effective operation"
or user friendly.

> If in the future we can determine a safe way to trigger the a_ops change we can
> add that to the semantic as an alternative for users.

No, the flag does not get turned on until we've solved the problems
that resulted in us turning it off. We've gone over this mutliple
times, and nobody has solved the issues that need solving - everyone
seems to just hack around the issues rather than solving it
properly. If we thought taking some kind of shortcut full of
compromises and gotchas was the right solution, we would have never
turned the flag off in the first place.

Cheers,

Dave.
--
Dave Chinner
[email protected]