2020-02-21 00:43:37

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4

From: Ira Weiny <[email protected]>

https://github.com/weiny2/linux-kernel/pull/new/dax-file-state-change-v4

Changes from V3:
https://lore.kernel.org/lkml/[email protected]/

* Remove global locking... :-D
* put back per inode locking and remove pre-mature optimizations
* Fix issues with Directories having IS_DAX() set
* Fix kernel crash issues reported by Jeff
* Add some clean up patches
* Consolidate diflags to iflags functions
* Update/add documentation
* Reorder/rename patches quite a bit

Changes from V2:
https://lore.kernel.org/lkml/[email protected]/

* Move i_dax_sem to be a global percpu_rw_sem rather than per inode
Internal discussions with Dan determined this would be easier,
just as performant, and slightly less overhead that having it
in the SB as suggested by Jan
* Fix locking order in comments and throughout code
* Change "mode" to "state" throughout commits
* Add CONFIG_FS_DAX wrapper to disable inode_[un]lock_state() when not
configured
* Add static branch for which is activated by a device which supports
DAX in XFS
* Change "lock/unlock" to up/down read/write as appropriate
Previous names were over simplified
* Update comments/documentation

* Remove the xfs specific lock to the vfs (global) layer.
* Fix i_dax_sem locking order and comments

* Move 'i_mapped' count from struct inode to struct address_space and
rename it to mmap_count
* Add inode_has_mappings() call

* Fix build issues
* Clean up syntax spacing and minor issues
* Update man page text for STATX_ATTR_DAX
* Add reviewed-by's
* Rebase to 5.6

Rename patch:
from: fs/xfs: Add lock/unlock state to xfs
to: fs/xfs: Add write DAX lock to xfs layer
Add patch:
fs/xfs: Clarify lockdep dependency for xfs_isilocked()
Drop patch:
fs/xfs: Fix truncate up


At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
consumption due to their inability to detect whether the kernel will
instantiate page cache for a file, and cases where a global dax enable via a
mount option is too coarse.

The following patch series enables selecting the use of DAX on individual files
and/or directories on xfs, and lays some groundwork to do so in ext4. In this
scheme the dax mount option can be omitted to allow the per-file property to
take effect.

The insight at LSF/MM was to separate the per-mount or per-file "physical"
capability switch from an "effective" attribute for the file.

At LSF/MM we discussed the difficulties of switching the DAX state of a file
with active mappings / page cache. It was thought the races could be avoided
by limiting DAX state flips to 0-length files.

However, this turns out to not be true.[3] This is because address space
operations (a_ops) may be in use at any time the inode is referenced and users
have expressed a desire to be able to change the DAX state on a file with data
in it. For those reasons this patch set allows changing the DAX state flag on
a file as long as it is not current mapped.

Details of when and how DAX state can be changed on a file is included in a
documentation patch.

It should be noted that the physical DAX flag inheritance is not shown in this
patch set as it was maintained from previous work on XFS. The physical DAX
flag and it's inheritance will need to be added to other file systems for user
control.

As submitted this works on real hardware testing.


[1] https://lwn.net/Articles/787973/
[2] https://lwn.net/Articles/787233/
[3] https://lkml.org/lkml/2019/10/20/96
[4] https://patchwork.kernel.org/patch/11310511/


To: [email protected]
Cc: Alexander Viro <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: "Theodore Y. Ts'o" <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

Ira Weiny (13):
fs/xfs: Remove unnecessary initialization of i_rwsem
fs/xfs: Clarify lockdep dependency for xfs_isilocked()
fs: Remove unneeded IS_DAX() check
fs/stat: Define DAX statx attribute
fs/xfs: Isolate the physical DAX flag from enabled
fs/xfs: Create function xfs_inode_enable_dax()
fs: Add locking for a dynamic address space operations state
fs: Prevent DAX state change if file is mmap'ed
fs/xfs: Add write aops lock to xfs layer
fs/xfs: Clean up locking in dax invalidate
fs/xfs: Allow toggle of effective DAX flag
fs/xfs: Remove xfs_diflags_to_linux()
Documentation/dax: Update Usage section

Documentation/filesystems/dax.txt | 84 ++++++++++++++++++++++++++++-
Documentation/filesystems/vfs.rst | 16 ++++++
fs/attr.c | 1 +
fs/inode.c | 16 ++++--
fs/iomap/buffered-io.c | 1 +
fs/open.c | 4 ++
fs/stat.c | 5 ++
fs/xfs/xfs_icache.c | 5 +-
fs/xfs/xfs_inode.c | 24 +++++++--
fs/xfs/xfs_inode.h | 9 +++-
fs/xfs/xfs_ioctl.c | 89 +++++++++++++------------------
fs/xfs/xfs_iops.c | 69 ++++++++++++++++--------
include/linux/fs.h | 75 ++++++++++++++++++++++++--
include/uapi/linux/stat.h | 1 +
mm/fadvise.c | 7 ++-
mm/filemap.c | 4 ++
mm/huge_memory.c | 1 +
mm/khugepaged.c | 2 +
mm/mmap.c | 19 ++++++-
mm/util.c | 9 +++-
20 files changed, 347 insertions(+), 94 deletions(-)

--
2.21.0


2020-02-21 00:43:47

by Ira Weiny

[permalink] [raw]
Subject: [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer

From: Ira Weiny <[email protected]>

XFS requires the use of the aops of an inode to quiesced prior to
changing it to/from the DAX aops vector.

Take the aops write lock while changing DAX state.

We define a new XFS_DAX_EXCL lock type to carry the lock through to
transaction completion.

Signed-off-by: Ira Weiny <[email protected]>

---
Changes from v3:
Change locking function names to reflect changes in previous
patches.

Changes from V2:
Change name of patch (WAS: fs/xfs: Add lock/unlock state to xfs)
Remove the xfs specific lock and move to the vfs layer.
We still use XFS_LOCK_DAX_EXCL to be able to pass this
flag through to the transaction code. But we no longer
have a lock specific to xfs. This removes a lot of code
from the XFS layer, preps us for using this in ext4, and
is actually more straight forward now that all the
locking requirements are better known.

Fix locking order comment
Rework for new 'state' names
(Other comments on the previous patch are not applicable with
new patch as much of the code was removed in favor of the vfs
level lock)
---
fs/xfs/xfs_inode.c | 22 ++++++++++++++++++++--
fs/xfs/xfs_inode.h | 7 +++++--
2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 35df324875db..5b014c428f0f 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
*
* Basic locking order:
*
- * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
+ * s_dax_sem -> i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
*
* mmap_sem locking order:
*
* i_rwsem -> page lock -> mmap_sem
- * mmap_sem -> i_mmap_lock -> page_lock
+ * s_dax_sem -> mmap_sem -> i_mmap_lock -> page_lock
*
* The difference in mmap_sem locking order mean that we cannot hold the
* i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can
@@ -182,6 +182,9 @@ xfs_ilock(
(XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);

+ if (lock_flags & XFS_DAX_EXCL)
+ inode_aops_down_write(VFS_I(ip));
+
if (lock_flags & XFS_IOLOCK_EXCL) {
down_write_nested(&VFS_I(ip)->i_rwsem,
XFS_IOLOCK_DEP(lock_flags));
@@ -224,6 +227,8 @@ xfs_ilock_nowait(
* You can't set both SHARED and EXCL for the same lock,
* and only XFS_IOLOCK_SHARED, XFS_IOLOCK_EXCL, XFS_ILOCK_SHARED,
* and XFS_ILOCK_EXCL are valid values to set in lock_flags.
+ *
+ * XFS_DAX_* is not allowed
*/
ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
(XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
@@ -232,6 +237,7 @@ xfs_ilock_nowait(
ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
(XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
+ ASSERT((lock_flags & XFS_DAX_EXCL) == 0);

if (lock_flags & XFS_IOLOCK_EXCL) {
if (!down_write_trylock(&VFS_I(ip)->i_rwsem))
@@ -318,6 +324,9 @@ xfs_iunlock(
else if (lock_flags & XFS_ILOCK_SHARED)
mrunlock_shared(&ip->i_lock);

+ if (lock_flags & XFS_DAX_EXCL)
+ inode_aops_up_write(VFS_I(ip));
+
trace_xfs_iunlock(ip, lock_flags, _RET_IP_);
}

@@ -333,6 +342,8 @@ xfs_ilock_demote(
ASSERT(lock_flags & (XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL));
ASSERT((lock_flags &
~(XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL)) == 0);
+ /* XFS_DAX_* is not allowed */
+ ASSERT((lock_flags & XFS_DAX_EXCL) == 0);

if (lock_flags & XFS_ILOCK_EXCL)
mrdemote(&ip->i_lock);
@@ -465,6 +476,9 @@ xfs_lock_inodes(
ASSERT(!(lock_mode & XFS_ILOCK_EXCL) ||
inodes <= XFS_ILOCK_MAX_SUBCLASS + 1);

+ /* XFS_DAX_* is not allowed */
+ ASSERT((lock_mode & XFS_DAX_EXCL) == 0);
+
if (lock_mode & XFS_IOLOCK_EXCL) {
ASSERT(!(lock_mode & (XFS_MMAPLOCK_EXCL | XFS_ILOCK_EXCL)));
} else if (lock_mode & XFS_MMAPLOCK_EXCL)
@@ -566,6 +580,10 @@ xfs_lock_two_inodes(
ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
!(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));

+ /* XFS_DAX_* is not allowed */
+ ASSERT((ip0_mode & XFS_DAX_EXCL) == 0);
+ ASSERT((ip1_mode & XFS_DAX_EXCL) == 0);
+
ASSERT(ip0->i_ino != ip1->i_ino);

if (ip0->i_ino > ip1->i_ino) {
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 492e53992fa9..25fe20740bf7 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -278,10 +278,12 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
#define XFS_ILOCK_SHARED (1<<3)
#define XFS_MMAPLOCK_EXCL (1<<4)
#define XFS_MMAPLOCK_SHARED (1<<5)
+#define XFS_DAX_EXCL (1<<6)

#define XFS_LOCK_MASK (XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \
| XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \
- | XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED)
+ | XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED \
+ | XFS_DAX_EXCL)

#define XFS_LOCK_FLAGS \
{ XFS_IOLOCK_EXCL, "IOLOCK_EXCL" }, \
@@ -289,7 +291,8 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
{ XFS_ILOCK_EXCL, "ILOCK_EXCL" }, \
{ XFS_ILOCK_SHARED, "ILOCK_SHARED" }, \
{ XFS_MMAPLOCK_EXCL, "MMAPLOCK_EXCL" }, \
- { XFS_MMAPLOCK_SHARED, "MMAPLOCK_SHARED" }
+ { XFS_MMAPLOCK_SHARED, "MMAPLOCK_SHARED" }, \
+ { XFS_DAX_EXCL, "DAX_EXCL" }


/*
--
2.21.0

2020-02-22 00:31:48

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer

On Thu, Feb 20, 2020 at 04:41:30PM -0800, [email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> XFS requires the use of the aops of an inode to quiesced prior to
> changing it to/from the DAX aops vector.
>
> Take the aops write lock while changing DAX state.
>
> We define a new XFS_DAX_EXCL lock type to carry the lock through to
> transaction completion.
>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes from v3:
> Change locking function names to reflect changes in previous
> patches.
>
> Changes from V2:
> Change name of patch (WAS: fs/xfs: Add lock/unlock state to xfs)
> Remove the xfs specific lock and move to the vfs layer.
> We still use XFS_LOCK_DAX_EXCL to be able to pass this
> flag through to the transaction code. But we no longer
> have a lock specific to xfs. This removes a lot of code
> from the XFS layer, preps us for using this in ext4, and
> is actually more straight forward now that all the
> locking requirements are better known.
>
> Fix locking order comment
> Rework for new 'state' names
> (Other comments on the previous patch are not applicable with
> new patch as much of the code was removed in favor of the vfs
> level lock)
> ---
> fs/xfs/xfs_inode.c | 22 ++++++++++++++++++++--
> fs/xfs/xfs_inode.h | 7 +++++--
> 2 files changed, 25 insertions(+), 4 deletions(-)
>
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 35df324875db..5b014c428f0f 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
> *
> * Basic locking order:
> *
> - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> + * s_dax_sem -> i_rwsem -> i_mmap_lock -> page_lock -> i_ilock

"dax_sem"? I thought this was now called i_aops_sem?

> *
> * mmap_sem locking order:
> *
> * i_rwsem -> page lock -> mmap_sem
> - * mmap_sem -> i_mmap_lock -> page_lock
> + * s_dax_sem -> mmap_sem -> i_mmap_lock -> page_lock
> *
> * The difference in mmap_sem locking order mean that we cannot hold the
> * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can
> @@ -182,6 +182,9 @@ xfs_ilock(
> (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
> ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
>
> + if (lock_flags & XFS_DAX_EXCL)

And similarly, I think this should be XFS_OPSLOCK_EXCL...

--D

> + inode_aops_down_write(VFS_I(ip));
> +
> if (lock_flags & XFS_IOLOCK_EXCL) {
> down_write_nested(&VFS_I(ip)->i_rwsem,
> XFS_IOLOCK_DEP(lock_flags));
> @@ -224,6 +227,8 @@ xfs_ilock_nowait(
> * You can't set both SHARED and EXCL for the same lock,
> * and only XFS_IOLOCK_SHARED, XFS_IOLOCK_EXCL, XFS_ILOCK_SHARED,
> * and XFS_ILOCK_EXCL are valid values to set in lock_flags.
> + *
> + * XFS_DAX_* is not allowed
> */
> ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
> (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
> @@ -232,6 +237,7 @@ xfs_ilock_nowait(
> ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
> (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
> ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
> + ASSERT((lock_flags & XFS_DAX_EXCL) == 0);
>
> if (lock_flags & XFS_IOLOCK_EXCL) {
> if (!down_write_trylock(&VFS_I(ip)->i_rwsem))
> @@ -318,6 +324,9 @@ xfs_iunlock(
> else if (lock_flags & XFS_ILOCK_SHARED)
> mrunlock_shared(&ip->i_lock);
>
> + if (lock_flags & XFS_DAX_EXCL)
> + inode_aops_up_write(VFS_I(ip));
> +
> trace_xfs_iunlock(ip, lock_flags, _RET_IP_);
> }
>
> @@ -333,6 +342,8 @@ xfs_ilock_demote(
> ASSERT(lock_flags & (XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL));
> ASSERT((lock_flags &
> ~(XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL)) == 0);
> + /* XFS_DAX_* is not allowed */
> + ASSERT((lock_flags & XFS_DAX_EXCL) == 0);
>
> if (lock_flags & XFS_ILOCK_EXCL)
> mrdemote(&ip->i_lock);
> @@ -465,6 +476,9 @@ xfs_lock_inodes(
> ASSERT(!(lock_mode & XFS_ILOCK_EXCL) ||
> inodes <= XFS_ILOCK_MAX_SUBCLASS + 1);
>
> + /* XFS_DAX_* is not allowed */
> + ASSERT((lock_mode & XFS_DAX_EXCL) == 0);
> +
> if (lock_mode & XFS_IOLOCK_EXCL) {
> ASSERT(!(lock_mode & (XFS_MMAPLOCK_EXCL | XFS_ILOCK_EXCL)));
> } else if (lock_mode & XFS_MMAPLOCK_EXCL)
> @@ -566,6 +580,10 @@ xfs_lock_two_inodes(
> ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
> !(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
>
> + /* XFS_DAX_* is not allowed */
> + ASSERT((ip0_mode & XFS_DAX_EXCL) == 0);
> + ASSERT((ip1_mode & XFS_DAX_EXCL) == 0);
> +
> ASSERT(ip0->i_ino != ip1->i_ino);
>
> if (ip0->i_ino > ip1->i_ino) {
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 492e53992fa9..25fe20740bf7 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -278,10 +278,12 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
> #define XFS_ILOCK_SHARED (1<<3)
> #define XFS_MMAPLOCK_EXCL (1<<4)
> #define XFS_MMAPLOCK_SHARED (1<<5)
> +#define XFS_DAX_EXCL (1<<6)
>
> #define XFS_LOCK_MASK (XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \
> | XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \
> - | XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED)
> + | XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED \
> + | XFS_DAX_EXCL)
>
> #define XFS_LOCK_FLAGS \
> { XFS_IOLOCK_EXCL, "IOLOCK_EXCL" }, \
> @@ -289,7 +291,8 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
> { XFS_ILOCK_EXCL, "ILOCK_EXCL" }, \
> { XFS_ILOCK_SHARED, "ILOCK_SHARED" }, \
> { XFS_MMAPLOCK_EXCL, "MMAPLOCK_EXCL" }, \
> - { XFS_MMAPLOCK_SHARED, "MMAPLOCK_SHARED" }
> + { XFS_MMAPLOCK_SHARED, "MMAPLOCK_SHARED" }, \
> + { XFS_DAX_EXCL, "DAX_EXCL" }
>
>
> /*
> --
> 2.21.0
>

2020-02-23 15:05:16

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer

On Fri, Feb 21, 2020 at 04:31:09PM -0800, Darrick J. Wong wrote:
> On Thu, Feb 20, 2020 at 04:41:30PM -0800, [email protected] wrote:
> > From: Ira Weiny <[email protected]>
> >
> > XFS requires the use of the aops of an inode to quiesced prior to
> > changing it to/from the DAX aops vector.
> >
> > Take the aops write lock while changing DAX state.
> >
> > We define a new XFS_DAX_EXCL lock type to carry the lock through to
> > transaction completion.
> >
> > Signed-off-by: Ira Weiny <[email protected]>
> >
> > ---
> > Changes from v3:
> > Change locking function names to reflect changes in previous
> > patches.
> >
> > Changes from V2:
> > Change name of patch (WAS: fs/xfs: Add lock/unlock state to xfs)
> > Remove the xfs specific lock and move to the vfs layer.
> > We still use XFS_LOCK_DAX_EXCL to be able to pass this
> > flag through to the transaction code. But we no longer
> > have a lock specific to xfs. This removes a lot of code
> > from the XFS layer, preps us for using this in ext4, and
> > is actually more straight forward now that all the
> > locking requirements are better known.
> >
> > Fix locking order comment
> > Rework for new 'state' names
> > (Other comments on the previous patch are not applicable with
> > new patch as much of the code was removed in favor of the vfs
> > level lock)
> > ---
> > fs/xfs/xfs_inode.c | 22 ++++++++++++++++++++--
> > fs/xfs/xfs_inode.h | 7 +++++--
> > 2 files changed, 25 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 35df324875db..5b014c428f0f 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
> > *
> > * Basic locking order:
> > *
> > - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> > + * s_dax_sem -> i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
>
> "dax_sem"? I thought this was now called i_aops_sem?

:-/ yep...

>
> > *
> > * mmap_sem locking order:
> > *
> > * i_rwsem -> page lock -> mmap_sem
> > - * mmap_sem -> i_mmap_lock -> page_lock
> > + * s_dax_sem -> mmap_sem -> i_mmap_lock -> page_lock
> > *
> > * The difference in mmap_sem locking order mean that we cannot hold the
> > * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can
> > @@ -182,6 +182,9 @@ xfs_ilock(
> > (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
> > ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
> >
> > + if (lock_flags & XFS_DAX_EXCL)
>
> And similarly, I think this should be XFS_OPSLOCK_EXCL...

... and ... yes...

Thanks for the review, I'll clean it up.

Ira

>
> --D
>
> > + inode_aops_down_write(VFS_I(ip));
> > +
> > if (lock_flags & XFS_IOLOCK_EXCL) {
> > down_write_nested(&VFS_I(ip)->i_rwsem,
> > XFS_IOLOCK_DEP(lock_flags));
> > @@ -224,6 +227,8 @@ xfs_ilock_nowait(
> > * You can't set both SHARED and EXCL for the same lock,
> > * and only XFS_IOLOCK_SHARED, XFS_IOLOCK_EXCL, XFS_ILOCK_SHARED,
> > * and XFS_ILOCK_EXCL are valid values to set in lock_flags.
> > + *
> > + * XFS_DAX_* is not allowed
> > */
> > ASSERT((lock_flags & (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL)) !=
> > (XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL));
> > @@ -232,6 +237,7 @@ xfs_ilock_nowait(
> > ASSERT((lock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL)) !=
> > (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
> > ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
> > + ASSERT((lock_flags & XFS_DAX_EXCL) == 0);
> >
> > if (lock_flags & XFS_IOLOCK_EXCL) {
> > if (!down_write_trylock(&VFS_I(ip)->i_rwsem))
> > @@ -318,6 +324,9 @@ xfs_iunlock(
> > else if (lock_flags & XFS_ILOCK_SHARED)
> > mrunlock_shared(&ip->i_lock);
> >
> > + if (lock_flags & XFS_DAX_EXCL)
> > + inode_aops_up_write(VFS_I(ip));
> > +
> > trace_xfs_iunlock(ip, lock_flags, _RET_IP_);
> > }
> >
> > @@ -333,6 +342,8 @@ xfs_ilock_demote(
> > ASSERT(lock_flags & (XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL));
> > ASSERT((lock_flags &
> > ~(XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL|XFS_ILOCK_EXCL)) == 0);
> > + /* XFS_DAX_* is not allowed */
> > + ASSERT((lock_flags & XFS_DAX_EXCL) == 0);
> >
> > if (lock_flags & XFS_ILOCK_EXCL)
> > mrdemote(&ip->i_lock);
> > @@ -465,6 +476,9 @@ xfs_lock_inodes(
> > ASSERT(!(lock_mode & XFS_ILOCK_EXCL) ||
> > inodes <= XFS_ILOCK_MAX_SUBCLASS + 1);
> >
> > + /* XFS_DAX_* is not allowed */
> > + ASSERT((lock_mode & XFS_DAX_EXCL) == 0);
> > +
> > if (lock_mode & XFS_IOLOCK_EXCL) {
> > ASSERT(!(lock_mode & (XFS_MMAPLOCK_EXCL | XFS_ILOCK_EXCL)));
> > } else if (lock_mode & XFS_MMAPLOCK_EXCL)
> > @@ -566,6 +580,10 @@ xfs_lock_two_inodes(
> > ASSERT(!(ip0_mode & (XFS_MMAPLOCK_SHARED|XFS_MMAPLOCK_EXCL)) ||
> > !(ip1_mode & (XFS_ILOCK_SHARED|XFS_ILOCK_EXCL)));
> >
> > + /* XFS_DAX_* is not allowed */
> > + ASSERT((ip0_mode & XFS_DAX_EXCL) == 0);
> > + ASSERT((ip1_mode & XFS_DAX_EXCL) == 0);
> > +
> > ASSERT(ip0->i_ino != ip1->i_ino);
> >
> > if (ip0->i_ino > ip1->i_ino) {
> > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > index 492e53992fa9..25fe20740bf7 100644
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -278,10 +278,12 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
> > #define XFS_ILOCK_SHARED (1<<3)
> > #define XFS_MMAPLOCK_EXCL (1<<4)
> > #define XFS_MMAPLOCK_SHARED (1<<5)
> > +#define XFS_DAX_EXCL (1<<6)
> >
> > #define XFS_LOCK_MASK (XFS_IOLOCK_EXCL | XFS_IOLOCK_SHARED \
> > | XFS_ILOCK_EXCL | XFS_ILOCK_SHARED \
> > - | XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED)
> > + | XFS_MMAPLOCK_EXCL | XFS_MMAPLOCK_SHARED \
> > + | XFS_DAX_EXCL)
> >
> > #define XFS_LOCK_FLAGS \
> > { XFS_IOLOCK_EXCL, "IOLOCK_EXCL" }, \
> > @@ -289,7 +291,8 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
> > { XFS_ILOCK_EXCL, "ILOCK_EXCL" }, \
> > { XFS_ILOCK_SHARED, "ILOCK_SHARED" }, \
> > { XFS_MMAPLOCK_EXCL, "MMAPLOCK_EXCL" }, \
> > - { XFS_MMAPLOCK_SHARED, "MMAPLOCK_SHARED" }
> > + { XFS_MMAPLOCK_SHARED, "MMAPLOCK_SHARED" }, \
> > + { XFS_DAX_EXCL, "DAX_EXCL" }
> >
> >
> > /*
> > --
> > 2.21.0
> >

2020-02-24 00:35:23

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer

On Thu, Feb 20, 2020 at 04:41:30PM -0800, [email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> XFS requires the use of the aops of an inode to quiesced prior to
> changing it to/from the DAX aops vector.
>
> Take the aops write lock while changing DAX state.
>
> We define a new XFS_DAX_EXCL lock type to carry the lock through to
> transaction completion.
>
> Signed-off-by: Ira Weiny <[email protected]>
>
> ---
> Changes from v3:
> Change locking function names to reflect changes in previous
> patches.
>
> Changes from V2:
> Change name of patch (WAS: fs/xfs: Add lock/unlock state to xfs)
> Remove the xfs specific lock and move to the vfs layer.
> We still use XFS_LOCK_DAX_EXCL to be able to pass this
> flag through to the transaction code. But we no longer
> have a lock specific to xfs. This removes a lot of code
> from the XFS layer, preps us for using this in ext4, and
> is actually more straight forward now that all the
> locking requirements are better known.
>
> Fix locking order comment
> Rework for new 'state' names
> (Other comments on the previous patch are not applicable with
> new patch as much of the code was removed in favor of the vfs
> level lock)
> ---
> fs/xfs/xfs_inode.c | 22 ++++++++++++++++++++--
> fs/xfs/xfs_inode.h | 7 +++++--
> 2 files changed, 25 insertions(+), 4 deletions(-)
>
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 35df324875db..5b014c428f0f 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
> *
> * Basic locking order:
> *
> - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> + * s_dax_sem -> i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> *
> * mmap_sem locking order:
> *
> * i_rwsem -> page lock -> mmap_sem
> - * mmap_sem -> i_mmap_lock -> page_lock
> + * s_dax_sem -> mmap_sem -> i_mmap_lock -> page_lock
> *
> * The difference in mmap_sem locking order mean that we cannot hold the
> * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can
> @@ -182,6 +182,9 @@ xfs_ilock(
> (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
> ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
>
> + if (lock_flags & XFS_DAX_EXCL)
> + inode_aops_down_write(VFS_I(ip));

I largely don't see the point of adding this to xfs_ilock/iunlock.

It's only got one caller, so I don't see much point in adding it to
an interface that has over a hundred other call sites that don't
need or use this lock. just open code it where it is needed in the
ioctl code.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-02-24 22:33:15

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer

On Mon, Feb 24, 2020 at 11:57:36AM -0800, Ira Weiny wrote:
> On Mon, Feb 24, 2020 at 11:34:55AM +1100, Dave Chinner wrote:
> > On Thu, Feb 20, 2020 at 04:41:30PM -0800, [email protected] wrote:
> > > From: Ira Weiny <[email protected]>
> > >
>
> [snip]
>
> > >
> > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > index 35df324875db..5b014c428f0f 100644
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > > @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
> > > *
> > > * Basic locking order:
> > > *
> > > - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> > > + * s_dax_sem -> i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> > > *
> > > * mmap_sem locking order:
> > > *
> > > * i_rwsem -> page lock -> mmap_sem
> > > - * mmap_sem -> i_mmap_lock -> page_lock
> > > + * s_dax_sem -> mmap_sem -> i_mmap_lock -> page_lock
> > > *
> > > * The difference in mmap_sem locking order mean that we cannot hold the
> > > * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can
> > > @@ -182,6 +182,9 @@ xfs_ilock(
> > > (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
> > > ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
> > >
> > > + if (lock_flags & XFS_DAX_EXCL)
> > > + inode_aops_down_write(VFS_I(ip));
> >
> > I largely don't see the point of adding this to xfs_ilock/iunlock.
> >
> > It's only got one caller, so I don't see much point in adding it to
> > an interface that has over a hundred other call sites that don't
> > need or use this lock. just open code it where it is needed in the
> > ioctl code.
>
> I know it seems overkill but if we don't do this we need to code a flag to be
> returned from xfs_ioctl_setattr_dax_invalidate(). This flag is then used in
> xfs_ioctl_setattr_get_trans() to create the transaction log item which can then
> be properly used to unlock the lock in xfs_inode_item_release()
>
> I don't know of a cleaner way to communicate to xfs_inode_item_release() to
> unlock i_aops_sem after the transaction is complete.

We manually unlock inodes after transactions in many cases -
anywhere we do a rolling transaction, the inode locks do not get
released by the transaction. Hence for a one-off case like this it
doesn't really make sense to push all this infrastructure into the
transaction subsystem. Especially as we can manually lock before and
unlock after the transaction context without any real complexity.

This also means that we can, if necessary, do aops manipulation work
/after/ the transaction that changes on-disk state completes and we
still hold the aops reference exclusively. While we don't do that
now, I think it is worthwhile keeping our options open here....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-02-25 21:12:55

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer

On Tue, Feb 25, 2020 at 09:32:45AM +1100, Dave Chinner wrote:
> On Mon, Feb 24, 2020 at 11:57:36AM -0800, Ira Weiny wrote:
> > On Mon, Feb 24, 2020 at 11:34:55AM +1100, Dave Chinner wrote:
> > > On Thu, Feb 20, 2020 at 04:41:30PM -0800, [email protected] wrote:
> > > > From: Ira Weiny <[email protected]>
> > > >
> >
> > [snip]
> >
> > > >
> > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > index 35df324875db..5b014c428f0f 100644
> > > > --- a/fs/xfs/xfs_inode.c
> > > > +++ b/fs/xfs/xfs_inode.c
> > > > @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
> > > > *
> > > > * Basic locking order:
> > > > *
> > > > - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> > > > + * s_dax_sem -> i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> > > > *
> > > > * mmap_sem locking order:
> > > > *
> > > > * i_rwsem -> page lock -> mmap_sem
> > > > - * mmap_sem -> i_mmap_lock -> page_lock
> > > > + * s_dax_sem -> mmap_sem -> i_mmap_lock -> page_lock
> > > > *
> > > > * The difference in mmap_sem locking order mean that we cannot hold the
> > > > * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can
> > > > @@ -182,6 +182,9 @@ xfs_ilock(
> > > > (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
> > > > ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
> > > >
> > > > + if (lock_flags & XFS_DAX_EXCL)
> > > > + inode_aops_down_write(VFS_I(ip));
> > >
> > > I largely don't see the point of adding this to xfs_ilock/iunlock.
> > >
> > > It's only got one caller, so I don't see much point in adding it to
> > > an interface that has over a hundred other call sites that don't
> > > need or use this lock. just open code it where it is needed in the
> > > ioctl code.
> >
> > I know it seems overkill but if we don't do this we need to code a flag to be
> > returned from xfs_ioctl_setattr_dax_invalidate(). This flag is then used in
> > xfs_ioctl_setattr_get_trans() to create the transaction log item which can then
> > be properly used to unlock the lock in xfs_inode_item_release()
> >
> > I don't know of a cleaner way to communicate to xfs_inode_item_release() to
> > unlock i_aops_sem after the transaction is complete.
>
> We manually unlock inodes after transactions in many cases -
> anywhere we do a rolling transaction, the inode locks do not get
> released by the transaction. Hence for a one-off case like this it
> doesn't really make sense to push all this infrastructure into the
> transaction subsystem. Especially as we can manually lock before and
> unlock after the transaction context without any real complexity.

So does xfs_trans_commit() operate synchronously?

I want to understand this better because I have fought with a lot of ABBA
issues with these locks. So... can I hold the lock until after
xfs_trans_commit() and safely unlock it there... because the XFS_MMAPLOCK_EXCL,
XFS_IOLOCK_EXCL, and XFS_ILOCK_EXCL will be released at that point? Thus
preserving the following lock order.

...
* Basic locking order:
*
* i_aops_sem -> i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
*
...

Thanks for the review!
Ira

>
> This also means that we can, if necessary, do aops manipulation work
> /after/ the transaction that changes on-disk state completes and we
> still hold the aops reference exclusively. While we don't do that
> now, I think it is worthwhile keeping our options open here....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2020-02-26 18:03:16

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V4 09/13] fs/xfs: Add write aops lock to xfs layer

On Wed, Feb 26, 2020 at 09:59:41AM +1100, Dave Chinner wrote:
> On Tue, Feb 25, 2020 at 01:12:28PM -0800, Ira Weiny wrote:
> > On Tue, Feb 25, 2020 at 09:32:45AM +1100, Dave Chinner wrote:
> > > On Mon, Feb 24, 2020 at 11:57:36AM -0800, Ira Weiny wrote:
> > > > On Mon, Feb 24, 2020 at 11:34:55AM +1100, Dave Chinner wrote:
> > > > > On Thu, Feb 20, 2020 at 04:41:30PM -0800, [email protected] wrote:
> > > > > > From: Ira Weiny <[email protected]>
> > > > > >
> > > >
> > > > [snip]
> > > >
> > > > > >
> > > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > > > index 35df324875db..5b014c428f0f 100644
> > > > > > --- a/fs/xfs/xfs_inode.c
> > > > > > +++ b/fs/xfs/xfs_inode.c
> > > > > > @@ -142,12 +142,12 @@ xfs_ilock_attr_map_shared(
> > > > > > *
> > > > > > * Basic locking order:
> > > > > > *
> > > > > > - * i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> > > > > > + * s_dax_sem -> i_rwsem -> i_mmap_lock -> page_lock -> i_ilock
> > > > > > *
> > > > > > * mmap_sem locking order:
> > > > > > *
> > > > > > * i_rwsem -> page lock -> mmap_sem
> > > > > > - * mmap_sem -> i_mmap_lock -> page_lock
> > > > > > + * s_dax_sem -> mmap_sem -> i_mmap_lock -> page_lock
> > > > > > *
> > > > > > * The difference in mmap_sem locking order mean that we cannot hold the
> > > > > > * i_mmap_lock over syscall based read(2)/write(2) based IO. These IO paths can
> > > > > > @@ -182,6 +182,9 @@ xfs_ilock(
> > > > > > (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL));
> > > > > > ASSERT((lock_flags & ~(XFS_LOCK_MASK | XFS_LOCK_SUBCLASS_MASK)) == 0);
> > > > > >
> > > > > > + if (lock_flags & XFS_DAX_EXCL)
> > > > > > + inode_aops_down_write(VFS_I(ip));
> > > > >
> > > > > I largely don't see the point of adding this to xfs_ilock/iunlock.
> > > > >
> > > > > It's only got one caller, so I don't see much point in adding it to
> > > > > an interface that has over a hundred other call sites that don't
> > > > > need or use this lock. just open code it where it is needed in the
> > > > > ioctl code.
> > > >
> > > > I know it seems overkill but if we don't do this we need to code a flag to be
> > > > returned from xfs_ioctl_setattr_dax_invalidate(). This flag is then used in
> > > > xfs_ioctl_setattr_get_trans() to create the transaction log item which can then
> > > > be properly used to unlock the lock in xfs_inode_item_release()
> > > >
> > > > I don't know of a cleaner way to communicate to xfs_inode_item_release() to
> > > > unlock i_aops_sem after the transaction is complete.
> > >
> > > We manually unlock inodes after transactions in many cases -
> > > anywhere we do a rolling transaction, the inode locks do not get
> > > released by the transaction. Hence for a one-off case like this it
> > > doesn't really make sense to push all this infrastructure into the
> > > transaction subsystem. Especially as we can manually lock before and
> > > unlock after the transaction context without any real complexity.
> >
> > So does xfs_trans_commit() operate synchronously?
>
> What do you mean by "synchronously", and what are you expecting to
> occur (a)synchronously with respect to filesystem objects and/or
> on-disk state?
>
> Keep in mid that the xfs transaction subsystem is a complex
> asynchronous IO engine full of feedback loops and resource
> management,

This is precisely why I added the lock to the transaction state. So that I
could guarantee that the lock will be released in the proper order when the
complicated transaction subsystem was done with it. I did not see any reason
to allow operations to proceed before that time. And so this seemed safe...

> so asking if something is "synchronous" without any
> other context is a difficult question to answer :)

Or apparently it is difficult to even ask... ;-) (...not trying to be
sarcastic.) Seriously, I'm not an expert in this area so I did what I thought
was most safe. Which for me was the number 1 goal.

>
> > I want to understand this better because I have fought with a lot of ABBA
> > issues with these locks. So... can I hold the lock until after
> > xfs_trans_commit() and safely unlock it there... because the XFS_MMAPLOCK_EXCL,
> > XFS_IOLOCK_EXCL, and XFS_ILOCK_EXCL will be released at that point? Thus
> > preserving the following lock order.
>
> See how operations like xfs_create, xfs_unlink, etc work. The don't
> specify flags to xfs_ijoin(), and so the transaction commits don't
> automatically unlock the inode.

xfs_ijoin()? Do you mean xfs_trans_ijoin()?

> This is necessary so that rolling
> transactions are executed atomically w.r.t. inode access - no-one
> can lock and access the inode while a multi-commit rolling
> transaction on the inode is on-going.
>
> In this case it's just a single commit and we don't need to keep
> it locked after the change is made, so we can unlock the inode
> on commit. So for the XFS internal locks the code is fine and
> doesn't need to change. We just need to wrap the VFS aops lock (if
> we keep it) around the outside of all the XFS locking until the
> transaction commits and unlocks the XFS locks...

Ok, I went ahead and coded it up and it is testing now. Everything looks good.
I have to say that at this point I have to agree that I can't see how a
deadlock could occur so...

Thanks for the review,
Ira

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2020-02-27 02:45:05

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH V4 00/13] Enable per-file/per-directory DAX operations V4

On Wed, Feb 26, 2020 at 05:48:38PM -0500, Jeff Moyer wrote:
> Hi, Ira,
>
> [email protected] writes:
>
> > From: Ira Weiny <[email protected]>
> >
> > https://github.com/weiny2/linux-kernel/pull/new/dax-file-state-change-v4
> >
> > Changes from V3:
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > * Remove global locking... :-D
> > * put back per inode locking and remove pre-mature optimizations
> > * Fix issues with Directories having IS_DAX() set
> > * Fix kernel crash issues reported by Jeff
> > * Add some clean up patches
> > * Consolidate diflags to iflags functions
> > * Update/add documentation
> > * Reorder/rename patches quite a bit
>
> I left out patches 1 and 2, but applied the rest and tested. This
> passes xfs tests in the following configurations:
> 1) MKFS_OPTIONS="-m reflink=0" MOUNT_OPTIONS="-o dax"
> 2) MKFS_OPTIONS="-m reflink=0"
> but with the added configuration step of setting the dax attribute on
> the mounted test directory.
>
> I also tested to ensure that reflink fails when a file has the dax
> attribute set. I've got more testing to do, but figured I'd at least
> let you know I've been looking at it.

Thank you!

I need to update my xfstest which is specific to this as well... I'll get to
that tomorrow and send an updated patch...

Thanks!
Ira

>
> Thanks!
> Jeff
>