2019-06-06 01:47:08

by Ira Weiny

[permalink] [raw]
Subject: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

From: Ira Weiny <[email protected]>

... V1,000,000 ;-)

Pre-requisites:
John Hubbard's put_user_pages() patch series.[1]
Jan Kara's ext4_break_layouts() fixes[2]

Based on the feedback from LSFmm and the LWN article which resulted. I've
decided to take a slightly different tack on this problem.

The real issue is that there is no use case for a user to have RDMA pinn'ed
memory which is then truncated. So really any solution we present which:

A) Prevents file system corruption or data leaks
...and...
B) Informs the user that they did something wrong

Should be an acceptable solution.

Because this is slightly new behavior. And because this is gonig to be
specific to DAX (because of the lack of a page cache) we have made the user
"opt in" to this behavior.

The following patches implement the following solution.

1) The user has to opt in to allowing GUP pins on a file with a layout lease
(now made visible).
2) GUP will fail (EPERM) if a layout lease is not taken
3) Any truncate or hole punch operation on a GUP'ed DAX page will fail.
4) The user has the option of holding the layout lease to receive a SIGIO for
notification to the original thread that another thread has tried to delete
their data. Furthermore this indicates that if the user needs to GUP the
file again they will need to retake the Layout lease before doing so.


NOTE: If the user releases the layout lease or if it has been broken by another
operation further GUP operations on the file will fail without re-taking the
lease. This means that if a user would like to register pieces of a file and
continue to register other pieces later they would be advised to keep the
layout lease, get a SIGIO notification, and retake the lease.

NOTE2: Truncation of pages which are not actively pinned will succeed. Similar
to accessing an mmap to this area GUP pins of that memory may fail.


A general overview follows for background.

It should be noted that one solution for this problem is to use RDMA's On
Demand Paging (ODP). There are 2 big reasons this may not work.

1) The hardware being used for RDMA may not support ODP
2) ODP may be detrimental to the over all network (cluster or cloud)
performance

Therefore, in order to support RDMA to File system pages without On Demand
Paging (ODP) a number of things need to be done.

1) GUP "longterm" users need to inform the other subsystems that they have
taken a pin on a page which may remain pinned for a very "long time".[3]

2) Any page which is "controlled" by a file system needs to have special
handling. The details of the handling depends on if the page is page cache
fronted or not.

2a) A page cache fronted page which has been pinned by GUP long term can use a
bounce buffer to allow the file system to write back snap shots of the page.
This is handled by the FS recognizing the GUP long term pin and making a copy
of the page to be written back.
NOTE: this patch set does not address this path.

2b) A FS "controlled" page which is not page cache fronted is either easier
to deal with or harder depending on the operation the filesystem is trying
to do.

2ba) [Hard case] If the FS operation _is_ a truncate or hole punch the
FS can no longer use the pages in question until the pin has been
removed. This patch set presents a solution to this by introducing
some reasonable restrictions on user space applications.

2bb) [Easy case] If the FS operation is _not_ a truncate or hole punch
then there is nothing which need be done. Data is Read or Written
directly to the page. This is an easy case which would currently work
if not for GUP long term pins being disabled. Therefore this patch set
need not change access to the file data but does allow for GUP pins
after 2ba above is dealt with.


This patch series and presents a solution for problem 2ba)

[1] https://github.com/johnhubbard/linux/tree/gup_dma_core

[2] ext4/dev branch:

- https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=dev

Specific patches:

[2a] ext4: wait for outstanding dio during truncate in nojournal mode

- https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=82a25b027ca48d7ef197295846b352345853dfa8

[2b] ext4: do not delete unlinked inode from orphan list on failed truncate

- https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=ee0ed02ca93ef1ecf8963ad96638795d55af2c14

[2c] ext4: gracefully handle ext4_break_layouts() failure during truncate

- https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=b9c1c26739ec2d4b4fb70207a0a9ad6747e43f4c

[3] The definition of long time is debatable but it has been established
that RDMAs use of pages, minutes or hours after the pin is the extreme case
which makes this problem most severe.


Ira Weiny (10):
fs/locks: Add trace_leases_conflict
fs/locks: Export F_LAYOUT lease to user space
mm/gup: Pass flags down to __gup_device_huge* calls
mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages
fs/ext4: Teach ext4 to break layout leases
fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range
fs/ext4: Fail truncate if pages are GUP pinned
fs/xfs: Teach xfs to use new dax_layout_busy_page()
fs/xfs: Fail truncate if pages are GUP pinned
mm/gup: Remove FOLL_LONGTERM DAX exclusion

fs/Kconfig | 1 +
fs/dax.c | 38 ++++++---
fs/ext4/ext4.h | 2 +-
fs/ext4/extents.c | 6 +-
fs/ext4/inode.c | 26 +++++--
fs/locks.c | 97 ++++++++++++++++++++---
fs/xfs/xfs_file.c | 24 ++++--
fs/xfs/xfs_inode.h | 5 +-
fs/xfs/xfs_ioctl.c | 15 +++-
fs/xfs/xfs_iops.c | 14 +++-
fs/xfs/xfs_pnfs.c | 14 ++--
include/linux/dax.h | 9 ++-
include/linux/fs.h | 2 +-
include/linux/mm.h | 2 +
include/trace/events/filelock.h | 35 +++++++++
include/uapi/asm-generic/fcntl.h | 3 +
mm/gup.c | 129 ++++++++++++-------------------
mm/huge_memory.c | 12 +++
18 files changed, 299 insertions(+), 135 deletions(-)

--
2.20.1


2019-06-06 01:47:52

by Ira Weiny

[permalink] [raw]
Subject: [PATCH RFC 06/10] fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range

From: Ira Weiny <[email protected]>

Callers of dax_layout_busy_page() are only rarely operating on the
entire file of concern.

Teach dax_layout_busy_page() to operate on a sub-range of the
address_space provided. Specifying 0 - ULONG_MAX however, will continue
to operate on the "entire file" and XFS is split out to a separate patch
by this method.

This could potentially speed up dax_layout_busy_page() as well.

Signed-off-by: Ira Weiny <[email protected]>
---
fs/dax.c | 15 +++++++++++----
fs/ext4/ext4.h | 2 +-
fs/ext4/extents.c | 6 +++---
fs/ext4/inode.c | 19 ++++++++++++-------
fs/xfs/xfs_file.c | 3 ++-
include/linux/dax.h | 3 ++-
6 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 29ff3b683657..abd77b184879 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -568,8 +568,11 @@ bool dax_mapping_is_dax(struct address_space *mapping)
EXPORT_SYMBOL_GPL(dax_mapping_is_dax);

/**
- * dax_layout_busy_page - find first pinned page in @mapping
+ * dax_layout_busy_page - find first pinned page in @mapping within
+ * the range @off - @off + @len
* @mapping: address space to scan for a page with ref count > 1
+ * @off: offset to start at
+ * @len: length to scan through
*
* DAX requires ZONE_DEVICE mapped pages. These pages are never
* 'onlined' to the page allocator so they are considered idle when
@@ -582,9 +585,13 @@ EXPORT_SYMBOL_GPL(dax_mapping_is_dax);
* to be able to run unmap_mapping_range() and subsequently not race
* mapping_mapped() becoming true.
*/
-struct page *dax_layout_busy_page(struct address_space *mapping)
+struct page *dax_layout_busy_page(struct address_space *mapping,
+ loff_t off, loff_t len)
{
- XA_STATE(xas, &mapping->i_pages, 0);
+ unsigned long start_idx = off >> PAGE_SHIFT;
+ unsigned long end_idx = (len == ULONG_MAX) ? ULONG_MAX
+ : start_idx + (len >> PAGE_SHIFT);
+ XA_STATE(xas, &mapping->i_pages, start_idx);
void *entry;
unsigned int scanned = 0;
struct page *page = NULL;
@@ -607,7 +614,7 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
unmap_mapping_range(mapping, 0, 0, 1);

xas_lock_irq(&xas);
- xas_for_each(&xas, entry, ULONG_MAX) {
+ xas_for_each(&xas, entry, end_idx) {
if (WARN_ON_ONCE(!xa_is_value(entry)))
continue;
if (unlikely(dax_is_locked(entry)))
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 1cb67859e051..ba5920c21023 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2530,7 +2530,7 @@ extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *);
extern int ext4_inode_attach_jinode(struct inode *inode);
extern int ext4_can_truncate(struct inode *inode);
extern int ext4_truncate(struct inode *);
-extern int ext4_break_layouts(struct inode *);
+extern int ext4_break_layouts(struct inode *inode, loff_t offset, loff_t len);
extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length);
extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int nblocks);
extern void ext4_set_inode_flags(struct inode *);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index d40ed940001e..9ddb117d8beb 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4736,7 +4736,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
*/
down_write(&EXT4_I(inode)->i_mmap_sem);

- ret = ext4_break_layouts(inode);
+ ret = ext4_break_layouts(inode, offset, len);
if (ret) {
up_write(&EXT4_I(inode)->i_mmap_sem);
goto out_mutex;
@@ -5419,7 +5419,7 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
*/
down_write(&EXT4_I(inode)->i_mmap_sem);

- ret = ext4_break_layouts(inode);
+ ret = ext4_break_layouts(inode, offset, len);
if (ret)
goto out_mmap;

@@ -5572,7 +5572,7 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
*/
down_write(&EXT4_I(inode)->i_mmap_sem);

- ret = ext4_break_layouts(inode);
+ ret = ext4_break_layouts(inode, offset, len);
if (ret)
goto out_mmap;

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c7c99f51961f..75f543f384e4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4232,7 +4232,7 @@ static void ext4_wait_dax_page(struct ext4_inode_info *ei)
down_write(&ei->i_mmap_sem);
}

-int ext4_break_layouts(struct inode *inode)
+int ext4_break_layouts(struct inode *inode, loff_t offset, loff_t len)
{
struct ext4_inode_info *ei = EXT4_I(inode);
struct page *page;
@@ -4246,7 +4246,7 @@ int ext4_break_layouts(struct inode *inode)
break_layout(inode, true);

do {
- page = dax_layout_busy_page(inode->i_mapping);
+ page = dax_layout_busy_page(inode->i_mapping, offset, len);
if (!page)
return 0;

@@ -4333,7 +4333,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
*/
down_write(&EXT4_I(inode)->i_mmap_sem);

- ret = ext4_break_layouts(inode);
+ ret = ext4_break_layouts(inode, offset, length);
if (ret)
goto out_dio;

@@ -5605,10 +5605,15 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)

down_write(&EXT4_I(inode)->i_mmap_sem);

- rc = ext4_break_layouts(inode);
- if (rc) {
- up_write(&EXT4_I(inode)->i_mmap_sem);
- return rc;
+ if (shrink) {
+ loff_t off = attr->ia_size;
+ loff_t len = inode->i_size - attr->ia_size;
+
+ rc = ext4_break_layouts(inode, off, len);
+ if (rc) {
+ up_write(&EXT4_I(inode)->i_mmap_sem);
+ return rc;
+ }
}

if (attr->ia_size != inode->i_size) {
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 76748255f843..ebddf911644c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -746,7 +746,8 @@ xfs_break_dax_layouts(

ASSERT(xfs_isilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL));

- page = dax_layout_busy_page(inode->i_mapping);
+ /* We default to the "whole file" */
+ page = dax_layout_busy_page(inode->i_mapping, 0, ULONG_MAX);
if (!page)
return 0;

diff --git a/include/linux/dax.h b/include/linux/dax.h
index ee6cbd56ddc4..3c3ab8dd76c6 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -107,7 +107,8 @@ int dax_writeback_mapping_range(struct address_space *mapping,
struct block_device *bdev, struct writeback_control *wbc);

bool dax_mapping_is_dax(struct address_space *mapping);
-struct page *dax_layout_busy_page(struct address_space *mapping);
+struct page *dax_layout_busy_page(struct address_space *mapping,
+ loff_t off, loff_t len);
dax_entry_t dax_lock_page(struct page *page);
void dax_unlock_page(struct page *page, dax_entry_t cookie);
#else
--
2.20.1

2019-06-06 10:42:48

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Wed 05-06-19 18:45:33, [email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> ... V1,000,000 ;-)
>
> Pre-requisites:
> John Hubbard's put_user_pages() patch series.[1]
> Jan Kara's ext4_break_layouts() fixes[2]
>
> Based on the feedback from LSFmm and the LWN article which resulted. I've
> decided to take a slightly different tack on this problem.
>
> The real issue is that there is no use case for a user to have RDMA pinn'ed
> memory which is then truncated. So really any solution we present which:
>
> A) Prevents file system corruption or data leaks
> ...and...
> B) Informs the user that they did something wrong
>
> Should be an acceptable solution.
>
> Because this is slightly new behavior. And because this is gonig to be
> specific to DAX (because of the lack of a page cache) we have made the user
> "opt in" to this behavior.
>
> The following patches implement the following solution.
>
> 1) The user has to opt in to allowing GUP pins on a file with a layout lease
> (now made visible).
> 2) GUP will fail (EPERM) if a layout lease is not taken
> 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail.
> 4) The user has the option of holding the layout lease to receive a SIGIO for
> notification to the original thread that another thread has tried to delete
> their data. Furthermore this indicates that if the user needs to GUP the
> file again they will need to retake the Layout lease before doing so.
>
>
> NOTE: If the user releases the layout lease or if it has been broken by
> another operation further GUP operations on the file will fail without
> re-taking the lease. This means that if a user would like to register
> pieces of a file and continue to register other pieces later they would
> be advised to keep the layout lease, get a SIGIO notification, and retake
> the lease.
>
> NOTE2: Truncation of pages which are not actively pinned will succeed.
> Similar to accessing an mmap to this area GUP pins of that memory may
> fail.

So after some through I'm willing accept the fact that pinned DAX pages
will just make truncate / hole punch fail and shove it into a same bucket
of situations like "user can open a file and unlink won't delete it" or
"ETXTBUSY when user is executing a file being truncated". The problem I
have with this proposal is a lack of visibility from sysadmin POV. For
ETXTBUSY or "unlinked but open file" sysadmin can just do lsof, find the
problematic process and kill it. There's nothing like that with your
proposal since currently once you hold page reference, you can unmap the
file, drop layout lease, close the file, and there's no trace that you're
responsible for the pinned page anymore.

So I'd like to actually mandate that you *must* hold the file lease until
you unpin all pages in the given range (not just that you have an option to
hold a lease). And I believe the kernel should actually enforce this. That
way we maintain a sane state that if someone uses a physical location of
logical file offset on disk, he has a layout lease. Also once this is done,
sysadmin has a reasonably easy way to discover run-away RDMA application
and kill it if he wishes so.

The question is on how to exactly enforce that lease is taken until all
pages are unpinned. I belive it could be done by tracking number of
long-term pinned pages within a lease. Gup_longterm could easily increment
the count when verifying the lease exists, gup_longterm users will somehow
need to propagate corresponding 'filp' (struct file pointer) to
put_user_pages_longterm() callsites so that they can look up appropriate
lease to drop reference - probably I'd just transition all gup_longterm()
users to a saner API similar to the one we have in mm/frame_vector.c where
we don't hand out page pointers but an encapsulating structure that does
all the necessary tracking. Removing a lease would need to block until all
pins are released - this is probably the most hairy part since we need to
handle a case if application just closes the file descriptor which would
release the lease but OTOH we need to make sure task exit does not deadlock.
Maybe we could block only on explicit lease unlock and just drop the layout
lease on file close and if there are still pinned pages, send SIGKILL to an
application as a reminder it did something stupid...

What do people think about this?

Honza
>
>
> A general overview follows for background.
>
> It should be noted that one solution for this problem is to use RDMA's On
> Demand Paging (ODP). There are 2 big reasons this may not work.
>
> 1) The hardware being used for RDMA may not support ODP
> 2) ODP may be detrimental to the over all network (cluster or cloud)
> performance
>
> Therefore, in order to support RDMA to File system pages without On Demand
> Paging (ODP) a number of things need to be done.
>
> 1) GUP "longterm" users need to inform the other subsystems that they have
> taken a pin on a page which may remain pinned for a very "long time".[3]
>
> 2) Any page which is "controlled" by a file system needs to have special
> handling. The details of the handling depends on if the page is page cache
> fronted or not.
>
> 2a) A page cache fronted page which has been pinned by GUP long term can use a
> bounce buffer to allow the file system to write back snap shots of the page.
> This is handled by the FS recognizing the GUP long term pin and making a copy
> of the page to be written back.
> NOTE: this patch set does not address this path.
>
> 2b) A FS "controlled" page which is not page cache fronted is either easier
> to deal with or harder depending on the operation the filesystem is trying
> to do.
>
> 2ba) [Hard case] If the FS operation _is_ a truncate or hole punch the
> FS can no longer use the pages in question until the pin has been
> removed. This patch set presents a solution to this by introducing
> some reasonable restrictions on user space applications.
>
> 2bb) [Easy case] If the FS operation is _not_ a truncate or hole punch
> then there is nothing which need be done. Data is Read or Written
> directly to the page. This is an easy case which would currently work
> if not for GUP long term pins being disabled. Therefore this patch set
> need not change access to the file data but does allow for GUP pins
> after 2ba above is dealt with.
>
>
> This patch series and presents a solution for problem 2ba)
>
> [1] https://github.com/johnhubbard/linux/tree/gup_dma_core
>
> [2] ext4/dev branch:
>
> - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=dev
>
> Specific patches:
>
> [2a] ext4: wait for outstanding dio during truncate in nojournal mode
>
> - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=82a25b027ca48d7ef197295846b352345853dfa8
>
> [2b] ext4: do not delete unlinked inode from orphan list on failed truncate
>
> - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=ee0ed02ca93ef1ecf8963ad96638795d55af2c14
>
> [2c] ext4: gracefully handle ext4_break_layouts() failure during truncate
>
> - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=b9c1c26739ec2d4b4fb70207a0a9ad6747e43f4c
>
> [3] The definition of long time is debatable but it has been established
> that RDMAs use of pages, minutes or hours after the pin is the extreme case
> which makes this problem most severe.
>
>
> Ira Weiny (10):
> fs/locks: Add trace_leases_conflict
> fs/locks: Export F_LAYOUT lease to user space
> mm/gup: Pass flags down to __gup_device_huge* calls
> mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages
> fs/ext4: Teach ext4 to break layout leases
> fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range
> fs/ext4: Fail truncate if pages are GUP pinned
> fs/xfs: Teach xfs to use new dax_layout_busy_page()
> fs/xfs: Fail truncate if pages are GUP pinned
> mm/gup: Remove FOLL_LONGTERM DAX exclusion
>
> fs/Kconfig | 1 +
> fs/dax.c | 38 ++++++---
> fs/ext4/ext4.h | 2 +-
> fs/ext4/extents.c | 6 +-
> fs/ext4/inode.c | 26 +++++--
> fs/locks.c | 97 ++++++++++++++++++++---
> fs/xfs/xfs_file.c | 24 ++++--
> fs/xfs/xfs_inode.h | 5 +-
> fs/xfs/xfs_ioctl.c | 15 +++-
> fs/xfs/xfs_iops.c | 14 +++-
> fs/xfs/xfs_pnfs.c | 14 ++--
> include/linux/dax.h | 9 ++-
> include/linux/fs.h | 2 +-
> include/linux/mm.h | 2 +
> include/trace/events/filelock.h | 35 +++++++++
> include/uapi/asm-generic/fcntl.h | 3 +
> mm/gup.c | 129 ++++++++++++-------------------
> mm/huge_memory.c | 12 +++
> 18 files changed, 299 insertions(+), 135 deletions(-)
>
> --
> 2.20.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-06-06 15:37:54

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Thu, Jun 6, 2019 at 3:42 AM Jan Kara <[email protected]> wrote:
>
> On Wed 05-06-19 18:45:33, [email protected] wrote:
> > From: Ira Weiny <[email protected]>
> >
> > ... V1,000,000 ;-)
> >
> > Pre-requisites:
> > John Hubbard's put_user_pages() patch series.[1]
> > Jan Kara's ext4_break_layouts() fixes[2]
> >
> > Based on the feedback from LSFmm and the LWN article which resulted. I've
> > decided to take a slightly different tack on this problem.
> >
> > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > memory which is then truncated. So really any solution we present which:
> >
> > A) Prevents file system corruption or data leaks
> > ...and...
> > B) Informs the user that they did something wrong
> >
> > Should be an acceptable solution.
> >
> > Because this is slightly new behavior. And because this is gonig to be
> > specific to DAX (because of the lack of a page cache) we have made the user
> > "opt in" to this behavior.
> >
> > The following patches implement the following solution.
> >
> > 1) The user has to opt in to allowing GUP pins on a file with a layout lease
> > (now made visible).
> > 2) GUP will fail (EPERM) if a layout lease is not taken
> > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail.
> > 4) The user has the option of holding the layout lease to receive a SIGIO for
> > notification to the original thread that another thread has tried to delete
> > their data. Furthermore this indicates that if the user needs to GUP the
> > file again they will need to retake the Layout lease before doing so.
> >
> >
> > NOTE: If the user releases the layout lease or if it has been broken by
> > another operation further GUP operations on the file will fail without
> > re-taking the lease. This means that if a user would like to register
> > pieces of a file and continue to register other pieces later they would
> > be advised to keep the layout lease, get a SIGIO notification, and retake
> > the lease.
> >
> > NOTE2: Truncation of pages which are not actively pinned will succeed.
> > Similar to accessing an mmap to this area GUP pins of that memory may
> > fail.
>
> So after some through I'm willing accept the fact that pinned DAX pages
> will just make truncate / hole punch fail and shove it into a same bucket
> of situations like "user can open a file and unlink won't delete it" or
> "ETXTBUSY when user is executing a file being truncated". The problem I
> have with this proposal is a lack of visibility from sysadmin POV. For
> ETXTBUSY or "unlinked but open file" sysadmin can just do lsof, find the
> problematic process and kill it. There's nothing like that with your
> proposal since currently once you hold page reference, you can unmap the
> file, drop layout lease, close the file, and there's no trace that you're
> responsible for the pinned page anymore.
>
> So I'd like to actually mandate that you *must* hold the file lease until
> you unpin all pages in the given range (not just that you have an option to
> hold a lease). And I believe the kernel should actually enforce this. That
> way we maintain a sane state that if someone uses a physical location of
> logical file offset on disk, he has a layout lease. Also once this is done,
> sysadmin has a reasonably easy way to discover run-away RDMA application
> and kill it if he wishes so.

Yes, this satisfies the primary concern that made me oppose failing
truncate. If the administrator determines that reclaiming capacity is
more important than maintaining active RDMA mappings "lsof + kill" is
a reasonable way to recover. I'd go so far as to say that anything
less is an abdication of the kernel's responsibility as an arbiter of
platform resources.

> The question is on how to exactly enforce that lease is taken until all
> pages are unpinned. I belive it could be done by tracking number of
> long-term pinned pages within a lease. Gup_longterm could easily increment
> the count when verifying the lease exists, gup_longterm users will somehow
> need to propagate corresponding 'filp' (struct file pointer) to
> put_user_pages_longterm() callsites so that they can look up appropriate
> lease to drop reference - probably I'd just transition all gup_longterm()
> users to a saner API similar to the one we have in mm/frame_vector.c where
> we don't hand out page pointers but an encapsulating structure that does
> all the necessary tracking. Removing a lease would need to block until all
> pins are released - this is probably the most hairy part since we need to
> handle a case if application just closes the file descriptor which would
> release the lease but OTOH we need to make sure task exit does not deadlock.
> Maybe we could block only on explicit lease unlock and just drop the layout
> lease on file close and if there are still pinned pages, send SIGKILL to an
> application as a reminder it did something stupid...
>
> What do people think about this?

SIGKILL on close() without explicit unlock and wait-on-last-pin with
explicit unlock sounds reasonable to me.

2019-06-06 18:53:15

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Wed, Jun 05, 2019 at 10:52:12PM -0700, John Hubbard wrote:
> On 6/5/19 6:45 PM, [email protected] wrote:
> > From: Ira Weiny <[email protected]>
> >
> > ... V1,000,000 ;-)
> >
> > Pre-requisites:
> > John Hubbard's put_user_pages() patch series.[1]
> > Jan Kara's ext4_break_layouts() fixes[2]
> >
> > Based on the feedback from LSFmm and the LWN article which resulted. I've
> > decided to take a slightly different tack on this problem.
> >
> > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > memory which is then truncated. So really any solution we present which:
> >
> > A) Prevents file system corruption or data leaks
> > ...and...
> > B) Informs the user that they did something wrong
> >
> > Should be an acceptable solution.
> >
> > Because this is slightly new behavior. And because this is gonig to be
> > specific to DAX (because of the lack of a page cache) we have made the user
> > "opt in" to this behavior.
> >
> > The following patches implement the following solution.
> >
> > 1) The user has to opt in to allowing GUP pins on a file with a layout lease
> > (now made visible).
> > 2) GUP will fail (EPERM) if a layout lease is not taken
> > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail.
> > 4) The user has the option of holding the layout lease to receive a SIGIO for
> > notification to the original thread that another thread has tried to delete
> > their data. Furthermore this indicates that if the user needs to GUP the
> > file again they will need to retake the Layout lease before doing so.
> >
> >
> > NOTE: If the user releases the layout lease or if it has been broken by another
> > operation further GUP operations on the file will fail without re-taking the
> > lease. This means that if a user would like to register pieces of a file and
> > continue to register other pieces later they would be advised to keep the
> > layout lease, get a SIGIO notification, and retake the lease.
> >
> > NOTE2: Truncation of pages which are not actively pinned will succeed. Similar
> > to accessing an mmap to this area GUP pins of that memory may fail.
> >
>
> Hi Ira,
>
> Wow, great to see this. This looks like basically the right behavior, IMHO.
>
> 1. We'll need man page additions, to explain it. In fact, even after a quick first
> pass through, I'm vague on two points:

Of course. But I was not going to go through and attempt to write man pages
and other docs without some agreement on the final mechanisms. This works
which was the basic requirement I had to send an RFC. :-D But yes man pages
and updates to headers etc all have to be done.

>
> a) I'm not sure how this actually provides "opt-in to new behavior", because I
> don't see any CONFIG_* or boot time choices, and it looks like the new behavior
> just is there. That is, if user space doesn't set F_LAYOUT on a range,
> GUP FOLL_LONGTERM will now fail, which is new behavior. (Did I get that right?)

The opt in is at run time. Currently GUP FOLL_LONGTERM is _not_ _allowed_ on
the FS DAX pages at all. So the default behavior is the same, GUP fails. (Or
specifically ibv_reg_mr() fails. This fails as before, not change there.

The Opt in is that if a user knows what is involved they can take the lease and
the GUP will not fail. This comes with the price of knowing that other
processes can't truncate those pages in use.

>
> b) Truncate and hole punch behavior, with and without user space having a SIGIO
> handler. (I'm sure this is obvious after another look through, but it might go
> nicely in a man page.)

Sorry this was not clear. There are 2 points for this patch set which requires
the use of catching SIGIO.

1) If an application _actually_ does (somehow, somewhere, in some unforseen use
case) want to allow a truncate to happen. They can catch the SIGIO, finish
their use of the pages, and release them. As long as they can do this within
the <sysfs>/lease-time-break time they are ok and the truncate can proceed.

2) This is a bit more subtle and something I almost delayed sending these out
for. Currently the implementation of a lease break actually removes the
lease from the file. I did not want this to happen and I was thinking of
delaying this patch set to implement something which keeps the lease around
but I figured I should get something out for comments. Jan has proposed
something along these lines and I agree with him so I'm going to ask you to
read my response to him about the details.

Anyway so the key here is that currently an app needs the SIGIO to retake
the lease if they want to map the file again or in parts based on usage.
For example, they may only want to map some of the file for when they are
using it and then map another part later. Without the SIGIO they would lose
their lease or would have to just take the lease for each GUP pin (which
adds overhead). Like I said I did not like this but I left it to get
something which works out.

>
> 2. It *seems* like ext4, xfs are taken care of here, not just for the DAX case,
> but for general RDMA on them? Or is there more that must be done?

This is limited to DAX. All the functionality is limited to *_devmap or "is
DAX" cases. I'm still thinking that page cache backed files can have a better
solution for the user.

>
> 3. Christophe Hellwig's unified gup patchset wreaks havoc in gup.c, and will
> conflict violently, as I'm sure you noticed. :)

Yep... But I needed to get the conversation started on this idea.

Thanks for the feedback!
Ira

>
>
> thanks,
> --
> John Hubbard
> NVIDIA
>

2019-06-06 20:18:23

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:

> So I'd like to actually mandate that you *must* hold the file lease until
> you unpin all pages in the given range (not just that you have an option to
> hold a lease). And I believe the kernel should actually enforce this. That
> way we maintain a sane state that if someone uses a physical location of
> logical file offset on disk, he has a layout lease. Also once this is done,
> sysadmin has a reasonably easy way to discover run-away RDMA application
> and kill it if he wishes so.
>
> The question is on how to exactly enforce that lease is taken until all
> pages are unpinned. I belive it could be done by tracking number of
> long-term pinned pages within a lease. Gup_longterm could easily increment
> the count when verifying the lease exists, gup_longterm users will somehow
> need to propagate corresponding 'filp' (struct file pointer) to
> put_user_pages_longterm() callsites so that they can look up appropriate
> lease to drop reference - probably I'd just transition all gup_longterm()
> users to a saner API similar to the one we have in mm/frame_vector.c where
> we don't hand out page pointers but an encapsulating structure that does
> all the necessary tracking. Removing a lease would need to block until all
> pins are released - this is probably the most hairy part since we need to
> handle a case if application just closes the file descriptor which
> would

I think if you are going to do this then the 'struct filp' that
represents the lease should be held in the kernel (ie inside the RDMA
umem) until the kernel is done with it.

Actually does someone have a pointer to this userspace lease API, I'm
not at all familiar with it, thanks

And yes, a better output format from GUP would be great..

> Maybe we could block only on explicit lease unlock and just drop the layout
> lease on file close and if there are still pinned pages, send SIGKILL to an
> application as a reminder it did something stupid...

Which process would you SIGKILL? At least for the rdma case a FD is
holding the GUP, so to do the put_user_pages() the kernel needs to
close the FD. I guess it would have to kill every process that has the
FD open? Seems complicated...

Regards,
Jason

2019-06-06 20:18:56

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Thu, Jun 06, 2019 at 10:11:58AM -0700, Ira Weiny wrote:

> 2) This is a bit more subtle and something I almost delayed sending these out
> for. Currently the implementation of a lease break actually removes the
> lease from the file. I did not want this to happen and I was thinking of
> delaying this patch set to implement something which keeps the lease around
> but I figured I should get something out for comments. Jan has proposed
> something along these lines and I agree with him so I'm going to ask you to
> read my response to him about the details.
>
>
> Anyway so the key here is that currently an app needs the SIGIO to retake
> the lease if they want to map the file again or in parts based on usage.
> For example, they may only want to map some of the file for when they are
> using it and then map another part later. Without the SIGIO they would lose
> their lease or would have to just take the lease for each GUP pin (which
> adds overhead). Like I said I did not like this but I left it to get
> something which works out.

So to be clear..

Even though the lease is broken the GUP remains, the pages remain
pined, and truncate/etc continues to fail?

I like Jan's take on this actually.. see other email.

Jason

2019-06-06 22:39:50

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
> On Wed 05-06-19 18:45:33, [email protected] wrote:
> > From: Ira Weiny <[email protected]>
> >
> > ... V1,000,000 ;-)
> >
> > Pre-requisites:
> > John Hubbard's put_user_pages() patch series.[1]
> > Jan Kara's ext4_break_layouts() fixes[2]
> >
> > Based on the feedback from LSFmm and the LWN article which resulted. I've
> > decided to take a slightly different tack on this problem.
> >
> > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > memory which is then truncated. So really any solution we present which:
> >
> > A) Prevents file system corruption or data leaks
> > ...and...
> > B) Informs the user that they did something wrong
> >
> > Should be an acceptable solution.
> >
> > Because this is slightly new behavior. And because this is gonig to be
> > specific to DAX (because of the lack of a page cache) we have made the user
> > "opt in" to this behavior.
> >
> > The following patches implement the following solution.
> >
> > 1) The user has to opt in to allowing GUP pins on a file with a layout lease
> > (now made visible).
> > 2) GUP will fail (EPERM) if a layout lease is not taken
> > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail.
> > 4) The user has the option of holding the layout lease to receive a SIGIO for
> > notification to the original thread that another thread has tried to delete
> > their data. Furthermore this indicates that if the user needs to GUP the
> > file again they will need to retake the Layout lease before doing so.
> >
> >
> > NOTE: If the user releases the layout lease or if it has been broken by
> > another operation further GUP operations on the file will fail without
> > re-taking the lease. This means that if a user would like to register
> > pieces of a file and continue to register other pieces later they would
> > be advised to keep the layout lease, get a SIGIO notification, and retake
> > the lease.
> >
> > NOTE2: Truncation of pages which are not actively pinned will succeed.
> > Similar to accessing an mmap to this area GUP pins of that memory may
> > fail.
>
> So after some through I'm willing accept the fact that pinned DAX pages
> will just make truncate / hole punch fail and shove it into a same bucket
> of situations like "user can open a file and unlink won't delete it" or
> "ETXTBUSY when user is executing a file being truncated". The problem I
> have with this proposal is a lack of visibility from sysadmin POV. For
> ETXTBUSY or "unlinked but open file" sysadmin can just do lsof, find the
> problematic process and kill it. There's nothing like that with your
> proposal since currently once you hold page reference, you can unmap the
> file, drop layout lease, close the file, and there's no trace that you're
> responsible for the pinned page anymore.

Agreed. For some "GUP interfaces" one may be able to figure this out but I'm
not familiar with any. For RDMA there has been some additions for tracking
resources but I don't think any of that is useful here. Regardless from a FS
POV this is awkward to have to understand all the independent interfaces, so I
agree.

>
> So I'd like to actually mandate that you *must* hold the file lease until
> you unpin all pages in the given range (not just that you have an option to
> hold a lease). And I believe the kernel should actually enforce this. That
> way we maintain a sane state that if someone uses a physical location of
> logical file offset on disk, he has a layout lease. Also once this is done,
> sysadmin has a reasonably easy way to discover run-away RDMA application
> and kill it if he wishes so.

Fair enough.

I was kind of heading that direction but had not thought this far forward. I
was exploring how to have a lease remain on the file even after a "lease
break". But that is incompatible with the current semantics of a "layout"
lease (as currently defined in the kernel). [In the end I wanted to get an RFC
out to see what people think of this idea so I did not look at keeping the
lease.]

Also hitch is that currently a lease is forcefully broken after
<sysfs>/lease-break-time. To do what you suggest I think we would need a new
lease type with the semantics you describe.

Previously I had thought this would be a good idea (for other reasons). But
what does everyone think about using a "longterm lease" similar to [1] which
has the semantics you proppose? In [1] I was not sure "longterm" was a good
name but with your proposal I think it makes more sense.

>
> The question is on how to exactly enforce that lease is taken until all
> pages are unpinned. I belive it could be done by tracking number of
> long-term pinned pages within a lease. Gup_longterm could easily increment
> the count when verifying the lease exists, gup_longterm users will somehow
> need to propagate corresponding 'filp' (struct file pointer) to
> put_user_pages_longterm() callsites so that they can look up appropriate
> lease to drop reference

I actually think that might be pretty easy. I actually added a ref count to
the longterm lease before.[2] This was done to be able to take the lease
within the GUP code. We don't need that functionality exactly but that patch
implements some of what you propose. With a ref count on the lease we can
refuse to release it until all GUP users have released it.

>
> - probably I'd just transition all gup_longterm()
> users to a saner API similar to the one we have in mm/frame_vector.c where
> we don't hand out page pointers but an encapsulating structure that does
> all the necessary tracking.

I'll take a look at that code. But that seems like a pretty big change.

>
> Removing a lease would need to block until all
> pins are released - this is probably the most hairy part since we need to
> handle a case if application just closes the file descriptor which would
> release the lease but OTOH we need to make sure task exit does not deadlock.
> Maybe we could block only on explicit lease unlock and just drop the layout
> lease on file close and if there are still pinned pages, send SIGKILL to an
> application as a reminder it did something stupid...

As presented at LSFmm I'm not opposed to killing a process which does not
"follow the rules". But I'm concerned about how to handle this across a fork.

Limiting the open()/LEASE/GUP/close()/SIGKILL to a specific pid "leak"'s pins
to a child through the RDMA context. This was the major issue Jason had with
the SIGBUS proposal.

Always sending a SIGKILL would prevent an RDMA process from doing something
like system("ls") (would kill the child unnecessarily). Are we ok with that?

>
> What do people think about this?

But generally I like the idea of the leases being sticky. Not sure about the
SIGKILL.

Thanks for the review,
Ira

[1] https://patchwork.kernel.org/patch/10921171/
[2] https://patchwork.kernel.org/patch/10921177/

>
> Honza
> >
> >
> > A general overview follows for background.
> >
> > It should be noted that one solution for this problem is to use RDMA's On
> > Demand Paging (ODP). There are 2 big reasons this may not work.
> >
> > 1) The hardware being used for RDMA may not support ODP
> > 2) ODP may be detrimental to the over all network (cluster or cloud)
> > performance
> >
> > Therefore, in order to support RDMA to File system pages without On Demand
> > Paging (ODP) a number of things need to be done.
> >
> > 1) GUP "longterm" users need to inform the other subsystems that they have
> > taken a pin on a page which may remain pinned for a very "long time".[3]
> >
> > 2) Any page which is "controlled" by a file system needs to have special
> > handling. The details of the handling depends on if the page is page cache
> > fronted or not.
> >
> > 2a) A page cache fronted page which has been pinned by GUP long term can use a
> > bounce buffer to allow the file system to write back snap shots of the page.
> > This is handled by the FS recognizing the GUP long term pin and making a copy
> > of the page to be written back.
> > NOTE: this patch set does not address this path.
> >
> > 2b) A FS "controlled" page which is not page cache fronted is either easier
> > to deal with or harder depending on the operation the filesystem is trying
> > to do.
> >
> > 2ba) [Hard case] If the FS operation _is_ a truncate or hole punch the
> > FS can no longer use the pages in question until the pin has been
> > removed. This patch set presents a solution to this by introducing
> > some reasonable restrictions on user space applications.
> >
> > 2bb) [Easy case] If the FS operation is _not_ a truncate or hole punch
> > then there is nothing which need be done. Data is Read or Written
> > directly to the page. This is an easy case which would currently work
> > if not for GUP long term pins being disabled. Therefore this patch set
> > need not change access to the file data but does allow for GUP pins
> > after 2ba above is dealt with.
> >
> >
> > This patch series and presents a solution for problem 2ba)
> >
> > [1] https://github.com/johnhubbard/linux/tree/gup_dma_core
> >
> > [2] ext4/dev branch:
> >
> > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=dev
> >
> > Specific patches:
> >
> > [2a] ext4: wait for outstanding dio during truncate in nojournal mode
> >
> > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=82a25b027ca48d7ef197295846b352345853dfa8
> >
> > [2b] ext4: do not delete unlinked inode from orphan list on failed truncate
> >
> > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=ee0ed02ca93ef1ecf8963ad96638795d55af2c14
> >
> > [2c] ext4: gracefully handle ext4_break_layouts() failure during truncate
> >
> > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=b9c1c26739ec2d4b4fb70207a0a9ad6747e43f4c
> >
> > [3] The definition of long time is debatable but it has been established
> > that RDMAs use of pages, minutes or hours after the pin is the extreme case
> > which makes this problem most severe.
> >
> >
> > Ira Weiny (10):
> > fs/locks: Add trace_leases_conflict
> > fs/locks: Export F_LAYOUT lease to user space
> > mm/gup: Pass flags down to __gup_device_huge* calls
> > mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages
> > fs/ext4: Teach ext4 to break layout leases
> > fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range
> > fs/ext4: Fail truncate if pages are GUP pinned
> > fs/xfs: Teach xfs to use new dax_layout_busy_page()
> > fs/xfs: Fail truncate if pages are GUP pinned
> > mm/gup: Remove FOLL_LONGTERM DAX exclusion
> >
> > fs/Kconfig | 1 +
> > fs/dax.c | 38 ++++++---
> > fs/ext4/ext4.h | 2 +-
> > fs/ext4/extents.c | 6 +-
> > fs/ext4/inode.c | 26 +++++--
> > fs/locks.c | 97 ++++++++++++++++++++---
> > fs/xfs/xfs_file.c | 24 ++++--
> > fs/xfs/xfs_inode.h | 5 +-
> > fs/xfs/xfs_ioctl.c | 15 +++-
> > fs/xfs/xfs_iops.c | 14 +++-
> > fs/xfs/xfs_pnfs.c | 14 ++--
> > include/linux/dax.h | 9 ++-
> > include/linux/fs.h | 2 +-
> > include/linux/mm.h | 2 +
> > include/trace/events/filelock.h | 35 +++++++++
> > include/uapi/asm-generic/fcntl.h | 3 +
> > mm/gup.c | 129 ++++++++++++-------------------
> > mm/huge_memory.c | 12 +++
> > 18 files changed, 299 insertions(+), 135 deletions(-)
> >
> > --
> > 2.20.1
> >
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2019-06-06 22:41:06

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Thu, Jun 06, 2019 at 04:51:15PM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
>
> > So I'd like to actually mandate that you *must* hold the file lease until
> > you unpin all pages in the given range (not just that you have an option to
> > hold a lease). And I believe the kernel should actually enforce this. That
> > way we maintain a sane state that if someone uses a physical location of
> > logical file offset on disk, he has a layout lease. Also once this is done,
> > sysadmin has a reasonably easy way to discover run-away RDMA application
> > and kill it if he wishes so.
> >
> > The question is on how to exactly enforce that lease is taken until all
> > pages are unpinned. I belive it could be done by tracking number of
> > long-term pinned pages within a lease. Gup_longterm could easily increment
> > the count when verifying the lease exists, gup_longterm users will somehow
> > need to propagate corresponding 'filp' (struct file pointer) to
> > put_user_pages_longterm() callsites so that they can look up appropriate
> > lease to drop reference - probably I'd just transition all gup_longterm()
> > users to a saner API similar to the one we have in mm/frame_vector.c where
> > we don't hand out page pointers but an encapsulating structure that does
> > all the necessary tracking. Removing a lease would need to block until all
> > pins are released - this is probably the most hairy part since we need to
> > handle a case if application just closes the file descriptor which
> > would
>
> I think if you are going to do this then the 'struct filp' that
> represents the lease should be held in the kernel (ie inside the RDMA
> umem) until the kernel is done with it.

Yea there seems merit to this. I'm still not resolving how this helps track
who has the pin across a fork.

>
> Actually does someone have a pointer to this userspace lease API, I'm
> not at all familiar with it, thanks

man fcntl
search for SETLEASE

But I had to add the F_LAYOUT lease type. (Personally I'm for calling it
F_LONGTERM at this point. I don't think LAYOUT is compatible with what we are
proposing here.)

Anyway, yea would be a libc change at lease for man page etc... But again I
want to get some buy in before going through all that.

>
> And yes, a better output format from GUP would be great..
>
> > Maybe we could block only on explicit lease unlock and just drop the layout
> > lease on file close and if there are still pinned pages, send SIGKILL to an
> > application as a reminder it did something stupid...
>
> Which process would you SIGKILL? At least for the rdma case a FD is
> holding the GUP, so to do the put_user_pages() the kernel needs to
> close the FD. I guess it would have to kill every process that has the
> FD open? Seems complicated...

Tending to agree... But I'm still not opposed to killing bad actors... ;-)

NOTE: Jason I think you need to be more clear about the FD you are speaking of.
I believe you mean the FD which refers to the RMDA context. That is what I
called it in my other email.

Ira

>
> Regards,
> Jason

2019-06-06 22:42:37

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Thu, Jun 06, 2019 at 03:03:30PM -0700, 'Ira Weiny' wrote:
> On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
> > On Wed 05-06-19 18:45:33, [email protected] wrote:
> > > From: Ira Weiny <[email protected]>
> > >
> > > ... V1,000,000 ;-)
> > >
> > > Pre-requisites:
> > > John Hubbard's put_user_pages() patch series.[1]
> > > Jan Kara's ext4_break_layouts() fixes[2]
> > >
> > > Based on the feedback from LSFmm and the LWN article which resulted. I've
> > > decided to take a slightly different tack on this problem.
> > >
> > > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > > memory which is then truncated. So really any solution we present which:
> > >
> > > A) Prevents file system corruption or data leaks
> > > ...and...
> > > B) Informs the user that they did something wrong
> > >
> > > Should be an acceptable solution.
> > >
> > > Because this is slightly new behavior. And because this is gonig to be
> > > specific to DAX (because of the lack of a page cache) we have made the user
> > > "opt in" to this behavior.
> > >
> > > The following patches implement the following solution.
> > >
> > > 1) The user has to opt in to allowing GUP pins on a file with a layout lease
> > > (now made visible).
> > > 2) GUP will fail (EPERM) if a layout lease is not taken
> > > 3) Any truncate or hole punch operation on a GUP'ed DAX page will fail.
> > > 4) The user has the option of holding the layout lease to receive a SIGIO for
> > > notification to the original thread that another thread has tried to delete
> > > their data. Furthermore this indicates that if the user needs to GUP the
> > > file again they will need to retake the Layout lease before doing so.
> > >
> > >
> > > NOTE: If the user releases the layout lease or if it has been broken by
> > > another operation further GUP operations on the file will fail without
> > > re-taking the lease. This means that if a user would like to register
> > > pieces of a file and continue to register other pieces later they would
> > > be advised to keep the layout lease, get a SIGIO notification, and retake
> > > the lease.
> > >
> > > NOTE2: Truncation of pages which are not actively pinned will succeed.
> > > Similar to accessing an mmap to this area GUP pins of that memory may
> > > fail.
> >
> > So after some through I'm willing accept the fact that pinned DAX pages
> > will just make truncate / hole punch fail and shove it into a same bucket
> > of situations like "user can open a file and unlink won't delete it" or
> > "ETXTBUSY when user is executing a file being truncated". The problem I
> > have with this proposal is a lack of visibility from sysadmin POV. For
> > ETXTBUSY or "unlinked but open file" sysadmin can just do lsof, find the
> > problematic process and kill it. There's nothing like that with your
> > proposal since currently once you hold page reference, you can unmap the
> > file, drop layout lease, close the file, and there's no trace that you're
> > responsible for the pinned page anymore.
>
> Agreed. For some "GUP interfaces" one may be able to figure this out but I'm
> not familiar with any. For RDMA there has been some additions for tracking
> resources but I don't think any of that is useful here. Regardless from a FS
> POV this is awkward to have to understand all the independent interfaces, so I
> agree.
>
> >
> > So I'd like to actually mandate that you *must* hold the file lease until
> > you unpin all pages in the given range (not just that you have an option to
> > hold a lease). And I believe the kernel should actually enforce this. That
> > way we maintain a sane state that if someone uses a physical location of
> > logical file offset on disk, he has a layout lease. Also once this is done,
> > sysadmin has a reasonably easy way to discover run-away RDMA application
> > and kill it if he wishes so.
>
> Fair enough.
>
> I was kind of heading that direction but had not thought this far forward. I
> was exploring how to have a lease remain on the file even after a "lease
> break". But that is incompatible with the current semantics of a "layout"
> lease (as currently defined in the kernel). [In the end I wanted to get an RFC
> out to see what people think of this idea so I did not look at keeping the
> lease.]
>
> Also hitch is that currently a lease is forcefully broken after
> <sysfs>/lease-break-time. To do what you suggest I think we would need a new
> lease type with the semantics you describe.
>
> Previously I had thought this would be a good idea (for other reasons). But
> what does everyone think about using a "longterm lease" similar to [1] which
> has the semantics you proppose? In [1] I was not sure "longterm" was a good
> name but with your proposal I think it makes more sense.
>
> >
> > The question is on how to exactly enforce that lease is taken until all
> > pages are unpinned. I belive it could be done by tracking number of
> > long-term pinned pages within a lease. Gup_longterm could easily increment
> > the count when verifying the lease exists, gup_longterm users will somehow
> > need to propagate corresponding 'filp' (struct file pointer) to
> > put_user_pages_longterm() callsites so that they can look up appropriate
> > lease to drop reference
>
> I actually think that might be pretty easy. I actually added a ref count to
> the longterm lease before.[2] This was done to be able to take the lease
> within the GUP code. We don't need that functionality exactly but that patch
> implements some of what you propose. With a ref count on the lease we can
> refuse to release it until all GUP users have released it.
>
> >
> > - probably I'd just transition all gup_longterm()
> > users to a saner API similar to the one we have in mm/frame_vector.c where
> > we don't hand out page pointers but an encapsulating structure that does
> > all the necessary tracking.
>
> I'll take a look at that code. But that seems like a pretty big change.
>
> >
> > Removing a lease would need to block until all
> > pins are released - this is probably the most hairy part since we need to
> > handle a case if application just closes the file descriptor which would
> > release the lease but OTOH we need to make sure task exit does not deadlock.
> > Maybe we could block only on explicit lease unlock and just drop the layout
> > lease on file close and if there are still pinned pages, send SIGKILL to an
> > application as a reminder it did something stupid...
>
> As presented at LSFmm I'm not opposed to killing a process which does not
> "follow the rules". But I'm concerned about how to handle this across a fork.
>
> Limiting the open()/LEASE/GUP/close()/SIGKILL to a specific pid "leak"'s pins
> to a child through the RDMA context. This was the major issue Jason had with
> the SIGBUS proposal.
>
> Always sending a SIGKILL would prevent an RDMA process from doing something
> like system("ls") (would kill the child unnecessarily). Are we ok with that?

I might be wrong here. My memory said it closed all fd's but I'm not finding
any documentation of that. Perhaps we could say that the child would be
required to keep the fd open as well?

Ira

>
> >
> > What do people think about this?
>
> But generally I like the idea of the leases being sticky. Not sure about the
> SIGKILL.
>
> Thanks for the review,
> Ira
>
> [1] https://patchwork.kernel.org/patch/10921171/
> [2] https://patchwork.kernel.org/patch/10921177/
>
> >
> > Honza
> > >
> > >
> > > A general overview follows for background.
> > >
> > > It should be noted that one solution for this problem is to use RDMA's On
> > > Demand Paging (ODP). There are 2 big reasons this may not work.
> > >
> > > 1) The hardware being used for RDMA may not support ODP
> > > 2) ODP may be detrimental to the over all network (cluster or cloud)
> > > performance
> > >
> > > Therefore, in order to support RDMA to File system pages without On Demand
> > > Paging (ODP) a number of things need to be done.
> > >
> > > 1) GUP "longterm" users need to inform the other subsystems that they have
> > > taken a pin on a page which may remain pinned for a very "long time".[3]
> > >
> > > 2) Any page which is "controlled" by a file system needs to have special
> > > handling. The details of the handling depends on if the page is page cache
> > > fronted or not.
> > >
> > > 2a) A page cache fronted page which has been pinned by GUP long term can use a
> > > bounce buffer to allow the file system to write back snap shots of the page.
> > > This is handled by the FS recognizing the GUP long term pin and making a copy
> > > of the page to be written back.
> > > NOTE: this patch set does not address this path.
> > >
> > > 2b) A FS "controlled" page which is not page cache fronted is either easier
> > > to deal with or harder depending on the operation the filesystem is trying
> > > to do.
> > >
> > > 2ba) [Hard case] If the FS operation _is_ a truncate or hole punch the
> > > FS can no longer use the pages in question until the pin has been
> > > removed. This patch set presents a solution to this by introducing
> > > some reasonable restrictions on user space applications.
> > >
> > > 2bb) [Easy case] If the FS operation is _not_ a truncate or hole punch
> > > then there is nothing which need be done. Data is Read or Written
> > > directly to the page. This is an easy case which would currently work
> > > if not for GUP long term pins being disabled. Therefore this patch set
> > > need not change access to the file data but does allow for GUP pins
> > > after 2ba above is dealt with.
> > >
> > >
> > > This patch series and presents a solution for problem 2ba)
> > >
> > > [1] https://github.com/johnhubbard/linux/tree/gup_dma_core
> > >
> > > [2] ext4/dev branch:
> > >
> > > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=dev
> > >
> > > Specific patches:
> > >
> > > [2a] ext4: wait for outstanding dio during truncate in nojournal mode
> > >
> > > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=82a25b027ca48d7ef197295846b352345853dfa8
> > >
> > > [2b] ext4: do not delete unlinked inode from orphan list on failed truncate
> > >
> > > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=ee0ed02ca93ef1ecf8963ad96638795d55af2c14
> > >
> > > [2c] ext4: gracefully handle ext4_break_layouts() failure during truncate
> > >
> > > - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=b9c1c26739ec2d4b4fb70207a0a9ad6747e43f4c
> > >
> > > [3] The definition of long time is debatable but it has been established
> > > that RDMAs use of pages, minutes or hours after the pin is the extreme case
> > > which makes this problem most severe.
> > >
> > >
> > > Ira Weiny (10):
> > > fs/locks: Add trace_leases_conflict
> > > fs/locks: Export F_LAYOUT lease to user space
> > > mm/gup: Pass flags down to __gup_device_huge* calls
> > > mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages
> > > fs/ext4: Teach ext4 to break layout leases
> > > fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range
> > > fs/ext4: Fail truncate if pages are GUP pinned
> > > fs/xfs: Teach xfs to use new dax_layout_busy_page()
> > > fs/xfs: Fail truncate if pages are GUP pinned
> > > mm/gup: Remove FOLL_LONGTERM DAX exclusion
> > >
> > > fs/Kconfig | 1 +
> > > fs/dax.c | 38 ++++++---
> > > fs/ext4/ext4.h | 2 +-
> > > fs/ext4/extents.c | 6 +-
> > > fs/ext4/inode.c | 26 +++++--
> > > fs/locks.c | 97 ++++++++++++++++++++---
> > > fs/xfs/xfs_file.c | 24 ++++--
> > > fs/xfs/xfs_inode.h | 5 +-
> > > fs/xfs/xfs_ioctl.c | 15 +++-
> > > fs/xfs/xfs_iops.c | 14 +++-
> > > fs/xfs/xfs_pnfs.c | 14 ++--
> > > include/linux/dax.h | 9 ++-
> > > include/linux/fs.h | 2 +-
> > > include/linux/mm.h | 2 +
> > > include/trace/events/filelock.h | 35 +++++++++
> > > include/uapi/asm-generic/fcntl.h | 3 +
> > > mm/gup.c | 129 ++++++++++++-------------------
> > > mm/huge_memory.c | 12 +++
> > > 18 files changed, 299 insertions(+), 135 deletions(-)
> > >
> > > --
> > > 2.20.1
> > >
> > --
> > Jan Kara <[email protected]>
> > SUSE Labs, CR
>

2019-06-07 11:04:53

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Thu 06-06-19 15:03:30, Ira Weiny wrote:
> On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
> > On Wed 05-06-19 18:45:33, [email protected] wrote:
> > > From: Ira Weiny <[email protected]>
> >
> > So I'd like to actually mandate that you *must* hold the file lease until
> > you unpin all pages in the given range (not just that you have an option to
> > hold a lease). And I believe the kernel should actually enforce this. That
> > way we maintain a sane state that if someone uses a physical location of
> > logical file offset on disk, he has a layout lease. Also once this is done,
> > sysadmin has a reasonably easy way to discover run-away RDMA application
> > and kill it if he wishes so.
>
> Fair enough.
>
> I was kind of heading that direction but had not thought this far forward. I
> was exploring how to have a lease remain on the file even after a "lease
> break". But that is incompatible with the current semantics of a "layout"
> lease (as currently defined in the kernel). [In the end I wanted to get an RFC
> out to see what people think of this idea so I did not look at keeping the
> lease.]
>
> Also hitch is that currently a lease is forcefully broken after
> <sysfs>/lease-break-time. To do what you suggest I think we would need a new
> lease type with the semantics you describe.

I'd do what Dave suggested - add flag to mark lease as unbreakable by
truncate and teach file locking core to handle that. There actually is
support for locks that are not broken after given timeout so there
shouldn't be too many changes need.

> Previously I had thought this would be a good idea (for other reasons). But
> what does everyone think about using a "longterm lease" similar to [1] which
> has the semantics you proppose? In [1] I was not sure "longterm" was a good
> name but with your proposal I think it makes more sense.

As I wrote elsewhere in this thread I think FL_LAYOUT name still makes
sense and I'd add there FL_UNBREAKABLE to mark unusal behavior with
truncate.

> > - probably I'd just transition all gup_longterm()
> > users to a saner API similar to the one we have in mm/frame_vector.c where
> > we don't hand out page pointers but an encapsulating structure that does
> > all the necessary tracking.
>
> I'll take a look at that code. But that seems like a pretty big change.

I was looking into that yesterday before proposing this and there aren't
than many gup_longterm() users and most of them anyway just stick pages
array into their tracking structure and then release them once done. So it
shouldn't be that complex to convert to a new convention (and you have to
touch all gup_longterm() users anyway to teach them track leases etc.).

> > Removing a lease would need to block until all
> > pins are released - this is probably the most hairy part since we need to
> > handle a case if application just closes the file descriptor which would
> > release the lease but OTOH we need to make sure task exit does not deadlock.
> > Maybe we could block only on explicit lease unlock and just drop the layout
> > lease on file close and if there are still pinned pages, send SIGKILL to an
> > application as a reminder it did something stupid...
>
> As presented at LSFmm I'm not opposed to killing a process which does not
> "follow the rules". But I'm concerned about how to handle this across a fork.
>
> Limiting the open()/LEASE/GUP/close()/SIGKILL to a specific pid "leak"'s pins
> to a child through the RDMA context. This was the major issue Jason had with
> the SIGBUS proposal.
>
> Always sending a SIGKILL would prevent an RDMA process from doing something
> like system("ls") (would kill the child unnecessarily). Are we ok with that?

I answered this in another email but system("ls") won't kill anybody.
fork(2) just creates new file descriptor for the same file and possibly
then closes it but since there is still another file descriptor for the
same struct file, the "close" code won't trigger.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-06-07 12:19:15

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Fri, Jun 07, 2019 at 12:36:36PM +0200, Jan Kara wrote:

> Because the pins would be invisible to sysadmin from that point on.

It is not invisible, it just shows up in a rdma specific kernel
interface. You have to use rdma netlink to see the kernel object
holding this pin.

If this visibility is the main sticking point I suggest just enhancing
the existing MR reporting to include the file info for current GUP
pins and teaching lsof to collect information from there as well so it
is easy to use.

If the ownership of the lease transfers to the MR, and we report that
ownership to userspace in a way lsof can find, then I think all the
concerns that have been raised are met, right?

> ugly to live so we have to come up with something better. The best I can
> currently come up with is to have a method associated with the lease that
> would invalidate the RDMA context that holds the pins in the same way that
> a file close would do it.

This is back to requiring all RDMA HW to have some new behavior they
currently don't have..

The main objection to the current ODP & DAX solution is that very
little HW can actually implement it, having the alternative still
require HW support doesn't seem like progress.

I think we will eventually start seein some HW be able to do this
invalidation, but it won't be universal, and I'd rather leave it
optional, for recovery from truely catastrophic errors (ie my DAX is
on fire, I need to unplug it).

Jason

2019-06-07 14:51:25

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Fri, Jun 07, 2019 at 09:17:29AM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 07, 2019 at 12:36:36PM +0200, Jan Kara wrote:
>
> > Because the pins would be invisible to sysadmin from that point on.
>
> It is not invisible, it just shows up in a rdma specific kernel
> interface. You have to use rdma netlink to see the kernel object
> holding this pin.
>
> If this visibility is the main sticking point I suggest just enhancing
> the existing MR reporting to include the file info for current GUP
> pins and teaching lsof to collect information from there as well so it
> is easy to use.
>
> If the ownership of the lease transfers to the MR, and we report that
> ownership to userspace in a way lsof can find, then I think all the
> concerns that have been raised are met, right?

I was contemplating some new lsof feature yesterday. But what I don't think we
want is sysadmins to have multiple tools for multiple subsystems. Or even have
to teach lsof something new for every potential new subsystem user of GUP pins.

I was thinking more along the lines of reporting files which have GUP pins on
them directly somewhere (dare I say procfs?) and teaching lsof to report that
information. That would cover any subsystem which does a longterm pin.

>
> > ugly to live so we have to come up with something better. The best I can
> > currently come up with is to have a method associated with the lease that
> > would invalidate the RDMA context that holds the pins in the same way that
> > a file close would do it.
>
> This is back to requiring all RDMA HW to have some new behavior they
> currently don't have..
>
> The main objection to the current ODP & DAX solution is that very
> little HW can actually implement it, having the alternative still
> require HW support doesn't seem like progress.
>
> I think we will eventually start seein some HW be able to do this
> invalidation, but it won't be universal, and I'd rather leave it
> optional, for recovery from truely catastrophic errors (ie my DAX is
> on fire, I need to unplug it).

Agreed. I think software wise there is not much some of the devices can do
with such an "invalidate".

Ira

2019-06-07 15:11:12

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Fri, Jun 07, 2019 at 07:52:13AM -0700, Ira Weiny wrote:
> On Fri, Jun 07, 2019 at 09:17:29AM -0300, Jason Gunthorpe wrote:
> > On Fri, Jun 07, 2019 at 12:36:36PM +0200, Jan Kara wrote:
> >
> > > Because the pins would be invisible to sysadmin from that point on.
> >
> > It is not invisible, it just shows up in a rdma specific kernel
> > interface. You have to use rdma netlink to see the kernel object
> > holding this pin.
> >
> > If this visibility is the main sticking point I suggest just enhancing
> > the existing MR reporting to include the file info for current GUP
> > pins and teaching lsof to collect information from there as well so it
> > is easy to use.
> >
> > If the ownership of the lease transfers to the MR, and we report that
> > ownership to userspace in a way lsof can find, then I think all the
> > concerns that have been raised are met, right?
>
> I was contemplating some new lsof feature yesterday. But what I don't think we
> want is sysadmins to have multiple tools for multiple subsystems. Or even have
> to teach lsof something new for every potential new subsystem user of GUP pins.

Well.. it is a bit tricky, but you'd have to arrange for the lease
object to have a list of 'struct files' that are holding the
lease open.

The first would be the file that did the fcntl, the next would be all
the files that did longterm GUP - which means longterm GUP has to have
a chardev file/etc as well (seems OK)

Then lsof could query the list of lease objects for each file it
encounters and print them out too.

Jason

2019-06-07 19:00:44

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Fri, Jun 07, 2019 at 01:04:26PM +0200, Jan Kara wrote:
> On Thu 06-06-19 15:03:30, Ira Weiny wrote:
> > On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
> > > On Wed 05-06-19 18:45:33, [email protected] wrote:
> > > > From: Ira Weiny <[email protected]>
> > >
> > > So I'd like to actually mandate that you *must* hold the file lease until
> > > you unpin all pages in the given range (not just that you have an option to
> > > hold a lease). And I believe the kernel should actually enforce this. That
> > > way we maintain a sane state that if someone uses a physical location of
> > > logical file offset on disk, he has a layout lease. Also once this is done,
> > > sysadmin has a reasonably easy way to discover run-away RDMA application
> > > and kill it if he wishes so.
> >
> > Fair enough.
> >
> > I was kind of heading that direction but had not thought this far forward. I
> > was exploring how to have a lease remain on the file even after a "lease
> > break". But that is incompatible with the current semantics of a "layout"
> > lease (as currently defined in the kernel). [In the end I wanted to get an RFC
> > out to see what people think of this idea so I did not look at keeping the
> > lease.]
> >
> > Also hitch is that currently a lease is forcefully broken after
> > <sysfs>/lease-break-time. To do what you suggest I think we would need a new
> > lease type with the semantics you describe.
>
> I'd do what Dave suggested - add flag to mark lease as unbreakable by
> truncate and teach file locking core to handle that. There actually is
> support for locks that are not broken after given timeout so there
> shouldn't be too many changes need.
>
> > Previously I had thought this would be a good idea (for other reasons). But
> > what does everyone think about using a "longterm lease" similar to [1] which
> > has the semantics you proppose? In [1] I was not sure "longterm" was a good
> > name but with your proposal I think it makes more sense.
>
> As I wrote elsewhere in this thread I think FL_LAYOUT name still makes
> sense and I'd add there FL_UNBREAKABLE to mark unusal behavior with
> truncate.

Ok I want to make sure I understand what you and Dave are suggesting.

Are you suggesting that we have something like this from user space?

fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);

>
> > > - probably I'd just transition all gup_longterm()
> > > users to a saner API similar to the one we have in mm/frame_vector.c where
> > > we don't hand out page pointers but an encapsulating structure that does
> > > all the necessary tracking.
> >
> > I'll take a look at that code. But that seems like a pretty big change.
>
> I was looking into that yesterday before proposing this and there aren't
> than many gup_longterm() users and most of them anyway just stick pages
> array into their tracking structure and then release them once done. So it
> shouldn't be that complex to convert to a new convention (and you have to
> touch all gup_longterm() users anyway to teach them track leases etc.).

I think in the direction we are heading this becomes more attractive for sure.
For me though it will take some time.

Should we convert the frame_vector over to this new mechanism? (Or more
accurately perhaps, add to frame_vector and use it?) It seems bad to have "yet
another object" returned from the pin pages interface...

And I think this is related to what Christoph Hellwig is doing with bio_vec and
dma. Really we want drivers out of the page processing business.

So for now I'm going to move forward with the idea of handing "some object" to
the GUP callers and figure out the lsof stuff, and let bigger questions like
this play out a bit more before I try and work with that code. Fair?

>
> > > Removing a lease would need to block until all
> > > pins are released - this is probably the most hairy part since we need to
> > > handle a case if application just closes the file descriptor which would
> > > release the lease but OTOH we need to make sure task exit does not deadlock.
> > > Maybe we could block only on explicit lease unlock and just drop the layout
> > > lease on file close and if there are still pinned pages, send SIGKILL to an
> > > application as a reminder it did something stupid...
> >
> > As presented at LSFmm I'm not opposed to killing a process which does not
> > "follow the rules". But I'm concerned about how to handle this across a fork.
> >
> > Limiting the open()/LEASE/GUP/close()/SIGKILL to a specific pid "leak"'s pins
> > to a child through the RDMA context. This was the major issue Jason had with
> > the SIGBUS proposal.
> >
> > Always sending a SIGKILL would prevent an RDMA process from doing something
> > like system("ls") (would kill the child unnecessarily). Are we ok with that?
>
> I answered this in another email but system("ls") won't kill anybody.
> fork(2) just creates new file descriptor for the same file and possibly
> then closes it but since there is still another file descriptor for the
> same struct file, the "close" code won't trigger.

Agreed. I was wrong. Sorry.

But if we can keep track of who has the pins in lsof can we agree no process
needs to be SIGKILL'ed? Admins can do this on their own "killing" if they
really need to stop the use of these files, right?

Ira

2019-06-07 19:02:40

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:

> And I think this is related to what Christoph Hellwig is doing with bio_vec and
> dma. Really we want drivers out of the page processing business.

At least for RDMA, and a few other places I've noticed, I'd really
like to get totally out of the handling struct pages game.

We are DMA based and really only want DMA addresses for the target
device. I know other places need CPU pages or more complicated
things.. But I also know there are other drivers like RDMA..

So I think it would be very helpful to have a driver API something
like:

int get_user_mem_for_dma(struct device *dma_device,
void __user *mem, size_t length,
struct gup_handle *res,
struct 'bio dma list' *dma_list,
const struct dma_params *params);
void put_user_mem_for_dma(struct gup_handle *res,
struct 'bio dma list' *dma_list);

And we could hope to put in there all the specialty logic we want to
have for this flow:
- The weird HMM stuff in hmm_range_dma_map()
- Interaction with DAX
- Interaction with DMA BUF
- Holding file leases
- PCI peer 2 peer features
- Optimizations for huge pages
- Handling page dirtying from DMA
- etc

I think Matthew was suggesting something like this at LS/MM, so +1
from here..

When Christoph sends his BIO dma work I was thinking of investigating
this avenue, as we already have something quite similiar in RDMA that
could perhaps be hoisted out for re-use into mm/

Jason

2019-06-08 00:59:22

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> On Fri, Jun 07, 2019 at 01:04:26PM +0200, Jan Kara wrote:
> > On Thu 06-06-19 15:03:30, Ira Weiny wrote:
> > > On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
> > > > On Wed 05-06-19 18:45:33, [email protected] wrote:
> > > > > From: Ira Weiny <[email protected]>
> > > >
> > > > So I'd like to actually mandate that you *must* hold the file lease until
> > > > you unpin all pages in the given range (not just that you have an option to
> > > > hold a lease). And I believe the kernel should actually enforce this. That
> > > > way we maintain a sane state that if someone uses a physical location of
> > > > logical file offset on disk, he has a layout lease. Also once this is done,
> > > > sysadmin has a reasonably easy way to discover run-away RDMA application
> > > > and kill it if he wishes so.
> > >
> > > Fair enough.
> > >
> > > I was kind of heading that direction but had not thought this far forward. I
> > > was exploring how to have a lease remain on the file even after a "lease
> > > break". But that is incompatible with the current semantics of a "layout"
> > > lease (as currently defined in the kernel). [In the end I wanted to get an RFC
> > > out to see what people think of this idea so I did not look at keeping the
> > > lease.]
> > >
> > > Also hitch is that currently a lease is forcefully broken after
> > > <sysfs>/lease-break-time. To do what you suggest I think we would need a new
> > > lease type with the semantics you describe.
> >
> > I'd do what Dave suggested - add flag to mark lease as unbreakable by
> > truncate and teach file locking core to handle that. There actually is
> > support for locks that are not broken after given timeout so there
> > shouldn't be too many changes need.
> >
> > > Previously I had thought this would be a good idea (for other reasons). But
> > > what does everyone think about using a "longterm lease" similar to [1] which
> > > has the semantics you proppose? In [1] I was not sure "longterm" was a good
> > > name but with your proposal I think it makes more sense.
> >
> > As I wrote elsewhere in this thread I think FL_LAYOUT name still makes
> > sense and I'd add there FL_UNBREAKABLE to mark unusal behavior with
> > truncate.
>
> Ok I want to make sure I understand what you and Dave are suggesting.
>
> Are you suggesting that we have something like this from user space?
>
> fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);

Rather than "unbreakable", perhaps a clearer description of the
policy it entails is "exclusive"?

i.e. what we are talking about here is an exclusive lease that
prevents other processes from changing the layout. i.e. the
mechanism used to guarantee a lease is exclusive is that the layout
becomes "unbreakable" at the filesystem level, but the policy we are
actually presenting to uses is "exclusive access"...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-06-09 01:30:51

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > On Fri, Jun 07, 2019 at 01:04:26PM +0200, Jan Kara wrote:
> > > On Thu 06-06-19 15:03:30, Ira Weiny wrote:
> > > > On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
> > > > > On Wed 05-06-19 18:45:33, [email protected] wrote:
> > > > > > From: Ira Weiny <[email protected]>
> > > > >
> > > > > So I'd like to actually mandate that you *must* hold the file lease until
> > > > > you unpin all pages in the given range (not just that you have an option to
> > > > > hold a lease). And I believe the kernel should actually enforce this. That
> > > > > way we maintain a sane state that if someone uses a physical location of
> > > > > logical file offset on disk, he has a layout lease. Also once this is done,
> > > > > sysadmin has a reasonably easy way to discover run-away RDMA application
> > > > > and kill it if he wishes so.
> > > >
> > > > Fair enough.
> > > >
> > > > I was kind of heading that direction but had not thought this far forward. I
> > > > was exploring how to have a lease remain on the file even after a "lease
> > > > break". But that is incompatible with the current semantics of a "layout"
> > > > lease (as currently defined in the kernel). [In the end I wanted to get an RFC
> > > > out to see what people think of this idea so I did not look at keeping the
> > > > lease.]
> > > >
> > > > Also hitch is that currently a lease is forcefully broken after
> > > > <sysfs>/lease-break-time. To do what you suggest I think we would need a new
> > > > lease type with the semantics you describe.
> > >
> > > I'd do what Dave suggested - add flag to mark lease as unbreakable by
> > > truncate and teach file locking core to handle that. There actually is
> > > support for locks that are not broken after given timeout so there
> > > shouldn't be too many changes need.
> > >
> > > > Previously I had thought this would be a good idea (for other reasons). But
> > > > what does everyone think about using a "longterm lease" similar to [1] which
> > > > has the semantics you proppose? In [1] I was not sure "longterm" was a good
> > > > name but with your proposal I think it makes more sense.
> > >
> > > As I wrote elsewhere in this thread I think FL_LAYOUT name still makes
> > > sense and I'd add there FL_UNBREAKABLE to mark unusal behavior with
> > > truncate.
> >
> > Ok I want to make sure I understand what you and Dave are suggesting.
> >
> > Are you suggesting that we have something like this from user space?
> >
> > fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
>
> Rather than "unbreakable", perhaps a clearer description of the
> policy it entails is "exclusive"?
>
> i.e. what we are talking about here is an exclusive lease that
> prevents other processes from changing the layout. i.e. the
> mechanism used to guarantee a lease is exclusive is that the layout
> becomes "unbreakable" at the filesystem level, but the policy we are
> actually presenting to uses is "exclusive access"...

That sounds good.

Ira

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2019-06-12 14:39:39

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Fri 07-06-19 07:52:13, Ira Weiny wrote:
> On Fri, Jun 07, 2019 at 09:17:29AM -0300, Jason Gunthorpe wrote:
> > On Fri, Jun 07, 2019 at 12:36:36PM +0200, Jan Kara wrote:
> >
> > > Because the pins would be invisible to sysadmin from that point on.
> >
> > It is not invisible, it just shows up in a rdma specific kernel
> > interface. You have to use rdma netlink to see the kernel object
> > holding this pin.
> >
> > If this visibility is the main sticking point I suggest just enhancing
> > the existing MR reporting to include the file info for current GUP
> > pins and teaching lsof to collect information from there as well so it
> > is easy to use.
> >
> > If the ownership of the lease transfers to the MR, and we report that
> > ownership to userspace in a way lsof can find, then I think all the
> > concerns that have been raised are met, right?
>
> I was contemplating some new lsof feature yesterday. But what I don't
> think we want is sysadmins to have multiple tools for multiple
> subsystems. Or even have to teach lsof something new for every potential
> new subsystem user of GUP pins.

Agreed.

> I was thinking more along the lines of reporting files which have GUP
> pins on them directly somewhere (dare I say procfs?) and teaching lsof to
> report that information. That would cover any subsystem which does a
> longterm pin.

So lsof already parses /proc/<pid>/maps to learn about files held open by
memory mappings. It could parse some other file as well I guess. The good
thing about that would be that then "longterm pin" structure would just hold
struct file reference. That would avoid any needs of special behavior on
file close (the file reference in the "longterm pin" structure would make
sure struct file and thus the lease stays around, we'd just need to make
explicit lease unlock block until the "longterm pin" structure is freed).
The bad thing is that it requires us to come up with a sane new proc
interface for reporting "longterm pins" and associated struct file. Also we
need to define what this interface shows if the pinned pages are in DRAM
(either page cache or anon) and not on NVDIMM.

> > > ugly to live so we have to come up with something better. The best I can
> > > currently come up with is to have a method associated with the lease that
> > > would invalidate the RDMA context that holds the pins in the same way that
> > > a file close would do it.
> >
> > This is back to requiring all RDMA HW to have some new behavior they
> > currently don't have..
> >
> > The main objection to the current ODP & DAX solution is that very
> > little HW can actually implement it, having the alternative still
> > require HW support doesn't seem like progress.
> >
> > I think we will eventually start seein some HW be able to do this
> > invalidation, but it won't be universal, and I'd rather leave it
> > optional, for recovery from truely catastrophic errors (ie my DAX is
> > on fire, I need to unplug it).
>
> Agreed. I think software wise there is not much some of the devices can do
> with such an "invalidate".

So out of curiosity: What does RDMA driver do when userspace just closes
the file pointing to RDMA object? It has to handle that somehow by aborting
everything that's going on... And I wanted similar behavior here.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-06-12 17:11:45

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote:
> On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote:
>
> > > > The main objection to the current ODP & DAX solution is that very
> > > > little HW can actually implement it, having the alternative still
> > > > require HW support doesn't seem like progress.
> > > >
> > > > I think we will eventually start seein some HW be able to do this
> > > > invalidation, but it won't be universal, and I'd rather leave it
> > > > optional, for recovery from truely catastrophic errors (ie my DAX is
> > > > on fire, I need to unplug it).
> > >
> > > Agreed. I think software wise there is not much some of the devices can do
> > > with such an "invalidate".
> >
> > So out of curiosity: What does RDMA driver do when userspace just closes
> > the file pointing to RDMA object? It has to handle that somehow by aborting
> > everything that's going on... And I wanted similar behavior here.
>
> It aborts *everything* connected to that file descriptor. Destroying
> everything avoids creating inconsistencies that destroying a subset
> would create.
>
> What has been talked about for lease break is not destroying anything
> but very selectively saying that one memory region linked to the GUP
> is no longer functional.

OK, so what I had in mind was that if RDMA app doesn't play by the rules
and closes the file with existing pins (and thus layout lease) we would
force it to abort everything. Yes, it is disruptive but then the app didn't
obey the rule that it has to maintain file lease while holding pins. Thus
such situation should never happen unless the app is malicious / buggy.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-06-12 18:50:19

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Wed, Jun 12, 2019 at 3:29 AM Jan Kara <[email protected]> wrote:
>
> On Fri 07-06-19 07:52:13, Ira Weiny wrote:
> > On Fri, Jun 07, 2019 at 09:17:29AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Jun 07, 2019 at 12:36:36PM +0200, Jan Kara wrote:
> > >
> > > > Because the pins would be invisible to sysadmin from that point on.
> > >
> > > It is not invisible, it just shows up in a rdma specific kernel
> > > interface. You have to use rdma netlink to see the kernel object
> > > holding this pin.
> > >
> > > If this visibility is the main sticking point I suggest just enhancing
> > > the existing MR reporting to include the file info for current GUP
> > > pins and teaching lsof to collect information from there as well so it
> > > is easy to use.
> > >
> > > If the ownership of the lease transfers to the MR, and we report that
> > > ownership to userspace in a way lsof can find, then I think all the
> > > concerns that have been raised are met, right?
> >
> > I was contemplating some new lsof feature yesterday. But what I don't
> > think we want is sysadmins to have multiple tools for multiple
> > subsystems. Or even have to teach lsof something new for every potential
> > new subsystem user of GUP pins.
>
> Agreed.
>
> > I was thinking more along the lines of reporting files which have GUP
> > pins on them directly somewhere (dare I say procfs?) and teaching lsof to
> > report that information. That would cover any subsystem which does a
> > longterm pin.
>
> So lsof already parses /proc/<pid>/maps to learn about files held open by
> memory mappings. It could parse some other file as well I guess. The good
> thing about that would be that then "longterm pin" structure would just hold
> struct file reference. That would avoid any needs of special behavior on
> file close (the file reference in the "longterm pin" structure would make
> sure struct file and thus the lease stays around, we'd just need to make
> explicit lease unlock block until the "longterm pin" structure is freed).
> The bad thing is that it requires us to come up with a sane new proc
> interface for reporting "longterm pins" and associated struct file. Also we
> need to define what this interface shows if the pinned pages are in DRAM
> (either page cache or anon) and not on NVDIMM.

The anon vs shared detection case is important because a longterm pin
might be blocking a memory-hot-unplug operation if it is pinning
ZONE_MOVABLE memory, but I don't think we want DRAM vs NVDIMM to be an
explicit concern of the interface. For the anon / cached case I expect
it might be useful to put that communication under the memory-blocks
sysfs interface. I.e. a list of pids that are pinning that
memory-block from being hot-unplugged.

2019-06-12 19:14:52

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote:
> On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote:
> > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote:
> >
> > > > > The main objection to the current ODP & DAX solution is that very
> > > > > little HW can actually implement it, having the alternative still
> > > > > require HW support doesn't seem like progress.
> > > > >
> > > > > I think we will eventually start seein some HW be able to do this
> > > > > invalidation, but it won't be universal, and I'd rather leave it
> > > > > optional, for recovery from truely catastrophic errors (ie my DAX is
> > > > > on fire, I need to unplug it).
> > > >
> > > > Agreed. I think software wise there is not much some of the devices can do
> > > > with such an "invalidate".
> > >
> > > So out of curiosity: What does RDMA driver do when userspace just closes
> > > the file pointing to RDMA object? It has to handle that somehow by aborting
> > > everything that's going on... And I wanted similar behavior here.
> >
> > It aborts *everything* connected to that file descriptor. Destroying
> > everything avoids creating inconsistencies that destroying a subset
> > would create.
> >
> > What has been talked about for lease break is not destroying anything
> > but very selectively saying that one memory region linked to the GUP
> > is no longer functional.
>
> OK, so what I had in mind was that if RDMA app doesn't play by the rules
> and closes the file with existing pins (and thus layout lease) we would
> force it to abort everything. Yes, it is disruptive but then the app didn't
> obey the rule that it has to maintain file lease while holding pins. Thus
> such situation should never happen unless the app is malicious / buggy.

We do have the infrastructure to completely revoke the entire
*content* of a FD (this is called device disassociate). It is
basically close without the app doing close. But again it only works
with some drivers. However, this is more likely something a driver
could support without a HW change though.

It is quite destructive as it forcibly kills everything RDMA related
the process(es) are doing, but it is less violent than SIGKILL, and
there is perhaps a way for the app to recover from this, if it is
coded for it.

My preference would be to avoid this scenario, but if it is really
necessary, we could probably build it with some work.

The only case we use it today is forced HW hot unplug, so it is rarely
used and only for an 'emergency' like use case.

Jason

2019-06-13 15:15:22

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Wed, Jun 12, 2019 at 06:14:46PM -0700, Dan Williams wrote:
> > Effectively, we would need a way for an admin to close a specific file
> > descriptor (or set of fds) which point to that file. AFAIK there is no way to
> > do that at all, is there?
>
> Even if there were that gets back to my other question, does RDMA
> teardown happen at close(fd), or at final fput() of the 'struct
> file'?

AFAIK there is no kernel side driver hook for close(fd).

rdma uses a normal chardev so it's lifetime is linked to the file_ops
release, which is called on last fput. So all the mmaps, all the dups,
everything must go before it releases its resources.

Jason

2019-06-13 15:31:21

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Wed, Jun 12, 2019 at 08:23:20PM -0700, Matthew Wilcox wrote:
> On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> > > That's rather different from the normal meaning of 'exclusive' in the
> > > context of locks, which is "only one user can have access to this at
> > > a time".
> >
> > Layout leases are not locks, they are a user access policy object.
> > It is the process/fd which holds the lease and it's the process/fd
> > that is granted exclusive access. This is exactly the same semantic
> > as O_EXCL provides for granting exclusive access to a block device
> > via open(), yes?
>
> This isn't my understanding of how RDMA wants this to work, so we should
> probably clear that up before we get too far down deciding what name to
> give it.
>
> For the RDMA usage case, it is entirely possible that both process A
> and process B which don't know about each other want to perform RDMA to
> file F. So there will be two layout leases active on this file at the
> same time. It's fine for IOs to simultaneously be active to both leases.
> But if the filesystem wants to move blocks around, it has to break
> both leases.
>
> If Process C tries to do a write to file F without a lease, there's no
> problem, unless a side-effect of the write would be to change the block
> mapping, in which case either the leases must break first, or the write
> must be denied.
>
> Jason, please correct me if I've misunderstood the RDMA needs here.

Yes, I think you've captured it

Based on Dave's remarks how frequently a filesystem needs to break the
lease will determine if it is usuable to be combined with RDMA or
not...

Jason

2019-06-13 16:27:47

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Thu, Jun 13, 2019 at 8:14 AM Jason Gunthorpe <[email protected]> wrote:
>
> On Wed, Jun 12, 2019 at 06:14:46PM -0700, Dan Williams wrote:
> > > Effectively, we would need a way for an admin to close a specific file
> > > descriptor (or set of fds) which point to that file. AFAIK there is no way to
> > > do that at all, is there?
> >
> > Even if there were that gets back to my other question, does RDMA
> > teardown happen at close(fd), or at final fput() of the 'struct
> > file'?
>
> AFAIK there is no kernel side driver hook for close(fd).
>
> rdma uses a normal chardev so it's lifetime is linked to the file_ops
> release, which is called on last fput. So all the mmaps, all the dups,
> everything must go before it releases its resources.

Oh, I must have missed where this conversation started talking about
the driver-device fd. I thought we were talking about the close /
release of the target file that is MAP_SHARED for the memory
registration. A release of the driver fd is orthogonal to coordinating
/ signalling actions relative to the leased file.

2019-06-13 16:34:41

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Wed 12-06-19 15:13:36, Ira Weiny wrote:
> On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote:
> > On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote:
> > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote:
> > > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote:
> > > >
> > > > > > > The main objection to the current ODP & DAX solution is that very
> > > > > > > little HW can actually implement it, having the alternative still
> > > > > > > require HW support doesn't seem like progress.
> > > > > > >
> > > > > > > I think we will eventually start seein some HW be able to do this
> > > > > > > invalidation, but it won't be universal, and I'd rather leave it
> > > > > > > optional, for recovery from truely catastrophic errors (ie my DAX is
> > > > > > > on fire, I need to unplug it).
> > > > > >
> > > > > > Agreed. I think software wise there is not much some of the devices can do
> > > > > > with such an "invalidate".
> > > > >
> > > > > So out of curiosity: What does RDMA driver do when userspace just closes
> > > > > the file pointing to RDMA object? It has to handle that somehow by aborting
> > > > > everything that's going on... And I wanted similar behavior here.
> > > >
> > > > It aborts *everything* connected to that file descriptor. Destroying
> > > > everything avoids creating inconsistencies that destroying a subset
> > > > would create.
> > > >
> > > > What has been talked about for lease break is not destroying anything
> > > > but very selectively saying that one memory region linked to the GUP
> > > > is no longer functional.
> > >
> > > OK, so what I had in mind was that if RDMA app doesn't play by the rules
> > > and closes the file with existing pins (and thus layout lease) we would
> > > force it to abort everything. Yes, it is disruptive but then the app didn't
> > > obey the rule that it has to maintain file lease while holding pins. Thus
> > > such situation should never happen unless the app is malicious / buggy.
> >
> > We do have the infrastructure to completely revoke the entire
> > *content* of a FD (this is called device disassociate). It is
> > basically close without the app doing close. But again it only works
> > with some drivers. However, this is more likely something a driver
> > could support without a HW change though.
> >
> > It is quite destructive as it forcibly kills everything RDMA related
> > the process(es) are doing, but it is less violent than SIGKILL, and
> > there is perhaps a way for the app to recover from this, if it is
> > coded for it.
>
> I don't think many are... I think most would effectively be "killed" if this
> happened to them.

Yes, I repeat we are in a situation when the application has a bug and
didn't propely manage its long term pins which are fully under its control.
So in my mind a situation similar to application using memory it has
already freed. The kernel has to manage that but we don't really care
what's left from the application when this happens.

That being said I'm not insisting this has to happen - tracking associated
"RDMA file" with a layout lease and somehow invalidating it on close of a
leased file is somewhat ugly anyway. But it is still an option if exposing
pins to userspace for lsof to consume proves even worse...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-06-13 16:52:53

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> > That's rather different from the normal meaning of 'exclusive' in the
> > context of locks, which is "only one user can have access to this at
> > a time".
>
> Layout leases are not locks, they are a user access policy object.
> It is the process/fd which holds the lease and it's the process/fd
> that is granted exclusive access. This is exactly the same semantic
> as O_EXCL provides for granting exclusive access to a block device
> via open(), yes?

This isn't my understanding of how RDMA wants this to work, so we should
probably clear that up before we get too far down deciding what name to
give it.

For the RDMA usage case, it is entirely possible that both process A
and process B which don't know about each other want to perform RDMA to
file F. So there will be two layout leases active on this file at the
same time. It's fine for IOs to simultaneously be active to both leases.
But if the filesystem wants to move blocks around, it has to break
both leases.

If Process C tries to do a write to file F without a lease, there's no
problem, unless a side-effect of the write would be to change the block
mapping, in which case either the leases must break first, or the write
must be denied.

Jason, please correct me if I've misunderstood the RDMA needs here.

2019-06-13 16:54:33

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Wed, Jun 12, 2019 at 4:32 PM Ira Weiny <[email protected]> wrote:
>
> On Wed, Jun 12, 2019 at 03:54:19PM -0700, Dan Williams wrote:
> > On Wed, Jun 12, 2019 at 3:12 PM Ira Weiny <[email protected]> wrote:
> > >
> > > On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote:
> > > > On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote:
> > > > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote:
> > > > > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote:
> > > > > >
> > > > > > > > > The main objection to the current ODP & DAX solution is that very
> > > > > > > > > little HW can actually implement it, having the alternative still
> > > > > > > > > require HW support doesn't seem like progress.
> > > > > > > > >
> > > > > > > > > I think we will eventually start seein some HW be able to do this
> > > > > > > > > invalidation, but it won't be universal, and I'd rather leave it
> > > > > > > > > optional, for recovery from truely catastrophic errors (ie my DAX is
> > > > > > > > > on fire, I need to unplug it).
> > > > > > > >
> > > > > > > > Agreed. I think software wise there is not much some of the devices can do
> > > > > > > > with such an "invalidate".
> > > > > > >
> > > > > > > So out of curiosity: What does RDMA driver do when userspace just closes
> > > > > > > the file pointing to RDMA object? It has to handle that somehow by aborting
> > > > > > > everything that's going on... And I wanted similar behavior here.
> > > > > >
> > > > > > It aborts *everything* connected to that file descriptor. Destroying
> > > > > > everything avoids creating inconsistencies that destroying a subset
> > > > > > would create.
> > > > > >
> > > > > > What has been talked about for lease break is not destroying anything
> > > > > > but very selectively saying that one memory region linked to the GUP
> > > > > > is no longer functional.
> > > > >
> > > > > OK, so what I had in mind was that if RDMA app doesn't play by the rules
> > > > > and closes the file with existing pins (and thus layout lease) we would
> > > > > force it to abort everything. Yes, it is disruptive but then the app didn't
> > > > > obey the rule that it has to maintain file lease while holding pins. Thus
> > > > > such situation should never happen unless the app is malicious / buggy.
> > > >
> > > > We do have the infrastructure to completely revoke the entire
> > > > *content* of a FD (this is called device disassociate). It is
> > > > basically close without the app doing close. But again it only works
> > > > with some drivers. However, this is more likely something a driver
> > > > could support without a HW change though.
> > > >
> > > > It is quite destructive as it forcibly kills everything RDMA related
> > > > the process(es) are doing, but it is less violent than SIGKILL, and
> > > > there is perhaps a way for the app to recover from this, if it is
> > > > coded for it.
> > >
> > > I don't think many are... I think most would effectively be "killed" if this
> > > happened to them.
> > >
> > > >
> > > > My preference would be to avoid this scenario, but if it is really
> > > > necessary, we could probably build it with some work.
> > > >
> > > > The only case we use it today is forced HW hot unplug, so it is rarely
> > > > used and only for an 'emergency' like use case.
> > >
> > > I'd really like to avoid this as well. I think it will be very confusing for
> > > RDMA apps to have their context suddenly be invalid. I think if we have a way
> > > for admins to ID who is pinning a file the admin can take more appropriate
> > > action on those processes. Up to and including killing the process.
> >
> > Can RDMA context invalidation, "device disassociate", be inflicted on
> > a process from the outside? Identifying the pid of a pin holder only
> > leaves SIGKILL of the entire process as the remediation for revoking a
> > pin, and I assume admins would use the finer grained invalidation
> > where it was available.
>
> No not in the way you are describing it. As Jason said you can hotplug the
> device which is "from the outside" but this would affect all users of that
> device.
>
> Effectively, we would need a way for an admin to close a specific file
> descriptor (or set of fds) which point to that file. AFAIK there is no way to
> do that at all, is there?

You can certainly give the lease holder the option to close the file
voluntarily via the siginfo_t that can be attached to a lease break
signal. But it's not really "close" you want as much as a finer
grained disassociate.

All that said you could require the lease taker opt-in to SIGKILL via
F_SETSIG before marking the lease "exclusive". That effectively
precludes failing truncate, but it's something we can enforce today
and work on finer grained / less drastic escalations over time for
something that should "never" happen.

2019-06-13 16:57:27

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Wed, Jun 12, 2019 at 4:32 PM Ira Weiny <[email protected]> wrote:
>
> On Wed, Jun 12, 2019 at 03:54:19PM -0700, Dan Williams wrote:
> > On Wed, Jun 12, 2019 at 3:12 PM Ira Weiny <[email protected]> wrote:
> > >
> > > On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote:
> > > > On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote:
> > > > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote:
> > > > > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote:
> > > > > >
> > > > > > > > > The main objection to the current ODP & DAX solution is that very
> > > > > > > > > little HW can actually implement it, having the alternative still
> > > > > > > > > require HW support doesn't seem like progress.
> > > > > > > > >
> > > > > > > > > I think we will eventually start seein some HW be able to do this
> > > > > > > > > invalidation, but it won't be universal, and I'd rather leave it
> > > > > > > > > optional, for recovery from truely catastrophic errors (ie my DAX is
> > > > > > > > > on fire, I need to unplug it).
> > > > > > > >
> > > > > > > > Agreed. I think software wise there is not much some of the devices can do
> > > > > > > > with such an "invalidate".
> > > > > > >
> > > > > > > So out of curiosity: What does RDMA driver do when userspace just closes
> > > > > > > the file pointing to RDMA object? It has to handle that somehow by aborting
> > > > > > > everything that's going on... And I wanted similar behavior here.
> > > > > >
> > > > > > It aborts *everything* connected to that file descriptor. Destroying
> > > > > > everything avoids creating inconsistencies that destroying a subset
> > > > > > would create.
> > > > > >
> > > > > > What has been talked about for lease break is not destroying anything
> > > > > > but very selectively saying that one memory region linked to the GUP
> > > > > > is no longer functional.
> > > > >
> > > > > OK, so what I had in mind was that if RDMA app doesn't play by the rules
> > > > > and closes the file with existing pins (and thus layout lease) we would
> > > > > force it to abort everything. Yes, it is disruptive but then the app didn't
> > > > > obey the rule that it has to maintain file lease while holding pins. Thus
> > > > > such situation should never happen unless the app is malicious / buggy.
> > > >
> > > > We do have the infrastructure to completely revoke the entire
> > > > *content* of a FD (this is called device disassociate). It is
> > > > basically close without the app doing close. But again it only works
> > > > with some drivers. However, this is more likely something a driver
> > > > could support without a HW change though.
> > > >
> > > > It is quite destructive as it forcibly kills everything RDMA related
> > > > the process(es) are doing, but it is less violent than SIGKILL, and
> > > > there is perhaps a way for the app to recover from this, if it is
> > > > coded for it.
> > >
> > > I don't think many are... I think most would effectively be "killed" if this
> > > happened to them.
> > >
> > > >
> > > > My preference would be to avoid this scenario, but if it is really
> > > > necessary, we could probably build it with some work.
> > > >
> > > > The only case we use it today is forced HW hot unplug, so it is rarely
> > > > used and only for an 'emergency' like use case.
> > >
> > > I'd really like to avoid this as well. I think it will be very confusing for
> > > RDMA apps to have their context suddenly be invalid. I think if we have a way
> > > for admins to ID who is pinning a file the admin can take more appropriate
> > > action on those processes. Up to and including killing the process.
> >
> > Can RDMA context invalidation, "device disassociate", be inflicted on
> > a process from the outside? Identifying the pid of a pin holder only
> > leaves SIGKILL of the entire process as the remediation for revoking a
> > pin, and I assume admins would use the finer grained invalidation
> > where it was available.
>
> No not in the way you are describing it. As Jason said you can hotplug the
> device which is "from the outside" but this would affect all users of that
> device.
>
> Effectively, we would need a way for an admin to close a specific file
> descriptor (or set of fds) which point to that file. AFAIK there is no way to
> do that at all, is there?

Even if there were that gets back to my other question, does RDMA
teardown happen at close(fd), or at final fput() of the 'struct file'?
I.e. does it also need munmap() to get the vma to drop its reference?
Perhaps a pointer to the relevant code would help me wrap my head
around this mechanism.

2019-06-13 16:59:35

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > > Are you suggesting that we have something like this from user space?
> > >
> > > fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> >
> > Rather than "unbreakable", perhaps a clearer description of the
> > policy it entails is "exclusive"?
> >
> > i.e. what we are talking about here is an exclusive lease that
> > prevents other processes from changing the layout. i.e. the
> > mechanism used to guarantee a lease is exclusive is that the layout
> > becomes "unbreakable" at the filesystem level, but the policy we are
> > actually presenting to uses is "exclusive access"...
>
> That's rather different from the normal meaning of 'exclusive' in the
> context of locks, which is "only one user can have access to this at
> a time".


Layout leases are not locks, they are a user access policy object.
It is the process/fd which holds the lease and it's the process/fd
that is granted exclusive access. This is exactly the same semantic
as O_EXCL provides for granting exclusive access to a block device
via open(), yes?

> As I understand it, this is rather more like a 'shared' or
> 'read' lock. The filesystem would be the one which wants an exclusive
> lock, so it can modify the mapping of logical to physical blocks.

ISTM that you're conflating internal filesystem implementation with
application visible semantics. Yes, the filesystem uses internal
locks to serialise the modification of the things the lease manages
access too, but that has nothing to do with the access policy the
lease provides to users.

e.g. Process A has an exclusive layout lease on file F. It does an
IO to file F. The filesystem IO path checks that Process A owns the
lease on the file and so skips straight through layout breaking
because it owns the lease and is allowed to modify the layout. It
then takes the inode metadata locks to allocate new space and write
new data.

Process B now tries to write to file F. The FS checks whether
Process B owns a layout lease on file F. It doesn't, so then it
tries to break the layout lease so the IO can proceed. The layout
breaking code sees that process A has an exclusive layout lease
granted, and so returns -ETXTBSY to process B - it is not allowed to
break the lease and so the IO fails with -ETXTBSY.

i.e. the exclusive layout lease prevents other processes from
performing operations that may need to modify the layout from
performing those operations. It does not "lock" the file/inode in
any way, it just changes how the layout lease breaking behaves.

Further, the "exclusiveness" of a layout lease is completely
irrelevant to the filesystem that is indicating that an operation
that may need to modify the layout is about to be performed. All the
filesystem has to do is handle failures to break the lease
appropriately. Yes, XFS serialises the layout lease validation
against other IO to the same file via it's IO locks, but that's an
internal data IO coherency requirement, not anything to do with
layout lease management.

Note that I talk about /writes/ here. This is interchangable with
any other operation that may need to modify the extent layout of the
file, be it truncate, fallocate, etc: the attempt to break the
layout lease by a non-owner should fail if the lease is "exclusive"
to the owner.

> The complication being that by default the filesystem has an exclusive
> lock on the mapping, and what we're trying to add is the ability for
> readers to ask the filesystem to give up its exclusive lock.

The filesystem doesn't even lock the "mapping" until after the
layout lease has been validated or broken.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-06-13 17:01:53

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Wed, Jun 12, 2019 at 03:54:19PM -0700, Dan Williams wrote:
> On Wed, Jun 12, 2019 at 3:12 PM Ira Weiny <[email protected]> wrote:
> >
> > On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote:
> > > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote:
> > > > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote:
> > > > >
> > > > > > > > The main objection to the current ODP & DAX solution is that very
> > > > > > > > little HW can actually implement it, having the alternative still
> > > > > > > > require HW support doesn't seem like progress.
> > > > > > > >
> > > > > > > > I think we will eventually start seein some HW be able to do this
> > > > > > > > invalidation, but it won't be universal, and I'd rather leave it
> > > > > > > > optional, for recovery from truely catastrophic errors (ie my DAX is
> > > > > > > > on fire, I need to unplug it).
> > > > > > >
> > > > > > > Agreed. I think software wise there is not much some of the devices can do
> > > > > > > with such an "invalidate".
> > > > > >
> > > > > > So out of curiosity: What does RDMA driver do when userspace just closes
> > > > > > the file pointing to RDMA object? It has to handle that somehow by aborting
> > > > > > everything that's going on... And I wanted similar behavior here.
> > > > >
> > > > > It aborts *everything* connected to that file descriptor. Destroying
> > > > > everything avoids creating inconsistencies that destroying a subset
> > > > > would create.
> > > > >
> > > > > What has been talked about for lease break is not destroying anything
> > > > > but very selectively saying that one memory region linked to the GUP
> > > > > is no longer functional.
> > > >
> > > > OK, so what I had in mind was that if RDMA app doesn't play by the rules
> > > > and closes the file with existing pins (and thus layout lease) we would
> > > > force it to abort everything. Yes, it is disruptive but then the app didn't
> > > > obey the rule that it has to maintain file lease while holding pins. Thus
> > > > such situation should never happen unless the app is malicious / buggy.
> > >
> > > We do have the infrastructure to completely revoke the entire
> > > *content* of a FD (this is called device disassociate). It is
> > > basically close without the app doing close. But again it only works
> > > with some drivers. However, this is more likely something a driver
> > > could support without a HW change though.
> > >
> > > It is quite destructive as it forcibly kills everything RDMA related
> > > the process(es) are doing, but it is less violent than SIGKILL, and
> > > there is perhaps a way for the app to recover from this, if it is
> > > coded for it.
> >
> > I don't think many are... I think most would effectively be "killed" if this
> > happened to them.
> >
> > >
> > > My preference would be to avoid this scenario, but if it is really
> > > necessary, we could probably build it with some work.
> > >
> > > The only case we use it today is forced HW hot unplug, so it is rarely
> > > used and only for an 'emergency' like use case.
> >
> > I'd really like to avoid this as well. I think it will be very confusing for
> > RDMA apps to have their context suddenly be invalid. I think if we have a way
> > for admins to ID who is pinning a file the admin can take more appropriate
> > action on those processes. Up to and including killing the process.
>
> Can RDMA context invalidation, "device disassociate", be inflicted on
> a process from the outside? Identifying the pid of a pin holder only
> leaves SIGKILL of the entire process as the remediation for revoking a
> pin, and I assume admins would use the finer grained invalidation
> where it was available.

No not in the way you are describing it. As Jason said you can hotplug the
device which is "from the outside" but this would affect all users of that
device.

Effectively, we would need a way for an admin to close a specific file
descriptor (or set of fds) which point to that file. AFAIK there is no way to
do that at all, is there?

Ira

2019-06-13 17:02:08

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > > Are you suggesting that we have something like this from user space?
> > >
> > > fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> >
> > Rather than "unbreakable", perhaps a clearer description of the
> > policy it entails is "exclusive"?
> >
> > i.e. what we are talking about here is an exclusive lease that
> > prevents other processes from changing the layout. i.e. the
> > mechanism used to guarantee a lease is exclusive is that the layout
> > becomes "unbreakable" at the filesystem level, but the policy we are
> > actually presenting to uses is "exclusive access"...
>
> That's rather different from the normal meaning of 'exclusive' in the
> context of locks, which is "only one user can have access to this at
> a time". As I understand it, this is rather more like a 'shared' or
> 'read' lock. The filesystem would be the one which wants an exclusive
> lock, so it can modify the mapping of logical to physical blocks.
>
> The complication being that by default the filesystem has an exclusive
> lock on the mapping, and what we're trying to add is the ability for
> readers to ask the filesystem to give up its exclusive lock.

This is an interesting view...

And after some more thought, exclusive does not seem like a good name for this
because technically F_WRLCK _is_ an exclusive lease...

In addition, the user does not need to take the "exclusive" write lease to be
notified of (broken by) an unexpected truncate. A "read" lease is broken by
truncate. (And "write" leases really don't do anything different WRT the
interaction of the FS and the user app. Write leases control "exclusive"
access between other file descriptors.)

Another thing to consider is that this patch set _allows_ a truncate/hole punch
to proceed _if_ the pages being affected are not actually pinned. So the
unbreakable/exclusive nature of the lease is not absolute.

Personally I like this functionality. I'm not quite sure I can make it work
with what Jan is suggesting. But I like it.

Given the muddied water of "exclusive" and "write" lease I'm now feeling like
Jeff has a point WRT the conflation of F_RDLCK/F_WRLCK/F_UNLCK and this new
functionality.

Should we use his suggested F_SETLAYOUT/F_GETLAYOUT cmd type?[1]

Ira

[1] https://lkml.org/lkml/2019/6/9/117

2019-06-13 17:10:34

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Wed, Jun 12, 2019 at 3:12 PM Ira Weiny <[email protected]> wrote:
>
> On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote:
> > On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote:
> > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote:
> > > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote:
> > > >
> > > > > > > The main objection to the current ODP & DAX solution is that very
> > > > > > > little HW can actually implement it, having the alternative still
> > > > > > > require HW support doesn't seem like progress.
> > > > > > >
> > > > > > > I think we will eventually start seein some HW be able to do this
> > > > > > > invalidation, but it won't be universal, and I'd rather leave it
> > > > > > > optional, for recovery from truely catastrophic errors (ie my DAX is
> > > > > > > on fire, I need to unplug it).
> > > > > >
> > > > > > Agreed. I think software wise there is not much some of the devices can do
> > > > > > with such an "invalidate".
> > > > >
> > > > > So out of curiosity: What does RDMA driver do when userspace just closes
> > > > > the file pointing to RDMA object? It has to handle that somehow by aborting
> > > > > everything that's going on... And I wanted similar behavior here.
> > > >
> > > > It aborts *everything* connected to that file descriptor. Destroying
> > > > everything avoids creating inconsistencies that destroying a subset
> > > > would create.
> > > >
> > > > What has been talked about for lease break is not destroying anything
> > > > but very selectively saying that one memory region linked to the GUP
> > > > is no longer functional.
> > >
> > > OK, so what I had in mind was that if RDMA app doesn't play by the rules
> > > and closes the file with existing pins (and thus layout lease) we would
> > > force it to abort everything. Yes, it is disruptive but then the app didn't
> > > obey the rule that it has to maintain file lease while holding pins. Thus
> > > such situation should never happen unless the app is malicious / buggy.
> >
> > We do have the infrastructure to completely revoke the entire
> > *content* of a FD (this is called device disassociate). It is
> > basically close without the app doing close. But again it only works
> > with some drivers. However, this is more likely something a driver
> > could support without a HW change though.
> >
> > It is quite destructive as it forcibly kills everything RDMA related
> > the process(es) are doing, but it is less violent than SIGKILL, and
> > there is perhaps a way for the app to recover from this, if it is
> > coded for it.
>
> I don't think many are... I think most would effectively be "killed" if this
> happened to them.
>
> >
> > My preference would be to avoid this scenario, but if it is really
> > necessary, we could probably build it with some work.
> >
> > The only case we use it today is forced HW hot unplug, so it is rarely
> > used and only for an 'emergency' like use case.
>
> I'd really like to avoid this as well. I think it will be very confusing for
> RDMA apps to have their context suddenly be invalid. I think if we have a way
> for admins to ID who is pinning a file the admin can take more appropriate
> action on those processes. Up to and including killing the process.

Can RDMA context invalidation, "device disassociate", be inflicted on
a process from the outside? Identifying the pid of a pin holder only
leaves SIGKILL of the entire process as the remediation for revoking a
pin, and I assume admins would use the finer grained invalidation
where it was available.

2019-06-13 17:12:55

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote:
> > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote:
> > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote:
> > >
> > > > > > The main objection to the current ODP & DAX solution is that very
> > > > > > little HW can actually implement it, having the alternative still
> > > > > > require HW support doesn't seem like progress.
> > > > > >
> > > > > > I think we will eventually start seein some HW be able to do this
> > > > > > invalidation, but it won't be universal, and I'd rather leave it
> > > > > > optional, for recovery from truely catastrophic errors (ie my DAX is
> > > > > > on fire, I need to unplug it).
> > > > >
> > > > > Agreed. I think software wise there is not much some of the devices can do
> > > > > with such an "invalidate".
> > > >
> > > > So out of curiosity: What does RDMA driver do when userspace just closes
> > > > the file pointing to RDMA object? It has to handle that somehow by aborting
> > > > everything that's going on... And I wanted similar behavior here.
> > >
> > > It aborts *everything* connected to that file descriptor. Destroying
> > > everything avoids creating inconsistencies that destroying a subset
> > > would create.
> > >
> > > What has been talked about for lease break is not destroying anything
> > > but very selectively saying that one memory region linked to the GUP
> > > is no longer functional.
> >
> > OK, so what I had in mind was that if RDMA app doesn't play by the rules
> > and closes the file with existing pins (and thus layout lease) we would
> > force it to abort everything. Yes, it is disruptive but then the app didn't
> > obey the rule that it has to maintain file lease while holding pins. Thus
> > such situation should never happen unless the app is malicious / buggy.
>
> We do have the infrastructure to completely revoke the entire
> *content* of a FD (this is called device disassociate). It is
> basically close without the app doing close. But again it only works
> with some drivers. However, this is more likely something a driver
> could support without a HW change though.
>
> It is quite destructive as it forcibly kills everything RDMA related
> the process(es) are doing, but it is less violent than SIGKILL, and
> there is perhaps a way for the app to recover from this, if it is
> coded for it.

I don't think many are... I think most would effectively be "killed" if this
happened to them.

>
> My preference would be to avoid this scenario, but if it is really
> necessary, we could probably build it with some work.
>
> The only case we use it today is forced HW hot unplug, so it is rarely
> used and only for an 'emergency' like use case.

I'd really like to avoid this as well. I think it will be very confusing for
RDMA apps to have their context suddenly be invalid. I think if we have a way
for admins to ID who is pinning a file the admin can take more appropriate
action on those processes. Up to and including killing the process.

Ira

2019-06-13 20:33:40

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Thu, Jun 13, 2019 at 10:55:52AM +1000, Dave Chinner wrote:
> On Wed, Jun 12, 2019 at 04:30:24PM -0700, Ira Weiny wrote:
> > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> > > On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> > > > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > > > > Are you suggesting that we have something like this from user space?
> > > > >
> > > > > fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> > > >
> > > > Rather than "unbreakable", perhaps a clearer description of the
> > > > policy it entails is "exclusive"?
> > > >
> > > > i.e. what we are talking about here is an exclusive lease that
> > > > prevents other processes from changing the layout. i.e. the
> > > > mechanism used to guarantee a lease is exclusive is that the layout
> > > > becomes "unbreakable" at the filesystem level, but the policy we are
> > > > actually presenting to uses is "exclusive access"...
> > >
> > > That's rather different from the normal meaning of 'exclusive' in the
> > > context of locks, which is "only one user can have access to this at
> > > a time". As I understand it, this is rather more like a 'shared' or
> > > 'read' lock. The filesystem would be the one which wants an exclusive
> > > lock, so it can modify the mapping of logical to physical blocks.
> > >
> > > The complication being that by default the filesystem has an exclusive
> > > lock on the mapping, and what we're trying to add is the ability for
> > > readers to ask the filesystem to give up its exclusive lock.
> >
> > This is an interesting view...
> >
> > And after some more thought, exclusive does not seem like a good name for this
> > because technically F_WRLCK _is_ an exclusive lease...
> >
> > In addition, the user does not need to take the "exclusive" write lease to be
> > notified of (broken by) an unexpected truncate. A "read" lease is broken by
> > truncate. (And "write" leases really don't do anything different WRT the
> > interaction of the FS and the user app. Write leases control "exclusive"
> > access between other file descriptors.)
>
> I've been assuming that there is only one type of layout lease -
> there is no use case I've heard of for read/write layout leases, and
> like you say there is zero difference in behaviour at the filesystem
> level - they all have to be broken to allow a non-lease truncate to
> proceed.
>
> IMO, taking a "read lease" to be able to modify and write to the
> underlying mapping of a file makes absolutely no sense at all.
> IOWs, we're talking exaclty about a revokable layout lease vs an
> exclusive layout lease here, and so read/write really doesn't match
> the policy or semantics we are trying to provide.

I humbly disagree, at least depending on how you look at it... :-D

The patches as they stand expect the user to take a "read" layout lease which
indicates they are currently using "reading" the layout as is. They are not
changing ("writing" to) the layout. They then pin pages which locks parts of
the layout and therefore they expect no "writers" to change the layout.

The "write" layout lease breaks the "read" layout lease indicating that the
layout is being written to. Should the layout be pinned in such a way that the
layout can't be changed the "layout writer" (truncate) fails.

In fact, this is what NFS does right now. The lease it puts on the file is of
"read" type.

nfs4layouts.c:
static int
nfsd4_layout_setlease(struct nfs4_layout_stateid *ls)
{
...
fl->fl_flags = FL_LAYOUT;
fl->fl_type = F_RDLCK;
...
}

I was not changing that much from the NFS patter which meant the break lease
code worked.

Jans proposal is solid but it means that there is no breaking of the lease. I
tried to add an "exclusive" flag to the "write" lease but the __break_lease()
code gets weird. I'm not saying it is not possible. Just that I have not
seen a good way to do it.

>
> > Another thing to consider is that this patch set _allows_ a truncate/hole punch
> > to proceed _if_ the pages being affected are not actually pinned. So the
> > unbreakable/exclusive nature of the lease is not absolute.
>
> If you're talking about the process that owns the layout lease
> running the truncate, then that is fine.
>
> However, if you are talking about a process that does not own the
> layout lease being allowed to truncate a file without first breaking
> the layout lease, then that is fundamentally broken.

In both cases (local or remote process) the lease is broken prior to the
attempt to truncate.

>
> i.e. If you don't own a layout lease, the layout leases must be
> broken before the truncate can proceed.

Agreed.

>
> If it's an exclusive lease,
> then you cannot break the lease and the truncate *must fail before
> it is started*. i.e. the layout lease state must be correctly
> resolved before we start an operation that may modify a file layout.
>
> Determining if we can actually do the truncate based on page state
> occurs /after/ the lease says the truncate can proceed....

That makes a lot of sense and that is the way the patch currently works.

I need to think on this some more. Keeping the lease may not be critical. As
discussed with Jan; dealing with close() is best dealt with by tracking the
actual pins on the file. If that works then we could potentially keep the
lease semantics closer to what you and I are talking about here.

Ira

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>

2019-06-13 23:59:12

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Thu, Jun 13, 2019 at 08:45:30PM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 13, 2019 at 02:13:21PM -0700, Ira Weiny wrote:
> > On Thu, Jun 13, 2019 at 08:27:55AM -0700, Matthew Wilcox wrote:
> > > On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> > > > e.g. Process A has an exclusive layout lease on file F. It does an
> > > > IO to file F. The filesystem IO path checks that Process A owns the
> > > > lease on the file and so skips straight through layout breaking
> > > > because it owns the lease and is allowed to modify the layout. It
> > > > then takes the inode metadata locks to allocate new space and write
> > > > new data.
> > > >
> > > > Process B now tries to write to file F. The FS checks whether
> > > > Process B owns a layout lease on file F. It doesn't, so then it
> > > > tries to break the layout lease so the IO can proceed. The layout
> > > > breaking code sees that process A has an exclusive layout lease
> > > > granted, and so returns -ETXTBSY to process B - it is not allowed to
> > > > break the lease and so the IO fails with -ETXTBSY.
> > >
> > > This description doesn't match the behaviour that RDMA wants either.
> > > Even if Process A has a lease on the file, an IO from Process A which
> > > results in blocks being freed from the file is going to result in the
> > > RDMA device being able to write to blocks which are now freed (and
> > > potentially reallocated to another file).
> >
> > I don't understand why this would not work for RDMA? As long as the layout
> > does not change the page pins can remain in place.
>
> Because process A had a layout lease (and presumably a MR) and the
> layout was still modified in way that invalidates the RDMA MR.

Oh sorry I miss read the above... (got Process A and B mixed up...)

Right, but Process A still can't free those blocks because the gup pin exists
on them... So yea it can't _just_ be a layout lease which controls this on the
"file fd".

Ira

2019-06-14 03:00:23

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Thu, Jun 13, 2019 at 01:34:05PM -0700, Ira Weiny wrote:
> On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> > > On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> > > > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > > > > Are you suggesting that we have something like this from user space?
> > > > >
> > > > > fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> > > >
> > > > Rather than "unbreakable", perhaps a clearer description of the
> > > > policy it entails is "exclusive"?
> > > >
> > > > i.e. what we are talking about here is an exclusive lease that
> > > > prevents other processes from changing the layout. i.e. the
> > > > mechanism used to guarantee a lease is exclusive is that the layout
> > > > becomes "unbreakable" at the filesystem level, but the policy we are
> > > > actually presenting to uses is "exclusive access"...
> > >
> > > That's rather different from the normal meaning of 'exclusive' in the
> > > context of locks, which is "only one user can have access to this at
> > > a time".
> >
> >
> > Layout leases are not locks, they are a user access policy object.
> > It is the process/fd which holds the lease and it's the process/fd
> > that is granted exclusive access. This is exactly the same semantic
> > as O_EXCL provides for granting exclusive access to a block device
> > via open(), yes?
> >
> > > As I understand it, this is rather more like a 'shared' or
> > > 'read' lock. The filesystem would be the one which wants an exclusive
> > > lock, so it can modify the mapping of logical to physical blocks.
> >
> > ISTM that you're conflating internal filesystem implementation with
> > application visible semantics. Yes, the filesystem uses internal
> > locks to serialise the modification of the things the lease manages
> > access too, but that has nothing to do with the access policy the
> > lease provides to users.
> >
> > e.g. Process A has an exclusive layout lease on file F. It does an
> > IO to file F. The filesystem IO path checks that Process A owns the
> > lease on the file and so skips straight through layout breaking
> > because it owns the lease and is allowed to modify the layout. It
> > then takes the inode metadata locks to allocate new space and write
> > new data.
> >
> > Process B now tries to write to file F. The FS checks whether
> > Process B owns a layout lease on file F. It doesn't, so then it
> > tries to break the layout lease so the IO can proceed. The layout
> > breaking code sees that process A has an exclusive layout lease
> > granted, and so returns -ETXTBSY to process B - it is not allowed to
> > break the lease and so the IO fails with -ETXTBSY.
> >
> > i.e. the exclusive layout lease prevents other processes from
> > performing operations that may need to modify the layout from
> > performing those operations. It does not "lock" the file/inode in
> > any way, it just changes how the layout lease breaking behaves.
>
> Question: Do we expect Process A to get notified that Process B was attempting
> to change the layout?

In which case?

In the non-exclusive case, yes, the lease gets
recalled and the application needs to play nice and release it's
references and drop the lease.

In the exclusive case, no. The application has said "I don't play
nice with others" and so we basically tell process B to get stuffed
and process A can continue onwards oblivious to the wreckage it
leaves behind....

> This changes the exclusivity semantics. While Process A has an exclusive lease
> it could release it if notified to allow process B temporary exclusivity.

And then it's not an exclusive lease - it's just a normal layout
lease. Process B -does not need a lease- to write to the file.

All the layout lease does is provide notification to applications
that rely on the layout of the file being under their control that
someone else is about to modify the layout. The lease holder that
"plays nice" then releases the layout and drops it's lease, allowing
process B to begin it's operation.

Process A then immediately takes a new layout lease, and remaps the
file layout via FIEMAP or by creating a new RDMA MR for the mmap
region. THose operations get serialised by the filesystem because
the operation being run by process B is run atomically w.r.t. the
original lease being broken. Hence the new mapping that process A
gets with it's new lease reflects whatever change was made by
process B.

IOWs, the "normal" layout lease recall behaviour provides "temporary
exclusivity" for third parties. If you are able to release leases
temporarily and regain them then there is no need for an exclusive
lease.

> Question 2: Do we expect other process' (say Process C) to also be able to map
> and pin the file? I believe users will need this and for layout purposes it is
> ok to do so. But this means that Process A does not have "exclusive" access to
> the lease.

This is an application architecture problem, not a layout lease or
filesystem problem. :)

i.e. if you have a single process controlling all the RDMA mappings,
then you can use exclusive leases. If you have multiple processes
that are uncoordinated and all require layout access to the same
file then you can't use exclusive layout leases in the application.
i.e. your application has to play nice with others.

Indeed, this is more than a application architecture problem - it's
actually a system wide architecture problem. e.g. the pNFS server
cannot use exclusive layout leases because it has to play nice with
anything else on the local filesystem that might require a layout
lease. An example of this woudl be an app that provides coherent
RDMA access to the same storage that pNFS is sharing (e.g. a
userspace CIFS server).

Hence I see that exclusive layout leases will end up being the
exception rather than the norm, because most applications will need
to play nice with other applications on the system that also
directly access the storage under the filesystem....

> So given Process C has also placed a layout lease on the file. Indicating
> that it does not want the layout to change.

That is *not what layout leases provide*.

Layout leases grant the owner the ability to map the layout and
directly access the underlying storage and to do it safely because
they will get a notification of 3rd party access that will
invalidate their mapping. Layout leases do not prevent anyone from
_changing_ the layout and, in fact, pNFS _requires_ the lease holder
to be able to modify the layout.

IOWs, the layout lease _as it stands now_ is a notification
mechanism that tells the lease owner when someone else is about to
modify the layout. It does not make the file layout immutable.

The "exclusive" aspect of layout we have been discussing is a
mechanism that prevents 3rd party modification of the layout by
denying the ability to break the layout. This "exclusive" aspect
does not make the layout immutable, either, it just means the
layout is only modifiable by the exclusive lease holder.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-06-14 03:10:21

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Thu, Jun 13, 2019 at 07:31:07PM -0700, Matthew Wilcox wrote:
> On Fri, Jun 14, 2019 at 12:09:21PM +1000, Dave Chinner wrote:
> > If the lease holder modifies the mapping in a way that causes it's
> > own internal state to screw up, then that's a bug in the lease
> > holder application.
>
> Sounds like the lease semantics aren't the right ones for the longterm
> GUP users then. The point of the longterm GUP is so the pages can be
> written to, and if the filesystem is going to move the pages around when
> they're written to, that just won't work.

And now we go full circle back to the constraints we decided on long
ago because we can't rely on demand paging RDMA hardware any time
soon to do everything we need to transparently support long-term GUP
on file-backed mappings. i.e.:

RDMA to file backed mappings must first preallocate and
write zeros to the range of the file they are mapping so
that the filesystem block mapping is complete and static for
the life of the RDMA mapping that will pin it.

IOWs, the layout lease will tell the RDMA application that the
static setup it has already done to work correctly with a file
backed mapping may be about to be broken by a third party.....

-Dave.
--
Dave Chinner
[email protected]

2019-06-14 03:44:03

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

On Thu, Jun 13, 2019 at 01:34:06PM -0700, Ira Weiny wrote:
> On Thu, Jun 13, 2019 at 10:55:52AM +1000, Dave Chinner wrote:
> > On Wed, Jun 12, 2019 at 04:30:24PM -0700, Ira Weiny wrote:
> > > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> > > > On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> > > > > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > > > > > Are you suggesting that we have something like this from user space?
> > > > > >
> > > > > > fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> > > > >
> > > > > Rather than "unbreakable", perhaps a clearer description of the
> > > > > policy it entails is "exclusive"?
> > > > >
> > > > > i.e. what we are talking about here is an exclusive lease that
> > > > > prevents other processes from changing the layout. i.e. the
> > > > > mechanism used to guarantee a lease is exclusive is that the layout
> > > > > becomes "unbreakable" at the filesystem level, but the policy we are
> > > > > actually presenting to uses is "exclusive access"...
> > > >
> > > > That's rather different from the normal meaning of 'exclusive' in the
> > > > context of locks, which is "only one user can have access to this at
> > > > a time". As I understand it, this is rather more like a 'shared' or
> > > > 'read' lock. The filesystem would be the one which wants an exclusive
> > > > lock, so it can modify the mapping of logical to physical blocks.
> > > >
> > > > The complication being that by default the filesystem has an exclusive
> > > > lock on the mapping, and what we're trying to add is the ability for
> > > > readers to ask the filesystem to give up its exclusive lock.
> > >
> > > This is an interesting view...
> > >
> > > And after some more thought, exclusive does not seem like a good name for this
> > > because technically F_WRLCK _is_ an exclusive lease...
> > >
> > > In addition, the user does not need to take the "exclusive" write lease to be
> > > notified of (broken by) an unexpected truncate. A "read" lease is broken by
> > > truncate. (And "write" leases really don't do anything different WRT the
> > > interaction of the FS and the user app. Write leases control "exclusive"
> > > access between other file descriptors.)
> >
> > I've been assuming that there is only one type of layout lease -
> > there is no use case I've heard of for read/write layout leases, and
> > like you say there is zero difference in behaviour at the filesystem
> > level - they all have to be broken to allow a non-lease truncate to
> > proceed.
> >
> > IMO, taking a "read lease" to be able to modify and write to the
> > underlying mapping of a file makes absolutely no sense at all.
> > IOWs, we're talking exaclty about a revokable layout lease vs an
> > exclusive layout lease here, and so read/write really doesn't match
> > the policy or semantics we are trying to provide.
>
> I humbly disagree, at least depending on how you look at it... :-D
>
> The patches as they stand expect the user to take a "read" layout lease which
> indicates they are currently using "reading" the layout as is.
> They are not
> changing ("writing" to) the layout.

As I said in a another email in the thread, a layout lease does not
make the layout "read only". It just means the lease owner will be
notified when someone else is about to modify it. The lease owner
can modify the mapping themselves, and they will not get notified
about their own modifications.

> They then pin pages which locks parts of
> the layout and therefore they expect no "writers" to change the layout.

Except they can change the layout themselves. It's perfectly valid
to get a layout lease, write() from offset 0 to EOF and fsync() to
intiialise the file and allocate all the space in the file, then
mmap() it and hand to off to RMDA, all while holding the layout
lease.

> The "write" layout lease breaks the "read" layout lease indicating that the
> layout is being written to.

Layout leases do not work this way.

> In fact, this is what NFS does right now. The lease it puts on the file is of
> "read" type.
>
> nfs4layouts.c:
> static int
> nfsd4_layout_setlease(struct nfs4_layout_stateid *ls)
> {
> ...
> fl->fl_flags = FL_LAYOUT;
> fl->fl_type = F_RDLCK;
> ...
> }

Yes, the existing /implementation/ uses F_RDLCK, but that doesn't
mean the layout is "read only". Look at the pNFS mapping layout code
- the ->map_blocks export operation:

int (*map_blocks)(struct inode *inode, loff_t offset,
u64 len, struct iomap *iomap,
bool write, u32 *device_generation);
^^^^^^^^^^

Yup, it has a write variable that, when set, causes the filesystem
to _allocate_ blocks if the range to be written to falls over a hole
in the file. IOWs, a pNFS layout lease can modify the file layout -
you're conflating use of a "read lock" API to mean that what the
lease _manages_ is "read only". That is not correct.

Layouts are /always writeable/ by the lease owner(s), the question
here is what we do with third parties attempting to modify a layout
covered by an "exclusive" layout lease. Hence, I'll repeat:

> > we're talking exaclty about a revokable layout lease vs an
> > exclusive layout lease here, and so read/write really doesn't match
> > the policy or semantics we are trying to provide.

Cheers,

Dave.
--
Dave Chinner
[email protected]