2018-02-23 07:28:07

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 0/5] vfio, dax: prevent long term filesystem-dax pins and other fixes

Changes since v1 [1]:

* Fix the detection of device-dax file instances in vma_is_fsdax().
(Haozhong, Gerd)

* Fix compile breakage in the FS_DAX=n and DEV_DAX=y case. (0day robot)

[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-February/014046.html

---

The vfio interface, like RDMA, wants to setup long term (indefinite)
pins of the pages backing an address range so that a guest or userspace
driver can perform DMA to the with physical address. Given that this
pinning may lead to filesystem operations deadlocking in the
filesystem-dax case, the pinning request needs to be rejected.

The longer term fix for vfio, RDMA, and any other long term pin user, is
to provide a 'pin with lease' mechanism. Similar to the leases that are
hold for pNFS RDMA layouts, this userspace lease gives the kernel a way
to notify userspace that the block layout of the file is changing and
the kernel is revoking access to pinned pages.

---

Dan Williams (5):
dax: fix vma_is_fsdax() helper
dax: fix dax_mapping() definition in the FS_DAX=n + DEV_DAX=y case
dax: fix S_DAX definition
dax: short circuit vma_is_fsdax() in the CONFIG_FS_DAX=n case
vfio: disable filesystem-dax page pinning


drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++++++---
include/linux/dax.h | 9 ++++++---
include/linux/fs.h | 6 ++++--
3 files changed, 25 insertions(+), 8 deletions(-)


2018-02-23 07:28:09

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 1/5] dax: fix vma_is_fsdax() helper

Gerd reports that ->i_mode may contain other bits besides S_IFCHR. Use
S_ISCHR() instead. Otherwise, get_user_pages_longterm() may fail on
device-dax instances when those are meant to be explicitly allowed.

Fixes: 2bb6d2837083 ("mm: introduce get_user_pages_longterm")
Cc: <[email protected]>
Reported-by: Gerd Rausch <[email protected]>
Reported-by: Haozhong Zhang <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
include/linux/fs.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2a815560fda0..79c413985305 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3198,7 +3198,7 @@ static inline bool vma_is_fsdax(struct vm_area_struct *vma)
if (!vma_is_dax(vma))
return false;
inode = file_inode(vma->vm_file);
- if (inode->i_mode == S_IFCHR)
+ if (S_ISCHR(inode->i_mode))
return false; /* device-dax */
return true;
}


2018-02-23 07:28:23

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 4/5] dax: short circuit vma_is_fsdax() in the CONFIG_FS_DAX=n case

Do not bother looking up the file type in the case when Filesystem-DAX
is disabled at build time.

Cc: Alexander Viro <[email protected]>
Cc: [email protected]
Cc: Christoph Hellwig <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
include/linux/fs.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b2fa9b4c1e51..8f80d9fff86d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3195,6 +3195,8 @@ static inline bool vma_is_fsdax(struct vm_area_struct *vma)

if (!vma->vm_file)
return false;
+ if (!IS_ENABLED(CONFIG_FS_DAX))
+ return false;
if (!vma_is_dax(vma))
return false;
inode = file_inode(vma->vm_file);


2018-02-23 07:28:42

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 5/5] vfio: disable filesystem-dax page pinning

Filesystem-DAX is incompatible with 'longterm' page pinning. Without
page cache indirection a DAX mapping maps filesystem blocks directly.
This means that the filesystem must not modify a file's block map while
any page in a mapping is pinned. In order to prevent the situation of
userspace holding of filesystem operations indefinitely, disallow
'longterm' Filesystem-DAX mappings.

RDMA has the same conflict and the plan there is to add a 'with lease'
mechanism to allow the kernel to notify userspace that the mapping is
being torn down for block-map maintenance. Perhaps something similar can
be put in place for vfio.

Note that xfs and ext4 still report:

"DAX enabled. Warning: EXPERIMENTAL, use at your own risk"

...at mount time, and resolving the dax-dma-vs-truncate problem is one
of the last hurdles to remove that designation.

Acked-by: Alex Williamson <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: [email protected]
Cc: <[email protected]>
Reported-by: Haozhong Zhang <[email protected]>
Fixes: d475c6346a38 ("dax,ext2: replace XIP read and write with DAX I/O")
Signed-off-by: Dan Williams <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index e30e29ae4819..45657e2b1ff7 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -338,11 +338,12 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
{
struct page *page[1];
struct vm_area_struct *vma;
+ struct vm_area_struct *vmas[1];
int ret;

if (mm == current->mm) {
- ret = get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE),
- page);
+ ret = get_user_pages_longterm(vaddr, 1, !!(prot & IOMMU_WRITE),
+ page, vmas);
} else {
unsigned int flags = 0;

@@ -351,7 +352,18 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,

down_read(&mm->mmap_sem);
ret = get_user_pages_remote(NULL, mm, vaddr, 1, flags, page,
- NULL, NULL);
+ vmas, NULL);
+ /*
+ * The lifetime of a vaddr_get_pfn() page pin is
+ * userspace-controlled. In the fs-dax case this could
+ * lead to indefinite stalls in filesystem operations.
+ * Disallow attempts to pin fs-dax pages via this
+ * interface.
+ */
+ if (ret > 0 && vma_is_fsdax(vmas[0])) {
+ ret = -EOPNOTSUPP;
+ put_page(page[0]);
+ }
up_read(&mm->mmap_sem);
}



2018-02-23 07:29:34

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 3/5] dax: fix S_DAX definition

Make sure S_DAX is defined in the CONFIG_FS_DAX=n + CONFIG_DEV_DAX=y
case. Otherwise vma_is_dax() may incorrectly return false in the
Device-DAX case.

Cc: Alexander Viro <[email protected]>
Cc: [email protected]
Cc: Christoph Hellwig <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: <[email protected]>
Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap")
Signed-off-by: Dan Williams <[email protected]>
---
include/linux/fs.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 79c413985305..b2fa9b4c1e51 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1859,7 +1859,7 @@ struct super_operations {
#define S_IMA 1024 /* Inode has an associated IMA struct */
#define S_AUTOMOUNT 2048 /* Automount/referral quasi-directory */
#define S_NOSEC 4096 /* no suid or xattr security attributes */
-#ifdef CONFIG_FS_DAX
+#if IS_ENABLED(CONFIG_FS_DAX) || IS_ENABLED(CONFIG_DEV_DAX)
#define S_DAX 8192 /* Direct Access, avoiding the page cache */
#else
#define S_DAX 0 /* Make all the DAX code disappear */


2018-02-23 07:29:38

by Dan Williams

[permalink] [raw]
Subject: [PATCH v2 2/5] dax: fix dax_mapping() definition in the FS_DAX=n + DEV_DAX=y case

An address_space will only have dax exceptional entries when FS_DAX is
enabled. The current reliance on S_DAX causes compile failures when
S_DAX is defined for DEV_DAX, but FS_DAX is disabled. Make dax_mapping()
always return false so that mm/truncate.c drops its link time
dependencies on fs/dax.c.

Cc: Alexander Viro <[email protected]>
Cc: [email protected]
Cc: Christoph Hellwig <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: <[email protected]>
Reported-by: kbuild test robot <[email protected]>
Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap")
Signed-off-by: Dan Williams <[email protected]>
---
include/linux/dax.h | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0185ecdae135..62e8cf7eb566 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -107,6 +107,10 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
int __dax_zero_page_range(struct block_device *bdev,
struct dax_device *dax_dev, sector_t sector,
unsigned int offset, unsigned int length);
+static inline bool dax_mapping(struct address_space *mapping)
+{
+ return mapping->host && IS_DAX(mapping->host);
+}
#else
static inline int __dax_zero_page_range(struct block_device *bdev,
struct dax_device *dax_dev, sector_t sector,
@@ -114,12 +118,11 @@ static inline int __dax_zero_page_range(struct block_device *bdev,
{
return -ENXIO;
}
-#endif
-
static inline bool dax_mapping(struct address_space *mapping)
{
- return mapping->host && IS_DAX(mapping->host);
+ return false;
}
+#endif

struct writeback_control;
int dax_writeback_mapping_range(struct address_space *mapping,


2018-02-23 08:58:08

by Haozhong Zhang

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] vfio, dax: prevent long term filesystem-dax pins and other fixes

On 02/22/18 23:17 -0800, Dan Williams wrote:
> Changes since v1 [1]:
>
> * Fix the detection of device-dax file instances in vma_is_fsdax().
> (Haozhong, Gerd)
>
> * Fix compile breakage in the FS_DAX=n and DEV_DAX=y case. (0day robot)
>
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2018-February/014046.html
>
> ---
>
> The vfio interface, like RDMA, wants to setup long term (indefinite)
> pins of the pages backing an address range so that a guest or userspace
> driver can perform DMA to the with physical address. Given that this
> pinning may lead to filesystem operations deadlocking in the
> filesystem-dax case, the pinning request needs to be rejected.
>
> The longer term fix for vfio, RDMA, and any other long term pin user, is
> to provide a 'pin with lease' mechanism. Similar to the leases that are
> hold for pNFS RDMA layouts, this userspace lease gives the kernel a way
> to notify userspace that the block layout of the file is changing and
> the kernel is revoking access to pinned pages.
>
> ---
>
> Dan Williams (5):
> dax: fix vma_is_fsdax() helper
> dax: fix dax_mapping() definition in the FS_DAX=n + DEV_DAX=y case
> dax: fix S_DAX definition
> dax: short circuit vma_is_fsdax() in the CONFIG_FS_DAX=n case
> vfio: disable filesystem-dax page pinning
>
>
> drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++++++---
> include/linux/dax.h | 9 ++++++---
> include/linux/fs.h | 6 ++++--
> 3 files changed, 25 insertions(+), 8 deletions(-)

Tested on QEMU with fs-dax and device-dax as vNVDIMM backends
respectively with vfio passthrough. The fs-dax case fails QEMU as
expected, and the device-dax case works normally now.

Tested-by: Haozhong Zhang <[email protected]>


2018-02-23 16:57:15

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] vfio, dax: prevent long term filesystem-dax pins and other fixes

On Fri, Feb 23, 2018 at 12:55 AM, Haozhong Zhang
<[email protected]> wrote:
> On 02/22/18 23:17 -0800, Dan Williams wrote:
>> Changes since v1 [1]:
>>
>> * Fix the detection of device-dax file instances in vma_is_fsdax().
>> (Haozhong, Gerd)
>>
>> * Fix compile breakage in the FS_DAX=n and DEV_DAX=y case. (0day robot)
>>
>> [1]: https://lists.01.org/pipermail/linux-nvdimm/2018-February/014046.html
>>
>> ---
>>
>> The vfio interface, like RDMA, wants to setup long term (indefinite)
>> pins of the pages backing an address range so that a guest or userspace
>> driver can perform DMA to the with physical address. Given that this
>> pinning may lead to filesystem operations deadlocking in the
>> filesystem-dax case, the pinning request needs to be rejected.
>>
>> The longer term fix for vfio, RDMA, and any other long term pin user, is
>> to provide a 'pin with lease' mechanism. Similar to the leases that are
>> hold for pNFS RDMA layouts, this userspace lease gives the kernel a way
>> to notify userspace that the block layout of the file is changing and
>> the kernel is revoking access to pinned pages.
>>
>> ---
>>
>> Dan Williams (5):
>> dax: fix vma_is_fsdax() helper
>> dax: fix dax_mapping() definition in the FS_DAX=n + DEV_DAX=y case
>> dax: fix S_DAX definition
>> dax: short circuit vma_is_fsdax() in the CONFIG_FS_DAX=n case
>> vfio: disable filesystem-dax page pinning
>>
>>
>> drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++++++---
>> include/linux/dax.h | 9 ++++++---
>> include/linux/fs.h | 6 ++++--
>> 3 files changed, 25 insertions(+), 8 deletions(-)
>
> Tested on QEMU with fs-dax and device-dax as vNVDIMM backends
> respectively with vfio passthrough. The fs-dax case fails QEMU as
> expected, and the device-dax case works normally now.
>
> Tested-by: Haozhong Zhang <[email protected]>
>

Thank you!

2018-02-26 09:45:26

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] dax: fix S_DAX definition

On Thu 22-02-18 23:17:56, Dan Williams wrote:
> Make sure S_DAX is defined in the CONFIG_FS_DAX=n + CONFIG_DEV_DAX=y
> case. Otherwise vma_is_dax() may incorrectly return false in the
> Device-DAX case.
>
> Cc: Alexander Viro <[email protected]>
> Cc: [email protected]
> Cc: Christoph Hellwig <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: <[email protected]>
> Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap")
> Signed-off-by: Dan Williams <[email protected]>

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> include/linux/fs.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 79c413985305..b2fa9b4c1e51 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1859,7 +1859,7 @@ struct super_operations {
> #define S_IMA 1024 /* Inode has an associated IMA struct */
> #define S_AUTOMOUNT 2048 /* Automount/referral quasi-directory */
> #define S_NOSEC 4096 /* no suid or xattr security attributes */
> -#ifdef CONFIG_FS_DAX
> +#if IS_ENABLED(CONFIG_FS_DAX) || IS_ENABLED(CONFIG_DEV_DAX)
> #define S_DAX 8192 /* Direct Access, avoiding the page cache */
> #else
> #define S_DAX 0 /* Make all the DAX code disappear */
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2018-02-26 09:46:02

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 2/5] dax: fix dax_mapping() definition in the FS_DAX=n + DEV_DAX=y case

On Thu 22-02-18 23:17:51, Dan Williams wrote:
> An address_space will only have dax exceptional entries when FS_DAX is
> enabled. The current reliance on S_DAX causes compile failures when
> S_DAX is defined for DEV_DAX, but FS_DAX is disabled. Make dax_mapping()
> always return false so that mm/truncate.c drops its link time
> dependencies on fs/dax.c.
>
> Cc: Alexander Viro <[email protected]>
> Cc: [email protected]
> Cc: Christoph Hellwig <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: <[email protected]>
> Reported-by: kbuild test robot <[email protected]>
> Fixes: dee410792419 ("/dev/dax, core: file operations and dax-mmap")
> Signed-off-by: Dan Williams <[email protected]>

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> include/linux/dax.h | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 0185ecdae135..62e8cf7eb566 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -107,6 +107,10 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
> int __dax_zero_page_range(struct block_device *bdev,
> struct dax_device *dax_dev, sector_t sector,
> unsigned int offset, unsigned int length);
> +static inline bool dax_mapping(struct address_space *mapping)
> +{
> + return mapping->host && IS_DAX(mapping->host);
> +}
> #else
> static inline int __dax_zero_page_range(struct block_device *bdev,
> struct dax_device *dax_dev, sector_t sector,
> @@ -114,12 +118,11 @@ static inline int __dax_zero_page_range(struct block_device *bdev,
> {
> return -ENXIO;
> }
> -#endif
> -
> static inline bool dax_mapping(struct address_space *mapping)
> {
> - return mapping->host && IS_DAX(mapping->host);
> + return false;
> }
> +#endif
>
> struct writeback_control;
> int dax_writeback_mapping_range(struct address_space *mapping,
>
--
Jan Kara <[email protected]>
SUSE Labs, CR