2024-06-05 19:30:25

by Martin Oliveira

[permalink] [raw]
Subject: [PATCH 0/6] Enable P2PDMA in Userspace RDMA

This patch series enables P2PDMA memory to be used in userspace RDMA
transfers. With this series, P2PDMA memory mmaped into userspace (ie.
only NVMe CMBs, at the moment) can then be used with ibv_reg_mr() (or
similar) interfaces. This can be tested by passing a sysfs p2pmem
allocator to the --mmap flag of the perftest tools.

This requires addressing three issues:

* Stop exporting the P2PDMA VMAs with page_mkwrite which is incompatible
with FOLL_LONGTERM

* Fix folio_fast_pin_allowed() path to take into account ZONE_DEVICE pages.

* Remove the restriction on FOLL_LONGTREM with FOLL_PCI_P2PDMA which was
initially put in place due to excessive caution with assuming P2PDMA
would have similar problems to fsdax with unmap_mapping_range(). Seeing
P2PDMA only uses unmap_mapping_range() on device unbind and immediately
waits for all page reference counts to go to zero after calling it, it
is actually believed to be safe from reuse and user access faults. See
[1] for more discussion.

This was tested using a Mellanox ConnectX-6 SmartNIC (MT28908 Family),
using the mlx5_core driver, as well as an NVMe CMB.

Thanks,
Martin

[1]: https://lore.kernel.org/linux-mm/[email protected]/T/

Martin Oliveira (6):
kernfs: create vm_operations_struct without page_mkwrite()
sysfs: add mmap_allocates parameter to struct bin_attribute
PCI/P2PDMA: create VMA without page_mkwrite() operator
mm/gup: handle ZONE_DEVICE pages in folio_fast_pin_allowed()
mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
RDMA/umem: add support for P2P RDMA

drivers/infiniband/core/umem.c | 3 +++
drivers/pci/p2pdma.c | 1 +
fs/kernfs/file.c | 15 ++++++++++++++-
fs/sysfs/file.c | 25 +++++++++++++++++++------
include/linux/kernfs.h | 7 +++++++
include/linux/sysfs.h | 1 +
mm/gup.c | 9 ++++-----
7 files changed, 49 insertions(+), 12 deletions(-)


base-commit: c3f38fa61af77b49866b006939479069cd451173
--
2.34.1



2024-06-05 19:30:38

by Martin Oliveira

[permalink] [raw]
Subject: [PATCH 3/6] PCI/P2PDMA: create VMA without page_mkwrite() operator

The P2PDMA code does not need (or want) a page_mkwrite() operator on its
VMA.

Furthermore, having the page_mkwrite() operator causes
writable_file_mapping_allowed() to fail due to
vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
enabling P2PDMA with FOLL_LONGTERM use cases.

Co-developed-by: Logan Gunthorpe <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
Signed-off-by: Martin Oliveira <[email protected]>
---
drivers/pci/p2pdma.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4f47a13cb500..ac07053abfea 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -171,6 +171,7 @@ static struct bin_attribute p2pmem_alloc_attr = {
* to be very large.
*/
.size = SZ_1T,
+ .mmap_allocates = true,
};

static struct attribute *p2pmem_attrs[] = {
--
2.34.1


2024-06-05 19:40:39

by Martin Oliveira

[permalink] [raw]
Subject: [PATCH 6/6] RDMA/umem: add support for P2P RDMA

If the device supports P2PDMA, add the FOLL_PCI_P2PDMA flag

This allows ibv_reg_mr() and friends to use P2PDMA memory that has been
mmaped into userspace for MRs in IB and RDMA transactions.

Co-developed-by: Logan Gunthorpe <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
Signed-off-by: Martin Oliveira <[email protected]>
---
drivers/infiniband/core/umem.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 07c571c7b699..b59bb6e1475e 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -208,6 +208,9 @@ struct ib_umem *ib_umem_get(struct ib_device *device, unsigned long addr,
if (umem->writable)
gup_flags |= FOLL_WRITE;

+ if (ib_dma_pci_p2p_dma_supported(device))
+ gup_flags |= FOLL_PCI_P2PDMA;
+
while (npages) {
cond_resched();
pinned = pin_user_pages_fast(cur_base,
--
2.34.1


2024-06-05 20:06:26

by Martin Oliveira

[permalink] [raw]
Subject: [PATCH 4/6] mm/gup: handle ZONE_DEVICE pages in folio_fast_pin_allowed()

folio_fast_pin_allowed() does not support ZONE_DEVICE pages because
currently it is impossible for that type of page to be used with
FOLL_LONGTERM. When this changes in a subsequent patch, this path will
attempt to read the mapping of a ZONE_DEVICE page which is not valid.

Instead, allow ZONE_DEVICE pages explicitly seeing they shouldn't pose
any problem with the fast path.

Co-developed-by: Logan Gunthorpe <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
Signed-off-by: Martin Oliveira <[email protected]>
---
mm/gup.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index ca0f5cedce9b..00d0a77112f4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2847,6 +2847,10 @@ static bool gup_fast_folio_allowed(struct folio *folio, unsigned int flags)
if (folio_test_hugetlb(folio))
return true;

+ /* It makes no sense to access the mapping of ZONE_DEVICE pages */
+ if (folio_is_zone_device(folio))
+ return true;
+
/*
* GUP-fast disables IRQs. When IRQS are disabled, RCU grace periods
* cannot proceed, which means no actions performed under RCU can
--
2.34.1


2024-06-05 21:27:34

by Martin Oliveira

[permalink] [raw]
Subject: [PATCH 1/6] kernfs: create vm_operations_struct without page_mkwrite()

The standard kernfs vm_ops installs a page_mkwrite() operator which
modifies the file update time on write.

This not always required (or makes sense), such as in the P2PDMA, which
uses the sysfs file as an allocator from userspace.

Furthermore, having the page_mkwrite() operator causes
writable_file_mapping_allowed() to fail due to
vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
enabling P2PDMA over RDMA.

Fix this by adding a new boolean on kernfs_ops to differentiate between
the different behaviours.

Co-developed-by: Logan Gunthorpe <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
Signed-off-by: Martin Oliveira <[email protected]>
---
fs/kernfs/file.c | 15 ++++++++++++++-
include/linux/kernfs.h | 7 +++++++
2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 8502ef68459b..d5e9fbded3dd 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -436,6 +436,12 @@ static const struct vm_operations_struct kernfs_vm_ops = {
.access = kernfs_vma_access,
};

+static const struct vm_operations_struct kernfs_vm_ops_mmap_allocates = {
+ .open = kernfs_vma_open,
+ .fault = kernfs_vma_fault,
+ .access = kernfs_vma_access,
+};
+
static int kernfs_fop_mmap(struct file *file, struct vm_area_struct *vma)
{
struct kernfs_open_file *of = kernfs_of(file);
@@ -482,13 +488,20 @@ static int kernfs_fop_mmap(struct file *file, struct vm_area_struct *vma)
if (vma->vm_ops && vma->vm_ops->close)
goto out_put;

+ if (ops->mmap_allocates)
+ vma->vm_ops = &kernfs_vm_ops_mmap_allocates;
+ else
+ vma->vm_ops = &kernfs_vm_ops;
+
+ if (ops->mmap_allocates && vma->vm_ops->page_mkwrite)
+ goto out_put;
+
rc = 0;
if (!of->mmapped) {
of->mmapped = true;
of_on(of)->nr_mmapped++;
of->vm_ops = vma->vm_ops;
}
- vma->vm_ops = &kernfs_vm_ops;
out_put:
kernfs_put_active(of->kn);
out_unlock:
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 87c79d076d6d..d6ae7d4b0011 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -311,6 +311,13 @@ struct kernfs_ops {
* ->prealloc. Provide ->read and ->write with ->prealloc.
*/
bool prealloc;
+ /*
+ * Use the file as an allocator from userspace. This disables
+ * page_mkwrite() to prevent the file time from being updated on write
+ * which enables using GUP with FOLL_LONGTERM with memory that's been
+ * mmaped.
+ */
+ bool mmap_allocates;
ssize_t (*write)(struct kernfs_open_file *of, char *buf, size_t bytes,
loff_t off);

--
2.34.1


2024-06-05 21:27:39

by Martin Oliveira

[permalink] [raw]
Subject: [PATCH 5/6] mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA

This check existed originally due to concerns that P2PDMA needed to copy
fsdax until pgmap refcounts were fixed (see [1]).

The P2PDMA infrastructure will only call unmap_mapping_range() when the
underlying device is unbound, and immediately after unmapping it waits
for the reference of all ZONE_DEVICE pages to be released before
continuing. This does not allow for a page to be reused and no user
access fault is therefore possible. It does not have the same problem as
fsdax.

The one minor concern with FOLL_LONGTERM pins is they will block device
unbind until userspace releases them all.

Co-developed-by: Logan Gunthorpe <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
Signed-off-by: Martin Oliveira <[email protected]>

[1]: https://lkml.kernel.org/r/[email protected]
---
mm/gup.c | 5 -----
1 file changed, 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 00d0a77112f4..28060e41788d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2614,11 +2614,6 @@ static bool is_valid_gup_args(struct page **pages, int *locked,
if (WARN_ON_ONCE((gup_flags & (FOLL_GET | FOLL_PIN)) && !pages))
return false;

- /* We want to allow the pgmap to be hot-unplugged at all times */
- if (WARN_ON_ONCE((gup_flags & FOLL_LONGTERM) &&
- (gup_flags & FOLL_PCI_P2PDMA)))
- return false;
-
*gup_flags_p = gup_flags;
return true;
}
--
2.34.1


2024-06-05 21:27:53

by Martin Oliveira

[permalink] [raw]
Subject: [PATCH 2/6] sysfs: add mmap_allocates parameter to struct bin_attribute

Now that a struct kernfs_ops can have an "mmap_allocates" parameter to
avoid the page_mkwrite() operator, the struct bin_attribute needs a way
to choose the appropriate kernfs_ops.

Introduce a "mmap_allocates" boolean on struct bin_attribute to achieve
that.

Co-developed-by: Logan Gunthorpe <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
Signed-off-by: Martin Oliveira <[email protected]>
---
fs/sysfs/file.c | 25 +++++++++++++++++++------
include/linux/sysfs.h | 1 +
2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index d1995e2d6c94..77c21009ceee 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -264,6 +264,15 @@ static const struct kernfs_ops sysfs_bin_kfops_mmap = {
.llseek = sysfs_kf_bin_llseek,
};

+static const struct kernfs_ops sysfs_bin_kfops_mmap_allocates = {
+ .read = sysfs_kf_bin_read,
+ .write = sysfs_kf_bin_write,
+ .mmap = sysfs_kf_bin_mmap,
+ .open = sysfs_kf_bin_open,
+ .llseek = sysfs_kf_bin_llseek,
+ .mmap_allocates = true,
+};
+
int sysfs_add_file_mode_ns(struct kernfs_node *parent,
const struct attribute *attr, umode_t mode, kuid_t uid,
kgid_t gid, const void *ns)
@@ -323,16 +332,20 @@ int sysfs_add_bin_file_mode_ns(struct kernfs_node *parent,
const struct kernfs_ops *ops;
struct kernfs_node *kn;

- if (battr->mmap)
- ops = &sysfs_bin_kfops_mmap;
- else if (battr->read && battr->write)
+ if (battr->mmap) {
+ if (battr->mmap_allocates)
+ ops = &sysfs_bin_kfops_mmap_allocates;
+ else
+ ops = &sysfs_bin_kfops_mmap;
+ } else if (battr->read && battr->write) {
ops = &sysfs_bin_kfops_rw;
- else if (battr->read)
+ } else if (battr->read) {
ops = &sysfs_bin_kfops_ro;
- else if (battr->write)
+ } else if (battr->write) {
ops = &sysfs_bin_kfops_wo;
- else
+ } else {
ops = &sysfs_file_kfops_empty;
+ }

#ifdef CONFIG_DEBUG_LOCK_ALLOC
if (!attr->ignore_lockdep)
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index a7d725fbf739..190b4b9355df 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -294,6 +294,7 @@ struct bin_attribute {
struct attribute attr;
size_t size;
void *private;
+ bool mmap_allocates;
struct address_space *(*f_mapping)(void);
ssize_t (*read)(struct file *, struct kobject *, struct bin_attribute *,
char *, loff_t, size_t);
--
2.34.1


2024-06-05 21:45:24

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 3/6] PCI/P2PDMA: create VMA without page_mkwrite() operator

On Wed, Jun 05, 2024 at 01:29:31PM -0600, Martin Oliveira wrote:
> The P2PDMA code does not need (or want) a page_mkwrite() operator on its
> VMA.
>
> Furthermore, having the page_mkwrite() operator causes
> writable_file_mapping_allowed() to fail due to
> vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
> enabling P2PDMA with FOLL_LONGTERM use cases.
>
> Co-developed-by: Logan Gunthorpe <[email protected]>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> Signed-off-by: Martin Oliveira <[email protected]>

Fine with me, but please s/create/Create/ in the subject to match
history of the file.

Acked-by: Bjorn Helgaas <[email protected]>

> ---
> drivers/pci/p2pdma.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 4f47a13cb500..ac07053abfea 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -171,6 +171,7 @@ static struct bin_attribute p2pmem_alloc_attr = {
> * to be very large.
> */
> .size = SZ_1T,
> + .mmap_allocates = true,
> };
>
> static struct attribute *p2pmem_attrs[] = {
> --
> 2.34.1
>

2024-06-05 21:47:12

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 1/6] kernfs: create vm_operations_struct without page_mkwrite()

On Wed, Jun 05, 2024 at 01:29:29PM -0600, Martin Oliveira wrote:
> The standard kernfs vm_ops installs a page_mkwrite() operator which
> modifies the file update time on write.
>
> This not always required (or makes sense), such as in the P2PDMA, which

s/This/This is/ ?

> uses the sysfs file as an allocator from userspace.
>
> Furthermore, having the page_mkwrite() operator causes
> writable_file_mapping_allowed() to fail due to
> vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
> enabling P2PDMA over RDMA.
>
> Fix this by adding a new boolean on kernfs_ops to differentiate between
> the different behaviours.

> + * Use the file as an allocator from userspace. This disables
> + * page_mkwrite() to prevent the file time from being updated on write
> + * which enables using GUP with FOLL_LONGTERM with memory that's been
> + * mmaped.

"mmaped" does seem more commonly used in Linux than "mmapped", but the
base word "mapped" definitely requires "pp", so "mmaped" looks funny
to me.

2024-06-06 08:56:05

by Zhu Yanjun

[permalink] [raw]
Subject: Re: [PATCH 0/6] Enable P2PDMA in Userspace RDMA

On 05.06.24 21:29, Martin Oliveira wrote:
> This patch series enables P2PDMA memory to be used in userspace RDMA
> transfers. With this series, P2PDMA memory mmaped into userspace (ie.
> only NVMe CMBs, at the moment) can then be used with ibv_reg_mr() (or
> similar) interfaces. This can be tested by passing a sysfs p2pmem
> allocator to the --mmap flag of the perftest tools.

Do you mean the following --mmap flag?
"
--mmap=file Use an mmap'd file as the buffer for testing P2P transfers.
"
I am interested in this. Can you provide the full steps to make tests
with this patch series?

Thanks a lot.
Zhu Yanjun

>
> This requires addressing three issues:
>
> * Stop exporting the P2PDMA VMAs with page_mkwrite which is incompatible
> with FOLL_LONGTERM
>
> * Fix folio_fast_pin_allowed() path to take into account ZONE_DEVICE pages.
>
> * Remove the restriction on FOLL_LONGTREM with FOLL_PCI_P2PDMA which was
> initially put in place due to excessive caution with assuming P2PDMA
> would have similar problems to fsdax with unmap_mapping_range(). Seeing
> P2PDMA only uses unmap_mapping_range() on device unbind and immediately
> waits for all page reference counts to go to zero after calling it, it
> is actually believed to be safe from reuse and user access faults. See
> [1] for more discussion.
>
> This was tested using a Mellanox ConnectX-6 SmartNIC (MT28908 Family),
> using the mlx5_core driver, as well as an NVMe CMB.
>
> Thanks,
> Martin
>
> [1]: https://lore.kernel.org/linux-mm/[email protected]/T/
>
> Martin Oliveira (6):
> kernfs: create vm_operations_struct without page_mkwrite()
> sysfs: add mmap_allocates parameter to struct bin_attribute
> PCI/P2PDMA: create VMA without page_mkwrite() operator
> mm/gup: handle ZONE_DEVICE pages in folio_fast_pin_allowed()
> mm/gup: allow FOLL_LONGTERM & FOLL_PCI_P2PDMA
> RDMA/umem: add support for P2P RDMA
>
> drivers/infiniband/core/umem.c | 3 +++
> drivers/pci/p2pdma.c | 1 +
> fs/kernfs/file.c | 15 ++++++++++++++-
> fs/sysfs/file.c | 25 +++++++++++++++++++------
> include/linux/kernfs.h | 7 +++++++
> include/linux/sysfs.h | 1 +
> mm/gup.c | 9 ++++-----
> 7 files changed, 49 insertions(+), 12 deletions(-)
>
>
> base-commit: c3f38fa61af77b49866b006939479069cd451173


2024-06-06 20:54:22

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/6] kernfs: create vm_operations_struct without page_mkwrite()

On Wed, Jun 05, 2024 at 01:29:29PM -0600, Martin Oliveira wrote:
> The standard kernfs vm_ops installs a page_mkwrite() operator which
> modifies the file update time on write.
>
> This not always required (or makes sense), such as in the P2PDMA, which
> uses the sysfs file as an allocator from userspace.

That's not a good idea, please don't do that. sysfs binary files are
"pass through", why would you want to use this as an allocator?

> Furthermore, having the page_mkwrite() operator causes
> writable_file_mapping_allowed() to fail due to
> vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
> enabling P2PDMA over RDMA.
>
> Fix this by adding a new boolean on kernfs_ops to differentiate between
> the different behaviours.

This isn't going to work well.

What exactly are you wanting to do in sysfs that you feel this is
required?

thanks,

greg k-h

2024-06-06 21:32:44

by Martin Oliveira

[permalink] [raw]
Subject: Re: [PATCH 0/6] Enable P2PDMA in Userspace RDMA

On 2024-06-06 02:53, Zhu Yanjun wrote:
> On 05.06.24 21:29, Martin Oliveira wrote:
>> This patch series enables P2PDMA memory to be used in userspace RDMA
>> transfers. With this series, P2PDMA memory mmaped into userspace (ie.
>> only NVMe CMBs, at the moment) can then be used with ibv_reg_mr() (or
>> similar) interfaces. This can be tested by passing a sysfs p2pmem
>> allocator to the --mmap flag of the perftest tools.
>
> Do you mean the following --mmap flag?
> "
> --mmap=fileĀ  Use an mmap'd file as the buffer for testing P2P transfers.
> "

Yes

> I am interested in this. Can you provide the full steps to make tests
> with this patch series
First start the server with:

ib_read_bw

Then run a client with something like this:

ib_read_bw --mmap /sys/bus/pci/devices/0000\:c5\:00.0/p2pmem/allocate <host>

Martin

2024-06-06 21:33:24

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 1/6] kernfs: create vm_operations_struct without page_mkwrite()

Hi Greg,

On 2024-06-06 14:54, Greg Kroah-Hartman wrote:
> On Wed, Jun 05, 2024 at 01:29:29PM -0600, Martin Oliveira wrote:
>> The standard kernfs vm_ops installs a page_mkwrite() operator which
>> modifies the file update time on write.
>>
>> This not always required (or makes sense), such as in the P2PDMA, which
>> uses the sysfs file as an allocator from userspace.
>
> That's not a good idea, please don't do that. sysfs binary files are
> "pass through", why would you want to use this as an allocator?

The P2PDMA code already creates a binary attribute which is used to
allocate P2PDMA memory into userspace[1]. It was done this way a couple
of years ago at the suggestion of Christoph[2]. Using a sysfs attribute
made the code substantially simpler and got rid of a bunch of pseudofs
mess that was required when mmaping a char device. The attribute already
exists and is used by userspace so it's not something we can change at
this point.

The attribute has worked well for what was needed until we wanted to use
P2PDMA memory with FOLL_LONGTERM and GUP. That path specifically denies
FOLL_LONGTERM pins when the underlying VMA has a .page_mkwrite operator,
which sysfs/kernfs forces on us. P2PDMA doesn't benefit from this
operator in any way so the simplest thing is to remove it for this use case.

>> Furthermore, having the page_mkwrite() operator causes
>> writable_file_mapping_allowed() to fail due to
>> vma_needs_dirty_tracking() on the gup flow, which is a pre-requisite for
>> enabling P2PDMA over RDMA.
>>
>> Fix this by adding a new boolean on kernfs_ops to differentiate between
>> the different behaviours.
>
> This isn't going to work well.

What about it are you worried won't work well? We're open to other
suggestions.

Thanks,

Logan

[1] https://elixir.bootlin.com/linux/latest/source/drivers/pci/p2pdma.c#L164
[2] https://lore.kernel.org/all/[email protected]/

2024-06-07 05:03:54

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/6] kernfs: create vm_operations_struct without page_mkwrite()

On Thu, Jun 06, 2024 at 10:54:06PM +0200, Greg Kroah-Hartman wrote:
> On Wed, Jun 05, 2024 at 01:29:29PM -0600, Martin Oliveira wrote:
> > The standard kernfs vm_ops installs a page_mkwrite() operator which
> > modifies the file update time on write.
> >
> > This not always required (or makes sense), such as in the P2PDMA, which
> > uses the sysfs file as an allocator from userspace.
>
> That's not a good idea, please don't do that. sysfs binary files are
> "pass through", why would you want to use this as an allocator?

I think the real question is why sysfs binary files implement
page_mkwrite by default. page_mkwrite is needed for file systems that
need to allocate space from a free space pool, which seems odd for
sysfs.

2024-06-07 16:18:01

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 1/6] kernfs: create vm_operations_struct without page_mkwrite()



On 2024-06-06 23:03, Christoph Hellwig wrote:
> On Thu, Jun 06, 2024 at 10:54:06PM +0200, Greg Kroah-Hartman wrote:
>> On Wed, Jun 05, 2024 at 01:29:29PM -0600, Martin Oliveira wrote:
>>> The standard kernfs vm_ops installs a page_mkwrite() operator which
>>> modifies the file update time on write.
>>>
>>> This not always required (or makes sense), such as in the P2PDMA, which
>>> uses the sysfs file as an allocator from userspace.
>>
>> That's not a good idea, please don't do that. sysfs binary files are
>> "pass through", why would you want to use this as an allocator?
>
> I think the real question is why sysfs binary files implement
> page_mkwrite by default. page_mkwrite is needed for file systems that
> need to allocate space from a free space pool, which seems odd for
> sysfs.

The default page_mkwrite in kernfs just calls file_update_time() but, as
I understand it, the fault code should call file_update_time() if
page_mkwrite isn't set. So perhaps the easiest thing is to simply not
add a page_mkwrite unless the vm_ops adds one.

It's not the easiest thing to trace, but as best as I can tell there are
no kernfs binary attributes that use page_mkwrite. So alternatively,
perhaps we could just disallow page_mkwrite in kernfs entirely?

Logan

2024-06-07 19:18:40

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/6] kernfs: create vm_operations_struct without page_mkwrite()

On Fri, Jun 07, 2024 at 10:16:58AM -0600, Logan Gunthorpe wrote:
>
>
> On 2024-06-06 23:03, Christoph Hellwig wrote:
> > On Thu, Jun 06, 2024 at 10:54:06PM +0200, Greg Kroah-Hartman wrote:
> >> On Wed, Jun 05, 2024 at 01:29:29PM -0600, Martin Oliveira wrote:
> >>> The standard kernfs vm_ops installs a page_mkwrite() operator which
> >>> modifies the file update time on write.
> >>>
> >>> This not always required (or makes sense), such as in the P2PDMA, which
> >>> uses the sysfs file as an allocator from userspace.
> >>
> >> That's not a good idea, please don't do that. sysfs binary files are
> >> "pass through", why would you want to use this as an allocator?
> >
> > I think the real question is why sysfs binary files implement
> > page_mkwrite by default. page_mkwrite is needed for file systems that
> > need to allocate space from a free space pool, which seems odd for
> > sysfs.
>
> The default page_mkwrite in kernfs just calls file_update_time() but, as
> I understand it, the fault code should call file_update_time() if
> page_mkwrite isn't set. So perhaps the easiest thing is to simply not
> add a page_mkwrite unless the vm_ops adds one.
>
> It's not the easiest thing to trace, but as best as I can tell there are
> no kernfs binary attributes that use page_mkwrite. So alternatively,
> perhaps we could just disallow page_mkwrite in kernfs entirely?

Sure, let's do that.

2024-06-10 12:12:11

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 6/6] RDMA/umem: add support for P2P RDMA

On Wed, Jun 05, 2024 at 01:29:34PM -0600, Martin Oliveira wrote:
> If the device supports P2PDMA, add the FOLL_PCI_P2PDMA flag
>
> This allows ibv_reg_mr() and friends to use P2PDMA memory that has been
> mmaped into userspace for MRs in IB and RDMA transactions.
>
> Co-developed-by: Logan Gunthorpe <[email protected]>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> Signed-off-by: Martin Oliveira <[email protected]>
> ---
> drivers/infiniband/core/umem.c | 3 +++
> 1 file changed, 3 insertions(+)

Acked-by: Jason Gunthorpe <[email protected]>

Jason