(pin_/get)_user_pages[_remote]() each provide an optional output parameter
for an array of VMA objects associated with each page in the input range.
These provide the means for VMAs to be returned, as long as mm->mmap_lock
is never released during the GUP operation (i.e. the internal flag
FOLL_UNLOCKABLE is not specified).
In addition, these VMAs have also to only be accessed under the mmap_lock,
and become invalidated the moment it is released.
The vast majority of invocations do not use this functionality and of those
that do, all but one retrieve a single VMA to perform checks upon.
It is not egregious in the single VMA cases to simply replace the operation
with a vma_lookup(). In these cases we duplicate the (fast) lookup on a
slow path already under the mmap_lock.
The special case is io_uring, where io_pin_pages() specifically needs to
assert that all the VMAs possess the same vm->vm_file (possibly NULL) and
they are either anonymous or hugetlb pages.
To continue to provide this functionality, we introduce the FOLL_SAME_PAGE
flag which asserts that the vma->vm_file remains the same throughout,
erroring out if this is not the case.
We can then replace the io_uring case by passing FOLL_SAME_FILE and looking
up the first VMA manually and performing the required checks on this
alone. The combination of the two amount to the same checks being
performed (and avoids an allocation).
Eliminating this parameter eliminates an entire class of errors - the vmas
array used to become a set of dangling pointers if access after release of
mmap_lock was attempted, this is simply no longer possible.
In addition the API is simplified and now clearly expresses what it is for
- applying the specified GUP flags and (if pinning) returning pinned pages.
This change additionally opens the door to further potential improvements
in GUP and the possible marrying of disparate code paths.
Thanks to Matthew Wilcox for suggesting this refactoring!
Lorenzo Stoakes (7):
mm/gup: remove unused vmas parameter from get_user_pages()
mm/gup: remove unused vmas parameter from pin_user_pages_remote()
mm/gup: remove vmas parameter from get_user_pages_remote()
mm/gup: introduce the FOLL_SAME_FILE GUP flag
io_uring: rsrc: use FOLL_SAME_FILE on pin_user_pages()
mm/gup: remove vmas parameter from pin_user_pages()
mm/gup: remove vmas array from internal GUP functions
arch/arm64/kernel/mte.c | 5 +-
arch/powerpc/mm/book3s64/iommu_api.c | 2 +-
arch/s390/kvm/interrupt.c | 2 +-
arch/x86/kernel/cpu/sgx/ioctl.c | 2 +-
drivers/gpu/drm/radeon/radeon_ttm.c | 2 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 2 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 2 +-
drivers/infiniband/sw/siw/siw_mem.c | 2 +-
drivers/iommu/iommufd/pages.c | 4 +-
drivers/media/v4l2-core/videobuf-dma-sg.c | 2 +-
drivers/misc/sgi-gru/grufault.c | 2 +-
drivers/vdpa/vdpa_user/vduse_dev.c | 2 +-
drivers/vfio/vfio_iommu_type1.c | 2 +-
drivers/vhost/vdpa.c | 2 +-
fs/exec.c | 2 +-
include/linux/hugetlb.h | 10 +-
include/linux/mm.h | 10 +-
include/linux/mm_types.h | 2 +
io_uring/rsrc.c | 39 +++----
kernel/events/uprobes.c | 10 +-
mm/gup.c | 121 ++++++++-------------
mm/gup_test.c | 14 +--
mm/hugetlb.c | 24 ++--
mm/memory.c | 9 +-
mm/process_vm_access.c | 2 +-
mm/rmap.c | 2 +-
net/xdp/xdp_umem.c | 2 +-
security/tomoyo/domain.c | 2 +-
virt/kvm/async_pf.c | 3 +-
virt/kvm/kvm_main.c | 4 +-
30 files changed, 125 insertions(+), 164 deletions(-)
--
2.40.0
No invocation of get_user_pages() uses the vmas parameter, so remove
it.
The GUP API is confusing and caveated. Recent changes have done much to
improve that, however there is more we can do. Exporting vmas is a prime
target as the caller has to be extremely careful to preclude their use
after the mmap_lock has expired or otherwise be left with dangling
pointers.
Removing the vmas parameter focuses the GUP functions upon their primary
purpose - pinning (and outputting) pages as well as performing the actions
implied by the input flags.
This is part of a patch series aiming to remove the vmas parameter
altogether.
Signed-off-by: Lorenzo Stoakes <[email protected]>
Suggested-by: Matthew Wilcox (Oracle) <[email protected]>
---
arch/x86/kernel/cpu/sgx/ioctl.c | 2 +-
drivers/gpu/drm/radeon/radeon_ttm.c | 2 +-
drivers/misc/sgi-gru/grufault.c | 2 +-
include/linux/mm.h | 3 +--
mm/gup.c | 9 +++------
mm/gup_test.c | 5 ++---
virt/kvm/kvm_main.c | 4 ++--
7 files changed, 11 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c
index 21ca0a831b70..5d390df21440 100644
--- a/arch/x86/kernel/cpu/sgx/ioctl.c
+++ b/arch/x86/kernel/cpu/sgx/ioctl.c
@@ -214,7 +214,7 @@ static int __sgx_encl_add_page(struct sgx_encl *encl,
if (!(vma->vm_flags & VM_MAYEXEC))
return -EACCES;
- ret = get_user_pages(src, 1, 0, &src_page, NULL);
+ ret = get_user_pages(src, 1, 0, &src_page);
if (ret < 1)
return -EFAULT;
diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
index 1e8e287e113c..0597540f0dde 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -362,7 +362,7 @@ static int radeon_ttm_tt_pin_userptr(struct ttm_device *bdev, struct ttm_tt *ttm
struct page **pages = ttm->pages + pinned;
r = get_user_pages(userptr, num_pages, write ? FOLL_WRITE : 0,
- pages, NULL);
+ pages);
if (r < 0)
goto release_pages;
diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
index b836936e9747..378cf02a2aa1 100644
--- a/drivers/misc/sgi-gru/grufault.c
+++ b/drivers/misc/sgi-gru/grufault.c
@@ -185,7 +185,7 @@ static int non_atomic_pte_lookup(struct vm_area_struct *vma,
#else
*pageshift = PAGE_SHIFT;
#endif
- if (get_user_pages(vaddr, 1, write ? FOLL_WRITE : 0, &page, NULL) <= 0)
+ if (get_user_pages(vaddr, 1, write ? FOLL_WRITE : 0, &page) <= 0)
return -EFAULT;
*paddr = page_to_phys(page);
put_page(page);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5d5ba1556ae9..faeed36c2d04 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2380,8 +2380,7 @@ long pin_user_pages_remote(struct mm_struct *mm,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas, int *locked);
long get_user_pages(unsigned long start, unsigned long nr_pages,
- unsigned int gup_flags, struct page **pages,
- struct vm_area_struct **vmas);
+ unsigned int gup_flags, struct page **pages);
long pin_user_pages(unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas);
diff --git a/mm/gup.c b/mm/gup.c
index 1f72a717232b..7e454d6b157e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2251,8 +2251,6 @@ long get_user_pages_remote(struct mm_struct *mm,
* @pages: array that receives pointers to the pages pinned.
* Should be at least nr_pages long. Or NULL, if caller
* only intends to ensure the pages are faulted in.
- * @vmas: array of pointers to vmas corresponding to each page.
- * Or NULL if the caller does not require them.
*
* This is the same as get_user_pages_remote(), just with a less-flexible
* calling convention where we assume that the mm being operated on belongs to
@@ -2260,16 +2258,15 @@ long get_user_pages_remote(struct mm_struct *mm,
* obviously don't pass FOLL_REMOTE in here.
*/
long get_user_pages(unsigned long start, unsigned long nr_pages,
- unsigned int gup_flags, struct page **pages,
- struct vm_area_struct **vmas)
+ unsigned int gup_flags, struct page **pages)
{
int locked = 1;
- if (!is_valid_gup_args(pages, vmas, NULL, &gup_flags, FOLL_TOUCH))
+ if (!is_valid_gup_args(pages, NULL, NULL, &gup_flags, FOLL_TOUCH))
return -EINVAL;
return __get_user_pages_locked(current->mm, start, nr_pages, pages,
- vmas, &locked, gup_flags);
+ NULL, &locked, gup_flags);
}
EXPORT_SYMBOL(get_user_pages);
diff --git a/mm/gup_test.c b/mm/gup_test.c
index 8ae7307a1bb6..9ba8ea23f84e 100644
--- a/mm/gup_test.c
+++ b/mm/gup_test.c
@@ -139,8 +139,7 @@ static int __gup_test_ioctl(unsigned int cmd,
pages + i);
break;
case GUP_BASIC_TEST:
- nr = get_user_pages(addr, nr, gup->gup_flags, pages + i,
- NULL);
+ nr = get_user_pages(addr, nr, gup->gup_flags, pages + i);
break;
case PIN_FAST_BENCHMARK:
nr = pin_user_pages_fast(addr, nr, gup->gup_flags,
@@ -161,7 +160,7 @@ static int __gup_test_ioctl(unsigned int cmd,
pages + i, NULL);
else
nr = get_user_pages(addr, nr, gup->gup_flags,
- pages + i, NULL);
+ pages + i);
break;
default:
ret = -EINVAL;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d255964ec331..2d2446df0900 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2474,7 +2474,7 @@ static inline int check_user_page_hwpoison(unsigned long addr)
{
int rc, flags = FOLL_HWPOISON | FOLL_WRITE;
- rc = get_user_pages(addr, 1, flags, NULL, NULL);
+ rc = get_user_pages(addr, 1, flags, NULL);
return rc == -EHWPOISON;
}
--
2.40.0
No invocation of pin_user_pages_remote() uses the vmas parameter, so remove
it. This forms part of a larger patch set eliminating the use of the vmas
parameters altogether.
Signed-off-by: Lorenzo Stoakes <[email protected]>
---
drivers/iommu/iommufd/pages.c | 4 ++--
drivers/vfio/vfio_iommu_type1.c | 2 +-
include/linux/mm.h | 2 +-
mm/gup.c | 8 +++-----
mm/process_vm_access.c | 2 +-
5 files changed, 8 insertions(+), 10 deletions(-)
diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index f8d92c9bb65b..9d55a2188a64 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -786,7 +786,7 @@ static int pfn_reader_user_pin(struct pfn_reader_user *user,
user->locked = 1;
}
rc = pin_user_pages_remote(pages->source_mm, uptr, npages,
- user->gup_flags, user->upages, NULL,
+ user->gup_flags, user->upages,
&user->locked);
}
if (rc <= 0) {
@@ -1787,7 +1787,7 @@ static int iopt_pages_rw_page(struct iopt_pages *pages, unsigned long index,
rc = pin_user_pages_remote(
pages->source_mm, (uintptr_t)(pages->uptr + index * PAGE_SIZE),
1, (flags & IOMMUFD_ACCESS_RW_WRITE) ? FOLL_WRITE : 0, &page,
- NULL, NULL);
+ NULL);
mmap_read_unlock(pages->source_mm);
if (rc != 1) {
if (WARN_ON(rc >= 0))
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 493c31de0edb..e6dc8fec3ed5 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -562,7 +562,7 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
mmap_read_lock(mm);
ret = pin_user_pages_remote(mm, vaddr, npages, flags | FOLL_LONGTERM,
- pages, NULL, NULL);
+ pages, NULL);
if (ret > 0) {
int i;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index faeed36c2d04..513d5fab02f1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2378,7 +2378,7 @@ long get_user_pages_remote(struct mm_struct *mm,
long pin_user_pages_remote(struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
- struct vm_area_struct **vmas, int *locked);
+ int *locked);
long get_user_pages(unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages);
long pin_user_pages(unsigned long start, unsigned long nr_pages,
diff --git a/mm/gup.c b/mm/gup.c
index 7e454d6b157e..931c805bc32b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3093,8 +3093,6 @@ EXPORT_SYMBOL_GPL(pin_user_pages_fast);
* @gup_flags: flags modifying lookup behaviour
* @pages: array that receives pointers to the pages pinned.
* Should be at least nr_pages long.
- * @vmas: array of pointers to vmas corresponding to each page.
- * Or NULL if the caller does not require them.
* @locked: pointer to lock flag indicating whether lock is held and
* subsequently whether VM_FAULT_RETRY functionality can be
* utilised. Lock must initially be held.
@@ -3109,14 +3107,14 @@ EXPORT_SYMBOL_GPL(pin_user_pages_fast);
long pin_user_pages_remote(struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
- struct vm_area_struct **vmas, int *locked)
+ int *locked)
{
int local_locked = 1;
- if (!is_valid_gup_args(pages, vmas, locked, &gup_flags,
+ if (!is_valid_gup_args(pages, NULL, locked, &gup_flags,
FOLL_PIN | FOLL_TOUCH | FOLL_REMOTE))
return 0;
- return __gup_longterm_locked(mm, start, nr_pages, pages, vmas,
+ return __gup_longterm_locked(mm, start, nr_pages, pages, NULL,
locked ? locked : &local_locked,
gup_flags);
}
diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index 78dfaf9e8990..0523edab03a6 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -104,7 +104,7 @@ static int process_vm_rw_single_vec(unsigned long addr,
mmap_read_lock(mm);
pinned_pages = pin_user_pages_remote(mm, pa, pinned_pages,
flags, process_pages,
- NULL, &locked);
+ &locked);
if (locked)
mmap_read_unlock(mm);
if (pinned_pages <= 0)
--
2.40.0
This flag causes GUP to assert that all VMAs within the input range possess
the same vma->vm_file. If not, the operation fails.
This is part of a patch series which eliminates the vmas parameter from the
GUP API, implementing the one remaining assertion within the entire kernel
that requires access to the VMAs associated with a GUP range.
Signed-off-by: Lorenzo Stoakes <[email protected]>
---
include/linux/mm_types.h | 2 ++
mm/gup.c | 16 ++++++++++++----
2 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3fc9e680f174..84d1aec9dbab 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1185,6 +1185,8 @@ enum {
FOLL_PCI_P2PDMA = 1 << 10,
/* allow interrupts from generic signals */
FOLL_INTERRUPTIBLE = 1 << 11,
+ /* assert that the range spans VMAs with the same vma->vm_file */
+ FOLL_SAME_FILE = 1 << 12,
/* See also internal only FOLL flags in mm/internal.h */
};
diff --git a/mm/gup.c b/mm/gup.c
index 9440aa54c741..3954ce499a4a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -959,7 +959,8 @@ static int faultin_page(struct vm_area_struct *vma,
return 0;
}
-static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
+static int check_vma_flags(struct vm_area_struct *vma, struct file *file,
+ unsigned long gup_flags)
{
vm_flags_t vm_flags = vma->vm_flags;
int write = (gup_flags & FOLL_WRITE);
@@ -968,7 +969,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
if (vm_flags & (VM_IO | VM_PFNMAP))
return -EFAULT;
- if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma))
+ if ((gup_flags & FOLL_ANON) && !vma_is_anonymous(vma))
return -EFAULT;
if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
@@ -977,6 +978,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
if (vma_is_secretmem(vma))
return -EFAULT;
+ if ((gup_flags & FOLL_SAME_FILE) && vma->vm_file != file)
+ return -EFAULT;
+
if (write) {
if (!(vm_flags & VM_WRITE)) {
if (!(gup_flags & FOLL_FORCE))
@@ -1081,6 +1085,7 @@ static long __get_user_pages(struct mm_struct *mm,
long ret = 0, i = 0;
struct vm_area_struct *vma = NULL;
struct follow_page_context ctx = { NULL };
+ struct file *file = NULL;
if (!nr_pages)
return 0;
@@ -1111,10 +1116,13 @@ static long __get_user_pages(struct mm_struct *mm,
ret = -EFAULT;
goto out;
}
- ret = check_vma_flags(vma, gup_flags);
+ ret = check_vma_flags(vma, i == 0 ? vma->vm_file : file,
+ gup_flags);
if (ret)
goto out;
+ file = vma->vm_file;
+
if (is_vm_hugetlb_page(vma)) {
i = follow_hugetlb_page(mm, vma, pages, vmas,
&start, &nr_pages, i,
@@ -1595,7 +1603,7 @@ long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
* We want to report -EINVAL instead of -EFAULT for any permission
* problems or incompatible mappings.
*/
- if (check_vma_flags(vma, gup_flags))
+ if (check_vma_flags(vma, vma->vm_file, gup_flags))
return -EINVAL;
ret = __get_user_pages(mm, start, nr_pages, gup_flags,
--
2.40.0
Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers")
prevents io_pin_pages() from pinning pages spanning multiple VMAs with
permitted characteristics (anon/huge), requiring that all VMAs share the
same vm_file.
The newly introduced FOLL_SAME_FILE flag permits this to be expressed as a
GUP flag rather than having to retrieve VMAs to perform the check.
We then only need to perform a VMA lookup for the first VMA to assert the
anon/hugepage requirement as we know the rest of the VMAs will possess the
same characteristics.
Doing this eliminates the one instance of vmas being used by
pin_user_pages().
Signed-off-by: Lorenzo Stoakes <[email protected]>
Suggested-by: Matthew Wilcox (Oracle) <[email protected]>
---
io_uring/rsrc.c | 39 ++++++++++++++++-----------------------
1 file changed, 16 insertions(+), 23 deletions(-)
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 7a43aed8e395..adc860bcbd4f 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1141,9 +1141,8 @@ static int io_buffer_account_pin(struct io_ring_ctx *ctx, struct page **pages,
struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages)
{
unsigned long start, end, nr_pages;
- struct vm_area_struct **vmas = NULL;
struct page **pages = NULL;
- int i, pret, ret = -ENOMEM;
+ int pret, ret = -ENOMEM;
end = (ubuf + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
start = ubuf >> PAGE_SHIFT;
@@ -1153,31 +1152,26 @@ struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages)
if (!pages)
goto done;
- vmas = kvmalloc_array(nr_pages, sizeof(struct vm_area_struct *),
- GFP_KERNEL);
- if (!vmas)
- goto done;
-
ret = 0;
mmap_read_lock(current->mm);
- pret = pin_user_pages(ubuf, nr_pages, FOLL_WRITE | FOLL_LONGTERM,
- pages, vmas);
+
+ pret = pin_user_pages(ubuf, nr_pages,
+ FOLL_WRITE | FOLL_LONGTERM | FOLL_SAME_FILE,
+ pages, NULL);
if (pret == nr_pages) {
- struct file *file = vmas[0]->vm_file;
+ /*
+ * lookup the first VMA, we require that all VMAs in range
+ * maintain the same file characteristics, as enforced by
+ * FOLL_SAME_FILE
+ */
+ struct vm_area_struct *vma = vma_lookup(current->mm, ubuf);
+ struct file *file;
/* don't support file backed memory */
- for (i = 0; i < nr_pages; i++) {
- if (vmas[i]->vm_file != file) {
- ret = -EINVAL;
- break;
- }
- if (!file)
- continue;
- if (!vma_is_shmem(vmas[i]) && !is_file_hugepages(file)) {
- ret = -EOPNOTSUPP;
- break;
- }
- }
+ file = vma->vm_file;
+ if (file && !vma_is_shmem(vma) && !is_file_hugepages(file))
+ ret = -EOPNOTSUPP;
+
*npages = nr_pages;
} else {
ret = pret < 0 ? pret : -EFAULT;
@@ -1194,7 +1188,6 @@ struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages)
}
ret = 0;
done:
- kvfree(vmas);
if (ret < 0) {
kvfree(pages);
pages = ERR_PTR(ret);
--
2.40.0
After the introduction of FOLL_SAME_FILE we no longer require vmas for any
invocation of pin_user_pages(), so eliminate this parameter from the
function and all callers.
This clears the way to removing the vmas parameter from GUP altogether.
Signed-off-by: Lorenzo Stoakes <[email protected]>
---
arch/powerpc/mm/book3s64/iommu_api.c | 2 +-
drivers/infiniband/hw/qib/qib_user_pages.c | 2 +-
drivers/infiniband/hw/usnic/usnic_uiom.c | 2 +-
drivers/infiniband/sw/siw/siw_mem.c | 2 +-
drivers/media/v4l2-core/videobuf-dma-sg.c | 2 +-
drivers/vdpa/vdpa_user/vduse_dev.c | 2 +-
drivers/vhost/vdpa.c | 2 +-
include/linux/mm.h | 3 +--
io_uring/rsrc.c | 2 +-
mm/gup.c | 9 +++------
mm/gup_test.c | 9 ++++-----
net/xdp/xdp_umem.c | 2 +-
12 files changed, 17 insertions(+), 22 deletions(-)
diff --git a/arch/powerpc/mm/book3s64/iommu_api.c b/arch/powerpc/mm/book3s64/iommu_api.c
index 81d7185e2ae8..d19fb1f3007d 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -105,7 +105,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
ret = pin_user_pages(ua + (entry << PAGE_SHIFT), n,
FOLL_WRITE | FOLL_LONGTERM,
- mem->hpages + entry, NULL);
+ mem->hpages + entry);
if (ret == n) {
pinned += n;
continue;
diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c
index f693bc753b6b..1bb7507325bc 100644
--- a/drivers/infiniband/hw/qib/qib_user_pages.c
+++ b/drivers/infiniband/hw/qib/qib_user_pages.c
@@ -111,7 +111,7 @@ int qib_get_user_pages(unsigned long start_page, size_t num_pages,
ret = pin_user_pages(start_page + got * PAGE_SIZE,
num_pages - got,
FOLL_LONGTERM | FOLL_WRITE,
- p + got, NULL);
+ p + got);
if (ret < 0) {
mmap_read_unlock(current->mm);
goto bail_release;
diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c
index 2a5cac2658ec..84e0f41e7dfa 100644
--- a/drivers/infiniband/hw/usnic/usnic_uiom.c
+++ b/drivers/infiniband/hw/usnic/usnic_uiom.c
@@ -140,7 +140,7 @@ static int usnic_uiom_get_pages(unsigned long addr, size_t size, int writable,
ret = pin_user_pages(cur_base,
min_t(unsigned long, npages,
PAGE_SIZE / sizeof(struct page *)),
- gup_flags, page_list, NULL);
+ gup_flags, page_list);
if (ret < 0)
goto out;
diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c
index f51ab2ccf151..e6e25f15567d 100644
--- a/drivers/infiniband/sw/siw/siw_mem.c
+++ b/drivers/infiniband/sw/siw/siw_mem.c
@@ -422,7 +422,7 @@ struct siw_umem *siw_umem_get(u64 start, u64 len, bool writable)
umem->page_chunk[i].plist = plist;
while (nents) {
rv = pin_user_pages(first_page_va, nents, foll_flags,
- plist, NULL);
+ plist);
if (rv < 0)
goto out_sem_up;
diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 53001532e8e3..405b89ea1054 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -180,7 +180,7 @@ static int videobuf_dma_init_user_locked(struct videobuf_dmabuf *dma,
data, size, dma->nr_pages);
err = pin_user_pages(data & PAGE_MASK, dma->nr_pages, gup_flags,
- dma->pages, NULL);
+ dma->pages);
if (err != dma->nr_pages) {
dma->nr_pages = (err >= 0) ? err : 0;
diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
index 0c3b48616a9f..1f80254604f0 100644
--- a/drivers/vdpa/vdpa_user/vduse_dev.c
+++ b/drivers/vdpa/vdpa_user/vduse_dev.c
@@ -995,7 +995,7 @@ static int vduse_dev_reg_umem(struct vduse_dev *dev,
goto out;
pinned = pin_user_pages(uaddr, npages, FOLL_LONGTERM | FOLL_WRITE,
- page_list, NULL);
+ page_list);
if (pinned != npages) {
ret = pinned < 0 ? pinned : -ENOMEM;
goto out;
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 7be9d9d8f01c..4317128c1c62 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -952,7 +952,7 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
while (npages) {
sz2pin = min_t(unsigned long, npages, list_size);
pinned = pin_user_pages(cur_base, sz2pin,
- gup_flags, page_list, NULL);
+ gup_flags, page_list);
if (sz2pin != pinned) {
if (pinned < 0) {
ret = pinned;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8dfa236cfb58..3f7d36ad7de7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2382,8 +2382,7 @@ long pin_user_pages_remote(struct mm_struct *mm,
long get_user_pages(unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages);
long pin_user_pages(unsigned long start, unsigned long nr_pages,
- unsigned int gup_flags, struct page **pages,
- struct vm_area_struct **vmas);
+ unsigned int gup_flags, struct page **pages);
long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
struct page **pages, unsigned int gup_flags);
long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index adc860bcbd4f..92d0d47e322c 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -1157,7 +1157,7 @@ struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages)
pret = pin_user_pages(ubuf, nr_pages,
FOLL_WRITE | FOLL_LONGTERM | FOLL_SAME_FILE,
- pages, NULL);
+ pages);
if (pret == nr_pages) {
/*
* lookup the first VMA, we require that all VMAs in range
diff --git a/mm/gup.c b/mm/gup.c
index 3954ce499a4a..714970ef3b30 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3132,8 +3132,6 @@ EXPORT_SYMBOL(pin_user_pages_remote);
* @gup_flags: flags modifying lookup behaviour
* @pages: array that receives pointers to the pages pinned.
* Should be at least nr_pages long.
- * @vmas: array of pointers to vmas corresponding to each page.
- * Or NULL if the caller does not require them.
*
* Nearly the same as get_user_pages(), except that FOLL_TOUCH is not set, and
* FOLL_PIN is set.
@@ -3142,15 +3140,14 @@ EXPORT_SYMBOL(pin_user_pages_remote);
* see Documentation/core-api/pin_user_pages.rst for details.
*/
long pin_user_pages(unsigned long start, unsigned long nr_pages,
- unsigned int gup_flags, struct page **pages,
- struct vm_area_struct **vmas)
+ unsigned int gup_flags, struct page **pages)
{
int locked = 1;
- if (!is_valid_gup_args(pages, vmas, NULL, &gup_flags, FOLL_PIN))
+ if (!is_valid_gup_args(pages, NULL, NULL, &gup_flags, FOLL_PIN))
return 0;
return __gup_longterm_locked(current->mm, start, nr_pages,
- pages, vmas, &locked, gup_flags);
+ pages, NULL, &locked, gup_flags);
}
EXPORT_SYMBOL(pin_user_pages);
diff --git a/mm/gup_test.c b/mm/gup_test.c
index 9ba8ea23f84e..1668ce0e0783 100644
--- a/mm/gup_test.c
+++ b/mm/gup_test.c
@@ -146,18 +146,17 @@ static int __gup_test_ioctl(unsigned int cmd,
pages + i);
break;
case PIN_BASIC_TEST:
- nr = pin_user_pages(addr, nr, gup->gup_flags, pages + i,
- NULL);
+ nr = pin_user_pages(addr, nr, gup->gup_flags, pages + i);
break;
case PIN_LONGTERM_BENCHMARK:
nr = pin_user_pages(addr, nr,
gup->gup_flags | FOLL_LONGTERM,
- pages + i, NULL);
+ pages + i);
break;
case DUMP_USER_PAGES_TEST:
if (gup->test_flags & GUP_TEST_FLAG_DUMP_PAGES_USE_PIN)
nr = pin_user_pages(addr, nr, gup->gup_flags,
- pages + i, NULL);
+ pages + i);
else
nr = get_user_pages(addr, nr, gup->gup_flags,
pages + i);
@@ -270,7 +269,7 @@ static inline int pin_longterm_test_start(unsigned long arg)
gup_flags, pages);
else
cur_pages = pin_user_pages(addr, remaining_pages,
- gup_flags, pages, NULL);
+ gup_flags, pages);
if (cur_pages < 0) {
pin_longterm_test_stop();
ret = cur_pages;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 02207e852d79..06cead2b8e34 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -103,7 +103,7 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
mmap_read_lock(current->mm);
npgs = pin_user_pages(address, umem->npgs,
- gup_flags | FOLL_LONGTERM, &umem->pgs[0], NULL);
+ gup_flags | FOLL_LONGTERM, &umem->pgs[0]);
mmap_read_unlock(current->mm);
if (npgs != umem->npgs) {
--
2.40.0
Now we have eliminated all callers to GUP APIs which use the vmas
parameter, eliminate it altogether.
This eliminates a class of bugs where vmas might have been kept around
longer than the mmap_lock and thus we need not be concerned about locks
being dropped during this operation leaving behind dangling pointers.
This simplifies the GUP API and makes it considerably clearer as to its
purpose - follow flags are applied and if pinning, an array of pages is
returned.
Signed-off-by: Lorenzo Stoakes <[email protected]>
---
include/linux/hugetlb.h | 10 ++---
mm/gup.c | 83 +++++++++++++++--------------------------
mm/hugetlb.c | 24 +++++-------
3 files changed, 45 insertions(+), 72 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 28703fe22386..2735e7a2b998 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -141,9 +141,8 @@ int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *,
struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
unsigned long address, unsigned int flags);
long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
- struct page **, struct vm_area_struct **,
- unsigned long *, unsigned long *, long, unsigned int,
- int *);
+ struct page **, unsigned long *, unsigned long *,
+ long, unsigned int, int *);
void unmap_hugepage_range(struct vm_area_struct *,
unsigned long, unsigned long, struct page *,
zap_flags_t);
@@ -297,9 +296,8 @@ static inline struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
static inline long follow_hugetlb_page(struct mm_struct *mm,
struct vm_area_struct *vma, struct page **pages,
- struct vm_area_struct **vmas, unsigned long *position,
- unsigned long *nr_pages, long i, unsigned int flags,
- int *nonblocking)
+ unsigned long *position, unsigned long *nr_pages,
+ long i, unsigned int flags, int *nonblocking)
{
BUG();
return 0;
diff --git a/mm/gup.c b/mm/gup.c
index 714970ef3b30..385e428a4acb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1028,8 +1028,6 @@ static int check_vma_flags(struct vm_area_struct *vma, struct file *file,
* @pages: array that receives pointers to the pages pinned.
* Should be at least nr_pages long. Or NULL, if caller
* only intends to ensure the pages are faulted in.
- * @vmas: array of pointers to vmas corresponding to each page.
- * Or NULL if the caller does not require them.
* @locked: whether we're still with the mmap_lock held
*
* Returns either number of pages pinned (which may be less than the
@@ -1043,8 +1041,6 @@ static int check_vma_flags(struct vm_area_struct *vma, struct file *file,
*
* The caller is responsible for releasing returned @pages, via put_page().
*
- * @vmas are valid only as long as mmap_lock is held.
- *
* Must be called with mmap_lock held. It may be released. See below.
*
* __get_user_pages walks a process's page tables and takes a reference to
@@ -1080,7 +1076,7 @@ static int check_vma_flags(struct vm_area_struct *vma, struct file *file,
static long __get_user_pages(struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
- struct vm_area_struct **vmas, int *locked)
+ int *locked)
{
long ret = 0, i = 0;
struct vm_area_struct *vma = NULL;
@@ -1124,9 +1120,9 @@ static long __get_user_pages(struct mm_struct *mm,
file = vma->vm_file;
if (is_vm_hugetlb_page(vma)) {
- i = follow_hugetlb_page(mm, vma, pages, vmas,
- &start, &nr_pages, i,
- gup_flags, locked);
+ i = follow_hugetlb_page(mm, vma, pages,
+ &start, &nr_pages, i,
+ gup_flags, locked);
if (!*locked) {
/*
* We've got a VM_FAULT_RETRY
@@ -1191,10 +1187,6 @@ static long __get_user_pages(struct mm_struct *mm,
ctx.page_mask = 0;
}
next_page:
- if (vmas) {
- vmas[i] = vma;
- ctx.page_mask = 0;
- }
page_increm = 1 + (~(start >> PAGE_SHIFT) & ctx.page_mask);
if (page_increm > nr_pages)
page_increm = nr_pages;
@@ -1349,7 +1341,6 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm,
unsigned long start,
unsigned long nr_pages,
struct page **pages,
- struct vm_area_struct **vmas,
int *locked,
unsigned int flags)
{
@@ -1387,7 +1378,7 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm,
pages_done = 0;
for (;;) {
ret = __get_user_pages(mm, start, nr_pages, flags, pages,
- vmas, locked);
+ locked);
if (!(flags & FOLL_UNLOCKABLE)) {
/* VM_FAULT_RETRY couldn't trigger, bypass */
pages_done = ret;
@@ -1451,7 +1442,7 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm,
*locked = 1;
ret = __get_user_pages(mm, start, 1, flags | FOLL_TRIED,
- pages, NULL, locked);
+ pages, locked);
if (!*locked) {
/* Continue to retry until we succeeded */
BUG_ON(ret != 0);
@@ -1549,7 +1540,7 @@ long populate_vma_page_range(struct vm_area_struct *vma,
* not result in a stack expansion that recurses back here.
*/
ret = __get_user_pages(mm, start, nr_pages, gup_flags,
- NULL, NULL, locked ? locked : &local_locked);
+ NULL, locked ? locked : &local_locked);
lru_add_drain();
return ret;
}
@@ -1607,7 +1598,7 @@ long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
return -EINVAL;
ret = __get_user_pages(mm, start, nr_pages, gup_flags,
- NULL, NULL, locked);
+ NULL, locked);
lru_add_drain();
return ret;
}
@@ -1675,8 +1666,7 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
#else /* CONFIG_MMU */
static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
unsigned long nr_pages, struct page **pages,
- struct vm_area_struct **vmas, int *locked,
- unsigned int foll_flags)
+ int *locked, unsigned int foll_flags)
{
struct vm_area_struct *vma;
bool must_unlock = false;
@@ -1720,8 +1710,7 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
if (pages[i])
get_page(pages[i]);
}
- if (vmas)
- vmas[i] = vma;
+
start = (start + PAGE_SIZE) & PAGE_MASK;
}
@@ -1902,8 +1891,7 @@ struct page *get_dump_page(unsigned long addr)
int locked = 0;
int ret;
- ret = __get_user_pages_locked(current->mm, addr, 1, &page, NULL,
- &locked,
+ ret = __get_user_pages_locked(current->mm, addr, 1, &page, &locked,
FOLL_FORCE | FOLL_DUMP | FOLL_GET);
return (ret == 1) ? page : NULL;
}
@@ -2076,7 +2064,6 @@ static long __gup_longterm_locked(struct mm_struct *mm,
unsigned long start,
unsigned long nr_pages,
struct page **pages,
- struct vm_area_struct **vmas,
int *locked,
unsigned int gup_flags)
{
@@ -2084,13 +2071,13 @@ static long __gup_longterm_locked(struct mm_struct *mm,
long rc, nr_pinned_pages;
if (!(gup_flags & FOLL_LONGTERM))
- return __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
+ return __get_user_pages_locked(mm, start, nr_pages, pages,
locked, gup_flags);
flags = memalloc_pin_save();
do {
nr_pinned_pages = __get_user_pages_locked(mm, start, nr_pages,
- pages, vmas, locked,
+ pages, locked,
gup_flags);
if (nr_pinned_pages <= 0) {
rc = nr_pinned_pages;
@@ -2108,9 +2095,8 @@ static long __gup_longterm_locked(struct mm_struct *mm,
* Check that the given flags are valid for the exported gup/pup interface, and
* update them with the required flags that the caller must have set.
*/
-static bool is_valid_gup_args(struct page **pages, struct vm_area_struct **vmas,
- int *locked, unsigned int *gup_flags_p,
- unsigned int to_set)
+static bool is_valid_gup_args(struct page **pages, int *locked,
+ unsigned int *gup_flags_p, unsigned int to_set)
{
unsigned int gup_flags = *gup_flags_p;
@@ -2152,13 +2138,6 @@ static bool is_valid_gup_args(struct page **pages, struct vm_area_struct **vmas,
(gup_flags & FOLL_PCI_P2PDMA)))
return false;
- /*
- * Can't use VMAs with locked, as locked allows GUP to unlock
- * which invalidates the vmas array
- */
- if (WARN_ON_ONCE(vmas && (gup_flags & FOLL_UNLOCKABLE)))
- return false;
-
*gup_flags_p = gup_flags;
return true;
}
@@ -2227,11 +2206,11 @@ long get_user_pages_remote(struct mm_struct *mm,
{
int local_locked = 1;
- if (!is_valid_gup_args(pages, NULL, locked, &gup_flags,
+ if (!is_valid_gup_args(pages, locked, &gup_flags,
FOLL_TOUCH | FOLL_REMOTE))
return -EINVAL;
- return __get_user_pages_locked(mm, start, nr_pages, pages, NULL,
+ return __get_user_pages_locked(mm, start, nr_pages, pages,
locked ? locked : &local_locked,
gup_flags);
}
@@ -2266,11 +2245,11 @@ long get_user_pages(unsigned long start, unsigned long nr_pages,
{
int locked = 1;
- if (!is_valid_gup_args(pages, NULL, NULL, &gup_flags, FOLL_TOUCH))
+ if (!is_valid_gup_args(pages, NULL, &gup_flags, FOLL_TOUCH))
return -EINVAL;
return __get_user_pages_locked(current->mm, start, nr_pages, pages,
- NULL, &locked, gup_flags);
+ &locked, gup_flags);
}
EXPORT_SYMBOL(get_user_pages);
@@ -2294,12 +2273,12 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
{
int locked = 0;
- if (!is_valid_gup_args(pages, NULL, NULL, &gup_flags,
+ if (!is_valid_gup_args(pages, NULL, &gup_flags,
FOLL_TOUCH | FOLL_UNLOCKABLE))
return -EINVAL;
return __get_user_pages_locked(current->mm, start, nr_pages, pages,
- NULL, &locked, gup_flags);
+ &locked, gup_flags);
}
EXPORT_SYMBOL(get_user_pages_unlocked);
@@ -2982,7 +2961,7 @@ static int internal_get_user_pages_fast(unsigned long start,
start += nr_pinned << PAGE_SHIFT;
pages += nr_pinned;
ret = __gup_longterm_locked(current->mm, start, nr_pages - nr_pinned,
- pages, NULL, &locked,
+ pages, &locked,
gup_flags | FOLL_TOUCH | FOLL_UNLOCKABLE);
if (ret < 0) {
/*
@@ -3024,7 +3003,7 @@ int get_user_pages_fast_only(unsigned long start, int nr_pages,
* FOLL_FAST_ONLY is required in order to match the API description of
* this routine: no fall back to regular ("slow") GUP.
*/
- if (!is_valid_gup_args(pages, NULL, NULL, &gup_flags,
+ if (!is_valid_gup_args(pages, NULL, &gup_flags,
FOLL_GET | FOLL_FAST_ONLY))
return -EINVAL;
@@ -3057,7 +3036,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
* FOLL_GET, because gup fast is always a "pin with a +1 page refcount"
* request.
*/
- if (!is_valid_gup_args(pages, NULL, NULL, &gup_flags, FOLL_GET))
+ if (!is_valid_gup_args(pages, NULL, &gup_flags, FOLL_GET))
return -EINVAL;
return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
}
@@ -3082,7 +3061,7 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast);
int pin_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages)
{
- if (!is_valid_gup_args(pages, NULL, NULL, &gup_flags, FOLL_PIN))
+ if (!is_valid_gup_args(pages, NULL, &gup_flags, FOLL_PIN))
return -EINVAL;
return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
}
@@ -3115,10 +3094,10 @@ long pin_user_pages_remote(struct mm_struct *mm,
{
int local_locked = 1;
- if (!is_valid_gup_args(pages, NULL, locked, &gup_flags,
+ if (!is_valid_gup_args(pages, locked, &gup_flags,
FOLL_PIN | FOLL_TOUCH | FOLL_REMOTE))
return 0;
- return __gup_longterm_locked(mm, start, nr_pages, pages, NULL,
+ return __gup_longterm_locked(mm, start, nr_pages, pages,
locked ? locked : &local_locked,
gup_flags);
}
@@ -3144,10 +3123,10 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages,
{
int locked = 1;
- if (!is_valid_gup_args(pages, NULL, NULL, &gup_flags, FOLL_PIN))
+ if (!is_valid_gup_args(pages, NULL, &gup_flags, FOLL_PIN))
return 0;
return __gup_longterm_locked(current->mm, start, nr_pages,
- pages, NULL, &locked, gup_flags);
+ pages, &locked, gup_flags);
}
EXPORT_SYMBOL(pin_user_pages);
@@ -3161,11 +3140,11 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
{
int locked = 0;
- if (!is_valid_gup_args(pages, NULL, NULL, &gup_flags,
+ if (!is_valid_gup_args(pages, NULL, &gup_flags,
FOLL_PIN | FOLL_TOUCH | FOLL_UNLOCKABLE))
return 0;
- return __gup_longterm_locked(current->mm, start, nr_pages, pages, NULL,
+ return __gup_longterm_locked(current->mm, start, nr_pages, pages,
&locked, gup_flags);
}
EXPORT_SYMBOL(pin_user_pages_unlocked);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a08fb47fb200..85138a0394b9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6371,17 +6371,14 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
}
#endif /* CONFIG_USERFAULTFD */
-static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma,
- int refs, struct page **pages,
- struct vm_area_struct **vmas)
+static void record_subpages(struct page *page, struct vm_area_struct *vma,
+ int refs, struct page **pages)
{
int nr;
for (nr = 0; nr < refs; nr++) {
if (likely(pages))
pages[nr] = nth_page(page, nr);
- if (vmas)
- vmas[nr] = vma;
}
}
@@ -6454,9 +6451,9 @@ struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
}
long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
- struct page **pages, struct vm_area_struct **vmas,
- unsigned long *position, unsigned long *nr_pages,
- long i, unsigned int flags, int *locked)
+ struct page **pages, unsigned long *position,
+ unsigned long *nr_pages, long i, unsigned int flags,
+ int *locked)
{
unsigned long pfn_offset;
unsigned long vaddr = *position;
@@ -6584,7 +6581,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
* If subpage information not requested, update counters
* and skip the same_page loop below.
*/
- if (!pages && !vmas && !pfn_offset &&
+ if (!pages && !pfn_offset &&
(vaddr + huge_page_size(h) < vma->vm_end) &&
(remainder >= pages_per_huge_page(h))) {
vaddr += huge_page_size(h);
@@ -6599,11 +6596,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
refs = min3(pages_per_huge_page(h) - pfn_offset, remainder,
(vma->vm_end - ALIGN_DOWN(vaddr, PAGE_SIZE)) >> PAGE_SHIFT);
- if (pages || vmas)
- record_subpages_vmas(nth_page(page, pfn_offset),
- vma, refs,
- likely(pages) ? pages + i : NULL,
- vmas ? vmas + i : NULL);
+ if (pages)
+ record_subpages(nth_page(page, pfn_offset),
+ vma, refs,
+ likely(pages) ? pages + i : NULL);
if (pages) {
/*
--
2.40.0
The only instances of get_user_pages_remote() invocations which used the
vmas parameter were for a single page which can instead simply look up the
VMA directly. In particular:-
- __update_ref_ctr() looked up the VMA but did nothing with it so we simply
remove it.
- __access_remote_vm() was already using vma_lookup() when the original
lookup failed so by doing the lookup directly this also de-duplicates the
code.
This forms part of a broader set of patches intended to eliminate the vmas
parameter altogether.
Signed-off-by: Lorenzo Stoakes <[email protected]>
---
arch/arm64/kernel/mte.c | 5 +++--
arch/s390/kvm/interrupt.c | 2 +-
fs/exec.c | 2 +-
include/linux/mm.h | 2 +-
kernel/events/uprobes.c | 10 +++++-----
mm/gup.c | 12 ++++--------
mm/memory.c | 9 +++++----
mm/rmap.c | 2 +-
security/tomoyo/domain.c | 2 +-
virt/kvm/async_pf.c | 3 +--
10 files changed, 23 insertions(+), 26 deletions(-)
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index f5bcb0dc6267..74d8d4007dec 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -437,8 +437,9 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr,
struct page *page = NULL;
ret = get_user_pages_remote(mm, addr, 1, gup_flags, &page,
- &vma, NULL);
- if (ret <= 0)
+ NULL);
+ vma = vma_lookup(mm, addr);
+ if (ret <= 0 || !vma)
break;
/*
diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c
index 9250fde1f97d..c19d0cb7d2f2 100644
--- a/arch/s390/kvm/interrupt.c
+++ b/arch/s390/kvm/interrupt.c
@@ -2777,7 +2777,7 @@ static struct page *get_map_page(struct kvm *kvm, u64 uaddr)
mmap_read_lock(kvm->mm);
get_user_pages_remote(kvm->mm, uaddr, 1, FOLL_WRITE,
- &page, NULL, NULL);
+ &page, NULL);
mmap_read_unlock(kvm->mm);
return page;
}
diff --git a/fs/exec.c b/fs/exec.c
index 87cf3a2f0e9a..d8d48ee15aac 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -219,7 +219,7 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
*/
mmap_read_lock(bprm->mm);
ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags,
- &page, NULL, NULL);
+ &page, NULL);
mmap_read_unlock(bprm->mm);
if (ret <= 0)
return NULL;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 513d5fab02f1..8dfa236cfb58 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2374,7 +2374,7 @@ extern int __access_remote_vm(struct mm_struct *mm, unsigned long addr,
long get_user_pages_remote(struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
- struct vm_area_struct **vmas, int *locked);
+ int *locked);
long pin_user_pages_remote(struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 59887c69d54c..35e8a7ec884c 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -365,7 +365,6 @@ __update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d)
{
void *kaddr;
struct page *page;
- struct vm_area_struct *vma;
int ret;
short *ptr;
@@ -373,7 +372,7 @@ __update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d)
return -EINVAL;
ret = get_user_pages_remote(mm, vaddr, 1,
- FOLL_WRITE, &page, &vma, NULL);
+ FOLL_WRITE, &page, NULL);
if (unlikely(ret <= 0)) {
/*
* We are asking for 1 page. If get_user_pages_remote() fails,
@@ -475,8 +474,9 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
gup_flags |= FOLL_SPLIT_PMD;
/* Read the page with vaddr into memory */
ret = get_user_pages_remote(mm, vaddr, 1, gup_flags,
- &old_page, &vma, NULL);
- if (ret <= 0)
+ &old_page, NULL);
+ vma = vma_lookup(mm, vaddr);
+ if (ret <= 0 || !vma)
return ret;
ret = verify_opcode(old_page, vaddr, &opcode);
@@ -2028,7 +2028,7 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr)
* essentially a kernel access to the memory.
*/
result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, &page,
- NULL, NULL);
+ NULL);
if (result < 0)
return result;
diff --git a/mm/gup.c b/mm/gup.c
index 931c805bc32b..9440aa54c741 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2165,8 +2165,6 @@ static bool is_valid_gup_args(struct page **pages, struct vm_area_struct **vmas,
* @pages: array that receives pointers to the pages pinned.
* Should be at least nr_pages long. Or NULL, if caller
* only intends to ensure the pages are faulted in.
- * @vmas: array of pointers to vmas corresponding to each page.
- * Or NULL if the caller does not require them.
* @locked: pointer to lock flag indicating whether lock is held and
* subsequently whether VM_FAULT_RETRY functionality can be
* utilised. Lock must initially be held.
@@ -2181,8 +2179,6 @@ static bool is_valid_gup_args(struct page **pages, struct vm_area_struct **vmas,
*
* The caller is responsible for releasing returned @pages, via put_page().
*
- * @vmas are valid only as long as mmap_lock is held.
- *
* Must be called with mmap_lock held for read or write.
*
* get_user_pages_remote walks a process's page tables and takes a reference
@@ -2219,15 +2215,15 @@ static bool is_valid_gup_args(struct page **pages, struct vm_area_struct **vmas,
long get_user_pages_remote(struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
- struct vm_area_struct **vmas, int *locked)
+ int *locked)
{
int local_locked = 1;
- if (!is_valid_gup_args(pages, vmas, locked, &gup_flags,
+ if (!is_valid_gup_args(pages, NULL, locked, &gup_flags,
FOLL_TOUCH | FOLL_REMOTE))
return -EINVAL;
- return __get_user_pages_locked(mm, start, nr_pages, pages, vmas,
+ return __get_user_pages_locked(mm, start, nr_pages, pages, NULL,
locked ? locked : &local_locked,
gup_flags);
}
@@ -2237,7 +2233,7 @@ EXPORT_SYMBOL(get_user_pages_remote);
long get_user_pages_remote(struct mm_struct *mm,
unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
- struct vm_area_struct **vmas, int *locked)
+ int *locked)
{
return 0;
}
diff --git a/mm/memory.c b/mm/memory.c
index ea8fdca35df3..43426147f9f7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5596,7 +5596,11 @@ int __access_remote_vm(struct mm_struct *mm, unsigned long addr, void *buf,
struct page *page = NULL;
ret = get_user_pages_remote(mm, addr, 1,
- gup_flags, &page, &vma, NULL);
+ gup_flags, &page, NULL);
+ vma = vma_lookup(mm, addr);
+ if (!vma)
+ break;
+
if (ret <= 0) {
#ifndef CONFIG_HAVE_IOREMAP_PROT
break;
@@ -5605,9 +5609,6 @@ int __access_remote_vm(struct mm_struct *mm, unsigned long addr, void *buf,
* Check if this is a VM_IO | VM_PFNMAP VMA, which
* we can access using slightly different code.
*/
- vma = vma_lookup(mm, addr);
- if (!vma)
- break;
if (vma->vm_ops && vma->vm_ops->access)
ret = vma->vm_ops->access(vma, addr, buf,
len, write);
diff --git a/mm/rmap.c b/mm/rmap.c
index ba901c416785..756ea8a9bb90 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2324,7 +2324,7 @@ int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
npages = get_user_pages_remote(mm, start, npages,
FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
- pages, NULL, NULL);
+ pages, NULL);
if (npages < 0)
return npages;
diff --git a/security/tomoyo/domain.c b/security/tomoyo/domain.c
index 31af29f669d2..ac20c0bdff9d 100644
--- a/security/tomoyo/domain.c
+++ b/security/tomoyo/domain.c
@@ -916,7 +916,7 @@ bool tomoyo_dump_page(struct linux_binprm *bprm, unsigned long pos,
*/
mmap_read_lock(bprm->mm);
ret = get_user_pages_remote(bprm->mm, pos, 1,
- FOLL_FORCE, &page, NULL, NULL);
+ FOLL_FORCE, &page, NULL);
mmap_read_unlock(bprm->mm);
if (ret <= 0)
return false;
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 9bfe1d6f6529..e033c79d528e 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -61,8 +61,7 @@ static void async_pf_execute(struct work_struct *work)
* access remotely.
*/
mmap_read_lock(mm);
- get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, NULL,
- &locked);
+ get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, &locked);
if (locked)
mmap_read_unlock(mm);
--
2.40.0
On 2023/04/15 8:27, Lorenzo Stoakes wrote:
> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index f5bcb0dc6267..74d8d4007dec 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -437,8 +437,9 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr,
> struct page *page = NULL;
>
> ret = get_user_pages_remote(mm, addr, 1, gup_flags, &page,
> - &vma, NULL);
> - if (ret <= 0)
> + NULL);
> + vma = vma_lookup(mm, addr);
> + if (ret <= 0 || !vma)
> break;
This conversion looks wrong. When get_user_pages_remote(&page) returned > 0,
put_page(page) is needed even if vma_lookup() returned NULL, isn't it?
On Sat, Apr 15, 2023 at 12:27:13AM +0100, Lorenzo Stoakes wrote:
> No invocation of get_user_pages() uses the vmas parameter, so remove
> it.
>
> The GUP API is confusing and caveated. Recent changes have done much to
> improve that, however there is more we can do. Exporting vmas is a prime
> target as the caller has to be extremely careful to preclude their use
> after the mmap_lock has expired or otherwise be left with dangling
> pointers.
>
> Removing the vmas parameter focuses the GUP functions upon their primary
> purpose - pinning (and outputting) pages as well as performing the actions
> implied by the input flags.
>
> This is part of a patch series aiming to remove the vmas parameter
> altogether.
>
> Signed-off-by: Lorenzo Stoakes <[email protected]>
> Suggested-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Greg Kroah-Hartman <[email protected]>
On Sat, Apr 15, 2023 at 09:25:51AM +0900, Tetsuo Handa wrote:
> On 2023/04/15 8:27, Lorenzo Stoakes wrote:
> > diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> > index f5bcb0dc6267..74d8d4007dec 100644
> > --- a/arch/arm64/kernel/mte.c
> > +++ b/arch/arm64/kernel/mte.c
> > @@ -437,8 +437,9 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr,
> > struct page *page = NULL;
> >
> > ret = get_user_pages_remote(mm, addr, 1, gup_flags, &page,
> > - &vma, NULL);
> > - if (ret <= 0)
> > + NULL);
> > + vma = vma_lookup(mm, addr);
> > + if (ret <= 0 || !vma)
> > break;
>
> This conversion looks wrong. When get_user_pages_remote(&page) returned > 0,
> put_page(page) is needed even if vma_lookup() returned NULL, isn't it?
>
You're right, though actually it's not possible for ret > 0 and vma != NULL
because the GUP code requires the VMA to exist for it to have returned > 0.
I was trying to be too cute here I think, actually we only want to be doing
that lookup if the GUP succeeded in any case.
Let me respin with a fix for this.
On Sat, Apr 15, 2023 at 12:27:45AM +0100, Lorenzo Stoakes wrote:
> Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers")
> prevents io_pin_pages() from pinning pages spanning multiple VMAs with
> permitted characteristics (anon/huge), requiring that all VMAs share the
> same vm_file.
That commmit doesn't really explain why io_uring is doing such a weird
thing.
What exactly is the problem with mixing struct pages from different
files and why of all the GUP users does only io_uring need to care
about this?
If there is no justification then lets revert that commit instead.
> /* don't support file backed memory */
> - for (i = 0; i < nr_pages; i++) {
> - if (vmas[i]->vm_file != file) {
> - ret = -EINVAL;
> - break;
> - }
> - if (!file)
> - continue;
> - if (!vma_is_shmem(vmas[i]) && !is_file_hugepages(file)) {
> - ret = -EOPNOTSUPP;
> - break;
> - }
> - }
> + file = vma->vm_file;
> + if (file && !vma_is_shmem(vma) && !is_file_hugepages(file))
> + ret = -EOPNOTSUPP;
> +
Also, why is it doing this?
All GUP users don't work entirely right for any fops implementation
that assumes write protect is unconditionally possible. eg most
filesystems.
We've been ignoring blocking it because it is an ABI break and it does
sort of work in some cases.
I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than
io_uring open coding this kind of stuff.
Jason
On Sat, Apr 15, 2023 at 12:27:13AM +0100, Lorenzo Stoakes wrote:
> No invocation of get_user_pages() uses the vmas parameter, so remove
> it.
>
> The GUP API is confusing and caveated. Recent changes have done much to
> improve that, however there is more we can do. Exporting vmas is a prime
> target as the caller has to be extremely careful to preclude their use
> after the mmap_lock has expired or otherwise be left with dangling
> pointers.
>
> Removing the vmas parameter focuses the GUP functions upon their primary
> purpose - pinning (and outputting) pages as well as performing the actions
> implied by the input flags.
>
> This is part of a patch series aiming to remove the vmas parameter
> altogether.
>
> Signed-off-by: Lorenzo Stoakes <[email protected]>
> Suggested-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
> arch/x86/kernel/cpu/sgx/ioctl.c | 2 +-
> drivers/gpu/drm/radeon/radeon_ttm.c | 2 +-
> drivers/misc/sgi-gru/grufault.c | 2 +-
> include/linux/mm.h | 3 +--
> mm/gup.c | 9 +++------
> mm/gup_test.c | 5 ++---
> virt/kvm/kvm_main.c | 4 ++--
> 7 files changed, 11 insertions(+), 16 deletions(-)
Reviewed-by: Jason Gunthorpe <[email protected]>
Jason
On Sat, Apr 15, 2023 at 12:27:23AM +0100, Lorenzo Stoakes wrote:
> No invocation of pin_user_pages_remote() uses the vmas parameter, so remove
> it. This forms part of a larger patch set eliminating the use of the vmas
> parameters altogether.
>
> Signed-off-by: Lorenzo Stoakes <[email protected]>
> ---
> drivers/iommu/iommufd/pages.c | 4 ++--
> drivers/vfio/vfio_iommu_type1.c | 2 +-
> include/linux/mm.h | 2 +-
> mm/gup.c | 8 +++-----
> mm/process_vm_access.c | 2 +-
> 5 files changed, 8 insertions(+), 10 deletions(-)
Reviewed-by: Jason Gunthorpe <[email protected]>
Jason
On Sat, Apr 15, 2023 at 12:27:31AM +0100, Lorenzo Stoakes wrote:
> The only instances of get_user_pages_remote() invocations which used the
> vmas parameter were for a single page which can instead simply look up the
> VMA directly. In particular:-
>
> - __update_ref_ctr() looked up the VMA but did nothing with it so we simply
> remove it.
>
> - __access_remote_vm() was already using vma_lookup() when the original
> lookup failed so by doing the lookup directly this also de-duplicates the
> code.
>
> This forms part of a broader set of patches intended to eliminate the vmas
> parameter altogether.
>
> Signed-off-by: Lorenzo Stoakes <[email protected]>
> ---
> arch/arm64/kernel/mte.c | 5 +++--
> arch/s390/kvm/interrupt.c | 2 +-
> fs/exec.c | 2 +-
> include/linux/mm.h | 2 +-
> kernel/events/uprobes.c | 10 +++++-----
> mm/gup.c | 12 ++++--------
> mm/memory.c | 9 +++++----
> mm/rmap.c | 2 +-
> security/tomoyo/domain.c | 2 +-
> virt/kvm/async_pf.c | 3 +--
> 10 files changed, 23 insertions(+), 26 deletions(-)
>
> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index f5bcb0dc6267..74d8d4007dec 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -437,8 +437,9 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr,
> struct page *page = NULL;
>
> ret = get_user_pages_remote(mm, addr, 1, gup_flags, &page,
> - &vma, NULL);
> - if (ret <= 0)
> + NULL);
> + vma = vma_lookup(mm, addr);
> + if (ret <= 0 || !vma)
> break;
Given the slightly tricky error handling, it would make sense to turn
this pattern into a helper function:
page = get_single_user_page_locked(mm, addr, gup_flags, &vma);
if (IS_ERR(page))
[..]
static inline struct page *get_single_user_page_locked(struct mm_struct *mm,
unsigned long addr, int gup_flags, struct vm_area_struct **vma)
{
struct page *page;
int ret;
ret = get_user_pages_remote(*mm, addr, 1, gup_flags, &page, NULL, NULL);
if (ret < 0)
return ERR_PTR(ret);
if (WARN_ON(ret == 0))
return ERR_PTR(-EINVAL);
*vma = vma_lookup(mm, addr);
if (WARN_ON(!*vma) {
put_user_page(page);
return ERR_PTR(-EINVAL);
}
return page;
}
It could be its own patch so this change was just a mechanical removal
of NULL
Jason
On Mon, Apr 17, 2023 at 10:09:36AM -0300, Jason Gunthorpe wrote:
> On Sat, Apr 15, 2023 at 12:27:31AM +0100, Lorenzo Stoakes wrote:
> > The only instances of get_user_pages_remote() invocations which used the
> > vmas parameter were for a single page which can instead simply look up the
> > VMA directly. In particular:-
> >
> > - __update_ref_ctr() looked up the VMA but did nothing with it so we simply
> > remove it.
> >
> > - __access_remote_vm() was already using vma_lookup() when the original
> > lookup failed so by doing the lookup directly this also de-duplicates the
> > code.
> >
> > This forms part of a broader set of patches intended to eliminate the vmas
> > parameter altogether.
> >
> > Signed-off-by: Lorenzo Stoakes <[email protected]>
> > ---
> > arch/arm64/kernel/mte.c | 5 +++--
> > arch/s390/kvm/interrupt.c | 2 +-
> > fs/exec.c | 2 +-
> > include/linux/mm.h | 2 +-
> > kernel/events/uprobes.c | 10 +++++-----
> > mm/gup.c | 12 ++++--------
> > mm/memory.c | 9 +++++----
> > mm/rmap.c | 2 +-
> > security/tomoyo/domain.c | 2 +-
> > virt/kvm/async_pf.c | 3 +--
> > 10 files changed, 23 insertions(+), 26 deletions(-)
> >
> > diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> > index f5bcb0dc6267..74d8d4007dec 100644
> > --- a/arch/arm64/kernel/mte.c
> > +++ b/arch/arm64/kernel/mte.c
> > @@ -437,8 +437,9 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr,
> > struct page *page = NULL;
> >
> > ret = get_user_pages_remote(mm, addr, 1, gup_flags, &page,
> > - &vma, NULL);
> > - if (ret <= 0)
> > + NULL);
> > + vma = vma_lookup(mm, addr);
> > + if (ret <= 0 || !vma)
> > break;
>
> Given the slightly tricky error handling, it would make sense to turn
> this pattern into a helper function:
>
> page = get_single_user_page_locked(mm, addr, gup_flags, &vma);
> if (IS_ERR(page))
> [..]
>
> static inline struct page *get_single_user_page_locked(struct mm_struct *mm,
> unsigned long addr, int gup_flags, struct vm_area_struct **vma)
> {
> struct page *page;
> int ret;
>
> ret = get_user_pages_remote(*mm, addr, 1, gup_flags, &page, NULL, NULL);
> if (ret < 0)
> return ERR_PTR(ret);
> if (WARN_ON(ret == 0))
> return ERR_PTR(-EINVAL);
> *vma = vma_lookup(mm, addr);
> if (WARN_ON(!*vma) {
> put_user_page(page);
> return ERR_PTR(-EINVAL);
> }
> return page;
> }
>
> It could be its own patch so this change was just a mechanical removal
> of NULL
>
> Jason
>
Agreed, I think this would work better as a follow up patch however so as
not to distract too much from the core change. I feel like there are quite
a few things we can follow up on including assessing whether we might be
able to use _fast() paths in places (I haven't assessed this yet).
On Sat, Apr 15, 2023 at 12:27:40AM +0100, Lorenzo Stoakes wrote:
> This flag causes GUP to assert that all VMAs within the input range possess
> the same vma->vm_file. If not, the operation fails.
>
> This is part of a patch series which eliminates the vmas parameter from the
> GUP API, implementing the one remaining assertion within the entire kernel
> that requires access to the VMAs associated with a GUP range.
>
> Signed-off-by: Lorenzo Stoakes <[email protected]>
> ---
> include/linux/mm_types.h | 2 ++
> mm/gup.c | 16 ++++++++++++----
> 2 files changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 3fc9e680f174..84d1aec9dbab 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1185,6 +1185,8 @@ enum {
> FOLL_PCI_P2PDMA = 1 << 10,
> /* allow interrupts from generic signals */
> FOLL_INTERRUPTIBLE = 1 << 11,
> + /* assert that the range spans VMAs with the same vma->vm_file */
> + FOLL_SAME_FILE = 1 << 12,
I hope we don't add this flag, but it needs to be rejected in
internal_get_user_pages_fast()
Jason
On Mon, Apr 17, 2023 at 02:13:39PM +0100, Lorenzo Stoakes wrote:
> On Mon, Apr 17, 2023 at 10:09:36AM -0300, Jason Gunthorpe wrote:
> > On Sat, Apr 15, 2023 at 12:27:31AM +0100, Lorenzo Stoakes wrote:
> > > The only instances of get_user_pages_remote() invocations which used the
> > > vmas parameter were for a single page which can instead simply look up the
> > > VMA directly. In particular:-
> > >
> > > - __update_ref_ctr() looked up the VMA but did nothing with it so we simply
> > > remove it.
> > >
> > > - __access_remote_vm() was already using vma_lookup() when the original
> > > lookup failed so by doing the lookup directly this also de-duplicates the
> > > code.
> > >
> > > This forms part of a broader set of patches intended to eliminate the vmas
> > > parameter altogether.
> > >
> > > Signed-off-by: Lorenzo Stoakes <[email protected]>
> > > ---
> > > arch/arm64/kernel/mte.c | 5 +++--
> > > arch/s390/kvm/interrupt.c | 2 +-
> > > fs/exec.c | 2 +-
> > > include/linux/mm.h | 2 +-
> > > kernel/events/uprobes.c | 10 +++++-----
> > > mm/gup.c | 12 ++++--------
> > > mm/memory.c | 9 +++++----
> > > mm/rmap.c | 2 +-
> > > security/tomoyo/domain.c | 2 +-
> > > virt/kvm/async_pf.c | 3 +--
> > > 10 files changed, 23 insertions(+), 26 deletions(-)
> > >
> > > diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> > > index f5bcb0dc6267..74d8d4007dec 100644
> > > --- a/arch/arm64/kernel/mte.c
> > > +++ b/arch/arm64/kernel/mte.c
> > > @@ -437,8 +437,9 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr,
> > > struct page *page = NULL;
> > >
> > > ret = get_user_pages_remote(mm, addr, 1, gup_flags, &page,
> > > - &vma, NULL);
> > > - if (ret <= 0)
> > > + NULL);
> > > + vma = vma_lookup(mm, addr);
> > > + if (ret <= 0 || !vma)
> > > break;
> >
> > Given the slightly tricky error handling, it would make sense to turn
> > this pattern into a helper function:
> >
> > page = get_single_user_page_locked(mm, addr, gup_flags, &vma);
> > if (IS_ERR(page))
> > [..]
> >
> > static inline struct page *get_single_user_page_locked(struct mm_struct *mm,
> > unsigned long addr, int gup_flags, struct vm_area_struct **vma)
> > {
> > struct page *page;
> > int ret;
> >
> > ret = get_user_pages_remote(*mm, addr, 1, gup_flags, &page, NULL, NULL);
> > if (ret < 0)
> > return ERR_PTR(ret);
> > if (WARN_ON(ret == 0))
> > return ERR_PTR(-EINVAL);
> > *vma = vma_lookup(mm, addr);
> > if (WARN_ON(!*vma) {
> > put_user_page(page);
> > return ERR_PTR(-EINVAL);
> > }
> > return page;
> > }
> >
> > It could be its own patch so this change was just a mechanical removal
> > of NULL
> >
> > Jason
> >
>
> Agreed, I think this would work better as a follow up patch however so as
> not to distract too much from the core change.
I don't think you should open code sketchy error handling in several
places and then clean it up later. Just do it right from the start.
Jason
On Mon, Apr 17, 2023 at 09:56:34AM -0300, Jason Gunthorpe wrote:
> On Sat, Apr 15, 2023 at 12:27:45AM +0100, Lorenzo Stoakes wrote:
> > Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers")
> > prevents io_pin_pages() from pinning pages spanning multiple VMAs with
> > permitted characteristics (anon/huge), requiring that all VMAs share the
> > same vm_file.
>
> That commmit doesn't really explain why io_uring is doing such a weird
> thing.
>
> What exactly is the problem with mixing struct pages from different
> files and why of all the GUP users does only io_uring need to care
> about this?
>
> If there is no justification then lets revert that commit instead.
>
> > /* don't support file backed memory */
> > - for (i = 0; i < nr_pages; i++) {
> > - if (vmas[i]->vm_file != file) {
> > - ret = -EINVAL;
> > - break;
> > - }
> > - if (!file)
> > - continue;
> > - if (!vma_is_shmem(vmas[i]) && !is_file_hugepages(file)) {
> > - ret = -EOPNOTSUPP;
> > - break;
> > - }
> > - }
> > + file = vma->vm_file;
> > + if (file && !vma_is_shmem(vma) && !is_file_hugepages(file))
> > + ret = -EOPNOTSUPP;
> > +
>
> Also, why is it doing this?
>
> All GUP users don't work entirely right for any fops implementation
> that assumes write protect is unconditionally possible. eg most
> filesystems.
>
> We've been ignoring blocking it because it is an ABI break and it does
> sort of work in some cases.
>
I will leave this to Jens and Pavel to revert on!
> I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than
> io_uring open coding this kind of stuff.
>
How would the semantics of this work? What is broken? It is a little
frustrating that we have FOLL_ANON but hugetlb as an outlying case, adding
FOLL_ANON_OR_HUGETLB was another consideration...
> Jason
On Mon, Apr 17, 2023 at 10:16:28AM -0300, Jason Gunthorpe wrote:
> On Mon, Apr 17, 2023 at 02:13:39PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Apr 17, 2023 at 10:09:36AM -0300, Jason Gunthorpe wrote:
> > > On Sat, Apr 15, 2023 at 12:27:31AM +0100, Lorenzo Stoakes wrote:
> > > > The only instances of get_user_pages_remote() invocations which used the
> > > > vmas parameter were for a single page which can instead simply look up the
> > > > VMA directly. In particular:-
> > > >
> > > > - __update_ref_ctr() looked up the VMA but did nothing with it so we simply
> > > > remove it.
> > > >
> > > > - __access_remote_vm() was already using vma_lookup() when the original
> > > > lookup failed so by doing the lookup directly this also de-duplicates the
> > > > code.
> > > >
> > > > This forms part of a broader set of patches intended to eliminate the vmas
> > > > parameter altogether.
> > > >
> > > > Signed-off-by: Lorenzo Stoakes <[email protected]>
> > > > ---
> > > > arch/arm64/kernel/mte.c | 5 +++--
> > > > arch/s390/kvm/interrupt.c | 2 +-
> > > > fs/exec.c | 2 +-
> > > > include/linux/mm.h | 2 +-
> > > > kernel/events/uprobes.c | 10 +++++-----
> > > > mm/gup.c | 12 ++++--------
> > > > mm/memory.c | 9 +++++----
> > > > mm/rmap.c | 2 +-
> > > > security/tomoyo/domain.c | 2 +-
> > > > virt/kvm/async_pf.c | 3 +--
> > > > 10 files changed, 23 insertions(+), 26 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> > > > index f5bcb0dc6267..74d8d4007dec 100644
> > > > --- a/arch/arm64/kernel/mte.c
> > > > +++ b/arch/arm64/kernel/mte.c
> > > > @@ -437,8 +437,9 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr,
> > > > struct page *page = NULL;
> > > >
> > > > ret = get_user_pages_remote(mm, addr, 1, gup_flags, &page,
> > > > - &vma, NULL);
> > > > - if (ret <= 0)
> > > > + NULL);
> > > > + vma = vma_lookup(mm, addr);
> > > > + if (ret <= 0 || !vma)
> > > > break;
> > >
> > > Given the slightly tricky error handling, it would make sense to turn
> > > this pattern into a helper function:
> > >
> > > page = get_single_user_page_locked(mm, addr, gup_flags, &vma);
> > > if (IS_ERR(page))
> > > [..]
> > >
> > > static inline struct page *get_single_user_page_locked(struct mm_struct *mm,
> > > unsigned long addr, int gup_flags, struct vm_area_struct **vma)
> > > {
> > > struct page *page;
> > > int ret;
> > >
> > > ret = get_user_pages_remote(*mm, addr, 1, gup_flags, &page, NULL, NULL);
> > > if (ret < 0)
> > > return ERR_PTR(ret);
> > > if (WARN_ON(ret == 0))
> > > return ERR_PTR(-EINVAL);
> > > *vma = vma_lookup(mm, addr);
> > > if (WARN_ON(!*vma) {
> > > put_user_page(page);
> > > return ERR_PTR(-EINVAL);
> > > }
> > > return page;
> > > }
> > >
> > > It could be its own patch so this change was just a mechanical removal
> > > of NULL
> > >
> > > Jason
> > >
> >
> > Agreed, I think this would work better as a follow up patch however so as
> > not to distract too much from the core change.
>
> I don't think you should open code sketchy error handling in several
> places and then clean it up later. Just do it right from the start.
>
Intent was to do smallest change possible (though through review that grew
of course), but I see your point, in this instance this is fiddly stuff and
probably better to abstract it to enforce correct handling.
I'll respin + add something like this.
> Jason
On Mon, Apr 17, 2023 at 10:14:02AM -0300, Jason Gunthorpe wrote:
> On Sat, Apr 15, 2023 at 12:27:40AM +0100, Lorenzo Stoakes wrote:
> > This flag causes GUP to assert that all VMAs within the input range possess
> > the same vma->vm_file. If not, the operation fails.
> >
> > This is part of a patch series which eliminates the vmas parameter from the
> > GUP API, implementing the one remaining assertion within the entire kernel
> > that requires access to the VMAs associated with a GUP range.
> >
> > Signed-off-by: Lorenzo Stoakes <[email protected]>
> > ---
> > include/linux/mm_types.h | 2 ++
> > mm/gup.c | 16 ++++++++++++----
> > 2 files changed, 14 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 3fc9e680f174..84d1aec9dbab 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -1185,6 +1185,8 @@ enum {
> > FOLL_PCI_P2PDMA = 1 << 10,
> > /* allow interrupts from generic signals */
> > FOLL_INTERRUPTIBLE = 1 << 11,
> > + /* assert that the range spans VMAs with the same vma->vm_file */
> > + FOLL_SAME_FILE = 1 << 12,
>
> I hope we don't add this flag, but it needs to be rejected in
> internal_get_user_pages_fast()
>
intenal_get_user_pages_fast() checks against the complement of accepted
masks, therefore it will reject this as-is unless I'm missing something.
As for not adding the flag (an entirely understandable sentiment), it would
be good to get an insight into the semantics of what you feel would be more
suitable!
> Jason
On Mon, Apr 17, 2023 at 02:19:16PM +0100, Lorenzo Stoakes wrote:
> > I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than
> > io_uring open coding this kind of stuff.
> >
>
> How would the semantics of this work? What is broken? It is a little
> frustrating that we have FOLL_ANON but hugetlb as an outlying case, adding
> FOLL_ANON_OR_HUGETLB was another consideration...
It says "historically this user has accepted file backed pages and we
we think there may actually be users doing that, so don't break the
uABI"
Without the flag GUP would refuse to return file backed pages that can
trigger kernel crashes or data corruption.
Eg we'd want most places to not specify the flag and the few that do
to have some justification.
We should consdier removing FOLL_ANON, I'm not sure it really makes
sense these days for what proc is doing with it. All that proc stuff
could likely be turned into a kthread_use_mm() and a simple
copy_to/from user?
I suspect that eliminates the need to check for FOLL_ANON?
Jason
On Mon, Apr 17, 2023 at 02:25:54PM +0100, Lorenzo Stoakes wrote:
> On Mon, Apr 17, 2023 at 10:14:02AM -0300, Jason Gunthorpe wrote:
> > On Sat, Apr 15, 2023 at 12:27:40AM +0100, Lorenzo Stoakes wrote:
> > > This flag causes GUP to assert that all VMAs within the input range possess
> > > the same vma->vm_file. If not, the operation fails.
> > >
> > > This is part of a patch series which eliminates the vmas parameter from the
> > > GUP API, implementing the one remaining assertion within the entire kernel
> > > that requires access to the VMAs associated with a GUP range.
> > >
> > > Signed-off-by: Lorenzo Stoakes <[email protected]>
> > > ---
> > > include/linux/mm_types.h | 2 ++
> > > mm/gup.c | 16 ++++++++++++----
> > > 2 files changed, 14 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 3fc9e680f174..84d1aec9dbab 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -1185,6 +1185,8 @@ enum {
> > > FOLL_PCI_P2PDMA = 1 << 10,
> > > /* allow interrupts from generic signals */
> > > FOLL_INTERRUPTIBLE = 1 << 11,
> > > + /* assert that the range spans VMAs with the same vma->vm_file */
> > > + FOLL_SAME_FILE = 1 << 12,
> >
> > I hope we don't add this flag, but it needs to be rejected in
> > internal_get_user_pages_fast()
> >
>
> intenal_get_user_pages_fast() checks against the complement of accepted
> masks, therefore it will reject this as-is unless I'm missing
> something.
Hmm, yes, that looks OK
> As for not adding the flag (an entirely understandable sentiment), it would
> be good to get an insight into the semantics of what you feel would be more
> suitable!
I'm hoping there is not a solid justification for why io_uring is
doing this check so we just delete it.
Jason
On Mon, Apr 17, 2023 at 10:26:09AM -0300, Jason Gunthorpe wrote:
> On Mon, Apr 17, 2023 at 02:19:16PM +0100, Lorenzo Stoakes wrote:
>
> > > I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than
> > > io_uring open coding this kind of stuff.
> > >
> >
> > How would the semantics of this work? What is broken? It is a little
> > frustrating that we have FOLL_ANON but hugetlb as an outlying case, adding
> > FOLL_ANON_OR_HUGETLB was another consideration...
>
> It says "historically this user has accepted file backed pages and we
> we think there may actually be users doing that, so don't break the
> uABI"
Having written a bunch here I suddenly realised that you probably mean for
this flag to NOT be applied to the io_uring code and thus have it enforce
the 'anonymous or hugetlb' check by default?
>
> Without the flag GUP would refuse to return file backed pages that can
> trigger kernel crashes or data corruption.
>
> Eg we'd want most places to not specify the flag and the few that do
> to have some justification.
>
So you mean to disallow file-backed page pinning as a whole unless this
flag is specified? For FOLL_GET I can see that access to the underlying
data is dangerous as the memory may get reclaimed or migrated, but surely
DMA-pinned memory (as is the case here) is safe?
Or is this a product more so of some kernel process accessing file-backed
pages for a file system which expects write-notify semantics and doesn't
get them in this case, which could indeed be horribly broken.
In which case yes this seems sensible.
> We should consdier removing FOLL_ANON, I'm not sure it really makes
> sense these days for what proc is doing with it. All that proc stuff
> could likely be turned into a kthread_use_mm() and a simple
> copy_to/from user?
>
> I suspect that eliminates the need to check for FOLL_ANON?
>
> Jason
I am definitely in favour of cutting things down if possible, and very much
prefer the use of uaccess if we are able to do so rather than GUP.
I do feel that GUP should be focused purely on pinning memory rather than
manipulating it (whether read or write) so I agree with this sentiment.
On Mon, Apr 17, 2023 at 03:00:16PM +0100, Lorenzo Stoakes wrote:
> On Mon, Apr 17, 2023 at 10:26:09AM -0300, Jason Gunthorpe wrote:
> > On Mon, Apr 17, 2023 at 02:19:16PM +0100, Lorenzo Stoakes wrote:
> >
> > > > I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than
> > > > io_uring open coding this kind of stuff.
> > > >
> > >
> > > How would the semantics of this work? What is broken? It is a little
> > > frustrating that we have FOLL_ANON but hugetlb as an outlying case, adding
> > > FOLL_ANON_OR_HUGETLB was another consideration...
> >
> > It says "historically this user has accepted file backed pages and we
> > we think there may actually be users doing that, so don't break the
> > uABI"
>
> Having written a bunch here I suddenly realised that you probably mean for
> this flag to NOT be applied to the io_uring code and thus have it enforce
> the 'anonymous or hugetlb' check by default?
Yes
> So you mean to disallow file-backed page pinning as a whole unless this
> flag is specified?
Yes
> For FOLL_GET I can see that access to the underlying
> data is dangerous as the memory may get reclaimed or migrated, but surely
> DMA-pinned memory (as is the case here) is safe?
No, it is all broken, read-only access is safe.
We are trying to get a point where pin access will interact properly
with the filesystem, but it isn't done yet.
> Or is this a product more so of some kernel process accessing file-backed
> pages for a file system which expects write-notify semantics and doesn't
> get them in this case, which could indeed be horribly broken.
Yes, broadly
> I am definitely in favour of cutting things down if possible, and very much
> prefer the use of uaccess if we are able to do so rather than GUP.
>
> I do feel that GUP should be focused purely on pinning memory rather than
> manipulating it (whether read or write) so I agree with this sentiment.
Yes, someone needs to be brave enough to go and try to adjust these
old places :)
I see in the git history this was added to solve CVE-2018-1120 - eg
FUSE can hold off fault-in indefinitely. So the flag is really badly
misnamed - it is "FOLL_DONT_BLOCK_ON_USERSPACE" and anon memory is a
simple, but overly narrow, way to get that property.
If it is changed to use kthread_use_mm() it needs a VMA based check
for the same idea.
Jason
On Mon, Apr 17, 2023 at 10:07:53AM -0500, Eric W. Biederman wrote:
> Lorenzo Stoakes <[email protected]> writes:
>
> > On Mon, Apr 17, 2023 at 10:16:28AM -0300, Jason Gunthorpe wrote:
> >> On Mon, Apr 17, 2023 at 02:13:39PM +0100, Lorenzo Stoakes wrote:
> >> > On Mon, Apr 17, 2023 at 10:09:36AM -0300, Jason Gunthorpe wrote:
> >> > > On Sat, Apr 15, 2023 at 12:27:31AM +0100, Lorenzo Stoakes wrote:
> >> > > > The only instances of get_user_pages_remote() invocations which used the
> >> > > > vmas parameter were for a single page which can instead simply look up the
> >> > > > VMA directly. In particular:-
> >> > > >
> >> > > > - __update_ref_ctr() looked up the VMA but did nothing with it so we simply
> >> > > > remove it.
> >> > > >
> >> > > > - __access_remote_vm() was already using vma_lookup() when the original
> >> > > > lookup failed so by doing the lookup directly this also de-duplicates the
> >> > > > code.
> >> > > >
> >> > > > This forms part of a broader set of patches intended to eliminate the vmas
> >> > > > parameter altogether.
> >> > > >
> >> > > > Signed-off-by: Lorenzo Stoakes <[email protected]>
> >> > > > ---
> >> > > > arch/arm64/kernel/mte.c | 5 +++--
> >> > > > arch/s390/kvm/interrupt.c | 2 +-
> >> > > > fs/exec.c | 2 +-
> >> > > > include/linux/mm.h | 2 +-
> >> > > > kernel/events/uprobes.c | 10 +++++-----
> >> > > > mm/gup.c | 12 ++++--------
> >> > > > mm/memory.c | 9 +++++----
> >> > > > mm/rmap.c | 2 +-
> >> > > > security/tomoyo/domain.c | 2 +-
> >> > > > virt/kvm/async_pf.c | 3 +--
> >> > > > 10 files changed, 23 insertions(+), 26 deletions(-)
> >> > > >
> >> > > > diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> >> > > > index f5bcb0dc6267..74d8d4007dec 100644
> >> > > > --- a/arch/arm64/kernel/mte.c
> >> > > > +++ b/arch/arm64/kernel/mte.c
> >> > > > @@ -437,8 +437,9 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr,
> >> > > > struct page *page = NULL;
> >> > > >
> >> > > > ret = get_user_pages_remote(mm, addr, 1, gup_flags, &page,
> >> > > > - &vma, NULL);
> >> > > > - if (ret <= 0)
> >> > > > + NULL);
> >> > > > + vma = vma_lookup(mm, addr);
> >> > > > + if (ret <= 0 || !vma)
> >> > > > break;
> >> > >
> >> > > Given the slightly tricky error handling, it would make sense to turn
> >> > > this pattern into a helper function:
> >> > >
> >> > > page = get_single_user_page_locked(mm, addr, gup_flags, &vma);
> >> > > if (IS_ERR(page))
> >> > > [..]
> >> > >
> >> > > static inline struct page *get_single_user_page_locked(struct mm_struct *mm,
> >> > > unsigned long addr, int gup_flags, struct vm_area_struct **vma)
> >> > > {
> >> > > struct page *page;
> >> > > int ret;
> >> > >
> >> > > ret = get_user_pages_remote(*mm, addr, 1, gup_flags, &page, NULL, NULL);
> >> > > if (ret < 0)
> >> > > return ERR_PTR(ret);
> >> > > if (WARN_ON(ret == 0))
> >> > > return ERR_PTR(-EINVAL);
> >> > > *vma = vma_lookup(mm, addr);
> >> > > if (WARN_ON(!*vma) {
> >> > > put_user_page(page);
> >> > > return ERR_PTR(-EINVAL);
> >> > > }
> >> > > return page;
> >> > > }
> >> > >
> >> > > It could be its own patch so this change was just a mechanical removal
> >> > > of NULL
> >> > >
> >> > > Jason
> >> > >
> >> >
> >> > Agreed, I think this would work better as a follow up patch however so as
> >> > not to distract too much from the core change.
> >>
> >> I don't think you should open code sketchy error handling in several
> >> places and then clean it up later. Just do it right from the start.
> >>
> >
> > Intent was to do smallest change possible (though through review that grew
> > of course), but I see your point, in this instance this is fiddly stuff and
> > probably better to abstract it to enforce correct handling.
> >
> > I'll respin + add something like this.
>
> Could you include in your description why looking up the vma after
> getting the page does not introduce a race?
>
> I am probably silly and just looking at this quickly but it does not
> seem immediately obvious why the vma and the page should match.
>
> I would not be surprised if you hold the appropriate mutex over the
> entire operation but it just isn't apparent from the diff.
>
> I am concerned because it is an easy mistake to refactor something into
> two steps and then discover you have introduced a race.
>
> Eric
>
The mmap_lock is held so VMAs cannot be altered and no such race can
occur. get_user_pages_remote() requires that the user calls it under the
lock so it is implicitly assured that this cannot happen.
I'll update the description to make this clear on the next spin!
(side-note: here is another irritating issue with GUP, we don't suffix with
_locked() consistently)
On Mon, Apr 17, 2023 at 11:15:10AM -0300, Jason Gunthorpe wrote:
> On Mon, Apr 17, 2023 at 03:00:16PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Apr 17, 2023 at 10:26:09AM -0300, Jason Gunthorpe wrote:
> > > On Mon, Apr 17, 2023 at 02:19:16PM +0100, Lorenzo Stoakes wrote:
> > >
> > > > > I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than
> > > > > io_uring open coding this kind of stuff.
> > > > >
> > > >
> > > > How would the semantics of this work? What is broken? It is a little
> > > > frustrating that we have FOLL_ANON but hugetlb as an outlying case, adding
> > > > FOLL_ANON_OR_HUGETLB was another consideration...
> > >
> > > It says "historically this user has accepted file backed pages and we
> > > we think there may actually be users doing that, so don't break the
> > > uABI"
> >
> > Having written a bunch here I suddenly realised that you probably mean for
> > this flag to NOT be applied to the io_uring code and thus have it enforce
> > the 'anonymous or hugetlb' check by default?
>
> Yes
>
> > So you mean to disallow file-backed page pinning as a whole unless this
> > flag is specified?
>
> Yes
>
> > For FOLL_GET I can see that access to the underlying
> > data is dangerous as the memory may get reclaimed or migrated, but surely
> > DMA-pinned memory (as is the case here) is safe?
>
> No, it is all broken, read-only access is safe.
>
> We are trying to get a point where pin access will interact properly
> with the filesystem, but it isn't done yet.
>
> > Or is this a product more so of some kernel process accessing file-backed
> > pages for a file system which expects write-notify semantics and doesn't
> > get them in this case, which could indeed be horribly broken.
>
> Yes, broadly
>
> > I am definitely in favour of cutting things down if possible, and very much
> > prefer the use of uaccess if we are able to do so rather than GUP.
> >
> > I do feel that GUP should be focused purely on pinning memory rather than
> > manipulating it (whether read or write) so I agree with this sentiment.
>
> Yes, someone needs to be brave enough to go and try to adjust these
> old places :)
Well, I liek to think of myself as stupid^W brave enough to do such things
so may try a separate patch series on that :)
>
> I see in the git history this was added to solve CVE-2018-1120 - eg
> FUSE can hold off fault-in indefinitely. So the flag is really badly
> misnamed - it is "FOLL_DONT_BLOCK_ON_USERSPACE" and anon memory is a
> simple, but overly narrow, way to get that property.
>
> If it is changed to use kthread_use_mm() it needs a VMA based check
> for the same idea.
>
> Jason
I'll try my hand at patching this also!
As for FOLL_ALLOW_BROKEN_FILE_MAPPINGS, I do really like this idea, and
think it is actually probably quite important we do it, however this feels
a bit out of scope for this patch series.
I think perhaps the way forward is, if Jens and Pavel don't have any issue
with it, we open code the check and drop FOLL_SAME_FILE for this series,
then introduce it in a separate one + replace the open coding there?
I am eager to try to keep this focused on the specific task of dropping the
vmas parameter as I think FOLL_ALLOW_BROKEN_FILE_MAPPINGS is likely to
garner some discussion which should be kept separate.
Lorenzo Stoakes <[email protected]> writes:
> On Mon, Apr 17, 2023 at 10:16:28AM -0300, Jason Gunthorpe wrote:
>> On Mon, Apr 17, 2023 at 02:13:39PM +0100, Lorenzo Stoakes wrote:
>> > On Mon, Apr 17, 2023 at 10:09:36AM -0300, Jason Gunthorpe wrote:
>> > > On Sat, Apr 15, 2023 at 12:27:31AM +0100, Lorenzo Stoakes wrote:
>> > > > The only instances of get_user_pages_remote() invocations which used the
>> > > > vmas parameter were for a single page which can instead simply look up the
>> > > > VMA directly. In particular:-
>> > > >
>> > > > - __update_ref_ctr() looked up the VMA but did nothing with it so we simply
>> > > > remove it.
>> > > >
>> > > > - __access_remote_vm() was already using vma_lookup() when the original
>> > > > lookup failed so by doing the lookup directly this also de-duplicates the
>> > > > code.
>> > > >
>> > > > This forms part of a broader set of patches intended to eliminate the vmas
>> > > > parameter altogether.
>> > > >
>> > > > Signed-off-by: Lorenzo Stoakes <[email protected]>
>> > > > ---
>> > > > arch/arm64/kernel/mte.c | 5 +++--
>> > > > arch/s390/kvm/interrupt.c | 2 +-
>> > > > fs/exec.c | 2 +-
>> > > > include/linux/mm.h | 2 +-
>> > > > kernel/events/uprobes.c | 10 +++++-----
>> > > > mm/gup.c | 12 ++++--------
>> > > > mm/memory.c | 9 +++++----
>> > > > mm/rmap.c | 2 +-
>> > > > security/tomoyo/domain.c | 2 +-
>> > > > virt/kvm/async_pf.c | 3 +--
>> > > > 10 files changed, 23 insertions(+), 26 deletions(-)
>> > > >
>> > > > diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
>> > > > index f5bcb0dc6267..74d8d4007dec 100644
>> > > > --- a/arch/arm64/kernel/mte.c
>> > > > +++ b/arch/arm64/kernel/mte.c
>> > > > @@ -437,8 +437,9 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr,
>> > > > struct page *page = NULL;
>> > > >
>> > > > ret = get_user_pages_remote(mm, addr, 1, gup_flags, &page,
>> > > > - &vma, NULL);
>> > > > - if (ret <= 0)
>> > > > + NULL);
>> > > > + vma = vma_lookup(mm, addr);
>> > > > + if (ret <= 0 || !vma)
>> > > > break;
>> > >
>> > > Given the slightly tricky error handling, it would make sense to turn
>> > > this pattern into a helper function:
>> > >
>> > > page = get_single_user_page_locked(mm, addr, gup_flags, &vma);
>> > > if (IS_ERR(page))
>> > > [..]
>> > >
>> > > static inline struct page *get_single_user_page_locked(struct mm_struct *mm,
>> > > unsigned long addr, int gup_flags, struct vm_area_struct **vma)
>> > > {
>> > > struct page *page;
>> > > int ret;
>> > >
>> > > ret = get_user_pages_remote(*mm, addr, 1, gup_flags, &page, NULL, NULL);
>> > > if (ret < 0)
>> > > return ERR_PTR(ret);
>> > > if (WARN_ON(ret == 0))
>> > > return ERR_PTR(-EINVAL);
>> > > *vma = vma_lookup(mm, addr);
>> > > if (WARN_ON(!*vma) {
>> > > put_user_page(page);
>> > > return ERR_PTR(-EINVAL);
>> > > }
>> > > return page;
>> > > }
>> > >
>> > > It could be its own patch so this change was just a mechanical removal
>> > > of NULL
>> > >
>> > > Jason
>> > >
>> >
>> > Agreed, I think this would work better as a follow up patch however so as
>> > not to distract too much from the core change.
>>
>> I don't think you should open code sketchy error handling in several
>> places and then clean it up later. Just do it right from the start.
>>
>
> Intent was to do smallest change possible (though through review that grew
> of course), but I see your point, in this instance this is fiddly stuff and
> probably better to abstract it to enforce correct handling.
>
> I'll respin + add something like this.
Could you include in your description why looking up the vma after
getting the page does not introduce a race?
I am probably silly and just looking at this quickly but it does not
seem immediately obvious why the vma and the page should match.
I would not be surprised if you hold the appropriate mutex over the
entire operation but it just isn't apparent from the diff.
I am concerned because it is an easy mistake to refactor something into
two steps and then discover you have introduced a race.
Eric
On Mon, Apr 17, 2023 at 10:26:09AM -0300, Jason Gunthorpe wrote:
> On Mon, Apr 17, 2023 at 02:19:16PM +0100, Lorenzo Stoakes wrote:
>
> > > I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than
> > > io_uring open coding this kind of stuff.
> > >
> >
> > How would the semantics of this work? What is broken? It is a little
> > frustrating that we have FOLL_ANON but hugetlb as an outlying case, adding
> > FOLL_ANON_OR_HUGETLB was another consideration...
>
> It says "historically this user has accepted file backed pages and we
> we think there may actually be users doing that, so don't break the
> uABI"
>
> Without the flag GUP would refuse to return file backed pages that can
> trigger kernel crashes or data corruption.
>
> Eg we'd want most places to not specify the flag and the few that do
> to have some justification.
>
> We should consdier removing FOLL_ANON, I'm not sure it really makes
> sense these days for what proc is doing with it. All that proc stuff
> could likely be turned into a kthread_use_mm() and a simple
> copy_to/from user?
>
> I suspect that eliminates the need to check for FOLL_ANON?
>
> Jason
The proc invocations utilising FOLL_ANON are get_mm_proctitle(),
get_mm_cmdline() and environ_read() which each pass it to
access_remote_vm() and which will be being called from a process context,
i.e. with tsk->mm != NULL, but kthread_use_mm() explicitly disallows the
(slightly mind boggling) idea of swapping out an established mm.
So I don't think this route is plausible unless you were thinking of
somehow offloading to a thread?
In any case, if we institute the FOLL_ALLOW_BROKEN_FILE_MAPPINGS flag we
can just drop FOLL_ANON altogether right, as this will be implied and
hugetlb should work here too?
Separately, I find the semantics of access_remote_vm() kind of weird, and
with a possible mmap_lock-free future it does make me wonder whether
something better could be done there.
(Section where I sound like I might be going mad) Perhaps having some means
of context switching into the kernel portion of the remote process as if
were a system call or soft interrupt handler and having that actually do
the uaccess operation could be useful here?
I'm guesing nothing like that exists yet?
On Mon, Apr 17, 2023 at 08:00:48PM +0100, Lorenzo Stoakes wrote:
> So I don't think this route is plausible unless you were thinking of
> somehow offloading to a thread?
ah, fair enough
> In any case, if we institute the FOLL_ALLOW_BROKEN_FILE_MAPPINGS flag we
> can just drop FOLL_ANON altogether right, as this will be implied and
> hugetlb should work here too?
Well.. no, as I said read-only access to the pages works fine, so GUP
should not block that. It is only write that has issues
> Separately, I find the semantics of access_remote_vm() kind of weird, and
> with a possible mmap_lock-free future it does make me wonder whether
> something better could be done there.
Yes, it is very weird, kthread_use_mm is much nicer.
> (Section where I sound like I might be going mad) Perhaps having some means
> of context switching into the kernel portion of the remote process as if
> were a system call or soft interrupt handler and having that actually do
> the uaccess operation could be useful here?
This is the kthread_use_mm() approach, that is basically what it
does. You are suggesting to extend it to kthreads that already have a
process attached...
access_remote_vm is basically copy_to/from_user built using kmap and
GUP.
even a simple step of localizing FOLL_ANON to __access_remote_vm,
since it must have the VMA nyhow, would be an improvement.
Jason
On Mon, Apr 17, 2023 at 04:24:04PM -0300, Jason Gunthorpe wrote:
> On Mon, Apr 17, 2023 at 08:00:48PM +0100, Lorenzo Stoakes wrote:
>
> > So I don't think this route is plausible unless you were thinking of
> > somehow offloading to a thread?
>
> ah, fair enough
>
> > In any case, if we institute the FOLL_ALLOW_BROKEN_FILE_MAPPINGS flag we
> > can just drop FOLL_ANON altogether right, as this will be implied and
> > hugetlb should work here too?
>
> Well.. no, as I said read-only access to the pages works fine, so GUP
> should not block that. It is only write that has issues
>
> > Separately, I find the semantics of access_remote_vm() kind of weird, and
> > with a possible mmap_lock-free future it does make me wonder whether
> > something better could be done there.
>
> Yes, it is very weird, kthread_use_mm is much nicer.
>
> > (Section where I sound like I might be going mad) Perhaps having some means
> > of context switching into the kernel portion of the remote process as if
> > were a system call or soft interrupt handler and having that actually do
> > the uaccess operation could be useful here?
>
> This is the kthread_use_mm() approach, that is basically what it
> does. You are suggesting to extend it to kthreads that already have a
> process attached...
Yeah, I wonder how plausible this is as we could in theory simply eliminate
these remote cases altogether which could be relatively efficient if we
could find a way to batch up operations.
>
> access_remote_vm is basically copy_to/from_user built using kmap and
> GUP.
>
> even a simple step of localizing FOLL_ANON to __access_remote_vm,
> since it must have the VMA nyhow, would be an improvement.
This is used from places where this flag might not be set though,
e.g. acess_process_vm() and ptrace.
However, access_remote_vm() is only used by the proc stuff, so I will spin
up a patch to move this function and treat it as a helper which sets
FOLL_ANON.
>
> Jason
On 4/17/23 13:56, Jason Gunthorpe wrote:
> On Sat, Apr 15, 2023 at 12:27:45AM +0100, Lorenzo Stoakes wrote:
>> Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers")
>> prevents io_pin_pages() from pinning pages spanning multiple VMAs with
>> permitted characteristics (anon/huge), requiring that all VMAs share the
>> same vm_file.
>
> That commmit doesn't really explain why io_uring is doing such a weird
> thing.
>
> What exactly is the problem with mixing struct pages from different
> files and why of all the GUP users does only io_uring need to care
> about this?
Simply because it doesn't seem sane to mix and register buffers of
different "nature" as one. It's not a huge deal for currently allowed
types, e.g. mixing normal and huge anon pages, but it's rather a matter
of time before it gets extended, and then I'll certainly become a
problem. We've been asked just recently to allow registering bufs
provided mapped by some specific driver, or there might be DMA mapped
memory in the future.
Rejecting based on vmas might be too conservative, I agree and am all
for if someone can help to make it right.
> If there is no justification then lets revert that commit instead.
>
>> /* don't support file backed memory */
>> - for (i = 0; i < nr_pages; i++) {
>> - if (vmas[i]->vm_file != file) {
>> - ret = -EINVAL;
>> - break;
>> - }
>> - if (!file)
>> - continue;
>> - if (!vma_is_shmem(vmas[i]) && !is_file_hugepages(file)) {
>> - ret = -EOPNOTSUPP;
>> - break;
>> - }
>> - }
>> + file = vma->vm_file;
>> + if (file && !vma_is_shmem(vma) && !is_file_hugepages(file))
>> + ret = -EOPNOTSUPP;
>> +
>
> Also, why is it doing this?
There were problems with filesystem mappings, I believe.
Jens may remember what it was.
> All GUP users don't work entirely right for any fops implementation
> that assumes write protect is unconditionally possible. eg most
> filesystems.
>
> We've been ignoring blocking it because it is an ABI break and it does
> sort of work in some cases.
>
> I'd rather see something like FOLL_ALLOW_BROKEN_FILE_MAPPINGS than
> io_uring open coding this kind of stuff.
--
Pavel Begunkov
On 4/18/23 17:25, Pavel Begunkov wrote:
> On 4/17/23 13:56, Jason Gunthorpe wrote:
>> On Sat, Apr 15, 2023 at 12:27:45AM +0100, Lorenzo Stoakes wrote:
>>> Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers")
>>> prevents io_pin_pages() from pinning pages spanning multiple VMAs with
>>> permitted characteristics (anon/huge), requiring that all VMAs share the
>>> same vm_file.
>>
>> That commmit doesn't really explain why io_uring is doing such a weird
>> thing.
>>
>> What exactly is the problem with mixing struct pages from different
>> files and why of all the GUP users does only io_uring need to care
>> about this?
>
> Simply because it doesn't seem sane to mix and register buffers of
> different "nature" as one. It's not a huge deal for currently allowed
> types, e.g. mixing normal and huge anon pages, but it's rather a matter
> of time before it gets extended, and then I'll certainly become a
> problem. We've been asked just recently to allow registering bufs
> provided mapped by some specific driver, or there might be DMA mapped
> memory in the future.
>
> Rejecting based on vmas might be too conservative, I agree and am all
> for if someone can help to make it right.
For some reason I thought it was rejecting if involves more than
one different vma. ->vm_file checks still sound fair to me, but in
any case, open to changing it.
--
Pavel Begunkov
On Tue, Apr 18, 2023 at 05:25:08PM +0100, Pavel Begunkov wrote:
> On 4/17/23 13:56, Jason Gunthorpe wrote:
> > On Sat, Apr 15, 2023 at 12:27:45AM +0100, Lorenzo Stoakes wrote:
> > > Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers")
> > > prevents io_pin_pages() from pinning pages spanning multiple VMAs with
> > > permitted characteristics (anon/huge), requiring that all VMAs share the
> > > same vm_file.
> >
> > That commmit doesn't really explain why io_uring is doing such a weird
> > thing.
> >
> > What exactly is the problem with mixing struct pages from different
> > files and why of all the GUP users does only io_uring need to care
> > about this?
>
> Simply because it doesn't seem sane to mix and register buffers of
> different "nature" as one.
That is not a good reason. Once things are converted to struct pages
they don't need to care about their "nature"
> problem. We've been asked just recently to allow registering bufs
> provided mapped by some specific driver, or there might be DMA mapped
> memory in the future.
We already have GUP flags to deal with it, eg FOLL_PCI_P2PDMA
> Rejecting based on vmas might be too conservative, I agree and am all
> for if someone can help to make it right.
It is GUP's problem to deal with this, not the callers.
GUP is defined to return a list of normal CPU DRAM in struct page
format. The caller doesn't care where or what this memory is, it is
all interchangable - by API contract of GUP itself.
If you use FOLL_PCI_P2PDMA then the definition expands to allow struct
pages that are MMIO.
In future, if someone invents new memory or new struct pages with
special needs it is their job to ensure it is blocked from GUP - for
*everyone*. eg how the PCI_P2PDMA was blocked from normal GUP.
io_uring is not special, there are many users of GUP, they all need to
work consistently.
> > Also, why is it doing this?
>
> There were problems with filesystem mappings, I believe.
> Jens may remember what it was.
Yes, I know about this, but as above, io_uring is not special, if we
want to block this GUP blocks it to protect all users, not io_uring
just protects itself..
Jason
On 4/18/23 17:36, Jason Gunthorpe wrote:
> On Tue, Apr 18, 2023 at 05:25:08PM +0100, Pavel Begunkov wrote:
>> On 4/17/23 13:56, Jason Gunthorpe wrote:
>>> On Sat, Apr 15, 2023 at 12:27:45AM +0100, Lorenzo Stoakes wrote:
>>>> Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers")
>>>> prevents io_pin_pages() from pinning pages spanning multiple VMAs with
>>>> permitted characteristics (anon/huge), requiring that all VMAs share the
>>>> same vm_file.
>>>
>>> That commmit doesn't really explain why io_uring is doing such a weird
>>> thing.
>>>
>>> What exactly is the problem with mixing struct pages from different
>>> files and why of all the GUP users does only io_uring need to care
>>> about this?
>>
>> Simply because it doesn't seem sane to mix and register buffers of
>> different "nature" as one.
>
> That is not a good reason. Once things are converted to struct pages
> they don't need to care about their "nature"
Arguing purely about uapi, I do think it is. Even though it can be
passed down and a page is a page, Frankenstein's Monster mixing anon
pages, pages for io_uring queues, device shared memory, and what not
else doesn't seem right for uapi. I see keeping buffers as a single
entity in opposite to a set of random pages beneficial for the future.
And again, as for how it's internally done, I don't have any preference
whatsoever.
>> problem. We've been asked just recently to allow registering bufs
>> provided mapped by some specific driver, or there might be DMA mapped
>> memory in the future.
>
> We already have GUP flags to deal with it, eg FOLL_PCI_P2PDMA
>
>> Rejecting based on vmas might be too conservative, I agree and am all
>> for if someone can help to make it right.
>
> It is GUP's problem to deal with this, not the callers.
Ok, that's even better for io_uring if the same can be achieved
just by passing flags.
> GUP is defined to return a list of normal CPU DRAM in struct page
> format. The caller doesn't care where or what this memory is, it is
> all interchangable - by API contract of GUP itself.
>
> If you use FOLL_PCI_P2PDMA then the definition expands to allow struct
> pages that are MMIO.
>
> In future, if someone invents new memory or new struct pages with
> special needs it is their job to ensure it is blocked from GUP - for
> *everyone*. eg how the PCI_P2PDMA was blocked from normal GUP.
>
> io_uring is not special, there are many users of GUP, they all need to
> work consistently.
--
Pavel Begunkov
On Tue, Apr 18, 2023 at 06:25:24PM +0100, Pavel Begunkov wrote:
> On 4/18/23 17:36, Jason Gunthorpe wrote:
> > On Tue, Apr 18, 2023 at 05:25:08PM +0100, Pavel Begunkov wrote:
> > > On 4/17/23 13:56, Jason Gunthorpe wrote:
> > > > On Sat, Apr 15, 2023 at 12:27:45AM +0100, Lorenzo Stoakes wrote:
> > > > > Commit edd478269640 ("io_uring/rsrc: disallow multi-source reg buffers")
> > > > > prevents io_pin_pages() from pinning pages spanning multiple VMAs with
> > > > > permitted characteristics (anon/huge), requiring that all VMAs share the
> > > > > same vm_file.
> > > >
> > > > That commmit doesn't really explain why io_uring is doing such a weird
> > > > thing.
> > > >
> > > > What exactly is the problem with mixing struct pages from different
> > > > files and why of all the GUP users does only io_uring need to care
> > > > about this?
> > >
> > > Simply because it doesn't seem sane to mix and register buffers of
> > > different "nature" as one.
> >
> > That is not a good reason. Once things are converted to struct pages
> > they don't need to care about their "nature"
>
> Arguing purely about uapi, I do think it is. Even though it can be
> passed down and a page is a page, Frankenstein's Monster mixing anon
> pages, pages for io_uring queues, device shared memory, and what not
> else doesn't seem right for uapi. I see keeping buffers as a single
> entity in opposite to a set of random pages beneficial for the future.
Again, it is not up to io_uring to make this choice. We have GUP as
part of our uAPI all over the place, GUP decides how it works, not
random different ideas all over the place.
We don't have these kinds of restrictions for O_DIRECT, for instance.
There should be consistency in the uAPI across everything.
Jason