2021-12-01 19:38:05

by Catalin Marinas

[permalink] [raw]
Subject: [PATCH v2 0/4] Avoid live-lock in fault-in+uaccess loops with sub-page faults

Hi,

Following the discussions on the first series,

https://lore.kernel.org/r/[email protected]

this new patchset aims to generalise the sub-page probing and introduce
a minimum size to the fault_in_*() functions. I called this 'v2' but I
can rebase it on top of v1 and keep v1 as a btrfs live-lock
back-portable fix. The fault_in_*() API improvements would be a new
series. Anyway, I'd first like to know whether this is heading in the
right direction and whether it's worth adding min_size to all
fault_in_*() (more below).

v2 adds a 'min_size' argument to all fault_in_*() functions with current
callers passing 0 (or we could make it 1). A probe_subpage_*() call is
made for the min_size range, though with all 0 this wouldn't have any
effect. The only difference is btrfs search_ioctl() in the last patch
which passes a non-zero min_size to avoid the live-lock (functionally
that's the same as the v1 series).

In terms of sub-page probing, I don't think with the current kernel
anything other than search_ioctl() matters. The buffered file I/O can
already cope with current fault_in_*() + copy_*_user() loops (the
uaccess makes progress). Direct I/O either goes via GUP + kernel mapping
access (and memcpy() can't fault) or, if the user buffer is not PAGE
aligned, it may fall back to buffered I/O. So we really only care about
fault_in_writeable(), as in v1.

Linus suggested that we could use the min_size to request a minimum
guaranteed probed size (in most cases this would be 1) and put a cap on
the faulted-in size, say two pages. All the fault_in_iov_iter_*()
callers will need to check the actual quantity returned by fault_in_*()
rather than bail out on non-zero but Andreas has a patch already (though
I think there are a few cases in btrfs etc.):

https://lore.kernel.org/r/[email protected]

With these callers fixed, we could add something like the diff below.
But, again, min_size doesn't actually have any current use in the kernel
other than fault_in_writeable() and search_ioctl().

Thanks for having a look. Suggestions welcomed.

------------------8<-------------------------------
diff --git a/mm/gup.c b/mm/gup.c
index 7fa69b0fb859..3aa88aa8ce9d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1658,6 +1658,8 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
}
#endif /* !CONFIG_MMU */

+#define MAX_FAULT_IN_SIZE (2 * PAGE_SIZE)
+
/**
* fault_in_writeable - fault in userspace address range for writing
* @uaddr: start of address range
@@ -1671,6 +1673,7 @@ size_t fault_in_writeable(char __user *uaddr, size_t size, size_t min_size)
{
char __user *start = uaddr, *end;
size_t faulted_in = size;
+ size_t max_size = max_t(size_t, MAX_FAULT_IN_SIZE, min_size);

if (unlikely(size == 0))
return 0;
@@ -1679,7 +1682,7 @@ size_t fault_in_writeable(char __user *uaddr, size_t size, size_t min_size)
return size;
uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr);
}
- end = (char __user *)PAGE_ALIGN((unsigned long)start + size);
+ end = (char __user *)PAGE_ALIGN((unsigned long)start + max_size);
if (unlikely(end < start))
end = NULL;
while (uaddr != end) {
@@ -1726,9 +1729,10 @@ size_t fault_in_safe_writeable(const char __user *uaddr, size_t size,
struct vm_area_struct *vma = NULL;
int locked = 0;
size_t faulted_in = size;
+ size_t max_size = max_t(size_t, MAX_FAULT_IN_SIZE, min_size);

nstart = start & PAGE_MASK;
- end = PAGE_ALIGN(start + size);
+ end = PAGE_ALIGN(start + max_size);
if (end < nstart)
end = 0;
for (; nstart != end; nstart = nend) {
@@ -1759,7 +1763,7 @@ size_t fault_in_safe_writeable(const char __user *uaddr, size_t size,
if (locked)
mmap_read_unlock(mm);
if (nstart != end)
- faulted_in = min_t(size_t, nstart - start, size);
+ faulted_in = min_t(size_t, nstart - start, max_size);
if (faulted_in < min_size ||
(min_size && probe_subpage_safe_writeable(uaddr, min_size)))
return size;
@@ -1782,6 +1786,7 @@ size_t fault_in_readable(const char __user *uaddr, size_t size,
const char __user *start = uaddr, *end;
volatile char c;
size_t faulted_in = size;
+ size_t max_size = max_t(size_t, MAX_FAULT_IN_SIZE, min_size);

if (unlikely(size == 0))
return 0;
@@ -1790,7 +1795,7 @@ size_t fault_in_readable(const char __user *uaddr, size_t size,
return size;
uaddr = (const char __user *)PAGE_ALIGN((unsigned long)uaddr);
}
- end = (const char __user *)PAGE_ALIGN((unsigned long)start + size);
+ end = (const char __user *)PAGE_ALIGN((unsigned long)start + max_size);
if (unlikely(end < start))
end = NULL;
while (uaddr != end) {
------------------8<-------------------------------

Catalin Marinas (4):
mm: Introduce a 'min_size' argument to fault_in_*()
mm: Probe for sub-page faults in fault_in_*()
arm64: Add support for user sub-page fault probing
btrfs: Avoid live-lock in search_ioctl() on hardware with sub-page
faults

arch/Kconfig | 7 ++++
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/uaccess.h | 59 +++++++++++++++++++++++++++++
arch/powerpc/kernel/kvm.c | 2 +-
arch/powerpc/kernel/signal_32.c | 4 +-
arch/powerpc/kernel/signal_64.c | 2 +-
arch/x86/kernel/fpu/signal.c | 2 +-
drivers/gpu/drm/armada/armada_gem.c | 2 +-
fs/btrfs/file.c | 6 +--
fs/btrfs/ioctl.c | 7 +++-
fs/f2fs/file.c | 2 +-
fs/fuse/file.c | 2 +-
fs/gfs2/file.c | 8 ++--
fs/iomap/buffered-io.c | 2 +-
fs/ntfs/file.c | 2 +-
fs/ntfs3/file.c | 2 +-
include/linux/pagemap.h | 8 ++--
include/linux/uaccess.h | 53 ++++++++++++++++++++++++++
include/linux/uio.h | 6 ++-
lib/iov_iter.c | 28 +++++++++++---
mm/filemap.c | 2 +-
mm/gup.c | 37 +++++++++++++-----
22 files changed, 203 insertions(+), 41 deletions(-)



2021-12-01 19:38:11

by Catalin Marinas

[permalink] [raw]
Subject: [PATCH v2 1/4] mm: Introduce a 'min_size' argument to fault_in_*()

There is no functional change after this patch as all callers pass a
min_size of 0. This argument will be used in subsequent patches to probe
for faults at sub-page granularity (e.g. arm64 MTE and SPARC ADI).

With a non-zero 'min_size' argument, the fault_in_*() functions return
the full range if they don't manage to fault in the minimum size.

Signed-off-by: Catalin Marinas <[email protected]>
---
arch/powerpc/kernel/kvm.c | 2 +-
arch/powerpc/kernel/signal_32.c | 4 ++--
arch/powerpc/kernel/signal_64.c | 2 +-
arch/x86/kernel/fpu/signal.c | 2 +-
drivers/gpu/drm/armada/armada_gem.c | 2 +-
fs/btrfs/file.c | 6 ++---
fs/btrfs/ioctl.c | 2 +-
fs/f2fs/file.c | 2 +-
fs/fuse/file.c | 2 +-
fs/gfs2/file.c | 8 +++----
fs/iomap/buffered-io.c | 2 +-
fs/ntfs/file.c | 2 +-
fs/ntfs3/file.c | 2 +-
include/linux/pagemap.h | 8 ++++---
include/linux/uio.h | 6 +++--
lib/iov_iter.c | 28 +++++++++++++++++++-----
mm/filemap.c | 2 +-
mm/gup.c | 34 ++++++++++++++++++++---------
18 files changed, 75 insertions(+), 41 deletions(-)

diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
index 6568823cf306..7a7fb08df4c4 100644
--- a/arch/powerpc/kernel/kvm.c
+++ b/arch/powerpc/kernel/kvm.c
@@ -670,7 +670,7 @@ static void __init kvm_use_magic_page(void)

/* Quick self-test to see if the mapping works */
if (fault_in_readable((const char __user *)KVM_MAGIC_PAGE,
- sizeof(u32))) {
+ sizeof(u32), 0)) {
kvm_patching_worked = false;
return;
}
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 3e053e2fd6b6..7c817881d418 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -1048,7 +1048,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
if (new_ctx == NULL)
return 0;
if (!access_ok(new_ctx, ctx_size) ||
- fault_in_readable((char __user *)new_ctx, ctx_size))
+ fault_in_readable((char __user *)new_ctx, ctx_size, 0))
return -EFAULT;

/*
@@ -1239,7 +1239,7 @@ SYSCALL_DEFINE3(debug_setcontext, struct ucontext __user *, ctx,
#endif

if (!access_ok(ctx, sizeof(*ctx)) ||
- fault_in_readable((char __user *)ctx, sizeof(*ctx)))
+ fault_in_readable((char __user *)ctx, sizeof(*ctx), 0))
return -EFAULT;

/*
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index d1e1fc0acbea..732fa4e10d24 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -688,7 +688,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
if (new_ctx == NULL)
return 0;
if (!access_ok(new_ctx, ctx_size) ||
- fault_in_readable((char __user *)new_ctx, ctx_size))
+ fault_in_readable((char __user *)new_ctx, ctx_size, 0))
return -EFAULT;

/*
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index d5958278eba6..c9bd217e3364 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -309,7 +309,7 @@ static bool restore_fpregs_from_user(void __user *buf, u64 xrestore,
if (ret != X86_TRAP_PF)
return false;

- if (!fault_in_readable(buf, size))
+ if (!fault_in_readable(buf, size, 0))
goto retry;
return false;
}
diff --git a/drivers/gpu/drm/armada/armada_gem.c b/drivers/gpu/drm/armada/armada_gem.c
index 147abf1a3968..0f44219c0120 100644
--- a/drivers/gpu/drm/armada/armada_gem.c
+++ b/drivers/gpu/drm/armada/armada_gem.c
@@ -351,7 +351,7 @@ int armada_gem_pwrite_ioctl(struct drm_device *dev, void *data,
if (!access_ok(ptr, args->size))
return -EFAULT;

- if (fault_in_readable(ptr, args->size))
+ if (fault_in_readable(ptr, args->size, 0))
return -EFAULT;

dobj = armada_gem_object_lookup(file, args->handle);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 11204dbbe053..96ac4b186b72 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1718,7 +1718,7 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
* Fault pages before locking them in prepare_pages
* to avoid recursive lock
*/
- if (unlikely(fault_in_iov_iter_readable(i, write_bytes))) {
+ if (unlikely(fault_in_iov_iter_readable(i, write_bytes, 0))) {
ret = -EFAULT;
break;
}
@@ -2021,7 +2021,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
if (left == prev_left) {
err = -ENOTBLK;
} else {
- fault_in_iov_iter_readable(from, left);
+ fault_in_iov_iter_readable(from, left, 0);
prev_left = left;
goto again;
}
@@ -3772,7 +3772,7 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
* the first time we are retrying. Fault in as many pages
* as possible and retry.
*/
- fault_in_iov_iter_writeable(to, left);
+ fault_in_iov_iter_writeable(to, left, 0);
prev_left = left;
goto again;
}
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 92138ac2a4e2..c7d74c8776a1 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2223,7 +2223,7 @@ static noinline int search_ioctl(struct inode *inode,

while (1) {
ret = -EFAULT;
- if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset))
+ if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset, 0))
break;

ret = btrfs_search_forward(root, &key, path, sk->min_transid);
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 92ec2699bc85..fb6eceac0d57 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -4276,7 +4276,7 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
size_t target_size = 0;
int err;

- if (fault_in_iov_iter_readable(from, iov_iter_count(from)))
+ if (fault_in_iov_iter_readable(from, iov_iter_count(from), 0))
set_inode_flag(inode, FI_NO_PREALLOC);

if ((iocb->ki_flags & IOCB_NOWAIT)) {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9d6c5f6361f7..c823b9f70215 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1162,7 +1162,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,

again:
err = -EFAULT;
- if (fault_in_iov_iter_readable(ii, bytes))
+ if (fault_in_iov_iter_readable(ii, bytes, 0))
break;

err = -ENOMEM;
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 3e718cfc19a7..f7bd3bfd0690 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -847,7 +847,7 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
size_t leftover;

gfs2_holder_allow_demote(gh);
- leftover = fault_in_iov_iter_writeable(to, window_size);
+ leftover = fault_in_iov_iter_writeable(to, window_size, 0);
gfs2_holder_disallow_demote(gh);
if (leftover != window_size) {
if (!gfs2_holder_queued(gh))
@@ -916,7 +916,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
size_t leftover;

gfs2_holder_allow_demote(gh);
- leftover = fault_in_iov_iter_readable(from, window_size);
+ leftover = fault_in_iov_iter_readable(from, window_size, 0);
gfs2_holder_disallow_demote(gh);
if (leftover != window_size) {
if (!gfs2_holder_queued(gh))
@@ -985,7 +985,7 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
size_t leftover;

gfs2_holder_allow_demote(&gh);
- leftover = fault_in_iov_iter_writeable(to, window_size);
+ leftover = fault_in_iov_iter_writeable(to, window_size, 0);
gfs2_holder_disallow_demote(&gh);
if (leftover != window_size) {
if (!gfs2_holder_queued(&gh)) {
@@ -1063,7 +1063,7 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb,
size_t leftover;

gfs2_holder_allow_demote(gh);
- leftover = fault_in_iov_iter_readable(from, window_size);
+ leftover = fault_in_iov_iter_readable(from, window_size, 0);
gfs2_holder_disallow_demote(gh);
if (leftover != window_size) {
from->count = min(from->count, window_size - leftover);
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 1753c26c8e76..e7a529405775 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -750,7 +750,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
* same page as we're writing to, without it being marked
* up-to-date.
*/
- if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
+ if (unlikely(fault_in_iov_iter_readable(i, bytes, 0))) {
status = -EFAULT;
break;
}
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index 2ae25e48a41a..441aeefda8b6 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -1830,7 +1830,7 @@ static ssize_t ntfs_perform_write(struct file *file, struct iov_iter *i,
* pages being swapped out between us bringing them into memory
* and doing the actual copying.
*/
- if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
+ if (unlikely(fault_in_iov_iter_readable(i, bytes, 0))) {
status = -EFAULT;
break;
}
diff --git a/fs/ntfs3/file.c b/fs/ntfs3/file.c
index 787b53b984ee..208686bda052 100644
--- a/fs/ntfs3/file.c
+++ b/fs/ntfs3/file.c
@@ -990,7 +990,7 @@ static ssize_t ntfs_compress_write(struct kiocb *iocb, struct iov_iter *from)
frame_vbo = pos & ~(frame_size - 1);
index = frame_vbo >> PAGE_SHIFT;

- if (unlikely(fault_in_iov_iter_readable(from, bytes))) {
+ if (unlikely(fault_in_iov_iter_readable(from, bytes, 0))) {
err = -EFAULT;
goto out;
}
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 1a0c646eb6ff..79d328031247 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -909,9 +909,11 @@ void folio_add_wait_queue(struct folio *folio, wait_queue_entry_t *waiter);
/*
* Fault in userspace address range.
*/
-size_t fault_in_writeable(char __user *uaddr, size_t size);
-size_t fault_in_safe_writeable(const char __user *uaddr, size_t size);
-size_t fault_in_readable(const char __user *uaddr, size_t size);
+size_t fault_in_writeable(char __user *uaddr, size_t size, size_t min_size);
+size_t fault_in_safe_writeable(const char __user *uaddr, size_t size,
+ size_t min_size);
+size_t fault_in_readable(const char __user *uaddr, size_t size,
+ size_t min_size);

int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
pgoff_t index, gfp_t gfp);
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 6350354f97e9..06c54c3ab3f8 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -134,8 +134,10 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
size_t bytes, struct iov_iter *i);
void iov_iter_advance(struct iov_iter *i, size_t bytes);
void iov_iter_revert(struct iov_iter *i, size_t bytes);
-size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t bytes);
-size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t bytes);
+size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t bytes,
+ size_t min_size);
+size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t bytes,
+ size_t min_size);
size_t iov_iter_single_seg_count(const struct iov_iter *i);
size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 66a740e6e153..ecb95bb5c423 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -191,7 +191,7 @@ static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t b
buf = iov->iov_base + skip;
copy = min(bytes, iov->iov_len - skip);

- if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_writeable(buf, copy)) {
+ if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_writeable(buf, copy, 0)) {
kaddr = kmap_atomic(page);
from = kaddr + offset;

@@ -275,7 +275,7 @@ static size_t copy_page_from_iter_iovec(struct page *page, size_t offset, size_t
buf = iov->iov_base + skip;
copy = min(bytes, iov->iov_len - skip);

- if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_readable(buf, copy)) {
+ if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_readable(buf, copy, 0)) {
kaddr = kmap_atomic(page);
to = kaddr + offset;

@@ -433,6 +433,7 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
* fault_in_iov_iter_readable - fault in iov iterator for reading
* @i: iterator
* @size: maximum length
+ * @min_size: minimum size to be faulted in
*
* Fault in one or more iovecs of the given iov_iter, to a maximum length of
* @size. For each iovec, fault in each page that constitutes the iovec.
@@ -442,25 +443,32 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
*
* Always returns 0 for non-userspace iterators.
*/
-size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t size)
+size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t size,
+ size_t min_size)
{
if (iter_is_iovec(i)) {
size_t count = min(size, iov_iter_count(i));
const struct iovec *p;
size_t skip;
+ size_t orig_size = size;

size -= count;
for (p = i->iov, skip = i->iov_offset; count; p++, skip = 0) {
size_t len = min(count, p->iov_len - skip);
+ size_t min_len = min(len, min_size);
size_t ret;

if (unlikely(!len))
continue;
- ret = fault_in_readable(p->iov_base + skip, len);
+ ret = fault_in_readable(p->iov_base + skip, len,
+ min_len);
count -= len - ret;
+ min_size -= min(min_size, len - ret);
if (ret)
break;
}
+ if (min_size)
+ return orig_size;
return count + size;
}
return 0;
@@ -471,6 +479,7 @@ EXPORT_SYMBOL(fault_in_iov_iter_readable);
* fault_in_iov_iter_writeable - fault in iov iterator for writing
* @i: iterator
* @size: maximum length
+ * @min_size: minimum size to be faulted in
*
* Faults in the iterator using get_user_pages(), i.e., without triggering
* hardware page faults. This is primarily useful when we already know that
@@ -481,25 +490,32 @@ EXPORT_SYMBOL(fault_in_iov_iter_readable);
*
* Always returns 0 for non-user-space iterators.
*/
-size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t size)
+size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t size,
+ size_t min_size)
{
if (iter_is_iovec(i)) {
size_t count = min(size, iov_iter_count(i));
const struct iovec *p;
size_t skip;
+ size_t orig_size = size;

size -= count;
for (p = i->iov, skip = i->iov_offset; count; p++, skip = 0) {
size_t len = min(count, p->iov_len - skip);
+ size_t min_len = min(len, min_size);
size_t ret;

if (unlikely(!len))
continue;
- ret = fault_in_safe_writeable(p->iov_base + skip, len);
+ ret = fault_in_safe_writeable(p->iov_base + skip, len,
+ min_len);
count -= len - ret;
+ min_size -= min(min_size, len - ret);
if (ret)
break;
}
+ if (min_size)
+ return orig_size;
return count + size;
}
return 0;
diff --git a/mm/filemap.c b/mm/filemap.c
index daa0e23a6ee6..e5d7f5b1e5cc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3743,7 +3743,7 @@ ssize_t generic_perform_write(struct file *file,
* same page as we're writing to, without it being marked
* up-to-date.
*/
- if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
+ if (unlikely(fault_in_iov_iter_readable(i, bytes, 0))) {
status = -EFAULT;
break;
}
diff --git a/mm/gup.c b/mm/gup.c
index 2c51e9748a6a..baa8240615a4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1662,13 +1662,15 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
* fault_in_writeable - fault in userspace address range for writing
* @uaddr: start of address range
* @size: size of address range
+ * @min_size: minimum size to be faulted in
*
* Returns the number of bytes not faulted in (like copy_to_user() and
* copy_from_user()).
*/
-size_t fault_in_writeable(char __user *uaddr, size_t size)
+size_t fault_in_writeable(char __user *uaddr, size_t size, size_t min_size)
{
char __user *start = uaddr, *end;
+ size_t faulted_in = size;

if (unlikely(size == 0))
return 0;
@@ -1688,8 +1690,10 @@ size_t fault_in_writeable(char __user *uaddr, size_t size)

out:
if (size > uaddr - start)
- return size - (uaddr - start);
- return 0;
+ faulted_in = uaddr - start;
+ if (faulted_in < min_size)
+ return size;
+ return size - faulted_in;
}
EXPORT_SYMBOL(fault_in_writeable);

@@ -1697,6 +1701,7 @@ EXPORT_SYMBOL(fault_in_writeable);
* fault_in_safe_writeable - fault in an address range for writing
* @uaddr: start of address range
* @size: length of address range
+ * @min_size: minimum size to be faulted in
*
* Faults in an address range using get_user_pages, i.e., without triggering
* hardware page faults. This is primarily useful when we already know that
@@ -1711,13 +1716,15 @@ EXPORT_SYMBOL(fault_in_writeable);
* Returns the number of bytes not faulted in, like copy_to_user() and
* copy_from_user().
*/
-size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
+size_t fault_in_safe_writeable(const char __user *uaddr, size_t size,
+ size_t min_size)
{
unsigned long start = (unsigned long)untagged_addr(uaddr);
unsigned long end, nstart, nend;
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma = NULL;
int locked = 0;
+ size_t faulted_in = size;

nstart = start & PAGE_MASK;
end = PAGE_ALIGN(start + size);
@@ -1750,9 +1757,11 @@ size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
}
if (locked)
mmap_read_unlock(mm);
- if (nstart == end)
- return 0;
- return size - min_t(size_t, nstart - start, size);
+ if (nstart != end)
+ faulted_in = min_t(size_t, nstart - start, size);
+ if (faulted_in < min_size)
+ return size;
+ return size - faulted_in;
}
EXPORT_SYMBOL(fault_in_safe_writeable);

@@ -1760,14 +1769,17 @@ EXPORT_SYMBOL(fault_in_safe_writeable);
* fault_in_readable - fault in userspace address range for reading
* @uaddr: start of user address range
* @size: size of user address range
+ * @min_size: minimum size to be faulted in
*
* Returns the number of bytes not faulted in (like copy_to_user() and
* copy_from_user()).
*/
-size_t fault_in_readable(const char __user *uaddr, size_t size)
+size_t fault_in_readable(const char __user *uaddr, size_t size,
+ size_t min_size)
{
const char __user *start = uaddr, *end;
volatile char c;
+ size_t faulted_in = size;

if (unlikely(size == 0))
return 0;
@@ -1788,8 +1800,10 @@ size_t fault_in_readable(const char __user *uaddr, size_t size)
out:
(void)c;
if (size > uaddr - start)
- return size - (uaddr - start);
- return 0;
+ faulted_in = uaddr - start;
+ if (faulted_in < min_size)
+ return size;
+ return size - faulted_in;
}
EXPORT_SYMBOL(fault_in_readable);


2021-12-01 19:38:25

by Catalin Marinas

[permalink] [raw]
Subject: [PATCH v2 2/4] mm: Probe for sub-page faults in fault_in_*()

On hardware with features like arm64 MTE or SPARC ADI, an access fault
can be triggered at sub-page granularity. Depending on how the
fault_in_*() functions are used, the caller can get into a live-lock by
continuously retrying the fault-in on an address different from the one
where the uaccess failed.

In the majority of cases progress is ensured by the following
conditions:

1. copy_{to,from}_user_nofault() guarantees at least one byte access if
the user address is not faulting.

2. The fault_in_*() loop is resumed from the next address that could not
be accessed by copy_{to,from}_user_nofault().

If the loop iteration is restarted from an earlier point, the loop is
repeated with the same conditions and it would live-lock. The same
problem exists if the fault_in_*() is attempted on the fault address
reported by copy_*_user_nofault() since the latter does not guarantee
the maximum possible bytes are written and fault_in_*() will succeed in
probing a single byte.

Introduce probe_subpage_*() and call them from the corresponding
fault_in_*() functions on the requested 'min_size' range. The arch code
with sub-page faults will have to implement the specific probing
functionality.

Signed-off-by: Catalin Marinas <[email protected]>
---
arch/Kconfig | 7 ++++++
include/linux/uaccess.h | 53 +++++++++++++++++++++++++++++++++++++++++
mm/gup.c | 9 ++++---
3 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 26b8ed11639d..02502b3362aa 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -27,6 +27,13 @@ config HAVE_IMA_KEXEC
config SET_FS
bool

+config ARCH_HAS_SUBPAGE_FAULTS
+ bool
+ help
+ Select if the architecture can check permissions at sub-page
+ granularity (e.g. arm64 MTE). The probe_user_*() functions
+ must be implemented.
+
config HOTPLUG_SMT
bool

diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index ac0394087f7d..04ad214c98cd 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -271,6 +271,59 @@ static inline bool pagefault_disabled(void)
*/
#define faulthandler_disabled() (pagefault_disabled() || in_atomic())

+#ifndef CONFIG_ARCH_HAS_SUBPAGE_FAULTS
+
+/**
+ * probe_subpage_writeable: probe the user range for write faults at sub-page
+ * granularity (e.g. arm64 MTE)
+ * @uaddr: start of address range
+ * @size: size of address range
+ *
+ * Returns 0 on success, the number of bytes not probed on fault.
+ *
+ * It is expected that the caller checked for the write permission of each
+ * page in the range either by put_user() or GUP. The architecture port can
+ * implement a more efficient get_user() probing if the same sub-page faults
+ * are triggered by either a read or a write.
+ */
+static inline size_t probe_subpage_writeable(void __user *uaddr, size_t size)
+{
+ return 0;
+}
+
+/**
+ * probe_subpage_safe_writeable: probe the user range for write faults at
+ * sub-page granularity without corrupting the
+ * existing data
+ * @uaddr: start of address range
+ * @size: size of address range
+ *
+ * Returns 0 on success, the number of bytes not probed on fault.
+ *
+ * It is expected that the caller checked for the write permission of each
+ * page in the range either by put_user() or GUP.
+ */
+static inline size_t probe_subpage_safe_writeable(void __user *uaddr,
+ size_t size)
+{
+ return 0;
+}
+
+/**
+ * probe_subpage_readable: probe the user range for read faults at sub-page
+ * granularity
+ * @uaddr: start of address range
+ * @size: size of address range
+ *
+ * Returns 0 on success, the number of bytes not probed on fault.
+ */
+static inline size_t probe_subpage_readable(void __user *uaddr, size_t size)
+{
+ return 0;
+}
+
+#endif
+
#ifndef ARCH_HAS_NOCACHE_UACCESS

static inline __must_check unsigned long
diff --git a/mm/gup.c b/mm/gup.c
index baa8240615a4..7fa69b0fb859 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1691,7 +1691,8 @@ size_t fault_in_writeable(char __user *uaddr, size_t size, size_t min_size)
out:
if (size > uaddr - start)
faulted_in = uaddr - start;
- if (faulted_in < min_size)
+ if (faulted_in < min_size ||
+ (min_size && probe_subpage_writeable(start, min_size)))
return size;
return size - faulted_in;
}
@@ -1759,7 +1760,8 @@ size_t fault_in_safe_writeable(const char __user *uaddr, size_t size,
mmap_read_unlock(mm);
if (nstart != end)
faulted_in = min_t(size_t, nstart - start, size);
- if (faulted_in < min_size)
+ if (faulted_in < min_size ||
+ (min_size && probe_subpage_safe_writeable(uaddr, min_size)))
return size;
return size - faulted_in;
}
@@ -1801,7 +1803,8 @@ size_t fault_in_readable(const char __user *uaddr, size_t size,
(void)c;
if (size > uaddr - start)
faulted_in = uaddr - start;
- if (faulted_in < min_size)
+ if (faulted_in < min_size ||
+ (min_size && probe_subpage_readable(start, min_size)))
return size;
return size - faulted_in;
}

2021-12-01 19:38:31

by Catalin Marinas

[permalink] [raw]
Subject: [PATCH v2 3/4] arm64: Add support for user sub-page fault probing

With MTE, even if the pte allows an access, a mismatched tag somewhere
within a page can still cause a fault. Select ARCH_HAS_SUBPAGE_FAULTS if
MTE is enabled and implement the probe_subpage_*() functions. Note that
get_user() is sufficient for the writeable checks since the same tag
mismatch fault would be triggered by a read.

Signed-off-by: Catalin Marinas <[email protected]>
---
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/uaccess.h | 59 ++++++++++++++++++++++++++++++++
2 files changed, 60 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c4207cf9bb17..dff89fd0d817 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1777,6 +1777,7 @@ config ARM64_MTE
depends on AS_HAS_LSE_ATOMICS
# Required for tag checking in the uaccess routines
depends on ARM64_PAN
+ select ARCH_HAS_SUBPAGE_FAULTS
select ARCH_USES_HIGH_VMA_FLAGS
help
Memory Tagging (part of the ARMv8.5 Extensions) provides
diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
index 6e2e0b7031ab..bcbd24b97917 100644
--- a/arch/arm64/include/asm/uaccess.h
+++ b/arch/arm64/include/asm/uaccess.h
@@ -445,4 +445,63 @@ static inline int __copy_from_user_flushcache(void *dst, const void __user *src,
}
#endif

+#ifdef CONFIG_ARCH_HAS_SUBPAGE_FAULTS
+
+/*
+ * Return 0 on success, the number of bytes not accessed otherwise.
+ */
+static inline size_t __mte_probe_user_range(const char __user *uaddr,
+ size_t size, bool skip_first)
+{
+ const char __user *end = uaddr + size;
+ int err = 0;
+ char val;
+
+ uaddr = PTR_ALIGN_DOWN(uaddr, MTE_GRANULE_SIZE);
+ if (skip_first)
+ uaddr += MTE_GRANULE_SIZE;
+ while (uaddr < end) {
+ /*
+ * A read is sufficient for MTE, the caller should have probed
+ * for the pte write permission if required.
+ */
+ __raw_get_user(val, uaddr, err);
+ if (err)
+ return end - uaddr;
+ uaddr += MTE_GRANULE_SIZE;
+ }
+ (void)val;
+
+ return 0;
+}
+
+static inline size_t probe_subpage_writeable(const void __user *uaddr,
+ size_t size)
+{
+ if (!system_supports_mte())
+ return 0;
+ /* first put_user() done in the caller */
+ return __mte_probe_user_range(uaddr, size, true);
+}
+
+static inline size_t probe_subpage_safe_writeable(const void __user *uaddr,
+ size_t size)
+{
+ if (!system_supports_mte())
+ return 0;
+ /* the caller used GUP, don't skip the first granule */
+ return __mte_probe_user_range(uaddr, size, false);
+}
+
+static inline size_t probe_subpage_readable(const void __user *uaddr,
+ size_t size)
+{
+ if (!system_supports_mte())
+ return 0;
+ /* first get_user() done in the caller */
+ return __mte_probe_user_range(uaddr, size, true);
+}
+
+#endif /* CONFIG_ARCH_HAS_SUBPAGE_FAULTS */
+
#endif /* __ASM_UACCESS_H */

2021-12-01 19:38:41

by Catalin Marinas

[permalink] [raw]
Subject: [PATCH v2 4/4] btrfs: Avoid live-lock in search_ioctl() on hardware with sub-page faults

Commit a48b73eca4ce ("btrfs: fix potential deadlock in the search
ioctl") addressed a lockdep warning by pre-faulting the user pages and
attempting the copy_to_user_nofault() in an infinite loop. On
architectures like arm64 with MTE, an access may fault within a page at
a location different from what fault_in_writeable() probed. Since the
sk_offset is rewound to the previous struct btrfs_ioctl_search_header
boundary, there is no guaranteed forward progress and search_ioctl() may
live-lock.

Request a 'min_size' of (*buf_size - sk_offset) from
fault_in_writeable() to check this range for sub-page faults.

Fixes: a48b73eca4ce ("btrfs: fix potential deadlock in the search ioctl")
Signed-off-by: Catalin Marinas <[email protected]>
Reported-by: Al Viro <[email protected]>
---
fs/btrfs/ioctl.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index c7d74c8776a1..439cf38f320a 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2222,8 +2222,13 @@ static noinline int search_ioctl(struct inode *inode,
key.offset = sk->min_offset;

while (1) {
+ size_t len = *buf_size - sk_offset;
ret = -EFAULT;
- if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset, 0))
+ /*
+ * Ensure that the whole user buffer is faulted in at sub-page
+ * granularity, otherwise the loop may live-lock.
+ */
+ if (fault_in_writeable(ubuf + sk_offset, len, len))
break;

ret = btrfs_search_forward(root, &key, path, sk->min_transid);

2021-12-01 20:30:16

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v2 3/4] arm64: Add support for user sub-page fault probing

Hi Catalin,

On Wed, Dec 01, 2021 at 07:37:49PM +0000, Catalin Marinas wrote:
> With MTE, even if the pte allows an access, a mismatched tag somewhere
> within a page can still cause a fault. Select ARCH_HAS_SUBPAGE_FAULTS if
> MTE is enabled and implement the probe_subpage_*() functions. Note that
> get_user() is sufficient for the writeable checks since the same tag
> mismatch fault would be triggered by a read.
>
> Signed-off-by: Catalin Marinas <[email protected]>
> ---
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/uaccess.h | 59 ++++++++++++++++++++++++++++++++
> 2 files changed, 60 insertions(+)
>
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index c4207cf9bb17..dff89fd0d817 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1777,6 +1777,7 @@ config ARM64_MTE
> depends on AS_HAS_LSE_ATOMICS
> # Required for tag checking in the uaccess routines
> depends on ARM64_PAN
> + select ARCH_HAS_SUBPAGE_FAULTS
> select ARCH_USES_HIGH_VMA_FLAGS
> help
> Memory Tagging (part of the ARMv8.5 Extensions) provides
> diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
> index 6e2e0b7031ab..bcbd24b97917 100644
> --- a/arch/arm64/include/asm/uaccess.h
> +++ b/arch/arm64/include/asm/uaccess.h
> @@ -445,4 +445,63 @@ static inline int __copy_from_user_flushcache(void *dst, const void __user *src,
> }
> #endif
>
> +#ifdef CONFIG_ARCH_HAS_SUBPAGE_FAULTS
> +
> +/*
> + * Return 0 on success, the number of bytes not accessed otherwise.
> + */
> +static inline size_t __mte_probe_user_range(const char __user *uaddr,
> + size_t size, bool skip_first)
> +{
> + const char __user *end = uaddr + size;
> + int err = 0;
> + char val;
> +
> + uaddr = PTR_ALIGN_DOWN(uaddr, MTE_GRANULE_SIZE);
> + if (skip_first)
> + uaddr += MTE_GRANULE_SIZE;

Do we need the skipping for a functional reason, or is that an optimization?

From the comments in probe_subpage_writeable() and
probe_subpage_safe_writeable() I wasn't sure if the skipping was because we
*don't need to* check the first granule, or because we *must not* check the
first granule.

> + while (uaddr < end) {
> + /*
> + * A read is sufficient for MTE, the caller should have probed
> + * for the pte write permission if required.
> + */
> + __raw_get_user(val, uaddr, err);
> + if (err)
> + return end - uaddr;
> + uaddr += MTE_GRANULE_SIZE;
> + }

I think we may need to account for the residue from PTR_ALIGN_DOWN(), or we can
report more bytes not copied than was passed in `size` in the first place,
which I think might confused some callers.

Consider MTE_GRANULE_SIZE is 16, uaddr is 31, and size is 1 (so end is 32). We
align uaddr down to 16, and if we fail the first access we return (32 - 16),
i.e. 16.

Thanks,
Mark.

> + (void)val;
> +
> + return 0;
> +}
> +
> +static inline size_t probe_subpage_writeable(const void __user *uaddr,
> + size_t size)
> +{
> + if (!system_supports_mte())
> + return 0;
> + /* first put_user() done in the caller */
> + return __mte_probe_user_range(uaddr, size, true);
> +}
> +
> +static inline size_t probe_subpage_safe_writeable(const void __user *uaddr,
> + size_t size)
> +{
> + if (!system_supports_mte())
> + return 0;
> + /* the caller used GUP, don't skip the first granule */
> + return __mte_probe_user_range(uaddr, size, false);
> +}
> +
> +static inline size_t probe_subpage_readable(const void __user *uaddr,
> + size_t size)
> +{
> + if (!system_supports_mte())
> + return 0;
> + /* first get_user() done in the caller */
> + return __mte_probe_user_range(uaddr, size, true);
> +}
> +
> +#endif /* CONFIG_ARCH_HAS_SUBPAGE_FAULTS */
> +
> #endif /* __ASM_UACCESS_H */

2021-12-02 16:09:16

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v2 3/4] arm64: Add support for user sub-page fault probing

Hi Mark,

On Wed, Dec 01, 2021 at 08:29:06PM +0000, Mark Rutland wrote:
> On Wed, Dec 01, 2021 at 07:37:49PM +0000, Catalin Marinas wrote:
> > +/*
> > + * Return 0 on success, the number of bytes not accessed otherwise.
> > + */
> > +static inline size_t __mte_probe_user_range(const char __user *uaddr,
> > + size_t size, bool skip_first)
> > +{
> > + const char __user *end = uaddr + size;
> > + int err = 0;
> > + char val;
> > +
> > + uaddr = PTR_ALIGN_DOWN(uaddr, MTE_GRANULE_SIZE);
> > + if (skip_first)
> > + uaddr += MTE_GRANULE_SIZE;
>
> Do we need the skipping for a functional reason, or is that an optimization?

An optimisation and very likely not noticeable. Given that we'd do a read
following a put_user() or get_user() earlier, the cacheline was
allocated and another load may be nearly as fast as the uaddr increment.

> From the comments in probe_subpage_writeable() and
> probe_subpage_safe_writeable() I wasn't sure if the skipping was because we
> *don't need to* check the first granule, or because we *must not* check the
> first granule.

The "don't need to" part. But thinking about this, I'll just drop it as
it's confusing.

> > + while (uaddr < end) {
> > + /*
> > + * A read is sufficient for MTE, the caller should have probed
> > + * for the pte write permission if required.
> > + */
> > + __raw_get_user(val, uaddr, err);
> > + if (err)
> > + return end - uaddr;
> > + uaddr += MTE_GRANULE_SIZE;
> > + }
>
> I think we may need to account for the residue from PTR_ALIGN_DOWN(), or we can
> report more bytes not copied than was passed in `size` in the first place,
> which I think might confused some callers.
>
> Consider MTE_GRANULE_SIZE is 16, uaddr is 31, and size is 1 (so end is 32). We
> align uaddr down to 16, and if we fail the first access we return (32 - 16),
> i.e. 16.

Good point. This is fine if we skip the first byte but not otherwise.
Planning to fold in this diff:

diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h
index bcbd24b97917..213b30841beb 100644
--- a/arch/arm64/include/asm/uaccess.h
+++ b/arch/arm64/include/asm/uaccess.h
@@ -451,15 +451,17 @@ static inline int __copy_from_user_flushcache(void *dst, const void __user *src,
* Return 0 on success, the number of bytes not accessed otherwise.
*/
static inline size_t __mte_probe_user_range(const char __user *uaddr,
- size_t size, bool skip_first)
+ size_t size)
{
const char __user *end = uaddr + size;
int err = 0;
char val;

- uaddr = PTR_ALIGN_DOWN(uaddr, MTE_GRANULE_SIZE);
- if (skip_first)
- uaddr += MTE_GRANULE_SIZE;
+ __raw_get_user(val, uaddr, err);
+ if (err)
+ return size;
+
+ uaddr = PTR_ALIGN(uaddr, MTE_GRANULE_SIZE);
while (uaddr < end) {
/*
* A read is sufficient for MTE, the caller should have probed
@@ -480,8 +482,7 @@ static inline size_t probe_subpage_writeable(const void __user *uaddr,
{
if (!system_supports_mte())
return 0;
- /* first put_user() done in the caller */
- return __mte_probe_user_range(uaddr, size, true);
+ return __mte_probe_user_range(uaddr, size);
}

static inline size_t probe_subpage_safe_writeable(const void __user *uaddr,
@@ -489,8 +490,7 @@ static inline size_t probe_subpage_safe_writeable(const void __user *uaddr,
{
if (!system_supports_mte())
return 0;
- /* the caller used GUP, don't skip the first granule */
- return __mte_probe_user_range(uaddr, size, false);
+ return __mte_probe_user_range(uaddr, size);
}

static inline size_t probe_subpage_readable(const void __user *uaddr,
@@ -498,8 +498,7 @@ static inline size_t probe_subpage_readable(const void __user *uaddr,
{
if (!system_supports_mte())
return 0;
- /* first get_user() done in the caller */
- return __mte_probe_user_range(uaddr, size, true);
+ return __mte_probe_user_range(uaddr, size);
}

#endif /* CONFIG_ARCH_HAS_SUBPAGE_FAULTS */

--
Catalin

2021-12-03 15:29:37

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Avoid live-lock in fault-in+uaccess loops with sub-page faults

Catalin,

On Wed, Dec 1, 2021 at 8:38 PM Catalin Marinas <[email protected]> wrote:
>
> Hi,
>
> Following the discussions on the first series,
>
> https://lore.kernel.org/r/[email protected]
>
> this new patchset aims to generalise the sub-page probing and introduce
> a minimum size to the fault_in_*() functions. I called this 'v2' but I
> can rebase it on top of v1 and keep v1 as a btrfs live-lock
> back-portable fix.

that's what I was actually expecting, an updated patch series that
changes the btrfs code to keep track of the user-copy fault address,
the corresponding changes to the fault_in functions to call the
appropriate arch functions, and the arch functions that probe far
enough from the fault address to prevent deadlocks. In this step, how
far the arch functions need to probe depends on the fault windows of
the user-copy functions.

A next step (as a patch series on top) would be to make sure direct
I/O also takes sub-page faults into account. That seems to require
probing the entire address range before the actual copying. A concern
I have about this is time-of-check versus time-of-use: what if
sub-page faults are added after the probing but before the copying?
Other than that, an approach like adding min_size parameters might
work except that maybe we can find a better name. Also, in order not
to make things even more messy, the fault_in functions should probably
continue to report how much of the address range they've failed to
fault in. Callers can then check for themselves whether the function
could fault in min_size bytes or not.

> The fault_in_*() API improvements would be a new
> series. Anyway, I'd first like to know whether this is heading in the
> right direction and whether it's worth adding min_size to all
> fault_in_*() (more below).
>
> v2 adds a 'min_size' argument to all fault_in_*() functions with current
> callers passing 0 (or we could make it 1). A probe_subpage_*() call is
> made for the min_size range, though with all 0 this wouldn't have any
> effect. The only difference is btrfs search_ioctl() in the last patch
> which passes a non-zero min_size to avoid the live-lock (functionally
> that's the same as the v1 series).

In the btrfs case, the copying will already trigger sub-page faults;
we only need to make sure that the next fault-in attempt happens at
the fault address. (And that the fault_in functions take the user-copy
fuzz into account, which we also need for byte granularity copying
anyway.) Otherwise, we're creating the same time-of-check versus
time-of-use disparity as for direct-IO here, unnecessarily.

> In terms of sub-page probing, I don't think with the current kernel
> anything other than search_ioctl() matters. The buffered file I/O can
> already cope with current fault_in_*() + copy_*_user() loops (the
> uaccess makes progress). Direct I/O either goes via GUP + kernel mapping
> access (and memcpy() can't fault) or, if the user buffer is not PAGE
> aligned, it may fall back to buffered I/O. So we really only care about
> fault_in_writeable(), as in v1.

Yes from a regression point of view, but note that direct I/O still
circumvents the sub-page fault checking, which seems to defeat the
whole point.

> Linus suggested that we could use the min_size to request a minimum
> guaranteed probed size (in most cases this would be 1) and put a cap on
> the faulted-in size, say two pages. All the fault_in_iov_iter_*()
> callers will need to check the actual quantity returned by fault_in_*()
> rather than bail out on non-zero but Andreas has a patch already (though
> I think there are a few cases in btrfs etc.):
>
> https://lore.kernel.org/r/[email protected]
>
> With these callers fixed, we could add something like the diff below.
> But, again, min_size doesn't actually have any current use in the kernel
> other than fault_in_writeable() and search_ioctl().

We're trying pretty hard to handle large I/O requests efficiently at
the filesystem level. A small, static upper limit in the fault-in
functions has the potential to ruin those efforts. So I'm not a fan of
that.

> Thanks for having a look. Suggestions welcomed.
>
> ------------------8<-------------------------------
> diff --git a/mm/gup.c b/mm/gup.c
> index 7fa69b0fb859..3aa88aa8ce9d 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1658,6 +1658,8 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
> }
> #endif /* !CONFIG_MMU */
>
> +#define MAX_FAULT_IN_SIZE (2 * PAGE_SIZE)
> +
> /**
> * fault_in_writeable - fault in userspace address range for writing
> * @uaddr: start of address range
> @@ -1671,6 +1673,7 @@ size_t fault_in_writeable(char __user *uaddr, size_t size, size_t min_size)
> {
> char __user *start = uaddr, *end;
> size_t faulted_in = size;
> + size_t max_size = max_t(size_t, MAX_FAULT_IN_SIZE, min_size);
>
> if (unlikely(size == 0))
> return 0;
> @@ -1679,7 +1682,7 @@ size_t fault_in_writeable(char __user *uaddr, size_t size, size_t min_size)
> return size;
> uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr);
> }
> - end = (char __user *)PAGE_ALIGN((unsigned long)start + size);
> + end = (char __user *)PAGE_ALIGN((unsigned long)start + max_size);
> if (unlikely(end < start))
> end = NULL;
> while (uaddr != end) {
> @@ -1726,9 +1729,10 @@ size_t fault_in_safe_writeable(const char __user *uaddr, size_t size,
> struct vm_area_struct *vma = NULL;
> int locked = 0;
> size_t faulted_in = size;
> + size_t max_size = max_t(size_t, MAX_FAULT_IN_SIZE, min_size);
>
> nstart = start & PAGE_MASK;
> - end = PAGE_ALIGN(start + size);
> + end = PAGE_ALIGN(start + max_size);
> if (end < nstart)
> end = 0;
> for (; nstart != end; nstart = nend) {
> @@ -1759,7 +1763,7 @@ size_t fault_in_safe_writeable(const char __user *uaddr, size_t size,
> if (locked)
> mmap_read_unlock(mm);
> if (nstart != end)
> - faulted_in = min_t(size_t, nstart - start, size);
> + faulted_in = min_t(size_t, nstart - start, max_size);
> if (faulted_in < min_size ||
> (min_size && probe_subpage_safe_writeable(uaddr, min_size)))
> return size;
> @@ -1782,6 +1786,7 @@ size_t fault_in_readable(const char __user *uaddr, size_t size,
> const char __user *start = uaddr, *end;
> volatile char c;
> size_t faulted_in = size;
> + size_t max_size = max_t(size_t, MAX_FAULT_IN_SIZE, min_size);
>
> if (unlikely(size == 0))
> return 0;
> @@ -1790,7 +1795,7 @@ size_t fault_in_readable(const char __user *uaddr, size_t size,
> return size;
> uaddr = (const char __user *)PAGE_ALIGN((unsigned long)uaddr);
> }
> - end = (const char __user *)PAGE_ALIGN((unsigned long)start + size);
> + end = (const char __user *)PAGE_ALIGN((unsigned long)start + max_size);
> if (unlikely(end < start))
> end = NULL;
> while (uaddr != end) {
> ------------------8<-------------------------------
>
> Catalin Marinas (4):
> mm: Introduce a 'min_size' argument to fault_in_*()
> mm: Probe for sub-page faults in fault_in_*()
> arm64: Add support for user sub-page fault probing
> btrfs: Avoid live-lock in search_ioctl() on hardware with sub-page
> faults
>
> arch/Kconfig | 7 ++++
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/uaccess.h | 59 +++++++++++++++++++++++++++++
> arch/powerpc/kernel/kvm.c | 2 +-
> arch/powerpc/kernel/signal_32.c | 4 +-
> arch/powerpc/kernel/signal_64.c | 2 +-
> arch/x86/kernel/fpu/signal.c | 2 +-
> drivers/gpu/drm/armada/armada_gem.c | 2 +-
> fs/btrfs/file.c | 6 +--
> fs/btrfs/ioctl.c | 7 +++-
> fs/f2fs/file.c | 2 +-
> fs/fuse/file.c | 2 +-
> fs/gfs2/file.c | 8 ++--
> fs/iomap/buffered-io.c | 2 +-
> fs/ntfs/file.c | 2 +-
> fs/ntfs3/file.c | 2 +-
> include/linux/pagemap.h | 8 ++--
> include/linux/uaccess.h | 53 ++++++++++++++++++++++++++
> include/linux/uio.h | 6 ++-
> lib/iov_iter.c | 28 +++++++++++---
> mm/filemap.c | 2 +-
> mm/gup.c | 37 +++++++++++++-----
> 22 files changed, 203 insertions(+), 41 deletions(-)
>

Thanks,
Andreas


2021-12-03 17:58:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Avoid live-lock in fault-in+uaccess loops with sub-page faults

On Fri, Dec 3, 2021 at 7:29 AM Andreas Gruenbacher <[email protected]> wrote:
>
>
> We're trying pretty hard to handle large I/O requests efficiently at
> the filesystem level. A small, static upper limit in the fault-in
> functions has the potential to ruin those efforts. So I'm not a fan of
> that.

I don't think fault-in should happen under any sane normal circumstances.

Except for low-memory situations, and then you don't want to fault in
large areas.

Do you really expect to write big areas that the user has never even
touched? That would be literally insane.

And if the user _has_ touched them, then they'll in in-core. Except
for the "swapped out" case.

End result: this is purely a correctness issue, not a performance issue.

Linus

2021-12-03 18:12:15

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Avoid live-lock in fault-in+uaccess loops with sub-page faults

On Fri, Dec 3, 2021 at 6:58 PM Linus Torvalds
<[email protected]> wrote:
> On Fri, Dec 3, 2021 at 7:29 AM Andreas Gruenbacher <[email protected]> wrote:
> > We're trying pretty hard to handle large I/O requests efficiently at
> > the filesystem level. A small, static upper limit in the fault-in
> > functions has the potential to ruin those efforts. So I'm not a fan of
> > that.
>
> I don't think fault-in should happen under any sane normal circumstances.
>
> Except for low-memory situations, and then you don't want to fault in
> large areas.
>
> Do you really expect to write big areas that the user has never even
> touched? That would be literally insane.
>
> And if the user _has_ touched them, then they'll in in-core. Except
> for the "swapped out" case.
>
> End result: this is purely a correctness issue, not a performance issue.

It happens when you mmap a file and write the mmapped region to
another file, for example. I don't think we want to make filesystems
go bonkers in such scenarios. Scaling down in response to memory
pressure sounds perfectly fine though.

Thanks,
Andreas


2021-12-03 18:25:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Avoid live-lock in fault-in+uaccess loops with sub-page faults

On Fri, Dec 3, 2021 at 10:12 AM Andreas Gruenbacher <[email protected]> wrote:
>
> It happens when you mmap a file and write the mmapped region to
> another file, for example.

Do you actually have such loads? Nobody should use mmap() for
single-access file copy purposes. It's slower than just doing the copy
exactly due to page fault overhead.

In other words, you seem to be worrying about the performance of a
load that is _explicitly_ badly written. You should be fixing the
application, not making the kernel do stupid things.

Also, it's worth noting that that situation should be caught by the
page-in code, which will map multiple pages in one go
(do_fault_around() - for when the pages are cached), and do the
readahead logic (filemap_fault() - for when the pages aren't in the
page cache).

Both of which are a lot more important than the "synchronously fault
in pages one at a time".

Linus

2021-12-03 19:51:55

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Avoid live-lock in fault-in+uaccess loops with sub-page faults

Hi Andreas,

On Fri, Dec 03, 2021 at 04:29:18PM +0100, Andreas Gruenbacher wrote:
> On Wed, Dec 1, 2021 at 8:38 PM Catalin Marinas <[email protected]> wrote:
> > Following the discussions on the first series,
> >
> > https://lore.kernel.org/r/[email protected]
> >
> > this new patchset aims to generalise the sub-page probing and introduce
> > a minimum size to the fault_in_*() functions. I called this 'v2' but I
> > can rebase it on top of v1 and keep v1 as a btrfs live-lock
> > back-portable fix.
>
> that's what I was actually expecting, an updated patch series that
> changes the btrfs code to keep track of the user-copy fault address,
> the corresponding changes to the fault_in functions to call the
> appropriate arch functions, and the arch functions that probe far
> enough from the fault address to prevent deadlocks. In this step, how
> far the arch functions need to probe depends on the fault windows of
> the user-copy functions.

I have that series as well, see the top patch here (well, you've seen it
already):

https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/btrfs-live-lock-fix

But I'm not convinced it's worth it if we go for the approach in v2
here. A key difference between v2 and the above branch is that
probe_subpage_writeable() checks exactly what is given (min_size) in v2
while in the devel/btrfs-live-lock-fix branch it can be given a
PAGE_SIZE or more but only checks the beginning 16 bytes to cover the
copy_to_user() error margin. The latter assumes that the caller will
always attempt the fault_in() from where the uaccess failed rather than
relying on the fault_in() itself to avoid the live-lock.

v1 posted earlier also checks the full range but only in
fault_in_writeable() which seems to be only relevant for btrfs in the
arm64 case.

Maybe I should post the other series as an alternative, get some input
on it.

> A next step (as a patch series on top) would be to make sure direct
> I/O also takes sub-page faults into account. That seems to require
> probing the entire address range before the actual copying. A concern
> I have about this is time-of-check versus time-of-use: what if
> sub-page faults are added after the probing but before the copying?

With direct I/O (that doesn't fall back to buffered), the access is done
via the kernel mapping following a get_user_pages(). Since the access
here cannot cope with exceptions, it must be unchecked. Yes, we do have
the time-of-use vs check problem but I'm not worried. I regard MTE as a
probabilistic security feature. Things get even murkier if the I/O is
done by some DMA engine which ignores tags anyway.

CHERI, OTOH, is a lot more strict but there is no check vs use issue
here since all permissions are encoded in the pointer itself (we might
just expand access_ok() to take this into account).

We could use the min_size logic for this I think in functions like
gfs2_file_direct_read() you'd have to fault in first and than invoke
iomap_dio_rw().

Anyway, like I said before, I'd leave the MTE accesses for direct I/O
unchecked as they currently are, I don't think it's worth the effort and
the potential slow-down (it will be significant).

> Other than that, an approach like adding min_size parameters might
> work except that maybe we can find a better name. Also, in order not
> to make things even more messy, the fault_in functions should probably
> continue to report how much of the address range they've failed to
> fault in. Callers can then check for themselves whether the function
> could fault in min_size bytes or not.

That's fine as well. I did it this way because I found the logic easier
to write.

> > The fault_in_*() API improvements would be a new
> > series. Anyway, I'd first like to know whether this is heading in the
> > right direction and whether it's worth adding min_size to all
> > fault_in_*() (more below).
> >
> > v2 adds a 'min_size' argument to all fault_in_*() functions with current
> > callers passing 0 (or we could make it 1). A probe_subpage_*() call is
> > made for the min_size range, though with all 0 this wouldn't have any
> > effect. The only difference is btrfs search_ioctl() in the last patch
> > which passes a non-zero min_size to avoid the live-lock (functionally
> > that's the same as the v1 series).
>
> In the btrfs case, the copying will already trigger sub-page faults;
> we only need to make sure that the next fault-in attempt happens at
> the fault address. (And that the fault_in functions take the user-copy
> fuzz into account, which we also need for byte granularity copying
> anyway.) Otherwise, we're creating the same time-of-check versus
> time-of-use disparity as for direct-IO here, unnecessarily.

I don't think it matters for btrfs. In some way, you'd have the time of
check vs use problem even if you fault in from where uaccess failed.
It's just that in practice it's impossible to live-lock as it needs very
precise synchronisation to change the tags from another CPU. But you do
guarantee that the uaccess was correct.

> > In terms of sub-page probing, I don't think with the current kernel
> > anything other than search_ioctl() matters. The buffered file I/O can
> > already cope with current fault_in_*() + copy_*_user() loops (the
> > uaccess makes progress). Direct I/O either goes via GUP + kernel mapping
> > access (and memcpy() can't fault) or, if the user buffer is not PAGE
> > aligned, it may fall back to buffered I/O. So we really only care about
> > fault_in_writeable(), as in v1.
>
> Yes from a regression point of view, but note that direct I/O still
> circumvents the sub-page fault checking, which seems to defeat the
> whole point.

It doesn't entirely defeat it. From my perspective MTE is more of a best
effort to find use-after-free etc. bugs. It has a performance penalty
and I wouldn't want to make it worse. Some libc allocators even go for
untagged memory (unchecked) if the required size is over some threshold
(usually when it falls back to multiple page allocations). That's more
likely to be involved in direct I/O anyway, so the additional check in
fault_in() won't matter.

> > Linus suggested that we could use the min_size to request a minimum
> > guaranteed probed size (in most cases this would be 1) and put a cap on
> > the faulted-in size, say two pages. All the fault_in_iov_iter_*()
> > callers will need to check the actual quantity returned by fault_in_*()
> > rather than bail out on non-zero but Andreas has a patch already (though
> > I think there are a few cases in btrfs etc.):
> >
> > https://lore.kernel.org/r/[email protected]
> >
> > With these callers fixed, we could add something like the diff below.
> > But, again, min_size doesn't actually have any current use in the kernel
> > other than fault_in_writeable() and search_ioctl().
>
> We're trying pretty hard to handle large I/O requests efficiently at
> the filesystem level. A small, static upper limit in the fault-in
> functions has the potential to ruin those efforts. So I'm not a fan of
> that.

I can't comment on this, I haven't spent time in the fs land. But I did
notice that generic_perform_write() for example limits the fault_in() to
PAGE_SIZE. So this min_size potential optimisation wouldn't make any
difference.

--
Catalin