2021-08-27 16:51:54

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v7 00/19] gfs2: Fix mmap + page fault deadlocks

Hi all,

here's another update on top of v5.14-rc7. Changes:

* Some of the patch descriptions have been improved.

* Patch "gfs2: Eliminate ip->i_gh" has been moved further to the front.

At this point, I'm not aware of anything that still needs fixing,


The first two patches are independent of the core of this patch queue
and I've asked the respective maintainers to have a look, but I've not
heard back from them. The first patch should just go into Al's tree;
it's a relatively straight-forward fix. The second patch really needs
to be looked at; it might break things:

iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
powerpc/kvm: Fix kvm_use_magic_page


Al and Linus seem to have a disagreement about the error reporting
semantics that functions fault_in_{readable,writeable} and
fault_in_iov_iter_{readable,writeable} should have. I've implemented
Linus's suggestion of returning the number of bytes not faulted in and I
think that being able to tell if "nothing", "something" or "everything"
could be faulted in does help, but I'll live with anything that allows
us to make progress.


The iomap changes should ideally be reviewed by Christoph; I've not
heard from him about those.


Thanks,
Andreas

Andreas Gruenbacher (16):
iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
powerpc/kvm: Fix kvm_use_magic_page
gup: Turn fault_in_pages_{readable,writeable} into
fault_in_{readable,writeable}
iov_iter: Turn iov_iter_fault_in_readable into
fault_in_iov_iter_readable
iov_iter: Introduce fault_in_iov_iter_writeable
gfs2: Add wrapper for iomap_file_buffered_write
gfs2: Clean up function may_grant
gfs2: Move the inode glock locking to gfs2_file_buffered_write
gfs2: Eliminate ip->i_gh
gfs2: Fix mmap + page fault deadlocks for buffered I/O
iomap: Fix iomap_dio_rw return value for user copies
iomap: Support partial direct I/O on user copy failures
iomap: Add done_before argument to iomap_dio_rw
gup: Introduce FOLL_NOFAULT flag to disable page faults
iov_iter: Introduce nofault flag to disable page faults
gfs2: Fix mmap + page fault deadlocks for direct I/O

Bob Peterson (3):
gfs2: Eliminate vestigial HIF_FIRST
gfs2: Remove redundant check from gfs2_glock_dq
gfs2: Introduce flag for glock holder auto-demotion

arch/powerpc/kernel/kvm.c | 3 +-
arch/powerpc/kernel/signal_32.c | 4 +-
arch/powerpc/kernel/signal_64.c | 2 +-
arch/x86/kernel/fpu/signal.c | 7 +-
drivers/gpu/drm/armada/armada_gem.c | 7 +-
fs/btrfs/file.c | 7 +-
fs/btrfs/ioctl.c | 5 +-
fs/ext4/file.c | 5 +-
fs/f2fs/file.c | 2 +-
fs/fuse/file.c | 2 +-
fs/gfs2/bmap.c | 60 +----
fs/gfs2/file.c | 245 ++++++++++++++++++--
fs/gfs2/glock.c | 340 +++++++++++++++++++++-------
fs/gfs2/glock.h | 20 ++
fs/gfs2/incore.h | 5 +-
fs/iomap/buffered-io.c | 2 +-
fs/iomap/direct-io.c | 21 +-
fs/ntfs/file.c | 2 +-
fs/xfs/xfs_file.c | 6 +-
fs/zonefs/super.c | 4 +-
include/linux/iomap.h | 11 +-
include/linux/mm.h | 3 +-
include/linux/pagemap.h | 58 +----
include/linux/uio.h | 4 +-
lib/iov_iter.c | 103 +++++++--
mm/filemap.c | 4 +-
mm/gup.c | 139 +++++++++++-
27 files changed, 785 insertions(+), 286 deletions(-)

--
2.26.3


2021-08-27 16:52:00

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v7 07/19] gfs2: Clean up function may_grant

Pass the first current glock holder into function may_grant and
deobfuscate the logic there.

We're now using function find_first_holder in do_promote, so move the
function's definition above do_promote.

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
fs/gfs2/glock.c | 120 ++++++++++++++++++++++++++++--------------------
1 file changed, 70 insertions(+), 50 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 1f3902ecdded..545b435f55ea 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -301,46 +301,59 @@ void gfs2_glock_put(struct gfs2_glock *gl)
}

/**
- * may_grant - check if its ok to grant a new lock
+ * may_grant - check if it's ok to grant a new lock
* @gl: The glock
+ * @current_gh: One of the current holders of @gl
* @gh: The lock request which we wish to grant
*
- * Returns: true if its ok to grant the lock
+ * With our current compatibility rules, if a glock has one or more active
+ * holders (HIF_HOLDER flag set), any of those holders can be passed in as
+ * @current_gh; they are all the same as far as compatibility with the new @gh
+ * goes.
+ *
+ * Returns true if it's ok to grant the lock.
*/

-static inline int may_grant(const struct gfs2_glock *gl, const struct gfs2_holder *gh)
-{
- const struct gfs2_holder *gh_head = list_first_entry(&gl->gl_holders, const struct gfs2_holder, gh_list);
+static inline bool may_grant(const struct gfs2_glock *gl,
+ const struct gfs2_holder *current_gh,
+ const struct gfs2_holder *gh)
+{
+ if (current_gh) {
+ BUG_ON(!test_bit(HIF_HOLDER, &current_gh->gh_iflags));
+
+ switch(current_gh->gh_state) {
+ case LM_ST_EXCLUSIVE:
+ /*
+ * Here we make a special exception to grant holders
+ * who agree to share the EX lock with other holders
+ * who also have the bit set. If the original holder
+ * has the LM_FLAG_NODE_SCOPE bit set, we grant more
+ * holders with the bit set.
+ */
+ return gh->gh_state == LM_ST_EXCLUSIVE &&
+ (current_gh->gh_flags & LM_FLAG_NODE_SCOPE) &&
+ (gh->gh_flags & LM_FLAG_NODE_SCOPE);

- if (gh != gh_head) {
- /**
- * Here we make a special exception to grant holders who agree
- * to share the EX lock with other holders who also have the
- * bit set. If the original holder has the LM_FLAG_NODE_SCOPE bit
- * is set, we grant more holders with the bit set.
- */
- if (gh_head->gh_state == LM_ST_EXCLUSIVE &&
- (gh_head->gh_flags & LM_FLAG_NODE_SCOPE) &&
- gh->gh_state == LM_ST_EXCLUSIVE &&
- (gh->gh_flags & LM_FLAG_NODE_SCOPE))
- return 1;
- if ((gh->gh_state == LM_ST_EXCLUSIVE ||
- gh_head->gh_state == LM_ST_EXCLUSIVE))
- return 0;
+ case LM_ST_SHARED:
+ case LM_ST_DEFERRED:
+ return gh->gh_state == current_gh->gh_state;
+
+ default:
+ return false;
+ }
}
+
if (gl->gl_state == gh->gh_state)
- return 1;
+ return true;
if (gh->gh_flags & GL_EXACT)
- return 0;
+ return false;
if (gl->gl_state == LM_ST_EXCLUSIVE) {
- if (gh->gh_state == LM_ST_SHARED && gh_head->gh_state == LM_ST_SHARED)
- return 1;
- if (gh->gh_state == LM_ST_DEFERRED && gh_head->gh_state == LM_ST_DEFERRED)
- return 1;
+ return gh->gh_state == LM_ST_SHARED ||
+ gh->gh_state == LM_ST_DEFERRED;
}
- if (gl->gl_state != LM_ST_UNLOCKED && (gh->gh_flags & LM_FLAG_ANY))
- return 1;
- return 0;
+ if (gh->gh_flags & LM_FLAG_ANY)
+ return gl->gl_state != LM_ST_UNLOCKED;
+ return false;
}

static void gfs2_holder_wake(struct gfs2_holder *gh)
@@ -380,6 +393,24 @@ static void do_error(struct gfs2_glock *gl, const int ret)
}
}

+/**
+ * find_first_holder - find the first "holder" gh
+ * @gl: the glock
+ */
+
+static inline struct gfs2_holder *find_first_holder(const struct gfs2_glock *gl)
+{
+ struct gfs2_holder *gh;
+
+ if (!list_empty(&gl->gl_holders)) {
+ gh = list_first_entry(&gl->gl_holders, struct gfs2_holder,
+ gh_list);
+ if (test_bit(HIF_HOLDER, &gh->gh_iflags))
+ return gh;
+ }
+ return NULL;
+}
+
/**
* do_promote - promote as many requests as possible on the current queue
* @gl: The glock
@@ -393,14 +424,16 @@ __releases(&gl->gl_lockref.lock)
__acquires(&gl->gl_lockref.lock)
{
const struct gfs2_glock_operations *glops = gl->gl_ops;
- struct gfs2_holder *gh, *tmp;
+ struct gfs2_holder *gh, *tmp, *first_gh;
int ret;

+ first_gh = find_first_holder(gl);
+
restart:
list_for_each_entry_safe(gh, tmp, &gl->gl_holders, gh_list) {
if (test_bit(HIF_HOLDER, &gh->gh_iflags))
continue;
- if (may_grant(gl, gh)) {
+ if (may_grant(gl, first_gh, gh)) {
if (gh->gh_list.prev == &gl->gl_holders &&
glops->go_lock) {
spin_unlock(&gl->gl_lockref.lock);
@@ -722,23 +755,6 @@ __acquires(&gl->gl_lockref.lock)
spin_lock(&gl->gl_lockref.lock);
}

-/**
- * find_first_holder - find the first "holder" gh
- * @gl: the glock
- */
-
-static inline struct gfs2_holder *find_first_holder(const struct gfs2_glock *gl)
-{
- struct gfs2_holder *gh;
-
- if (!list_empty(&gl->gl_holders)) {
- gh = list_first_entry(&gl->gl_holders, struct gfs2_holder, gh_list);
- if (test_bit(HIF_HOLDER, &gh->gh_iflags))
- return gh;
- }
- return NULL;
-}
-
/**
* run_queue - do all outstanding tasks related to a glock
* @gl: The glock in question
@@ -1354,8 +1370,12 @@ __acquires(&gl->gl_lockref.lock)
GLOCK_BUG_ON(gl, true);

if (gh->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB)) {
- if (test_bit(GLF_LOCK, &gl->gl_flags))
- try_futile = !may_grant(gl, gh);
+ if (test_bit(GLF_LOCK, &gl->gl_flags)) {
+ struct gfs2_holder *first_gh;
+
+ first_gh = find_first_holder(gl);
+ try_futile = !may_grant(gl, first_gh, gh);
+ }
if (test_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags))
goto fail;
}
--
2.26.3

2021-08-27 16:52:04

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v7 03/19] gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}

Turn fault_in_pages_{readable,writeable} into versions that return the
number of bytes not faulted in (similar to copy_to_user) instead of
returning a non-zero value when any of the requested pages couldn't be
faulted in. This supports the existing users that require all pages to
be faulted in as well as new users that are happy if any pages can be
faulted in at all.

Neither of these functions is entirely trivial and it doesn't seem
useful to inline them, so move them to mm/gup.c.

Rename the functions to fault_in_{readable,writeable} to make sure that
this change doesn't silently break things.

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
arch/powerpc/kernel/kvm.c | 3 +-
arch/powerpc/kernel/signal_32.c | 4 +-
arch/powerpc/kernel/signal_64.c | 2 +-
arch/x86/kernel/fpu/signal.c | 7 ++-
drivers/gpu/drm/armada/armada_gem.c | 7 ++-
fs/btrfs/ioctl.c | 5 +-
include/linux/pagemap.h | 57 ++---------------------
lib/iov_iter.c | 10 ++--
mm/filemap.c | 2 +-
mm/gup.c | 72 +++++++++++++++++++++++++++++
10 files changed, 93 insertions(+), 76 deletions(-)

diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
index d89cf802d9aa..6568823cf306 100644
--- a/arch/powerpc/kernel/kvm.c
+++ b/arch/powerpc/kernel/kvm.c
@@ -669,7 +669,8 @@ static void __init kvm_use_magic_page(void)
on_each_cpu(kvm_map_magic_page, &features, 1);

/* Quick self-test to see if the mapping works */
- if (fault_in_pages_readable((const char *)KVM_MAGIC_PAGE, sizeof(u32))) {
+ if (fault_in_readable((const char __user *)KVM_MAGIC_PAGE,
+ sizeof(u32))) {
kvm_patching_worked = false;
return;
}
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 0608581967f0..38c3eae40c14 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -1048,7 +1048,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
if (new_ctx == NULL)
return 0;
if (!access_ok(new_ctx, ctx_size) ||
- fault_in_pages_readable((u8 __user *)new_ctx, ctx_size))
+ fault_in_readable((char __user *)new_ctx, ctx_size))
return -EFAULT;

/*
@@ -1237,7 +1237,7 @@ SYSCALL_DEFINE3(debug_setcontext, struct ucontext __user *, ctx,
#endif

if (!access_ok(ctx, sizeof(*ctx)) ||
- fault_in_pages_readable((u8 __user *)ctx, sizeof(*ctx)))
+ fault_in_readable((char __user *)ctx, sizeof(*ctx)))
return -EFAULT;

/*
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 1831bba0582e..9f471b4a11e3 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -688,7 +688,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
if (new_ctx == NULL)
return 0;
if (!access_ok(new_ctx, ctx_size) ||
- fault_in_pages_readable((u8 __user *)new_ctx, ctx_size))
+ fault_in_readable((char __user *)new_ctx, ctx_size))
return -EFAULT;

/*
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 445c57c9c539..ba6bdec81603 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -205,7 +205,7 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
fpregs_unlock();

if (ret) {
- if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
+ if (!fault_in_writeable(buf_fx, fpu_user_xstate_size))
goto retry;
return -EFAULT;
}
@@ -278,10 +278,9 @@ static int restore_fpregs_from_user(void __user *buf, u64 xrestore,
if (ret != -EFAULT)
return -EINVAL;

- ret = fault_in_pages_readable(buf, size);
- if (!ret)
+ if (!fault_in_readable(buf, size))
goto retry;
- return ret;
+ return -EFAULT;
}

/*
diff --git a/drivers/gpu/drm/armada/armada_gem.c b/drivers/gpu/drm/armada/armada_gem.c
index 21909642ee4c..8fbb25913327 100644
--- a/drivers/gpu/drm/armada/armada_gem.c
+++ b/drivers/gpu/drm/armada/armada_gem.c
@@ -336,7 +336,7 @@ int armada_gem_pwrite_ioctl(struct drm_device *dev, void *data,
struct drm_armada_gem_pwrite *args = data;
struct armada_gem_object *dobj;
char __user *ptr;
- int ret;
+ int ret = 0;

DRM_DEBUG_DRIVER("handle %u off %u size %u ptr 0x%llx\n",
args->handle, args->offset, args->size, args->ptr);
@@ -349,9 +349,8 @@ int armada_gem_pwrite_ioctl(struct drm_device *dev, void *data,
if (!access_ok(ptr, args->size))
return -EFAULT;

- ret = fault_in_pages_readable(ptr, args->size);
- if (ret)
- return ret;
+ if (fault_in_readable(ptr, args->size))
+ return -EFAULT;

dobj = armada_gem_object_lookup(file, args->handle);
if (dobj == NULL)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0ba98e08a029..9233ecc31e2e 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2244,9 +2244,8 @@ static noinline int search_ioctl(struct inode *inode,
key.offset = sk->min_offset;

while (1) {
- ret = fault_in_pages_writeable(ubuf + sk_offset,
- *buf_size - sk_offset);
- if (ret)
+ ret = -EFAULT;
+ if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset))
break;

ret = btrfs_search_forward(root, &key, path, sk->min_transid);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ed02aa522263..7c9edc9694d9 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -734,61 +734,10 @@ int wait_on_page_private_2_killable(struct page *page);
extern void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter);

/*
- * Fault everything in given userspace address range in.
+ * Fault in userspace address range.
*/
-static inline int fault_in_pages_writeable(char __user *uaddr, int size)
-{
- char __user *end = uaddr + size - 1;
-
- if (unlikely(size == 0))
- return 0;
-
- if (unlikely(uaddr > end))
- return -EFAULT;
- /*
- * Writing zeroes into userspace here is OK, because we know that if
- * the zero gets there, we'll be overwriting it.
- */
- do {
- if (unlikely(__put_user(0, uaddr) != 0))
- return -EFAULT;
- uaddr += PAGE_SIZE;
- } while (uaddr <= end);
-
- /* Check whether the range spilled into the next page. */
- if (((unsigned long)uaddr & PAGE_MASK) ==
- ((unsigned long)end & PAGE_MASK))
- return __put_user(0, end);
-
- return 0;
-}
-
-static inline int fault_in_pages_readable(const char __user *uaddr, int size)
-{
- volatile char c;
- const char __user *end = uaddr + size - 1;
-
- if (unlikely(size == 0))
- return 0;
-
- if (unlikely(uaddr > end))
- return -EFAULT;
-
- do {
- if (unlikely(__get_user(c, uaddr) != 0))
- return -EFAULT;
- uaddr += PAGE_SIZE;
- } while (uaddr <= end);
-
- /* Check whether the range spilled into the next page. */
- if (((unsigned long)uaddr & PAGE_MASK) ==
- ((unsigned long)end & PAGE_MASK)) {
- return __get_user(c, end);
- }
-
- (void)c;
- return 0;
-}
+size_t fault_in_writeable(char __user *uaddr, size_t size);
+size_t fault_in_readable(const char __user *uaddr, size_t size);

int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
pgoff_t index, gfp_t gfp_mask);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 25dfc48536d7..069cedd9d7b4 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -191,7 +191,7 @@ static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t b
buf = iov->iov_base + skip;
copy = min(bytes, iov->iov_len - skip);

- if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_pages_writeable(buf, copy)) {
+ if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_writeable(buf, copy)) {
kaddr = kmap_atomic(page);
from = kaddr + offset;

@@ -275,7 +275,7 @@ static size_t copy_page_from_iter_iovec(struct page *page, size_t offset, size_t
buf = iov->iov_base + skip;
copy = min(bytes, iov->iov_len - skip);

- if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_pages_readable(buf, copy)) {
+ if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_readable(buf, copy)) {
kaddr = kmap_atomic(page);
to = kaddr + offset;

@@ -446,13 +446,11 @@ int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes)
bytes = i->count;
for (p = i->iov, skip = i->iov_offset; bytes; p++, skip = 0) {
size_t len = min(bytes, p->iov_len - skip);
- int err;

if (unlikely(!len))
continue;
- err = fault_in_pages_readable(p->iov_base + skip, len);
- if (unlikely(err))
- return err;
+ if (fault_in_readable(p->iov_base + skip, len))
+ return -EFAULT;
bytes -= len;
}
}
diff --git a/mm/filemap.c b/mm/filemap.c
index d1458ecf2f51..4dec3bc7752e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -88,7 +88,7 @@
* ->lock_page (access_process_vm)
*
* ->i_mutex (generic_perform_write)
- * ->mmap_lock (fault_in_pages_readable->do_page_fault)
+ * ->mmap_lock (fault_in_readable->do_page_fault)
*
* bdi->wb.list_lock
* sb_lock (fs/fs-writeback.c)
diff --git a/mm/gup.c b/mm/gup.c
index b94717977d17..0cf47955e5a1 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1672,6 +1672,78 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
}
#endif /* !CONFIG_MMU */

+/**
+ * fault_in_writeable - fault in userspace address range for writing
+ * @uaddr: start of address range
+ * @size: size of address range
+ *
+ * Returns the number of bytes not faulted in (like copy_to_user() and
+ * copy_from_user()).
+ */
+size_t fault_in_writeable(char __user *uaddr, size_t size)
+{
+ char __user *start = uaddr, *end;
+
+ if (unlikely(size == 0))
+ return 0;
+ if (!PAGE_ALIGNED(uaddr)) {
+ if (unlikely(__put_user(0, uaddr) != 0))
+ return size;
+ uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr);
+ }
+ end = (char __user *)PAGE_ALIGN((unsigned long)start + size);
+ if (unlikely(end < start))
+ end = NULL;
+ while (uaddr != end) {
+ if (unlikely(__put_user(0, uaddr) != 0))
+ goto out;
+ uaddr += PAGE_SIZE;
+ }
+
+out:
+ if (size > uaddr - start)
+ return size - (uaddr - start);
+ return 0;
+}
+EXPORT_SYMBOL(fault_in_writeable);
+
+/**
+ * fault_in_readable - fault in userspace address range for reading
+ * @uaddr: start of user address range
+ * @size: size of user address range
+ *
+ * Returns the number of bytes not faulted in (like copy_to_user() and
+ * copy_from_user()).
+ */
+size_t fault_in_readable(const char __user *uaddr, size_t size)
+{
+ const char __user *start = uaddr, *end;
+ volatile char c;
+
+ if (unlikely(size == 0))
+ return 0;
+ if (!PAGE_ALIGNED(uaddr)) {
+ if (unlikely(__get_user(c, uaddr) != 0))
+ return size;
+ uaddr = (const char __user *)PAGE_ALIGN((unsigned long)uaddr);
+ }
+ end = (const char __user *)PAGE_ALIGN((unsigned long)start + size);
+ if (unlikely(end < start))
+ end = NULL;
+ while (uaddr != end) {
+ if (unlikely(__get_user(c, uaddr) != 0))
+ goto out;
+ uaddr += PAGE_SIZE;
+ }
+
+out:
+ (void)c;
+ if (size > uaddr - start)
+ return size - (uaddr - start);
+ return 0;
+}
+EXPORT_SYMBOL(fault_in_readable);
+
/**
* get_dump_page() - pin user page in memory while writing it to core dump
* @addr: user address
--
2.26.3

2021-08-27 16:52:16

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v7 04/19] iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable

Turn iov_iter_fault_in_readable into a function that returns the number
of bytes not faulted in (similar to copy_to_user) instead of returning a
non-zero value when any of the requested pages couldn't be faulted in.
This supports the existing users that require all pages to be faulted in
as well as new users that are happy if any pages can be faulted in at
all.

Rename iov_iter_fault_in_readable to fault_in_iov_iter_readable to make
sure that this change doesn't silently break things.

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
fs/btrfs/file.c | 2 +-
fs/f2fs/file.c | 2 +-
fs/fuse/file.c | 2 +-
fs/iomap/buffered-io.c | 2 +-
fs/ntfs/file.c | 2 +-
include/linux/uio.h | 2 +-
lib/iov_iter.c | 33 +++++++++++++++++++++------------
mm/filemap.c | 2 +-
8 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index ee34497500e1..281c77cfe91a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1698,7 +1698,7 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
* Fault pages before locking them in prepare_pages
* to avoid recursive lock
*/
- if (unlikely(iov_iter_fault_in_readable(i, write_bytes))) {
+ if (unlikely(fault_in_iov_iter_readable(i, write_bytes))) {
ret = -EFAULT;
break;
}
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 6afd4562335f..b04b6c909a8b 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -4259,7 +4259,7 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
size_t target_size = 0;
int err;

- if (iov_iter_fault_in_readable(from, iov_iter_count(from)))
+ if (fault_in_iov_iter_readable(from, iov_iter_count(from)))
set_inode_flag(inode, FI_NO_PREALLOC);

if ((iocb->ki_flags & IOCB_NOWAIT)) {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 97f860cfc195..da49ef71dab5 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1160,7 +1160,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,

again:
err = -EFAULT;
- if (iov_iter_fault_in_readable(ii, bytes))
+ if (fault_in_iov_iter_readable(ii, bytes))
break;

err = -ENOMEM;
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 87ccb3438bec..7dc42dd3a724 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -749,7 +749,7 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
* same page as we're writing to, without it being marked
* up-to-date.
*/
- if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+ if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
status = -EFAULT;
break;
}
diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index ab4f3362466d..a43adeacd930 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -1829,7 +1829,7 @@ static ssize_t ntfs_perform_write(struct file *file, struct iov_iter *i,
* pages being swapped out between us bringing them into memory
* and doing the actual copying.
*/
- if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+ if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
status = -EFAULT;
break;
}
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 82c3c3e819e0..12d30246c2e9 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -119,7 +119,7 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
size_t bytes, struct iov_iter *i);
void iov_iter_advance(struct iov_iter *i, size_t bytes);
void iov_iter_revert(struct iov_iter *i, size_t bytes);
-int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes);
+size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t bytes);
size_t iov_iter_single_seg_count(const struct iov_iter *i);
size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 069cedd9d7b4..082ab155496d 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -430,33 +430,42 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
}

/*
+ * fault_in_iov_iter_readable - fault in iov iterator for reading
+ * @i: iterator
+ * @size: maximum length
+ *
* Fault in one or more iovecs of the given iov_iter, to a maximum length of
- * bytes. For each iovec, fault in each page that constitutes the iovec.
+ * @size. For each iovec, fault in each page that constitutes the iovec.
+ *
+ * Returns the number of bytes not faulted in (like copy_to_user() and
+ * copy_from_user()).
*
- * Return 0 on success, or non-zero if the memory could not be accessed (i.e.
- * because it is an invalid address).
+ * Always returns 0 for non-userspace iterators.
*/
-int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes)
+size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t size)
{
if (iter_is_iovec(i)) {
+ size_t count = min(size, iov_iter_count(i));
const struct iovec *p;
size_t skip;

- if (bytes > i->count)
- bytes = i->count;
- for (p = i->iov, skip = i->iov_offset; bytes; p++, skip = 0) {
- size_t len = min(bytes, p->iov_len - skip);
+ size -= count;
+ for (p = i->iov, skip = i->iov_offset; count; p++, skip = 0) {
+ size_t len = min(count, p->iov_len - skip);
+ size_t ret;

if (unlikely(!len))
continue;
- if (fault_in_readable(p->iov_base + skip, len))
- return -EFAULT;
- bytes -= len;
+ ret = fault_in_readable(p->iov_base + skip, len);
+ count -= len - ret;
+ if (ret)
+ break;
}
+ return count + size;
}
return 0;
}
-EXPORT_SYMBOL(iov_iter_fault_in_readable);
+EXPORT_SYMBOL(fault_in_iov_iter_readable);

void iov_iter_init(struct iov_iter *i, unsigned int direction,
const struct iovec *iov, unsigned long nr_segs,
diff --git a/mm/filemap.c b/mm/filemap.c
index 4dec3bc7752e..83af8a534339 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3643,7 +3643,7 @@ ssize_t generic_perform_write(struct file *file,
* same page as we're writing to, without it being marked
* up-to-date.
*/
- if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
+ if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
status = -EFAULT;
break;
}
--
2.26.3

2021-08-27 16:52:25

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

Introduce a new fault_in_iov_iter_writeable helper for safely faulting
in an iterator for writing. Uses get_user_pages() to fault in the pages
without actually writing to them, which would be destructive.

We'll use fault_in_iov_iter_writeable in gfs2 once we've determined that
the iterator passed to .read_iter isn't in memory.

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
include/linux/pagemap.h | 1 +
include/linux/uio.h | 1 +
lib/iov_iter.c | 39 +++++++++++++++++++++++++
mm/gup.c | 63 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 104 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 7c9edc9694d9..a629807edb8c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -737,6 +737,7 @@ extern void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter);
* Fault in userspace address range.
*/
size_t fault_in_writeable(char __user *uaddr, size_t size);
+size_t fault_in_safe_writeable(const char __user *uaddr, size_t size);
size_t fault_in_readable(const char __user *uaddr, size_t size);

int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 12d30246c2e9..ffa431aeb067 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -120,6 +120,7 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
void iov_iter_advance(struct iov_iter *i, size_t bytes);
void iov_iter_revert(struct iov_iter *i, size_t bytes);
size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t bytes);
+size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t bytes);
size_t iov_iter_single_seg_count(const struct iov_iter *i);
size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 082ab155496d..968f2d2595cd 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -467,6 +467,45 @@ size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t size)
}
EXPORT_SYMBOL(fault_in_iov_iter_readable);

+/*
+ * fault_in_iov_iter_writeable - fault in iov iterator for writing
+ * @i: iterator
+ * @size: maximum length
+ *
+ * Faults in the iterator using get_user_pages(), i.e., without triggering
+ * hardware page faults. This is primarily useful when we know that some or
+ * all of the pages in @i aren't in memory.
+ *
+ * Returns the number of bytes not faulted in (like copy_to_user() and
+ * copy_from_user()).
+ *
+ * Always returns 0 for non-user space iterators.
+ */
+size_t fault_in_iov_iter_writeable(const struct iov_iter *i, size_t size)
+{
+ if (iter_is_iovec(i)) {
+ size_t count = min(size, iov_iter_count(i));
+ const struct iovec *p;
+ size_t skip;
+
+ size -= count;
+ for (p = i->iov, skip = i->iov_offset; count; p++, skip = 0) {
+ size_t len = min(count, p->iov_len - skip);
+ size_t ret;
+
+ if (unlikely(!len))
+ continue;
+ ret = fault_in_safe_writeable(p->iov_base + skip, len);
+ count -= len - ret;
+ if (ret)
+ break;
+ }
+ return count + size;
+ }
+ return 0;
+}
+EXPORT_SYMBOL(fault_in_iov_iter_writeable);
+
void iov_iter_init(struct iov_iter *i, unsigned int direction,
const struct iovec *iov, unsigned long nr_segs,
size_t count)
diff --git a/mm/gup.c b/mm/gup.c
index 0cf47955e5a1..03ab03b68dc7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1707,6 +1707,69 @@ size_t fault_in_writeable(char __user *uaddr, size_t size)
}
EXPORT_SYMBOL(fault_in_writeable);

+/*
+ * fault_in_safe_writeable - fault in an address range for writing
+ * @uaddr: start of address range
+ * @size: length of address range
+ *
+ * Faults in an address range using get_user_pages, i.e., without triggering
+ * hardware page faults. This is primarily useful when we know that some or
+ * all of the pages in the address range aren't in memory.
+ *
+ * Other than fault_in_writeable(), this function is non-destructive.
+ *
+ * Note that we don't pin or otherwise hold the pages referenced that we fault
+ * in. There's no guarantee that they'll stay in memory for any duration of
+ * time.
+ *
+ * Returns the number of bytes not faulted in (like copy_to_user() and
+ * copy_from_user()).
+ */
+size_t fault_in_safe_writeable(const char __user *uaddr, size_t size)
+{
+ unsigned long start = (unsigned long)uaddr;
+ unsigned long end, nstart, nend;
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma = NULL;
+ int locked = 0;
+
+ nstart = start & PAGE_MASK;
+ end = PAGE_ALIGN(start + size);
+ if (end < nstart)
+ end = 0;
+ for (; nstart != end; nstart = nend) {
+ unsigned long nr_pages;
+ long ret;
+
+ if (!locked) {
+ locked = 1;
+ mmap_read_lock(mm);
+ vma = find_vma(mm, nstart);
+ } else if (nstart >= vma->vm_end)
+ vma = vma->vm_next;
+ if (!vma || vma->vm_start >= end)
+ break;
+ nend = end ? min(end, vma->vm_end) : vma->vm_end;
+ if (vma->vm_flags & (VM_IO | VM_PFNMAP))
+ continue;
+ if (nstart < vma->vm_start)
+ nstart = vma->vm_start;
+ nr_pages = (nend - nstart) / PAGE_SIZE;
+ ret = __get_user_pages_locked(mm, nstart, nr_pages,
+ NULL, NULL, &locked,
+ FOLL_TOUCH | FOLL_WRITE);
+ if (ret <= 0)
+ break;
+ nend = nstart + ret * PAGE_SIZE;
+ }
+ if (locked)
+ mmap_read_unlock(mm);
+ if (nstart == end)
+ return 0;
+ return size - min_t(size_t, nstart - start, size);
+}
+EXPORT_SYMBOL(fault_in_safe_writeable);
+
/**
* fault_in_readable - fault in userspace address range for reading
* @uaddr: start of user address range
--
2.26.3

2021-08-27 16:52:39

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v7 06/19] gfs2: Add wrapper for iomap_file_buffered_write

Add a wrapper around iomap_file_buffered_write. We'll add code for when
the operation needs to be retried here later.

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
fs/gfs2/file.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 84ec053d43b4..55ec1cadc9e6 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -876,6 +876,18 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
return written ? written : ret;
}

+static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file_inode(file);
+ ssize_t ret;
+
+ current->backing_dev_info = inode_to_bdi(inode);
+ ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+ current->backing_dev_info = NULL;
+ return ret;
+}
+
/**
* gfs2_file_write_iter - Perform a write to a file
* @iocb: The io context
@@ -927,9 +939,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
goto out_unlock;

iocb->ki_flags |= IOCB_DSYNC;
- current->backing_dev_info = inode_to_bdi(inode);
- buffered = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
- current->backing_dev_info = NULL;
+ buffered = gfs2_file_buffered_write(iocb, from);
if (unlikely(buffered <= 0)) {
if (!ret)
ret = buffered;
@@ -951,9 +961,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (!ret || ret2 > 0)
ret += ret2;
} else {
- current->backing_dev_info = inode_to_bdi(inode);
- ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
- current->backing_dev_info = NULL;
+ ret = gfs2_file_buffered_write(iocb, from);
if (likely(ret > 0)) {
iocb->ki_pos += ret;
ret = generic_write_sync(iocb, ret);
--
2.26.3

2021-08-27 16:52:54

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v7 08/19] gfs2: Eliminate vestigial HIF_FIRST

From: Bob Peterson <[email protected]>

Holder flag HIF_FIRST is no longer used or needed, so remove it.

Signed-off-by: Bob Peterson <[email protected]>
---
fs/gfs2/glock.c | 2 --
fs/gfs2/incore.h | 1 -
2 files changed, 3 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 545b435f55ea..fd280b6c37ce 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -2097,8 +2097,6 @@ static const char *hflags2str(char *buf, u16 flags, unsigned long iflags)
*p++ = 'H';
if (test_bit(HIF_WAIT, &iflags))
*p++ = 'W';
- if (test_bit(HIF_FIRST, &iflags))
- *p++ = 'F';
*p = 0;
return buf;
}
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index e6f820f146cb..5c6b985254aa 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -253,7 +253,6 @@ struct gfs2_lkstats {
enum {
/* States */
HIF_HOLDER = 6, /* Set for gh that "holds" the glock */
- HIF_FIRST = 7,
HIF_WAIT = 10,
};

--
2.26.3

2021-08-27 16:53:02

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v7 09/19] gfs2: Remove redundant check from gfs2_glock_dq

From: Bob Peterson <[email protected]>

In function gfs2_glock_dq, it checks to see if this is the fast path.
Before this patch, it checked both "find_first_holder(gl) == NULL" and
list_empty(&gl->gl_holders), which is redundant. If gl_holders is empty
then find_first_holder must return NULL. This patch removes the
redundancy.

Signed-off-by: Bob Peterson <[email protected]>
---
fs/gfs2/glock.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index fd280b6c37ce..f24db2ececfb 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1514,12 +1514,11 @@ void gfs2_glock_dq(struct gfs2_holder *gh)

list_del_init(&gh->gh_list);
clear_bit(HIF_HOLDER, &gh->gh_iflags);
- if (find_first_holder(gl) == NULL) {
- if (list_empty(&gl->gl_holders) &&
- !test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
- !test_bit(GLF_DEMOTE, &gl->gl_flags))
- fast_path = 1;
- }
+ if (list_empty(&gl->gl_holders) &&
+ !test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
+ !test_bit(GLF_DEMOTE, &gl->gl_flags))
+ fast_path = 1;
+
if (!test_bit(GLF_LFLUSH, &gl->gl_flags) && demote_ok(gl))
gfs2_glock_add_to_lru(gl);

--
2.26.3

2021-08-27 16:53:25

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v7 11/19] gfs2: Move the inode glock locking to gfs2_file_buffered_write

So far, for buffered writes, we were taking the inode glock in
gfs2_iomap_begin and dropping it in gfs2_iomap_end with the intention of
not holding the inode glock while iomap_write_actor faults in user
pages. It turns out that iomap_write_actor is called inside iomap_begin
... iomap_end, so the user pages were still faulted in while holding the
inode glock and the locking code in iomap_begin / iomap_end was
completely pointless.

Move the locking into gfs2_file_buffered_write instead. We'll take care
of the potential deadlocks due to faulting in user pages while holding a
glock in a subsequent patch.

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
fs/gfs2/bmap.c | 60 +-------------------------------------------------
fs/gfs2/file.c | 27 +++++++++++++++++++++++
2 files changed, 28 insertions(+), 59 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index ed8b67b21718..0d90f1809efb 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -961,46 +961,6 @@ static int __gfs2_iomap_get(struct inode *inode, loff_t pos, loff_t length,
goto out;
}

-static int gfs2_write_lock(struct inode *inode)
-{
- struct gfs2_inode *ip = GFS2_I(inode);
- struct gfs2_sbd *sdp = GFS2_SB(inode);
- int error;
-
- gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh);
- error = gfs2_glock_nq(&ip->i_gh);
- if (error)
- goto out_uninit;
- if (&ip->i_inode == sdp->sd_rindex) {
- struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
-
- error = gfs2_glock_nq_init(m_ip->i_gl, LM_ST_EXCLUSIVE,
- GL_NOCACHE, &m_ip->i_gh);
- if (error)
- goto out_unlock;
- }
- return 0;
-
-out_unlock:
- gfs2_glock_dq(&ip->i_gh);
-out_uninit:
- gfs2_holder_uninit(&ip->i_gh);
- return error;
-}
-
-static void gfs2_write_unlock(struct inode *inode)
-{
- struct gfs2_inode *ip = GFS2_I(inode);
- struct gfs2_sbd *sdp = GFS2_SB(inode);
-
- if (&ip->i_inode == sdp->sd_rindex) {
- struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
-
- gfs2_glock_dq_uninit(&m_ip->i_gh);
- }
- gfs2_glock_dq_uninit(&ip->i_gh);
-}
-
static int gfs2_iomap_page_prepare(struct inode *inode, loff_t pos,
unsigned len, struct iomap *iomap)
{
@@ -1119,11 +1079,6 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
return ret;
}

-static inline bool gfs2_iomap_need_write_lock(unsigned flags)
-{
- return (flags & IOMAP_WRITE) && !(flags & IOMAP_DIRECT);
-}
-
static int gfs2_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
unsigned flags, struct iomap *iomap,
struct iomap *srcmap)
@@ -1136,12 +1091,6 @@ static int gfs2_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
iomap->flags |= IOMAP_F_BUFFER_HEAD;

trace_gfs2_iomap_start(ip, pos, length, flags);
- if (gfs2_iomap_need_write_lock(flags)) {
- ret = gfs2_write_lock(inode);
- if (ret)
- goto out;
- }
-
ret = __gfs2_iomap_get(inode, pos, length, flags, iomap, &mp);
if (ret)
goto out_unlock;
@@ -1169,10 +1118,7 @@ static int gfs2_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
ret = gfs2_iomap_begin_write(inode, pos, length, flags, iomap, &mp);

out_unlock:
- if (ret && gfs2_iomap_need_write_lock(flags))
- gfs2_write_unlock(inode);
release_metapath(&mp);
-out:
trace_gfs2_iomap_end(ip, iomap, ret);
return ret;
}
@@ -1220,15 +1166,11 @@ static int gfs2_iomap_end(struct inode *inode, loff_t pos, loff_t length,
}

if (unlikely(!written))
- goto out_unlock;
+ return 0;

if (iomap->flags & IOMAP_F_SIZE_CHANGED)
mark_inode_dirty(inode);
set_bit(GLF_DIRTY, &ip->i_gl->gl_flags);
-
-out_unlock:
- if (gfs2_iomap_need_write_lock(flags))
- gfs2_write_unlock(inode);
return 0;
}

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 55ec1cadc9e6..813154d60834 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -880,11 +880,38 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
{
struct file *file = iocb->ki_filp;
struct inode *inode = file_inode(file);
+ struct gfs2_inode *ip = GFS2_I(inode);
+ struct gfs2_sbd *sdp = GFS2_SB(inode);
ssize_t ret;

+ gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh);
+ ret = gfs2_glock_nq(&ip->i_gh);
+ if (ret)
+ goto out_uninit;
+
+ if (inode == sdp->sd_rindex) {
+ struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
+
+ ret = gfs2_glock_nq_init(m_ip->i_gl, LM_ST_EXCLUSIVE,
+ GL_NOCACHE, &m_ip->i_gh);
+ if (ret)
+ goto out_unlock;
+ }
+
current->backing_dev_info = inode_to_bdi(inode);
ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
current->backing_dev_info = NULL;
+
+ if (inode == sdp->sd_rindex) {
+ struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
+
+ gfs2_glock_dq_uninit(&m_ip->i_gh);
+ }
+
+out_unlock:
+ gfs2_glock_dq(&ip->i_gh);
+out_uninit:
+ gfs2_holder_uninit(&ip->i_gh);
return ret;
}

--
2.26.3

2021-08-27 16:53:30

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v7 10/19] gfs2: Introduce flag for glock holder auto-demotion

From: Bob Peterson <[email protected]>

This patch introduces a new HIF_MAY_DEMOTE flag and infrastructure that
will allow glocks to be demoted automatically on locking conflicts.
When a locking request comes in that isn't compatible with the locking
state of an active holder and that holder has the HIF_MAY_DEMOTE flag
set, the holder will be demoted before the incoming locking request is
granted.

Note that this mechanism demotes active holders (with the HIF_HOLDER
flag set), while we were only demoting glocks without any active holders
before. This allows processes to keep hold of locks that may form a
cyclic locking dependency; the core glock logic will then break those
dependencies in case a conflicting locking request actually occurs.
We'll use this to avoid giving up the inode glock proactively before
faulting in pages.

Processes that allow a glock holder to be taken away indicate this by
calling gfs2_holder_allow_demote(). When they need the glock again,
they call gfs2_holder_disallow_demote(). Then they check if the holder
is still queued: if it is, they are still holding the glock; if it
isn't, they can re-acquire the glock or abort.

Signed-off-by: Bob Peterson <[email protected]>
Signed-off-by: Andreas Gruenbacher <[email protected]>
---
fs/gfs2/glock.c | 221 +++++++++++++++++++++++++++++++++++++++--------
fs/gfs2/glock.h | 20 +++++
fs/gfs2/incore.h | 1 +
3 files changed, 206 insertions(+), 36 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index f24db2ececfb..d1b06a09ce2f 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -58,6 +58,7 @@ struct gfs2_glock_iter {
typedef void (*glock_examiner) (struct gfs2_glock * gl);

static void do_xmote(struct gfs2_glock *gl, struct gfs2_holder *gh, unsigned int target);
+static void __gfs2_glock_dq(struct gfs2_holder *gh);

static struct dentry *gfs2_root;
static struct workqueue_struct *glock_workqueue;
@@ -197,6 +198,12 @@ static int demote_ok(const struct gfs2_glock *gl)

if (gl->gl_state == LM_ST_UNLOCKED)
return 0;
+ /*
+ * Note that demote_ok is used for the lru process of disposing of
+ * glocks. For this purpose, we don't care if the glock's holders
+ * have the HIF_MAY_DEMOTE flag set or not. If someone is using
+ * them, don't demote.
+ */
if (!list_empty(&gl->gl_holders))
return 0;
if (glops->go_demote_ok)
@@ -379,7 +386,7 @@ static void do_error(struct gfs2_glock *gl, const int ret)
struct gfs2_holder *gh, *tmp;

list_for_each_entry_safe(gh, tmp, &gl->gl_holders, gh_list) {
- if (test_bit(HIF_HOLDER, &gh->gh_iflags))
+ if (!test_bit(HIF_WAIT, &gh->gh_iflags))
continue;
if (ret & LM_OUT_ERROR)
gh->gh_error = -EIO;
@@ -393,6 +400,40 @@ static void do_error(struct gfs2_glock *gl, const int ret)
}
}

+/**
+ * demote_incompat_holders - demote incompatible demoteable holders
+ * @gl: the glock we want to promote
+ * @new_gh: the new holder to be promoted
+ */
+static void demote_incompat_holders(struct gfs2_glock *gl,
+ struct gfs2_holder *new_gh)
+{
+ struct gfs2_holder *gh;
+
+ /*
+ * Demote incompatible holders before we make ourselves eligible.
+ * (This holder may or may not allow auto-demoting, but we don't want
+ * to demote the new holder before it's even granted.)
+ */
+ list_for_each_entry(gh, &gl->gl_holders, gh_list) {
+ /*
+ * Since holders are at the front of the list, we stop when we
+ * find the first non-holder.
+ */
+ if (!test_bit(HIF_HOLDER, &gh->gh_iflags))
+ return;
+ if (test_bit(HIF_MAY_DEMOTE, &gh->gh_iflags) &&
+ !may_grant(gl, new_gh, gh)) {
+ /*
+ * We should not recurse into do_promote because
+ * __gfs2_glock_dq only calls handle_callback,
+ * gfs2_glock_add_to_lru and __gfs2_glock_queue_work.
+ */
+ __gfs2_glock_dq(gh);
+ }
+ }
+}
+
/**
* find_first_holder - find the first "holder" gh
* @gl: the glock
@@ -411,6 +452,26 @@ static inline struct gfs2_holder *find_first_holder(const struct gfs2_glock *gl)
return NULL;
}

+/**
+ * find_first_strong_holder - find the first non-demoteable holder
+ * @gl: the glock
+ *
+ * Find the first holder that doesn't have the HIF_MAY_DEMOTE flag set.
+ */
+static inline struct gfs2_holder
+*find_first_strong_holder(struct gfs2_glock *gl)
+{
+ struct gfs2_holder *gh;
+
+ list_for_each_entry(gh, &gl->gl_holders, gh_list) {
+ if (!test_bit(HIF_HOLDER, &gh->gh_iflags))
+ return NULL;
+ if (!test_bit(HIF_MAY_DEMOTE, &gh->gh_iflags))
+ return gh;
+ }
+ return NULL;
+}
+
/**
* do_promote - promote as many requests as possible on the current queue
* @gl: The glock
@@ -425,15 +486,27 @@ __acquires(&gl->gl_lockref.lock)
{
const struct gfs2_glock_operations *glops = gl->gl_ops;
struct gfs2_holder *gh, *tmp, *first_gh;
+ bool incompat_holders_demoted = false;
int ret;

- first_gh = find_first_holder(gl);
+ first_gh = find_first_strong_holder(gl);

restart:
list_for_each_entry_safe(gh, tmp, &gl->gl_holders, gh_list) {
- if (test_bit(HIF_HOLDER, &gh->gh_iflags))
+ if (!test_bit(HIF_WAIT, &gh->gh_iflags))
continue;
if (may_grant(gl, first_gh, gh)) {
+ if (!incompat_holders_demoted) {
+ demote_incompat_holders(gl, first_gh);
+ incompat_holders_demoted = true;
+ first_gh = gh;
+ }
+ /*
+ * The first holder (and only the first holder) on the
+ * list to be promoted needs to call the go_lock
+ * function. This does things like inode_refresh
+ * to read an inode from disk.
+ */
if (gh->gh_list.prev == &gl->gl_holders &&
glops->go_lock) {
spin_unlock(&gl->gl_lockref.lock);
@@ -459,6 +532,11 @@ __acquires(&gl->gl_lockref.lock)
gfs2_holder_wake(gh);
continue;
}
+ /*
+ * If we get here, it means we may not grant this holder for
+ * some reason. If this holder is the head of the list, it
+ * means we have a blocked holder at the head, so return 1.
+ */
if (gh->gh_list.prev == &gl->gl_holders)
return 1;
do_error(gl, 0);
@@ -1373,7 +1451,7 @@ __acquires(&gl->gl_lockref.lock)
if (test_bit(GLF_LOCK, &gl->gl_flags)) {
struct gfs2_holder *first_gh;

- first_gh = find_first_holder(gl);
+ first_gh = find_first_strong_holder(gl);
try_futile = !may_grant(gl, first_gh, gh);
}
if (test_bit(GLF_INVALIDATE_IN_PROGRESS, &gl->gl_flags))
@@ -1382,7 +1460,8 @@ __acquires(&gl->gl_lockref.lock)

list_for_each_entry(gh2, &gl->gl_holders, gh_list) {
if (unlikely(gh2->gh_owner_pid == gh->gh_owner_pid &&
- (gh->gh_gl->gl_ops->go_type != LM_TYPE_FLOCK)))
+ (gh->gh_gl->gl_ops->go_type != LM_TYPE_FLOCK) &&
+ !test_bit(HIF_MAY_DEMOTE, &gh2->gh_iflags)))
goto trap_recursive;
if (try_futile &&
!(gh2->gh_flags & (LM_FLAG_TRY | LM_FLAG_TRY_1CB))) {
@@ -1478,51 +1557,83 @@ int gfs2_glock_poll(struct gfs2_holder *gh)
return test_bit(HIF_WAIT, &gh->gh_iflags) ? 0 : 1;
}

-/**
- * gfs2_glock_dq - dequeue a struct gfs2_holder from a glock (release a glock)
- * @gh: the glock holder
- *
- */
+static inline bool needs_demote(struct gfs2_glock *gl)
+{
+ return (test_bit(GLF_DEMOTE, &gl->gl_flags) ||
+ test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags));
+}

-void gfs2_glock_dq(struct gfs2_holder *gh)
+static void __gfs2_glock_dq(struct gfs2_holder *gh)
{
struct gfs2_glock *gl = gh->gh_gl;
struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
unsigned delay = 0;
int fast_path = 0;

- spin_lock(&gl->gl_lockref.lock);
/*
- * If we're in the process of file system withdraw, we cannot just
- * dequeue any glocks until our journal is recovered, lest we
- * introduce file system corruption. We need two exceptions to this
- * rule: We need to allow unlocking of nondisk glocks and the glock
- * for our own journal that needs recovery.
+ * This while loop is similar to function demote_incompat_holders:
+ * If the glock is due to be demoted (which may be from another node
+ * or even if this holder is GL_NOCACHE), the weak holders are
+ * demoted as well, allowing the glock to be demoted.
*/
- if (test_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags) &&
- glock_blocked_by_withdraw(gl) &&
- gh->gh_gl != sdp->sd_jinode_gl) {
- sdp->sd_glock_dqs_held++;
- spin_unlock(&gl->gl_lockref.lock);
- might_sleep();
- wait_on_bit(&sdp->sd_flags, SDF_WITHDRAW_RECOVERY,
- TASK_UNINTERRUPTIBLE);
- spin_lock(&gl->gl_lockref.lock);
- }
- if (gh->gh_flags & GL_NOCACHE)
- handle_callback(gl, LM_ST_UNLOCKED, 0, false);
+ while (gh) {
+ /*
+ * If we're in the process of file system withdraw, we cannot
+ * just dequeue any glocks until our journal is recovered, lest
+ * we introduce file system corruption. We need two exceptions
+ * to this rule: We need to allow unlocking of nondisk glocks
+ * and the glock for our own journal that needs recovery.
+ */
+ if (test_bit(SDF_WITHDRAW_RECOVERY, &sdp->sd_flags) &&
+ glock_blocked_by_withdraw(gl) &&
+ gh->gh_gl != sdp->sd_jinode_gl) {
+ sdp->sd_glock_dqs_held++;
+ spin_unlock(&gl->gl_lockref.lock);
+ might_sleep();
+ wait_on_bit(&sdp->sd_flags, SDF_WITHDRAW_RECOVERY,
+ TASK_UNINTERRUPTIBLE);
+ spin_lock(&gl->gl_lockref.lock);
+ }
+
+ /*
+ * This holder should not be cached, so mark it for demote.
+ * Note: this should be done before the check for needs_demote
+ * below.
+ */
+ if (gh->gh_flags & GL_NOCACHE)
+ handle_callback(gl, LM_ST_UNLOCKED, 0, false);

- list_del_init(&gh->gh_list);
- clear_bit(HIF_HOLDER, &gh->gh_iflags);
- if (list_empty(&gl->gl_holders) &&
- !test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
- !test_bit(GLF_DEMOTE, &gl->gl_flags))
- fast_path = 1;
+ list_del_init(&gh->gh_list);
+ clear_bit(HIF_HOLDER, &gh->gh_iflags);
+ trace_gfs2_glock_queue(gh, 0);
+
+ /*
+ * If there hasn't been a demote request we are done.
+ * (Let the remaining holders, if any, keep holding it.)
+ */
+ if (!needs_demote(gl)) {
+ if (list_empty(&gl->gl_holders))
+ fast_path = 1;
+ break;
+ }
+ /*
+ * If we have another strong holder (we cannot auto-demote)
+ * we are done. It keeps holding it until it is done.
+ */
+ if (find_first_strong_holder(gl))
+ break;
+
+ /*
+ * If we have a weak holder at the head of the list, it
+ * (and all others like it) must be auto-demoted. If there
+ * are no more weak holders, we exit the while loop.
+ */
+ gh = find_first_holder(gl);
+ }

if (!test_bit(GLF_LFLUSH, &gl->gl_flags) && demote_ok(gl))
gfs2_glock_add_to_lru(gl);

- trace_gfs2_glock_queue(gh, 0);
if (unlikely(!fast_path)) {
gl->gl_lockref.count++;
if (test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
@@ -1531,6 +1642,19 @@ void gfs2_glock_dq(struct gfs2_holder *gh)
delay = gl->gl_hold_time;
__gfs2_glock_queue_work(gl, delay);
}
+}
+
+/**
+ * gfs2_glock_dq - dequeue a struct gfs2_holder from a glock (release a glock)
+ * @gh: the glock holder
+ *
+ */
+void gfs2_glock_dq(struct gfs2_holder *gh)
+{
+ struct gfs2_glock *gl = gh->gh_gl;
+
+ spin_lock(&gl->gl_lockref.lock);
+ __gfs2_glock_dq(gh);
spin_unlock(&gl->gl_lockref.lock);
}

@@ -1693,6 +1817,7 @@ void gfs2_glock_dq_m(unsigned int num_gh, struct gfs2_holder *ghs)

void gfs2_glock_cb(struct gfs2_glock *gl, unsigned int state)
{
+ struct gfs2_holder mock_gh = { .gh_gl = gl, .gh_state = state, };
unsigned long delay = 0;
unsigned long holdtime;
unsigned long now = jiffies;
@@ -1707,6 +1832,28 @@ void gfs2_glock_cb(struct gfs2_glock *gl, unsigned int state)
if (test_bit(GLF_REPLY_PENDING, &gl->gl_flags))
delay = gl->gl_hold_time;
}
+ /*
+ * Note 1: We cannot call demote_incompat_holders from handle_callback
+ * or gfs2_set_demote due to recursion problems like: gfs2_glock_dq ->
+ * handle_callback -> demote_incompat_holders -> gfs2_glock_dq
+ * Plus, we only want to demote the holders if the request comes from
+ * a remote cluster node because local holder conflicts are resolved
+ * elsewhere.
+ *
+ * Note 2: if a remote node wants this glock in EX mode, lock_dlm will
+ * request that we set our state to UNLOCKED. Here we mock up a holder
+ * to make it look like someone wants the lock EX locally. Any SH
+ * and DF requests should be able to share the lock without demoting.
+ *
+ * Note 3: We only want to demote the demoteable holders when there
+ * are no more strong holders. The demoteable holders might as well
+ * keep the glock until the last strong holder is done with it.
+ */
+ if (!find_first_strong_holder(gl)) {
+ if (state == LM_ST_UNLOCKED)
+ mock_gh.gh_state = LM_ST_EXCLUSIVE;
+ demote_incompat_holders(gl, &mock_gh);
+ }
handle_callback(gl, state, delay, true);
__gfs2_glock_queue_work(gl, delay);
spin_unlock(&gl->gl_lockref.lock);
@@ -2096,6 +2243,8 @@ static const char *hflags2str(char *buf, u16 flags, unsigned long iflags)
*p++ = 'H';
if (test_bit(HIF_WAIT, &iflags))
*p++ = 'W';
+ if (test_bit(HIF_MAY_DEMOTE, &iflags))
+ *p++ = 'D';
*p = 0;
return buf;
}
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 31a8f2f649b5..9012487da4c6 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -150,6 +150,8 @@ static inline struct gfs2_holder *gfs2_glock_is_locked_by_me(struct gfs2_glock *
list_for_each_entry(gh, &gl->gl_holders, gh_list) {
if (!test_bit(HIF_HOLDER, &gh->gh_iflags))
break;
+ if (test_bit(HIF_MAY_DEMOTE, &gh->gh_iflags))
+ continue;
if (gh->gh_owner_pid == pid)
goto out;
}
@@ -325,6 +327,24 @@ static inline void glock_clear_object(struct gfs2_glock *gl, void *object)
spin_unlock(&gl->gl_lockref.lock);
}

+static inline void gfs2_holder_allow_demote(struct gfs2_holder *gh)
+{
+ struct gfs2_glock *gl = gh->gh_gl;
+
+ spin_lock(&gl->gl_lockref.lock);
+ set_bit(HIF_MAY_DEMOTE, &gh->gh_iflags);
+ spin_unlock(&gl->gl_lockref.lock);
+}
+
+static inline void gfs2_holder_disallow_demote(struct gfs2_holder *gh)
+{
+ struct gfs2_glock *gl = gh->gh_gl;
+
+ spin_lock(&gl->gl_lockref.lock);
+ clear_bit(HIF_MAY_DEMOTE, &gh->gh_iflags);
+ spin_unlock(&gl->gl_lockref.lock);
+}
+
extern void gfs2_inode_remember_delete(struct gfs2_glock *gl, u64 generation);
extern bool gfs2_inode_already_deleted(struct gfs2_glock *gl, u64 generation);

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 5c6b985254aa..e73a81db0714 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -252,6 +252,7 @@ struct gfs2_lkstats {

enum {
/* States */
+ HIF_MAY_DEMOTE = 1,
HIF_HOLDER = 6, /* Set for gh that "holds" the glock */
HIF_WAIT = 10,
};
--
2.26.3

2021-08-27 16:53:32

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v7 12/19] gfs2: Eliminate ip->i_gh

Now that gfs2_file_buffered_write is the only remaining user of
ip->i_gh, we can move the glock holder to the stack (or rather, use the
one we already have on the stack); there is no need for keeping the
holder in the inode anymore.

This is slightly complicated by the fact that we're using ip->i_gh for
the statfs inode in gfs2_file_buffered_write as well. Writing to the
statfs inode isn't very common, so allocate the statfs holder
dynamically when needed.

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
fs/gfs2/file.c | 40 +++++++++++++++++++++++++++-------------
fs/gfs2/incore.h | 3 +--
2 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 813154d60834..5f328bc21d0b 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -876,16 +876,31 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
return written ? written : ret;
}

-static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *from)
+static ssize_t gfs2_file_buffered_write(struct kiocb *iocb,
+ struct iov_iter *from,
+ struct gfs2_holder *gh)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file_inode(file);
struct gfs2_inode *ip = GFS2_I(inode);
struct gfs2_sbd *sdp = GFS2_SB(inode);
+ struct gfs2_holder *statfs_gh = NULL;
ssize_t ret;

- gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh);
- ret = gfs2_glock_nq(&ip->i_gh);
+ /*
+ * In this function, we disable page faults when we're holding the
+ * inode glock while doing I/O. If a page fault occurs, we drop the
+ * inode glock, fault in the pages manually, and retry.
+ */
+
+ if (inode == sdp->sd_rindex) {
+ statfs_gh = kmalloc(sizeof(*statfs_gh), GFP_NOFS);
+ if (!statfs_gh)
+ return -ENOMEM;
+ }
+
+ gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, gh);
+ ret = gfs2_glock_nq(gh);
if (ret)
goto out_uninit;

@@ -893,7 +908,7 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);

ret = gfs2_glock_nq_init(m_ip->i_gl, LM_ST_EXCLUSIVE,
- GL_NOCACHE, &m_ip->i_gh);
+ GL_NOCACHE, statfs_gh);
if (ret)
goto out_unlock;
}
@@ -902,16 +917,15 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
current->backing_dev_info = NULL;

- if (inode == sdp->sd_rindex) {
- struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);
-
- gfs2_glock_dq_uninit(&m_ip->i_gh);
- }
+ if (inode == sdp->sd_rindex)
+ gfs2_glock_dq_uninit(statfs_gh);

out_unlock:
- gfs2_glock_dq(&ip->i_gh);
+ gfs2_glock_dq(gh);
out_uninit:
- gfs2_holder_uninit(&ip->i_gh);
+ gfs2_holder_uninit(gh);
+ if (statfs_gh)
+ kfree(statfs_gh);
return ret;
}

@@ -966,7 +980,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
goto out_unlock;

iocb->ki_flags |= IOCB_DSYNC;
- buffered = gfs2_file_buffered_write(iocb, from);
+ buffered = gfs2_file_buffered_write(iocb, from, &gh);
if (unlikely(buffered <= 0)) {
if (!ret)
ret = buffered;
@@ -988,7 +1002,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (!ret || ret2 > 0)
ret += ret2;
} else {
- ret = gfs2_file_buffered_write(iocb, from);
+ ret = gfs2_file_buffered_write(iocb, from, &gh);
if (likely(ret > 0)) {
iocb->ki_pos += ret;
ret = generic_write_sync(iocb, ret);
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index e73a81db0714..87abdcc1de0c 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -387,9 +387,8 @@ struct gfs2_inode {
u64 i_generation;
u64 i_eattr;
unsigned long i_flags; /* GIF_... */
- struct gfs2_glock *i_gl; /* Move into i_gh? */
+ struct gfs2_glock *i_gl;
struct gfs2_holder i_iopen_gh;
- struct gfs2_holder i_gh; /* for prepare/commit_write only */
struct gfs2_qadata *i_qadata; /* quota allocation data */
struct gfs2_holder i_rgd_gh;
struct gfs2_blkreserv i_res; /* rgrp multi-block reservation */
--
2.26.3

2021-08-27 16:53:42

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v7 13/19] gfs2: Fix mmap + page fault deadlocks for buffered I/O

In the .read_iter and .write_iter file operations, we're accessing
user-space memory while holding the inode glock. There is a possibility
that the memory is mapped to the same file, in which case we'd recurse
on the same glock.

More complex scenarios can involve multiple glocks, processes, and even
cluster nodes.

Avoid these kinds of problems by disabling page faults while holding the
inode glock. If a page fault would occur, we either end up with a
partial read or write, or with -EFAULT if nothing could be read or
written. In either case, we know that we're not done with the
operation, so we indicate that we're willing to give up the inode glock
(HIF_MAY_DEMOTE) and then we fault in the missing pages. If that made
us lose the inode glock, we return a partial read or write. Otherwise,
we resume the operation.

This locking problem was originally reported by Jan Kara. Linus came up
with the proposal to disable page faults. Many thanks to Al Viro and
Matthew Wilcox for their feedback.

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
fs/gfs2/file.c | 91 +++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 87 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 5f328bc21d0b..fce3a5249e19 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -776,6 +776,36 @@ static int gfs2_fsync(struct file *file, loff_t start, loff_t end,
return ret ? ret : ret1;
}

+static bool should_fault_in_pages(struct iov_iter *i, size_t *prev_count,
+ size_t *window_size)
+{
+ char __user *p = i->iov[0].iov_base + i->iov_offset;
+ size_t count = iov_iter_count(i);
+ size_t size;
+
+ if (!iter_is_iovec(i))
+ return false;
+
+ if (*prev_count != count || !*window_size) {
+ int pages, nr_dirtied;
+
+ pages = min_t(int, BIO_MAX_VECS,
+ DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE));
+ nr_dirtied = max(current->nr_dirtied_pause -
+ current->nr_dirtied, 1);
+ pages = min(pages, nr_dirtied);
+ size = (size_t)PAGE_SIZE * pages - offset_in_page(p);
+ } else {
+ size = (size_t)PAGE_SIZE - offset_in_page(p);
+ if (*window_size <= size)
+ return false;
+ }
+
+ *prev_count = count;
+ *window_size = size;
+ return true;
+}
+
static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
struct gfs2_holder *gh)
{
@@ -840,9 +870,16 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
struct gfs2_inode *ip;
struct gfs2_holder gh;
+ size_t prev_count = 0, window_size = 0;
size_t written = 0;
ssize_t ret;

+ /*
+ * In this function, we disable page faults when we're holding the
+ * inode glock while doing I/O. If a page fault occurs, we drop the
+ * inode glock, fault in the pages manually, and retry.
+ */
+
if (iocb->ki_flags & IOCB_DIRECT) {
ret = gfs2_file_direct_read(iocb, to, &gh);
if (likely(ret != -ENOTBLK))
@@ -864,13 +901,35 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
}
ip = GFS2_I(iocb->ki_filp->f_mapping->host);
gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
+retry:
ret = gfs2_glock_nq(&gh);
if (ret)
goto out_uninit;
+retry_under_glock:
+ pagefault_disable();
ret = generic_file_read_iter(iocb, to);
+ pagefault_enable();
if (ret > 0)
written += ret;
- gfs2_glock_dq(&gh);
+
+ if (unlikely(iov_iter_count(to) && (ret > 0 || ret == -EFAULT)) &&
+ should_fault_in_pages(to, &prev_count, &window_size)) {
+ size_t leftover;
+
+ gfs2_holder_allow_demote(&gh);
+ leftover = fault_in_iov_iter_writeable(to, window_size);
+ gfs2_holder_disallow_demote(&gh);
+ if (leftover != window_size) {
+ if (!gfs2_holder_queued(&gh)) {
+ if (written)
+ goto out_uninit;
+ goto retry;
+ }
+ goto retry_under_glock;
+ }
+ }
+ if (gfs2_holder_queued(&gh))
+ gfs2_glock_dq(&gh);
out_uninit:
gfs2_holder_uninit(&gh);
return written ? written : ret;
@@ -885,6 +944,8 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb,
struct gfs2_inode *ip = GFS2_I(inode);
struct gfs2_sbd *sdp = GFS2_SB(inode);
struct gfs2_holder *statfs_gh = NULL;
+ size_t prev_count = 0, window_size = 0;
+ size_t read = 0;
ssize_t ret;

/*
@@ -900,10 +961,11 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb,
}

gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, gh);
+retry:
ret = gfs2_glock_nq(gh);
if (ret)
goto out_uninit;
-
+retry_under_glock:
if (inode == sdp->sd_rindex) {
struct gfs2_inode *m_ip = GFS2_I(sdp->sd_statfs_inode);

@@ -914,19 +976,40 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb,
}

current->backing_dev_info = inode_to_bdi(inode);
+ pagefault_disable();
ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+ pagefault_enable();
current->backing_dev_info = NULL;
+ if (ret > 0)
+ read += ret;

if (inode == sdp->sd_rindex)
gfs2_glock_dq_uninit(statfs_gh);

+ if (unlikely(iov_iter_count(from) && (ret > 0 || ret == -EFAULT)) &&
+ should_fault_in_pages(from, &prev_count, &window_size)) {
+ size_t leftover;
+
+ gfs2_holder_allow_demote(gh);
+ leftover = fault_in_iov_iter_readable(from, window_size);
+ gfs2_holder_disallow_demote(gh);
+ if (leftover != window_size) {
+ if (!gfs2_holder_queued(gh)) {
+ if (read)
+ goto out_uninit;
+ goto retry;
+ }
+ goto retry_under_glock;
+ }
+ }
out_unlock:
- gfs2_glock_dq(gh);
+ if (gfs2_holder_queued(gh))
+ gfs2_glock_dq(gh);
out_uninit:
gfs2_holder_uninit(gh);
if (statfs_gh)
kfree(statfs_gh);
- return ret;
+ return read ? read : ret;
}

/**
--
2.26.3

2021-08-27 16:54:03

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v7 15/19] iomap: Support partial direct I/O on user copy failures

In iomap_dio_rw, when iomap_apply returns an -EFAULT error and the
IOMAP_DIO_PARTIAL flag is set, complete the request synchronously and
return a partial result. This allows the caller to deal with the page
fault and retry the remainder of the request.

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
fs/iomap/direct-io.c | 6 ++++++
include/linux/iomap.h | 7 +++++++
2 files changed, 13 insertions(+)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 8054f5d6c273..ba88fe51b77a 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -561,6 +561,12 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
ret = iomap_apply(inode, pos, count, iomap_flags, ops, dio,
iomap_dio_actor);
if (ret <= 0) {
+ if (ret == -EFAULT && dio->size &&
+ (dio_flags & IOMAP_DIO_PARTIAL)) {
+ wait_for_completion = true;
+ ret = 0;
+ }
+
/* magic error code to fall back to buffered I/O */
if (ret == -ENOTBLK) {
wait_for_completion = true;
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 479c1da3e221..bcae4814b8e3 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -267,6 +267,13 @@ struct iomap_dio_ops {
*/
#define IOMAP_DIO_OVERWRITE_ONLY (1 << 1)

+/*
+ * When a page fault occurs, return a partial synchronous result and allow
+ * the caller to retry the rest of the operation after dealing with the page
+ * fault.
+ */
+#define IOMAP_DIO_PARTIAL (1 << 2)
+
ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
unsigned int dio_flags);
--
2.26.3

2021-08-27 16:54:24

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v7 19/19] gfs2: Fix mmap + page fault deadlocks for direct I/O

Also disable page faults during direct I/O requests and implement a
similar kind of retry logic as in the buffered I/O case.

The retry logic in the direct I/O case differs from the buffered I/O
case in the following way: direct I/O doesn't provide the kinds of
consistency guarantees between concurrent reads and writes that buffered
I/O provides, so when we lose the inode glock while faulting in user
pages, we always resume the operation. We never need to return a
partial read or write.

This locking problem was originally reported by Jan Kara. Linus came up
with the proposal to disable page faults. Many thanks to Al Viro and
Matthew Wilcox for their feedback.

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
fs/gfs2/file.c | 99 ++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 87 insertions(+), 12 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 64bf2f68e6d6..6603d9cd8739 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -811,22 +811,64 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
{
struct file *file = iocb->ki_filp;
struct gfs2_inode *ip = GFS2_I(file->f_mapping->host);
- size_t count = iov_iter_count(to);
+ size_t prev_count = 0, window_size = 0;
+ size_t written = 0;
ssize_t ret;

- if (!count)
+ /*
+ * In this function, we disable page faults when we're holding the
+ * inode glock while doing I/O. If a page fault occurs, we drop the
+ * inode glock, fault in the pages manually, and retry.
+ *
+ * Unlike generic_file_read_iter, for reads, iomap_dio_rw can trigger
+ * physical as well as manual page faults, and we need to disable both
+ * kinds.
+ *
+ * For direct I/O, gfs2 takes the inode glock in deferred mode. This
+ * locking mode is compatible with other deferred holders, so multiple
+ * processes and nodes can do direct I/O to a file at the same time.
+ * There's no guarantee that reads or writes will be atomic. Any
+ * coordination among readers and writers needs to happen externally.
+ */
+
+ if (!iov_iter_count(to))
return 0; /* skip atime */

gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+retry:
ret = gfs2_glock_nq(gh);
if (ret)
goto out_uninit;
+retry_under_glock:
+ pagefault_disable();
+ to->nofault = true;
+ ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL,
+ IOMAP_DIO_PARTIAL, written);
+ to->nofault = false;
+ pagefault_enable();
+ if (ret > 0)
+ written = ret;

- ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0, 0);
- gfs2_glock_dq(gh);
+ if (unlikely(iov_iter_count(to) && (ret > 0 || ret == -EFAULT)) &&
+ should_fault_in_pages(to, &prev_count, &window_size)) {
+ size_t leftover;
+
+ gfs2_holder_allow_demote(gh);
+ leftover = fault_in_iov_iter_writeable(to, window_size);
+ gfs2_holder_disallow_demote(gh);
+ if (leftover != window_size) {
+ if (!gfs2_holder_queued(gh))
+ goto retry;
+ goto retry_under_glock;
+ }
+ }
+ if (gfs2_holder_queued(gh))
+ gfs2_glock_dq(gh);
out_uninit:
gfs2_holder_uninit(gh);
- return ret;
+ if (ret < 0)
+ return ret;
+ return written;
}

static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
@@ -835,10 +877,19 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;
struct gfs2_inode *ip = GFS2_I(inode);
- size_t len = iov_iter_count(from);
- loff_t offset = iocb->ki_pos;
+ size_t prev_count = 0, window_size = 0;
+ size_t read = 0;
ssize_t ret;

+ /*
+ * In this function, we disable page faults when we're holding the
+ * inode glock while doing I/O. If a page fault occurs, we drop the
+ * inode glock, fault in the pages manually, and retry.
+ *
+ * For writes, iomap_dio_rw only triggers manual page faults, so we
+ * don't need to disable physical ones.
+ */
+
/*
* Deferred lock, even if its a write, since we do no allocation on
* this path. All we need to change is the atime, and this lock mode
@@ -848,22 +899,46 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
* VFS does.
*/
gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+retry:
ret = gfs2_glock_nq(gh);
if (ret)
goto out_uninit;
-
+retry_under_glock:
/* Silently fall back to buffered I/O when writing beyond EOF */
- if (offset + len > i_size_read(&ip->i_inode))
+ if (iocb->ki_pos + iov_iter_count(from) > i_size_read(&ip->i_inode))
goto out;

- ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0, 0);
+ from->nofault = true;
+ ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL,
+ IOMAP_DIO_PARTIAL, read);
+ from->nofault = false;
+
if (ret == -ENOTBLK)
ret = 0;
+ if (ret > 0)
+ read = ret;
+
+ if (unlikely(iov_iter_count(from) && (ret > 0 || ret == -EFAULT)) &&
+ should_fault_in_pages(from, &prev_count, &window_size)) {
+ size_t leftover;
+
+ gfs2_holder_allow_demote(gh);
+ leftover = fault_in_iov_iter_readable(from, window_size);
+ gfs2_holder_disallow_demote(gh);
+ if (leftover != window_size) {
+ if (!gfs2_holder_queued(gh))
+ goto retry;
+ goto retry_under_glock;
+ }
+ }
out:
- gfs2_glock_dq(gh);
+ if (gfs2_holder_queued(gh))
+ gfs2_glock_dq(gh);
out_uninit:
gfs2_holder_uninit(gh);
- return ret;
+ if (ret < 0)
+ return ret;
+ return read;
}

static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
--
2.26.3

2021-08-27 16:55:37

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v7 17/19] gup: Introduce FOLL_NOFAULT flag to disable page faults

Introduce a new FOLL_NOFAULT flag that causes get_user_pages to return
-EFAULT when it would otherwise trigger a page fault. This is roughly
similar to FOLL_FAST_ONLY but available on all architectures, and less
fragile.

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
include/linux/mm.h | 3 ++-
mm/gup.c | 4 +++-
2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7ca22e6e694a..958246aa343f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2850,7 +2850,8 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */
#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
* and return without waiting upon it */
-#define FOLL_POPULATE 0x40 /* fault in page */
+#define FOLL_POPULATE 0x40 /* fault in pages (with FOLL_MLOCK) */
+#define FOLL_NOFAULT 0x80 /* do not fault in pages */
#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
diff --git a/mm/gup.c b/mm/gup.c
index 03ab03b68dc7..69056adcc8c9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -932,6 +932,8 @@ static int faultin_page(struct vm_area_struct *vma,
/* mlock all present pages, but do not fault in new pages */
if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK)
return -ENOENT;
+ if (*flags & FOLL_NOFAULT)
+ return -EFAULT;
if (*flags & FOLL_WRITE)
fault_flags |= FAULT_FLAG_WRITE;
if (*flags & FOLL_REMOTE)
@@ -2857,7 +2859,7 @@ static int internal_get_user_pages_fast(unsigned long start,

if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
FOLL_FORCE | FOLL_PIN | FOLL_GET |
- FOLL_FAST_ONLY)))
+ FOLL_FAST_ONLY | FOLL_NOFAULT)))
return -EINVAL;

if (gup_flags & FOLL_PIN)
--
2.26.3

2021-08-27 16:56:13

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v7 14/19] iomap: Fix iomap_dio_rw return value for user copies

When a user copy fails in one of the helpers of iomap_dio_rw, fail with
-EFAULT instead of returning 0. This matches what iomap_dio_bio_actor
returns when it gets an -EFAULT from bio_iov_iter_get_pages. With these
changes, iomap_dio_actor now consistently fails with -EFAULT when a user
page cannot be faulted in.

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
fs/iomap/direct-io.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 9398b8c31323..8054f5d6c273 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -370,7 +370,7 @@ iomap_dio_hole_actor(loff_t length, struct iomap_dio *dio)
{
length = iov_iter_zero(length, dio->submit.iter);
dio->size += length;
- return length;
+ return length ? length : -EFAULT;
}

static loff_t
@@ -397,7 +397,7 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
copied = copy_to_iter(iomap->inline_data + pos, length, iter);
}
dio->size += copied;
- return copied;
+ return copied ? copied : -EFAULT;
}

static loff_t
--
2.26.3

2021-08-27 16:56:13

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v7 16/19] iomap: Add done_before argument to iomap_dio_rw

Add a done_before argument to iomap_dio_rw that indicates how much of
the request has already been transferred. When the request succeeds, we
report that done_before additional bytes were tranferred. This is
useful for finishing a request asynchronously when part of the request
has already been completed synchronously.

We'll use that to allow iomap_dio_rw to be used with page faults
disabled: when a page fault occurs while submitting a request, we
synchronously complete the part of the request that has already been
submitted. The caller can then take care of the page fault and call
iomap_dio_rw again for the rest of the request, passing in the number of
bytes already tranferred.

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
fs/btrfs/file.c | 5 +++--
fs/ext4/file.c | 5 +++--
fs/gfs2/file.c | 4 ++--
fs/iomap/direct-io.c | 11 ++++++++---
fs/xfs/xfs_file.c | 6 +++---
fs/zonefs/super.c | 4 ++--
include/linux/iomap.h | 4 ++--
7 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 281c77cfe91a..8817fe6b5fc0 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1945,7 +1945,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
}

dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
- 0);
+ 0, 0);

btrfs_inode_unlock(inode, ilock_flags);

@@ -3637,7 +3637,8 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
return 0;

btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
- ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops, 0);
+ ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
+ 0, 0);
btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
return ret;
}
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 816dedcbd541..4a5e7fd31fb5 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -74,7 +74,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
return generic_file_read_iter(iocb, to);
}

- ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0);
+ ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0, 0);
inode_unlock_shared(inode);

file_accessed(iocb->ki_filp);
@@ -566,7 +566,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (ilock_shared)
iomap_ops = &ext4_iomap_overwrite_ops;
ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
- (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
+ (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0,
+ 0);
if (ret == -ENOTBLK)
ret = 0;

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index fce3a5249e19..64bf2f68e6d6 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -822,7 +822,7 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
if (ret)
goto out_uninit;

- ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0);
+ ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0, 0);
gfs2_glock_dq(gh);
out_uninit:
gfs2_holder_uninit(gh);
@@ -856,7 +856,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
if (offset + len > i_size_read(&ip->i_inode))
goto out;

- ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0);
+ ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0, 0);
if (ret == -ENOTBLK)
ret = 0;
out:
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index ba88fe51b77a..dcf9a2b4381f 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -31,6 +31,7 @@ struct iomap_dio {
atomic_t ref;
unsigned flags;
int error;
+ size_t done_before;
bool wait_for_completion;

union {
@@ -126,6 +127,9 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
ret = generic_write_sync(iocb, ret);

+ if (ret > 0)
+ ret += dio->done_before;
+
kfree(dio);

return ret;
@@ -450,7 +454,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
struct iomap_dio *
__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
- unsigned int dio_flags)
+ unsigned int dio_flags, size_t done_before)
{
struct address_space *mapping = iocb->ki_filp->f_mapping;
struct inode *inode = file_inode(iocb->ki_filp);
@@ -477,6 +481,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
dio->dops = dops;
dio->error = 0;
dio->flags = 0;
+ dio->done_before = done_before;

dio->submit.iter = iter;
dio->submit.waiter = current;
@@ -648,11 +653,11 @@ EXPORT_SYMBOL_GPL(__iomap_dio_rw);
ssize_t
iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
- unsigned int dio_flags)
+ unsigned int dio_flags, size_t done_before)
{
struct iomap_dio *dio;

- dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags);
+ dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags, done_before);
if (IS_ERR_OR_NULL(dio))
return PTR_ERR_OR_ZERO(dio);
return iomap_dio_complete(dio);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index cc3cfb12df53..3103d9bda466 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -259,7 +259,7 @@ xfs_file_dio_read(
ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
if (ret)
return ret;
- ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0);
+ ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, 0);
xfs_iunlock(ip, XFS_IOLOCK_SHARED);

return ret;
@@ -569,7 +569,7 @@ xfs_file_dio_write_aligned(
}
trace_xfs_file_direct_write(iocb, from);
ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
- &xfs_dio_write_ops, 0);
+ &xfs_dio_write_ops, 0, 0);
out_unlock:
if (iolock)
xfs_iunlock(ip, iolock);
@@ -647,7 +647,7 @@ xfs_file_dio_write_unaligned(

trace_xfs_file_direct_write(iocb, from);
ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
- &xfs_dio_write_ops, flags);
+ &xfs_dio_write_ops, flags, 0);

/*
* Retry unaligned I/O with exclusive blocking semantics if the DIO
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 70055d486bf7..85ca2f5fe06e 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -864,7 +864,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
ret = zonefs_file_dio_append(iocb, from);
else
ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops,
- &zonefs_write_dio_ops, 0);
+ &zonefs_write_dio_ops, 0, 0);
if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
(ret > 0 || ret == -EIOCBQUEUED)) {
if (ret > 0)
@@ -999,7 +999,7 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
}
file_accessed(iocb->ki_filp);
ret = iomap_dio_rw(iocb, to, &zonefs_iomap_ops,
- &zonefs_read_dio_ops, 0);
+ &zonefs_read_dio_ops, 0, 0);
} else {
ret = generic_file_read_iter(iocb, to);
if (ret == -EIO)
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index bcae4814b8e3..908bda10024c 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -276,10 +276,10 @@ struct iomap_dio_ops {

ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
- unsigned int dio_flags);
+ unsigned int dio_flags, size_t done_before);
struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
- unsigned int dio_flags);
+ unsigned int dio_flags, size_t done_before);
ssize_t iomap_dio_complete(struct iomap_dio *dio);
int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);

--
2.26.3

2021-08-27 16:56:13

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v7 18/19] iov_iter: Introduce nofault flag to disable page faults

Introduce a new nofault flag to indicate to get_user_pages to use the
FOLL_NOFAULT flag. This will cause get_user_pages to fail when it
would otherwise fault in a page.

Currently, the noio flag is only checked in iov_iter_get_pages and
iov_iter_get_pages_alloc. This is enough for iomaop_dio_rw, but it
may make sense to check in other contexts as well.

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
include/linux/uio.h | 1 +
lib/iov_iter.c | 20 +++++++++++++++-----
2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index ffa431aeb067..ea35e511268f 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -29,6 +29,7 @@ enum iter_type {

struct iov_iter {
u8 iter_type;
+ bool nofault;
bool data_source;
size_t iov_offset;
size_t count;
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 968f2d2595cd..22a82f272754 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -513,6 +513,7 @@ void iov_iter_init(struct iov_iter *i, unsigned int direction,
WARN_ON(direction & ~(READ | WRITE));
*i = (struct iov_iter) {
.iter_type = ITER_IOVEC,
+ .nofault = false,
.data_source = direction,
.iov = iov,
.nr_segs = nr_segs,
@@ -1523,13 +1524,17 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
return 0;

if (likely(iter_is_iovec(i))) {
+ unsigned int gup_flags = 0;
unsigned long addr;

+ if (iov_iter_rw(i) != WRITE)
+ gup_flags |= FOLL_WRITE;
+ if (i->nofault)
+ gup_flags |= FOLL_NOFAULT;
+
addr = first_iovec_segment(i, &len, start, maxsize, maxpages);
n = DIV_ROUND_UP(len, PAGE_SIZE);
- res = get_user_pages_fast(addr, n,
- iov_iter_rw(i) != WRITE ? FOLL_WRITE : 0,
- pages);
+ res = get_user_pages_fast(addr, n, gup_flags, pages);
if (unlikely(res <= 0))
return res;
return (res == n ? len : res * PAGE_SIZE) - *start;
@@ -1645,15 +1650,20 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
return 0;

if (likely(iter_is_iovec(i))) {
+ unsigned int gup_flags = 0;
unsigned long addr;

+ if (iov_iter_rw(i) != WRITE)
+ gup_flags |= FOLL_WRITE;
+ if (i->nofault)
+ gup_flags |= FOLL_NOFAULT;
+
addr = first_iovec_segment(i, &len, start, maxsize, ~0U);
n = DIV_ROUND_UP(len, PAGE_SIZE);
p = get_pages_array(n);
if (!p)
return -ENOMEM;
- res = get_user_pages_fast(addr, n,
- iov_iter_rw(i) != WRITE ? FOLL_WRITE : 0, p);
+ res = get_user_pages_fast(addr, n, gup_flags, p);
if (unlikely(res <= 0)) {
kvfree(p);
*pages = NULL;
--
2.26.3

2021-08-27 17:20:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 00/19] gfs2: Fix mmap + page fault deadlocks

On Fri, Aug 27, 2021 at 9:49 AM Andreas Gruenbacher <[email protected]> wrote:
>
> here's another update on top of v5.14-rc7. Changes:
>
> * Some of the patch descriptions have been improved.
>
> * Patch "gfs2: Eliminate ip->i_gh" has been moved further to the front.
>
> At this point, I'm not aware of anything that still needs fixing,

From a quick scan, I didn't see anything that raised my hackles.

But I skipped all the gfs2-specific changes in the series, since
that's all above my paygrade.

Linus

2021-08-27 18:31:31

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v7 16/19] iomap: Add done_before argument to iomap_dio_rw

On Fri, Aug 27, 2021 at 06:49:23PM +0200, Andreas Gruenbacher wrote:
> Add a done_before argument to iomap_dio_rw that indicates how much of
> the request has already been transferred. When the request succeeds, we
> report that done_before additional bytes were tranferred. This is
> useful for finishing a request asynchronously when part of the request
> has already been completed synchronously.
>
> We'll use that to allow iomap_dio_rw to be used with page faults
> disabled: when a page fault occurs while submitting a request, we
> synchronously complete the part of the request that has already been
> submitted. The caller can then take care of the page fault and call
> iomap_dio_rw again for the rest of the request, passing in the number of
> bytes already tranferred.
>
> Signed-off-by: Andreas Gruenbacher <[email protected]>
> ---
> fs/btrfs/file.c | 5 +++--
> fs/ext4/file.c | 5 +++--
> fs/gfs2/file.c | 4 ++--
> fs/iomap/direct-io.c | 11 ++++++++---
> fs/xfs/xfs_file.c | 6 +++---
> fs/zonefs/super.c | 4 ++--
> include/linux/iomap.h | 4 ++--
> 7 files changed, 23 insertions(+), 16 deletions(-)
>

<snip to the interesting parts>

> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index ba88fe51b77a..dcf9a2b4381f 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -31,6 +31,7 @@ struct iomap_dio {
> atomic_t ref;
> unsigned flags;
> int error;
> + size_t done_before;
> bool wait_for_completion;
>
> union {
> @@ -126,6 +127,9 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
> if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
> ret = generic_write_sync(iocb, ret);
>
> + if (ret > 0)
> + ret += dio->done_before;

Pardon my ignorance since this is the first time I've had a crack at
this patchset, but why is it necessary to carry the "bytes copied"
count from the /previous/ iomap_dio_rw call all the way through to dio
completion of the current call?

If the directio operation succeeds even partially and the PARTIAL flag
is set, won't that push the iov iter ahead by however many bytes
completed?

In other words, why won't this loop work for gfs2?

size_t copied = 0;
while (iov_iter_count(iov) > 0) {
ssize_t ret = iomap_dio_rw(iocb, iov, ..., IOMAP_DIO_PARTIAL);
if (iov_iter_count(iov) == 0 || ret != -EFAULT)
break;

copied += ret;
/* strange gfs2 relocking I don't understand */
/* deal with page faults... */
};
if (ret < 0)
return ret;
return copied + ret;

It feels clunky to make the caller pass the results of a previous
operation through the current operation just so the caller can catch the
value again afterwards. Is there something I'm missing?

--D

> +
> kfree(dio);
>
> return ret;
> @@ -450,7 +454,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
> struct iomap_dio *
> __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> - unsigned int dio_flags)
> + unsigned int dio_flags, size_t done_before)
> {
> struct address_space *mapping = iocb->ki_filp->f_mapping;
> struct inode *inode = file_inode(iocb->ki_filp);
> @@ -477,6 +481,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> dio->dops = dops;
> dio->error = 0;
> dio->flags = 0;
> + dio->done_before = done_before;
>
> dio->submit.iter = iter;
> dio->submit.waiter = current;
> @@ -648,11 +653,11 @@ EXPORT_SYMBOL_GPL(__iomap_dio_rw);
> ssize_t
> iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> - unsigned int dio_flags)
> + unsigned int dio_flags, size_t done_before)
> {
> struct iomap_dio *dio;
>
> - dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags);
> + dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags, done_before);
> if (IS_ERR_OR_NULL(dio))
> return PTR_ERR_OR_ZERO(dio);
> return iomap_dio_complete(dio);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index cc3cfb12df53..3103d9bda466 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -259,7 +259,7 @@ xfs_file_dio_read(
> ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
> if (ret)
> return ret;
> - ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0);
> + ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, 0);
> xfs_iunlock(ip, XFS_IOLOCK_SHARED);
>
> return ret;
> @@ -569,7 +569,7 @@ xfs_file_dio_write_aligned(
> }
> trace_xfs_file_direct_write(iocb, from);
> ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
> - &xfs_dio_write_ops, 0);
> + &xfs_dio_write_ops, 0, 0);
> out_unlock:
> if (iolock)
> xfs_iunlock(ip, iolock);
> @@ -647,7 +647,7 @@ xfs_file_dio_write_unaligned(
>
> trace_xfs_file_direct_write(iocb, from);
> ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
> - &xfs_dio_write_ops, flags);
> + &xfs_dio_write_ops, flags, 0);
>
> /*
> * Retry unaligned I/O with exclusive blocking semantics if the DIO
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index 70055d486bf7..85ca2f5fe06e 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -864,7 +864,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
> ret = zonefs_file_dio_append(iocb, from);
> else
> ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops,
> - &zonefs_write_dio_ops, 0);
> + &zonefs_write_dio_ops, 0, 0);
> if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
> (ret > 0 || ret == -EIOCBQUEUED)) {
> if (ret > 0)
> @@ -999,7 +999,7 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> }
> file_accessed(iocb->ki_filp);
> ret = iomap_dio_rw(iocb, to, &zonefs_iomap_ops,
> - &zonefs_read_dio_ops, 0);
> + &zonefs_read_dio_ops, 0, 0);
> } else {
> ret = generic_file_read_iter(iocb, to);
> if (ret == -EIO)
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index bcae4814b8e3..908bda10024c 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -276,10 +276,10 @@ struct iomap_dio_ops {
>
> ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> - unsigned int dio_flags);
> + unsigned int dio_flags, size_t done_before);
> struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> - unsigned int dio_flags);
> + unsigned int dio_flags, size_t done_before);
> ssize_t iomap_dio_complete(struct iomap_dio *dio);
> int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
>
> --
> 2.26.3
>

2021-08-27 18:53:23

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 18/19] iov_iter: Introduce nofault flag to disable page faults

On Fri, Aug 27, 2021 at 06:49:25PM +0200, Andreas Gruenbacher wrote:
> Introduce a new nofault flag to indicate to get_user_pages to use the
> FOLL_NOFAULT flag. This will cause get_user_pages to fail when it
> would otherwise fault in a page.
>
> Currently, the noio flag is only checked in iov_iter_get_pages and
> iov_iter_get_pages_alloc. This is enough for iomaop_dio_rw, but it
> may make sense to check in other contexts as well.

I can live with that, but
* direct assignments (as in the next patch) are fucking hard to
grep for. Is it intended to be "we set it for duration of primitive",
or...?
* it would be nice to have a description of intended semantics
for that thing. This "may make sense to check in other contexts" really
needs to be elaborated (and agreed) upon. Details, please.

2021-08-27 18:54:51

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 04/19] iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable

On Fri, Aug 27, 2021 at 06:49:11PM +0200, Andreas Gruenbacher wrote:
> Turn iov_iter_fault_in_readable into a function that returns the number
> of bytes not faulted in (similar to copy_to_user) instead of returning a
> non-zero value when any of the requested pages couldn't be faulted in.
> This supports the existing users that require all pages to be faulted in
> as well as new users that are happy if any pages can be faulted in at
> all.
>
> Rename iov_iter_fault_in_readable to fault_in_iov_iter_readable to make
> sure that this change doesn't silently break things.

I really disagree with these calling conventions. "Number not faulted in"
is bloody useless; make it "nothing could be faulted in"/"something had
been faulted in" and it would make sense. Failure several pages into the
area should not be treated as a hard error, for one thing, and ANY user
of that thing will have to cope with short copies anyway, no matter how
much you've managed to fault in.

2021-08-27 18:56:38

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

On Fri, Aug 27, 2021 at 06:49:12PM +0200, Andreas Gruenbacher wrote:
> Introduce a new fault_in_iov_iter_writeable helper for safely faulting
> in an iterator for writing. Uses get_user_pages() to fault in the pages
> without actually writing to them, which would be destructive.
>
> We'll use fault_in_iov_iter_writeable in gfs2 once we've determined that
> the iterator passed to .read_iter isn't in memory.

Again, the calling conventions are wrong. Make it success/failure or
0/-EFAULT. And it's inconsistent for iovec and non-iovec cases as it is.

2021-08-27 18:59:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 04/19] iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable

On Fri, Aug 27, 2021 at 11:53 AM Al Viro <[email protected]> wrote:
>
> I really disagree with these calling conventions. "Number not faulted in"
> is bloody useless

It's what we already have for copy_to/from_user(), so it's actually
consistent with that.

And it avoids changing all the existing tests where people really
cared only about the "everything ok" case.

Andreas' first patch did that changed version, and was ugly as hell.

But if you have a version that avoids the ugliness...

Linus

2021-08-27 19:09:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

On Fri, Aug 27, 2021 at 11:52 AM Al Viro <[email protected]> wrote:
>
> Again, the calling conventions are wrong. Make it success/failure or
> 0/-EFAULT. And it's inconsistent for iovec and non-iovec cases as it is.

Al, the 0/-EFAULT thing DOES NOT WORK.

The whole "success vs failure" model is broken.

Because "success" for some people is "everything worked".

And for other people it is "at least _part_ of it worked".

So no, 0/-EFAULT fundamentally cannot work, because the return needs
to be able to handle that ternary situation (ie "nothing" vs
"something" vs "everything").

This is *literally* the exact same thing that we have for
copy_to/from_user(). And Andreas' solution (based on my suggestion) is
the exact same one that we have had for that code since basically day
#1.

The whole "0/-EFAULT" is simpler, yes. And it's what
"{get|put}_user()" uses, yes. And it's more common to a lot of other
functions that return zero or an error.

But see above. People *need* that ternary result, and "bytes/pages
uncopied" is not only the traditional one we use elsewhere in similar
situations, it's the one that has the easiest error tests for existing
users (because zero remains "everything worked").

Andreas originally had that "how many bytes/pages succeeded" return
value instead, and yes, that's also ternary. But it means that now the
common "complete success" test ends up being a lot uglier, and the
semantics of the function changes completely where "0" no longer means
success, and that messes up much more.

So I really think you are barking entirely up the wrong tree.

If there is any inconsistency, maybe we should make _more_ cases use
that "how many bytes/pages not copied" logic, but in a lot of cases
you don't actually need the ternary decision value.

So the inconsistency is EXACTLY the same as the one we have always had
for get|put_user() vs copy_to|from_user(), and it exists for the EXACT
same reason.

IOW, please explain how you'd solve the ternary problem without making
the code a lot uglier.

Linus

2021-08-27 19:09:54

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 03/19] gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}

On Fri, Aug 27, 2021 at 06:49:10PM +0200, Andreas Gruenbacher wrote:
> Turn fault_in_pages_{readable,writeable} into versions that return the
> number of bytes not faulted in (similar to copy_to_user) instead of
> returning a non-zero value when any of the requested pages couldn't be
> faulted in. This supports the existing users that require all pages to
> be faulted in as well as new users that are happy if any pages can be
> faulted in at all.
>
> Neither of these functions is entirely trivial and it doesn't seem
> useful to inline them, so move them to mm/gup.c.
>
> Rename the functions to fault_in_{readable,writeable} to make sure that
> this change doesn't silently break things.

I'm sorry, but this is wrong. The callers need to be reviewed and
sanitized. You have several oddball callers (most of them simply
wrong) *and* the ones on a very hot path in write(2). And _there_
the existing behaviour does the wrong thing for memory poisoning setups.

Do we have *any* cases where we both need the fault-in at all *and*
would not be better off with "fail only if the first byte couldn't have been
faulted in"?

> diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
> index 0608581967f0..38c3eae40c14 100644
> --- a/arch/powerpc/kernel/signal_32.c
> +++ b/arch/powerpc/kernel/signal_32.c
> @@ -1048,7 +1048,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
> if (new_ctx == NULL)
> return 0;
> if (!access_ok(new_ctx, ctx_size) ||
> - fault_in_pages_readable((u8 __user *)new_ctx, ctx_size))
> + fault_in_readable((char __user *)new_ctx, ctx_size))
> return -EFAULT;

This is completely pointless. Look at do_setcontext() there. Seriously,
it immediately does
if (!user_read_access_begin(ucp, sizeof(*ucp)))
return -EFAULT;
so this access_ok() is so much garbage. Then it does normal unsage_get_...()
stuff, so it doesn't need that fault-in crap at all - it *must* handle
copyin failures, fault-in or not. Just lose that fault_in_... call and be
done with that.


> @@ -1237,7 +1237,7 @@ SYSCALL_DEFINE3(debug_setcontext, struct ucontext __user *, ctx,
> #endif
>
> if (!access_ok(ctx, sizeof(*ctx)) ||
> - fault_in_pages_readable((u8 __user *)ctx, sizeof(*ctx)))
> + fault_in_readable((char __user *)ctx, sizeof(*ctx)))
> return -EFAULT;

Ditto.

> diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
> index 1831bba0582e..9f471b4a11e3 100644
> --- a/arch/powerpc/kernel/signal_64.c
> +++ b/arch/powerpc/kernel/signal_64.c
> @@ -688,7 +688,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
> if (new_ctx == NULL)
> return 0;
> if (!access_ok(new_ctx, ctx_size) ||
> - fault_in_pages_readable((u8 __user *)new_ctx, ctx_size))
> + fault_in_readable((char __user *)new_ctx, ctx_size))
> return -EFAULT;

... and again.

> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 0ba98e08a029..9233ecc31e2e 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2244,9 +2244,8 @@ static noinline int search_ioctl(struct inode *inode,
> key.offset = sk->min_offset;
>
> while (1) {
> - ret = fault_in_pages_writeable(ubuf + sk_offset,
> - *buf_size - sk_offset);
> - if (ret)
> + ret = -EFAULT;
> + if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset))
> break;

Really?

> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 25dfc48536d7..069cedd9d7b4 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -191,7 +191,7 @@ static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t b
> buf = iov->iov_base + skip;
> copy = min(bytes, iov->iov_len - skip);
>
> - if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_pages_writeable(buf, copy)) {
> + if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_writeable(buf, copy)) {

Here we definitely want "fail only if nothing could be faulted in"

> kaddr = kmap_atomic(page);
> from = kaddr + offset;
>
> @@ -275,7 +275,7 @@ static size_t copy_page_from_iter_iovec(struct page *page, size_t offset, size_t
> buf = iov->iov_base + skip;
> copy = min(bytes, iov->iov_len - skip);
>
> - if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_pages_readable(buf, copy)) {
> + if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_readable(buf, copy)) {

Same.

> @@ -446,13 +446,11 @@ int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes)
> bytes = i->count;
> for (p = i->iov, skip = i->iov_offset; bytes; p++, skip = 0) {
> size_t len = min(bytes, p->iov_len - skip);
> - int err;
>
> if (unlikely(!len))
> continue;
> - err = fault_in_pages_readable(p->iov_base + skip, len);
> - if (unlikely(err))
> - return err;
> + if (fault_in_readable(p->iov_base + skip, len))
> + return -EFAULT;

... and the same, except that here we want failure only if nothing had already
been faulted in.

2021-08-27 19:20:02

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 04/19] iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable

On Fri, Aug 27, 2021 at 11:57:19AM -0700, Linus Torvalds wrote:
> On Fri, Aug 27, 2021 at 11:53 AM Al Viro <[email protected]> wrote:
> >
> > I really disagree with these calling conventions. "Number not faulted in"
> > is bloody useless
>
> It's what we already have for copy_to/from_user(), so it's actually
> consistent with that.

After copy_to/copy_from you've got the data copied and it's not going
anywhere. After fault-in you still have to copy, and it still can give
you less data than fault-in had succeeded for. So you must handle short
copies separately, no matter how much you've got from fault-in.

> And it avoids changing all the existing tests where people really
> cared only about the "everything ok" case.

The thing is, the checks tend to be wrong. We can't rely upon the full
fault-in to expect the full copy-in/copy-out, so the checks downstream
are impossible to avoid anyway. And fault-in failure is always a slow
path, so we are not saving time here.

And for the memory poisoining we end up aborting a copy potentially
a lot earlier than we should.

> Andreas' first patch did that changed version, and was ugly as hell.
>
> But if you have a version that avoids the ugliness...

I'll need to dig my notes out...

2021-08-27 19:24:48

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

On Fri, Aug 27, 2021 at 12:05:32PM -0700, Linus Torvalds wrote:

> But see above. People *need* that ternary result, and "bytes/pages
> uncopied" is not only the traditional one we use elsewhere in similar
> situations, it's the one that has the easiest error tests for existing
> users (because zero remains "everything worked").

Could you show the cases where "partial copy, so it's OK" behaviour would
break anything?

For that you would need the case where
partial fault-in is currently rejected by the check
checks downstream from there (for failing copy-in/copy-out) would
be either missing or would not be handled correctly in case of partial
fault-in or would slow a fast path down.

I don't see any such cases and I would be very surprised if such existed.
If you see any, please describe them - I could be wrong. And I would
like to take a good look at any such case and see how well does it handle
possible short copy after full fault-in.

2021-08-27 19:35:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

On Fri, Aug 27, 2021 at 12:23 PM Al Viro <[email protected]> wrote:
>
> Could you show the cases where "partial copy, so it's OK" behaviour would
> break anything?

Absolutely.

For example, i t would cause an infinite loop in
restore_fpregs_from_user() if the "buf" argument is a situation where
the first page is fine, but the next page is not.

Why? Because __restore_fpregs_from_user() would take a fault, but then
fault_in_pages_readable() (renamed) would succeed, so you'd just do
that "retry" forever and ever.

Probably there are a number of other places too. That was literally
the *first* place I looked at.

Seriously. The current semantics are "check the whole area".

THOSE MUST NOT CHANGE.

Linus

2021-08-27 19:39:53

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

On Fri, Aug 27, 2021 at 12:33:00PM -0700, Linus Torvalds wrote:
> On Fri, Aug 27, 2021 at 12:23 PM Al Viro <[email protected]> wrote:
> >
> > Could you show the cases where "partial copy, so it's OK" behaviour would
> > break anything?
>
> Absolutely.
>
> For example, i t would cause an infinite loop in
> restore_fpregs_from_user() if the "buf" argument is a situation where
> the first page is fine, but the next page is not.
>
> Why? Because __restore_fpregs_from_user() would take a fault, but then
> fault_in_pages_readable() (renamed) would succeed, so you'd just do
> that "retry" forever and ever.
>
> Probably there are a number of other places too. That was literally
> the *first* place I looked at.

OK...

Let me dig out the notes from the last time I looked through that area
and grep around a bit. Should be about an hour or two.

2021-08-27 19:59:14

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [PATCH v7 18/19] iov_iter: Introduce nofault flag to disable page faults

On Fri, Aug 27, 2021 at 8:47 PM Al Viro <[email protected]> wrote:
> On Fri, Aug 27, 2021 at 06:49:25PM +0200, Andreas Gruenbacher wrote:
> > Introduce a new nofault flag to indicate to get_user_pages to use the
> > FOLL_NOFAULT flag. This will cause get_user_pages to fail when it
> > would otherwise fault in a page.
> >
> > Currently, the noio flag is only checked in iov_iter_get_pages and
> > iov_iter_get_pages_alloc. This is enough for iomaop_dio_rw, but it
> > may make sense to check in other contexts as well.
>
> I can live with that, but
> * direct assignments (as in the next patch) are fucking hard to
> grep for. Is it intended to be "we set it for duration of primitive",
> or...?

It's for this kind of pattern:

pagefault_disable();
to->nofault = true;
ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL,
IOMAP_DIO_PARTIAL, written);
to->nofault = false;
pagefault_enable();

Clearing the flag at the end isn't strictly necessary, but it kind of
documents that the flag pertains to iomap_dio_rw and not something
else.

> * it would be nice to have a description of intended semantics
> for that thing. This "may make sense to check in other contexts" really
> needs to be elaborated (and agreed) upon. Details, please.

Maybe the description should just be something like:

"Introduce a new nofault flag to indicate to iov_iter_get_pages not to
fault in user pages.

This is implemented by passing the FOLL_NOFAULT flag to get_user_pages,
which causes get_user_pages to fail when it would otherwise fault in a
page. We'll use the ->nofault flag to prevent iomap_dio_rw from faulting
in pages when page faults are not allowed."

Thanks,
Andreas

2021-08-27 20:18:07

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [PATCH v7 16/19] iomap: Add done_before argument to iomap_dio_rw

On Fri, Aug 27, 2021 at 8:30 PM Darrick J. Wong <[email protected]> wrote:
> On Fri, Aug 27, 2021 at 06:49:23PM +0200, Andreas Gruenbacher wrote:
> > Add a done_before argument to iomap_dio_rw that indicates how much of
> > the request has already been transferred. When the request succeeds, we
> > report that done_before additional bytes were tranferred. This is
> > useful for finishing a request asynchronously when part of the request
> > has already been completed synchronously.
> >
> > We'll use that to allow iomap_dio_rw to be used with page faults
> > disabled: when a page fault occurs while submitting a request, we
> > synchronously complete the part of the request that has already been
> > submitted. The caller can then take care of the page fault and call
> > iomap_dio_rw again for the rest of the request, passing in the number of
> > bytes already tranferred.
> >
> > Signed-off-by: Andreas Gruenbacher <[email protected]>
> > ---
> > fs/btrfs/file.c | 5 +++--
> > fs/ext4/file.c | 5 +++--
> > fs/gfs2/file.c | 4 ++--
> > fs/iomap/direct-io.c | 11 ++++++++---
> > fs/xfs/xfs_file.c | 6 +++---
> > fs/zonefs/super.c | 4 ++--
> > include/linux/iomap.h | 4 ++--
> > 7 files changed, 23 insertions(+), 16 deletions(-)
> >
>
> <snip to the interesting parts>
>
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index ba88fe51b77a..dcf9a2b4381f 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -31,6 +31,7 @@ struct iomap_dio {
> > atomic_t ref;
> > unsigned flags;
> > int error;
> > + size_t done_before;
> > bool wait_for_completion;
> >
> > union {
> > @@ -126,6 +127,9 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
> > if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
> > ret = generic_write_sync(iocb, ret);
> >
> > + if (ret > 0)
> > + ret += dio->done_before;
>
> Pardon my ignorance since this is the first time I've had a crack at
> this patchset, but why is it necessary to carry the "bytes copied"
> count from the /previous/ iomap_dio_rw call all the way through to dio
> completion of the current call?

Consider the following situation:

* A user submits an asynchronous read request.

* The first page of the buffer is in memory, but the following
pages are not. This isn't uncommon for consecutive reads
into freshly allocated memory.

* iomap_dio_rw writes into the first page. Then it
hits the next page which is missing, so it returns a partial
result, synchronously.

* We then fault in the remaining pages and call iomap_dio_rw
for the rest of the request.

* The rest of the request completes asynchronously.

Does that answer your question?

Thanks,
Andreas

2021-08-27 20:58:46

by Kari Argillander

[permalink] [raw]
Subject: Re: [PATCH v7 04/19] iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable

On Fri, Aug 27, 2021 at 06:49:11PM +0200, Andreas Gruenbacher wrote:
> Turn iov_iter_fault_in_readable into a function that returns the number
> of bytes not faulted in (similar to copy_to_user) instead of returning a
> non-zero value when any of the requested pages couldn't be faulted in.
> This supports the existing users that require all pages to be faulted in
> as well as new users that are happy if any pages can be faulted in at
> all.
>
> Rename iov_iter_fault_in_readable to fault_in_iov_iter_readable to make
> sure that this change doesn't silently break things.

At least this patch will break ntfs3 which is in next. It has been there
just couple weeks so I understand. I added Konstantin and ntfs3 list so
that we know what is going on. Can you please info if and when do we
need rebase.

We are in situation that ntfs3 might get in 5.15, but it is uncertain so
it would be best that we solve this. Just info is enough.

Argillander

>
> Signed-off-by: Andreas Gruenbacher <[email protected]>
> ---
> fs/btrfs/file.c | 2 +-
> fs/f2fs/file.c | 2 +-
> fs/fuse/file.c | 2 +-
> fs/iomap/buffered-io.c | 2 +-
> fs/ntfs/file.c | 2 +-
> include/linux/uio.h | 2 +-
> lib/iov_iter.c | 33 +++++++++++++++++++++------------
> mm/filemap.c | 2 +-
> 8 files changed, 28 insertions(+), 19 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index ee34497500e1..281c77cfe91a 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1698,7 +1698,7 @@ static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
> * Fault pages before locking them in prepare_pages
> * to avoid recursive lock
> */
> - if (unlikely(iov_iter_fault_in_readable(i, write_bytes))) {
> + if (unlikely(fault_in_iov_iter_readable(i, write_bytes))) {
> ret = -EFAULT;
> break;
> }
> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index 6afd4562335f..b04b6c909a8b 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -4259,7 +4259,7 @@ static ssize_t f2fs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> size_t target_size = 0;
> int err;
>
> - if (iov_iter_fault_in_readable(from, iov_iter_count(from)))
> + if (fault_in_iov_iter_readable(from, iov_iter_count(from)))
> set_inode_flag(inode, FI_NO_PREALLOC);
>
> if ((iocb->ki_flags & IOCB_NOWAIT)) {
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 97f860cfc195..da49ef71dab5 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1160,7 +1160,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
>
> again:
> err = -EFAULT;
> - if (iov_iter_fault_in_readable(ii, bytes))
> + if (fault_in_iov_iter_readable(ii, bytes))
> break;
>
> err = -ENOMEM;
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 87ccb3438bec..7dc42dd3a724 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -749,7 +749,7 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> * same page as we're writing to, without it being marked
> * up-to-date.
> */
> - if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
> + if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
> status = -EFAULT;
> break;
> }
> diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
> index ab4f3362466d..a43adeacd930 100644
> --- a/fs/ntfs/file.c
> +++ b/fs/ntfs/file.c
> @@ -1829,7 +1829,7 @@ static ssize_t ntfs_perform_write(struct file *file, struct iov_iter *i,
> * pages being swapped out between us bringing them into memory
> * and doing the actual copying.
> */
> - if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
> + if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
> status = -EFAULT;
> break;
> }
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 82c3c3e819e0..12d30246c2e9 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -119,7 +119,7 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
> size_t bytes, struct iov_iter *i);
> void iov_iter_advance(struct iov_iter *i, size_t bytes);
> void iov_iter_revert(struct iov_iter *i, size_t bytes);
> -int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes);
> +size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t bytes);
> size_t iov_iter_single_seg_count(const struct iov_iter *i);
> size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
> struct iov_iter *i);
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 069cedd9d7b4..082ab155496d 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -430,33 +430,42 @@ static size_t copy_page_to_iter_pipe(struct page *page, size_t offset, size_t by
> }
>
> /*
> + * fault_in_iov_iter_readable - fault in iov iterator for reading
> + * @i: iterator
> + * @size: maximum length
> + *
> * Fault in one or more iovecs of the given iov_iter, to a maximum length of
> - * bytes. For each iovec, fault in each page that constitutes the iovec.
> + * @size. For each iovec, fault in each page that constitutes the iovec.
> + *
> + * Returns the number of bytes not faulted in (like copy_to_user() and
> + * copy_from_user()).
> *
> - * Return 0 on success, or non-zero if the memory could not be accessed (i.e.
> - * because it is an invalid address).
> + * Always returns 0 for non-userspace iterators.
> */
> -int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes)
> +size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t size)
> {
> if (iter_is_iovec(i)) {
> + size_t count = min(size, iov_iter_count(i));
> const struct iovec *p;
> size_t skip;
>
> - if (bytes > i->count)
> - bytes = i->count;
> - for (p = i->iov, skip = i->iov_offset; bytes; p++, skip = 0) {
> - size_t len = min(bytes, p->iov_len - skip);
> + size -= count;
> + for (p = i->iov, skip = i->iov_offset; count; p++, skip = 0) {
> + size_t len = min(count, p->iov_len - skip);
> + size_t ret;
>
> if (unlikely(!len))
> continue;
> - if (fault_in_readable(p->iov_base + skip, len))
> - return -EFAULT;
> - bytes -= len;
> + ret = fault_in_readable(p->iov_base + skip, len);
> + count -= len - ret;
> + if (ret)
> + break;
> }
> + return count + size;
> }
> return 0;
> }
> -EXPORT_SYMBOL(iov_iter_fault_in_readable);
> +EXPORT_SYMBOL(fault_in_iov_iter_readable);
>
> void iov_iter_init(struct iov_iter *i, unsigned int direction,
> const struct iovec *iov, unsigned long nr_segs,
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 4dec3bc7752e..83af8a534339 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3643,7 +3643,7 @@ ssize_t generic_perform_write(struct file *file,
> * same page as we're writing to, without it being marked
> * up-to-date.
> */
> - if (unlikely(iov_iter_fault_in_readable(i, bytes))) {
> + if (unlikely(fault_in_iov_iter_readable(i, bytes))) {
> status = -EFAULT;
> break;
> }
> --
> 2.26.3
>

2021-08-27 21:36:26

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v7 16/19] iomap: Add done_before argument to iomap_dio_rw

On Fri, Aug 27, 2021 at 10:15:11PM +0200, Andreas Gruenbacher wrote:
> On Fri, Aug 27, 2021 at 8:30 PM Darrick J. Wong <[email protected]> wrote:
> > On Fri, Aug 27, 2021 at 06:49:23PM +0200, Andreas Gruenbacher wrote:
> > > Add a done_before argument to iomap_dio_rw that indicates how much of
> > > the request has already been transferred. When the request succeeds, we
> > > report that done_before additional bytes were tranferred. This is
> > > useful for finishing a request asynchronously when part of the request
> > > has already been completed synchronously.
> > >
> > > We'll use that to allow iomap_dio_rw to be used with page faults
> > > disabled: when a page fault occurs while submitting a request, we
> > > synchronously complete the part of the request that has already been
> > > submitted. The caller can then take care of the page fault and call
> > > iomap_dio_rw again for the rest of the request, passing in the number of
> > > bytes already tranferred.
> > >
> > > Signed-off-by: Andreas Gruenbacher <[email protected]>
> > > ---
> > > fs/btrfs/file.c | 5 +++--
> > > fs/ext4/file.c | 5 +++--
> > > fs/gfs2/file.c | 4 ++--
> > > fs/iomap/direct-io.c | 11 ++++++++---
> > > fs/xfs/xfs_file.c | 6 +++---
> > > fs/zonefs/super.c | 4 ++--
> > > include/linux/iomap.h | 4 ++--
> > > 7 files changed, 23 insertions(+), 16 deletions(-)
> > >
> >
> > <snip to the interesting parts>
> >
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index ba88fe51b77a..dcf9a2b4381f 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -31,6 +31,7 @@ struct iomap_dio {
> > > atomic_t ref;
> > > unsigned flags;
> > > int error;
> > > + size_t done_before;
> > > bool wait_for_completion;
> > >
> > > union {
> > > @@ -126,6 +127,9 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
> > > if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
> > > ret = generic_write_sync(iocb, ret);
> > >
> > > + if (ret > 0)
> > > + ret += dio->done_before;
> >
> > Pardon my ignorance since this is the first time I've had a crack at
> > this patchset, but why is it necessary to carry the "bytes copied"
> > count from the /previous/ iomap_dio_rw call all the way through to dio
> > completion of the current call?
>
> Consider the following situation:
>
> * A user submits an asynchronous read request.
>
> * The first page of the buffer is in memory, but the following
> pages are not. This isn't uncommon for consecutive reads
> into freshly allocated memory.
>
> * iomap_dio_rw writes into the first page. Then it
> hits the next page which is missing, so it returns a partial
> result, synchronously.
>
> * We then fault in the remaining pages and call iomap_dio_rw
> for the rest of the request.
>
> * The rest of the request completes asynchronously.
>
> Does that answer your question?

No, because you totally ignored the second question:

If the directio operation succeeds even partially and the PARTIAL flag
is set, won't that push the iov iter ahead by however many bytes
completed?

We already finished the IO for the first page, so the second attempt
should pick up where it left off, i.e. the second page.

--D

> Thanks,
> Andreas
>

2021-08-27 21:50:59

by Andreas Grünbacher

[permalink] [raw]
Subject: Re: [PATCH v7 16/19] iomap: Add done_before argument to iomap_dio_rw

Am Fr., 27. Aug. 2021 um 23:33 Uhr schrieb Darrick J. Wong <[email protected]>:
> On Fri, Aug 27, 2021 at 10:15:11PM +0200, Andreas Gruenbacher wrote:
> > On Fri, Aug 27, 2021 at 8:30 PM Darrick J. Wong <[email protected]> wrote:
> > > On Fri, Aug 27, 2021 at 06:49:23PM +0200, Andreas Gruenbacher wrote:
> > > > Add a done_before argument to iomap_dio_rw that indicates how much of
> > > > the request has already been transferred. When the request succeeds, we
> > > > report that done_before additional bytes were tranferred. This is
> > > > useful for finishing a request asynchronously when part of the request
> > > > has already been completed synchronously.
> > > >
> > > > We'll use that to allow iomap_dio_rw to be used with page faults
> > > > disabled: when a page fault occurs while submitting a request, we
> > > > synchronously complete the part of the request that has already been
> > > > submitted. The caller can then take care of the page fault and call
> > > > iomap_dio_rw again for the rest of the request, passing in the number of
> > > > bytes already tranferred.
> > > >
> > > > Signed-off-by: Andreas Gruenbacher <[email protected]>
> > > > ---
> > > > fs/btrfs/file.c | 5 +++--
> > > > fs/ext4/file.c | 5 +++--
> > > > fs/gfs2/file.c | 4 ++--
> > > > fs/iomap/direct-io.c | 11 ++++++++---
> > > > fs/xfs/xfs_file.c | 6 +++---
> > > > fs/zonefs/super.c | 4 ++--
> > > > include/linux/iomap.h | 4 ++--
> > > > 7 files changed, 23 insertions(+), 16 deletions(-)
> > > >
> > >
> > > <snip to the interesting parts>
> > >
> > > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > > index ba88fe51b77a..dcf9a2b4381f 100644
> > > > --- a/fs/iomap/direct-io.c
> > > > +++ b/fs/iomap/direct-io.c
> > > > @@ -31,6 +31,7 @@ struct iomap_dio {
> > > > atomic_t ref;
> > > > unsigned flags;
> > > > int error;
> > > > + size_t done_before;
> > > > bool wait_for_completion;
> > > >
> > > > union {
> > > > @@ -126,6 +127,9 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
> > > > if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
> > > > ret = generic_write_sync(iocb, ret);
> > > >
> > > > + if (ret > 0)
> > > > + ret += dio->done_before;
> > >
> > > Pardon my ignorance since this is the first time I've had a crack at
> > > this patchset, but why is it necessary to carry the "bytes copied"
> > > count from the /previous/ iomap_dio_rw call all the way through to dio
> > > completion of the current call?
> >
> > Consider the following situation:
> >
> > * A user submits an asynchronous read request.
> >
> > * The first page of the buffer is in memory, but the following
> > pages are not. This isn't uncommon for consecutive reads
> > into freshly allocated memory.
> >
> > * iomap_dio_rw writes into the first page. Then it
> > hits the next page which is missing, so it returns a partial
> > result, synchronously.
> >
> > * We then fault in the remaining pages and call iomap_dio_rw
> > for the rest of the request.
> >
> > * The rest of the request completes asynchronously.
> >
> > Does that answer your question?
>
> No, because you totally ignored the second question:
>
> If the directio operation succeeds even partially and the PARTIAL flag
> is set, won't that push the iov iter ahead by however many bytes
> completed?

Yes, exactly as it should.

> We already finished the IO for the first page, so the second attempt
> should pick up where it left off, i.e. the second page.

Yes, so what's the question?

Thanks,
Andreas

2021-08-27 21:57:23

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

On Fri, Aug 27, 2021 at 07:37:25PM +0000, Al Viro wrote:
> On Fri, Aug 27, 2021 at 12:33:00PM -0700, Linus Torvalds wrote:
> > On Fri, Aug 27, 2021 at 12:23 PM Al Viro <[email protected]> wrote:
> > >
> > > Could you show the cases where "partial copy, so it's OK" behaviour would
> > > break anything?
> >
> > Absolutely.
> >
> > For example, i t would cause an infinite loop in
> > restore_fpregs_from_user() if the "buf" argument is a situation where
> > the first page is fine, but the next page is not.
> >
> > Why? Because __restore_fpregs_from_user() would take a fault, but then
> > fault_in_pages_readable() (renamed) would succeed, so you'd just do
> > that "retry" forever and ever.
> >
> > Probably there are a number of other places too. That was literally
> > the *first* place I looked at.
>
> OK...
>
> Let me dig out the notes from the last time I looked through that area
> and grep around a bit. Should be about an hour or two.

OK, I've dug it out and rechecked the current mainline.

Call trees:

fault_in_pages_readable()
kvm_use_magic_page()

Broken, as per mpe. Relevant part (see <[email protected]> in
your mailbox back in early May for the full story):
|The current code is confused, ie. broken.
...
|We want to check that the mapping succeeded, that the address is
|readable (& writeable as well actually).
...
|diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
...
|- if (!fault_in_pages_readable((const char *)KVM_MAGIC_PAGE, sizeof(u32))) {
|+ if (get_kernel_nofault(c, (const char *)KVM_MAGIC_PAGE)) {

[ppc32]swapcontext()
[ppc32]debug_setcontext()
[ppc64]swapcontext()

Same situation in all three - it's going to kill the process if copy-in
fails, so it tries to be gentler about it and treat fault-in failures
as -EFAULT from syscall. AFAICS, it's pointless, but I would like
comments from ppc folks. Note that bogus *contents* of the
struct ucontext passed by user is almost certainly going to end up
with segfault; trying to catch the cases when bogus address happens
to point someplace unreadable is rather useless in that situation.

restore_fpregs_from_user()
The one you've caught; hadn't been there last time I'd checked (back in
April). Its counterpart in copy_fpstate_to_sigframe() had been, though.

armada_gem_pwrite_ioctl()
Pointless, along with the access_ok() there - it does copy_from_user()
on that area shortly afterwards and failure of either is not a fast path.
copy_page_from_iter_iovec()
Will do the right thing on short copy of any kind; we are fine with either
semantics.
iov_iter_fault_in_readable()
generic_perform_write()
Any short copy that had not lead to progress (== rejected by ->write_end())
will lead to next chunk shortened accordingly, so ->write_begin() would be
asked to prepare for the amount we expect to be able to copy; ->write_end()
should be fine with that. Failure to copy anything at all (possible due to
eviction on memory pressure, etc.) leads to retry of the same chunk as the
last time, and that's where we rely on fault-in rejecting "nothing could be
faulted in" case. That one is fine with partial fault-in reported as success.
f2fs_file_write_iter()
Odd prealloc-related stuff. AFAICS, from the correctness POV either variant
of semantics would do, but I'm not sure how if either is the right match
to what they are trying to do there.
fuse_fill_write_pages()
Similar to generic_perform_write() situation, only simpler (no ->write_end()
counterpart there). All we care about is failure if nothing could be faulted
in.
btrfs_buffered_write()
Again, similar to generic_perform_write(). More convoluted (after a short
copy it switches to going page-by-page and getting destination pages uptodate,
which will be equivalent to ->write_end() always accepting everything it's
given from that point on), but it's the same "we care only about failure
to fault in the first page" situation.
ntfs_perform_write()
Another generic_perform_write() analogue. Same situation wrt fault-in
semantics.
iomap_write_actor()
Another generic_perform_write() relative. Same situation.


fault_in_pages_writeable()
copy_fpstate_to_sigframe()
Same kind of "retry everything from scratch on short copy" as in the other
fpu/signal.c case.
[btrfs]search_ioctl()
Broken with memory poisoning, for either variant of semantics. Same for
arm64 sub-page permission differences, I think.
copy_page_to_iter_iovec()
Will do the right thing on short copy of any kind; we are fine with either
semantics.

So we have 3 callers where we want all-or-nothing semantics - two in
arch/x86/kernel/fpu/signal.c and one in btrfs. HWPOISON will be a problem
for all 3, AFAICS...

IOW, it looks like we have two different things mixed here - one that wants
to try and fault stuff in, with callers caring only about having _something_
faulted in (most of the users) and one that wants to make sure we *can* do
stores or loads on each byte in the affected area.

Just accessing a byte in each page really won't suffice for the second kind.
Neither will g-u-p use, unless we teach it about HWPOISON and other fun
beasts... Looks like we want that thing to be a separate primitive; for
btrfs I'd probably replace fault_in_pages_writeable() with clear_user()
as a quick fix for now...

Comments?

2021-08-27 21:58:24

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

On Fri, Aug 27, 2021 at 09:48:55PM +0000, Al Viro wrote:

> [btrfs]search_ioctl()
> Broken with memory poisoning, for either variant of semantics. Same for
> arm64 sub-page permission differences, I think.


> So we have 3 callers where we want all-or-nothing semantics - two in
> arch/x86/kernel/fpu/signal.c and one in btrfs. HWPOISON will be a problem
> for all 3, AFAICS...
>
> IOW, it looks like we have two different things mixed here - one that wants
> to try and fault stuff in, with callers caring only about having _something_
> faulted in (most of the users) and one that wants to make sure we *can* do
> stores or loads on each byte in the affected area.
>
> Just accessing a byte in each page really won't suffice for the second kind.
> Neither will g-u-p use, unless we teach it about HWPOISON and other fun
> beasts... Looks like we want that thing to be a separate primitive; for
> btrfs I'd probably replace fault_in_pages_writeable() with clear_user()
> as a quick fix for now...
>
> Comments?

Wait a sec... Wasn't HWPOISON a per-page thing? arm64 definitely does have
smaller-than-page areas with different permissions, so btrfs search_ioctl()
has a problem there, but arch/x86/kernel/fpu/signal.c doesn't have to deal
with that...

Sigh... I really need more coffee...

2021-08-27 22:37:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 16/19] iomap: Add done_before argument to iomap_dio_rw

On Fri, Aug 27, 2021 at 2:32 PM Darrick J. Wong <[email protected]> wrote:
>
> No, because you totally ignored the second question:
>
> If the directio operation succeeds even partially and the PARTIAL flag
> is set, won't that push the iov iter ahead by however many bytes
> completed?
>
> We already finished the IO for the first page, so the second attempt
> should pick up where it left off, i.e. the second page.

Darrick, I think you're missing the point.

It's the *return*value* that is the issue, not the iovec.

The iovec is updated as you say. But the return value from the async
part is - without Andreas' patch - only the async part of it.

With Andreas' patch, the async part will now return the full return
value, including the part that was done synchronously.

And the return value is returned from that async part, which somehow
thus needs to know what predated it.

Could that pre-existing part perhaps be saved somewhere else? Very
possibly. That 'struct iomap_dio' addition is kind of ugly. So maybe
what Andreas did could be done differently. But I think you guys are
arguing past each other.

Linus

2021-08-27 23:25:18

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

On Fri, Aug 27, 2021 at 09:57:10PM +0000, Al Viro wrote:
> On Fri, Aug 27, 2021 at 09:48:55PM +0000, Al Viro wrote:
>
> > [btrfs]search_ioctl()
> > Broken with memory poisoning, for either variant of semantics. Same for
> > arm64 sub-page permission differences, I think.
>
>
> > So we have 3 callers where we want all-or-nothing semantics - two in
> > arch/x86/kernel/fpu/signal.c and one in btrfs. HWPOISON will be a problem
> > for all 3, AFAICS...
> >
> > IOW, it looks like we have two different things mixed here - one that wants
> > to try and fault stuff in, with callers caring only about having _something_
> > faulted in (most of the users) and one that wants to make sure we *can* do
> > stores or loads on each byte in the affected area.
> >
> > Just accessing a byte in each page really won't suffice for the second kind.
> > Neither will g-u-p use, unless we teach it about HWPOISON and other fun
> > beasts... Looks like we want that thing to be a separate primitive; for
> > btrfs I'd probably replace fault_in_pages_writeable() with clear_user()
> > as a quick fix for now...
> >
> > Comments?
>
> Wait a sec... Wasn't HWPOISON a per-page thing? arm64 definitely does have
> smaller-than-page areas with different permissions, so btrfs search_ioctl()
> has a problem there, but arch/x86/kernel/fpu/signal.c doesn't have to deal
> with that...
>
> Sigh... I really need more coffee...

On Intel poison is tracked at the cache line granularity. Linux
inflates that to per-page (because it can only take a whole page away).
For faults triggered in ring3 this is pretty much the same thing because
mm/memory_failure.c unmaps the page ... so while you see a #MC on first
access, you get #PF when you retry. The x86 fault handler sees a magic
signature in the page table and sends a SIGBUS.

But it's all different if the #MC is triggerd from ring0. The machine
check handler can't unmap the page. It just schedules task_work to do
the unmap when next returning to the user.

But if your kernel code loops and tries again without a return to user,
then your get another #MC.

-Tony

2021-08-28 02:21:56

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

> But if your kernel code loops and tries again without a return to user,
> then your get another #MC.

I've been trying to push this patch:

https://lore.kernel.org/linux-edac/[email protected]/

which turns the infinite loops of machine checks into a panic.

-Tony

2021-08-28 17:15:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 04/19] iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable

On Fri, Aug 27, 2021 at 1:56 PM Kari Argillander
<[email protected]> wrote:
>
> At least this patch will break ntfs3 which is in next. It has been there
> just couple weeks so I understand. I added Konstantin and ntfs3 list so
> that we know what is going on. Can you please info if and when do we
> need rebase.

No need to rebase. It just makes it harder for me to pick one pull
over another, since it would mix the two things together.

I'll notice the semantic conflict as I do my merge build test, and
it's easy for me to fix as part of the merge - whichever one I merge
later.

It's good if both sides remind me about the issue, but these kinds of
conflicts are not a problem.

And yes, it does happen that I miss conflicts like this if I merge
while on the road and don't do my full build tests, or if it's some
architecture-specific thing or a problem that doesn't happen on my
usual allmodconfig testing. But neither of those cases should be
present in this situation.

Linus

2021-08-28 19:37:14

by Al Viro

[permalink] [raw]
Subject: [RFC][arm64] possible infinite loop in btrfs search_ioctl()

AFAICS, a48b73eca4ce "btrfs: fix potential deadlock in the search ioctl"
has introduced a bug at least on arm64.

Relevant bits: in search_ioctl() we have
while (1) {
ret = fault_in_pages_writeable(ubuf + sk_offset,
*buf_size - sk_offset);
if (ret)
break;

ret = btrfs_search_forward(root, &key, path, sk->min_transid);
if (ret != 0) {
if (ret > 0)
ret = 0;
goto err;
}
ret = copy_to_sk(path, &key, sk, buf_size, ubuf,
&sk_offset, &num_found);
btrfs_release_path(path);
if (ret)
break;

}
and in copy_to_sk() -
sh.objectid = key->objectid;
sh.offset = key->offset;
sh.type = key->type;
sh.len = item_len;
sh.transid = found_transid;

/*
* Copy search result header. If we fault then loop again so we
* can fault in the pages and -EFAULT there if there's a
* problem. Otherwise we'll fault and then copy the buffer in
* properly this next time through
*/
if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) {
ret = 0;
goto out;
}
with sk_offset left unchanged if the very first copy_to_user_nofault() fails.

Now, consider a situation on arm64 where ubuf points to the beginning of page,
ubuf[0] can be accessed, but ubuf[16] can not (possible with MTE, AFAICS). We do
fault_in_pages_writeable(), which succeeds. When we get to copy_to_user_nofault()
we fail as soon as it gets past the first 16 bytes. And we repeat everything from
scratch, with no progress made, since short copies are treated as "discard and
repeat" here.

Am I misreading what's going on there?

2021-08-28 21:52:51

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

On Fri, Aug 27 2021 at 16:22, Tony Luck wrote:
> On Fri, Aug 27, 2021 at 09:57:10PM +0000, Al Viro wrote:
>> On Fri, Aug 27, 2021 at 09:48:55PM +0000, Al Viro wrote:
>>
>> > [btrfs]search_ioctl()
>> > Broken with memory poisoning, for either variant of semantics. Same for
>> > arm64 sub-page permission differences, I think.
>>
>>
>> > So we have 3 callers where we want all-or-nothing semantics - two in
>> > arch/x86/kernel/fpu/signal.c and one in btrfs. HWPOISON will be a problem
>> > for all 3, AFAICS...
>> >
>> > IOW, it looks like we have two different things mixed here - one that wants
>> > to try and fault stuff in, with callers caring only about having _something_
>> > faulted in (most of the users) and one that wants to make sure we *can* do
>> > stores or loads on each byte in the affected area.
>> >
>> > Just accessing a byte in each page really won't suffice for the second kind.
>> > Neither will g-u-p use, unless we teach it about HWPOISON and other fun
>> > beasts... Looks like we want that thing to be a separate primitive; for
>> > btrfs I'd probably replace fault_in_pages_writeable() with clear_user()
>> > as a quick fix for now...
>> >
>> > Comments?
>>
>> Wait a sec... Wasn't HWPOISON a per-page thing? arm64 definitely does have
>> smaller-than-page areas with different permissions, so btrfs search_ioctl()
>> has a problem there, but arch/x86/kernel/fpu/signal.c doesn't have to deal
>> with that...
>>
>> Sigh... I really need more coffee...
>
> On Intel poison is tracked at the cache line granularity. Linux
> inflates that to per-page (because it can only take a whole page away).
> For faults triggered in ring3 this is pretty much the same thing because
> mm/memory_failure.c unmaps the page ... so while you see a #MC on first
> access, you get #PF when you retry. The x86 fault handler sees a magic
> signature in the page table and sends a SIGBUS.
>
> But it's all different if the #MC is triggerd from ring0. The machine
> check handler can't unmap the page. It just schedules task_work to do
> the unmap when next returning to the user.
>
> But if your kernel code loops and tries again without a return to user,
> then your get another #MC.

But that's not the case for restore_fpregs_from_user() when it hits #MC.

restore_fpregs_from_user()
...
ret = __restore_fpregs_from_user(buf, xrestore, fx_only)

/* Try to handle #PF, but anything else is fatal. */
if (ret != -EFAULT)
return -EINVAL;

Now let's look at __restore_fpregs_from_user()

__restore_fpregs_from_user()
return $FPUVARIANT_rstor_from_user_sigframe()

which all end up in user_insn(). user_insn() returns 0 or the negated
trap number, which results in -EFAULT for #PF, but for #MC the negated
trap number is -18 i.e. != -EFAULT. IOW, there is no endless loop.

This used to be a problem before commit:

aee8c67a4faa ("x86/fpu: Return proper error codes from user access functions")

and as the changelog says the initial reason for this was #GP going into
the fault path, but I'm pretty sure that I also discussed the #MC angle with
Borislav back then. Should have added some more comments there
obviously.

Thanks,

tglx

2021-08-28 22:09:12

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

On Sat, Aug 28, 2021 at 11:47:03PM +0200, Thomas Gleixner wrote:

> /* Try to handle #PF, but anything else is fatal. */
> if (ret != -EFAULT)
> return -EINVAL;

> which all end up in user_insn(). user_insn() returns 0 or the negated
> trap number, which results in -EFAULT for #PF, but for #MC the negated
> trap number is -18 i.e. != -EFAULT. IOW, there is no endless loop.
>
> This used to be a problem before commit:
>
> aee8c67a4faa ("x86/fpu: Return proper error codes from user access functions")
>
> and as the changelog says the initial reason for this was #GP going into
> the fault path, but I'm pretty sure that I also discussed the #MC angle with
> Borislav back then. Should have added some more comments there
> obviously.

... or at least have that check spelled

if (ret != -X86_TRAP_PF)
return -EINVAL;

Unless I'm misreading your explanation, that is...

2021-08-28 22:12:38

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

On Sat, Aug 28, 2021 at 10:04:41PM +0000, Al Viro wrote:
> On Sat, Aug 28, 2021 at 11:47:03PM +0200, Thomas Gleixner wrote:
>
> > /* Try to handle #PF, but anything else is fatal. */
> > if (ret != -EFAULT)
> > return -EINVAL;
>
> > which all end up in user_insn(). user_insn() returns 0 or the negated
> > trap number, which results in -EFAULT for #PF, but for #MC the negated
> > trap number is -18 i.e. != -EFAULT. IOW, there is no endless loop.
> >
> > This used to be a problem before commit:
> >
> > aee8c67a4faa ("x86/fpu: Return proper error codes from user access functions")
> >
> > and as the changelog says the initial reason for this was #GP going into
> > the fault path, but I'm pretty sure that I also discussed the #MC angle with
> > Borislav back then. Should have added some more comments there
> > obviously.
>
> ... or at least have that check spelled
>
> if (ret != -X86_TRAP_PF)
> return -EINVAL;
>
> Unless I'm misreading your explanation, that is...

BTW, is #MC triggered on stored to a poisoned cacheline? Existence of CLZERO
would seem to argue against that...

2021-08-28 22:21:02

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

On Sat, Aug 28, 2021 at 10:11:36PM +0000, Al Viro wrote:
> On Sat, Aug 28, 2021 at 10:04:41PM +0000, Al Viro wrote:
> > On Sat, Aug 28, 2021 at 11:47:03PM +0200, Thomas Gleixner wrote:
> >
> > > /* Try to handle #PF, but anything else is fatal. */
> > > if (ret != -EFAULT)
> > > return -EINVAL;
> >
> > > which all end up in user_insn(). user_insn() returns 0 or the negated
> > > trap number, which results in -EFAULT for #PF, but for #MC the negated
> > > trap number is -18 i.e. != -EFAULT. IOW, there is no endless loop.
> > >
> > > This used to be a problem before commit:
> > >
> > > aee8c67a4faa ("x86/fpu: Return proper error codes from user access functions")
> > >
> > > and as the changelog says the initial reason for this was #GP going into
> > > the fault path, but I'm pretty sure that I also discussed the #MC angle with
> > > Borislav back then. Should have added some more comments there
> > > obviously.
> >
> > ... or at least have that check spelled
> >
> > if (ret != -X86_TRAP_PF)
> > return -EINVAL;
> >
> > Unless I'm misreading your explanation, that is...
>
> BTW, is #MC triggered on stored to a poisoned cacheline? Existence of CLZERO
> would seem to argue against that...

How about taking __clear_user() out of copy_fpregs_to_sigframe()
and replacing the call of fault_in_pages_writeable() with
if (!clear_user(buf_fx, fpu_user_xstate_size))
goto retry;
return -EFAULT;
in the caller?

2021-08-28 22:22:02

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

On Sat, Aug 28, 2021 at 3:12 PM Al Viro <[email protected]> wrote:
> BTW, is #MC triggered on stored to a poisoned cacheline? Existence of CLZERO
> would seem to argue against that...

No #MC on stores. Just on loads. Note that you can't clear poison
state with a series of small writes to the cache line. But a single
64-byte store might do it (architects didn't want to guarantee that
it would work when I asked about avx512 stores to clear poison
many years ago).

-Tony

2021-08-28 22:25:40

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

On Sat, Aug 28 2021 at 22:04, Al Viro wrote:

> On Sat, Aug 28, 2021 at 11:47:03PM +0200, Thomas Gleixner wrote:
>
>> /* Try to handle #PF, but anything else is fatal. */
>> if (ret != -EFAULT)
>> return -EINVAL;
>
>> which all end up in user_insn(). user_insn() returns 0 or the negated
>> trap number, which results in -EFAULT for #PF, but for #MC the negated
>> trap number is -18 i.e. != -EFAULT. IOW, there is no endless loop.
>>
>> This used to be a problem before commit:
>>
>> aee8c67a4faa ("x86/fpu: Return proper error codes from user access functions")
>>
>> and as the changelog says the initial reason for this was #GP going into
>> the fault path, but I'm pretty sure that I also discussed the #MC angle with
>> Borislav back then. Should have added some more comments there
>> obviously.
>
> ... or at least have that check spelled
>
> if (ret != -X86_TRAP_PF)
> return -EINVAL;
>
> Unless I'm misreading your explanation, that is...

Yes, that makes a lot of sense.

2021-08-28 23:04:32

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

On Sat, Aug 28, 2021 at 10:19:04PM +0000, Al Viro wrote:

> How about taking __clear_user() out of copy_fpregs_to_sigframe()
> and replacing the call of fault_in_pages_writeable() with
> if (!clear_user(buf_fx, fpu_user_xstate_size))
> goto retry;
> return -EFAULT;
> in the caller?

Something like this (completely untested)

Lift __clear_user() out of copy_fpregs_to_sigframe(), do not confuse EFAULT with
X86_TRAP_PF, don't bother with fault_in_pages_writeable() (pointless, since now
__clear_user() on error is not under pagefault_disable()). And don't bother
with retries on anything other than #PF...

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 5a18694a89b2..71c6621a262f 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -17,6 +17,7 @@
#include <linux/mm.h>

#include <asm/user.h>
+#include <asm/trapnr.h>
#include <asm/fpu/api.h>
#include <asm/fpu/xstate.h>
#include <asm/fpu/xcr.h>
@@ -345,7 +346,7 @@ static inline int xsave_to_user_sigframe(struct xregs_state __user *buf)
*/
err = __clear_user(&buf->header, sizeof(buf->header));
if (unlikely(err))
- return -EFAULT;
+ return -X86_TRAP_PF;

stac();
XSTATE_OP(XSAVE, buf, lmask, hmask, err);
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 445c57c9c539..611b9ed9c06b 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -135,18 +135,12 @@ static inline int save_xstate_epilog(void __user *buf, int ia32_frame)

static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf)
{
- int err;
-
if (use_xsave())
- err = xsave_to_user_sigframe(buf);
- else if (use_fxsr())
- err = fxsave_to_user_sigframe((struct fxregs_state __user *) buf);
+ return xsave_to_user_sigframe(buf);
+ if (use_fxsr())
+ return fxsave_to_user_sigframe((struct fxregs_state __user *) buf);
else
- err = fnsave_to_user_sigframe((struct fregs_state __user *) buf);
-
- if (unlikely(err) && __clear_user(buf, fpu_user_xstate_size))
- err = -EFAULT;
- return err;
+ return fnsave_to_user_sigframe((struct fregs_state __user *) buf);
}

/*
@@ -205,9 +199,10 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
fpregs_unlock();

if (ret) {
- if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
+ if (!__clear_user(buf_fx, fpu_user_xstate_size) &&
+ ret == -X86_TRAP_PF)
goto retry;
- return -EFAULT;
+ return -1;
}

/* Save the fsave header for the 32-bit frames. */
@@ -275,7 +270,7 @@ static int restore_fpregs_from_user(void __user *buf, u64 xrestore,
fpregs_unlock();

/* Try to handle #PF, but anything else is fatal. */
- if (ret != -EFAULT)
+ if (ret != -X86_TRAP_PF)
return -EINVAL;

ret = fault_in_pages_readable(buf, size);

2021-08-29 01:56:49

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

On Fri, Aug 27, 2021 at 09:48:55PM +0000, Al Viro wrote:

> So we have 3 callers where we want all-or-nothing semantics - two in
> arch/x86/kernel/fpu/signal.c and one in btrfs. HWPOISON will be a problem
> for all 3, AFAICS...
>
> IOW, it looks like we have two different things mixed here - one that wants
> to try and fault stuff in, with callers caring only about having _something_
> faulted in (most of the users) and one that wants to make sure we *can* do
> stores or loads on each byte in the affected area.
>
> Just accessing a byte in each page really won't suffice for the second kind.
> Neither will g-u-p use, unless we teach it about HWPOISON and other fun
> beasts... Looks like we want that thing to be a separate primitive; for
> btrfs I'd probably replace fault_in_pages_writeable() with clear_user()
> as a quick fix for now...

Looks like out of these 3 we have
* x86 restoring FPU state on sigreturn: correct, if somewhat obfuscated;
HWPOISON is not an issue. We want full fault-in there (1 or 2 pages)
* x86 saving FPU state into sigframe: not really needed; we do
__clear_user() on any error anyway, and taking it into the caller past the
pagefault_enable() will serve just fine instead of fault-in of the same
for write.
* btrfs search_ioctl(): HWPOISON is not an issue (no #MC on stores),
but arm64 side of the things very likely is a problem with MTE; there we
can have successful store in some bytes in a page with faults on stores
elsewhere in it. With such setups that thing will loop indefinitely.
And unlike x86 FPU handling, btrfs is arch-independent.

IOW, unless I'm misreading the situation, we have one caller where "all or
nothing" semantics is correct and needed, several where fault-in is pointless,
one where the current use of fault-in is actively wrong (ppc kvm, patch from
ppc folks exists), another place where neither semantics is right (btrfs on
arm64) and a bunch where "can we access at least the first byte?" should be
fine...

2021-08-29 02:05:57

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

On Sat, Aug 28, 2021 at 03:20:58PM -0700, Tony Luck wrote:
> On Sat, Aug 28, 2021 at 3:12 PM Al Viro <[email protected]> wrote:
> > BTW, is #MC triggered on stored to a poisoned cacheline? Existence of CLZERO
> > would seem to argue against that...
>
> No #MC on stores. Just on loads. Note that you can't clear poison
> state with a series of small writes to the cache line. But a single
> 64-byte store might do it (architects didn't want to guarantee that
> it would work when I asked about avx512 stores to clear poison
> many years ago).

Dave Jiang thinks MOVDIR64B clears poison.

http://archive.lwn.net:8080/linux-kernel/157617505636.42350.1170110675242558018.stgit@djiang5-desk3.ch.intel.com/

2021-08-29 18:46:02

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

On Sat, Aug 28 2021 at 22:51, Al Viro wrote:
> @@ -345,7 +346,7 @@ static inline int xsave_to_user_sigframe(struct xregs_state __user *buf)
> */
> err = __clear_user(&buf->header, sizeof(buf->header));
> if (unlikely(err))
> - return -EFAULT;
> + return -X86_TRAP_PF;

This clear_user can be lifted into copy_fpstate_to_sigframe(). Something
like the below.

Thanks,

tglx
---
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -135,18 +135,12 @@ static inline int save_xstate_epilog(voi

static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf)
{
- int err;
-
if (use_xsave())
- err = xsave_to_user_sigframe(buf);
- else if (use_fxsr())
- err = fxsave_to_user_sigframe((struct fxregs_state __user *) buf);
+ return xsave_to_user_sigframe(buf);
+ if (use_fxsr())
+ return fxsave_to_user_sigframe((struct fxregs_state __user *) buf);
else
- err = fnsave_to_user_sigframe((struct fregs_state __user *) buf);
-
- if (unlikely(err) && __clear_user(buf, fpu_user_xstate_size))
- err = -EFAULT;
- return err;
+ return fnsave_to_user_sigframe((struct fregs_state __user *) buf);
}

/*
@@ -188,6 +182,16 @@ int copy_fpstate_to_sigframe(void __user

if (!access_ok(buf, size))
return -EACCES;
+
+ if (use_xsave()) {
+ /*
+ * Clear the xsave header first, so that reserved fields are
+ * initialized to zero.
+ */
+ ret = __clear_user(&buf->header, sizeof(buf->header));
+ if (unlikely(ret))
+ return ret;
+ }
retry:
/*
* Load the FPU registers if they are not valid for the current task.
@@ -205,9 +209,10 @@ int copy_fpstate_to_sigframe(void __user
fpregs_unlock();

if (ret) {
- if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
+ if (!__clear_user(buf_fx, fpu_user_xstate_size) &&
+ ret == -X86_TRAP_PF)
goto retry;
- return -EFAULT;
+ return -1;
}

/* Save the fsave header for the 32-bit frames. */
@@ -275,7 +280,7 @@ static int restore_fpregs_from_user(void
fpregs_unlock();

/* Try to handle #PF, but anything else is fatal. */
- if (ret != -EFAULT)
+ if (ret != -X86_TRAP_PF)
return -EINVAL;

ret = fault_in_pages_readable(buf, size);
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -323,9 +323,12 @@ static inline void os_xrstor(struct xreg
* We don't use modified optimization because xrstor/xrstors might track
* a different application.
*
- * We don't use compacted format xsave area for
- * backward compatibility for old applications which don't understand
- * compacted format of xsave area.
+ * We don't use compacted format xsave area for backward compatibility for
+ * old applications which don't understand the compacted format of the
+ * xsave area.
+ *
+ * The caller has to zero buf::header before calling this because XSAVE*
+ * does not touch them.
*/
static inline int xsave_to_user_sigframe(struct xregs_state __user *buf)
{
@@ -339,14 +342,6 @@ static inline int xsave_to_user_sigframe
u32 hmask = mask >> 32;
int err;

- /*
- * Clear the xsave header first, so that reserved fields are
- * initialized to zero.
- */
- err = __clear_user(&buf->header, sizeof(buf->header));
- if (unlikely(err))
- return -EFAULT;
-
stac();
XSTATE_OP(XSAVE, buf, lmask, hmask, err);
clac();

2021-08-29 19:51:44

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

On Sun, Aug 29, 2021 at 08:44:04PM +0200, Thomas Gleixner wrote:
> On Sat, Aug 28 2021 at 22:51, Al Viro wrote:
> > @@ -345,7 +346,7 @@ static inline int xsave_to_user_sigframe(struct xregs_state __user *buf)
> > */
> > err = __clear_user(&buf->header, sizeof(buf->header));
> > if (unlikely(err))
> > - return -EFAULT;
> > + return -X86_TRAP_PF;
>
> This clear_user can be lifted into copy_fpstate_to_sigframe(). Something
> like the below.

Hmm... This mixing of -X86_TRAP_... with -E... looks like it's asking for
trouble in general. Might be worth making e.g. fpu__restore_sig() (and
its callers) return bool, seeing that we only check for 0/non-zero in
there.

2021-08-29 19:53:46

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

On Sun, Aug 29 2021 at 19:46, Al Viro wrote:

> On Sun, Aug 29, 2021 at 08:44:04PM +0200, Thomas Gleixner wrote:
>> On Sat, Aug 28 2021 at 22:51, Al Viro wrote:
>> > @@ -345,7 +346,7 @@ static inline int xsave_to_user_sigframe(struct xregs_state __user *buf)
>> > */
>> > err = __clear_user(&buf->header, sizeof(buf->header));
>> > if (unlikely(err))
>> > - return -EFAULT;
>> > + return -X86_TRAP_PF;
>>
>> This clear_user can be lifted into copy_fpstate_to_sigframe(). Something
>> like the below.
>
> Hmm... This mixing of -X86_TRAP_... with -E... looks like it's asking for
> trouble in general. Might be worth making e.g. fpu__restore_sig() (and
> its callers) return bool, seeing that we only check for 0/non-zero in
> there.

Let me fix that.

2021-08-30 15:46:10

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v7 05/19] iov_iter: Introduce fault_in_iov_iter_writeable

>> No #MC on stores. Just on loads. Note that you can't clear poison
>> state with a series of small writes to the cache line. But a single
>> 64-byte store might do it (architects didn't want to guarantee that
>> it would work when I asked about avx512 stores to clear poison
>> many years ago).
>
> Dave Jiang thinks MOVDIR64B clears poison.
>
> http://archive.lwn.net:8080/linux-kernel/157617505636.42350.1170110675242558018.stgit@djiang5-desk3.ch.intel.com/

MOVDIR64B has some explicit guarantees (does a write-back invalidate if the target is already
in the cache) that a 64-byte avx512 write doesn't.

Of course it would stop working if some future CPU were to have a longer than 64 bytes cache line.

-Tony

2021-08-31 13:57:49

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC][arm64] possible infinite loop in btrfs search_ioctl()

On Sat, Aug 28, 2021 at 08:28:17PM +0100, Al Viro wrote:
> AFAICS, a48b73eca4ce "btrfs: fix potential deadlock in the search ioctl"
> has introduced a bug at least on arm64.
>
> Relevant bits: in search_ioctl() we have
> while (1) {
> ret = fault_in_pages_writeable(ubuf + sk_offset,
> *buf_size - sk_offset);
> if (ret)
> break;
>
> ret = btrfs_search_forward(root, &key, path, sk->min_transid);
> if (ret != 0) {
> if (ret > 0)
> ret = 0;
> goto err;
> }
> ret = copy_to_sk(path, &key, sk, buf_size, ubuf,
> &sk_offset, &num_found);
> btrfs_release_path(path);
> if (ret)
> break;
>
> }
> and in copy_to_sk() -
> sh.objectid = key->objectid;
> sh.offset = key->offset;
> sh.type = key->type;
> sh.len = item_len;
> sh.transid = found_transid;
>
> /*
> * Copy search result header. If we fault then loop again so we
> * can fault in the pages and -EFAULT there if there's a
> * problem. Otherwise we'll fault and then copy the buffer in
> * properly this next time through
> */
> if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) {
> ret = 0;
> goto out;
> }
> with sk_offset left unchanged if the very first copy_to_user_nofault() fails.
>
> Now, consider a situation on arm64 where ubuf points to the beginning of page,
> ubuf[0] can be accessed, but ubuf[16] can not (possible with MTE, AFAICS). We do
> fault_in_pages_writeable(), which succeeds. When we get to copy_to_user_nofault()
> we fail as soon as it gets past the first 16 bytes. And we repeat everything from
> scratch, with no progress made, since short copies are treated as "discard and
> repeat" here.

So if copy_to_user_nofault() returns -EFAULT, copy_to_sk() returns 0
(following commit a48b73eca4ce). I think you are right, search_ioctl()
can get into an infinite loop attempting to write to user if the
architecture can trigger faults at smaller granularity than the page
boundary. fault_in_pages_writeable() won't fix it if ubuf[0] is
writable and doesn't trigger an MTE tag check fault.

An arm64-specific workaround would be for pagefault_disable() to disable
tag checking. It's a pretty big hammer, weakening the out of bounds
access detection of MTE. My preference would be a fix in the btrfs code.

A btrfs option would be for copy_to_sk() to return an indication of
where the fault occurred and get fault_in_pages_writeable() to check
that location, even if the copying would restart from an earlier offset
(this requires open-coding copy_to_user_nofault()). An attempt below,
untested and does not cover read_extent_buffer_to_user_nofault():

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0ba98e08a029..9e74ba1c955d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2079,6 +2079,7 @@ static noinline int copy_to_sk(struct btrfs_path *path,
size_t *buf_size,
char __user *ubuf,
unsigned long *sk_offset,
+ unsigned long *fault_offset,
int *num_found)
{
u64 found_transid;
@@ -2143,7 +2144,11 @@ static noinline int copy_to_sk(struct btrfs_path *path,
* problem. Otherwise we'll fault and then copy the buffer in
* properly this next time through
*/
- if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) {
+ pagefault_disable();
+ ret = __copy_to_user_inatomic(ubuf + *sk_offset, &sh, sizeof(sh));
+ pagefault_enable();
+ *fault_offset = *sk_offset + sizeof(sh) - ret;
+ if (ret) {
ret = 0;
goto out;
}
@@ -2218,6 +2223,7 @@ static noinline int search_ioctl(struct inode *inode,
int ret;
int num_found = 0;
unsigned long sk_offset = 0;
+ unsigned long fault_offset = 0;

if (*buf_size < sizeof(struct btrfs_ioctl_search_header)) {
*buf_size = sizeof(struct btrfs_ioctl_search_header);
@@ -2244,8 +2250,8 @@ static noinline int search_ioctl(struct inode *inode,
key.offset = sk->min_offset;

while (1) {
- ret = fault_in_pages_writeable(ubuf + sk_offset,
- *buf_size - sk_offset);
+ ret = fault_in_pages_writeable(ubuf + fault_offset,
+ *buf_size - fault_offset);
if (ret)
break;

@@ -2256,7 +2262,7 @@ static noinline int search_ioctl(struct inode *inode,
goto err;
}
ret = copy_to_sk(path, &key, sk, buf_size, ubuf,
- &sk_offset, &num_found);
+ &sk_offset, &fault_offset, &num_found);
btrfs_release_path(path);
if (ret)
break;

--
Catalin

2021-08-31 15:50:15

by Al Viro

[permalink] [raw]
Subject: Re: [RFC][arm64] possible infinite loop in btrfs search_ioctl()

On Tue, Aug 31, 2021 at 02:54:50PM +0100, Catalin Marinas wrote:

> An arm64-specific workaround would be for pagefault_disable() to disable
> tag checking. It's a pretty big hammer, weakening the out of bounds
> access detection of MTE. My preference would be a fix in the btrfs code.
>
> A btrfs option would be for copy_to_sk() to return an indication of
> where the fault occurred and get fault_in_pages_writeable() to check
> that location, even if the copying would restart from an earlier offset
> (this requires open-coding copy_to_user_nofault()). An attempt below,
> untested and does not cover read_extent_buffer_to_user_nofault():

Umm... There's another copy_to_user_nofault() call in the same function
(same story, AFAICS).

Can't say I'm fond of their ABI, but then I guess it could've been worse -
iterating over btree, running a user-supplied chunk of INTERCAL over it,
with all details of internal representation cast in stone by that exposure...

2021-08-31 16:05:40

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC][arm64] possible infinite loop in btrfs search_ioctl()

On Tue, Aug 31, 2021 at 03:28:57PM +0000, Al Viro wrote:
> On Tue, Aug 31, 2021 at 02:54:50PM +0100, Catalin Marinas wrote:
>
> > An arm64-specific workaround would be for pagefault_disable() to disable
> > tag checking. It's a pretty big hammer, weakening the out of bounds
> > access detection of MTE. My preference would be a fix in the btrfs code.
> >
> > A btrfs option would be for copy_to_sk() to return an indication of
> > where the fault occurred and get fault_in_pages_writeable() to check
> > that location, even if the copying would restart from an earlier offset
> > (this requires open-coding copy_to_user_nofault()). An attempt below,
> > untested and does not cover read_extent_buffer_to_user_nofault():
>
> Umm... There's another copy_to_user_nofault() call in the same function
> (same story, AFAICS).

Yeah, I was too lazy to do it all and I don't have a setup to test the
patch quickly either. BTW, my hack is missing an access_ok() check.

I wonder whether copy_{to,from}_user_nofault() could actually return the
number of bytes left to copy, just like their non-nofault counterparts.
These are only used in a few places, so fairly easy to change. If we go
for a btrfs fix along the lines of my diff, it saves us from duplicating
the copy_to_user_nofault() code.

--
Catalin

2021-09-01 22:45:30

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [PATCH v7 00/19] gfs2: Fix mmap + page fault deadlocks

On Fri, Aug 27, 2021 at 7:17 PM Linus Torvalds
<[email protected]> wrote:
> On Fri, Aug 27, 2021 at 9:49 AM Andreas Gruenbacher <[email protected]> wrote:
> >
> > here's another update on top of v5.14-rc7. Changes:
> >
> > * Some of the patch descriptions have been improved.
> >
> > * Patch "gfs2: Eliminate ip->i_gh" has been moved further to the front.
> >
> > At this point, I'm not aware of anything that still needs fixing,
>
> From a quick scan, I didn't see anything that raised my hackles.

So there's a minor merge conflict between Christoph's iomap_iter
conversion and this patch queue now, and I should probably clarify the
description of "iomap: Add done_before argument to iomap_dio_rw" that
Darrick ran into. Then there are the user copy issues that Al has
pointed out. Fixing those will create superficial conflicts with this
patch queue, but probably nothing serious.

So how should I proceed: do you expect a v8 of this patch queue on top
of the current mainline?

Thanks,
Andreas

2021-09-03 15:10:37

by Filipe Manana

[permalink] [raw]
Subject: Re: [PATCH v7 00/19] gfs2: Fix mmap + page fault deadlocks

On Fri, Aug 27, 2021 at 5:51 PM Andreas Gruenbacher <[email protected]> wrote:
>
> Hi all,
>
> here's another update on top of v5.14-rc7. Changes:
>
> * Some of the patch descriptions have been improved.
>
> * Patch "gfs2: Eliminate ip->i_gh" has been moved further to the front.
>
> At this point, I'm not aware of anything that still needs fixing,

Hi, thanks for doing this.

In btrfs we also have a deadlock (after the conversion to use iomap
for direct IO) triggered by your recent test case for fstests,
generic/647 [1].
Even though we can fix it in btrfs without touching iomap, iov_iter,
etc, it would be too complex for such a rare and exotic case (a user
passing a buffer for a direct IO read/write that is memory mapped to
the same file range of the operation is very uncommon at least). But
this patchset would make the fix much simpler and cleaner.

One thing I noticed is that, for direct IO reads, despite setting the
->nofault attribute of the iov_iter to true, we can still get page
faults while in the iomap code.
This happens when reading from holes and unwritten/prealloc extents,
because iomap calls iov_iter_zero() and this seems to ignore the value
of ->nofault.
Is that intentional? I can get around it by surrounding the iomap call
with pagefault_disable() / pagefault_enable(), but it seems odd to do
so, given that iov_iter->nofault was set to true.

[1] https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/?id=d3cbdabffc4cb28850e97bc7bd8a7a1460db94e5

Thanks.

>
>
> The first two patches are independent of the core of this patch queue
> and I've asked the respective maintainers to have a look, but I've not
> heard back from them. The first patch should just go into Al's tree;
> it's a relatively straight-forward fix. The second patch really needs
> to be looked at; it might break things:
>
> iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
> powerpc/kvm: Fix kvm_use_magic_page
>
>
> Al and Linus seem to have a disagreement about the error reporting
> semantics that functions fault_in_{readable,writeable} and
> fault_in_iov_iter_{readable,writeable} should have. I've implemented
> Linus's suggestion of returning the number of bytes not faulted in and I
> think that being able to tell if "nothing", "something" or "everything"
> could be faulted in does help, but I'll live with anything that allows
> us to make progress.
>
>
> The iomap changes should ideally be reviewed by Christoph; I've not
> heard from him about those.
>
>
> Thanks,
> Andreas
>
> Andreas Gruenbacher (16):
> iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
> powerpc/kvm: Fix kvm_use_magic_page
> gup: Turn fault_in_pages_{readable,writeable} into
> fault_in_{readable,writeable}
> iov_iter: Turn iov_iter_fault_in_readable into
> fault_in_iov_iter_readable
> iov_iter: Introduce fault_in_iov_iter_writeable
> gfs2: Add wrapper for iomap_file_buffered_write
> gfs2: Clean up function may_grant
> gfs2: Move the inode glock locking to gfs2_file_buffered_write
> gfs2: Eliminate ip->i_gh
> gfs2: Fix mmap + page fault deadlocks for buffered I/O
> iomap: Fix iomap_dio_rw return value for user copies
> iomap: Support partial direct I/O on user copy failures
> iomap: Add done_before argument to iomap_dio_rw
> gup: Introduce FOLL_NOFAULT flag to disable page faults
> iov_iter: Introduce nofault flag to disable page faults
> gfs2: Fix mmap + page fault deadlocks for direct I/O
>
> Bob Peterson (3):
> gfs2: Eliminate vestigial HIF_FIRST
> gfs2: Remove redundant check from gfs2_glock_dq
> gfs2: Introduce flag for glock holder auto-demotion
>
> arch/powerpc/kernel/kvm.c | 3 +-
> arch/powerpc/kernel/signal_32.c | 4 +-
> arch/powerpc/kernel/signal_64.c | 2 +-
> arch/x86/kernel/fpu/signal.c | 7 +-
> drivers/gpu/drm/armada/armada_gem.c | 7 +-
> fs/btrfs/file.c | 7 +-
> fs/btrfs/ioctl.c | 5 +-
> fs/ext4/file.c | 5 +-
> fs/f2fs/file.c | 2 +-
> fs/fuse/file.c | 2 +-
> fs/gfs2/bmap.c | 60 +----
> fs/gfs2/file.c | 245 ++++++++++++++++++--
> fs/gfs2/glock.c | 340 +++++++++++++++++++++-------
> fs/gfs2/glock.h | 20 ++
> fs/gfs2/incore.h | 5 +-
> fs/iomap/buffered-io.c | 2 +-
> fs/iomap/direct-io.c | 21 +-
> fs/ntfs/file.c | 2 +-
> fs/xfs/xfs_file.c | 6 +-
> fs/zonefs/super.c | 4 +-
> include/linux/iomap.h | 11 +-
> include/linux/mm.h | 3 +-
> include/linux/pagemap.h | 58 +----
> include/linux/uio.h | 4 +-
> lib/iov_iter.c | 103 +++++++--
> mm/filemap.c | 4 +-
> mm/gup.c | 139 +++++++++++-
> 27 files changed, 785 insertions(+), 286 deletions(-)
>
> --
> 2.26.3
>


--
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

2021-09-03 16:16:39

by Filipe Manana

[permalink] [raw]
Subject: Re: [PATCH v7 03/19] gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}

On Fri, Aug 27, 2021 at 5:52 PM Andreas Gruenbacher <[email protected]> wrote:
>
> Turn fault_in_pages_{readable,writeable} into versions that return the
> number of bytes not faulted in (similar to copy_to_user) instead of
> returning a non-zero value when any of the requested pages couldn't be
> faulted in. This supports the existing users that require all pages to
> be faulted in as well as new users that are happy if any pages can be
> faulted in at all.
>
> Neither of these functions is entirely trivial and it doesn't seem
> useful to inline them, so move them to mm/gup.c.
>
> Rename the functions to fault_in_{readable,writeable} to make sure that
> this change doesn't silently break things.
>
> Signed-off-by: Andreas Gruenbacher <[email protected]>
> ---
> arch/powerpc/kernel/kvm.c | 3 +-
> arch/powerpc/kernel/signal_32.c | 4 +-
> arch/powerpc/kernel/signal_64.c | 2 +-
> arch/x86/kernel/fpu/signal.c | 7 ++-
> drivers/gpu/drm/armada/armada_gem.c | 7 ++-
> fs/btrfs/ioctl.c | 5 +-
> include/linux/pagemap.h | 57 ++---------------------
> lib/iov_iter.c | 10 ++--
> mm/filemap.c | 2 +-
> mm/gup.c | 72 +++++++++++++++++++++++++++++
> 10 files changed, 93 insertions(+), 76 deletions(-)
>
> diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
> index d89cf802d9aa..6568823cf306 100644
> --- a/arch/powerpc/kernel/kvm.c
> +++ b/arch/powerpc/kernel/kvm.c
> @@ -669,7 +669,8 @@ static void __init kvm_use_magic_page(void)
> on_each_cpu(kvm_map_magic_page, &features, 1);
>
> /* Quick self-test to see if the mapping works */
> - if (fault_in_pages_readable((const char *)KVM_MAGIC_PAGE, sizeof(u32))) {
> + if (fault_in_readable((const char __user *)KVM_MAGIC_PAGE,
> + sizeof(u32))) {
> kvm_patching_worked = false;
> return;
> }
> diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
> index 0608581967f0..38c3eae40c14 100644
> --- a/arch/powerpc/kernel/signal_32.c
> +++ b/arch/powerpc/kernel/signal_32.c
> @@ -1048,7 +1048,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
> if (new_ctx == NULL)
> return 0;
> if (!access_ok(new_ctx, ctx_size) ||
> - fault_in_pages_readable((u8 __user *)new_ctx, ctx_size))
> + fault_in_readable((char __user *)new_ctx, ctx_size))
> return -EFAULT;
>
> /*
> @@ -1237,7 +1237,7 @@ SYSCALL_DEFINE3(debug_setcontext, struct ucontext __user *, ctx,
> #endif
>
> if (!access_ok(ctx, sizeof(*ctx)) ||
> - fault_in_pages_readable((u8 __user *)ctx, sizeof(*ctx)))
> + fault_in_readable((char __user *)ctx, sizeof(*ctx)))
> return -EFAULT;
>
> /*
> diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
> index 1831bba0582e..9f471b4a11e3 100644
> --- a/arch/powerpc/kernel/signal_64.c
> +++ b/arch/powerpc/kernel/signal_64.c
> @@ -688,7 +688,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
> if (new_ctx == NULL)
> return 0;
> if (!access_ok(new_ctx, ctx_size) ||
> - fault_in_pages_readable((u8 __user *)new_ctx, ctx_size))
> + fault_in_readable((char __user *)new_ctx, ctx_size))
> return -EFAULT;
>
> /*
> diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
> index 445c57c9c539..ba6bdec81603 100644
> --- a/arch/x86/kernel/fpu/signal.c
> +++ b/arch/x86/kernel/fpu/signal.c
> @@ -205,7 +205,7 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
> fpregs_unlock();
>
> if (ret) {
> - if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
> + if (!fault_in_writeable(buf_fx, fpu_user_xstate_size))
> goto retry;
> return -EFAULT;
> }
> @@ -278,10 +278,9 @@ static int restore_fpregs_from_user(void __user *buf, u64 xrestore,
> if (ret != -EFAULT)
> return -EINVAL;
>
> - ret = fault_in_pages_readable(buf, size);
> - if (!ret)
> + if (!fault_in_readable(buf, size))
> goto retry;
> - return ret;
> + return -EFAULT;
> }
>
> /*
> diff --git a/drivers/gpu/drm/armada/armada_gem.c b/drivers/gpu/drm/armada/armada_gem.c
> index 21909642ee4c..8fbb25913327 100644
> --- a/drivers/gpu/drm/armada/armada_gem.c
> +++ b/drivers/gpu/drm/armada/armada_gem.c
> @@ -336,7 +336,7 @@ int armada_gem_pwrite_ioctl(struct drm_device *dev, void *data,
> struct drm_armada_gem_pwrite *args = data;
> struct armada_gem_object *dobj;
> char __user *ptr;
> - int ret;
> + int ret = 0;
>
> DRM_DEBUG_DRIVER("handle %u off %u size %u ptr 0x%llx\n",
> args->handle, args->offset, args->size, args->ptr);
> @@ -349,9 +349,8 @@ int armada_gem_pwrite_ioctl(struct drm_device *dev, void *data,
> if (!access_ok(ptr, args->size))
> return -EFAULT;
>
> - ret = fault_in_pages_readable(ptr, args->size);
> - if (ret)
> - return ret;
> + if (fault_in_readable(ptr, args->size))
> + return -EFAULT;
>
> dobj = armada_gem_object_lookup(file, args->handle);
> if (dobj == NULL)
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 0ba98e08a029..9233ecc31e2e 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2244,9 +2244,8 @@ static noinline int search_ioctl(struct inode *inode,
> key.offset = sk->min_offset;
>
> while (1) {
> - ret = fault_in_pages_writeable(ubuf + sk_offset,
> - *buf_size - sk_offset);
> - if (ret)
> + ret = -EFAULT;
> + if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset))
> break;
>
> ret = btrfs_search_forward(root, &key, path, sk->min_transid);
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index ed02aa522263..7c9edc9694d9 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -734,61 +734,10 @@ int wait_on_page_private_2_killable(struct page *page);
> extern void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter);
>
> /*
> - * Fault everything in given userspace address range in.
> + * Fault in userspace address range.
> */
> -static inline int fault_in_pages_writeable(char __user *uaddr, int size)
> -{
> - char __user *end = uaddr + size - 1;
> -
> - if (unlikely(size == 0))
> - return 0;
> -
> - if (unlikely(uaddr > end))
> - return -EFAULT;
> - /*
> - * Writing zeroes into userspace here is OK, because we know that if
> - * the zero gets there, we'll be overwriting it.
> - */
> - do {
> - if (unlikely(__put_user(0, uaddr) != 0))
> - return -EFAULT;
> - uaddr += PAGE_SIZE;
> - } while (uaddr <= end);
> -
> - /* Check whether the range spilled into the next page. */
> - if (((unsigned long)uaddr & PAGE_MASK) ==
> - ((unsigned long)end & PAGE_MASK))
> - return __put_user(0, end);
> -
> - return 0;
> -}
> -
> -static inline int fault_in_pages_readable(const char __user *uaddr, int size)
> -{
> - volatile char c;
> - const char __user *end = uaddr + size - 1;
> -
> - if (unlikely(size == 0))
> - return 0;
> -
> - if (unlikely(uaddr > end))
> - return -EFAULT;
> -
> - do {
> - if (unlikely(__get_user(c, uaddr) != 0))
> - return -EFAULT;
> - uaddr += PAGE_SIZE;
> - } while (uaddr <= end);
> -
> - /* Check whether the range spilled into the next page. */
> - if (((unsigned long)uaddr & PAGE_MASK) ==
> - ((unsigned long)end & PAGE_MASK)) {
> - return __get_user(c, end);
> - }
> -
> - (void)c;
> - return 0;
> -}
> +size_t fault_in_writeable(char __user *uaddr, size_t size);
> +size_t fault_in_readable(const char __user *uaddr, size_t size);
>
> int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> pgoff_t index, gfp_t gfp_mask);
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 25dfc48536d7..069cedd9d7b4 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -191,7 +191,7 @@ static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t b
> buf = iov->iov_base + skip;
> copy = min(bytes, iov->iov_len - skip);
>
> - if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_pages_writeable(buf, copy)) {
> + if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_writeable(buf, copy)) {
> kaddr = kmap_atomic(page);
> from = kaddr + offset;
>
> @@ -275,7 +275,7 @@ static size_t copy_page_from_iter_iovec(struct page *page, size_t offset, size_t
> buf = iov->iov_base + skip;
> copy = min(bytes, iov->iov_len - skip);
>
> - if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_pages_readable(buf, copy)) {
> + if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_readable(buf, copy)) {
> kaddr = kmap_atomic(page);
> to = kaddr + offset;
>
> @@ -446,13 +446,11 @@ int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes)
> bytes = i->count;
> for (p = i->iov, skip = i->iov_offset; bytes; p++, skip = 0) {
> size_t len = min(bytes, p->iov_len - skip);
> - int err;
>
> if (unlikely(!len))
> continue;
> - err = fault_in_pages_readable(p->iov_base + skip, len);
> - if (unlikely(err))
> - return err;
> + if (fault_in_readable(p->iov_base + skip, len))
> + return -EFAULT;
> bytes -= len;
> }
> }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index d1458ecf2f51..4dec3bc7752e 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -88,7 +88,7 @@
> * ->lock_page (access_process_vm)
> *
> * ->i_mutex (generic_perform_write)
> - * ->mmap_lock (fault_in_pages_readable->do_page_fault)
> + * ->mmap_lock (fault_in_readable->do_page_fault)
> *
> * bdi->wb.list_lock
> * sb_lock (fs/fs-writeback.c)
> diff --git a/mm/gup.c b/mm/gup.c
> index b94717977d17..0cf47955e5a1 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1672,6 +1672,78 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
> }
> #endif /* !CONFIG_MMU */
>
> +/**
> + * fault_in_writeable - fault in userspace address range for writing
> + * @uaddr: start of address range
> + * @size: size of address range
> + *
> + * Returns the number of bytes not faulted in (like copy_to_user() and
> + * copy_from_user()).
> + */
> +size_t fault_in_writeable(char __user *uaddr, size_t size)
> +{
> + char __user *start = uaddr, *end;
> +
> + if (unlikely(size == 0))
> + return 0;
> + if (!PAGE_ALIGNED(uaddr)) {
> + if (unlikely(__put_user(0, uaddr) != 0))
> + return size;
> + uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr);
> + }
> + end = (char __user *)PAGE_ALIGN((unsigned long)start + size);
> + if (unlikely(end < start))
> + end = NULL;
> + while (uaddr != end) {
> + if (unlikely(__put_user(0, uaddr) != 0))
> + goto out;
> + uaddr += PAGE_SIZE;

Won't we loop endlessly or corrupt some unwanted page when 'end' was
set to NULL?

> + }
> +
> +out:
> + if (size > uaddr - start)
> + return size - (uaddr - start);
> + return 0;
> +}
> +EXPORT_SYMBOL(fault_in_writeable);
> +
> +/**
> + * fault_in_readable - fault in userspace address range for reading
> + * @uaddr: start of user address range
> + * @size: size of user address range
> + *
> + * Returns the number of bytes not faulted in (like copy_to_user() and
> + * copy_from_user()).
> + */
> +size_t fault_in_readable(const char __user *uaddr, size_t size)
> +{
> + const char __user *start = uaddr, *end;
> + volatile char c;
> +
> + if (unlikely(size == 0))
> + return 0;
> + if (!PAGE_ALIGNED(uaddr)) {
> + if (unlikely(__get_user(c, uaddr) != 0))
> + return size;
> + uaddr = (const char __user *)PAGE_ALIGN((unsigned long)uaddr);
> + }
> + end = (const char __user *)PAGE_ALIGN((unsigned long)start + size);
> + if (unlikely(end < start))
> + end = NULL;
> + while (uaddr != end) {

Same kind of issue here, when 'end' was set to NULL?

Thanks.

> + if (unlikely(__get_user(c, uaddr) != 0))
> + goto out;
> + uaddr += PAGE_SIZE;
> + }
> +
> +out:
> + (void)c;
> + if (size > uaddr - start)
> + return size - (uaddr - start);
> + return 0;
> +}
> +EXPORT_SYMBOL(fault_in_readable);
> +
> /**
> * get_dump_page() - pin user page in memory while writing it to core dump
> * @addr: user address
> --
> 2.26.3
>


--
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

2021-09-03 16:52:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 00/19] gfs2: Fix mmap + page fault deadlocks

On Wed, Sep 1, 2021 at 12:53 PM Andreas Gruenbacher <[email protected]> wrote:
>
> So there's a minor merge conflict between Christoph's iomap_iter
> conversion and this patch queue now, and I should probably clarify the
> description of "iomap: Add done_before argument to iomap_dio_rw" that
> Darrick ran into. Then there are the user copy issues that Al has
> pointed out. Fixing those will create superficial conflicts with this
> patch queue, but probably nothing serious.
>
> So how should I proceed: do you expect a v8 of this patch queue on top
> of the current mainline?

So if you rebase for fixes, it's going to be a "next merge window" thing again.

Personally, I'm ok with the series as is, and the conflict isn't an
issue. So I'd take it as is, and then people can fix up niggling
issues later.

But if somebody screams loudly..

Linus

2021-09-03 18:34:55

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v7 00/19] gfs2: Fix mmap + page fault deadlocks

On Fri, Sep 03, 2021 at 08:52:00AM -0700, Linus Torvalds wrote:
> On Wed, Sep 1, 2021 at 12:53 PM Andreas Gruenbacher <[email protected]> wrote:
> >
> > So there's a minor merge conflict between Christoph's iomap_iter
> > conversion and this patch queue now, and I should probably clarify the
> > description of "iomap: Add done_before argument to iomap_dio_rw" that
> > Darrick ran into. Then there are the user copy issues that Al has
> > pointed out. Fixing those will create superficial conflicts with this
> > patch queue, but probably nothing serious.
> >
> > So how should I proceed: do you expect a v8 of this patch queue on top
> > of the current mainline?
>
> So if you rebase for fixes, it's going to be a "next merge window" thing again.
>
> Personally, I'm ok with the series as is, and the conflict isn't an
> issue. So I'd take it as is, and then people can fix up niggling
> issues later.
>
> But if somebody screams loudly..

FWIW, my objections regarding the calling conventions are still there.

Out of 3 callers that do want more than "success/failure", one is gone
(series by tglx) and one more is broken (regardless of the semantics,
btrfs on arm64). Which leaves 1 caller (fault-in for read in FPU
handling on x86 sigreturn). That caller turns out to be correct, but
IMO there are fairly strong arguments in favour of *not* using the
normal fault-in in that case.

"how many bytes can we fault in" is misleading, unless we really
poke into every cacheline in the affected area. Which might be a primitive
worth having, but it's a lot heavier than needed by the majority of callers.

2021-09-03 18:49:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 00/19] gfs2: Fix mmap + page fault deadlocks

On Fri, Sep 3, 2021 at 11:28 AM Al Viro <[email protected]> wrote:
>
> FWIW, my objections regarding the calling conventions are still there.

So I'm happy to further change the calling conventions, but by now
_that_ part is most definitely a "not this merge window". The need for
that ternary state is still there.

It might go away in the future, but I think that's literally that: a
future cleanup. Not really related to the problem at hand.

Linus

2021-09-03 18:56:26

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v7 15/19] iomap: Support partial direct I/O on user copy failures

On Fri, Aug 27, 2021 at 06:49:22PM +0200, Andreas Gruenbacher wrote:
> In iomap_dio_rw, when iomap_apply returns an -EFAULT error and the
> IOMAP_DIO_PARTIAL flag is set, complete the request synchronously and
> return a partial result. This allows the caller to deal with the page
> fault and retry the remainder of the request.
>
> Signed-off-by: Andreas Gruenbacher <[email protected]>

Pretty straightforward.
Reviewed-by: Darrick J. Wong <[email protected]>

--D

> ---
> fs/iomap/direct-io.c | 6 ++++++
> include/linux/iomap.h | 7 +++++++
> 2 files changed, 13 insertions(+)
>
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 8054f5d6c273..ba88fe51b77a 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -561,6 +561,12 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> ret = iomap_apply(inode, pos, count, iomap_flags, ops, dio,
> iomap_dio_actor);
> if (ret <= 0) {
> + if (ret == -EFAULT && dio->size &&
> + (dio_flags & IOMAP_DIO_PARTIAL)) {
> + wait_for_completion = true;
> + ret = 0;
> + }
> +
> /* magic error code to fall back to buffered I/O */
> if (ret == -ENOTBLK) {
> wait_for_completion = true;
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 479c1da3e221..bcae4814b8e3 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -267,6 +267,13 @@ struct iomap_dio_ops {
> */
> #define IOMAP_DIO_OVERWRITE_ONLY (1 << 1)
>
> +/*
> + * When a page fault occurs, return a partial synchronous result and allow
> + * the caller to retry the rest of the operation after dealing with the page
> + * fault.
> + */
> +#define IOMAP_DIO_PARTIAL (1 << 2)
> +
> ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> unsigned int dio_flags);
> --
> 2.26.3
>

2021-09-03 18:58:29

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v7 14/19] iomap: Fix iomap_dio_rw return value for user copies

On Fri, Aug 27, 2021 at 06:49:21PM +0200, Andreas Gruenbacher wrote:
> When a user copy fails in one of the helpers of iomap_dio_rw, fail with
> -EFAULT instead of returning 0. This matches what iomap_dio_bio_actor
> returns when it gets an -EFAULT from bio_iov_iter_get_pages. With these
> changes, iomap_dio_actor now consistently fails with -EFAULT when a user
> page cannot be faulted in.
>
> Signed-off-by: Andreas Gruenbacher <[email protected]>

Reviewed-by: Darrick J. Wong <[email protected]>

--D

> ---
> fs/iomap/direct-io.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 9398b8c31323..8054f5d6c273 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -370,7 +370,7 @@ iomap_dio_hole_actor(loff_t length, struct iomap_dio *dio)
> {
> length = iov_iter_zero(length, dio->submit.iter);
> dio->size += length;
> - return length;
> + return length ? length : -EFAULT;
> }
>
> static loff_t
> @@ -397,7 +397,7 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length,
> copied = copy_to_iter(iomap->inline_data + pos, length, iter);
> }
> dio->size += copied;
> - return copied;
> + return copied ? copied : -EFAULT;
> }
>
> static loff_t
> --
> 2.26.3
>

2021-09-03 19:18:34

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v7 16/19] iomap: Add done_before argument to iomap_dio_rw

On Fri, Aug 27, 2021 at 06:49:23PM +0200, Andreas Gruenbacher wrote:
> Add a done_before argument to iomap_dio_rw that indicates how much of
> the request has already been transferred. When the request succeeds, we
> report that done_before additional bytes were tranferred. This is
> useful for finishing a request asynchronously when part of the request
> has already been completed synchronously.
>
> We'll use that to allow iomap_dio_rw to be used with page faults
> disabled: when a page fault occurs while submitting a request, we
> synchronously complete the part of the request that has already been
> submitted. The caller can then take care of the page fault and call
> iomap_dio_rw again for the rest of the request, passing in the number of
> bytes already tranferred.
>
> Signed-off-by: Andreas Gruenbacher <[email protected]>
> ---
> fs/btrfs/file.c | 5 +++--
> fs/ext4/file.c | 5 +++--
> fs/gfs2/file.c | 4 ++--
> fs/iomap/direct-io.c | 11 ++++++++---
> fs/xfs/xfs_file.c | 6 +++---
> fs/zonefs/super.c | 4 ++--
> include/linux/iomap.h | 4 ++--
> 7 files changed, 23 insertions(+), 16 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 281c77cfe91a..8817fe6b5fc0 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1945,7 +1945,7 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
> }
>
> dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> - 0);
> + 0, 0);
>
> btrfs_inode_unlock(inode, ilock_flags);
>
> @@ -3637,7 +3637,8 @@ static ssize_t btrfs_direct_read(struct kiocb *iocb, struct iov_iter *to)
> return 0;
>
> btrfs_inode_lock(inode, BTRFS_ILOCK_SHARED);
> - ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops, 0);
> + ret = iomap_dio_rw(iocb, to, &btrfs_dio_iomap_ops, &btrfs_dio_ops,
> + 0, 0);
> btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
> return ret;
> }
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 816dedcbd541..4a5e7fd31fb5 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -74,7 +74,7 @@ static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
> return generic_file_read_iter(iocb, to);
> }
>
> - ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0);
> + ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0, 0);
> inode_unlock_shared(inode);
>
> file_accessed(iocb->ki_filp);
> @@ -566,7 +566,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> if (ilock_shared)
> iomap_ops = &ext4_iomap_overwrite_ops;
> ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> - (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
> + (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0,
> + 0);
> if (ret == -ENOTBLK)
> ret = 0;
>
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index fce3a5249e19..64bf2f68e6d6 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -822,7 +822,7 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
> if (ret)
> goto out_uninit;
>
> - ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0);
> + ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0, 0);
> gfs2_glock_dq(gh);
> out_uninit:
> gfs2_holder_uninit(gh);
> @@ -856,7 +856,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
> if (offset + len > i_size_read(&ip->i_inode))
> goto out;
>
> - ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0);
> + ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0, 0);
> if (ret == -ENOTBLK)
> ret = 0;
> out:
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index ba88fe51b77a..dcf9a2b4381f 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -31,6 +31,7 @@ struct iomap_dio {
> atomic_t ref;
> unsigned flags;
> int error;
> + size_t done_before;

So, now that I actually understand the reason why the count of
previously transferred bytes has to be passed into the iomap_dio, I
would like this field to have a comment so that stupid maintainers like
me don't forget the subtleties again:

/*
* For asynchronous IO, we have one chance to call the iocb
* completion method with the results of the directio operation.
* If this operation is a resubmission after a previous partial
* completion (e.g. page fault), we need to know about that
* progress so that we can report that and the result of the
* resubmission to the iocb completion.
*/
size_t done_before;

With that added, I think I can live with this enough to:
Reviewed-by: Darrick J. Wong <[email protected]>

--D

> bool wait_for_completion;
>
> union {
> @@ -126,6 +127,9 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
> if (ret > 0 && (dio->flags & IOMAP_DIO_NEED_SYNC))
> ret = generic_write_sync(iocb, ret);
>
> + if (ret > 0)
> + ret += dio->done_before;
> +
> kfree(dio);
>
> return ret;
> @@ -450,7 +454,7 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
> struct iomap_dio *
> __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> - unsigned int dio_flags)
> + unsigned int dio_flags, size_t done_before)
> {
> struct address_space *mapping = iocb->ki_filp->f_mapping;
> struct inode *inode = file_inode(iocb->ki_filp);
> @@ -477,6 +481,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> dio->dops = dops;
> dio->error = 0;
> dio->flags = 0;
> + dio->done_before = done_before;
>
> dio->submit.iter = iter;
> dio->submit.waiter = current;
> @@ -648,11 +653,11 @@ EXPORT_SYMBOL_GPL(__iomap_dio_rw);
> ssize_t
> iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> - unsigned int dio_flags)
> + unsigned int dio_flags, size_t done_before)
> {
> struct iomap_dio *dio;
>
> - dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags);
> + dio = __iomap_dio_rw(iocb, iter, ops, dops, dio_flags, done_before);
> if (IS_ERR_OR_NULL(dio))
> return PTR_ERR_OR_ZERO(dio);
> return iomap_dio_complete(dio);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index cc3cfb12df53..3103d9bda466 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -259,7 +259,7 @@ xfs_file_dio_read(
> ret = xfs_ilock_iocb(iocb, XFS_IOLOCK_SHARED);
> if (ret)
> return ret;
> - ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0);
> + ret = iomap_dio_rw(iocb, to, &xfs_read_iomap_ops, NULL, 0, 0);
> xfs_iunlock(ip, XFS_IOLOCK_SHARED);
>
> return ret;
> @@ -569,7 +569,7 @@ xfs_file_dio_write_aligned(
> }
> trace_xfs_file_direct_write(iocb, from);
> ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
> - &xfs_dio_write_ops, 0);
> + &xfs_dio_write_ops, 0, 0);
> out_unlock:
> if (iolock)
> xfs_iunlock(ip, iolock);
> @@ -647,7 +647,7 @@ xfs_file_dio_write_unaligned(
>
> trace_xfs_file_direct_write(iocb, from);
> ret = iomap_dio_rw(iocb, from, &xfs_direct_write_iomap_ops,
> - &xfs_dio_write_ops, flags);
> + &xfs_dio_write_ops, flags, 0);
>
> /*
> * Retry unaligned I/O with exclusive blocking semantics if the DIO
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index 70055d486bf7..85ca2f5fe06e 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -864,7 +864,7 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
> ret = zonefs_file_dio_append(iocb, from);
> else
> ret = iomap_dio_rw(iocb, from, &zonefs_iomap_ops,
> - &zonefs_write_dio_ops, 0);
> + &zonefs_write_dio_ops, 0, 0);
> if (zi->i_ztype == ZONEFS_ZTYPE_SEQ &&
> (ret > 0 || ret == -EIOCBQUEUED)) {
> if (ret > 0)
> @@ -999,7 +999,7 @@ static ssize_t zonefs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> }
> file_accessed(iocb->ki_filp);
> ret = iomap_dio_rw(iocb, to, &zonefs_iomap_ops,
> - &zonefs_read_dio_ops, 0);
> + &zonefs_read_dio_ops, 0, 0);
> } else {
> ret = generic_file_read_iter(iocb, to);
> if (ret == -EIO)
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index bcae4814b8e3..908bda10024c 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -276,10 +276,10 @@ struct iomap_dio_ops {
>
> ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> - unsigned int dio_flags);
> + unsigned int dio_flags, size_t done_before);
> struct iomap_dio *__iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> - unsigned int dio_flags);
> + unsigned int dio_flags, size_t done_before);
> ssize_t iomap_dio_complete(struct iomap_dio *dio);
> int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
>
> --
> 2.26.3
>

2021-09-03 19:19:26

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v7 16/19] iomap: Add done_before argument to iomap_dio_rw

On Fri, Aug 27, 2021 at 03:35:06PM -0700, Linus Torvalds wrote:
> On Fri, Aug 27, 2021 at 2:32 PM Darrick J. Wong <[email protected]> wrote:
> >
> > No, because you totally ignored the second question:
> >
> > If the directio operation succeeds even partially and the PARTIAL flag
> > is set, won't that push the iov iter ahead by however many bytes
> > completed?
> >
> > We already finished the IO for the first page, so the second attempt
> > should pick up where it left off, i.e. the second page.
>
> Darrick, I think you're missing the point.
>
> It's the *return*value* that is the issue, not the iovec.
>
> The iovec is updated as you say. But the return value from the async
> part is - without Andreas' patch - only the async part of it.
>
> With Andreas' patch, the async part will now return the full return
> value, including the part that was done synchronously.
>
> And the return value is returned from that async part, which somehow
> thus needs to know what predated it.

Aha, that was the missing piece, thank you. I'd forgotten that
iomap_dio_complete_work calls iocb->ki_complete with the return value of
iomap_dio_complete, which means that the iomap_dio has to know if there
was a previous transfer that stopped short so that the caller could do
more work and resubmit.

> Could that pre-existing part perhaps be saved somewhere else? Very
> possibly. That 'struct iomap_dio' addition is kind of ugly. So maybe
> what Andreas did could be done differently.

There's probably a more elegant way for the ->ki_complete functions to
figure out how much got transferred, but that's sufficiently ugly and
invasive so as not to be suitable for a bug fix.

> But I think you guys are arguing past each other.

Yes, definitely.

--D

>
> Linus

2021-09-09 11:22:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 14/19] iomap: Fix iomap_dio_rw return value for user copies

On Fri, Aug 27, 2021 at 06:49:21PM +0200, Andreas Gruenbacher wrote:
> When a user copy fails in one of the helpers of iomap_dio_rw, fail with
> -EFAULT instead of returning 0. This matches what iomap_dio_bio_actor
> returns when it gets an -EFAULT from bio_iov_iter_get_pages. With these
> changes, iomap_dio_actor now consistently fails with -EFAULT when a user
> page cannot be faulted in.
>
> Signed-off-by: Andreas Gruenbacher <[email protected]>
> ---
> fs/iomap/direct-io.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 9398b8c31323..8054f5d6c273 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -370,7 +370,7 @@ iomap_dio_hole_actor(loff_t length, struct iomap_dio *dio)
> {
> length = iov_iter_zero(length, dio->submit.iter);
> dio->size += length;
> - return length;
> + return length ? length : -EFAULT;

What's wrong with a good old:

if (!length)
return -EFAULT;
return length?

Besides this nit and the fact that the patch needs a reabse it looks
good to me:

Reviewed-by: Christoph Hellwig <[email protected]>

2021-09-09 11:25:10

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 15/19] iomap: Support partial direct I/O on user copy failures

On Fri, Aug 27, 2021 at 06:49:22PM +0200, Andreas Gruenbacher wrote:
> In iomap_dio_rw, when iomap_apply returns an -EFAULT error and the
> IOMAP_DIO_PARTIAL flag is set, complete the request synchronously and
> return a partial result. This allows the caller to deal with the page
> fault and retry the remainder of the request.
>
> Signed-off-by: Andreas Gruenbacher <[email protected]>
> ---
> fs/iomap/direct-io.c | 6 ++++++
> include/linux/iomap.h | 7 +++++++
> 2 files changed, 13 insertions(+)
>
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 8054f5d6c273..ba88fe51b77a 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -561,6 +561,12 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> ret = iomap_apply(inode, pos, count, iomap_flags, ops, dio,
> iomap_dio_actor);
> if (ret <= 0) {
> + if (ret == -EFAULT && dio->size &&
> + (dio_flags & IOMAP_DIO_PARTIAL)) {
> + wait_for_completion = true;
> + ret = 0;

Do we need a NOWAIT check here to skip the wait_for_completion
for that case?

2021-09-09 11:34:09

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 16/19] iomap: Add done_before argument to iomap_dio_rw

What about just passing done_before as an argument to
iomap_dio_complete? gfs2 would have to switch to __iomap_dio_rw +
iomap_dio_complete instead of iomap_dio_rw for that, and it obviously
won't work for async completions, but you force sync in this case
anyway, right?

2021-09-09 11:39:24

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 17/19] gup: Introduce FOLL_NOFAULT flag to disable page faults

On Fri, Aug 27, 2021 at 06:49:24PM +0200, Andreas Gruenbacher wrote:
> Introduce a new FOLL_NOFAULT flag that causes get_user_pages to return
> -EFAULT when it would otherwise trigger a page fault. This is roughly
> similar to FOLL_FAST_ONLY but available on all architectures, and less
> fragile.

So, FOLL_FAST_ONLY only has one single user through
get_user_pages_fast_only (pin_user_pages_fast_only is entirely unused,
which makes totally sense given that give up on fault and pin are not
exactly useful semantics).

But it looks like they want to call it from atomic context, so we can't
really share it. Sight, I hate all these single-user FOLL flags that
make gup.c a complete mess.

But otherwise this looks fine.

2021-09-09 17:19:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 17/19] gup: Introduce FOLL_NOFAULT flag to disable page faults

On Thu, Sep 9, 2021 at 4:36 AM Christoph Hellwig <[email protected]> wrote:
>
> On Fri, Aug 27, 2021 at 06:49:24PM +0200, Andreas Gruenbacher wrote:
> > Introduce a new FOLL_NOFAULT flag that causes get_user_pages to return
> > -EFAULT when it would otherwise trigger a page fault. This is roughly
> > similar to FOLL_FAST_ONLY but available on all architectures, and less
> > fragile.
>
> So, FOLL_FAST_ONLY only has one single user through
> get_user_pages_fast_only (pin_user_pages_fast_only is entirely unused,
> which makes totally sense given that give up on fault and pin are not
> exactly useful semantics).

So I think we should treat FOLL_FAST_ONLY as a special "internal to
gup.c" flag, and perhaps not really compare it to the new
FOLL_NOFAULT.

In fact, maybe we could even just make FOLL_FAST_ONLY be the high bit,
and not expose it in <linux/mm.h> and make it entirely private as a
name in gup.c.

Because FOLL_FAST_ONLY really is meant more as a "this way we can
share code easily inside gup.c, by having the internal helpers that
*can* do everything, but not do it all when the user is one of the
limited interfaces".

Because we don't really expect people to use FOLL_FAST_ONLY externally
- they'll use the explicit interfaces we have instead (ie
"get_user_pages_fast()"). Those use-cases that want that fast-only
thing really are so special that they need to be very explicitly so.

FOLL_NOFAULT is different, in that that is something an external user
_would_ use.

Admittedly we'd only have one single case for now, but I think we may
end up with other filesystems - or other cases entirely - having that
same kind of "I am holding locks, so I can't fault into the MM, but
I'm otherwise ok with the immediate mmap_sem lock usage and sleeping".

End result: FOLL_FAST_ONLY and FOLL_NOFAULT have some similarities,
but at the same time I think they are fundamentally different.

The FAST_ONLY is the very very special "I can't sleep, I can't even
take the fundamental MM lock, and we export special interfaces because
it's _so_ special and can be used in interrupts etc".

In contrast, NOFAULT is not _that_ special. It's just another flag,
and has generic use.

Linus

2021-09-09 17:24:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v7 16/19] iomap: Add done_before argument to iomap_dio_rw

On Thu, Sep 9, 2021 at 4:31 AM Christoph Hellwig <[email protected]> wrote:
>
> What about just passing done_before as an argument to
> iomap_dio_complete? gfs2 would have to switch to __iomap_dio_rw +
> iomap_dio_complete instead of iomap_dio_rw for that, and it obviously
> won't work for async completions, but you force sync in this case
> anyway, right?

I think you misunderstand.

Or maybe I do.

It very much doesn't force sync in this case. It did the *first* part
of it synchronously, but then it wants to continue with that async
part for the rest, and very much do that async completion.

And that's why it wants to add that "I already did X much of the
work", exactly so that the async completion can report the full end
result.

But maybe now it's me who is misunderstanding.

Linus

2021-09-10 07:29:49

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 17/19] gup: Introduce FOLL_NOFAULT flag to disable page faults

On Thu, Sep 09, 2021 at 10:17:14AM -0700, Linus Torvalds wrote:
> So I think we should treat FOLL_FAST_ONLY as a special "internal to
> gup.c" flag, and perhaps not really compare it to the new
> FOLL_NOFAULT.
>
> In fact, maybe we could even just make FOLL_FAST_ONLY be the high bit,
> and not expose it in <linux/mm.h> and make it entirely private as a
> name in gup.c.

There are quite a few bits like that. I've been wanting to make them
private for 5.16.

2021-09-10 07:41:57

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 16/19] iomap: Add done_before argument to iomap_dio_rw

On Thu, Sep 09, 2021 at 10:22:56AM -0700, Linus Torvalds wrote:
> I think you misunderstand.
>
> Or maybe I do.
>
> It very much doesn't force sync in this case. It did the *first* part
> of it synchronously, but then it wants to continue with that async
> part for the rest, and very much do that async completion.
>
> And that's why it wants to add that "I already did X much of the
> work", exactly so that the async completion can report the full end
> result.

Could be, and yes in that case it won't work.

2021-09-28 15:04:24

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [PATCH v7 03/19] gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}

On Fri, Sep 3, 2021 at 4:57 PM Filipe Manana <[email protected]> wrote:
> On Fri, Aug 27, 2021 at 5:52 PM Andreas Gruenbacher <[email protected]> wrote:
> >
> > Turn fault_in_pages_{readable,writeable} into versions that return the
> > number of bytes not faulted in (similar to copy_to_user) instead of
> > returning a non-zero value when any of the requested pages couldn't be
> > faulted in. This supports the existing users that require all pages to
> > be faulted in as well as new users that are happy if any pages can be
> > faulted in at all.
> >
> > Neither of these functions is entirely trivial and it doesn't seem
> > useful to inline them, so move them to mm/gup.c.
> >
> > Rename the functions to fault_in_{readable,writeable} to make sure that
> > this change doesn't silently break things.
> >
> > Signed-off-by: Andreas Gruenbacher <[email protected]>
> > ---
> > arch/powerpc/kernel/kvm.c | 3 +-
> > arch/powerpc/kernel/signal_32.c | 4 +-
> > arch/powerpc/kernel/signal_64.c | 2 +-
> > arch/x86/kernel/fpu/signal.c | 7 ++-
> > drivers/gpu/drm/armada/armada_gem.c | 7 ++-
> > fs/btrfs/ioctl.c | 5 +-
> > include/linux/pagemap.h | 57 ++---------------------
> > lib/iov_iter.c | 10 ++--
> > mm/filemap.c | 2 +-
> > mm/gup.c | 72 +++++++++++++++++++++++++++++
> > 10 files changed, 93 insertions(+), 76 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/kvm.c b/arch/powerpc/kernel/kvm.c
> > index d89cf802d9aa..6568823cf306 100644
> > --- a/arch/powerpc/kernel/kvm.c
> > +++ b/arch/powerpc/kernel/kvm.c
> > @@ -669,7 +669,8 @@ static void __init kvm_use_magic_page(void)
> > on_each_cpu(kvm_map_magic_page, &features, 1);
> >
> > /* Quick self-test to see if the mapping works */
> > - if (fault_in_pages_readable((const char *)KVM_MAGIC_PAGE, sizeof(u32))) {
> > + if (fault_in_readable((const char __user *)KVM_MAGIC_PAGE,
> > + sizeof(u32))) {
> > kvm_patching_worked = false;
> > return;
> > }
> > diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
> > index 0608581967f0..38c3eae40c14 100644
> > --- a/arch/powerpc/kernel/signal_32.c
> > +++ b/arch/powerpc/kernel/signal_32.c
> > @@ -1048,7 +1048,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
> > if (new_ctx == NULL)
> > return 0;
> > if (!access_ok(new_ctx, ctx_size) ||
> > - fault_in_pages_readable((u8 __user *)new_ctx, ctx_size))
> > + fault_in_readable((char __user *)new_ctx, ctx_size))
> > return -EFAULT;
> >
> > /*
> > @@ -1237,7 +1237,7 @@ SYSCALL_DEFINE3(debug_setcontext, struct ucontext __user *, ctx,
> > #endif
> >
> > if (!access_ok(ctx, sizeof(*ctx)) ||
> > - fault_in_pages_readable((u8 __user *)ctx, sizeof(*ctx)))
> > + fault_in_readable((char __user *)ctx, sizeof(*ctx)))
> > return -EFAULT;
> >
> > /*
> > diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
> > index 1831bba0582e..9f471b4a11e3 100644
> > --- a/arch/powerpc/kernel/signal_64.c
> > +++ b/arch/powerpc/kernel/signal_64.c
> > @@ -688,7 +688,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, old_ctx,
> > if (new_ctx == NULL)
> > return 0;
> > if (!access_ok(new_ctx, ctx_size) ||
> > - fault_in_pages_readable((u8 __user *)new_ctx, ctx_size))
> > + fault_in_readable((char __user *)new_ctx, ctx_size))
> > return -EFAULT;
> >
> > /*
> > diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
> > index 445c57c9c539..ba6bdec81603 100644
> > --- a/arch/x86/kernel/fpu/signal.c
> > +++ b/arch/x86/kernel/fpu/signal.c
> > @@ -205,7 +205,7 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
> > fpregs_unlock();
> >
> > if (ret) {
> > - if (!fault_in_pages_writeable(buf_fx, fpu_user_xstate_size))
> > + if (!fault_in_writeable(buf_fx, fpu_user_xstate_size))
> > goto retry;
> > return -EFAULT;
> > }
> > @@ -278,10 +278,9 @@ static int restore_fpregs_from_user(void __user *buf, u64 xrestore,
> > if (ret != -EFAULT)
> > return -EINVAL;
> >
> > - ret = fault_in_pages_readable(buf, size);
> > - if (!ret)
> > + if (!fault_in_readable(buf, size))
> > goto retry;
> > - return ret;
> > + return -EFAULT;
> > }
> >
> > /*
> > diff --git a/drivers/gpu/drm/armada/armada_gem.c b/drivers/gpu/drm/armada/armada_gem.c
> > index 21909642ee4c..8fbb25913327 100644
> > --- a/drivers/gpu/drm/armada/armada_gem.c
> > +++ b/drivers/gpu/drm/armada/armada_gem.c
> > @@ -336,7 +336,7 @@ int armada_gem_pwrite_ioctl(struct drm_device *dev, void *data,
> > struct drm_armada_gem_pwrite *args = data;
> > struct armada_gem_object *dobj;
> > char __user *ptr;
> > - int ret;
> > + int ret = 0;
> >
> > DRM_DEBUG_DRIVER("handle %u off %u size %u ptr 0x%llx\n",
> > args->handle, args->offset, args->size, args->ptr);
> > @@ -349,9 +349,8 @@ int armada_gem_pwrite_ioctl(struct drm_device *dev, void *data,
> > if (!access_ok(ptr, args->size))
> > return -EFAULT;
> >
> > - ret = fault_in_pages_readable(ptr, args->size);
> > - if (ret)
> > - return ret;
> > + if (fault_in_readable(ptr, args->size))
> > + return -EFAULT;
> >
> > dobj = armada_gem_object_lookup(file, args->handle);
> > if (dobj == NULL)
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index 0ba98e08a029..9233ecc31e2e 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -2244,9 +2244,8 @@ static noinline int search_ioctl(struct inode *inode,
> > key.offset = sk->min_offset;
> >
> > while (1) {
> > - ret = fault_in_pages_writeable(ubuf + sk_offset,
> > - *buf_size - sk_offset);
> > - if (ret)
> > + ret = -EFAULT;
> > + if (fault_in_writeable(ubuf + sk_offset, *buf_size - sk_offset))
> > break;
> >
> > ret = btrfs_search_forward(root, &key, path, sk->min_transid);
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index ed02aa522263..7c9edc9694d9 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -734,61 +734,10 @@ int wait_on_page_private_2_killable(struct page *page);
> > extern void add_page_wait_queue(struct page *page, wait_queue_entry_t *waiter);
> >
> > /*
> > - * Fault everything in given userspace address range in.
> > + * Fault in userspace address range.
> > */
> > -static inline int fault_in_pages_writeable(char __user *uaddr, int size)
> > -{
> > - char __user *end = uaddr + size - 1;
> > -
> > - if (unlikely(size == 0))
> > - return 0;
> > -
> > - if (unlikely(uaddr > end))
> > - return -EFAULT;
> > - /*
> > - * Writing zeroes into userspace here is OK, because we know that if
> > - * the zero gets there, we'll be overwriting it.
> > - */
> > - do {
> > - if (unlikely(__put_user(0, uaddr) != 0))
> > - return -EFAULT;
> > - uaddr += PAGE_SIZE;
> > - } while (uaddr <= end);
> > -
> > - /* Check whether the range spilled into the next page. */
> > - if (((unsigned long)uaddr & PAGE_MASK) ==
> > - ((unsigned long)end & PAGE_MASK))
> > - return __put_user(0, end);
> > -
> > - return 0;
> > -}
> > -
> > -static inline int fault_in_pages_readable(const char __user *uaddr, int size)
> > -{
> > - volatile char c;
> > - const char __user *end = uaddr + size - 1;
> > -
> > - if (unlikely(size == 0))
> > - return 0;
> > -
> > - if (unlikely(uaddr > end))
> > - return -EFAULT;
> > -
> > - do {
> > - if (unlikely(__get_user(c, uaddr) != 0))
> > - return -EFAULT;
> > - uaddr += PAGE_SIZE;
> > - } while (uaddr <= end);
> > -
> > - /* Check whether the range spilled into the next page. */
> > - if (((unsigned long)uaddr & PAGE_MASK) ==
> > - ((unsigned long)end & PAGE_MASK)) {
> > - return __get_user(c, end);
> > - }
> > -
> > - (void)c;
> > - return 0;
> > -}
> > +size_t fault_in_writeable(char __user *uaddr, size_t size);
> > +size_t fault_in_readable(const char __user *uaddr, size_t size);
> >
> > int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> > pgoff_t index, gfp_t gfp_mask);
> > diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> > index 25dfc48536d7..069cedd9d7b4 100644
> > --- a/lib/iov_iter.c
> > +++ b/lib/iov_iter.c
> > @@ -191,7 +191,7 @@ static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t b
> > buf = iov->iov_base + skip;
> > copy = min(bytes, iov->iov_len - skip);
> >
> > - if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_pages_writeable(buf, copy)) {
> > + if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_writeable(buf, copy)) {
> > kaddr = kmap_atomic(page);
> > from = kaddr + offset;
> >
> > @@ -275,7 +275,7 @@ static size_t copy_page_from_iter_iovec(struct page *page, size_t offset, size_t
> > buf = iov->iov_base + skip;
> > copy = min(bytes, iov->iov_len - skip);
> >
> > - if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_pages_readable(buf, copy)) {
> > + if (IS_ENABLED(CONFIG_HIGHMEM) && !fault_in_readable(buf, copy)) {
> > kaddr = kmap_atomic(page);
> > to = kaddr + offset;
> >
> > @@ -446,13 +446,11 @@ int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes)
> > bytes = i->count;
> > for (p = i->iov, skip = i->iov_offset; bytes; p++, skip = 0) {
> > size_t len = min(bytes, p->iov_len - skip);
> > - int err;
> >
> > if (unlikely(!len))
> > continue;
> > - err = fault_in_pages_readable(p->iov_base + skip, len);
> > - if (unlikely(err))
> > - return err;
> > + if (fault_in_readable(p->iov_base + skip, len))
> > + return -EFAULT;
> > bytes -= len;
> > }
> > }
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index d1458ecf2f51..4dec3bc7752e 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -88,7 +88,7 @@
> > * ->lock_page (access_process_vm)
> > *
> > * ->i_mutex (generic_perform_write)
> > - * ->mmap_lock (fault_in_pages_readable->do_page_fault)
> > + * ->mmap_lock (fault_in_readable->do_page_fault)
> > *
> > * bdi->wb.list_lock
> > * sb_lock (fs/fs-writeback.c)
> > diff --git a/mm/gup.c b/mm/gup.c
> > index b94717977d17..0cf47955e5a1 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -1672,6 +1672,78 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start,
> > }
> > #endif /* !CONFIG_MMU */
> >
> > +/**
> > + * fault_in_writeable - fault in userspace address range for writing
> > + * @uaddr: start of address range
> > + * @size: size of address range
> > + *
> > + * Returns the number of bytes not faulted in (like copy_to_user() and
> > + * copy_from_user()).
> > + */
> > +size_t fault_in_writeable(char __user *uaddr, size_t size)
> > +{
> > + char __user *start = uaddr, *end;
> > +
> > + if (unlikely(size == 0))
> > + return 0;
> > + if (!PAGE_ALIGNED(uaddr)) {
> > + if (unlikely(__put_user(0, uaddr) != 0))
> > + return size;
> > + uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr);
> > + }
> > + end = (char __user *)PAGE_ALIGN((unsigned long)start + size);
> > + if (unlikely(end < start))
> > + end = NULL;
> > + while (uaddr != end) {
> > + if (unlikely(__put_user(0, uaddr) != 0))
> > + goto out;
> > + uaddr += PAGE_SIZE;
>
> Won't we loop endlessly or corrupt some unwanted page when 'end' was
> set to NULL?

What do you mean? We set 'end' to NULL when start + size < start
exactly so that the loop will stop when uaddr wraps around.

> > + }
> > +
> > +out:
> > + if (size > uaddr - start)
> > + return size - (uaddr - start);
> > + return 0;
> > +}
> > +EXPORT_SYMBOL(fault_in_writeable);
> > +
> > +/**
> > + * fault_in_readable - fault in userspace address range for reading
> > + * @uaddr: start of user address range
> > + * @size: size of user address range
> > + *
> > + * Returns the number of bytes not faulted in (like copy_to_user() and
> > + * copy_from_user()).
> > + */
> > +size_t fault_in_readable(const char __user *uaddr, size_t size)
> > +{
> > + const char __user *start = uaddr, *end;
> > + volatile char c;
> > +
> > + if (unlikely(size == 0))
> > + return 0;
> > + if (!PAGE_ALIGNED(uaddr)) {
> > + if (unlikely(__get_user(c, uaddr) != 0))
> > + return size;
> > + uaddr = (const char __user *)PAGE_ALIGN((unsigned long)uaddr);
> > + }
> > + end = (const char __user *)PAGE_ALIGN((unsigned long)start + size);
> > + if (unlikely(end < start))
> > + end = NULL;
> > + while (uaddr != end) {
>
> Same kind of issue here, when 'end' was set to NULL?
>
> Thanks.
>
> > + if (unlikely(__get_user(c, uaddr) != 0))
> > + goto out;
> > + uaddr += PAGE_SIZE;
> > + }
> > +
> > +out:
> > + (void)c;
> > + if (size > uaddr - start)
> > + return size - (uaddr - start);
> > + return 0;
> > +}
> > +EXPORT_SYMBOL(fault_in_readable);
> > +
> > /**
> > * get_dump_page() - pin user page in memory while writing it to core dump
> > * @addr: user address
> > --
> > 2.26.3
> >

Thanks,
Andreas

2021-09-28 15:07:18

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [PATCH v7 15/19] iomap: Support partial direct I/O on user copy failures

On Thu, Sep 9, 2021 at 1:22 PM Christoph Hellwig <[email protected]> wrote:
> On Fri, Aug 27, 2021 at 06:49:22PM +0200, Andreas Gruenbacher wrote:
> > In iomap_dio_rw, when iomap_apply returns an -EFAULT error and the
> > IOMAP_DIO_PARTIAL flag is set, complete the request synchronously and
> > return a partial result. This allows the caller to deal with the page
> > fault and retry the remainder of the request.
> >
> > Signed-off-by: Andreas Gruenbacher <[email protected]>
> > ---
> > fs/iomap/direct-io.c | 6 ++++++
> > include/linux/iomap.h | 7 +++++++
> > 2 files changed, 13 insertions(+)
> >
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index 8054f5d6c273..ba88fe51b77a 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -561,6 +561,12 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > ret = iomap_apply(inode, pos, count, iomap_flags, ops, dio,
> > iomap_dio_actor);
> > if (ret <= 0) {
> > + if (ret == -EFAULT && dio->size &&
> > + (dio_flags & IOMAP_DIO_PARTIAL)) {
> > + wait_for_completion = true;
> > + ret = 0;
>
> Do we need a NOWAIT check here to skip the wait_for_completion
> for that case?

Hmm, you're probably right, yes. I'll add that.

Thanks,
Andreas

2021-09-28 16:44:18

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v7 03/19] gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}

On Tue, Sep 28, 2021 at 05:02:43PM +0200, Andreas Gruenbacher wrote:
> On Fri, Sep 3, 2021 at 4:57 PM Filipe Manana <[email protected]> wrote:
> > On Fri, Aug 27, 2021 at 5:52 PM Andreas Gruenbacher <[email protected]> wrote:
> > > +size_t fault_in_writeable(char __user *uaddr, size_t size)
> > > +{
> > > + char __user *start = uaddr, *end;
> > > +
> > > + if (unlikely(size == 0))
> > > + return 0;
> > > + if (!PAGE_ALIGNED(uaddr)) {
> > > + if (unlikely(__put_user(0, uaddr) != 0))
> > > + return size;
> > > + uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr);
> > > + }
> > > + end = (char __user *)PAGE_ALIGN((unsigned long)start + size);
> > > + if (unlikely(end < start))
> > > + end = NULL;
> > > + while (uaddr != end) {
> > > + if (unlikely(__put_user(0, uaddr) != 0))
> > > + goto out;
> > > + uaddr += PAGE_SIZE;
> >
> > Won't we loop endlessly or corrupt some unwanted page when 'end' was
> > set to NULL?
>
> What do you mean? We set 'end' to NULL when start + size < start
> exactly so that the loop will stop when uaddr wraps around.

But think about x86-64. The virtual address space (unless you have 5
level PTs) looks like:

[0, 2^47) userspace
[2^47, 2^64 - 2^47) hole
[2^64 - 2^47, 2^64) kernel space

If we try to copy from the hole we'll get some kind of fault (I forget
the details). We have to stop at the top of userspace.

2021-09-28 20:46:28

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [PATCH v7 03/19] gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}

Hi Willy,

On Tue, Sep 28, 2021 at 6:40 PM Matthew Wilcox <[email protected]> wrote:
> On Tue, Sep 28, 2021 at 05:02:43PM +0200, Andreas Gruenbacher wrote:
> > On Fri, Sep 3, 2021 at 4:57 PM Filipe Manana <[email protected]> wrote:
> > > On Fri, Aug 27, 2021 at 5:52 PM Andreas Gruenbacher <[email protected]> wrote:
> > > > +size_t fault_in_writeable(char __user *uaddr, size_t size)
> > > > +{
> > > > + char __user *start = uaddr, *end;
> > > > +
> > > > + if (unlikely(size == 0))
> > > > + return 0;
> > > > + if (!PAGE_ALIGNED(uaddr)) {
> > > > + if (unlikely(__put_user(0, uaddr) != 0))
> > > > + return size;
> > > > + uaddr = (char __user *)PAGE_ALIGN((unsigned long)uaddr);
> > > > + }
> > > > + end = (char __user *)PAGE_ALIGN((unsigned long)start + size);
> > > > + if (unlikely(end < start))
> > > > + end = NULL;
> > > > + while (uaddr != end) {
> > > > + if (unlikely(__put_user(0, uaddr) != 0))
> > > > + goto out;
> > > > + uaddr += PAGE_SIZE;
> > >
> > > Won't we loop endlessly or corrupt some unwanted page when 'end' was
> > > set to NULL?
> >
> > What do you mean? We set 'end' to NULL when start + size < start
> > exactly so that the loop will stop when uaddr wraps around.
>
> But think about x86-64. The virtual address space (unless you have 5
> level PTs) looks like:
>
> [0, 2^47) userspace
> [2^47, 2^64 - 2^47) hole
> [2^64 - 2^47, 2^64) kernel space
>
> If we try to copy from the hole we'll get some kind of fault (I forget
> the details). We have to stop at the top of userspace.

If you look at the before and after state of this patch,
fault_in_pages_readable and fault_in_pages_writeable did fail an
attempt to fault in a range that wraps with -EFAULT. That's sensible
for a function that returns an all-or-nothing result. We now want to
return how much of the range was (or wasn't) faulted in. We could do
that and still reject ranges that wrap outright. Or we could try to
fault in however much we reasonably can even if the range wraps. The
patch tries the latter, which is where the stopping at NULL is coming
from: when the range wraps, we *definitely* don't want to go any
further.

If the range extends into the hole, we'll get a failure from
__get_user or __put_user where that happens. That's entirely the
expected result, isn't it?

Thanks,
Andreas

2021-10-11 17:40:31

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC][arm64] possible infinite loop in btrfs search_ioctl()

On Tue, Aug 31, 2021 at 03:28:57PM +0000, Al Viro wrote:
> On Tue, Aug 31, 2021 at 02:54:50PM +0100, Catalin Marinas wrote:
> > An arm64-specific workaround would be for pagefault_disable() to disable
> > tag checking. It's a pretty big hammer, weakening the out of bounds
> > access detection of MTE. My preference would be a fix in the btrfs code.
> >
> > A btrfs option would be for copy_to_sk() to return an indication of
> > where the fault occurred and get fault_in_pages_writeable() to check
> > that location, even if the copying would restart from an earlier offset
> > (this requires open-coding copy_to_user_nofault()). An attempt below,
> > untested and does not cover read_extent_buffer_to_user_nofault():
>
> Umm... There's another copy_to_user_nofault() call in the same function
> (same story, AFAICS).

I cleaned up this patch [1] but I realised it still doesn't solve it.
The arm64 __copy_to_user_inatomic(), while ensuring progress if called
in a loop, it does not guarantee precise copy to the fault position. The
copy_to_sk(), after returning an error, starts again from the previous
sizeof(sh) boundary rather than from where the __copy_to_user_inatomic()
stopped. So it can get stuck attempting to copy the same search header.

An ugly fix is to fall back to byte by byte copying so that we can
attempt the actual fault address in fault_in_pages_writeable().

If the sh being recreated in copy_to_sk() is the same on the retried
iteration, we could use an *sk_offset that is not a multiple of
sizeof(sh) in order to have progress. But it's not clear to me whether
the data being copied can change once btrfs_release_path() is called.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/btrfs-fix

--
Catalin

2021-10-11 19:18:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][arm64] possible infinite loop in btrfs search_ioctl()

On Mon, Oct 11, 2021 at 10:38 AM Catalin Marinas
<[email protected]> wrote:
>
> I cleaned up this patch [1] but I realised it still doesn't solve it.
> The arm64 __copy_to_user_inatomic(), while ensuring progress if called
> in a loop, it does not guarantee precise copy to the fault position.

That should be ok., We've always allowed the user copy to return early
if it does word copies and hits a page crosser that causes a fault.

Any user then has the choice of:

- partial copies are bad

- partial copies are handled and then you retry from the place
copy_to_user() failed at

and in that second case, the next time around, you'll get the fault
immediately (or you'll make some more progress - maybe the user copy
loop did something different just because the length and/or alignment
was different).

If you get the fault immediately, that's -EFAULT.

And if you make some more progress, it's again up to the caller to
rinse and repeat.

End result: user copy functions do not have to report errors exactly.
It is the caller that has to handle the situation.

Most importantly: "exact error or not" doesn't actually _matter_ to
the caller. If the caller does the right thing for an exact error, it
will do the right thing for an inexact one too. See above.

> The copy_to_sk(), after returning an error, starts again from the previous
> sizeof(sh) boundary rather than from where the __copy_to_user_inatomic()
> stopped. So it can get stuck attempting to copy the same search header.

That seems to be purely a copy_to_sk() bug.

Or rather, it looks like a bug in the caller. copy_to_sk() itself does

if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) {
ret = 0;
goto out;
}

and the comment says

* 0: all items from this leaf copied, continue with next

but that comment is then obviously not actually true in that it's not
"continue with next" at all.

But this is all very much a bug in the btrfs
search_ioctl()/copy_to_sk() code: it simply doesn't do the proper
thing for a partial result.

Because no, "just retry the whole thing" is by definition not the proper thing.

That said, I think that if we can have faults at non-page-aligned
boundaries, then we just need to make fault_in_pages_writeable() check
non-page boundaries.

> An ugly fix is to fall back to byte by byte copying so that we can
> attempt the actual fault address in fault_in_pages_writeable().

No, changing the user copy machinery is wrong. The caller really has
to do the right thing with partial results.

And I think we need to make fault_in_pages_writeable() match the
actual faulting cases - maybe remote the "pages" part of the name?

That would fix the btrfs code - it's not doing the right thing as-is,
but it's "close enough' to right that I think fixing
fault_in_pages_writeable() should fix it.

No?

Linus

2021-10-11 21:10:58

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC][arm64] possible infinite loop in btrfs search_ioctl()

On Mon, Oct 11, 2021 at 12:15:43PM -0700, Linus Torvalds wrote:
> On Mon, Oct 11, 2021 at 10:38 AM Catalin Marinas
> <[email protected]> wrote:
> > I cleaned up this patch [1] but I realised it still doesn't solve it.
> > The arm64 __copy_to_user_inatomic(), while ensuring progress if called
> > in a loop, it does not guarantee precise copy to the fault position.
>
> That should be ok., We've always allowed the user copy to return early
> if it does word copies and hits a page crosser that causes a fault.
>
> Any user then has the choice of:
>
> - partial copies are bad
>
> - partial copies are handled and then you retry from the place
> copy_to_user() failed at
>
> and in that second case, the next time around, you'll get the fault
> immediately (or you'll make some more progress - maybe the user copy
> loop did something different just because the length and/or alignment
> was different).
>
> If you get the fault immediately, that's -EFAULT.
>
> And if you make some more progress, it's again up to the caller to
> rinse and repeat.
>
> End result: user copy functions do not have to report errors exactly.
> It is the caller that has to handle the situation.
>
> Most importantly: "exact error or not" doesn't actually _matter_ to
> the caller. If the caller does the right thing for an exact error, it
> will do the right thing for an inexact one too. See above.

Yes, that's my expectation (though fixed fairly recently in the arm64
user copy routines).

> > The copy_to_sk(), after returning an error, starts again from the previous
> > sizeof(sh) boundary rather than from where the __copy_to_user_inatomic()
> > stopped. So it can get stuck attempting to copy the same search header.
>
> That seems to be purely a copy_to_sk() bug.
>
> Or rather, it looks like a bug in the caller. copy_to_sk() itself does
>
> if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) {
> ret = 0;
> goto out;
> }
>
> and the comment says
>
> * 0: all items from this leaf copied, continue with next
>
> but that comment is then obviously not actually true in that it's not
> "continue with next" at all.

The comments were correct before commit a48b73eca4ce ("btrfs: fix
potential deadlock in the search ioctl") which introduced the
potentially infinite loop.

Something like this would make the comments match (I think):

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index cc61813213d8..1debf6a124e8 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2161,7 +2161,7 @@ static noinline int copy_to_sk(struct btrfs_path *path,
* properly this next time through
*/
if (copy_to_user_nofault(ubuf + *sk_offset, &sh, sizeof(sh))) {
- ret = 0;
+ ret = -EFAULT;
goto out;
}

@@ -2175,7 +2175,7 @@ static noinline int copy_to_sk(struct btrfs_path *path,
*/
if (read_extent_buffer_to_user_nofault(leaf, up,
item_off, item_len)) {
- ret = 0;
+ ret = -EFAULT;
*sk_offset -= sizeof(sh);
goto out;
}
@@ -2260,12 +2260,8 @@ static noinline int search_ioctl(struct inode *inode,
key.type = sk->min_type;
key.offset = sk->min_offset;

- while (1) {
- ret = fault_in_pages_writeable(ubuf + sk_offset,
- *buf_size - sk_offset);
- if (ret)
- break;
-
+ ret = fault_in_pages_writeable(ubuf, *buf_size);
+ while (ret == 0) {
ret = btrfs_search_forward(root, &key, path, sk->min_transid);
if (ret != 0) {
if (ret > 0)
@@ -2275,9 +2271,14 @@ static noinline int search_ioctl(struct inode *inode,
ret = copy_to_sk(path, &key, sk, buf_size, ubuf,
&sk_offset, &num_found);
btrfs_release_path(path);
- if (ret)
- break;

+ /*
+ * Fault in copy_to_sk(), attempt to bring the page in after
+ * releasing the locks and retry.
+ */
+ if (ret == -EFAULT)
+ ret = fault_in_pages_writeable(ubuf + sk_offset,
+ sizeof(struct btrfs_ioctl_search_header));
}
if (ret > 0)
ret = 0;

> > An ugly fix is to fall back to byte by byte copying so that we can
> > attempt the actual fault address in fault_in_pages_writeable().
>
> No, changing the user copy machinery is wrong. The caller really has
> to do the right thing with partial results.
>
> And I think we need to make fault_in_pages_writeable() match the
> actual faulting cases - maybe remote the "pages" part of the name?

Ah, good point. Without removing "pages" from the name (too late over
here to grep the kernel), something like below:

diff --git a/arch/arm64/include/asm/page-def.h b/arch/arm64/include/asm/page-def.h
index 2403f7b4cdbf..3768ac4a6610 100644
--- a/arch/arm64/include/asm/page-def.h
+++ b/arch/arm64/include/asm/page-def.h
@@ -15,4 +15,9 @@
#define PAGE_SIZE (_AC(1, UL) << PAGE_SHIFT)
#define PAGE_MASK (~(PAGE_SIZE-1))

+#ifdef CONFIG_ARM64_MTE
+#define FAULT_GRANULE_SIZE (16)
+#define FAULT_GRANULE_MASK (~(FAULT_GRANULE_SIZE-1))
+#endif
+
#endif /* __ASM_PAGE_DEF_H */
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 62db6b0176b9..7aef732e4fa7 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -16,6 +16,11 @@
#include <linux/hardirq.h> /* for in_interrupt() */
#include <linux/hugetlb_inline.h>

+#ifndef FAULT_GRANULE_SIZE
+#define FAULT_GRANULE_SIZE PAGE_SIZE
+#define FAULT_GRANULE_MASK PAGE_MASK
+#endif
+
struct pagevec;

static inline bool mapping_empty(struct address_space *mapping)
@@ -751,12 +756,12 @@ static inline int fault_in_pages_writeable(char __user *uaddr, size_t size)
do {
if (unlikely(__put_user(0, uaddr) != 0))
return -EFAULT;
- uaddr += PAGE_SIZE;
+ uaddr += FAULT_GRANULE_SIZE;
} while (uaddr <= end);

- /* Check whether the range spilled into the next page. */
- if (((unsigned long)uaddr & PAGE_MASK) ==
- ((unsigned long)end & PAGE_MASK))
+ /* Check whether the range spilled into the next granule. */
+ if (((unsigned long)uaddr & FAULT_GRANULE_MASK) ==
+ ((unsigned long)end & FAULT_GRANULE_MASK))
return __put_user(0, end);

return 0;

If this looks in the right direction, I'll do some proper patches
tomorrow.

--
Catalin

2021-10-12 00:09:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][arm64] possible infinite loop in btrfs search_ioctl()

On Mon, Oct 11, 2021 at 2:08 PM Catalin Marinas <[email protected]> wrote:
>
> +#ifdef CONFIG_ARM64_MTE
> +#define FAULT_GRANULE_SIZE (16)
> +#define FAULT_GRANULE_MASK (~(FAULT_GRANULE_SIZE-1))

[...]

> If this looks in the right direction, I'll do some proper patches
> tomorrow.

Looks fine to me. It's going to be quite expensive and bad for caches, though.

That said, fault_in_writable() is _supposed_ to all be for the slow
path when things go south and the normal path didn't work out, so I
think it's fine.

I do wonder how the sub-page granularity works. Is it sufficient to
just read from it? Because then a _slightly_ better option might be to
do one write per page (to catch page table writability) and then one
read per "granule" (to catch pointer coloring or cache poisoning
issues)?

That said, since this is all preparatory to us wanting to write to it
eventually anyway, maybe marking it all dirty in the caches is only
good.

Linus

2021-10-12 17:29:54

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC][arm64] possible infinite loop in btrfs search_ioctl()

On Mon, Oct 11, 2021 at 04:59:28PM -0700, Linus Torvalds wrote:
> On Mon, Oct 11, 2021 at 2:08 PM Catalin Marinas <[email protected]> wrote:
> > +#ifdef CONFIG_ARM64_MTE
> > +#define FAULT_GRANULE_SIZE (16)
> > +#define FAULT_GRANULE_MASK (~(FAULT_GRANULE_SIZE-1))
>
> [...]
>
> > If this looks in the right direction, I'll do some proper patches
> > tomorrow.
>
> Looks fine to me. It's going to be quite expensive and bad for caches,
> though.
>
> That said, fault_in_writable() is _supposed_ to all be for the slow
> path when things go south and the normal path didn't work out, so I
> think it's fine.
>
> I do wonder how the sub-page granularity works. Is it sufficient to
> just read from it?

For arm64 MTE and I think SPARC ADI, just reading should be sufficient.
There is CHERI in the long run, if it takes off, where the user can set
independent read/write permissions and uaccess would use the capability
rather than a match-all pointer (hence checked).

> Because then a _slightly_ better option might be to
> do one write per page (to catch page table writability) and then one
> read per "granule" (to catch pointer coloring or cache poisoning
> issues)?
>
> That said, since this is all preparatory to us wanting to write to it
> eventually anyway, maybe marking it all dirty in the caches is only
> good.

It depends on how much would be written in the actual copy. For
significant memcpy on arm CPUs, write streaming usually kicks in and the
cache dirtying is skipped. This probably matters more for
copy_page_to_iter_iovec() than the btrfs search ioctl.

Apart from fault_in_pages_*(), there's also fault_in_user_writeable()
called from the futex code which uses the GUP mechanism as the write
would be destructive. It looks like it could potentially trigger the
same infinite loop on -EFAULT. For arm64 MTE, we get away with this by
disabling the tag checking around the arch futex code (we did it for an
unrelated issue - we don't have LDXR/STXR that would run with user
permissions in kernel mode like we do with LDTR/STTR).

I wonder whether we should actually just disable tag checking around the
problematic accesses. What these callers seem to have in common is using
pagefault_disable/enable(). We could abuse this to disable tag checking
or maybe in_atomic() when handling the exception to lazily disable such
faults temporarily.

A more invasive change would be to return a different error for such
faults like -EACCESS and treat them differently in the caller.

--
Catalin

2021-10-12 18:00:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][arm64] possible infinite loop in btrfs search_ioctl()

On Tue, Oct 12, 2021 at 10:27 AM Catalin Marinas
<[email protected]> wrote:
>
> Apart from fault_in_pages_*(), there's also fault_in_user_writeable()
> called from the futex code which uses the GUP mechanism as the write
> would be destructive. It looks like it could potentially trigger the
> same infinite loop on -EFAULT.

Hmm.

I think the reason we do fault_in_user_writeable() using GUP is that

(a) we can avoid the page fault overhead

(b) we don't have any good "atomic_inc_user()" interface or similar
that could do a write with a zero increment or something like that.

We do have that "arch_futex_atomic_op_inuser()" thing, of course. It's
all kinds of crazy, but we *could* do

arch_futex_atomic_op_inuser(FUTEX_OP_ADD, 0, &dummy, uaddr);

instead of doing the fault_in_user_writeable().

That might be a good idea anyway. I dunno.

But I agree other options exist:

> I wonder whether we should actually just disable tag checking around the
> problematic accesses. What these callers seem to have in common is using
> pagefault_disable/enable(). We could abuse this to disable tag checking
> or maybe in_atomic() when handling the exception to lazily disable such
> faults temporarily.

Hmm. That would work for MTE, but possibly be very inconvenient for
other situations.

> A more invasive change would be to return a different error for such
> faults like -EACCESS and treat them differently in the caller.

That's _really_ hard for things like "copy_to_user()", that isn't a
single operation, and is supposed to return the bytes left.

Adding another error return would be nasty.

We've had hacks like "squirrel away the actual error code in the task
structure", but that tends to be unmaintainable because we have
interrupts (and NMI's) doing their own possibly nested atomics, so
even disabling preemption won't actually fix some of the nesting
issues.

All of these things make me think that the proper fix ends up being to
make sure that our "fault_in_xyz()" functions simply should always
handle all faults.

Another option may be to teach the GUP code to actually check
architecture-specific sub-page ranges.

Linus

2021-10-18 17:23:13

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC][arm64] possible infinite loop in btrfs search_ioctl()

On Tue, Oct 12, 2021 at 10:58:46AM -0700, Linus Torvalds wrote:
> On Tue, Oct 12, 2021 at 10:27 AM Catalin Marinas
> <[email protected]> wrote:
> > Apart from fault_in_pages_*(), there's also fault_in_user_writeable()
> > called from the futex code which uses the GUP mechanism as the write
> > would be destructive. It looks like it could potentially trigger the
> > same infinite loop on -EFAULT.
>
> Hmm.
>
> I think the reason we do fault_in_user_writeable() using GUP is that
>
> (a) we can avoid the page fault overhead
>
> (b) we don't have any good "atomic_inc_user()" interface or similar
> that could do a write with a zero increment or something like that.
>
> We do have that "arch_futex_atomic_op_inuser()" thing, of course. It's
> all kinds of crazy, but we *could* do
>
> arch_futex_atomic_op_inuser(FUTEX_OP_ADD, 0, &dummy, uaddr);
>
> instead of doing the fault_in_user_writeable().
>
> That might be a good idea anyway. I dunno.

I gave this a quick try for futex (though MTE is not affected at the
moment):

https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=devel/sub-page-faults

However, I still have doubts about fault_in_pages_*() probing every 16
bytes, especially if one decides to change these routines to be
GUP-based.

> > A more invasive change would be to return a different error for such
> > faults like -EACCESS and treat them differently in the caller.
>
> That's _really_ hard for things like "copy_to_user()", that isn't a
> single operation, and is supposed to return the bytes left.
>
> Adding another error return would be nasty.
>
> We've had hacks like "squirrel away the actual error code in the task
> structure", but that tends to be unmaintainable because we have
> interrupts (and NMI's) doing their own possibly nested atomics, so
> even disabling preemption won't actually fix some of the nesting
> issues.

I think we can do something similar to the __get_user_error() on arm64.
We can keep the __copy_to_user_inatomic() etc. returning the number of
bytes left but change the exception handling path in those routines to
set an error code or boolean to a pointer passed at uaccess routine call
time. The caller would do something along these lines:

bool page_fault;
left = copy_to_user_inatomic(dst, src, size, &page_fault);
if (left && page_fault)
goto repeat_fault_in;

copy_to_user_nofault() could also change its return type from -EFAULT to
something else based on whether page_fault was set or not.

Most architectures will use a generic copy_to_user_inatomic() wrapper
where page_fault == true for any fault. Arm64 needs some adjustment to
the uaccess fault handling to pass the fault code down to the exception
code. This way, at least for arm64, I don't think an interrupt or NMI
would be problematic.

> All of these things make me think that the proper fix ends up being to
> make sure that our "fault_in_xyz()" functions simply should always
> handle all faults.
>
> Another option may be to teach the GUP code to actually check
> architecture-specific sub-page ranges.

Teaching GUP about this is likely to be expensive. A put_user() for
probing on arm64 uses a STTR instruction that's run with user privileges
on the user address and the user tag checking mode. The GUP code for
MTE, OTOH, would need to explicitly read the tag in memory and compare
it with the user pointer tag (which is normally cleared in the GUP code
by untagged_addr()).

To me it makes more sense for the fault_in_*() functions to only deal
with those permissions the kernel controls, i.e. the pte. Sub-page
permissions like MTE or CHERI are controlled by the user directly, so
the kernel cannot fix them up anyway. Rather than overloading
fault_in_*() with additional checks, I think we should expand the
in-atomic uaccess API to cover the type of fault.

--
Catalin

2021-10-21 00:49:11

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [RFC][arm64] possible infinite loop in btrfs search_ioctl()

On Tue, Oct 12, 2021 at 1:59 AM Linus Torvalds
<[email protected]> wrote:
> On Mon, Oct 11, 2021 at 2:08 PM Catalin Marinas <[email protected]> wrote:
> >
> > +#ifdef CONFIG_ARM64_MTE
> > +#define FAULT_GRANULE_SIZE (16)
> > +#define FAULT_GRANULE_MASK (~(FAULT_GRANULE_SIZE-1))
>
> [...]
>
> > If this looks in the right direction, I'll do some proper patches
> > tomorrow.
>
> Looks fine to me. It's going to be quite expensive and bad for caches, though.
>
> That said, fault_in_writable() is _supposed_ to all be for the slow
> path when things go south and the normal path didn't work out, so I
> think it's fine.

Let me get back to this; I'm actually not convinced that we need to
worry about sub-page-size fault granules in fault_in_pages_readable or
fault_in_pages_writeable.

From a filesystem point of view, we can get into trouble when a
user-space read or write triggers a page fault while we're holding
filesystem locks, and that page fault ends up calling back into the
filesystem. To deal with that, we're performing those user-space
accesses with page faults disabled. When a page fault would occur, we
get back an error instead, and then we try to fault in the offending
pages. If a page is resident and we still get a fault trying to access
it, trying to fault in the same page again isn't going to help and we
have a true error. We're clearly looking at memory at a page
granularity; faults at a sub-page level don't matter at this level of
abstraction (but they do show similar error behavior). To avoid
getting stuck, when it gets a short result or -EFAULT, the filesystem
implements the following backoff strategy: first, it tries to fault in
a number of pages. When the read or write still doesn't make progress,
it scales back and faults in a single page. Finally, when that still
doesn't help, it gives up. This strategy is needed for actual page
faults, but it also handles sub-page faults appropriately as long as
the user-space access functions give sensible results.

What am I missing?

Thanks,
Andreas

> I do wonder how the sub-page granularity works. Is it sufficient to
> just read from it? Because then a _slightly_ better option might be to
> do one write per page (to catch page table writability) and then one
> read per "granule" (to catch pointer coloring or cache poisoning
> issues)?
>
> That said, since this is all preparatory to us wanting to write to it
> eventually anyway, maybe marking it all dirty in the caches is only
> good.
>
> Linus
>

2021-10-21 10:09:12

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC][arm64] possible infinite loop in btrfs search_ioctl()

On Thu, Oct 21, 2021 at 02:46:10AM +0200, Andreas Gruenbacher wrote:
> On Tue, Oct 12, 2021 at 1:59 AM Linus Torvalds
> <[email protected]> wrote:
> > On Mon, Oct 11, 2021 at 2:08 PM Catalin Marinas <[email protected]> wrote:
> > >
> > > +#ifdef CONFIG_ARM64_MTE
> > > +#define FAULT_GRANULE_SIZE (16)
> > > +#define FAULT_GRANULE_MASK (~(FAULT_GRANULE_SIZE-1))
> >
> > [...]
> >
> > > If this looks in the right direction, I'll do some proper patches
> > > tomorrow.
> >
> > Looks fine to me. It's going to be quite expensive and bad for caches, though.
> >
> > That said, fault_in_writable() is _supposed_ to all be for the slow
> > path when things go south and the normal path didn't work out, so I
> > think it's fine.
>
> Let me get back to this; I'm actually not convinced that we need to
> worry about sub-page-size fault granules in fault_in_pages_readable or
> fault_in_pages_writeable.
>
> From a filesystem point of view, we can get into trouble when a
> user-space read or write triggers a page fault while we're holding
> filesystem locks, and that page fault ends up calling back into the
> filesystem. To deal with that, we're performing those user-space
> accesses with page faults disabled.

Yes, this makes sense.

> When a page fault would occur, we
> get back an error instead, and then we try to fault in the offending
> pages. If a page is resident and we still get a fault trying to access
> it, trying to fault in the same page again isn't going to help and we
> have a true error.

You can't be sure the second fault is a true error. The unlocked
fault_in_*() may race with some LRU scheme making the pte not accessible
or a write-back making it clean/read-only. copy_to_user() with
pagefault_disabled() fails again but that's a benign fault. The
filesystem should re-attempt the fault-in (gup would correct the pte),
disable page faults and copy_to_user(), potentially in an infinite loop.
If you bail out on the second/third uaccess following a fault_in_*()
call, you may get some unexpected errors (though very rare). Maybe the
filesystems avoid this problem somehow but I couldn't figure it out.

> We're clearly looking at memory at a page
> granularity; faults at a sub-page level don't matter at this level of
> abstraction (but they do show similar error behavior). To avoid
> getting stuck, when it gets a short result or -EFAULT, the filesystem
> implements the following backoff strategy: first, it tries to fault in
> a number of pages. When the read or write still doesn't make progress,
> it scales back and faults in a single page. Finally, when that still
> doesn't help, it gives up. This strategy is needed for actual page
> faults, but it also handles sub-page faults appropriately as long as
> the user-space access functions give sensible results.

As I said above, I think with this approach there's a small chance of
incorrectly reporting an error when the fault is recoverable. If you
change it to an infinite loop, you'd run into the sub-page fault
problem.

There are some places with such infinite loops: futex_wake_op(),
search_ioctl() in the btrfs code. I still have to get my head around
generic_perform_write() but I think we get away here because it faults
in the page with a get_user() rather than gup (and copy_from_user() is
guaranteed to make progress if any bytes can still be accessed).

--
Catalin

2021-10-21 14:44:22

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [RFC][arm64] possible infinite loop in btrfs search_ioctl()

On Thu, Oct 21, 2021 at 12:06 PM Catalin Marinas
<[email protected]> wrote:
> On Thu, Oct 21, 2021 at 02:46:10AM +0200, Andreas Gruenbacher wrote:
> > On Tue, Oct 12, 2021 at 1:59 AM Linus Torvalds
> > <[email protected]> wrote:
> > > On Mon, Oct 11, 2021 at 2:08 PM Catalin Marinas <[email protected]> wrote:
> > > >
> > > > +#ifdef CONFIG_ARM64_MTE
> > > > +#define FAULT_GRANULE_SIZE (16)
> > > > +#define FAULT_GRANULE_MASK (~(FAULT_GRANULE_SIZE-1))
> > >
> > > [...]
> > >
> > > > If this looks in the right direction, I'll do some proper patches
> > > > tomorrow.
> > >
> > > Looks fine to me. It's going to be quite expensive and bad for caches, though.
> > >
> > > That said, fault_in_writable() is _supposed_ to all be for the slow
> > > path when things go south and the normal path didn't work out, so I
> > > think it's fine.
> >
> > Let me get back to this; I'm actually not convinced that we need to
> > worry about sub-page-size fault granules in fault_in_pages_readable or
> > fault_in_pages_writeable.
> >
> > From a filesystem point of view, we can get into trouble when a
> > user-space read or write triggers a page fault while we're holding
> > filesystem locks, and that page fault ends up calling back into the
> > filesystem. To deal with that, we're performing those user-space
> > accesses with page faults disabled.
>
> Yes, this makes sense.
>
> > When a page fault would occur, we
> > get back an error instead, and then we try to fault in the offending
> > pages. If a page is resident and we still get a fault trying to access
> > it, trying to fault in the same page again isn't going to help and we
> > have a true error.
>
> You can't be sure the second fault is a true error. The unlocked
> fault_in_*() may race with some LRU scheme making the pte not accessible
> or a write-back making it clean/read-only. copy_to_user() with
> pagefault_disabled() fails again but that's a benign fault. The
> filesystem should re-attempt the fault-in (gup would correct the pte),
> disable page faults and copy_to_user(), potentially in an infinite loop.
> If you bail out on the second/third uaccess following a fault_in_*()
> call, you may get some unexpected errors (though very rare). Maybe the
> filesystems avoid this problem somehow but I couldn't figure it out.

Good point, we can indeed only bail out if both the user copy and the
fault-in fail.

But probing the entire memory range in fault domain granularity in the
page fault-in functions still doesn't actually make sense. Those
functions really only need to guarantee that we'll be able to make
progress eventually. From that point of view, it should be enough to
probe the first byte of the requested memory range, so when one of
those functions reports that the next N bytes should be accessible,
this really means that the first byte surely isn't permanently
inaccessible and that the rest is likely accessible. Functions
fault_in_readable and fault_in_writeable already work that way, so
this only leaves function fault_in_safe_writeable to worry about.

> > We're clearly looking at memory at a page
> > granularity; faults at a sub-page level don't matter at this level of
> > abstraction (but they do show similar error behavior). To avoid
> > getting stuck, when it gets a short result or -EFAULT, the filesystem
> > implements the following backoff strategy: first, it tries to fault in
> > a number of pages. When the read or write still doesn't make progress,
> > it scales back and faults in a single page. Finally, when that still
> > doesn't help, it gives up. This strategy is needed for actual page
> > faults, but it also handles sub-page faults appropriately as long as
> > the user-space access functions give sensible results.
>
> As I said above, I think with this approach there's a small chance of
> incorrectly reporting an error when the fault is recoverable. If you
> change it to an infinite loop, you'd run into the sub-page fault
> problem.

Yes, I see now, thanks.

> There are some places with such infinite loops: futex_wake_op(),
> search_ioctl() in the btrfs code. I still have to get my head around
> generic_perform_write() but I think we get away here because it faults
> in the page with a get_user() rather than gup (and copy_from_user() is
> guaranteed to make progress if any bytes can still be accessed).

Thanks,
Andreas

2021-10-21 17:14:14

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC][arm64] possible infinite loop in btrfs search_ioctl()

On Thu, Oct 21, 2021 at 04:42:33PM +0200, Andreas Gruenbacher wrote:
> On Thu, Oct 21, 2021 at 12:06 PM Catalin Marinas
> <[email protected]> wrote:
> > On Thu, Oct 21, 2021 at 02:46:10AM +0200, Andreas Gruenbacher wrote:
> > > When a page fault would occur, we
> > > get back an error instead, and then we try to fault in the offending
> > > pages. If a page is resident and we still get a fault trying to access
> > > it, trying to fault in the same page again isn't going to help and we
> > > have a true error.
> >
> > You can't be sure the second fault is a true error. The unlocked
> > fault_in_*() may race with some LRU scheme making the pte not accessible
> > or a write-back making it clean/read-only. copy_to_user() with
> > pagefault_disabled() fails again but that's a benign fault. The
> > filesystem should re-attempt the fault-in (gup would correct the pte),
> > disable page faults and copy_to_user(), potentially in an infinite loop.
> > If you bail out on the second/third uaccess following a fault_in_*()
> > call, you may get some unexpected errors (though very rare). Maybe the
> > filesystems avoid this problem somehow but I couldn't figure it out.
>
> Good point, we can indeed only bail out if both the user copy and the
> fault-in fail.
>
> But probing the entire memory range in fault domain granularity in the
> page fault-in functions still doesn't actually make sense. Those
> functions really only need to guarantee that we'll be able to make
> progress eventually. From that point of view, it should be enough to
> probe the first byte of the requested memory range, so when one of
> those functions reports that the next N bytes should be accessible,
> this really means that the first byte surely isn't permanently
> inaccessible and that the rest is likely accessible. Functions
> fault_in_readable and fault_in_writeable already work that way, so
> this only leaves function fault_in_safe_writeable to worry about.

I agree, that's why generic_perform_write() works. It does a get_user()
from the first byte in that range and the subsequent copy_from_user()
will make progress of at least one byte if it was readable. Eventually
it will hit the byte that faults. The gup-based fault_in_*() are a bit
more problematic.

Your series introduces fault_in_safe_writeable() and I think for MTE
doing a _single_ get_user(uaddr) (in addition to the gup checks for
write) would be sufficient as long as generic_file_read_iter() advances
by at least one byte (eventually).

This discussion started with the btrfs search_ioctl() where, even if
some bytes were written in copy_to_sk(), it always restarts from an
earlier position, reattempting to write the same bytes. Since
copy_to_sk() doesn't guarantee forward progress even if some bytes are
writable, Linus' suggestion was for fault_in_writable() to probe the
whole range. I consider this overkill since btrfs is the only one that
needs probing every 16 bytes. The other cases like the new
fault_in_safe_writeable() can be fixed by probing the first byte only
followed by gup.

I think we need to better define the semantics of the fault_in + uaccess
sequences. For uaccess, we document "a hard requirement that not storing
anything at all (i.e. returning size) should happen only when nothing
could be copied" (from linux/uaccess.h). I think we can add a
requirement for the new size_t-based fault_in_* variants without
mandating that the whole range is probed, something like: "returning
leftover < size guarantees that a subsequent user access at uaddr copies
at least one byte eventually". I said "eventually" but maybe we can come
up with some clearer wording for a liveness property.

Such requirement would ensure that infinite loops of fault_in_* +
uaccess make progress as long as they don't reset the probed range. Code
like btrfs search_ioctl() would need to be adjusted to avoid such range
reset and guarantee forward progress.

--
Catalin

2021-10-21 18:03:45

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [RFC][arm64] possible infinite loop in btrfs search_ioctl()

On Thu, Oct 21, 2021 at 7:09 PM Catalin Marinas <[email protected]> wrote:
> On Thu, Oct 21, 2021 at 04:42:33PM +0200, Andreas Gruenbacher wrote:
> > On Thu, Oct 21, 2021 at 12:06 PM Catalin Marinas
> > <[email protected]> wrote:
> > > On Thu, Oct 21, 2021 at 02:46:10AM +0200, Andreas Gruenbacher wrote:
> > > > When a page fault would occur, we
> > > > get back an error instead, and then we try to fault in the offending
> > > > pages. If a page is resident and we still get a fault trying to access
> > > > it, trying to fault in the same page again isn't going to help and we
> > > > have a true error.
> > >
> > > You can't be sure the second fault is a true error. The unlocked
> > > fault_in_*() may race with some LRU scheme making the pte not accessible
> > > or a write-back making it clean/read-only. copy_to_user() with
> > > pagefault_disabled() fails again but that's a benign fault. The
> > > filesystem should re-attempt the fault-in (gup would correct the pte),
> > > disable page faults and copy_to_user(), potentially in an infinite loop.
> > > If you bail out on the second/third uaccess following a fault_in_*()
> > > call, you may get some unexpected errors (though very rare). Maybe the
> > > filesystems avoid this problem somehow but I couldn't figure it out.
> >
> > Good point, we can indeed only bail out if both the user copy and the
> > fault-in fail.
> >
> > But probing the entire memory range in fault domain granularity in the
> > page fault-in functions still doesn't actually make sense. Those
> > functions really only need to guarantee that we'll be able to make
> > progress eventually. From that point of view, it should be enough to
> > probe the first byte of the requested memory range, so when one of
> > those functions reports that the next N bytes should be accessible,
> > this really means that the first byte surely isn't permanently
> > inaccessible and that the rest is likely accessible. Functions
> > fault_in_readable and fault_in_writeable already work that way, so
> > this only leaves function fault_in_safe_writeable to worry about.
>
> I agree, that's why generic_perform_write() works. It does a get_user()
> from the first byte in that range and the subsequent copy_from_user()
> will make progress of at least one byte if it was readable. Eventually
> it will hit the byte that faults. The gup-based fault_in_*() are a bit
> more problematic.
>
> Your series introduces fault_in_safe_writeable() and I think for MTE
> doing a _single_ get_user(uaddr) (in addition to the gup checks for
> write) would be sufficient as long as generic_file_read_iter() advances
> by at least one byte (eventually).
>
> This discussion started with the btrfs search_ioctl() where, even if
> some bytes were written in copy_to_sk(), it always restarts from an
> earlier position, reattempting to write the same bytes. Since
> copy_to_sk() doesn't guarantee forward progress even if some bytes are
> writable, Linus' suggestion was for fault_in_writable() to probe the
> whole range. I consider this overkill since btrfs is the only one that
> needs probing every 16 bytes. The other cases like the new
> fault_in_safe_writeable() can be fixed by probing the first byte only
> followed by gup.

Hmm. Direct I/O request sizes are multiples of the underlying device
block size, so we'll also get stuck there if fault-in won't give us a
full block. This is getting pretty ugly. So scratch that idea; let's
stick with probing the whole range.

Thanks,
Andreas

> I think we need to better define the semantics of the fault_in + uaccess
> sequences. For uaccess, we document "a hard requirement that not storing
> anything at all (i.e. returning size) should happen only when nothing
> could be copied" (from linux/uaccess.h). I think we can add a
> requirement for the new size_t-based fault_in_* variants without
> mandating that the whole range is probed, something like: "returning
> leftover < size guarantees that a subsequent user access at uaddr copies
> at least one byte eventually". I said "eventually" but maybe we can come
> up with some clearer wording for a liveness property.
>
> Such requirement would ensure that infinite loops of fault_in_* +
> uaccess make progress as long as they don't reset the probed range. Code
> like btrfs search_ioctl() would need to be adjusted to avoid such range
> reset and guarantee forward progress.
>
> --
> Catalin
>

2021-10-22 02:33:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC][arm64] possible infinite loop in btrfs search_ioctl()

On Thu, Oct 21, 2021 at 4:42 AM Andreas Gruenbacher <[email protected]> wrote:
>
> But probing the entire memory range in fault domain granularity in the
> page fault-in functions still doesn't actually make sense. Those
> functions really only need to guarantee that we'll be able to make
> progress eventually. From that point of view, it should be enough to
> probe the first byte of the requested memory range

That's probably fine.

Although it should be more than one byte - "copy_from_user()" might do
word-at-a-time optimizations, so you could have an infinite loop of

(a) copy_from_user() fails because the chunk it tried to get failed partly

(b) fault_in() probing succeeds, because the beginning part is fine

so I agree that the fault-in code doesn't need to do the whole area,
but it needs to at least do some <N bytes, up to length> thing, to
handle the situation where the copy_to/from_user requires more than a
single byte.

Linus

2021-10-22 09:35:28

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC][arm64] possible infinite loop in btrfs search_ioctl()

On Thu, Oct 21, 2021 at 04:30:30PM -1000, Linus Torvalds wrote:
> On Thu, Oct 21, 2021 at 4:42 AM Andreas Gruenbacher <[email protected]> wrote:
> > But probing the entire memory range in fault domain granularity in the
> > page fault-in functions still doesn't actually make sense. Those
> > functions really only need to guarantee that we'll be able to make
> > progress eventually. From that point of view, it should be enough to
> > probe the first byte of the requested memory range
>
> That's probably fine.
>
> Although it should be more than one byte - "copy_from_user()" might do
> word-at-a-time optimizations, so you could have an infinite loop of
>
> (a) copy_from_user() fails because the chunk it tried to get failed partly
>
> (b) fault_in() probing succeeds, because the beginning part is fine
>
> so I agree that the fault-in code doesn't need to do the whole area,
> but it needs to at least do some <N bytes, up to length> thing, to
> handle the situation where the copy_to/from_user requires more than a
> single byte.

From a discussion with Al some months ago, if there are bytes still
accessible, copy_from_user() is not allowed to fail fully (i.e. return
the requested copy size) even when it uses word-at-a-time. In the worst
case, it should return size - 1. If the fault_in() then continues
probing from uaddr + 1, it should eventually hit the faulty address.

The problem appears when fault_in() restarts from uaddr rather than
where copy_from_user() stopped. That's what the btrfs search_ioctl()
does. I also need to check the direct I/O cases that Andreas mentioned,
maybe they can be changed not to attempt the fault_in() from the
beginning of the block.

--
Catalin

2021-10-22 18:42:41

by Catalin Marinas

[permalink] [raw]
Subject: Re: [RFC][arm64] possible infinite loop in btrfs search_ioctl()

On Thu, Oct 21, 2021 at 08:00:50PM +0200, Andreas Gruenbacher wrote:
> On Thu, Oct 21, 2021 at 7:09 PM Catalin Marinas <[email protected]> wrote:
> > This discussion started with the btrfs search_ioctl() where, even if
> > some bytes were written in copy_to_sk(), it always restarts from an
> > earlier position, reattempting to write the same bytes. Since
> > copy_to_sk() doesn't guarantee forward progress even if some bytes are
> > writable, Linus' suggestion was for fault_in_writable() to probe the
> > whole range. I consider this overkill since btrfs is the only one that
> > needs probing every 16 bytes. The other cases like the new
> > fault_in_safe_writeable() can be fixed by probing the first byte only
> > followed by gup.
>
> Hmm. Direct I/O request sizes are multiples of the underlying device
> block size, so we'll also get stuck there if fault-in won't give us a
> full block. This is getting pretty ugly. So scratch that idea; let's
> stick with probing the whole range.

Ah, I wasn't aware of this. I got lost in the call trees but I noticed
__iomap_dio_rw() does an iov_iter_revert() only if direction is READ. Is
this the case for writes as well?

--
Catalin

2021-10-26 00:22:55

by Andreas Gruenbacher

[permalink] [raw]
Subject: Re: [RFC][arm64] possible infinite loop in btrfs search_ioctl()

On Fri, Oct 22, 2021 at 8:41 PM Catalin Marinas <[email protected]> wrote:
> On Thu, Oct 21, 2021 at 08:00:50PM +0200, Andreas Gruenbacher wrote:
> > On Thu, Oct 21, 2021 at 7:09 PM Catalin Marinas <[email protected]> wrote:
> > > This discussion started with the btrfs search_ioctl() where, even if
> > > some bytes were written in copy_to_sk(), it always restarts from an
> > > earlier position, reattempting to write the same bytes. Since
> > > copy_to_sk() doesn't guarantee forward progress even if some bytes are
> > > writable, Linus' suggestion was for fault_in_writable() to probe the
> > > whole range. I consider this overkill since btrfs is the only one that
> > > needs probing every 16 bytes. The other cases like the new
> > > fault_in_safe_writeable() can be fixed by probing the first byte only
> > > followed by gup.
> >
> > Hmm. Direct I/O request sizes are multiples of the underlying device
> > block size, so we'll also get stuck there if fault-in won't give us a
> > full block. This is getting pretty ugly. So scratch that idea; let's
> > stick with probing the whole range.
>
> Ah, I wasn't aware of this. I got lost in the call trees but I noticed
> __iomap_dio_rw() does an iov_iter_revert() only if direction is READ. Is
> this the case for writes as well?

It's the EOF case, so it only applies to reads:

/*
* We only report that we've read data up to i_size.
* Revert iter to a state corresponding to that as some callers (such
* as the splice code) rely on it.
*/
if (iov_iter_rw(iter) == READ && iomi.pos >= dio->i_size)
iov_iter_revert(iter, iomi.pos - dio->i_size);

Andreas