2021-07-18 22:41:08

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v2 0/6] gfs2: Fix mmap + page fault deadlocks

Hi Linus et al.,

here's an update to the gfs2 mmap + page faults fix that implements
Linus's suggestion of disabling page faults while we're holding the
inode glock.

This passes fstests except for test generic/095, which fails due to an
independent bug in the iov_iter code. I'm currently trying to get
initial feedback from Al and Christoph on that.

Any feedback would be welcome.

Thanks a lot,
Andreas

Andreas Gruenbacher (6):
iov_iter: Introduce fault_in_iov_iter helper
iomap: Fix iomap_dio_rw return value for page faults
gfs2: Add wrapper for iomap_file_buffered_write
gfs2: Fix mmap + page fault deadlocks for buffered I/O
iov_iter: Introduce ITER_FLAG_FAST_ONLY flag
gfs2: Fix mmap + page fault deadlocks for direct I/O

fs/gfs2/file.c | 79 ++++++++++++++++++++++++++++++++++++++++----
fs/iomap/direct-io.c | 2 ++
include/linux/mm.h | 3 ++
include/linux/uio.h | 16 +++++++--
lib/iov_iter.c | 62 +++++++++++++++++++++++++++++++---
mm/gup.c | 68 ++++++++++++++++++++++++++++++++++++++
6 files changed, 215 insertions(+), 15 deletions(-)

--
2.26.3


2021-07-18 22:41:16

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v2 2/6] iomap: Fix iomap_dio_rw return value for page faults

When a page fault occurs during a direct I/O, iomap_dio_rw can currently return
0 when a page cannot be accessed. In that case, -EFAULT should be returned
instead. (For reads, a return value of 0 indicates the end of file.) Fix that
by casting the return value of iomap_apply from 0 to -EFAULT: in that position,
we know that we should have been able to read something.

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
fs/iomap/direct-io.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 9398b8c31323..a87a43ee8278 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -561,6 +561,8 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
ret = iomap_apply(inode, pos, count, iomap_flags, ops, dio,
iomap_dio_actor);
if (ret <= 0) {
+ if (ret == 0)
+ ret = -EFAULT;
/* magic error code to fall back to buffered I/O */
if (ret == -ENOTBLK) {
wait_for_completion = true;
--
2.26.3

2021-07-18 22:41:18

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v2 3/6] gfs2: Add wrapper for iomap_file_buffered_write

Add a wrapper around iomap_file_buffered_write. We'll add code for when
the operation needs to be retried here later.

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
fs/gfs2/file.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 493a83e3f590..13f701493c3c 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -857,6 +857,18 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
return written ? written : ret;
}

+static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file_inode(file);
+ ssize_t ret;
+
+ current->backing_dev_info = inode_to_bdi(inode);
+ ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+ current->backing_dev_info = NULL;
+ return ret;
+}
+
/**
* gfs2_file_write_iter - Perform a write to a file
* @iocb: The io context
@@ -908,9 +920,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
goto out_unlock;

iocb->ki_flags |= IOCB_DSYNC;
- current->backing_dev_info = inode_to_bdi(inode);
- buffered = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
- current->backing_dev_info = NULL;
+ buffered = gfs2_file_buffered_write(iocb, from);
if (unlikely(buffered <= 0)) {
if (!ret)
ret = buffered;
@@ -932,9 +942,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (!ret || ret2 > 0)
ret += ret2;
} else {
- current->backing_dev_info = inode_to_bdi(inode);
- ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
- current->backing_dev_info = NULL;
+ ret = gfs2_file_buffered_write(iocb, from);
if (likely(ret > 0)) {
iocb->ki_pos += ret;
ret = generic_write_sync(iocb, ret);
--
2.26.3

2021-07-18 22:42:00

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v2 4/6] gfs2: Fix mmap + page fault deadlocks for buffered I/O

In the .read_iter and .write_iter file operations, we're accessing
user-space memory while holding the inodes glock. There's a possibility
that the memory is mapped to the same file, in which case we'd recurse on
the same glock.

More complex scenarios can involve multiple glocks, processes, and even cluster
nodes.

Avoids these kinds of problems by disabling page faults while holding a glock.
If a page fault occurs, we either end up with a partial read or write, or with
-EFAULT if nothing could be read or written. In that case, we drop the glock,
fault in the requested pages manually, and repeat the operation.

This locking problem in gfs2 was originally reported by Jan Kara. Linus came
up with the proposal to disable page faults. Many thanks to Al Viro and
Matthew Wilcox for their feedback as well.

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
fs/gfs2/file.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 13f701493c3c..99df7934b4d8 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -824,6 +824,12 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
size_t written = 0;
ssize_t ret;

+ /*
+ * In this function, we disable page faults when whe're holding the
+ * inode glock while doing I/O. If a page fault occurs, we drop the
+ * inode glock, fault in the pages manually, and then we retry.
+ */
+
if (iocb->ki_flags & IOCB_DIRECT) {
ret = gfs2_file_direct_read(iocb, to, &gh);
if (likely(ret != -ENOTBLK))
@@ -831,6 +837,7 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
iocb->ki_flags &= ~IOCB_DIRECT;
}
iocb->ki_flags |= IOCB_NOIO;
+ /* Leave page faults enabled while we're not holding any locks. */
ret = generic_file_read_iter(iocb, to);
iocb->ki_flags &= ~IOCB_NOIO;
if (ret >= 0) {
@@ -845,13 +852,19 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
}
ip = GFS2_I(iocb->ki_filp->f_mapping->host);
gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
+retry:
ret = gfs2_glock_nq(&gh);
if (ret)
goto out_uninit;
+ pagefault_disable();
ret = generic_file_read_iter(iocb, to);
+ pagefault_enable();
if (ret > 0)
written += ret;
gfs2_glock_dq(&gh);
+ if (unlikely(iov_iter_count(to) && (ret > 0 || ret == -EFAULT)) &&
+ fault_in_iov_iter(to))
+ goto retry;
out_uninit:
gfs2_holder_uninit(&gh);
return written ? written : ret;
@@ -863,9 +876,20 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
struct inode *inode = file_inode(file);
ssize_t ret;

+ /*
+ * In this function, we disable page faults when whe're holding the
+ * inode glock while doing I/O. If a page fault occurs, we drop the
+ * inode glock, fault in the pages manually, and then we retry.
+ */
+
+retry:
current->backing_dev_info = inode_to_bdi(inode);
+ pagefault_disable();
ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+ pagefault_enable();
current->backing_dev_info = NULL;
+ if (unlikely(ret == -EFAULT) && fault_in_iov_iter(from))
+ goto retry;
return ret;
}

--
2.26.3

2021-07-18 22:42:22

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v2 5/6] iov_iter: Introduce ITER_FLAG_FAST_ONLY flag

Introduce a new ITER_FLAG_FAST_ONLY flag to indicate to get_user_pages to use
the FOLL_FAST_ONLY flag. This will cause get_user_pages to fail when it would
otherwise fault in a page.

Currently, the ITER_FLAG_FAST_ONLY flag is only checked in iov_iter_get_pages
and iov_iter_get_pages_alloc. This is enough for iomaop_dio_rw(), but it may
make sense to check for this flag in other contexts as well.

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
include/linux/uio.h | 15 ++++++++++++---
lib/iov_iter.c | 20 +++++++++++++++-----
2 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 74f819c41735..d3d629c2153a 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -18,6 +18,9 @@ struct kvec {
};

enum iter_type {
+ /* set if get_user_pages should use FOLL_FAST_ONLY */
+ ITER_FLAG_FAST_ONLY = 2,
+
/* iter types */
ITER_IOVEC = 4,
ITER_KVEC = 8,
@@ -30,8 +33,9 @@ enum iter_type {
struct iov_iter {
/*
* Bit 0 is the read/write bit, set if we're writing.
- * Bit 1 is the BVEC_FLAG_NO_REF bit, set if type is a bvec and
- * the caller isn't expecting to drop a page reference when done.
+ * Bit 1 is the ITER_FLAG_FAST_ONLY bit, set if get_user_pages
+ * should use the FOLL_FAST_ONLY flag when trying to fault in pages
+ * (only useful for type ITER_IOVEC).
*/
unsigned int type;
size_t iov_offset;
@@ -55,7 +59,7 @@ struct iov_iter {

static inline enum iter_type iov_iter_type(const struct iov_iter *i)
{
- return i->type & ~(READ | WRITE);
+ return i->type & ~(READ | WRITE | ITER_FLAG_FAST_ONLY);
}

static inline bool iter_is_iovec(const struct iov_iter *i)
@@ -93,6 +97,11 @@ static inline unsigned char iov_iter_rw(const struct iov_iter *i)
return i->type & (READ | WRITE);
}

+static inline bool iov_iter_is_fast_only(const struct iov_iter *i)
+{
+ return i->type & ITER_FLAG_FAST_ONLY;
+}
+
/*
* Total number of bytes covered by an iovec.
*
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 3beecf8f77de..182ff2afed19 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1538,6 +1538,8 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
struct page **pages, size_t maxsize, unsigned maxpages,
size_t *start)
{
+ unsigned int gup_flags = 0;
+
if (maxsize > i->count)
maxsize = i->count;

@@ -1548,6 +1550,11 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
if (unlikely(iov_iter_is_discard(i)))
return -EFAULT;

+ if (iov_iter_rw(i) != WRITE)
+ gup_flags |= FOLL_WRITE;
+ if (iov_iter_is_fast_only(i))
+ gup_flags |= FOLL_FAST_ONLY;
+
iterate_all_kinds(i, maxsize, v, ({
unsigned long addr = (unsigned long)v.iov_base;
size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
@@ -1558,9 +1565,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
len = maxpages * PAGE_SIZE;
addr &= ~(PAGE_SIZE - 1);
n = DIV_ROUND_UP(len, PAGE_SIZE);
- res = get_user_pages_fast(addr, n,
- iov_iter_rw(i) != WRITE ? FOLL_WRITE : 0,
- pages);
+ res = get_user_pages_fast(addr, n, gup_flags, pages);
if (unlikely(res < 0))
return res;
return (res == n ? len : res * PAGE_SIZE) - *start;
@@ -1665,6 +1670,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
struct page ***pages, size_t maxsize,
size_t *start)
{
+ unsigned int gup_flags = 0;
struct page **p;

if (maxsize > i->count)
@@ -1677,6 +1683,11 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
if (unlikely(iov_iter_is_discard(i)))
return -EFAULT;

+ if (iov_iter_rw(i) != WRITE)
+ gup_flags |= FOLL_WRITE;
+ if (iov_iter_is_fast_only(i))
+ gup_flags |= FOLL_FAST_ONLY;
+
iterate_all_kinds(i, maxsize, v, ({
unsigned long addr = (unsigned long)v.iov_base;
size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
@@ -1688,8 +1699,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
p = get_pages_array(n);
if (!p)
return -ENOMEM;
- res = get_user_pages_fast(addr, n,
- iov_iter_rw(i) != WRITE ? FOLL_WRITE : 0, p);
+ res = get_user_pages_fast(addr, n, gup_flags, p);
if (unlikely(res < 0)) {
kvfree(p);
return res;
--
2.26.3

2021-07-18 22:42:27

by Andreas Gruenbacher

[permalink] [raw]
Subject: [PATCH v2 6/6] gfs2: Fix mmap + page fault deadlocks for direct I/O

Direct I/O differs from buffered I/O in that it uses bio_iov_iter_get_pages for
grabbing page references and for manually faulting in pages instead of
triggering actual page faults. For disabling these manual page faults, it's
not enough to call pagefault_disable(); instead, we use the new
ITER_FLAG_FAST_ONLY flag for telling iomap_dio_rw to stop faulting pages in for
us.

Signed-off-by: Andreas Gruenbacher <[email protected]>
---
fs/gfs2/file.c | 35 ++++++++++++++++++++++++++++++++++-
1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 99df7934b4d8..6feb857a8a1c 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -763,21 +763,42 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
struct file *file = iocb->ki_filp;
struct gfs2_inode *ip = GFS2_I(file->f_mapping->host);
size_t count = iov_iter_count(to);
+ size_t written = 0;
ssize_t ret;

+ /*
+ * In this function, we disable page faults when whe're holding the
+ * inode glock while doing I/O. If a page fault occurs, we drop the
+ * inode glock, fault in the pages manually, and then we retry. Other
+ * than in gfs2_file_read_iter, iomap_dio_rw can trigger implicit as
+ * well as manual page faults, and we need to disable both kinds
+ * separately.
+ */
+
if (!count)
return 0; /* skip atime */

gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+retry:
ret = gfs2_glock_nq(gh);
if (ret)
goto out_uninit;

+ pagefault_disable();
+ to->type |= ITER_FLAG_FAST_ONLY;
ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0);
+ to->type &= ~ITER_FLAG_FAST_ONLY;
+ pagefault_enable();
+
gfs2_glock_dq(gh);
+ if (ret > 0)
+ written += ret;
+ if (unlikely(iov_iter_count(to) && (ret > 0 || ret == -EFAULT)) &&
+ fault_in_iov_iter(to))
+ goto retry;
out_uninit:
gfs2_holder_uninit(gh);
- return ret;
+ return written ? written : ret;
}

static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
@@ -790,6 +811,12 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
loff_t offset = iocb->ki_pos;
ssize_t ret;

+ /*
+ * In this function, we disable page faults when whe're holding the
+ * inode glock while doing I/O. If a page fault occurs, we drop the
+ * inode glock, fault in the pages manually, and then we retry.
+ */
+
/*
* Deferred lock, even if its a write, since we do no allocation on
* this path. All we need to change is the atime, and this lock mode
@@ -799,6 +826,7 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
* VFS does.
*/
gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+retry:
ret = gfs2_glock_nq(gh);
if (ret)
goto out_uninit;
@@ -807,11 +835,16 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
if (offset + len > i_size_read(&ip->i_inode))
goto out;

+ from->type |= ITER_FLAG_FAST_ONLY;
ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL, 0);
+ from->type &= ~ITER_FLAG_FAST_ONLY;
+
if (ret == -ENOTBLK)
ret = 0;
out:
gfs2_glock_dq(gh);
+ if (unlikely(ret == -EFAULT) && fault_in_iov_iter(from))
+ goto retry;
out_uninit:
gfs2_holder_uninit(gh);
return ret;
--
2.26.3

2021-07-19 20:38:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] gfs2: Fix mmap + page fault deadlocks

On Sun, Jul 18, 2021 at 3:39 PM Andreas Gruenbacher <[email protected]> wrote:
>
> here's an update to the gfs2 mmap + page faults fix that implements
> Linus's suggestion of disabling page faults while we're holding the
> inode glock.

Apart from some wording/naming issues, I think this looks a _lot_
better, and should fix the fundamental and underlying deadlock
properly.

So Ack from me with the trivial suggestions I sent to the individual patches.

Linus

2021-07-19 20:38:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] iov_iter: Introduce ITER_FLAG_FAST_ONLY flag

On Sun, Jul 18, 2021 at 3:40 PM Andreas Gruenbacher <[email protected]> wrote:
>
> Introduce a new ITER_FLAG_FAST_ONLY flag

I think the code is fine, but I think it might be best to call this
"ITER_FLAG_NOIO" or something like that.

The "FAST_ONLY" name makes sense in the context of
"get_user_pages_fast()" where we have that "fast" naming (and the long
history too). But I don't think it makes much sense as a name in the
context of iov_iter.

Linus

2021-07-19 20:56:46

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] iov_iter: Introduce ITER_FLAG_FAST_ONLY flag

On Mon, Jul 19, 2021 at 12:29:35PM -0700, Linus Torvalds wrote:
> On Sun, Jul 18, 2021 at 3:40 PM Andreas Gruenbacher <[email protected]> wrote:
> >
> > Introduce a new ITER_FLAG_FAST_ONLY flag
>
> I think the code is fine, but I think it might be best to call this
> "ITER_FLAG_NOIO" or something like that.
>
> The "FAST_ONLY" name makes sense in the context of
> "get_user_pages_fast()" where we have that "fast" naming (and the long
> history too). But I don't think it makes much sense as a name in the
> context of iov_iter.

This code has never been tested with current lib/iov_iter.c as it is
in mainline. Or had been in -next during the last cycle. It won't
apply at all.

Sure, I can try to port that series over to the current mainline, but
I'd rather see that done (and tested) by the series author...

2021-07-19 20:56:46

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2 0/6] gfs2: Fix mmap + page fault deadlocks

On Mon, Jul 19, 2021 at 12:39:26AM +0200, Andreas Gruenbacher wrote:
> Hi Linus et al.,
>
> here's an update to the gfs2 mmap + page faults fix that implements
> Linus's suggestion of disabling page faults while we're holding the
> inode glock.
>
> This passes fstests except for test generic/095, which fails due to an
> independent bug in the iov_iter code. I'm currently trying to get
> initial feedback from Al and Christoph on that.
>
> Any feedback would be welcome.

What tree is that against? Because in mainline your #5 sure as hell
won't apply...

There uio.h has
enum iter_type {
/* iter types */
ITER_IOVEC,
ITER_KVEC,
ITER_BVEC,
ITER_PIPE,
ITER_XARRAY,
ITER_DISCARD,
};

and your series assumes

enum iter_type {
/* iter types */
ITER_IOVEC = 4,
ITER_KVEC = 8,
ITER_BVEC = 16,
ITER_PIPE = 32,
ITER_DISCARD = 64,
};

What had that been tested with?