LinuxLists.cc - [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)

2021-05-31 17:12:42

Subject: [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)

When the buffer passed to a read or write system call is memory mapped
to the same file, a page fault can occur in filemap_fault. In that
case, the task will already be holding the inode glock, and trying to
take the same lock again will result in a BUG in add_to_queue().

Fix that by recognizing the self-recursion case. Either skip the lock
taking (when the glock is held in a compatible way), or fail the
operation.

Likewise, a request to un-share a copy-on-write page can *probably*
happen in similar situations, so treat the locking in gfs2_page_mkwrite
in the same way.

A future patch will handle these case more gracefully by retrying
operations instead of failing them, along with addressing more complex
deadlock scenarios.

Reported-by: Jan Kara <[email protected]>
Signed-off-by: Andreas Gruenbacher <[email protected]>
---
fs/gfs2/file.c | 40 ++++++++++++++++++++++++++++++----------
1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 6d77743f11a4..7d88abb4629b 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -423,6 +423,7 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
struct page *page = vmf->page;
struct inode *inode = file_inode(vmf->vma->vm_file);
struct gfs2_inode *ip = GFS2_I(inode);
+ struct gfs2_holder *outer_gh = gfs2_glock_is_locked_by_me(ip->i_gl);
struct gfs2_sbd *sdp = GFS2_SB(inode);
struct gfs2_alloc_parms ap = { .aflags = 0, };
u64 offset = page_offset(page);
@@ -436,10 +437,18 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
sb_start_pagefault(inode->i_sb);

gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh);
- err = gfs2_glock_nq(&gh);
- if (err) {
- ret = block_page_mkwrite_return(err);
- goto out_uninit;
+ if (likely(!outer_gh)) {
+ err = gfs2_glock_nq(&gh);
+ if (err) {
+ ret = block_page_mkwrite_return(err);
+ goto out_uninit;
+ }
+ } else {
+ if (!gfs2_holder_is_compatible(outer_gh, LM_ST_EXCLUSIVE)) {
+ /* We could try to upgrade outer_gh here. */
+ ret = VM_FAULT_SIGBUS;
+ goto out_uninit;
+ }
}

/* Check page index against inode size */
@@ -540,7 +549,8 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
out_quota_unlock:
gfs2_quota_unlock(ip);
out_unlock:
- gfs2_glock_dq(&gh);
+ if (likely(!outer_gh))
+ gfs2_glock_dq(&gh);
out_uninit:
gfs2_holder_uninit(&gh);
if (ret == VM_FAULT_LOCKED) {
@@ -555,6 +565,7 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf)
{
struct inode *inode = file_inode(vmf->vma->vm_file);
struct gfs2_inode *ip = GFS2_I(inode);
+ struct gfs2_holder *outer_gh = gfs2_glock_is_locked_by_me(ip->i_gl);
struct gfs2_holder gh;
vm_fault_t ret;
u16 state;
@@ -562,13 +573,22 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf)

state = (vmf->flags & FAULT_FLAG_WRITE) ? LM_ST_EXCLUSIVE : LM_ST_SHARED;
gfs2_holder_init(ip->i_gl, state, 0, &gh);
- err = gfs2_glock_nq(&gh);
- if (err) {
- ret = block_page_mkwrite_return(err);
- goto out_uninit;
+ if (likely(!outer_gh)) {
+ err = gfs2_glock_nq(&gh);
+ if (err) {
+ ret = block_page_mkwrite_return(err);
+ goto out_uninit;
+ }
+ } else {
+ if (!gfs2_holder_is_compatible(outer_gh, state)) {
+ /* We could try to upgrade outer_gh here. */
+ ret = VM_FAULT_SIGBUS;
+ goto out_uninit;
+ }
}
ret = filemap_fault(vmf);
- gfs2_glock_dq(&gh);
+ if (likely(!outer_gh))
+ gfs2_glock_dq(&gh);
out_uninit:
gfs2_holder_uninit(&gh);
return ret;
--
2.26.3

2021-06-01 06:02:55

by Linus Torvalds

[permalink] [raw]

Subject: Re: [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)

On Mon, May 31, 2021 at 7:01 AM Andreas Gruenbacher <[email protected]> wrote:
>
> Fix that by recognizing the self-recursion case.

Hmm. I get the feeling that the self-recursion case should never have
been allowed to happen in the first place.

IOW, is there some reason why you can't make the user accesses always
be doen with page faults disabled (ie using the "atomic" user space
access model), and then if you get a partial read (or write) to user
space, at that point you drop the locks in read/write, do the "try to
make readable/writable" and try again.

IOW, none of this "detect recursion" thing. Just "no recursion in the
first place".

That way you'd not have these odd rules at fault time at all, because
a fault while holding a lock would never get to the filesystem at all,
it would be aborted early. And you'd not have any odd "inner/outer"
locks, or lock compatibility rules or anything like that. You'd
literally have just "oh, I didn't get everything at RW time while I
held locks, so let's drop the locks, try to access user space, and
retry".

Wouldn't that be a lot simpler and more robust?

Because what if the mmap is something a bit more complex, like
overlayfs or usefaultfd, and completing the fault isn't about gfs2
handling it as a "fault", but about some *other* entity calling back
to gfs2 and doing a read/write instead? Now all your "inner/outer"
lock logic ends up being entirely pointless, as far as I can tell, and
you end up deadlocking on the lock you are holding over the user space
access _anyway_.

So I literally think that your approach is

(a) too complicated

(b) doesn't actually fix the issue in the more general case

But maybe I'm missing something.

Linus

Linus

2021-06-02 11:18:23

by Andreas Gruenbacher

[permalink] [raw]

Subject: Re: [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)

On Tue, Jun 1, 2021 at 8:00 AM Linus Torvalds
<[email protected]> wrote:
> On Mon, May 31, 2021 at 7:01 AM Andreas Gruenbacher <[email protected]> wrote:
> >
> > Fix that by recognizing the self-recursion case.
>
> Hmm. I get the feeling that the self-recursion case should never have
> been allowed to happen in the first place.
>
> IOW, is there some reason why you can't make the user accesses always
> be done with page faults disabled (ie using the "atomic" user space
> access model), and then if you get a partial read (or write) to user
> space, at that point you drop the locks in read/write, do the "try to
> make readable/writable" and try again.
>
> IOW, none of this "detect recursion" thing. Just "no recursion in the
> first place".
>
> That way you'd not have these odd rules at fault time at all, because
> a fault while holding a lock would never get to the filesystem at all,
> it would be aborted early. And you'd not have any odd "inner/outer"
> locks, or lock compatibility rules or anything like that. You'd
> literally have just "oh, I didn't get everything at RW time while I
> held locks, so let's drop the locks, try to access user space, and
> retry".

Well, iomap_file_buffered_write() does that by using
iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic() as
in iomap_write_actor(), but the read and direct I/O side doesn't seem
to have equivalents. I suspect we can't just wrap
generic_file_read_iter() and iomap_dio_rw() calls in
pagefault_disable().

> Wouldn't that be a lot simpler and more robust?

Sure, with vfs primitives that support atomic user-space access and
with a iov_iter_fault_in_writeable() like operation, we could do that.

> Because what if the mmap is something a bit more complex, like
> overlayfs or userfaultfd, and completing the fault isn't about gfs2
> handling it as a "fault", but about some *other* entity calling back
> to gfs2 and doing a read/write instead? Now all your "inner/outer"
> lock logic ends up being entirely pointless, as far as I can tell, and
> you end up deadlocking on the lock you are holding over the user space
> access _anyway_.

Yes, those kinds of deadlocks would still be possible.

Until we have a better solution, wouldn't it make sense to at least
prevent those self-recursion deadlocks? I'll send a separate pull
request in case you find that acceptable.

Thanks,
Andreas

2021-06-11 16:28:46

by Al Viro

[permalink] [raw]

Subject: Re: [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)

On Wed, Jun 02, 2021 at 01:16:32PM +0200, Andreas Gruenbacher wrote:

> Well, iomap_file_buffered_write() does that by using
> iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic() as
> in iomap_write_actor(), but the read and direct I/O side doesn't seem
> to have equivalents. I suspect we can't just wrap
> generic_file_read_iter() and iomap_dio_rw() calls in
> pagefault_disable().

And it will have zero effect on O_DIRECT case, so you get the same
deadlocks right back. Because there you hit
iomap_dio_bio_actor()
bio_iov_iter_get_pages()
....
get_user_pages_fast()
....
faultin_page()
handle_mm_fault()
and at no point had CPU hit an exception, so disable_pagefault() will
have no effect whatsoever. You can bloody well hit gfs2 readpage/mkwrite
if the destination is in mmapped area of some GFS2 file. Do that
while holding GFS2 locks and you are fucked.

No amount of prefaulting will protect you, BTW - it might make the
deadlock harder to reproduce, but that's it.

2021-06-12 21:09:23

by Al Viro

[permalink] [raw]

Subject: Re: [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)

On Fri, Jun 11, 2021 at 04:25:10PM +0000, Al Viro wrote:
> On Wed, Jun 02, 2021 at 01:16:32PM +0200, Andreas Gruenbacher wrote:
>
> > Well, iomap_file_buffered_write() does that by using
> > iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic() as
> > in iomap_write_actor(), but the read and direct I/O side doesn't seem
> > to have equivalents. I suspect we can't just wrap
> > generic_file_read_iter() and iomap_dio_rw() calls in
> > pagefault_disable().
>
> And it will have zero effect on O_DIRECT case, so you get the same
> deadlocks right back. Because there you hit
> iomap_dio_bio_actor()
> bio_iov_iter_get_pages()
> ....
> get_user_pages_fast()
> ....
> faultin_page()
> handle_mm_fault()
> and at no point had CPU hit an exception, so disable_pagefault() will
> have no effect whatsoever. You can bloody well hit gfs2 readpage/mkwrite
> if the destination is in mmapped area of some GFS2 file. Do that
> while holding GFS2 locks and you are fucked.
>
> No amount of prefaulting will protect you, BTW - it might make the
> deadlock harder to reproduce, but that's it.

AFAICS, what we have is
* handle_mm_fault() can hit gfs2_fault(), which grabs per-inode lock
shared
* handle_mm_fault() for write can hit gfs2_page_mkwrite(), which grabs
per-inode lock exclusive
* pagefault_disable() prevents that for real page faults, but not for
get_user_pages_fast()
* normal write:
with inode_lock(inode)
in a loop
with per-inode lock exclusive
__gfs2_iomap_get
possibly gfs2_iomap_begin_write
in a loop
fault-in [read faults]
iomap_write_begin
copy_page_from_iter_atomic() [pf disabled]
iomap_write_end
gfs2_iomap_end
* O_DIRECT write:
with inode_lock(inode) and per-inode lock deferred (?)
in a loop
__gfs2_iomap_get
possibly gfs2_iomap_begin_write
bio_iov_iter_get_pages(), map and submit [gup]
gfs2_iomap_end
* normal read:
in a loop
filemap_get_pages (grab pages and readpage them if needed)
copy_page_to_iter() for each [write faults]
* O_DIRECT read:
with per-inode lock deferred
in a loop
__gfs2_iomap_get
either iov_iter_zero() (on hole) [write faults]
or bio_iov_iter_get_pages(), map and submit [gup]
gfs2_iomap_end

... with some amount of waiting on buffered IO in case of O_DIRECT writes

Is the above an accurate description of the mainline situation there?
In particular, normal read doesn't seem to bother with locks at all.
What exactly are those cluster locks for in O_DIRECT read?

2021-06-12 21:36:55

by Al Viro

[permalink] [raw]

Subject: Re: [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)

On Sat, Jun 12, 2021 at 09:05:40PM +0000, Al Viro wrote:

> Is the above an accurate description of the mainline situation there?
> In particular, normal read doesn't seem to bother with locks at all.
> What exactly are those cluster locks for in O_DIRECT read?

BTW, assuming the lack of contention, how costly is dropping/regaining
such cluster lock?

2021-06-13 08:55:39

by Steven Whitehouse

[permalink] [raw]

Subject: Re: [Cluster-devel] [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)

Hi,

On Sat, 2021-06-12 at 21:35 +0000, Al Viro wrote:
> On Sat, Jun 12, 2021 at 09:05:40PM +0000, Al Viro wrote:
>
> > Is the above an accurate description of the mainline situation
> > there?
> > In particular, normal read doesn't seem to bother with locks at
> > all.
> > What exactly are those cluster locks for in O_DIRECT read?
>
> BTW, assuming the lack of contention, how costly is
> dropping/regaining
> such cluster lock?
>

The answer is that it depends...

The locking modes for glocks for inodes look like this:

========== ========== ============== ========== ==============
Glock mode Cache data Cache Metadata Dirty Data Dirty Metadata
========== ========== ============== ========== ==============
UN No No No No
SH Yes Yes No No
DF No Yes No No
EX Yes Yes Yes Yes
========== ========== ============== ========== ==============

The above is a copy & paste from Documentation/filesystems/gfs2-
glocks.rst. If you think of these locks as cache control, then it makes
a lot more sense.

The DF (deferred) mode is there only for DIO. It is a shared lock mode
that is incompatible with the normal SH mode. That is because it is ok
to cache data pages under SH but not under DF. That the only other
difference between the two shared modes. DF is used for both read and
write under DIO meaning that it is possible for multiple nodes to read
& write the same file at the same time with DIO, leaving any
synchronisation to the application layer. As soon as one performs an
operation which alters the metadata tree (truncate, extend, hole
filling) then we drop back to the normal EX mode, so DF is only used
for preallocated files.

Your original question though was about the cost of locking, and there
is a wide variation according to circumstances. The glock layer caches
the results of the DLM requests and will continue to hold glocks gained
from remote nodes until either memory pressure or requests to drop the
lock from another node is received.

When no other nodes are interested in a lock, all such cluster lock
activity is local. There is a cost to it though, and if (for example)
you tried to take and drop the cluster lock on every page, that would
definitely be noticeable. There are probably optimisations that could
be done on what is quite a complex code path, but in general thats what
we've discovered from testing. The introduction of ->readpages() vs the
old ->readpage() made a measurable difference and likewise on the write
side, iomap has also show performance increases due to the reduction in
locking on multi-page writes.

If there is another node that has an interest in a lock, then it can
get very expensive in terms of latency to regain a lock. To drop the
lock to a lower mode may involve I/O (from EX mode) and journal
flush(es) and to get the lock back again involves I/O to other nodes
and then a wait while they finish what they are doing. To avoid
starvation there is a "minimum hold time" so that when a node gains a
glock, it is allowed to retain it, in the absence of local requests,
for a short period. The idea being that if a large number of glock
requests are being made on a node, each for a short time, we allow
several of those to complete before we do the expensive glock release
to another node.

See Documentation/filesystems/gfs2-glocks.rst for a longer explanation
and locking order/rules between different lock types,

Steve.