This patch removes one kind of the deadlocks inside the fuse daemon. The
problem appear when the fuse daemon itself makes a file operation on its
filesystem and receives a fatal signal.
This deadlock can be interrupted via fusectl filesystem. But if you have
many fuse mountpoints, it will be difficult to figure out which
connection to break.
This patch aborts the connection if the fuse server receives a fatal
signal.
Reproducer: https://github.com/sargun/fuse-example
Reference: CVE-2019-20794
Fixes: 51eb01e73599 ("[PATCH] fuse: no backgrounding on interrupt")
Сс: Andrew Morton <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: Alexey Gladkov <[email protected]>
---
fs/fuse/dev.c | 26 +++++++++++++++++++++++++-
fs/fuse/fuse_i.h | 6 ++++++
fs/fuse/inode.c | 3 +++
3 files changed, 34 insertions(+), 1 deletion(-)
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 02b3c36b3676..eadfed675791 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -21,6 +21,7 @@
#include <linux/swap.h>
#include <linux/splice.h>
#include <linux/sched.h>
+#include <linux/fdtable.h>
MODULE_ALIAS_MISCDEV(FUSE_MINOR);
MODULE_ALIAS("devname:fuse");
@@ -357,6 +358,29 @@ static int queue_interrupt(struct fuse_iqueue *fiq, struct fuse_req *req)
return 0;
}
+static int match_fusedev(const void *p, struct file *file, unsigned fd)
+{
+ return ((struct fuse_conn *) p)->fusedev_file == file;
+}
+
+static inline bool is_fuse_daemon(struct fuse_conn *fc)
+{
+ return iterate_fd(current->files, 0, match_fusedev, fc);
+}
+
+static inline bool is_conn_untrusted(struct fuse_conn *fc)
+{
+ return (fc->sb->s_iflags & SB_I_UNTRUSTED_MOUNTER);
+}
+
+static inline bool is_event_finished(struct fuse_conn *fc, struct fuse_req *req)
+{
+ if (fc->check_fusedev_file &&
+ fatal_signal_pending(current) && is_conn_untrusted(fc) && is_fuse_daemon(fc))
+ fuse_abort_conn(fc);
+ return test_bit(FR_FINISHED, &req->flags);
+}
+
static void request_wait_answer(struct fuse_conn *fc, struct fuse_req *req)
{
struct fuse_iqueue *fiq = &fc->iq;
@@ -399,7 +423,7 @@ static void request_wait_answer(struct fuse_conn *fc, struct fuse_req *req)
* Either request is already in userspace, or it was forced.
* Wait it out.
*/
- wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));
+ wait_event(req->waitq, is_event_finished(fc, req));
}
static void __fuse_request_send(struct fuse_conn *fc, struct fuse_req *req)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 740a8a7d7ae6..ee9986b3c932 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -516,6 +516,9 @@ struct fuse_conn {
/** The group id for this mount */
kgid_t group_id;
+ /** The /dev/fuse file for this mount */
+ struct file *fusedev_file;
+
/** The pid namespace for this mount */
struct pid_namespace *pid_ns;
@@ -720,6 +723,9 @@ struct fuse_conn {
/* Do not show mount options */
unsigned int no_mount_options:1;
+ /** Do not check fusedev_file (virtiofs) */
+ unsigned int check_fusedev_file:1;
+
/** The number of requests waiting for completion */
atomic_t num_waiting;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index bba747520e9b..8dc86e5079e6 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1201,6 +1201,8 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
fc->no_control = ctx->no_control;
fc->no_force_umount = ctx->no_force_umount;
fc->no_mount_options = ctx->no_mount_options;
+ fc->fusedev_file = fget(ctx->fd);
+ fc->check_fusedev_file = 1;
err = -ENOMEM;
root = fuse_get_root_inode(sb, ctx->rootmode);
@@ -1348,6 +1350,7 @@ static void fuse_sb_destroy(struct super_block *sb)
fuse_abort_conn(fc);
fuse_wait_aborted(fc);
+ fput(fc->fusedev_file);
down_write(&fc->killsb);
fc->sb = NULL;
--
2.25.4
On Mon, Nov 9, 2020 at 1:48 PM Alexey Gladkov <[email protected]> wrote:
>
> This patch removes one kind of the deadlocks inside the fuse daemon. The
> problem appear when the fuse daemon itself makes a file operation on its
> filesystem and receives a fatal signal.
>
> This deadlock can be interrupted via fusectl filesystem. But if you have
> many fuse mountpoints, it will be difficult to figure out which
> connection to break.
>
> This patch aborts the connection if the fuse server receives a fatal
> signal.
The patch itself might be acceptable, but I have some questions.
To logic of this patch says:
"If a task having the fuse device open in it's fd table receives
SIGKILL (and filesystem was initially mounted in a non-init user
namespace), then abort the filesystem operation"
You just say "server" instead of "task having the fuse device open in
it's fd table" which is sloppy to say the least. It might also lead
to regressions, although I agree that it's unlikely.
Also how is this solving any security issue? Just create the request
loop using two fuse filesystems and the deadlock avoidance has just
been circumvented. So AFAICS "selling" this as a CVE fix is not
appropriate.
What's the reason for making this user-ns only? If we drop the
security aspect, then I don't see any reason not to do this
unconditionally.
Also note, there's a proper solution for making fuse requests always
killable, and that is to introduce a shadow locking that ensures
correct fs operation in the face of requests that have returned and
released their respective VFS locks. Now this would be a much more
complex solution, but also a much more correct one, not having issues
with correctly defining what a server is (which is not a solvable
problem).
Thanks,
Miklos
Miklos Szeredi <[email protected]> writes:
> On Mon, Nov 9, 2020 at 1:48 PM Alexey Gladkov <[email protected]> wrote:
>>
>> This patch removes one kind of the deadlocks inside the fuse daemon. The
>> problem appear when the fuse daemon itself makes a file operation on its
>> filesystem and receives a fatal signal.
>>
>> This deadlock can be interrupted via fusectl filesystem. But if you have
>> many fuse mountpoints, it will be difficult to figure out which
>> connection to break.
>>
>> This patch aborts the connection if the fuse server receives a fatal
>> signal.
>
> The patch itself might be acceptable, but I have some questions.
>
> To logic of this patch says:
>
> "If a task having the fuse device open in it's fd table receives
> SIGKILL (and filesystem was initially mounted in a non-init user
> namespace), then abort the filesystem operation"
>
> You just say "server" instead of "task having the fuse device open in
> it's fd table" which is sloppy to say the least. It might also lead
> to regressions, although I agree that it's unlikely.
>
> Also how is this solving any security issue? Just create the request
> loop using two fuse filesystems and the deadlock avoidance has just
> been circumvented. So AFAICS "selling" this as a CVE fix is not
> appropriate.
The original report came in with a CVE on it. So referencing that CVE
seems reasonable. Even if the issue isn't particularly serious. It is
very annoying not to be able to kill processes with SIGKILL or the OOM
killer.
You have a good point about the looping issue. I wonder if there is a
way to enhance this comparatively simple approach to prevent the more
complex scenario you mention.
Does tweaking the code to close every connection represented by a fuse
file descriptor after a SIGKILL has been delevered create any problems?
> What's the reason for making this user-ns only? If we drop the
> security aspect, then I don't see any reason not to do this
> unconditionally.
> Also note, there's a proper solution for making fuse requests always
> killable, and that is to introduce a shadow locking that ensures
> correct fs operation in the face of requests that have returned and
> released their respective VFS locks. Now this would be a much more
> complex solution, but also a much more correct one, not having issues
> with correctly defining what a server is (which is not a solvable
> problem).
Is this the solution that was removed at some point from fuse,
or are you talking about something else?
I think you are talking about adding a set of fuse specific locks
so fuse does not need to rely on the vfs locks. I don't quite have
enough insight to see that bigger problem so if you can expand in more
detail I would appreciate it.
Eric
On Mon, Nov 9, 2020 at 7:54 PM Eric W. Biederman <[email protected]> wrote:
>
> Miklos Szeredi <[email protected]> writes:
>
> > On Mon, Nov 9, 2020 at 1:48 PM Alexey Gladkov <[email protected]> wrote:
> >>
> >> This patch removes one kind of the deadlocks inside the fuse daemon. The
> >> problem appear when the fuse daemon itself makes a file operation on its
> >> filesystem and receives a fatal signal.
> >>
> >> This deadlock can be interrupted via fusectl filesystem. But if you have
> >> many fuse mountpoints, it will be difficult to figure out which
> >> connection to break.
> >>
> >> This patch aborts the connection if the fuse server receives a fatal
> >> signal.
> >
> > The patch itself might be acceptable, but I have some questions.
> >
> > To logic of this patch says:
> >
> > "If a task having the fuse device open in it's fd table receives
> > SIGKILL (and filesystem was initially mounted in a non-init user
> > namespace), then abort the filesystem operation"
> >
> > You just say "server" instead of "task having the fuse device open in
> > it's fd table" which is sloppy to say the least. It might also lead
> > to regressions, although I agree that it's unlikely.
> >
> > Also how is this solving any security issue? Just create the request
> > loop using two fuse filesystems and the deadlock avoidance has just
> > been circumvented. So AFAICS "selling" this as a CVE fix is not
> > appropriate.
>
> The original report came in with a CVE on it. So referencing that CVE
> seems reasonable. Even if the issue isn't particularly serious. It is
> very annoying not to be able to kill processes with SIGKILL or the OOM
> killer.
>
> You have a good point about the looping issue. I wonder if there is a
> way to enhance this comparatively simple approach to prevent the more
> complex scenario you mention.
Let's take a concrete example:
- task A is "server" for fuse fs a
- task B is "server" for fuse fs b
- task C: chmod(/a/x, ...)
- task A: read UNLINK request
- task A: chmod(/b/x, ...)
- task B: read UNLINK request
- task B: chmod (/a/x, ...)
Now B is blocking on i_mutex on x , A is waiting for reply from B, C
is holding i_mutex on x and waiting for reply from A.
At this point B is truly uninterruptible (and I'm not betting large
sums on Al accepting killable VFS locks patches), so killing B is out.
Killing A with this patch does nothing, since A does not have b's dev
fd in its fdtable.
Killing C again does nothing, since it has no fuse dev fd at all.
> Does tweaking the code to close every connection represented by a fuse
> file descriptor after a SIGKILL has been delevered create any problems?
In the above example are you suggesting that SIGKILL on A would abort
"a" from fs b's code? Yeah, that would work, I guess. Poking into
another instance this way sounds pretty horrid, though.
> > What's the reason for making this user-ns only? If we drop the
> > security aspect, then I don't see any reason not to do this
> > unconditionally.
>
>
> > Also note, there's a proper solution for making fuse requests always
> > killable, and that is to introduce a shadow locking that ensures
> > correct fs operation in the face of requests that have returned and
> > released their respective VFS locks. Now this would be a much more
> > complex solution, but also a much more correct one, not having issues
> > with correctly defining what a server is (which is not a solvable
> > problem).
>
> Is this the solution that was removed at some point from fuse,
> or are you talking about something else?
>
> I think you are talking about adding a set of fuse specific locks
> so fuse does not need to rely on the vfs locks. I don't quite have
> enough insight to see that bigger problem so if you can expand in more
> detail I would appreciate it.
Okay, so the problem with making the wait_event() at the end of
request_wait_answer() killable is that it would allow compromising the
server's integrity by unlocking the VFS level lock (which protects the
fs) while the server hasn't yet finished the request.
The way this would be solvable is to add a fuse level lock for each
VFS level lock. That lock would be taken before the request is sent
to userspace and would be released when the answer is received.
Normally there would be zero contention on these shadow locks, but if
a request is forcibly killed, then the VFS lock is released and the
shadow lock now protects the filesystem.
This wouldn't solve the case where a fuse fs is deadlocked on a VFS
lock (e.g. task B), but would allow tasks blocked directly on a fuse
filesystem to be killed (e.g. task A or C, both of which would unwind
the deadlock).
Thanks,
Miklos
Miklos Szeredi <[email protected]> writes:
> On Mon, Nov 9, 2020 at 7:54 PM Eric W. Biederman <[email protected]> wrote:
>>
>> Miklos Szeredi <[email protected]> writes:
>>
>> > On Mon, Nov 9, 2020 at 1:48 PM Alexey Gladkov <[email protected]> wrote:
>> >>
>> >> This patch removes one kind of the deadlocks inside the fuse daemon. The
>> >> problem appear when the fuse daemon itself makes a file operation on its
>> >> filesystem and receives a fatal signal.
>> >>
>> >> This deadlock can be interrupted via fusectl filesystem. But if you have
>> >> many fuse mountpoints, it will be difficult to figure out which
>> >> connection to break.
>> >>
>> >> This patch aborts the connection if the fuse server receives a fatal
>> >> signal.
>> >
>> > The patch itself might be acceptable, but I have some questions.
>> >
>> > To logic of this patch says:
>> >
>> > "If a task having the fuse device open in it's fd table receives
>> > SIGKILL (and filesystem was initially mounted in a non-init user
>> > namespace), then abort the filesystem operation"
>> >
>> > You just say "server" instead of "task having the fuse device open in
>> > it's fd table" which is sloppy to say the least. It might also lead
>> > to regressions, although I agree that it's unlikely.
>> >
>> > Also how is this solving any security issue? Just create the request
>> > loop using two fuse filesystems and the deadlock avoidance has just
>> > been circumvented. So AFAICS "selling" this as a CVE fix is not
>> > appropriate.
>>
>> The original report came in with a CVE on it. So referencing that CVE
>> seems reasonable. Even if the issue isn't particularly serious. It is
>> very annoying not to be able to kill processes with SIGKILL or the OOM
>> killer.
>>
>> You have a good point about the looping issue. I wonder if there is a
>> way to enhance this comparatively simple approach to prevent the more
>> complex scenario you mention.
>
> Let's take a concrete example:
>
> - task A is "server" for fuse fs a
> - task B is "server" for fuse fs b
> - task C: chmod(/a/x, ...)
> - task A: read UNLINK request
> - task A: chmod(/b/x, ...)
> - task B: read UNLINK request
> - task B: chmod (/a/x, ...)
>
> Now B is blocking on i_mutex on x , A is waiting for reply from B, C
> is holding i_mutex on x and waiting for reply from A.
>
> At this point B is truly uninterruptible (and I'm not betting large
> sums on Al accepting killable VFS locks patches), so killing B is out.
>
> Killing A with this patch does nothing, since A does not have b's dev
> fd in its fdtable.
>
> Killing C again does nothing, since it has no fuse dev fd at all.
>
>> Does tweaking the code to close every connection represented by a fuse
>> file descriptor after a SIGKILL has been delevered create any problems?
>
> In the above example are you suggesting that SIGKILL on A would abort
> "a" from fs b's code? Yeah, that would work, I guess. Poking into
> another instance this way sounds pretty horrid, though.
Yes. That is what I am suggesting.
Layering purity it does not have. It is also fragile as it only
handles interactions between fuse instances.
The advantage is that it is a very small amount of code. I think there
is enough care to get a small change like that in. (With a big fat
comment describing why it is imperfect). I don't know if there is
enough care to get the general solution (you describe below) implemented
and merged in any kind of timely manner.
>> > What's the reason for making this user-ns only? If we drop the
>> > security aspect, then I don't see any reason not to do this
>> > unconditionally.
>>
>>
>> > Also note, there's a proper solution for making fuse requests always
>> > killable, and that is to introduce a shadow locking that ensures
>> > correct fs operation in the face of requests that have returned and
>> > released their respective VFS locks. Now this would be a much more
>> > complex solution, but also a much more correct one, not having issues
>> > with correctly defining what a server is (which is not a solvable
>> > problem).
>>
>> Is this the solution that was removed at some point from fuse,
>> or are you talking about something else?
>>
>> I think you are talking about adding a set of fuse specific locks
>> so fuse does not need to rely on the vfs locks. I don't quite have
>> enough insight to see that bigger problem so if you can expand in more
>> detail I would appreciate it.
>
> Okay, so the problem with making the wait_event() at the end of
> request_wait_answer() killable is that it would allow compromising the
> server's integrity by unlocking the VFS level lock (which protects the
> fs) while the server hasn't yet finished the request.
>
> The way this would be solvable is to add a fuse level lock for each
> VFS level lock. That lock would be taken before the request is sent
> to userspace and would be released when the answer is received.
> Normally there would be zero contention on these shadow locks, but if
> a request is forcibly killed, then the VFS lock is released and the
> shadow lock now protects the filesystem.
>
> This wouldn't solve the case where a fuse fs is deadlocked on a VFS
> lock (e.g. task B), but would allow tasks blocked directly on a fuse
> filesystem to be killed (e.g. task A or C, both of which would unwind
> the deadlock).
Are we just talking the inode lock here?
I am trying to figure out if this is a straight forward change.
Or if it will take a fair amount of work.
If the change is just wordy we can probably do the good version and call
fuse well and truly fixed. But I don't currently see the problem well
enough to know what the good change would look like even on a single
code path.
Eric
On Wed, Nov 11, 2020 at 8:42 AM Eric W. Biederman <[email protected]> wrote:
>
> Miklos Szeredi <[email protected]> writes:
> > Okay, so the problem with making the wait_event() at the end of
> > request_wait_answer() killable is that it would allow compromising the
> > server's integrity by unlocking the VFS level lock (which protects the
> > fs) while the server hasn't yet finished the request.
> >
> > The way this would be solvable is to add a fuse level lock for each
> > VFS level lock. That lock would be taken before the request is sent
> > to userspace and would be released when the answer is received.
> > Normally there would be zero contention on these shadow locks, but if
> > a request is forcibly killed, then the VFS lock is released and the
> > shadow lock now protects the filesystem.
> >
> > This wouldn't solve the case where a fuse fs is deadlocked on a VFS
> > lock (e.g. task B), but would allow tasks blocked directly on a fuse
> > filesystem to be killed (e.g. task A or C, both of which would unwind
> > the deadlock).
>
> Are we just talking the inode lock here?
>
> I am trying to figure out if this is a straight forward change.
> Or if it will take a fair amount of work.
Inode lock and cross directory rename lock should suffice, I think.
One issue is that we are losing normal ref on dentry+mount, so in case
the process is killed we need to take a ref on the inode.
Since multiple inode locks can be held for one op, we need to take
care of ordering the shadow locks as well.
It's not a trivial change, but I'd be much happier if we would take
this instead of the hackish one.
Thanks,
Miklos